python:requests-html 一个人性化的HTML解析库
requests-html 这个库旨在使解析HTML(例如抓取web)尽可能简单和直观,比较人性化的库。
当使用这个库时,你会自动得到:
- 完整的JavaScript支持!
- CSS选择器。
- XPath选择器,用于模糊的核心。
- 模拟用户代理(像一个真正的web浏览器)。
- 自动跟踪重定向。
- 连接池和cookie持久性。
Installation
C:\Users\lifeng>pip install requests-html
Collecting requests-html
Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting fake-useragent
Downloading fake-useragent-0.1.11.tar.gz (13 kB)
Preparing metadata (setup.py) ... done
Collecting pyppeteer>=0.0.14
Downloading pyppeteer-0.2.6-py3-none-any.whl (83 kB)
|████████████████████████████████| 83 kB 3.4 kB/s
Collecting pyquery
Downloading pyquery-1.4.3-py3-none-any.whl (22 kB)
Requirement already satisfied: w3lib in d:\python\python37\lib\site-packages (from requests-html) (1.22.0)
Requirement already satisfied: requests in d:\python\python37\lib\site-packages (from requests-html) (2.25.0)
Collecting bs4
Downloading bs4-0.0.1.tar.gz (1.1 kB)
Preparing metadata (setup.py) ... done
Collecting parse
Downloading parse-1.19.0.tar.gz (30 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.26.2)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.4.4)
Requirement already satisfied: importlib-metadata>=1.4 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.7.0)
Collecting pyee<9.0.0,>=8.1.0
Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting websockets<10.0,>=9.1
Downloading websockets-9.1-cp37-cp37m-win_amd64.whl (90 kB)
|████████████████████████████████| 90 kB 4.9 kB/s
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (4.62.3)
Requirement already satisfied: beautifulsoup4 in d:\python\python37\lib\site-packages (from bs4->requests-html) (4.8.2)
Requirement already satisfied: cssselect>0.7.9 in d:\python\python37\lib\site-packages (from pyquery->requests-html) (1.1.0)
Requirement already satisfied: lxml>=2.1 in d:\python\python37\lib\site-packages (from pyquery->requests-html) (4.5.0)
Requirement already satisfied: certifi>=2017.4.17 in d:\python\python37\lib\site-packages (from requests->requests-html) (2020.4.5.1)
Requirement already satisfied: idna<3,>=2.5 in d:\python\python37\lib\site-packages (from requests->requests-html) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in d:\python\python37\lib\site-packages (from requests->requests-html) (3.0.4)
Requirement already satisfied: six>=1.4.1 in d:\python\python37\lib\site-packages (from w3lib->requests-html) (1.12.0)
Requirement already satisfied: zipp>=0.5 in d:\python\python37\lib\site-packages (from importlib-metadata>=1.4->pyppeteer>=0.0.14->requests-html) (3.1.0)
Requirement already satisfied: colorama in d:\python\python37\lib\site-packages (from tqdm<5.0.0,>=4.42.1->pyppeteer>=0.0.14->requests-html) (0.4.3)
Requirement already satisfied: soupsieve>=1.2 in d:\python\python37\lib\site-packages (from beautifulsoup4->bs4->requests-html) (2.0.1)
Using legacy 'setup.py install' for bs4, since package 'wheel' is not installed.
Using legacy 'setup.py install' for fake-useragent, since package 'wheel' is not installed.
Using legacy 'setup.py install' for parse, since package 'wheel' is not installed.
Installing collected packages: websockets, pyee, pyquery, pyppeteer, parse, fake-useragent, bs4, requests-html
Running setup.py install for parse ... done
Running setup.py install for fake-useragent ... done
Running setup.py install for bs4 ... done
Successfully installed bs4-0.0.1 fake-useragent-0.1.11 parse-1.19.0 pyee-8.2.2 pyppeteer-0.2.6 pyquery-1.4.3 requests-html-0.10.0 websockets-9.1
教程和使用
- 使用Requests向'baidu.com'发出GET请求:
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('https://www.baidu.com/')
print(r)
- 抓取页面上所有链接的列表,按原样:
print(r.html.links)
# 运行结果
{'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%8D%E5%B0%91%E5%9C%B0%E5%8C%BA%E7%BB%BF%E5%8F%B6%E8%8F%9C%E4%BB%B7%E6%A0%BC%E5%BC%80%E5%A7%8B%E6%98%8E%E6%98%BE%E5%9B%9E%E8%90%BD&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://xueshu.baidu.com', 'https://b2b.baidu.com/s?fr=wwwt', 'https://baike.baidu.com', '/', 'https://map.baidu.com/?newmap=1&ie=utf-8&s=s', 'https://top.baidu.com/board?platform=pc&sa=pcindex_entry', 'http://tieba.baidu.com/f?fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&fr=wwwt', 'https://wenku.baidu.com', 'http://news.baidu.com', 'https://beian.miit.gov.cn', '//www.baidu.com/duty', 'http://tieba.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E6%96%B9%E6%9A%B4%E9%9B%AA%E5%8D%B3%E5%B0%86%E4%B8%8A%E7%BA%BF&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', '//home.baidu.com', '//www.baidu.com/licence/', 'https://jingyan.baidu.com', 'https://live.baidu.com/', 'http://e.baidu.com/ebaidu/home?refer=887', 'http://map.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8D%97%E6%96%B9%E5%91%A8%E6%9C%AB%E5%88%9B%E5%A7%8B%E4%BA%BA%E5%B7%A6%E6%96%B9%E5%8E%BB%E4%B8%96&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://wenku.baidu.com/search?lm=0&od=0&ie=utf-8', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E6%B2%B3%E5%8C%97%E7%96%AB%E6%83%85%E5%AD%98%E5%A4%9A%E6%9D%A1%E4%BC%A0%E6%92%AD%E9%93%BE+%E6%B6%89%E5%A9%9A%E5%AE%B4%E7%AD%89&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', '//help.baidu.com', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8', 'https://zhidao.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E4%BA%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%B5%B7%E8%AF%89%E6%97%A5%E6%9C%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%8E%B7%E8%83%9C&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5', 'http://image.baidu.com', 'http://www.baidu.com/more/', 'http://ir.baidu.com', 'https://www.hao123.com', 'http://music.taihe.com', 'https://haokan.baidu.com/?sfrom=baidu-top', 'https://pan.baidu.com', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001', 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%AD%E4%BC%81%E4%BA%A7%E5%93%81%E8%A2%AB%E7%BE%8E%E6%96%B9%E6%89%A3%E7%95%99+%E5%A4%96%E4%BA%A4%E9%83%A8%E5%9B%9E%E5%BA%94&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8'}
Process finished with exit code 0
- 抓取页面上所有链接的列表,以绝对形式:
print(r.html.absolute_links)
# 运行结果
{'http://map.baidu.com', 'https://beian.miit.gov.cn', 'https://www.baidu.com/duty', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8', 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8D%97%E6%96%B9%E5%91%A8%E6%9C%AB%E5%88%9B%E5%A7%8B%E4%BA%BA%E5%B7%A6%E6%96%B9%E5%8E%BB%E4%B8%96&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%8D%E5%B0%91%E5%9C%B0%E5%8C%BA%E7%BB%BF%E5%8F%B6%E8%8F%9C%E4%BB%B7%E6%A0%BC%E5%BC%80%E5%A7%8B%E6%98%8E%E6%98%BE%E5%9B%9E%E8%90%BD&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://map.baidu.com/?newmap=1&ie=utf-8&s=s', 'http://xueshu.baidu.com', 'http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8', 'https://top.baidu.com/board?platform=pc&sa=pcindex_entry', 'https://haokan.baidu.com/?sfrom=baidu-top', 'https://help.baidu.com', 'https://www.hao123.com', 'https://pan.baidu.com', 'https://zhidao.baidu.com', 'https://wenku.baidu.com', 'https://home.baidu.com', 'https://jingyan.baidu.com', 'https://baike.baidu.com', 'http://ir.baidu.com', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&fr=wwwt', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E4%BA%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%B5%B7%E8%AF%89%E6%97%A5%E6%9C%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%8E%B7%E8%83%9C&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001', 'http://image.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%AD%E4%BC%81%E4%BA%A7%E5%93%81%E8%A2%AB%E7%BE%8E%E6%96%B9%E6%89%A3%E7%95%99+%E5%A4%96%E4%BA%A4%E9%83%A8%E5%9B%9E%E5%BA%94&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://www.baidu.com/licence/', 'http://news.baidu.com', 'http://music.taihe.com', 'http://www.baidu.com/more/', 'https://www.baidu.com/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E6%96%B9%E6%9A%B4%E9%9B%AA%E5%8D%B3%E5%B0%86%E4%B8%8A%E7%BA%BF&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://tieba.baidu.com', 'http://e.baidu.com/ebaidu/home?refer=887', 'http://tieba.baidu.com/f?fr=wwwt', 'https://b2b.baidu.com/s?fr=wwwt', 'https://live.baidu.com/', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E6%B2%B3%E5%8C%97%E7%96%AB%E6%83%85%E5%AD%98%E5%A4%9A%E6%9D%A1%E4%BC%A0%E6%92%AD%E9%93%BE+%E6%B6%89%E5%A9%9A%E5%AE%B4%E7%AD%89&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://wenku.baidu.com/search?lm=0&od=0&ie=utf-8'}
Process finished with exit code 0
- 用CSS选择器选择一个元素:
print(r.html.find("#kw", first=True))
# 运行结果
<Element 'input' id='kw' name='wd' class=('s_ipt',) value='' maxlength='255' autocomplete='off'>
Process finished with exit code 0
- 获取一个元素的文本内容:
data = r.html.find(".text-color", first=True)
print(data.text)
# 运行结果
关于百度
Process finished with exit code 0
- 元素的属性:
data = r.html.find(".text-color", first=True)
print(data.attrs)
# 运行结果
{'class': ('text-color',), 'href': '//home.baidu.com', 'target': '_blank'}
Process finished with exit code 0
- 渲染一个元素的HTML:
data = r.html.find(".text-color", first=True)
print(data.html)
# 运行结果
<a class="text-color" href="//home.baidu.com" target="_blank">关于百度</a>
Process finished with exit code 0
- 在一个元素中选择一个元素列表:
data = r.html.find(".text-color", first=True)
print(data.find('a'))
# 运行结果
[<Element 'a' class=('text-color',) href='//home.baidu.com' target='_blank'>]
Process finished with exit code 0
- 搜索元素中的链接:
data = r.html.find(".text-color", first=True)
print(data.absolute_links)
# 运行结果
{'https://home.baidu.com'}
Process finished with exit code 0
- 搜索页面上的文本:
print(r.html.search("baidu"))
# 运行结果
<Result () {}>
Process finished with exit code 0
- 更复杂的CSS选择器示例(从Chrome开发工具复制):
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('https://www.baidu.com/')
ele = "li.hotsearch-item:nth-child(1) > a:nth-child(1) > span:nth-child(2)"
print(r.html.find(ele, first=True).text)
# 运行结果
河北疫情存多条传播链 涉婚宴等
Process finished with exit code 0
- 还支持XPath:
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('https://www.baidu.com/')
print(r.html.xpath('//*[@id="kw"]'))
# 运行结果
[<Element 'input' id='kw' name='wd' class=('s_ipt',) value='' maxlength='255' autocomplete='off'>]
Process finished with exit code 0
- 你也可以只选择包含特定文本的元素:
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('https://www.baidu.com/')
print(r.html.find('a', containing='baidu'))
# 运行结果
[<Element 'a' class=('text-color',) href='http://ir.baidu.com' target='_blank'>]
Process finished with exit code 0
JavaScript支持
也可以抓取一些JavaScript渲染的文本:
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('http://www.baidu.com/')
print(r.html.render())
分页
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('http://news.baidu.com/')
for html in r.html:
print(html)
或者你也可以简单地请求下一个URL:
from requests_html import HTMLSession
with HTMLSession() as session:
r = session.get('http://news.baidu.com/')
print(r.html.next())
使用没有请求
你也可以使用这个库没有请求:
from requests_html import HTML
doc = """<a href='https://httpbin.org'>"""
html = HTML(html=doc)
print(html.links)
# 运行结果
{'https://httpbin.org'}
Process finished with exit code 0
你也可以在没有请求的情况下渲染JavaScript页面:
from requests_html import HTML
script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
html = HTML(html=script)
val = html.render(script=script, reload=False)
print(val)
print(html.html)
# 运行结果
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
<html><head></head><body>() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}</body></html>
Process finished with exit code 0
使用异步访问网站
- 尝试async在同一时间获得一些网站:
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_pythonorg():
r = await asession.get('https://python.org/')
return r
async def get_reddit():
r = await asession.get('https://www.douban.com/')
return r
async def get_google():
r = await asession.get('https://www.baidu.com/')
return r
results = asession.run(get_pythonorg, get_reddit, get_google)
print(results)
# 运行结果
[<Response [200]>, <Response [200]>, <Response [200]>]
Process finished with exit code 0
- 结果列表中的每一项都是响应对象,可以与之进行交互:
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_pythonorg():
r = await asession.get('https://python.org/')
return r
async def get_reddit():
r = await asession.get('https://www.douban.com/')
return r
async def get_google():
r = await asession.get('https://www.baidu.com/')
return r
results = asession.run(get_pythonorg, get_reddit, get_google)
for result in results:
print(result.html.url)
# 运行结果
https://www.python.org/
https://www.baidu.com/
https://www.douban.com/
Process finished with exit code 0
以上总结或许能帮助到你,或许帮助不到你,但还是希望能帮助到你,如有疑问、歧义,直接私信留言会及时修正发布;非常期待你的点赞和分享哟,谢谢!
未完,待续…
一直都在努力,希望您也是!
微信搜索公众号:就用python