路飞学城-Python爬虫实战密训-第1章
爬虫入门:
一、requests模块
安装: pip3 install requests
导入: import requests
1、常用参数
1 method 一般为get 或post (requests.get requests.post ,get 和 post的本质就是method)
get请求:get为信息获取,可带params参数,GET提交的数据会在地址栏中显示出来
#无参数实例 import requests ret = requests.get('https://github.com') print ret.text #有参数实例 import requests ret = requests.get("https://github.com/", params= {'key1': 'value1', 'key2': 'value2'}) print ret.text
post请求:更新资源信息,把提交的数据放置在是HTTP包的body中
# 方式一: payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print(r.text) #方式二: import requests import json url = 'https://api.github.com/some/endpoint' payload = {'v1': 'k1} headers = {'content-type': 'application/json'} ret = requests.post(url, data=json.dumps(payload), headers=headers) print ret.text
2、url 爬虫访问的地址
3、cookies的保存:session的使用,get_dict()
4、其它参数:
- parmas:附带的参数
- data:post请求中的请求体, 浏览器中体现为 form data
- json:post请求中的参数,浏览器中体现为request payload
- headers:请求头信息
- cookies:浏览器cookie
- verify:是否忽略证书
- timeout:超时
- allow_redirects:重定向开关
- proxies:代理
- streem:一般用来下载文件
二、beautifulsoup模块
安装: pip3 install beautifulsoup4
导入: from bs4 import BeautifulSoup
BeautifulSoup默认使用python默认的html解析器。 我们可以安装第三方解析器更方便的使用它。如 lxml
使用以下示例:
from bs4 import BeautifulSoup html = ''' <html>
<head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
</body></html> ''' soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title) print(soup.title.name) print(soup.title.string) print(soup.title.parent.name) print(soup.p) print(soup.p["class"]) print(soup.a) print(soup.find_all('a')) print(soup.find(id='link3'))
1、基本使用
标签选择器
print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
通过这种soup.标签名 我们就可以获得这个标签的内容
这里有个问题需要注意,通过这种方式获取标签,如果文档中有多个这样的标签,返回的结果是第一个标签的内容,如上面我们通过soup.p获取p标签,而文档中有多个p标签,但是只返回了第一个p标签内容
标签有两个重要的属性 name和attrs
name
print(soup.name) print(soup..headname) #[document] #head
attrs
#打印p标签所有属性 print(soup.p.attrs) #{'class':['title'],'name':'dromouse'} #获取单独某个属性 print(souop.p['class']) #['title']
遍历文档
#子节点 .content 属性可以将tag的子节点以列表的方式输出 .children属性返回一个list生成器对象 .descendants 所有子孙节点 print(soup.p.content ) print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child) #父节点和祖先节点 print(soup.a.parent) print(list(enumerate(soup.a.parents))) #兄弟节点 print(soup.a.next_siblings) #获取后面的兄弟节点 print(soup.a.previous_siblings) #获取前面的兄弟节点 print(soup.a.next_sibling) #获取下一个兄弟标签 print(souo.a.previous_sinbling) #获取上一个兄弟标签
搜索文档
find和find_all
参数
1、name (name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉)
print(soup.find('a')) print(soup.find_all('a'))
import re print(soup.find(re.compile(''^a"))) print(soup.find_all(re.compile(''^a")))
print(soup.find(['a','b'])) print(soup.find_all(['a','b']))
2、keyword
print(soup.find(id='link2')) print(soup.find(sttrs={'class':'link','name':'baidu'}))
3、text 文档标签内的内容
print(soup.find(text='else')) print(soup.find(text=['else','if'])) print(soup.find(text=re.compile('else')))
4、limit 限制返回数量,如果文档树很大会耗时,此参数搜索到符合条件的数量即返回
print(soup.find_all('a',limit=2))
select选择器
#标签名 print(soup.select('title')) print(soup.select('a')) print(soup.select('b')) #类名 print(soup.select('.link')) #id名 print(soup.select('#link2')) #组合查找 print(soup.select('a[class="link"]')) print(soup.select('head > title'))
三、 抽屉实例
import requests from bs4 import BeautifulSoup # 第一次访问返回未授权的cookie值 r1 = requests.get( url='https://dig.chouti.com/', headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5478.400 QQBrowser/10.1.1550.400' }) r1_cookies = r1.cookies.get_dict() # 登录成功之后cookie值已经授权 r2 = requests.post( url='https://dig.chouti.com/login', data={ 'phone':'8613800138000', 'password':'abcd1234', 'oneMonth':'1' }, headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5478.400 QQBrowser/10.1.1550.400' }, cookies = r1_cookies, ) for num_page in range(2,10): ret_index= requests.get(url='https://dig.chouti.com/all/hot/recent/%s'%(num_page), headers={ 'User-Agent': Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5478.400 QQBrowser/10.1.1550.400' }, ) soup = BeautifulSoup(ret_index.text,'html.parser') div = soup.find(name='div',id='content-list') item_list = div.find_all(attrs={'class':'part2'}) for item in item_list: num = item.get('share-linkid') # 此时带着已经授权的cookie值去点赞 r3 = requests.post( url='https://dig.chouti.com/link/vote?linksId=%s'%(num), headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5478.400 QQBrowser/10.1.1550.400' }, cookies = r1_cookies ) print(r3.text)