爬虫:get请求,post请求,请求头,请求体,模拟get,post请求登录,响应Reponse对象参数,解析json
爬虫:
根本就是模拟发送http请求(浏览器需什么,我们携带什么),浏览器响应请求并返回数据,我们再对数据进行清洗即为摘选需要的数据,最后入库。
爬虫协议:robots.txt
举例:
https://www.baidu.com/robots.txt
https://www.cnblogs.com/robots.txt
requests
介绍:使用requests可以模拟浏览器请求,相较于之前的urlib,requests模块的api更加便捷(本质就是封装了urlib)
下载模块:
pip3 install requests
介绍:
注意:requests库发送请求将网页内容下载下来以后,并不会执行js代码,这需要我们自己分析目标站点然后发起新的request请求。
各种请求方式:常用的就是requests.get()和requests.post()
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
使用:
使用pycharm模拟get请求。
# 导入模块
import requests
res = requests.get("https://www.cnblogs.com/xiejunjie8888/")
# <Response [200]>
print(res)
# 响应体内容,text表示文本
print(res.text)
浏览器没有渲染的代码。
get地址中携带参数
params
# https://www.cnblogs.com/liuqingzheng/p/16005866.html?name=junjie&age=18
res = requests.get("https://www.cnblogs.com/liuqingzheng/p/16005866.html",
params={'name':'junjie','age':18})
url编码和解码,一般应用于查询关键词,如果是查询关键词中文或者有其他特殊符号,则不得不进行url编码
举例:
百度搜索刘亦菲。注意:&wd之后是我们输入的查询关键字。
将url复制发现两者不同。原因url此时为编码状态,解码即可。
from urllib.parse import quote,unquote
a = "%E5%88%98%E4%BA%A6%E8%8F%B2"
b = unquote(a)
# 刘亦菲
print(b)
导入模块,注意,此模块仅针对于中文的编码和解码:
from urllib.parse import quote,unquote
举例1:
# 编码
a = "俊杰"
b = quote(a)
#
print(b)
# 解码
c = "%E4%BF%8A%E6%9D%B0"
d = unquote(c)
# 俊杰
print(d)
举例2:
导入模块:
from urllib.parse import urlencode
res = {'name':'俊杰','age':18}
p = urlencode(res)
# name=%E4%BF%8A%E6%9D%B0&age=18
print(p)
在http的请求当中会使用url的编码和解码。
实例:
from urllib.parse import urlencode
wd = 'junjie老师'
# utf-8可以不加
encode_res = urlencode({'k': wd}, encoding='utf-8')
# k=junjie%E8%80%81%E5%B8%88
print(encode_res)
keyword = encode_res.split('=')[1]
# junjie%E8%80%81%E5%B8%88
print(keyword)
# 然后拼接成url
url = 'https://www.baidu.com/s?wd=%s' % keyword
# https://www.baidu.com/s?wd=junjie%E8%80%81%E5%B8%88
print(url)
上面这个实例亦可以直接通过params
完成。
携带请求头,带参数的GET请求-->headers
通常我们在发送请求时都需要带上请求头,请求头(key)是将自身伪装成浏览器的关键,常见有用的请求头如下:
- Host
- Referer --> 大型网站通常会根据此参数判断请求的来源
- User-Agent --> 客户端
- Cookie --> Cookie信息虽然包含在请求头里,但requests模块有单独的参数来处理他,headers={}内就不要放它了
User-Agent --> 示例,访问博客园:
import requests
header = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
}
res = requests.get("https://www.cnblogs.com/xiejunjie8888/",headers=header)
with open('bokeyuan.html','wb') as f:
# content 表示是二进制,本地打开bokeyuan.html,不过没有css样式
f.write(res.content)
print(res.text)
如果访问不成功,请求头多加参数。
携带cookie
本身cookie是请求头中的值,那么就可以执行放在请求头中,但是cookie经常用,也可以单独是一个参数。
举例,模拟携带cookie点赞:
- 测试网站: https://dig.chouti.com/
可以看到点赞实则是向此地址发送post请求。
现在使用代码点赞:
import requests
header = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
"Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
}
data = {
# 请求体
"linkId": "34939488"
}
res = requests.post("https://dig.chouti.com/link/vote/",headers=header,data=data)
print(res.text)
刷新页面可以看到已完成点赞,如果删除cookie,发送post请求,则显示需要登录。
Cookies也可做关键字传参,在requests.post
中携带:
import requests
header = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
# "Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
}
data = {
"linkId": "34939488"
}
res = requests.post("https://dig.chouti.com/link/vote/",headers=header,data=data,
cookies={"Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
})
print(res.text)
发送post请求模拟登录
以华华手机为例,平台地址:http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F
当我们点击登录可以看到是向http://www.aa7a.cn/user.php
地址发送post请求。
查看请求体:
代码模拟登录:
import requests
data = {
'username': 'ee@qq.com',
'password': 'lqz',
'captcha': '3j5c',
'remember': 1,
'ref': 'http://www.aa7a.cn/',
'act': 'act_login',
}
res = requests.post("http://www.aa7a.cn/user.php", data=data,)
print(res.text)
# 登录成功之后返回的cookie,可以使用这个cookie模拟登录之后的操作
'''
{"error":0,"ref":"http://www.aa7a.cn/"}
<RequestsCookieJar[<Cookie ECS[password]=4a5e6ce9d1aba9de9b31abdf303bbdc2 for www.aa7a.cn/>,
<Cookie ECS[user_id]=61399 for www.aa7a.cn/>, <Cookie ECS[username]=616564099%40qq.com for www.aa7a.cn/>,
<Cookie ECS[visit_times]=1 for www.aa7a.cn/>, <Cookie ECS_ID=aeb0d6ccd56197a1d1f7dbc0d3ddc8976fd955c6 for www.aa7a.cn/>]>
'''
print(res.cookies)
# False
res2 = requests.get("http://www.aa7a.cn/")
print("616564099@qq.com" in res2.text)
# True
res3 = requests.get("http://www.aa7a.cn/",cookies=res.cookies)
# 带着cookies模拟了登录成功之后查询账号是否再页面中
# 登录成功之后的无csshtml页面
print(res3.text)
# 判断站好是否再成功之后的无csshtml页面
print("616564099@qq.com" in res3.text)
然而每次都要手动携带cookie,较为麻烦,可以使用requests提供的session方法。
# 以后所有请求用session对象发送,不用手动处理cookie
session = requests.session()
data = {
'username': 'ee@qq.com',
'password': 'lqz',
'captcha': '3j5c',
'remember': 1,
'ref': 'http://www.aa7a.cn/',
'act': 'act_login',
}
res = session.post("http://www.aa7a.cn/user.php", data=data,)
print(res.text)
# 登录成功之后返回的cookie,可以使用这个cookie模拟登录之后的操作
'''
{"error":0,"ref":"http://www.aa7a.cn/"}
<RequestsCookieJar[<Cookie ECS[password]=4a5e6ce9d1aba9de9b31abdf303bbdc2 for www.aa7a.cn/>,
<Cookie ECS[user_id]=61399 for www.aa7a.cn/>, <Cookie ECS[username]=616564099%40qq.com for www.aa7a.cn/>,
<Cookie ECS[visit_times]=1 for www.aa7a.cn/>, <Cookie ECS_ID=aeb0d6ccd56197a1d1f7dbc0d3ddc8976fd955c6 for www.aa7a.cn/>]>
'''
print(res.cookies)
# False
res2 = session.get("http://www.aa7a.cn/")
print("616564099@qq.com" in res2.text)
# True
res3 = session.get("http://www.aa7a.cn/",cookies=res.cookies)
# 带着cookies模拟了登录成功之后查询账号是否再页面中
# 登录成功之后的无csshtml页面
print(res3.text)
print("616564099@qq.com" in res3.text)
响应Response对象参数。
import requests
respone=requests.get('http://www.jianshu.com')
# respone属性
print(respone.text) # 返回响应体的文本内容
print(respone.content)# 返回响应体的二进制内容
print(respone.status_code)# 响应状态码
print(respone.headers)# 响应头
print(respone.cookies)# 响应的cookie
print(respone.cookies.get_dict())# 响应的cookie转成字典
print(respone.cookies.items())
print(respone.url) # 请求地址
print(respone.history) # 了解---》如果有重定向,列表,放着重定向之前的地址
print(respone.encoding) # 页面的编码方式:utf-8 gbk
# response.iter_content() # content迭代取出content二进制内容,一般用它存文件
编码问题(一般不会有问题)
import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
print(response.text)
获取二进制数据content或者iter_content
用于下载图片,视频。
解析json
将返回的数据转换成json格式,方便操作。
import json
data = {
'cname': '',
'pid': '',
'keyword': '上海',
'pageIndex': 1,
'pageSize': 10,
}
# res = requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword',data=data)
# j = json.loads(res.text)
# print(j['Table'][0]['rowcount'])
上述方法太麻烦,简单方法:
data = {
'cname': '',
'pid': '',
'keyword': '上海',
'pageIndex': 1,
'pageSize': 10,
}
res= requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword',data=data).json()
print(res['Table'][0]['rowcount'])
ssl认证
# 之前网站,有些没有认证过的ssl证书,我们访问需要手动携带证书
# 跳过证书直接访问
import requests
respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200
print(respone.status_code)
# 手动携带
import requests
respone=requests.get('https://www.12306.cn',
cert=('/path/server.crt',
'/path/key'))
print(respone.status_code)