爬虫:get请求,post请求,请求头,请求体,模拟get,post请求登录,响应Reponse对象参数,解析json

爬虫:

根本就是模拟发送http请求(浏览器需什么,我们携带什么),浏览器响应请求并返回数据,我们再对数据进行清洗即为摘选需要的数据,最后入库。

爬虫协议:robots.txt

举例:

https://www.baidu.com/robots.txt
https://www.cnblogs.com/robots.txt

requests

介绍:使用requests可以模拟浏览器请求,相较于之前的urlib,requests模块的api更加便捷(本质就是封装了urlib)

下载模块:

pip3 install requests

介绍:

注意:requests库发送请求将网页内容下载下来以后,并不会执行js代码,这需要我们自己分析目标站点然后发起新的request请求。

各种请求方式:常用的就是requests.get()和requests.post()

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

使用:

使用pycharm模拟get请求。

# 导入模块
import requests

res = requests.get("https://www.cnblogs.com/xiejunjie8888/")
# <Response [200]>
print(res)
# 响应体内容,text表示文本
print(res.text)

浏览器没有渲染的代码。

截屏2022-05-06 下午2.55.22

get地址中携带参数

params

# https://www.cnblogs.com/liuqingzheng/p/16005866.html?name=junjie&age=18
res = requests.get("https://www.cnblogs.com/liuqingzheng/p/16005866.html",
                   params={'name':'junjie','age':18})

url编码和解码,一般应用于查询关键词,如果是查询关键词中文或者有其他特殊符号,则不得不进行url编码

举例:

百度搜索刘亦菲。注意:&wd之后是我们输入的查询关键字。

截屏2022-05-06 下午3.53.15

将url复制发现两者不同。原因url此时为编码状态,解码即可。

from urllib.parse import quote,unquote

a = "%E5%88%98%E4%BA%A6%E8%8F%B2"
b = unquote(a)
# 刘亦菲
print(b)

导入模块,注意,此模块仅针对于中文的编码和解码:

from urllib.parse import quote,unquote

举例1:

# 编码
a = "俊杰"
b = quote(a)
#
print(b)

# 解码
c = "%E4%BF%8A%E6%9D%B0"
d = unquote(c)
# 俊杰
print(d)

举例2:

导入模块:

from urllib.parse import urlencode
res = {'name':'俊杰','age':18}
p = urlencode(res)
# name=%E4%BF%8A%E6%9D%B0&age=18
print(p)

在http的请求当中会使用url的编码和解码。

实例:

from urllib.parse import urlencode

wd = 'junjie老师'
# utf-8可以不加
encode_res = urlencode({'k': wd}, encoding='utf-8')
# k=junjie%E8%80%81%E5%B8%88
print(encode_res)
keyword = encode_res.split('=')[1]
# junjie%E8%80%81%E5%B8%88
print(keyword)

# 然后拼接成url
url = 'https://www.baidu.com/s?wd=%s' % keyword
# https://www.baidu.com/s?wd=junjie%E8%80%81%E5%B8%88
print(url)

上面这个实例亦可以直接通过params完成。

携带请求头,带参数的GET请求-->headers

通常我们在发送请求时都需要带上请求头,请求头(key)是将自身伪装成浏览器的关键,常见有用的请求头如下:

  • Host
  • Referer --> 大型网站通常会根据此参数判断请求的来源
  • User-Agent --> 客户端
  • Cookie --> Cookie信息虽然包含在请求头里,但requests模块有单独的参数来处理他,headers={}内就不要放它了

User-Agent --> 示例,访问博客园:

import requests

header = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",

}
res = requests.get("https://www.cnblogs.com/xiejunjie8888/",headers=header)
with open('bokeyuan.html','wb') as f:
    # content 表示是二进制,本地打开bokeyuan.html,不过没有css样式
    f.write(res.content)
print(res.text)

如果访问不成功,请求头多加参数。

携带cookie

本身cookie是请求头中的值,那么就可以执行放在请求头中,但是cookie经常用,也可以单独是一个参数。

举例,模拟携带cookie点赞:

可以看到点赞实则是向此地址发送post请求。

截屏2022-05-06 下午7.23.58

截屏2022-05-06 下午7.17.55

现在使用代码点赞:

import requests

header = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    "Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
}
data = {
  	# 请求体
    "linkId": "34939488"
}
res = requests.post("https://dig.chouti.com/link/vote/",headers=header,data=data)

print(res.text)

截屏2022-05-06 下午7.26.08

刷新页面可以看到已完成点赞,如果删除cookie,发送post请求,则显示需要登录。

Cookies也可做关键字传参,在requests.post中携带:

import requests

header = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    # "Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
}
data = {
    "linkId": "34939488"
}
res = requests.post("https://dig.chouti.com/link/vote/",headers=header,data=data,
                    cookies={"Cookie":"deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI5OTc4MTE0ZS1jZDRlLTQ0YWEtOTMzMS04YmJjZTU2YTljZmMiLCJleHBpcmUiOiIxNjU0NDI3NjIzMjI4In0.RDFMfFm9PlMQQbXB91knucnn2ULfL-TO1ymLhbbmjWQ; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1651835624; __snaker__id=r9q6WZAENXErIczY; gdxidpyhxdE=cev%2B4O872OVnz6lamLJyWUeS5ff8V%5CIO4KTlb9MdTtDaY4UaEUKBEMaU6Q7waoeh6n0I8Sr4Q8KCOJuVjKeocEdcsTOPO3N6xsbNbRHBwNo4E3YQfYlcKQHaM73%2FdmJRb1nB0nti1kWIMW7LAvMiEhAD1Iw5%5CTGvOtKfQOOzV0SD8ICy%3A1651836524953; _9755xjdesxxd_=32; YD00000980905869%3AWM_NI=TjVihGAe%2BWLNC2lHcJxCYV6n5OePF1B%2BvEXmc%2Fw51U3TShYaeSW9tfEgy%2Ft3nBch0Q6Jk2pjnyWKH9RiWrxooc00SkNLu0Z2JOeWH4hd9O0B7iZHjWE7Tl0i%2FT0fvaYGdEo%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee89cf79b0f1ffa7d37ef69a8ea3d85e868e9fb0d54aa8bf8ca3ea79aeaea6d5c42af0fea7c3b92a98b69a8aeb4d83bbe5b8cf4681eea4a9e242bc9d97b7b566f791abb1c53b82f5baa6db67a1f1b9d6b45ff18c9d89dc6ba78d99daf76e87b9b7d0f739ae9ead89b35bf696a5d3f362aea799accb3392b79993fb65adbd00d1d2618e94bbd1d93a8a91abb8cb40acbcf7b0b34af5b796a9e95d87e99c8db363ed8c0087f56b9b99aca6ee37e2a3; YD00000980905869%3AWM_TID=WY%2BAISARxQxFURREVAPAEIOHA0p2qn9B; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjdHVfNjUxODM1NjAyMTEiLCJleHBpcmUiOiIxNjU0NDI3NjM1NjM5In0.5RDSo94xZigmibnPGic5g1RdYVNmn5avmj3y2bc1s0M; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1651835636"
})

print(res.text)

截屏2022-05-06 下午7.30.37

发送post请求模拟登录

以华华手机为例,平台地址:http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F

当我们点击登录可以看到是向http://www.aa7a.cn/user.php地址发送post请求。

截屏2022-05-06 下午7.47.02

查看请求体:

截屏2022-05-06 下午7.48.14

代码模拟登录:

import requests

data = {
    'username': 'ee@qq.com',
    'password': 'lqz',
    'captcha': '3j5c',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = requests.post("http://www.aa7a.cn/user.php", data=data,)

print(res.text)

# 登录成功之后返回的cookie,可以使用这个cookie模拟登录之后的操作
'''
{"error":0,"ref":"http://www.aa7a.cn/"}
<RequestsCookieJar[<Cookie ECS[password]=4a5e6ce9d1aba9de9b31abdf303bbdc2 for www.aa7a.cn/>, 
<Cookie ECS[user_id]=61399 for www.aa7a.cn/>, <Cookie ECS[username]=616564099%40qq.com for www.aa7a.cn/>, 
<Cookie ECS[visit_times]=1 for www.aa7a.cn/>, <Cookie ECS_ID=aeb0d6ccd56197a1d1f7dbc0d3ddc8976fd955c6 for www.aa7a.cn/>]>
'''
print(res.cookies)

# False
res2 = requests.get("http://www.aa7a.cn/")
print("616564099@qq.com" in res2.text)

# True
res3 = requests.get("http://www.aa7a.cn/",cookies=res.cookies)
# 带着cookies模拟了登录成功之后查询账号是否再页面中
# 登录成功之后的无csshtml页面
print(res3.text)
# 判断站好是否再成功之后的无csshtml页面
print("616564099@qq.com" in res3.text)

然而每次都要手动携带cookie,较为麻烦,可以使用requests提供的session方法。

# 以后所有请求用session对象发送,不用手动处理cookie
session = requests.session()
data = {
    'username': 'ee@qq.com',
    'password': 'lqz',
    'captcha': '3j5c',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = session.post("http://www.aa7a.cn/user.php", data=data,)

print(res.text)

# 登录成功之后返回的cookie,可以使用这个cookie模拟登录之后的操作
'''
{"error":0,"ref":"http://www.aa7a.cn/"}
<RequestsCookieJar[<Cookie ECS[password]=4a5e6ce9d1aba9de9b31abdf303bbdc2 for www.aa7a.cn/>, 
<Cookie ECS[user_id]=61399 for www.aa7a.cn/>, <Cookie ECS[username]=616564099%40qq.com for www.aa7a.cn/>, 
<Cookie ECS[visit_times]=1 for www.aa7a.cn/>, <Cookie ECS_ID=aeb0d6ccd56197a1d1f7dbc0d3ddc8976fd955c6 for www.aa7a.cn/>]>
'''
print(res.cookies)

# False
res2 = session.get("http://www.aa7a.cn/")
print("616564099@qq.com" in res2.text)

# True
res3 = session.get("http://www.aa7a.cn/",cookies=res.cookies)
# 带着cookies模拟了登录成功之后查询账号是否再页面中
# 登录成功之后的无csshtml页面
print(res3.text)
print("616564099@qq.com" in res3.text)

响应Response对象参数。

import requests
respone=requests.get('http://www.jianshu.com')

# respone属性
print(respone.text)  # 返回响应体的文本内容
print(respone.content)# 返回响应体的二进制内容

print(respone.status_code)# 响应状态码
print(respone.headers)# 响应头
print(respone.cookies)# 响应的cookie
print(respone.cookies.get_dict())# 响应的cookie转成字典
print(respone.cookies.items())

print(respone.url) # 请求地址
print(respone.history) # 了解---》如果有重定向,列表,放着重定向之前的地址

print(respone.encoding) # 页面的编码方式:utf-8   gbk
# response.iter_content()  # content迭代取出content二进制内容,一般用它存文件

编码问题(一般不会有问题)

import requests

response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
print(response.text)

获取二进制数据content或者iter_content

用于下载图片,视频。

解析json

将返回的数据转换成json格式,方便操作。

import json

data = {
    'cname': '',
    'pid': '',
    'keyword': '上海',
    'pageIndex': 1,
    'pageSize': 10,
}
# res = requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword',data=data)
# j = json.loads(res.text)
# print(j['Table'][0]['rowcount'])

上述方法太麻烦,简单方法:

data = {
    'cname': '',
    'pid': '',
    'keyword': '上海',
    'pageIndex': 1,
    'pageSize': 10,
}

res= requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword',data=data).json()
print(res['Table'][0]['rowcount'])

ssl认证

# 之前网站,有些没有认证过的ssl证书,我们访问需要手动携带证书
# 跳过证书直接访问
import requests
respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200
print(respone.status_code)
# 手动携带
import requests
respone=requests.get('https://www.12306.cn',
                     cert=('/path/server.crt',
                           '/path/key'))
print(respone.status_code)

使用代理

posted @ 2022-05-07 20:36  谢俊杰  阅读(961)  评论(0编辑  收藏  举报