爬虫基本

爬虫

reqursts模块

功能：模拟发送http请求
'''
urllib内置模块，可以发送http请求，但是api使用复杂
介绍：使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）
'''
注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

安装：pip install requests

只能发送请求，不能解析html，后来作者又写了一个模块：requests-hmtl模块：兼具发送请求和解析
	requests+lxml=requests-hmtl
    
'''
基本使用：
	import requests
	
	response = requests.get('https://www.cnblogs.com/')
	
	# response是响应封装成了对象，响应中得到的所有东西都在这个对象中
	print(response.text) #响应体的内容
'''

携带get参数

方式一：直接在url中携带

reponse = requests.get('https://www.cnblogs.com/?name=春游去动物园&age=20')
print(reponse.text)

方式二：使用params参数

reponse = requests.get('https://www.cnblogs.com/',params={'name':'春游去动物园','age':20})
print(reponse.text)

如果url中存在中文会涉及到编码和解码

如我们在浏览器的地址栏里输入https://www.cnblogs.com/?name=春游去动物园&age=20然后我们再从浏览器地址栏中复制出来后就变了。 https://www.cnblogs.com/?name=%E6%98%A5%E6%B8%B8%E5%8E%BB%E5%8A%A8%E7%89%A9%E5%9B%AD&age=20

这时在后端我们就需要进行编码或者解码
'''
使用内置模块 urllib
'''

编码 parse.quote

from urllib import parse
res = parse.quote('春游去动物园')
print(res)
结果为：%E6%98%A5%E6%B8%B8%E5%8E%BB%E5%8A%A8%E7%89%A9%E5%9B%AD

解码 parse.unquote

from urllib import parse
res = parse.unquote('%E6%98%A5%E6%B8%B8%E5%8E%BB%E5%8A%A8%E7%89%A9%E5%9B%AD')
print(res)
结果为：春游去动物园

携带请求头

常见的请求头
	#通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下
    Host
    Referer #大型网站通常都会根据该参数判断请求的来源
    User-Agent #客户端
    Cookie #Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了
    
 python:
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    }
    res = requests.get('https://dig.chouti.com/',headers = header)
    print(res.text)

携带cookie

cookie:存放着登入信息，如果携带了则说明登入了，能去做一些需要登入后的事情

方式一：在请求头中携带

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (K。。。。。。。。。。。。。。。。。。。。。。4492.400',
        # 如果不带，无法进去个人主页
        'Cookie': 'bid=CoFnyYaHvxQ; douban-fav-remind=1; __yadk_uid=MGLPA6D5bDkj8JypPNYfwDtVVqUUC1nu; ll="118281"; push_doumail_num=0; __utmv=30149280.22172; gr_user_id=59b923b6-172f-4e68-a7f1-4a2facfb25c7; _ga=GA1.1.843993488.1628860359; _ga_RXNMP372GL=GS1.1.1628860359.1.0.1628860362.0; push_noty_num=0; __gads=ID=a6e23bf0b365acc8-22f68e9c26cb0。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。6343244.1631934791.1631946579.12; __utmz=30149280.1631946579.12.11.utmcsr=movie.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/typerank; __utmt=1; dbcl2="221725636:8nlW+/9/z7g"; ck=dF24; ap_v=0,6.0; _pk_id.100001.8cb4=ee514d0cfc78c7ff.1626343189.12.1631946979.1631437922.; __utmb=30149280.6.10.1631946579'
    }

    # 发送请求
    res = requests.get(url,headers=headers).content.decode()
    print(res.text)

方式二：使用cookie参数:之前登录成功了，就有cookie，cookie是CookieJar的对象，直接传

1.先登入(案例不是真实的)
	header = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
 }
    res = requests.post('https://dig.chouti.com/login', headers=header,data = {
        'username':'春游去动物园',
        'password':'123456'
    })
    print(res.cookie) 
    '''
    	通过r.cookies可以得到一个RequestsCookieJar对象，该对象中保存了cookie信息
    	
    	这里登入使用了post请求，请求体中的数据都写在data中
    '''
    
2.带着cookie去发送给get请求
方式一：将获取到的cookie转成字典再赋值给cookie参数
	'''
	requests.utils.dict_from_cookiejar()把返回的cookies转换成字典
	'''
    cookie = requests.utils.dict_from_cookiejar(res.cookie)
    
    reponse = requests.get('https://dig.chouti.com/',headers = header,cookie=cookie)
    
方式二：构造一个RequestsCookieJar对象，直接传
	cookieJar = res.cookie.RequestsCookieJar()
      reponse = requests.get('https://dig.chouti.com/',headers = header,cookie=cookieJar)
        

方法三：cookie = res.cookie
 	  reponse = requests.get('https://dig.chouti.com/',headers = header,cookie=res.cookie)

post请求

'''
	浏览器F12后对某已请求查看Payload，里面为需要携带的请求体的数据，用requests发送时一个都不能少
'''
请求体数据格式
	from-data,urlencoded(默认),json

    
res = requests.post('http://www.aa7a.cn/user.php', data={
     'username': '@qq.com',
     'password': '',
     'captcha': 'aaaa',
     'remember': 1,
     'ref': 'http://www.aa7a.cn/user.php?act=logout',
     'act': ' act_login',
 })
 print(res.text)
 print(res.cookies)  # 登录成功的cookie，cookieJar对象-字典

  res1=requests.get('http://www.aa7a.cn/',cookies=res.cookies) # 携带cookie去请求需要登入的功能

post请求携带数据编码为json

res=requests.post('xxx',json={})

requests.session()

'''
	requests库的session对象能够帮我们跨请求保持某些参数，也会在同一个session实例发出的所有请求之间保持cookies。
'''
request.session的使用，整个过程中自动维护cookie

sessino = requests.session()
# 使用session 发送请求
session.post('http://www.aa7a.cn/user.php', data={
     'username': '@qq.com',
     'password': '',
     'captcha': 'aaaa',
     'remember': 1,
     'ref': 'http://www.aa7a.cn/user.php?act=logout',
     'act': ' act_login',
 })
# 登入后再去调用需要登入的功能
res1=session.get('http://www.aa7a.cn/')

response属性

repsonse对象的属性和方法-->把http的响应封装成了response

 respone=requests.get('http://www.autohome.com/news')
 print(respone.text)   # 响应体的字符串
 print(respone.content) # 响应体二进制数据，如视频图片
 print(respone.status_code) #响应状态码
 print(respone.headers)# 响应头
 print(respone.cookies) #响应的cookie
 print(respone.cookies.get_dict()) #cookie转成dict
 print(respone.cookies.items())  # cookie拿出key和value
 print(respone.url)         # 请求的地址
 print(respone.history)     # 列表，有重定向，里面放了重定向之前的地址
 print(respone.encoding)   # 响应编码格式
    
 respone.iter_content()    # 视频和图片文件
 '''
 			res  = requests.get('https://video.pearvideo.com/mp4/adshort/20220427/cont-1760318-15870165_adpkg-ad_hd.mp4')
 with open('致命诱惑3.mp4','wb') as f:
     # f.write(res.content) # 一次性写入，文件大了不太好
     for line in res.iter_content(chunk_size=1024): # 按1024字节写
         f.write(line)
 '''

编码问题

大部分网站都是utf-8编码，老网站中文编码使用gbk，gb2312
如果网站使用的时utf-8以外的编码，就需要需要修改成对应编码
respone = requests.get('http://www.autohome.com/news')
respone.encoding='gbk'
print(respone.text)  # 默认使用utf-8可能会导致中文乱码

获取二进制数据

获取二进制数据
response.content
response.iter_content(chunk_size=1024)
res=requests.get('https://gd-hbimg.huaban.com/e1abf47cecfe5848afc2a4a8fd2e0df1c272637f2825b-e3lVMF_fw658')
with open('美女.png','wb') as f:
    #f.write(res.content) # 一次性写入，文件大了不太好
    # 推荐一下的方法
    for line in res.iter_content(chunk_size=1024): # 按1024字节写
        f.write(line)

解析json

简便方法
 res = requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword', data={
     'cname': '',
     'pid': '',
     'keyword': '北京',
     'pageIndex': 1,
     'pageSize': 10,
 })
 print(type(res.json()))
 print(type(res.json()))

高级用法之 Cert Verification

 发送请求手动携带证书
import requests
# 不验证证书，直接访问
respone=requests.get('https://www.12306.cn',verify=False)
# 携带证书
respone=requests.get('https://www.12306.cn',
                      cert=('/path/server.crt',
                            '/path/key'))

代理

import requests
proxies = {
     'http': '112.14.47.6:52024',
}
#180.164.66.7
respone=requests.get('https://www.cnblogs.com/',proxies=proxies)
print(respone.status_code)
    
通过代理来向对应地址发送请求，然后再通过代理将响应返回
'''
	真正发送请求的是代理
'''

超时，认证，异常，上传文件

1.超时设置
	 import requests
	respone=requests.get('https://www.baidu.com',timeout=0.0001)
    
    
2.异常处理
from requests.exceptions import *
try:
     r=requests.get('http://www.baidu.com',timeout=0.00001)
# except ReadTimeout:
#     print('===:')
# except ConnectionError: #网络不通
#     print('-----')
# except Timeout:
#     print('aaaaa')
except Exception:
     print('x')
        
        
3.上传文件
import requests
files={'file':open('a.jpg','rb')}
 respone=requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)

posted @ 2022-07-29 20:38 春游去动物园阅读(27) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

妹妹说紫色很有韵味

星光不问赶路人,时光不负有心人乾坤未定，你我皆是黑马

爬虫基本

爬虫

reqursts模块

携带get参数

方式一：直接在url中携带

方式二：使用params参数

如果url中存在中文会涉及到编码和解码

编码 parse.quote

解码 parse.unquote

携带请求头

携带cookie

方式一：在请求头中携带

方式二：使用cookie参数:之前登录成功了，就有cookie，cookie是CookieJar的对象，直接传

post请求

post请求携带数据编码为json

requests.session()

response属性

编码问题

获取二进制数据

解析json

高级用法之 Cert Verification

代理

超时，认证，异常，上传文件

公告

妹妹说紫色很有韵味

星光不问赶路人,时光不负有心人 乾坤未定，你我皆是黑马

爬虫基本

爬虫

reqursts模块

携带get参数

方式一：直接在url中携带

方式二：使用params参数

如果url中存在中文会涉及到编码和解码

编码 parse.quote

解码 parse.unquote

携带请求头

携带cookie

方式一：在请求头中携带

方式二：使用cookie参数:之前登录成功了，就有cookie，cookie是CookieJar的对象，直接传

post请求

post请求携带数据编码为json

requests.session()

response属性

编码问题

获取二进制数据

解析json

高级用法之 Cert Verification

代理

超时，认证，异常，上传文件

公告

星光不问赶路人,时光不负有心人乾坤未定，你我皆是黑马