python爬虫-requests
requests简介
requests是对于urllib有着很多优势,它能够更好的处理关于cookies,登录验证,代理设置等操作而不需要想urllib南无麻烦,以下为requests经常用到的一些方法
基本用法
get()方法请求网页,实现与urllib库urlopen()方法相同的操作,得到一个Response对象,以及这个对象的类型与属性
import requests r = requests.get('https://www.baidu.com/') print(type(r)) print(r.status_code) print(type(r.text)) print(r.cookies) 结果: #得到一个Response对象 <class 'requests.models.Response'> #状态码 200 #响应体类型 <class 'str'> #cookies <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
Get请求请求链接为http://httpbin.org/get,测试客户端是否发起Get请求,而且会返回相应的请求信息,其格式为json格式。
import requests r = requests.get('http://www.httpbin.org/get') print(r.text) 结果: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "www.httpbin.org", "User-Agent": "python-requests/2.19.1" }, "origin": "36.40.49.173", "url": "http://www.httpbin.org/get" }
json格式转换为字典格式
import requests r = requests.get('http://www.httpbin.org/get') print(type(r.text)) print(type(r.json())) print(r.json()) 结果: <class 'str'> <class 'dict'> {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'www.httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '36.40.49.173', 'url': 'http://www.httpbin.org/get'}
Get请求抓取知乎网页(普通网页信息一般不为json格式),(需要传入header信息,即User-Agent字段信息),否则知乎禁止抓取
import requests import re headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0' } r = requests.get('https://www.zhihu.com/explore',headers = headers) pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S) titles = re.findall(pattern,r.text) print(titles) 结果: ['\n地理这门学科有多有趣?\n', '\n如何看待辩论赛收报名费的现象?\n', '\n如何评价朱一龙?\n', '\n历史上哪些人的名或字比较奇怪?\n', '\n有哪些惊艳到你的句子?\n', '\n英雄联盟中有哪些冷知识?\n', '\n如何评价《海贼王》第925话?\n', '\n怎样看待华晨宇说自己做音乐的天赋占百分之九十九,努力占百分之一?\n', '\n你觉得《三体》中最残忍的一句话是什么?\n', '\n人类有哪些细思恐极的事?\n']
抓取二进制数据(图片,音频,视频等)并保存至文件:
import requests r = requests.get('https://github.com/favicon.ico') #返回结果乱码 print(r.text) #返回二进制数据 print(r.content) 部分结果: �������O L������ b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x01\x00 \x00(\x05\x00\x00&\x00\x00\x00 \x00\x00\x01\x00 图片保存至文件: import requests r = requests.get('https://www.github.com/favicon.ico') with open('favicon.ico','wb') as f: f.write(r.content)
requests post 请求:
import requests data = { 'name' : 'getmey', 'age' : 22 } r = requests.post('http://httpbin.org/post',data=data) print(r.text) 结果: { "args": {}, "data": "", "files": {}, "form": { "age": "22", "name": "getmey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.19.1" }, "json": null, "origin": "36.40.49.173", "url": "http://httpbin.org/post" }
响应信息:
import requests r = requests.get('http://www.jianshu.com') print(type(r.status_code),r.status_code) print(type(r.headers),r.headers) print(type(r.cookies),r.cookies) print(type(r.url),r.url) print(type(r.history),r.history) 结果: <class 'int'> 403 <class 'requests.structures.CaseInsensitiveDict'> {'Date': 'Mon, 24 Dec 2018 02:36:55 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Tengine', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'X-Via': '1.1 PSbjwjBGP2yt134:5 (Cdn Cache Server V2.0), 1.1 PSzjwzdx11at80:10 (Cdn Cache Server V2.0), 1.1 PSsxwndx4au44:4 (Cdn Cache Server V2.0)'} <class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]> <class 'str'> https://www.jianshu.com/ <class 'list'> [<Response [301]>]
Requests高级用法
文件上传
import requests files = {'file' : open('favicon.ico','rb')} r = requests.post('http://httpbin.org/post',files=files) print(r.text) 结果: { "args": {}, "data": "", "files": { "file": "data:application/octet-stream;base64,AAABAAIAEBAAAAEAIAA..... } "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "6665", "Content-Type": "multipart/form-data; boundary=aa294e10346d7b2538cf2d744dd46855", "Host": "httpbin.org", "User-Agent": "python-requests/2.19.1" }, "json": null, "origin": "36.40.49.173", "url": "http://httpbin.org/post" } Cookies:
Cookies 获取
import requests r = requests.get('https://www.baidu.com') print(r.cookies) 结果: <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
会话维持-Session用于模拟在一个浏览器中打开同一个站点的不同页面
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') r = s.get('http://httpbin.org/cookies') print(r.text) 结果: { "cookies": { "number": "123456789" } }
代理设置
import requests proxies = { 'http' : 'http://10.10.1.10:3128', 'https' : 'http://10.10.1.10:1080' } 或 proxies ={ 'http': 'socket5://user:password@host:port', 'https': 'socket5://user:password@host:por' } 或 proxies = { 'http' : 'http://user:password@10.10.1.10:3128' } requests.get('https://www.taobao.com',proxies=proxies)
超时设置
import requests #链接和读取之和为1秒 r = requests('https://www.taobao.com',timeout=1) print(r.status_code) import requests #分别设置链接和读取时间 r = requests.get('https://www.taobao.com',timeout=(5,10)) print(r.status_code) import requests #timeout设置为None,永久等待 r = requests.get('https://www.taobao.com',timeout=None) print(r.status_code)
身份认证:
import requests from requests.auth import HTTPBasicAuth #具体以实际情况为准 r = requests.get("http:localhost:5000",auth=HTTPBasicAuth('username','password')) print(r.status_code)
Prepared Request-将请求表示为数据结构且数据结果就称为Prepared Request
from requests import Request,Session url = 'http://httpbin.org/post' data ={ 'name' : 'getmey', } headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0' } s = Session() req = Request('POST',url,data=data,headers=headers) preped = s.prepare_request(req) r = s.send(preped) print(r.text) 结果: { "args": {}, "data": "", "files": {}, "form": { "name": "getmey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0" }, "json": null, "origin": "36.40.49.173", "url": "http://httpbin.org/post" }