python爬虫-requests

 requests简介

requests是对于urllib有着很多优势,它能够更好的处理关于cookies,登录验证,代理设置等操作而不需要想urllib南无麻烦,以下为requests经常用到的一些方法

基本用法

get()方法请求网页,实现与urlliburlopen()方法相同的操作,得到一个Response对象,以及这个对象的类型与属性

import requests

r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.cookies)

结果:
#得到一个Response对象
<class 'requests.models.Response'>
#状态码
200
#响应体类型
<class 'str'>
#cookies
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
View Code

Get请求请求链接为http://httpbin.org/get,测试客户端是否发起Get请求,而且会返回相应的请求信息,其格式为json格式。

import requests

r = requests.get('http://www.httpbin.org/get')
print(r.text)

结果:
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "www.httpbin.org", 
    "User-Agent": "python-requests/2.19.1"
  }, 
  "origin": "36.40.49.173", 
  "url": "http://www.httpbin.org/get"
}
View Code

json格式转换为字典格式

import requests

r = requests.get('http://www.httpbin.org/get')
print(type(r.text))
print(type(r.json()))
print(r.json())

结果:
<class 'str'>
<class 'dict'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'www.httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '36.40.49.173', 'url': 'http://www.httpbin.org/get'}
View Code

Get请求抓取知乎网页(普通网页信息一般不为json格式),(需要传入header信息,即User-Agent字段信息),否则知乎禁止抓取

import requests
import re

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
r = requests.get('https://www.zhihu.com/explore',headers = headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S)
titles = re.findall(pattern,r.text)
print(titles)

结果:
['\n地理这门学科有多有趣?\n', '\n如何看待辩论赛收报名费的现象?\n', '\n如何评价朱一龙?\n', '\n历史上哪些人的名或字比较奇怪?\n', '\n有哪些惊艳到你的句子?\n', '\n英雄联盟中有哪些冷知识?\n', '\n如何评价《海贼王》第925话?\n', '\n怎样看待华晨宇说自己做音乐的天赋占百分之九十九,努力占百分之一?\n', '\n你觉得《三体》中最残忍的一句话是什么?\n', '\n人类有哪些细思恐极的事?\n']
View Code

抓取二进制数据(图片,音频,视频等)并保存至文件:

import requests

r = requests.get('https://github.com/favicon.ico')
#返回结果乱码
print(r.text)
#返回二进制数据
print(r.content)
部分结果:
�������O                                L������     

b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x01\x00 \x00(\x05\x00\x00&\x00\x00\x00  \x00\x00\x01\x00 

图片保存至文件:
import requests

r = requests.get('https://www.github.com/favicon.ico')
with open('favicon.ico','wb') as f:
    f.write(r.content)
View Code

requests post 请求:

import requests

data = {
    'name' : 'getmey',
    'age' : 22
}
r = requests.post('http://httpbin.org/post',data=data)
print(r.text)

结果:
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "22", 
    "name": "getmey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1"
  }, 
  "json": null, 
  "origin": "36.40.49.173", 
  "url": "http://httpbin.org/post"
}
View Code

响应信息:

import requests

r = requests.get('http://www.jianshu.com')
print(type(r.status_code),r.status_code)
print(type(r.headers),r.headers)
print(type(r.cookies),r.cookies)
print(type(r.url),r.url)
print(type(r.history),r.history)

结果:
<class 'int'> 403
<class 'requests.structures.CaseInsensitiveDict'> {'Date': 'Mon, 24 Dec 2018 02:36:55 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Tengine', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'X-Via': '1.1 PSbjwjBGP2yt134:5 (Cdn Cache Server V2.0), 1.1 PSzjwzdx11at80:10 (Cdn Cache Server V2.0), 1.1 PSsxwndx4au44:4 (Cdn Cache Server V2.0)'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>
<class 'str'> https://www.jianshu.com/
<class 'list'> [<Response [301]>]
View Code

Requests高级用法

文件上传

import requests

files = {'file' : open('favicon.ico','rb')}
r = requests.post('http://httpbin.org/post',files=files)
print(r.text)
结果:

{
  "args": {}, 
  "data": "", 
  "files": {
"file": "data:application/octet-stream;base64,AAABAAIAEBAAAAEAIAA.....
}
 "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "6665", 
    "Content-Type": "multipart/form-data; boundary=aa294e10346d7b2538cf2d744dd46855", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1"
  }, 
  "json": null, 
  "origin": "36.40.49.173", 
  "url": "http://httpbin.org/post"
}
Cookies:
View Code

Cookies 获取

import requests

r = requests.get('https://www.baidu.com')
print(r.cookies)

结果:
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
View Code

会话维持-Session用于模拟在一个浏览器中打开同一个站点的不同页面

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
结果:
{
  "cookies": {
    "number": "123456789"
  }
}
View Code

代理设置

import requests

proxies = {
    'http' : 'http://10.10.1.10:3128',
    'https' : 'http://10.10.1.10:1080'
}
或
proxies ={
    'http': 'socket5://user:password@host:port',
    'https': 'socket5://user:password@host:por'
}
或
proxies = {
    'http' : 'http://user:password@10.10.1.10:3128'
}
requests.get('https://www.taobao.com',proxies=proxies)
View Code

超时设置

import requests
#链接和读取之和为1秒
r = requests('https://www.taobao.com',timeout=1)
print(r.status_code)

import requests
#分别设置链接和读取时间
r = requests.get('https://www.taobao.com',timeout=(5,10))
print(r.status_code)

import requests
#timeout设置为None,永久等待
r = requests.get('https://www.taobao.com',timeout=None)
print(r.status_code)
View Code

身份认证:

import requests
from requests.auth import HTTPBasicAuth
#具体以实际情况为准
r = requests.get("http:localhost:5000",auth=HTTPBasicAuth('username','password'))
print(r.status_code)
View Code

Prepared Request-将请求表示为数据结构且数据结果就称为Prepared Request

from requests import Request,Session
url = 'http://httpbin.org/post'
data ={
    'name' : 'getmey',
}
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
s = Session()
req = Request('POST',url,data=data,headers=headers)
preped = s.prepare_request(req)
r = s.send(preped)
print(r.text)
结果:
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "getmey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0"
  }, 
  "json": null, 
  "origin": "36.40.49.173", 
  "url": "http://httpbin.org/post"
}
View Code

 

 

 

 

 

 

posted @ 2018-12-24 18:46  Coolc  阅读(173)  评论(0编辑  收藏  举报