爬虫笔记四

一、Requests快速入门

1.发送请求

import requests

# 常见的请求方式 get post
r = requests.get('https://github.com/timeline.json')
# r = requests.post("http://httpbin.org/post")
print(r.text)
print(r.url)

# 其它请求方式
# r = requests.put("http://httpbin.org/put")
# r = requests.delete("http://httpbin.org/delete")
# r = requests.head("http://httpbin.org/get")
# r = requests.options("http://httpbin.org/get")

示例

2.传递URL参数

手工构建的 URL，那么数据会以键/值对的形式置于 URL 中，跟在一个问号的后面。例如， httpbin.org/get?key=val。

Requests 允许你使用 params 关键字参数，以一个字符串字典来提供这些参数。

如果你想传递 key1=value1 和 key2=value2 到 httpbin.org/get ，那么你可以使用如下代码：

import requests

payload = {'key1':'values1','key2':'values2'}
r = requests.get(url='http://httpbin.org/get',params=payload)

# params= 用来在URL中传递键值对参数 ？后面进行拼接 ?key1=values1&key2=values2
print(r.url)  # http://httpbin.org/get?key1=values1&key2=values2

# 注意字典里值为 None 的键都不会被添加到 URL 的查询字符串

# 参数中的值可以是一个列表
payload = {'key1':'values1','key2':['values2','values3']}
r = requests.get(url='http://httpbin.org/get',params=payload)
print(r.url)    # http://httpbin.org/get?key1=values1&key2=values2&key2=values3

示例

3.响应内容

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。

import requests

r = requests.get(url='https://github.com/timeline.json')
# 默认是utf8编码
r.encoding = 'gbk'  # 改变编码

示例

4.Json响应内容

Requests 中也有一个内置的 JSON 解码器，助你处理 JSON 数据。

需要注意的是，成功调用 r.json() 并**不**意味着响应的成功。有的服务器会在失败的响应中包含一个 JSON 对象（比如 HTTP 500 的错误细节）。这种 JSON 会被解码返回。要检查请求是否成功，请使用 r.raise_for_status() 或者检查 r.status_code 是否和你的期望相同。

r.json( )

5.定制请求头

如果你想为请求添加 HTTP 头部，只要简单地传递一个 dict 给 headers 参数就可以了。

import requests

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

示例

所有的 header 值必须是 string、bytestring 或者 unicode。

尽管传递 unicode header 也是允许的，但不建议这样做。

6.更加复杂的POST请求

想要发送一些编码为表单形式的数据——非常像一个 HTML 表单。只需简单地传递一个字典给 data 参数。你的数据字典在发出请求时会自动编码为表单形式

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)

print(r.text)
'''
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "61.184.75.210", 
  "url": "http://httpbin.org/post"
}
'''

示例

为 data 参数传入一个元组列表。在表单中多个元素使用同一 key 的时候，这种方式尤其有效：

import requests

payload = (('key1', 'value1'), ('key1', 'value2'))
r = requests.post("http://httpbin.org/post", data=payload)

print(r.text)
'''
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": [
      "value1", 
      "value2"
    ]
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "61.184.75.210", 
  "url": "http://httpbin.org/post"
}
'''

示例

很多时候你想要发送的数据并非编码为表单形式的。如果你传递一个 string 而不是一个 dict，那么数据会被直接发布出去。

import requests
import json

url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}

r = requests.post(url, data=json.dumps(payload))
# r = requests.post(url, json=payload)

示例

7.响应状态码

import requests

r = requests.get('http://httpbin.org/get')
print(r.status_code)  # 200

示例代码

8.响应头

我们可以查看以一个 Python 字典形式展示的服务器响应头：

import requests

r = requests.get('http://httpbin.org/get')
print(r.headers)
'''
{
    'Connection': 'keep-alive', 
    'Server': 'meinheld/0.6.1', 
    'Date': 'Tue, 09 Jan 2018 07:00:48 GMT', 
    'Content-Type': 'application/json', 
    'Access-Control-Allow-Origin': '*', 
    'Access-Control-Allow-Credentials': 'true',
    'X-Powered-By': 'Flask', 
    'X-Processed-Time': '0.000684976577759', 
    'Content-Length': '266', 
    'Via': '1.1 vegur'
} 
'''

示例

它还有一个特殊点，那就是服务器可以多次接受同一 header，每次都使用不同的值。

9.Cookie

要想发送你的cookies到服务器，可以使用 cookies 参数：

import requests

url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')

r = requests.get(url, cookies=cookies)
print(r.text)
'''
{
  "cookies": {
    "cookies_are": "working"
  }
}
'''

示例

Cookie 的返回对象为 RequestsCookieJar，它的行为和字典类似，但界面更为完整，适合跨域名跨路径使用。你还可以把 Cookie Jar 传到 Requests 中：

import requests

jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
url = 'http://httpbin.org/cookies'
r = requests.get(url, cookies=jar)
print(r.text)
'''
{
  "cookies": {
    "tasty_cookie": "yum"
  }
}
'''

示例

10重定向和请求历史

默认情况下，除了 HEAD, Requests 会自动处理所有重定向。

可以使用响应对象的 history 方法来追踪重定向。

Response.history 是一个 Response 对象的列表，为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。

import requests

r = requests.get('http://github.com')
print(r.url)         # https://github.com/
print(r.status_code) # 200
print(r.history)     # [<Response [301]>]

示例

如果你使用的是GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那么你可以通过 allow_redirects 参数禁用重定向处理：

import requests

# allow_redirects=False 禁用重定向
r = requests.get('http://github.com', allow_redirects=False)
print(r.url)         # http://github.com/
print(r.status_code) # 301
print(r.history)     # []

示例

如果你使用了 HEAD，你也可以启用重定向：

import requests

r = requests.head('http://github.com', allow_redirects=True)
print(r.url)         # https://github.com/
print(r.status_code) # 200
print(r.history)     # [<Response [301]>]

示例

11.超时

你可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。

基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应：

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

注意
timeout 仅对连接过程有效，与响应体的下载无关。 timeout 并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在 timeout 秒内没有从基础套接字上接收到任何字节的数据时）If no timeout is specified explicitly, requests do not time out.

示例

12.错误与异常

遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 ConnectionError 异常。

如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

posted @ 2018-01-27 10:41 _慕阅读(326) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

_慕

等风，也等你