requests模块和网站的请求（get、post请求）

requests模块

发送get请求，一般拥有2种方式

一种是直接拼凑URL，直接发送最终的URL，不需要传参逻辑
另一种是使用params传参逻辑，用最正统的get请求方式

1、直接发送get请求，不使用params传入参数

response = requests.get(url, headers=headers)

import requests

# 设置请求的url
url = "https://www.baidu.com/s?wd=快递里的经济新脉动&sa=fyb_hp_news_31065&rsv_dl=fyb_hp_news_31065&from=31065"

# 设置请求头，请求头是一个字典
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
    }

# 用get方式发送请求，带入url和头部信息
response = requests.get(url, headers=headers)

# 查看最终请求的url（返回的是经过所有重定向后的最终url）
print(response.url)

# 查看返回的请求网站的编码
print(response.encoding)

# 设置编码为"UTF-8"接收文档
response.encoding = 'UTF-8'

# 打印文档返回的文本
print(response.text)

2、使用params传入参数，发送get请求

response = requests.get(url, headers=headers, params=params)

import requests

# s最后有没有问号结果都一样
url = 'https://www.baidu.com/s?'

# 请求头是一个字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Dnt': '1',
    'Host': 'www.baidu.com',
    'Referer': 'https://www.baidu.com/',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Ch-Ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"'
}
# 请求参数是一个字典
params = {
    'wd': '快递里的经济新脉动',
    'sa': 'fyb_hp_news_31065',
    'rsv_dl': 'fyb_hp_news_31065',
    'from': '31065',
}
# 带上请求参数发起请求，获取响应
response = requests.get(url, headers=headers, params=params)
data = response.content.decode()
print(response.url)

# 执行结果
https://www.baidu.com/s?wd=%E5%BF%AB%E9%80%92%E9%87%8C%E7%9A%84%E7%BB%8F%E6%B5%8E%E6%96%B0%E8%84%89%E5%8A%A8&sa=fyb_hp_news_31065&rsv_dl=fyb_hp_news_31065&from=31065

URL编码的相互转换

在浏览器中输入的内容通常都会转化为URL编码，此时如果想转化编码：

使用requests.utils.unquote()进行“URL解码”
使用requests.utils.quote()进行“URL编码”

import requests

url = "https://www.baidu.com/s?wd=快递里的经济新脉动&sa=fyb_hp_news_31065&rsv_dl=fyb_hp_news_31065&from=31065"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
    }

# 发送请求
response = requests.get(url, headers=headers)
# 处理响应
print("请求的URL:")
print(response.url)

# print(response.text)

# 解码
decoded_url = requests.utils.unquote(response.url)
print("解码后的url:")
print(decoded_url)

# 将文字进行URL编码

print("URL编码:")
print("快递里的经济新脉动->", requests.utils.quote("快递里的经济新脉动"))

# 执行结果
请求的URL:
https://www.baidu.com/s?wd=%E5%BF%AB%E9%80%92%E9%87%8C%E7%9A%84%E7%BB%8F%E6%B5%8E%E6%96%B0%E8%84%89%E5%8A%A8&sa=fyb_hp_news_31065&rsv_dl=fyb_hp_news_31065&from=31065
解码后的URL:
https://www.baidu.com/s?wd=快递里的经济新脉动&sa=fyb_hp_news_31065&rsv_dl=fyb_hp_news_31065&from=31065
URL编码:
快递里的经济新脉动-> %E5%BF%AB%E9%80%92%E9%87%8C%E7%9A%84%E7%BB%8F%E6%B5%8E%E6%96%B0%E8%84%89%E5%8A%A8

2种解码方式：

requests 模块中，有2种获取请求数据的方法：response.content和response.text他们本身是可以不指定参数，自动识别解码的。

如果你要获取网页的文本内容，那么这两种方式都可以。

如果处理的是二进制数据（例如图片或视频文件），使用 response.content 更合适。

手动解码 (response.content.decode('utf-8')):
- 可以设置“UTF-8”或者“GBK”等编码形式进行解码。
- 如果服务器响应的内容确实是“UTF-8”或者“GBK”编码，那么获取的data内容通常是正确的。
```
data = response.content.decode('utf-8')
```
response.text通过设置 response.encoding:
- 如果服务器正确设置了 Content-Type 头部的字符集信息，response.text 会根据 Content-Type 头部的字符集信息自动设置编码。
- 如果服务器没有设置Content-Type 头部的字符集信息，需要手动设置response.encoding = 'UTF-8'或GBK，response.text 就能够正确解码。
```
response.encoding = 'UTF-8'
data = response.text
```

在一般情况下，两种方法的结果是相似的，因为它们都使用UTF-8进行解码。然而，如果服务器响应的 Content-Type 头部中包含了不同的字符集信息，那么手动解码可能更加灵活，能够更好地处理这种情况。

总的来说，如果服务器正确设置了 Content-Type 头部并包含正确的字符集信息，那么两者通常没有太大区别。

网站查询编码：

# 查看返回的请求网站的编码
print(response.encoding)

检查返回的响应内容

# 打印服务器返回的HTTP响应头信息
print(response.headers)
# 打印响应中包含的Cookies信息
print(response.cookies)
# 打印发送请求时使用的HTTP请求头信息
print(response.request.headers)
# 打印最终请求的URL
print(response.url)
# 打印解码后请求的URL
print(unquote(response.url))

使用requests发送POST请求

思考：哪些地方我们会用到POST请求？

登录注册（ POST 比 GET 更安全）
需要传输大文本内容的时候（ POST 请求对数据长度没有要求）

所以同样的，我们的爬虫也需要在这两个地方回去模拟浏览器发送post请求

POST请求用法：

response = requests.post("http://www.baidu.com/", data = data, headers=headers)

data 的形式：字典

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

url = 'https://www.test.com/index.php?c=trans&m=fy&client=6&auth_user=key_ciba&sign=99730f3bf66b2582'

data = {
    'from': 'zh',
    'to': 'en',
    'q': 'w'
}

res = requests.post(url, headers=headers, data=data)
print(res)
print(res.status_code)

# 运行结果
<Response [200]>
200

SSL证书错误：

requests.exceptions.SSLError: HTTPSConnectionPool(host='www.xxxxxx.com', port=443): Max retries exceeded with url: /content/13264.html (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')))

# 导入“关闭SSL证书警告”相关模块
from requests.packages.urllib3.exceptions import InsecureRequestWarning

# 关闭SSL证书警告
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# 配置verify=False，跳过证书验证
response = requests.get(url, headers=headers, params=params, verify=False)

获取response = requests.get()后的数据处理

如果是文本网站，有2种方式：

1、使用data = response.content.decode()

response.encoding = 'utf-8'    # 如果响应头中没有指定字符集，或者自动编码识别错误
data = response.content.decode('utf-8')    # 上面设置或在这里设置都可以

decode() 是一个Python的bytes对象方法，用于将字节字符串转换为字符串。默认不写的情况下（'utf-8'），它使用UTF-8编码进行解码，但你也可以指定其他编码。

2、使用data = response.text

response.encoding = 'utf-8'    # 如果响应头中没有指定字符集，或者自动编码识别错误
data = response.text

使用 response.text 而不是手动调用 decode('utf-8') 的好处是，requests 库会自动处理字符集的问题。如果响应头中包含了 Content-Type 字段，并且这个字段指定了字符集（例如 Content-Type: text/plain; charset=utf-8），那么 requests 会使用这个字符集来解码响应内容。如果响应头中没有指定字符集，requests 会默认使用 ISO-8859-1（也称为 Latin-1），但在很多现代应用中，UTF-8 是更常见的选择。因此，在大多数情况下，使用 response.text 是更稳妥的做法。

如果返回的是json文件：

response.encoding = 'utf-8'    # 如果响应头中没有指定字符集，或者自动编码识别错误
data = response.json()

response.encoding = 'utf-8'

这行代码显式地设置了响应对象的字符编码为 'utf-8'。requests 库通常会自动根据响应头中的 Content-Type 字段来猜测编码，但有时这个猜测可能不准确，或者响应头中根本没有包含编码信息。在这种情况下，你可以手动设置编码。

data = response.json()

这行代码调用了 response 对象的 json() 方法，该方法会尝试将响应内容解析为 JSON 对象。json() 方法内部会首先调用 response.content 来获取原始的字节数据，然后根据 response.encoding（或者你手动设置的编码）来将这些字节数据解码为字符串，最后使用 Python 的 json 模块来解析这个字符串为 JSON 对象。

直接使用 response.json() 的好处是，它简化了从 HTTP 响应中获取 JSON 数据的流程。你不需要手动解码字节数据，也不需要调用 json.loads() 来解析字符串。只要确保响应的内容确实是有效的 JSON 格式，并且编码设置正确，response.json() 就会返回一个 Python 字典或列表，这取决于 JSON 数据的结构。

posted @ 2024-01-23 17:17 Magiclala 阅读(186) 评论(0) 编辑收藏举报

刷新页面返回顶部

Magiclala的博客

requests模块和网站的请求（get、post请求）

requests模块

发送get请求，一般拥有2种方式

1、直接发送get请求，不使用params传入参数

2、使用params传入参数，发送get请求

URL编码的相互转换

2种解码方式：

手动解码 (`response.content.decode('utf-8')`):

`response.text`通过设置 `response.encoding`:

网站查询编码：

检查返回的响应内容

使用requests发送POST请求

SSL证书错误：

获取response = requests.get()后的数据处理

公告

Magiclala的博客

requests模块和网站的请求（get、post请求）

requests模块

发送get请求，一般拥有2种方式

1、直接发送get请求，不使用params传入参数

2、使用params传入参数，发送get请求

URL编码的相互转换

2种解码方式：

手动解码 (response.content.decode('utf-8')):

response.text通过设置 response.encoding:

网站查询编码：

检查返回的响应内容

使用requests发送POST请求

SSL证书错误：

获取response = requests.get()后的数据处理

公告

手动解码 (`response.content.decode('utf-8')`):

`response.text`通过设置 `response.encoding`: