Python爬虫学习(二)requests库
一、urllib库
1、了解urllib
Urllib是python内置的HTTP请求库
包括:urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robot.txt解析模块
二、Requests库
1、简单使用
import requests response = requests.get(url) print(type(response)) print(response.status_code) print(response.cookies) print(response.text) print(response.content) print(response.content.decode("utf-8"))
注意:
很多情况下直接用response.text会出现乱码问题,所以常使用response.content,返回二进制格式的数据,在通过decode()转换成utf-8
也可以使用以下方式进行避免乱码的问题
response = requests.get(url) response.encoding = 'utf-8' print(response.text)
2、请求
- get请求
(1)基本get请求
(2)带参数的get请求
get?key=val
response = requests.get("http://httpbin.org/get?name=zhaofan&age=23") print(response.text)
通过params关键字传递参数
data = { “name”:"zhaofan" , "age":22 } response = requests.get("http://httpbin.org/get",params=data) print(response.url) print(response.text)
- 解析json requests.json执行了json.loads()方法,两者执行的结果一致
import json import requests response = request.get("http://httpbin.org/get") print(response.json()) print(json.loads(response.text))
- 添加headers 有些网站(如知乎)直接通过requests请求访问时,默认是无法访问
在谷歌浏览器里输入chrome://version,就可以看到用户代理,将用户代理添加到头部信息
import requests headers = { "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" } response = requests.get("https://www.zhihu.com",headers=headers) print(response.text)
- post请求
添加data参数
import requests data = { “name”:"zhaofan", "age":23 } response = requests.post("http://httpbin.org/post",data=data) print(response.text)
- 响应
通过response可以获得很多属性
import requests response = requests.get("http://www.baidu.com") print(response.status_code) print(response.headers) print(response.cookies) print(response.url) print(response.history)
状态码判断
202:accepted
404:not_found