网络爬虫之requests模块
python3中用于模拟发起网络请求的模块有两个urllib模块和requests模块,由于requests模块相对于urllib模块来说更加简单便捷高效本文就只介绍requests模块。
环境安装:
pip install requests
GET请求:
HTTP中最常见的请求之一就是GET请求,下面首先来详细了解一下利用requests模块构建GET请求的方法
首先构建一个最简单的GET请求,url就是请求链接,该网站会判断如果客户发送的是GET请求的话,它返回相印的请求信息
import request url = 'http://httpbin.org/get’ response = requests.get(url=url) print(response.text) 运行结果如下: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.25.1", "X-Amzn-Trace-Id": "Root=1-6069d800-43b4f5da49eb42f770c9dc90" }, "origin": "113.118.77.36", "url": "http://httpbin.org/get" }
对于GET请求如果需要附加额外的信息,只需传入params参数即可
import requests url = 'http://httpbin.org/get' params = { 'name':'germey', 'age':22 } response = requests.get(url=url,params=params) print(response.text) 结果如下: { "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.25.1", "X-Amzn-Trace-Id": "Root=1-6069d97d-10bea5df63cec72311101582" }, "origin": "113.118.77.36", "url": "http://httpbin.org/get?name=germey&age=22" }
如果网页上是json数据就需要调用响应数据的json方法,如果是二进制数据就需要调用content方法。
response.json()
response.content()
POST请求:
前面了解了最基本的GET请求,另一种比较常见的就是POST请求。通用使用requests实现POST请求同样非常简单。
import requests url = 'http://httpbin.org/post’ data = { 'name':'germey', 'age':22 } page_text = requests.post(url=url,data=data) print(page_text.text) 结果如下: { "args": {}, "data": "", "files": {}, "form": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.25.1", "X-Amzn-Trace-Id": "Root=1-6069dbb2-20c01c0a048c8d0c239cdf28" }, "json": null, "origin": "113.118.77.36", "url": "http://httpbin.org/post" }
通常情况下发起请求需要添加headers参数进行UA伪装,不然网页会拒绝你的请求。
import requests header = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15' } url = ‘https://www.baidu.com' Response = request.get(url, headers = header)