python—网络爬虫(HTTP协议及Requests库方法)
HTTP协议:超文本传输协议。
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。
HTTP协议采用URL作为定位网络资源的标识。
URL格式: http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port: 端口号,缺省端口为80
path: 请求资源路径
URL是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源。
HTTP协议对资源的操作:
方法 | 说明 |
GET | 请求获取URL位置的资源 |
HEAD | 请求URL位置资源的响应消息报告,即获得该资源的头部信息 |
POST | 请求向URL位置的资源后附加新的数据 |
PUT | 请求向URL位置存储一个资源,覆盖原URL位置的资源 |
PATCH | 请求局部更新URL位置的资源,即该处资源的部分内容 |
DELETE | 请求删除URL位置存储的资源 |
理解PATCH和PUT的区别:
假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段。
需求:用户修改了UserName,其他不变。
采用PATCH,仅向URL提交UserName的局部更新请求。
采用PUT,必须将所有20个字段一并提交到URL,未提交字段被删除。
PATCH的最大好处:节省网络带宽。
HTTP协议与Request库功能比较
HTTP协议方法 | Request库方法 | 功能一致性 |
GET | requests.get() | 一致 |
HEAD | requests.head() | 一致 |
POST | requests.post() | 一致 |
PUT | requests.put() | 一致 |
PATCH | requests.patch() | 一致 |
DELETE | requests.delete() | 一致 |
Requests库的head()方法:
>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Wed, 01 Aug 2018 11:52:17 GMT', 'Content-Type': 'application/json', 'Content-Length': '267', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
>>> r.text #显示内容为空
''
>>>
Requests库的post()方法:
向URL POST一个字典,自动编码为form(表单)
>>> payload = {'key1':'value1', 'key2':'value2'}
>>> r = requests.post('http://httpbin.org/post', data=payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/post"
}
>>>
向URL POST一个字符串,自动编码为data
>>> r = requests.post('http://httpbin.org/post',data='ABC')
>>> print(r.text)
{
"args": {},
"data": "ABC",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "3",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/post"
}
>>>
Requests库的put()方法: 它与post方法类似,只不过会将原有内容覆盖掉
>>> payload = {'key':'value1', 'key2':'value2'}
>>> r = requests.put('http://httpbin.org/put',data = payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "22",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/put"
}