Requests爬虫包及解析工具 xpath、正则、Beautiful Soup

”python爬虫系列“目录：

Python爬虫（一）-必备基础

Python爬虫（二）- Requests爬虫包及解析工具 xpath

Python爬虫（三）- Scrapy爬虫框架系列

scrapy (1)- 基础用法

scrapy (2)- get请求

scrapy (3)- post请求

scrapy (4)-请求传参

scrapy (5)-爬取二级页面的内容

scrapy (6)-CrawlSpider的使用

第一篇：Requests

一、简介

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库，是爬虫常用库,使用的频率非常高，所以做此总结，希望能对后来者有所助益。。

二、安装

pip install requests

三、使用文档

1 实际案例

1.1 访问百度网站

# 引入Requests库
import requests
# 发起GET请求
r = requests.get('https://www.baidu.com/')
# 查看响应类型  requests.models.Response
print(type(r))
# 输出状态码  200
print(r.status_code)
# 输出响应内容类型  str
print(type(r.text))
# 输出响应内容
print(r.text)
# 输出cookies
print(r.cookies)

1.2 各种请求方式

import requests
# 发起POST请求
requests.post('http://httpbin.org/post')
# 发起PUT请求
requests.put('http://httpbin.org/put')
# 发起DELETE请求
requests.delete('http://httpbin.org/delete')
# 发送HEAD请求
requests.head('http://httpbin.org/get')
# 发送OPTION请求
requests.options('http://httpbin.org/get')

2 GET请求

2.1 无参数的GET请求

import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

2.2 带参数的GET请求

2.2.1 访问url携带参数

import requests
response = requests.get('http://httpbin.org/get?name=jyx&age=18')
print(response.text)

2.2.2 请求体包含参数

import requests
# GET请求参数
param = {'name': 'ide', 'city': 'New York'}
# 传递参数params
response = requests.get('http://httpbin.org/get',params=param)
print(response.text)

3 POST请求

3.1 发送表单形式的数据

import requests
#POST请求参数
param = {'name': 'ide', 'city': 'New York'}
#传递参数params
response = requests.post('http://httpbin.org/post',data=param)
print(response.text)

3.2 发送Json数据

import json
import requests
# POST请求参数
param = {'name': 'ide', 'city': 'New York'}
# 传递参数params，并格式化为json数据
response = requests.post('http://httpbin.org/post', data=json.dumps(param))
print(response.text)

3.3 发送文件数据

如果需要向网站发送图片、文档等，需要使用files参数

import requests
# POST请求参数
file ={'file': open('default.png', 'rb')}
# 传递参数files
response = requests.post('http://httpbin.org/post', files=file)
print(response.text)
python```

### 4 获取二进制数据
```python
import requests
response = requests.get('http://l.bst.126.net/rsc/img/loginopen/201406/appstore/quanzi.jpg?v=001')
# 输出响应的二进制内容
print(response.content)
# 下载二进制数据到本地
with open('quanzi.jpg', 'wb') as f:
    f.write(response.content)
    f.close()

5 设置headers

import requests

# 设置User-Agent浏览器信息
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    # 'content-type': 'application/json'
}
# 设置请求头信息
response = requests.get('https://www.zhihu.com/question/37787004',headers=headers)
print(response.text)

6 编码类型

可以找出requests使用了什么编码，并能够进行改变

r.encoding
r.encoding = 'ISO-8859-1'
如果改变了编码，每当访问r.text时，Request都将会使用r.encoding的新值。

python```

### 7 响应属性
```python
import requests

response = requests.get('http://www.jianshu.com/')
# 获取响应状态码
print(type(response.status_code),response.status_code)
# 获取响应头信息
print(type(response.headers),response.headers)
# 获取响应头中的cookies
print(type(response.cookies),response.cookies)
# 获取访问的url
print(type(response.url),response.url)
# 获取访问的历史记录
print(type(response.history),response.history)
8 requests内置的状态字符
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect', 'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

9 获取/发送cookies

import requests

response = requests.get('https://www.baidu.com')
print(response.cookies)
for key, value in response.cookies.items():
    print(key, '=====', value)
print(response.cookies['BAIDUID'])
发送自已定义请求的COOKIES

url = 'http://httpbin.org/cookies'
cookies = {'mycookies':'working'}
response = requests.get(url, cookies = cookies)
print(response.text)

10 session会话保存

import requests

# 从requests中获取session
session = requests.session()
# 使用seesion去请求保证了请求是同一个session
session.get('http://httpbin.org/cookies/set/number/12456')
response = session.get('http://httpbin.org/cookies')
print(response.text)

11 https安全访问

11.1 无证书访问

import requests

response = requests.get('https://www.12306.cn')
# 在请求https时，request会进行证书的验证，如果验证失败则会抛出异常
print(response.status_code)

11.2 关闭证书验证

import requests

# 关闭验证，但是仍然会报出证书警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

11.3 消除关闭证书验证的警告

from requests.packages import urllib3  # 可能会报错，不用担心，继续运行即可
import requests

# 关闭警告
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

11.4 手动设置证书

import requests

# 设置本地证书
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

12 设置代理

12.1 普通代理

import requests

proxies = {
    "http": "http://127.0.0.1:9743",
    "https": "https://127.0.0.1:9743",
}
# 往请求中设置代理(proxies
)
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

12.2 需要认证的代理

import requests

proxies = {
    "http": "http://user:password@127.0.0.1:9743/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

12.3 设置socks代理

import requests

proxies = {
    'http': 'socks5://127.0.0.1:9742',
    'https': 'socks5://127.0.0.1:9742'
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

13 超时设置

import requests
from requests.exceptions import ReadTimeout

try:
   # 设置必须在500ms内收到响应，不然或抛出ReadTimeout异常
   response = requests.get("http://httpbin.org/get", timeout=0.5)
   print(response.status_code)
except ReadTimeout:
   print('Timeout')

14 json解析

requests中内置了一个JSON解码器，帮助你处理JSON数据

import requests

response  = requests.get('https://github.com/timeline.json')
print(response .json())
如果JSON解码失败，response .json就会抛出一个异常

15 网站认证

import requests
from requests.auth import HTTPBasicAuth

response = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(response.status_code)

16 异常处理

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException

try:
    response = requests.get("http://httpbin.org/get", timeout=0.5)
    print(response.status_code)
except ReadTimeout:
    # 超时异常
    print('Timeout')
except ConnectionError:
    # 连接异常
    print('Connection error')
except RequestException:
    # 请求异常
    print('Error')

17 原始响应内容

如果你想获取来自服务器的原始套接字响应，那么你可以访问r.raw，前提是需要在初始请求中设置stream=True

import requests

response = requests.get('https://github.com/timeline.json', stream=True)
print(response.raw)
print(response.raw.read(10))

摘自：https://www.jianshu.com/p/50bdcb7cd5f6

第二篇：解析工具

xpath

1、简介

XPath，全称 XML Path Language，即 XML 路径语言，它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的，但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。

2、安装

pip install lxml

3、使用文档

1 实际案例

from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
x_data = etree.HTML(text)
result = x_data.xpath('//li/a[@href="link4.html"]/text()')  
printn(result)     # ['fourth item']

2 xml和html的不同

（1）html标签被预定义，xml标签需要自己定义
（2）html设计用来显示数据，xml设计用来传输数据

3 xpath常用路径表达式

/ : 从根节点开始查找
// : 从任意位置开始查找
. : 从当前节点开始查找
.. : 从上一级节点开始查找
@ ：选取指定属性
* ：匹配所有的节点
@* : 匹配所有的属性节点
具体使用：
    属性定位
        input[@id="kw"]
    层级定位、索引定位
        //div[@class="head_wrapper"]/div[@id="u1"]/a[1]
        //div[@class="head_wrapper"]//a
    模糊匹配
        contains
            //a[contains(@class,"lb")]
            查找所有的a，class属性值包含lb的a
            //a[contains(text(),"新")]
            查找所有的a，文本内容包含 新 的a
        starts-with
            //a[starts-with(@class,"lb")]
            查找所有的a。class属性值以lb开头的
            //a[starts-with(text(),"更多")]
            查找所有的a，文本内容以更多开头
    获取文本内容
        //div[@id="u1"]/a[1]/text()
    获取属性值
        //div[@id="u1"]/a[2]/@href
        //div[@id="u1"]/img[1]/@src

eg：                  
       bookstore/book : 查找bookstore下面的所有book节点，该book必须是bookstore的直接子节点
       //book : 查找所有的book
       bookstore//book : 查找bookstore下面的所有book节点，但是该book是bookstore的子节点或者子孙节点
        //@lang : 查找所有有lang属性的节点
	bookstore/book[1] : 取出bookstore下面的第一个本book
	bookstore/book[last()] : 取出bookstore下面的最后一个本book
	bookstore/book[last()-1] : 取出bookstore下面的倒数第二本book
	//title[@lang] ： 查找所有的有lang属性的title节点
	//title[@lang='eng'] ：查找所有lang属性为eng的title节点

posted @ 2021-01-05 17:41 peng_li 阅读(876) 评论(0) 收藏举报

刷新页面返回顶部

PengLi

一个学生物的程序猿

Requests爬虫包及解析工具 xpath、正则、Beautiful Soup

”python爬虫系列“目录：

第一篇：Requests

一、简介

二、安装

三、使用文档

1 实际案例

1.1 访问百度网站

1.2 各种请求方式

2 GET请求

2.1 无参数的GET请求

2.2 带参数的GET请求

2.2.1 访问url携带参数

2.2.2 请求体包含参数

3 POST请求

3.1 发送表单形式的数据

3.2 发送Json数据

3.3 发送文件数据

5 设置headers

6 编码类型

9 获取/发送cookies

10 session会话保存

11 https安全访问

11.1 无证书访问

11.2 关闭证书验证

11.3 消除关闭证书验证的警告

11.4 手动设置证书

12 设置代理

12.1 普通代理

12.2 需要认证的代理

12.3 设置socks代理

13 超时设置

14 json解析

15 网站认证

16 异常处理

17 原始响应内容

第二篇：解析工具

xpath

1、简介

2、安装

3、使用文档

1 实际案例

2 xml和html的不同

3 xpath常用路径表达式

公告