开始coding爬虫的前奏

（一）requests模块

requests是一个作为Python构建的优雅而简单的HTTP库。目前它使用了Apache2 Licensed许可证，requests在Python一些基本库上进行了高度封装。

中文文档：http://docs.python-requests.org/zh_CN/latest/

常用方法：

 1 requests.get(url, params=None, **kwargs)
 2 
 3 # 发送一个get请求到服务器端
 4 
 5 # url接收一个URL地址
 6 
 7 # parmas接收一个字典对象
 8 
 9 # 返回一个请求对象
10 
11 requests.options(url, **kwargs)
12 
13 # 发送一个options请求到服务器端
14 
15 # url接收一个URL地址
16 
17 requests.head(url, **kwargs)
18 
19 # 发送一个head请求到服务器端
20 
21 # url接收一个URL地址
22 
23 requests.post(url, data=None, json=None, **kwargs)
24 
25 # 发送一个post请求到服务器端
26 
27 # url接收一个URL地址
28 
29 # data接收一个字典、字节或者是一个文件对象
30 
31 # json接收一个json数据
32 
33 requests.put(url, data=None, **kwargs)
34 
35 # 发送一个put请求到服务器端
36 
37 # url接收一个URL地址
38 
39 # data接收一个字典、字节或者是一个文件对象
40 
41 requests.patch(url, data=None, **kwargs)
42 
43 # 发送一个patch请求到服务器端
44 
45 # url接收一个URL地址
46 
47 # data接收一个字典、字节或者是文件对象
48 
49 requests.delete(url, **kwargs)
50 
51 # 发送一个delete请求到服务器端
52 
53 # url接收一个URL地址
54 
55 requests.request(method, url, **kwargs)
56 
57 # 发送一个请求
58 
59 # method指定请求的方法
60 
61 # url接收一个URL地址
62 
63 # params接收一个字典、字节或者是文件对象
64 
65 # data接收一个使用元组构成的列表[(key, value)]或者是字典、字节或者是文件对象
66 
67 # json接收一个json数据
68 
69 # headers接收一个字典，用于构成请求头
70 
71 # cookies接收一个cookie对象
72 
73 # files接收一个文件对象
74 
75 # auth接收一个元组，用来身份认证
76 
77 # timeout接收一个浮点数或者是元组
78 
79 # allow_redirects接收一个布尔值，默认是True，是否开启重定向
80 
81 # proxies 接收代理的url
82 
83 # verify 是否启用安全认证
84 
85 # stream 是否使用数据流的方式传输文件
86 
87 # cert 使用证书文件，如果是pem文件，则(xxx.pem)，如果是crt文件和key文件，则('xxx.crt', 'xxx.key')

View Code

源码：

  1 # -*- coding: utf-8 -*-
  2 
  3 """
  4 requests.api
  5 ~~~~~~~~~~~~
  6 
  7 This module implements the Requests API.
  8 
  9 :copyright: (c) 2012 by Kenneth Reitz.
 10 :license: Apache2, see LICENSE for more details.
 11 """
 12 
 13 from . import sessions
 14 
 15 
 16 def request(method, url, **kwargs):
 17     """Constructs and sends a :class:`Request <Request>`.
 18 
 19     :param method: method for the new :class:`Request` object.
 20     :param url: URL for the new :class:`Request` object.
 21     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
 22     :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
 23     :param json: (optional) json data to send in the body of the :class:`Request`.
 24     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
 25     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
 26     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
 27         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
 28         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
 29         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
 30         to add for the file.
 31     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
 32     :param timeout: (optional) How many seconds to wait for the server to send data
 33         before giving up, as a float, or a :ref:`(connect timeout, read
 34         timeout) <timeouts>` tuple.
 35     :type timeout: float or tuple
 36     :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
 37     :type allow_redirects: bool
 38     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
 39     :param verify: (optional) Either a boolean, in which case it controls whether we verify
 40             the server's TLS certificate, or a string, in which case it must be a path
 41             to a CA bundle to use. Defaults to ``True``.
 42     :param stream: (optional) if ``False``, the response content will be immediately downloaded.
 43     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
 44     :return: :class:`Response <Response>` object
 45     :rtype: requests.Response
 46 
 47     Usage::
 48 
 49       >>> import requests
 50       >>> req = requests.request('GET', 'http://httpbin.org/get')
 51       <Response [200]>
 52     """
 53 
 54     # By using the 'with' statement we are sure the session is closed, thus we
 55     # avoid leaving sockets open which can trigger a ResourceWarning in some
 56     # cases, and look like a memory leak in others.
 57     with sessions.Session() as session:
 58         return session.request(method=method, url=url, **kwargs)
 59 
 60 
 61 def get(url, params=None, **kwargs):
 62     r"""Sends a GET request.
 63 
 64     :param url: URL for the new :class:`Request` object.
 65     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
 66     :param \*\*kwargs: Optional arguments that ``request`` takes.
 67     :return: :class:`Response <Response>` object
 68     :rtype: requests.Response
 69     """
 70 
 71     kwargs.setdefault('allow_redirects', True)
 72     return request('get', url, params=params, **kwargs)
 73 
 74 
 75 def options(url, **kwargs):
 76     r"""Sends an OPTIONS request.
 77 
 78     :param url: URL for the new :class:`Request` object.
 79     :param \*\*kwargs: Optional arguments that ``request`` takes.
 80     :return: :class:`Response <Response>` object
 81     :rtype: requests.Response
 82     """
 83 
 84     kwargs.setdefault('allow_redirects', True)
 85     return request('options', url, **kwargs)
 86 
 87 
 88 def head(url, **kwargs):
 89     r"""Sends a HEAD request.
 90 
 91     :param url: URL for the new :class:`Request` object.
 92     :param \*\*kwargs: Optional arguments that ``request`` takes.
 93     :return: :class:`Response <Response>` object
 94     :rtype: requests.Response
 95     """
 96 
 97     kwargs.setdefault('allow_redirects', False)
 98     return request('head', url, **kwargs)
 99 
100 
101 def post(url, data=None, json=None, **kwargs):
102     r"""Sends a POST request.
103 
104     :param url: URL for the new :class:`Request` object.
105     :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
106     :param json: (optional) json data to send in the body of the :class:`Request`.
107     :param \*\*kwargs: Optional arguments that ``request`` takes.
108     :return: :class:`Response <Response>` object
109     :rtype: requests.Response
110     """
111 
112     return request('post', url, data=data, json=json, **kwargs)
113 
114 
115 def put(url, data=None, **kwargs):
116     r"""Sends a PUT request.
117 
118     :param url: URL for the new :class:`Request` object.
119     :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
120     :param json: (optional) json data to send in the body of the :class:`Request`.
121     :param \*\*kwargs: Optional arguments that ``request`` takes.
122     :return: :class:`Response <Response>` object
123     :rtype: requests.Response
124     """
125 
126     return request('put', url, data=data, **kwargs)
127 
128 
129 def patch(url, data=None, **kwargs):
130     r"""Sends a PATCH request.
131 
132     :param url: URL for the new :class:`Request` object.
133     :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
134     :param json: (optional) json data to send in the body of the :class:`Request`.
135     :param \*\*kwargs: Optional arguments that ``request`` takes.
136     :return: :class:`Response <Response>` object
137     :rtype: requests.Response
138     """
139 
140     return request('patch', url, data=data, **kwargs)
141 
142 
143 def delete(url, **kwargs):
144     r"""Sends a DELETE request.
145 
146     :param url: URL for the new :class:`Request` object.
147     :param \*\*kwargs: Optional arguments that ``request`` takes.
148     :return: :class:`Response <Response>` object
149     :rtype: requests.Response
150     """
151 
152     return request('delete', url, **kwargs)

View Code

（二）BeautifulSoup模块

Beautiful Soup是一个用于从HTML和XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装bs：

（1） Debain或Ubuntu：

1 apt-get install Python-bs4

（2） easy_install或pip，这个包兼容Python2和Python3：

1 easy_install beautifulsoup4
2 pip install beautifulsoup4

（3）源码安装：

下载地址：https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/

首先解压下载的源码压缩包，进入源码目录，执行：

1 python setup.py install

（4）安装解析器lxml和html5lib：

1 apt-get install Python-lxml
2 easy_install lxml
3 pip install lxml
4 apt-get install Python-html5lib
5 easy_install html5lib
6 pip install html5lib

解析器	使用方法	优点	缺点
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

从上表可知，推荐使用lxml解析器效率更高，但是xml或html文档的格式不正确的话返回的结果可能不正确。

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(open("index.html"))  # 直接打开本地html文件
3 soup = BeautifulSoup("<html>data</html>")  #传入html文本

常用对象介绍：

Beautiful Soup将HTML或XML文件转换为树形结构，每个节点都是Python对象。总共可以分为四种：

（1）标签对象：

Tag对象与原生的HTML或XML对象相同。

1 tag = soup.b

Name：

Name是Tag的名字。

1 tag.name

Attrs：

Tag的属性是个列表，可以使用tag[‘class’]的方式操作属性，也可以使用tag.attrs来操作属性。

（2）可遍历的字符串NavigableString对象：

由于字符串包含在了Tag内，所以Beautiful Soup用 NavigableString 类来包装tag中的字符串。

1 tag.string

它的类型是BS的字符串，可以通过unicode()方法将其转换为Unicode字符串。

1 unicode_string = unicode(tag.string)

BS的Tag中包含的字符串不可以被编辑，但是可以通过replace_with()方法被替换成为其他的字符串。

（3）BeautifulSoup对象：

该对象表示的全部的内容。其soup.name属性的值是：u'[document]'。

（4）注释及特殊字符串Comment对象：

Comment 对象是一个特殊类型的 NavigableString 对象

 1 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
 2 
 3 soup = BeautifulSoup(markup)
 4 
 5 comment = soup.b.string
 6 
 7 type(comment)
 8 
 9 # <class 'bs4.element.Comment'>
10 
11 print(soup.b.prettify())   # 输出成为了特殊的格式
12 
13 # <b>
14 
15 #  <!--Hey, buddy. Want to buy a used parser?-->
16 
17 # </b>

Beautiful Soup定义的其他类型可能会出现在XML文档中，CData、ProcessingInstruction、Declaration、Doctype，这些类型与Comment类似，都是NavigableString的子类，只是添加了一些特殊的方法。

常用属性和方法：

 1 soup.head                               # 获取<head></head>
 2 soup.title                              # 获取<title></title>
 3 soup.TagName                            # 获取< TagName></ TagName>
 4 soup.find_all(‘TagName’)                # 获取所有TagName的标签
 5 tag.contents                            # 将tag子节点以列表的方式输出
 6 tag.children                            # 返回一个tag子节点的可迭代生成器对象 
 7 tag.descendants                        # 属性可以对所有tag的子孙节点进行递归循环
 8 tag.string                              # 获取tag中的字符串内容
 9 tag.strings                             # 循环获取tag中的字符串内容
10 tag.stripped_strings        # 功能类似于tag.strings，但是具有除去多余空白字符串的功能
11 tag.parent                             # 获取父节点对象
12 tag.parents                            # 获取父节点对象可迭代生成器对象
13 tag.next_sibling                      # 获取下一个兄弟节点对象
14 tag.previous_sibling                 # 获取上一个兄弟节点对象
15 tag.next_siblings                     # 获取向下的所有兄弟节点的可迭代生成器对象    
16 tag.previous_siblings                # 获取向上的所有兄弟节点的可迭代生成器对象
17 tag.next_element                      # 指向解析过程中下一个被解析的对象
18 tag.previous_element                 # 指向解析过程中上一个被解析的对象
19 tag.next_elements                     # 指向解析过程中上面所有被解析对象的集合
20 tag.previous_elements                # 指向解析过程中下面被解析对象的集合
21 tag.find_all(‘TagName’)              # 查找所有与TagName匹配的节点
22 tag.find_all([‘TagName1’, ‘TagName2’])      # 查找所有与列表中TagName相匹配的节点
23 tag.find_all(True)                    # 返回所有可以匹配的值
24 tag.find_all(FuncName)               # 接收一个方法名称，如果这个方法返回True表示当前的元素匹配并且找到，官方示例：
25 def has_class_but_no_id(tag):
26     return tag.has_attr('class') and not tag.has_attr('id')
27 soup.find_all(has_class_but_no_id)
28 tag.find_all(Key=’Value)            # 搜索所有Key的值是Value的标签
29 soup.find_all(Key=re.compile("RegExp"), Key='Value')   # 结合正则表达式使用并且是或的逻辑关系
30 tag.find_all(text=’xxx’)            # 使用text参数可以搜索文档中的字符串内容
31 tag.find_all(text=[‘xxx’, ‘xxx’, ])     # text参数可以接受字符串、正则、列表和布尔值
32 tag.find_all(‘TagName’, limit=Number)  # 返回Number个符合的标签
33 tag.find_all(‘TagName’, recursive=True/False)           # 是否只匹配直接子节点
34 tag.find( name , attrs , recursive , text , **kwargs )   
35 # 直接返回一个结果，匹配不到时返回None，而find_all()返回空列表[]

View Code

类似的方法还有：

 1 tag.find_parents()
 2 tag.find_parent()
 3 tag.find_next_siblings()
 4 tag.find_next_sibling()
 5 tag.find_previous_siblings()
 6 tag.find_previous_sibling()
 7 tag.find_all_next()
 8 tag.find_next()
 9 tag.find_all_previous()
10 tag.find_previous()

View Code

Beautiful Soup支持大部分的CSS选择器，即tag.select()。

 1 tag.append(“Content”)                            # 向标签中添加内容
 2 tag.new_string()                                 # 创建新的字符串对象
 3 tag.new_tag()                                      # 创建新的标签对象
 4 tag.insert()                                      # 插入标签对象
 5 tag.insert_before()                              # 在tag标签之前插入新的标签对象
 6 tag.insert_after()                               # 在tag标签之后插入新的标签对象
 7 tag. clear()                                      # 清除当前tag的内容
 8 tag. extract()                        # 将当前的tag从文档树中删除，并且返回该tag对象
 9 tag. decompose()                      # 从当前的文档树中移除，并且完全销毁该tag对象
10 tag. replace_with()                             # 替换该tag对象
11 tag. wrap()                            # 用传入的tag对象包装指定的tag对象
12 tag. unwrap()                  # 取消使用上层tag对象的包装，并返回被移除的上层tag对象
13 tag. prettify()               # 将文档树格式化后使用Unicode编码输出
14 tag. get_text()               # 获取tag对象中的内容

View Code

（三）自动登录GitHub

 1 # -*- coding:utf8 -*-
 2 import requests
 3 from bs4 import BeautifulSoup
 4 
 5 # 用户名和密码
 6 username = 'xxx'
 7 password = 'xxx'
 8 # 请求头
 9 header = {
10     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
11     'Accept-Encoding': 'gzip, deflate, br',
12     'Accept-Language': 'zh-CN,zh;q=0.9',
13     'Connection': 'keep-alive',
14     'Host': 'github.com',
15     'Referer': "https://github.com/xvGe/xvGe.github.io",
16     'Upgrade-Insecure-Requests': '1',
17     'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
18 }
19 # 登录
20 response = requests.request('get', 'https://github.com/login', headers=header)
21 soup = BeautifulSoup(response.text, features='lxml')
22 # 获取登录token
23 token = soup.find(name='input', attrs={'name': "authenticity_token"})['value']
24 # 获取cookie
25 cookie = response.cookies.get_dict()
26 # 提交的登录数据
27 formData = {
28     'commit': 'Sign in',
29     'utf8': '✓',
30     'authenticity_token': token,
31     'login': username,
32     'password': password,
33 }
34 # 提交登录数据
35 response = requests.request('post', 'https://github.com/session', data=formData, cookies=cookie, headers=header)
36 response.close()

View Code

posted @ 2018-05-09 15:14 全栈英雄阅读(149) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

全栈英雄

做我该做的梦，一个人的感动。

开始coding爬虫的前奏

（一）requests模块

（二）BeautifulSoup模块

（三）自动登录GitHub

公告