Python爬虫基础
今日概要:
- Requests与BeautifulSoup
- 爬取汽车之家的新闻资讯
- 爬github和抽屉
- 轮询和长轮询
一.HTTP知识扫盲
- http的get请求 是没有请求体,所有的参数都放在请求头的url里
- http的post请求 将请求内容放到请求体里
- http = 请求头+请求体 响应头+响应体
- http是无状态请求,一个请求,一次响应就会结束
二.Requests
Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。
1.GET请求
无参数实例:
# 无参实例 import requests data = requests.get("http://www.sina.com.cn/") print(data.url) print(data.text)
有参实例:
import requests payload = {'key1': 'value1', 'key2': 'value2'} ret = requests.get("http://httpbin.org/get", params=payload) print(ret.url) print(ret.text)
向 https://github.com/timeline.json 发送一个GET请求,将请求和响应相关均封装在 data对象中。
2.POST请求
基本POST实例:
import requests payload = {'key1': 'value1', 'key2': 'value2'} data = requests.post("http://httpbin.org/post", data=payload) print(data.text)
发送请求头和数据实例
# -*- coding:utf-8 -*- # !/usr/bin/python import requests import json url = 'https://api.github.com/some/endpoint' payload = {'some': 'data'} headers = {'content-type': 'application/json'} data = requests.post(url, data=json.dumps(payload), headers=headers) print(data.text) print(data.cookies)
3.其他请求
requests.get(url, params=None, **kwargs) requests.post(url, data=None, json=None, **kwargs) requests.put(url, data=None, **kwargs) requests.head(url, **kwargs) requests.delete(url, **kwargs) requests.patch(url, data=None, **kwargs) requests.options(url, **kwargs) # 以上方法均是在此方法的基础上构建 requests.request(method, url, **kwargs)
4.更多参数
1 def request(method, url, **kwargs): 2 """Constructs and sends a :class:`Request <Request>`. 3 4 :param method: method for the new :class:`Request` object. 5 :param url: URL for the new :class:`Request` object. 6 :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. 7 :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. 8 :param json: (optional) json data to send in the body of the :class:`Request`. 9 :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. 10 :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. 11 :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. 12 ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` 13 or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string 14 defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers 15 to add for the file. 16 :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. 17 :param timeout: (optional) How long to wait for the server to send data 18 before giving up, as a float, or a :ref:`(connect timeout, read 19 timeout) <timeouts>` tuple. 20 :type timeout: float or tuple 21 :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed. 22 :type allow_redirects: bool 23 :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. 24 :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``. 25 :param stream: (optional) if ``False``, the response content will be immediately downloaded. 26 :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. 27 :return: :class:`Response <Response>` object 28 :rtype: requests.Response 29 30 Usage:: 31 32 >>> import requests 33 >>> req = requests.request('GET', 'http://httpbin.org/get') 34 <Response [200]> 35 """ 36 37 参数列表
更多requests模块相关的文档见:http://cn.python-requests.org/zh_CN/latest/
5.爬取汽车之家新闻无需登录
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests # http方式 response = requests.get("http://www.autohome.com.cn/news/") response.encoding = 'gbk' soup = BeautifulSoup(response.text, "html.parser") tag = soup.find(name="div", attrs={"id":"auto-channel-lazyload-article"}) li_list = tag.find_all("li") # [标签对象,标签对象] for li in li_list: h3 = li.find(name="h3") if not h3: continue print(h3.text, li.find(name="a").get("href")) """ 售13.59-18.59万元 别克新款威朗上市 //www.autohome.com.cn/news/201710/908038.html#pvareaid=102624 售11.99-14.69万元 别克阅朗正式上市 //www.autohome.com.cn/news/201710/908029.html#pvareaid=102624 售14.49-16.69万元 别克GL6正式上市 //www.autohome.com.cn/news/201710/908024.html#pvareaid=102624 售10.99-14.39万元 别克新款英朗上市 //www.autohome.com.cn/news/201710/908023.html#pvareaid=102624 中型SUV/1.6T动力 中华V7申报图曝光 //www.autohome.com.cn/news/201710/908128.html#pvareaid=102624 拉低门槛 奔驰C级或换装全新1.3T发动机 //www.autohome.com.cn/news/201710/908114.html#pvareaid=102624 外观造型硬朗 昌河全新SUV申报图曝光 //www.autohome.com.cn/news/201710/908111.html#pvareaid=102624 将于年内正式投产 捷豹XEL实车曝光 //www.autohome.com.cn/news/201710/908101.html#pvareaid=102624 与海外版一致 英菲尼迪新款Q50L申报图 //www.autohome.com.cn/news/201710/908108.html#pvareaid=102624 或11月上市/两种动力 荣威RX3实车到店 //www.autohome.com.cn/news/201710/908106.html#pvareaid=102624 更年轻 北汽新能源EC180/200推定制套装 //www.autohome.com.cn/news/201710/908107.html#pvareaid=102624 即将“复活” 别克全新凯越申报图曝光 //www.autohome.com.cn/news/201710/908105.html#pvareaid=102624 内饰焕然一新 全新牧马人产品手册曝光 //www.autohome.com.cn/news/201710/908102.html#pvareaid=102624 售16.78-17.98万元 长安CS95荣耀版上市 //www.autohome.com.cn/news/201710/908103.html#pvareaid=102624 售9.98-18.68万 2018款荣威RX5上市 //www.autohome.com.cn/news/201710/908094.html#pvareaid=102624 """
三.BeautifulSoup
BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
windows下安装BeautifulSoup模块:pip install BeautifulSoup4
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> asdf <div class="title"> <b>The Dormouse's story总共</b> <h1>f</h1> </div> <div class="story">Once upon a time there were three little sisters; and their names were <a class="sister0" id="link1">Els<span>f</span>ie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</div> ad<br/>sf <p class="story">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, features="lxml") # 找到第一个a标签 tag1 = soup.find(name='a') # 找到所有的a标签 tag2 = soup.find_all(name='a') # 找到id=link2的标签 tag3 = soup.select('#link2')
1. name,标签名称
# tag = soup.find('a') # name = tag.name # 获取 # print(name) # tag.name = 'span' # 设置 # print(soup)
2.attrs,标签属性
# tag = soup.find('a') # attrs = tag.attrs # 获取 # print(attrs) # tag.attrs = {'ik':123} # 设置 # tag.attrs['id'] = 'iiiii' # 设置 # print(soup)
3.children,所有子标签
# body = soup.find('body') # v = body.children
4.children,所有子子孙孙标签
# body = soup.find('body') # v = body.descendants
5.clear,将标签的所有子标签全部清空(保留标签名)
# tag = soup.find('body') # tag.clear() # print(soup)
6.decompose,递归的删除所有的标签
# body = soup.find('body') # body.decompose() # print(soup)
7.extract,递归的删除所有的标签,并获取删除的标签
# body = soup.find('body') # v = body.extract() # print(soup)
8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)
# body = soup.find('body') # v = body.decode() # v = body.decode_contents() # print(v)
9.encode,转换为字节(含当前标签);encode_contents(不含当前标签)
# body = soup.find('body') # v = body.encode() # v = body.encode_contents() # print(v)
10. find,获取匹配的第一个标签
# tag = soup.find('a') # print(tag) # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') # print(tag)
11. find_all,获取匹配的所有标签
1 # tags = soup.find_all('a') 2 # print(tags) 3 4 # tags = soup.find_all('a',limit=1) 5 # print(tags) 6 7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') 8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie') 9 # print(tags) 10 11 12 # ####### 列表 ####### 13 # v = soup.find_all(name=['a','div']) 14 # print(v) 15 16 # v = soup.find_all(class_=['sister0', 'sister']) 17 # print(v) 18 19 # v = soup.find_all(text=['Tillie']) 20 # print(v, type(v[0])) 21 22 23 # v = soup.find_all(id=['link1','link2']) 24 # print(v) 25 26 # v = soup.find_all(href=['link1','link2']) 27 # print(v) 28 29 # ####### 正则 ####### 30 import re 31 # rep = re.compile('p') 32 # rep = re.compile('^p') 33 # v = soup.find_all(name=rep) 34 # print(v) 35 36 # rep = re.compile('sister.*') 37 # v = soup.find_all(class_=rep) 38 # print(v) 39 40 # rep = re.compile('http://www.oldboy.com/static/.*') 41 # v = soup.find_all(href=rep) 42 # print(v) 43 44 # ####### 方法筛选 ####### 45 # def func(tag): 46 # return tag.has_attr('class') and tag.has_attr('id') 47 # v = soup.find_all(name=func) 48 # print(v) 49 50 51 # ## get,获取标签属性 52 # tag = soup.find('a') 53 # v = tag.get('id') 54 # print(v)
12. has_attr,检查标签是否具有该属性
# tag = soup.find('a') # v = tag.has_attr('id') # print(v)
13.get_text,获取标签内部文本内容
# tag = soup.find('a') # v = tag.get_text('id') # print(v)
14.index,检查标签在某标签中的索引位置
# tag = soup.find('body') # v = tag.index(tag.find('div')) # print(v) # tag = soup.find('body') # for i,v in enumerate(tag): # print(i,v)
15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'
# tag = soup.find('br') # v = tag.is_empty_element # print(v)
16. 当前的关联标签
# soup.next # soup.next_element # soup.next_elements # soup.next_sibling # soup.next_siblings # # tag.previous # tag.previous_element # tag.previous_elements # tag.previous_sibling # tag.previous_siblings # # tag.parent # tag.parents
17. 查找某标签的关联标签
# tag.find_next(...) # tag.find_all_next(...) # tag.find_next_sibling(...) # tag.find_next_siblings(...) # tag.find_previous(...) # tag.find_all_previous(...) # tag.find_previous_sibling(...) # tag.find_previous_siblings(...) # tag.find_parent(...) # tag.find_parents(...) # 参数同find_all
18. select,select_one, CSS选择器
1 soup.select("title") 2 3 soup.select("p nth-of-type(3)") 4 5 soup.select("body a") 6 7 soup.select("html head title") 8 9 tag = soup.select("span,a") 10 11 soup.select("head > title") 12 13 soup.select("p > a") 14 15 soup.select("p > a:nth-of-type(2)") 16 17 soup.select("p > #link1") 18 19 soup.select("body > a") 20 21 soup.select("#link1 ~ .sister") 22 23 soup.select("#link1 + .sister") 24 25 soup.select(".sister") 26 27 soup.select("[class~=sister]") 28 29 soup.select("#link1") 30 31 soup.select("a#link2") 32 33 soup.select('a[href]') 34 35 soup.select('a[href="http://example.com/elsie"]') 36 37 soup.select('a[href^="http://example.com/"]') 38 39 soup.select('a[href$="tillie"]') 40 41 soup.select('a[href*=".com/el"]') 42 43 44 from bs4.element import Tag 45 46 def default_candidate_generator(tag): 47 for child in tag.descendants: 48 if not isinstance(child, Tag): 49 continue 50 if not child.has_attr('href'): 51 continue 52 yield child 53 54 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator) 55 print(type(tags), tags) 56 57 from bs4.element import Tag 58 def default_candidate_generator(tag): 59 for child in tag.descendants: 60 if not isinstance(child, Tag): 61 continue 62 if not child.has_attr('href'): 63 continue 64 yield child 65 66 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1) 67 print(type(tags), tags)
19. 标签的内容
# tag = soup.find('span') # print(tag.string) # 获取 # tag.string = 'new content' # 设置 # print(soup) # tag = soup.find('body') # print(tag.string) # tag.string = 'xxx' # print(soup) # tag = soup.find('body') # v = tag.stripped_strings # 递归内部获取所有标签的文本 # print(v)
20.append在当前标签内部追加一个标签
# tag = soup.find('body') # tag.append(soup.find('a')) # print(soup) # # from bs4.element import Tag # obj = Tag(name='i',attrs={'id': 'it'}) # obj.string = '我是一个新来的' # tag = soup.find('body') # tag.append(obj) # print(soup)
21.insert在当前标签内部指定位置插入一个标签
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一个新来的' # tag = soup.find('body') # tag.insert(2, obj) # print(soup)
22. insert_after,insert_before 在当前标签后面或前面插入
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一个新来的' # tag = soup.find('body') # # tag.insert_before(obj) # tag.insert_after(obj) # print(soup)
23. replace_with 在当前标签替换为指定标签
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一个新来的' # tag = soup.find('div') # tag.replace_with(obj) # print(soup)
24. 创建标签之间的关系
# tag = soup.find('div') # a = soup.find('a') # tag.setup(previous_sibling=a) # print(tag.previous_sibling)
25. wrap,将指定标签把当前标签包裹起来
# from bs4.element import Tag # obj1 = Tag(name='div', attrs={'id': 'it'}) # obj1.string = '我是一个新来的' # # tag = soup.find('a') # v = tag.wrap(obj1) # print(soup) # tag = soup.find('a') # v = tag.wrap(soup.find('p')) # print(soup)
26. unwrap,去掉当前标签,将保留其包裹的标签
# tag = soup.find('a') # v = tag.unwrap() # print(soup)
四.爬取GitHub和抽屉的新闻页
GitHub自动登录
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests # 1. 获取token和cookie r1 = requests.get(url='https://github.com/login') s1 = BeautifulSoup(r1.text,'html.parser') val = s1.find(attrs={'name':'authenticity_token'}).get('value') # cookie返回给你 r1_cookie_dict = r1.cookies.get_dict() # 发送用户认证 r2 = requests.post( url='https://github.com/session', data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':val, 'login':'xxx', 'password':'xxx' }, cookies = r1_cookie_dict ) r2_cookie_dict = r2.cookies.get_dict() print(r1_cookie_dict) print(r2_cookie_dict) all_cookies = {} all_cookies.update(r1_cookie_dict) all_cookies.update(r2_cookie_dict) # 3.github直接用带token之后的cookies就行 r3 = requests.get('https://github.com/settings/emails',cookies=r2_cookie_dict) print(r3.text)
登录抽屉并自动点赞
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests r1 = requests.get(url='http://dig.chouti.com/') r1_cookies_dict = r1.cookies.get_dict() r2 = requests.post( url='http://dig.chouti.com/login', data={ 'phone':'xxx', 'password':'xxx', 'oneMonth':1 }, cookies = r1_cookies_dict ) r2_cookies_dict = r2.cookies.get_dict() print(r1_cookies_dict) print(r2_cookies_dict) all_cookies = {} all_cookies.update(r1_cookies_dict) all_cookies.update(r2_cookies_dict) r3 = requests.post('http://dig.chouti.com/link/vote?linksId=14708906',cookies=r1_cookies_dict) print(r3.text)
注意:有的登录页面,登录的时候不一定会给cookie,需要get一次才给cookie,而登录的时候仅仅是授权,get的时候的cookie,这样就不需要带第二次的cookie去请求
五.轮询和长轮询
-
轮询:客户端定时向服务器发送Ajax请求,服务器接到请求后马上返回响应信息并关闭连接。
优点:后端程序编写比较容易。
缺点:请求中有大半是无用,浪费带宽和服务器资源。
实例:适于小型应用。 -
长轮询:客户端向服务器发送Ajax请求,服务器接到请求后hold住连接,直到有新消息才返回响应信息并关闭连接,客户端处理完响应信息后再向服务器发送新的请求,服务器端会设置超时时间,当出现超时的时候,服务端会断开链接,客户端会再次请求服务端hold住
优点:在无消息的情况下不会频繁的请求。
缺点:服务器hold连接会消耗资源。
实例:WebQQ、Hi网页版、Facebook IM。
另外,对于长连接和socket连接也有区分:
-
长连接:在页面里嵌入一个隐蔵iframe,将这个隐蔵iframe的src属性设为对一个长连接的请求,服务器端就能源源不断地往客户端输入数据。
优点:消息即时到达,不发无用请求。
缺点:服务器维护一个长连接会增加开销。
实例:Gmail聊天