python3: 爬虫---- urllib, beautifulsoup
最近晚上学习爬虫,首先从基本的开始;
python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载, beautifulsoup 可以从杂乱的html代码中
分离出我们需要的部分;
注: beautifulsoup 是一种可以从html 或XML文件中提取数据的python库;
实例1:
from urllib import request from bs4 import BeautifulSoup as bs import re header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' } def download(): """ 模拟浏览器进行访问; :param url: :return: """ for pageIdx in range(1, 3, 1): #print(pageIdx) url = "https://www.cnblogs.com/#p%s" % str(pageIdx) print(url) req = request.Request(url, headers=header) rep = request.urlopen(req).read() data = rep.decode('utf-8') print(data) content = bs(data) for link in content.find_all('h3'): content1 = bs(str(link), 'html.parser') print(content1.a['href'],content1.a.string) curhtmlcontent = request.urlopen(request.Request(content1.a['href'], headers=header)).read() #print(curhtmlcontent.decode('utf-8')) open('%s.html' % content1.a.string, 'w',encoding='utf-8').write(curhtmlcontent.decode('utf-8')) if __name__ == "__main__": download()
实例2:
# -- coding: utf-8 -- import unittest import lxml import requests from bs4 import BeautifulSoup as bs def school(): for index in range(2, 34, 1): try: url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index) r = requests.get(url=url) soup = bs(r.content, 'lxml') city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string fp = open("%s.txt" %(city), "w", encoding="utf-8") content1 = soup.find_all(name="tr", attrs={"height": "29"}) for content2 in content1: try: contentTemp = bs(str(content2), "lxml") soup_content = contentTemp.find_all(name="td")[1].string fp.write(soup_content + "\n") print(soup_content) except IndexError: pass fp.close() except IndexError: pass class MyTestCase(unittest.TestCase): def test_something(self): school() if __name__ == '__main__': unittest.main()
BeatifulSoup支持很多HTML解析器(下面是一些主要的):
解析器 | 使用方法 | 优势 | 劣势 |
Python标准库 | BeautifulSoup(markup, “html.parser”) | (1)Python的内置标准库(2)执行速度适中(3)文档容错能力强 | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
lxml HTML解析器 | BeautifulSoup(markup, “lxml”) | (1)速度快(2)文档容错能力强 | 需要安装C语言库 |
lxml XML解析器 | BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”) | (1)速度快(2)唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | (1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档 | (1)速度慢(2)不依赖外部扩展 |