python 网络数据采集1
python3 网络数据采集1
第一部分:
一、可靠的网络连接:
使用库:
python标准库: urllib
python第三方库:BeautifulSoup
安装:pip3 install beautifulsoup4
导入:import bs4
cat scrapetest2.py #!/usr/local/bin/python3 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.error import HTTPError def getTitle(url): try: html = urlopen(url) except HTTPError as e: return None try: bsObj = BeautifulSoup(html.read()) title = bsObj.body.h1 except AttributeError as e: return None return title x = 'http://pythonscraping.com/pages/page1.html' title = getTitle(x) if title == None: print('Title could not be found.') else: print(title) #######执行结果####### python3 scrapetest2.py /usr/local/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 21 of the file scrapetest2.py. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP}) to this: BeautifulSoup(YOUR_MARKUP, "html.parser") markup_type=markup_type)) <h1>An Interesting Title</h1>
二、 复杂的HTML解析
.get_text() 会把正在处理的HTML文档中所有的标签(超链接、段落、标签)都清除,然后返回一个只包含文字的字符串。
cat bs41.py #!/usr/local/bin/python3 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html') bsObj = BeautifulSoup(html) nameList = bsObj.findAll('span',{'class':'green'}) for name in nameList: print(name.get_text()) ################# 返回所有绿色的字体 Anna Pavlovna Scherer Empress Marya Fedorovna Prince Vasili Kuragin Anna Pavlovna St. Petersburg the prince Anna Pavlovna Anna Pavlovna the prince the prince the prince Prince Vasili Anna Pavlovna Anna Pavlovna the prince Wintzingerode King of Prussia le Vicomte de Mortemart Montmorencys Rohans Abbe Morio the Emperor the prince Prince Vasili Dowager Empress Marya Fedorovna the baron Anna Pavlovna the Empress the Empress Anna Pavlovna's Her Majesty Baron Funke The prince Anna Pavlovna the Empress The prince Anatole the prince The prince Anna Pavlovna Anna Pavlovna
BeautifulSoup 的find() 和findAll()
用途:通过标签的不同属性过滤HTML页面,查找需要的标签组或单个标签。
findAll(tag, attributes, recursive, text, limit, keywords)
findAll = find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
find(tag, attributes, recursive, text, keywords)
find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
说明: tag 可以传一个标签的名称或多个标签名称组成的python列表做标签参数;findAll({'h1', 'h2', 'h3', 'h4', 'h5', 'h6',})
attributes 用字典封装一个标签的若干属性和对应的属性值; .findAll("span", {"class":{"green", "red"}})
recursive 递归参数是一个布尔变量,默认值True;如果是True会查找变迁参数的所有子标签,以及子标签的子标签;如果设置为False,就只查找文档的一级标签;
text 文本参数 ,它是用标签的文本内容去匹配,而不是标签的属性;nameList = bsObj.findAll(text='the prince') print(len(nameList)) 结果是:7
limit 范围限制参数x,按照网页上的顺序排序获取前面的x项;
keywords 关键词参数,选择具有指定属性的标签;
关键词参数: allText = bsObj.findAll(id='text') print(allText[0].get_text()) ###### 下面两行代码一样: bsObj.findAll(id='text') bsObj.findAll("", {"id" : "text"}) ###### 在class后面加一个下划线 bsObj.findAll(class_='green') 也可以用属性参数把class用引号包起来 bsObj.findAll("", {"class":"green"})
BeautifulSoup库里的两种对象:
1,BeautifulSoup 对象
2,标签Tag对象
直接调用子标签获取的一列对象或单个对象; bsObj.div.h1
另外两个如下:
3, NavigableString对象
用来表示标签里的文字,不是标签;
4,Comment 对象
用来查找 HTML文档的注释标签 <!--像这样-->
导航树:
通过标签在文档中的位置来查找标签, 导航树Navigating Trees作用;
1,处理子标签和其他后代标签;
子标签就是一个父标签的下一级;而后代标签是指一个父标签下面所有级别的标签;
查找子标签,可以用 .children标签;
########子标签####### # cat beau.py #!/usr/local/bin/python3 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://www.pythonscraping.com/pages/page3.html') bsObj = BeautifulSoup(html) for child in bsObj.find("table",{'id':'giftList'}).children: print(child) ########################## cat beau.py #!/usr/local/bin/python3 from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://www.pythonscraping.com/pages/page3.html') bsObj = BeautifulSoup(html) for child in bsObj.find("table",{'id':'giftList'}).descendants: print(child)
2,处理兄弟标签;