python 爬虫(三)
爬遍整个域名
六度空间理论:任何两个陌生人之间所间隔的人不会超过六个,也就是说最多通过五个人你可以认识任何一个陌生人。通过维基百科我们能够通过连接从一个人连接到任何一个他想连接到的人。
1. 获取一个界面的所有连接
1 from urllib.request import urlopen 2 from bs4 import BeautifulSoup 3 4 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon") 5 bsObj = BeautifulSoup(html,'html.parser') 6 for link in bsObj.find_all("a"): 7 if 'href' in link.attrs: 8 print(link.attrs['href'])
2. 获取维基百科当前人物关联的事物
1. 除去网页中每个界面都会存在sidebar,footbar,header links 和 category pages,talk pages.
2. 当前界面连接到其他界面的连接都会有的相同点
I 包含在一个id为bodyContent的div中
II url中不包含分号,并且以/wiki/开头
1 from urllib.request import urlopen 2 from bs4 import BeautifulSoup 3 import re 4 5 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon") 6 bsObj = BeautifulSoup(html,"html.parser") 7 for link in bsObj.find('div',{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$")): 8 if 'href' in link.attrs: 9 print(link.attrs['href'])
3. 深层查找
简单的从一个维基百科界面中找到当前界面的连接是没有意义的,如果能够从当前界面开始循环的找下去会有很大的进步
1. 需要创建一个简单的方法,返回当前界面所有文章的连接
2. 创建一个main方法,从一个界面开始查找,然后进入其中一个随机连接,以这个新连接为基础继续查找直到没有新的连接为止。
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup from random import choice import re basename = "http://en.wikipedia.org" def getLinks(pagename): url = basename + pagename try: with urlopen(url) as html: bsObj = BeautifulSoup(html,"html.parser") links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$")) return [link.attrs['href'] for link in links if 'href' in link.attrs] except (HTTPError,AttributeError) as e: return None def main(): links = getLinks("/wiki/Kevin_Bacon") while len(links) > 0: nextpage = choice(links) print(nextpage) links = getLinks(nextpage) main()
4. 爬遍整个域名
1. 爬遍整个网站首先需要从网站的主界面开始
2. 需要保存已经访问过的网页,避免重复访问相同的地址
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup from random import choice import re basename = "http://en.wikipedia.org" visitedpages = set()#使用set来保存已经访问过的界面地址 def visitelink(pagename): url = basename + pagename global visitedpages try: with urlopen(url) as html: bsObj = BeautifulSoup(html,"html.parser") links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$")) for eachlink in links: if 'href' in eachlink.attrs: if eachlink.attrs['href'] not in visitedpages: nextpage = eachlink.attrs['href'] print(nextpage) visitedpages.add(nextpage) visitelink(nextpage) except (HTTPError,AttributeError) as e: return None visitelink("")
5. 从网站上搜集有用信息
1. 没做什么特别的东西,在访问网页的时候打印了一些 h1和文字内容
2. 在print的时候出现的问题》
UnicodeEncodeError: 'gbk' codec can't encode character u'\xa9' in position 24051: illegal multibyte sequence
解决方法:在print之前将source_code.encode('GB18030')
解释:GB18030是GBK的父集,所以能兼容GBK不能编码的字符。
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup from random import choice import re basename = "http://en.wikipedia.org" visitedpages = set()#使用set来保存已经访问过的界面地址 def visitelink(pagename): url = basename + pagename global visitedpages try: with urlopen(url) as html: bsObj = BeautifulSoup(html,"html.parser") try: print(bsObj.h1.get_text()) print(bsObj.find("div",{"id":"mw-content-text"}).find("p").get_text().encode('GB18030')) except AttributeError as e: print("AttributeError") links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$")) for eachlink in links: if 'href' in eachlink.attrs: if eachlink.attrs['href'] not in visitedpages: nextpage = eachlink.attrs['href'] print(nextpage) visitedpages.add(nextpage) visitelink(nextpage) except (HTTPError,AttributeError) as e: return None visitelink("")