使用Python的BeautifulSoup 类库采集网页内容
BeautifulSoup 一个分析、处理DOM树的类库。可以做网络爬虫。模块简称bs4。
安装类库
easy_install beautifulsoup4
pip install beautifulsoup4
下面是一些用法
from urllib.request import urlopen from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister text-bold text-danger" id="link3" title="this is title!">Tillie</a>; and they lived at the bottom of a well.</p> <p class="red">...</p> <p class="green">...</p> <p class="red green">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, "html.parser") link3 = soup.find(id='link3') # <a class="sister" href="http://example.com/tillie" id="link3" title="this is title!">Tillie</a> print(link3) # <class 'bs4.element.Tag'> print(type(link3)) # {'href': 'http://example.com/tillie', 'title': 'this is title!', 'id': 'link3', 'class': ['sister', 'text-bold', 'text-danger']} print(link3.attrs) # Tillie print(link3.get_text()) # this is title! print(link3["title"]) all_a = soup.find_all('a') # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(all_a[0]) # ['Elsie', 'Lacie', 'Tillie'] print(soup.find_all(text=["Tillie", "Elsie", "Lacie"])) # [<p class="red green">...</p>] print(soup.find_all("p", {"class":"red", "class":"red green"}))
一个例子
采集所有img标签的title属性的内容
# -*- coding: utf-8 -*- from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup url = "http://qa.beloved999.com/category/view?id=2" url = "http://beloved.finley.com/category/view?id=24" html = urlopen(url) bs = BeautifulSoup(html.read(),"html.parser") res = bs.findAll("img", "item-image") print(len(res)) for a in res: print(a['title'])
注意,有些网站会失败,返回403 forbidden。比如我试的开源中国,可能更header头有关。
经查,发送的HTTP_USER_AGENT是Python-urllib/3.4。包含HTTP的信息有
'HTTP_ACCEPT_ENCODING' => 'identity'
'HTTP_CONNECTION' => 'close'
'HTTP_HOST' => 'beloved.finley.com'
'HTTP_USER_AGENT' => 'Python-urllib/3.4' 。