Python beautifulsoup模块
BeautifulSoup中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
BeautifulSoup下载:http://www.crummy.com/software/BeautifulSoup/
解压到任意目录
在cmd控制台下进入目录
执行:python setup.py install即可;
执行完后命令行进入python使用import bs4命令验证是否成功:
假设content变量里存着整个网页的字符串,或者是urllib.request.urlopen(url)的返回值
首先,导入模块,然后把content打包进soup里
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'html.parser')
1.将字符串以网页的形式美化显示(返回的是一个字符串)
print(soup.prettify())
2.提取出网页中的特定标签
比如:提取出所有<a>标签
soup = BeautifulSoup(content,'html.parser') print(soup.find_all('a'))
或者提取出所有<a>标签和<b>标签
soup = BeautifulSoup(content,'html.parser') print(soup.find_all(['a','b']))
或者提取出所有class为t-large的<span>标签(也就是所有类似于<span class="t-large"></span>的标签)
soup = BeautifulSoup(content,'html.parser') print(soup.find_all('span','t-large'))
或者提取出所有有class属性没有id属性的标签
def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')
soup = BeautifulSoup(content,'html.parser')
print(soup.find_all(has_class_but_no_id))
或者提取出所有id等于"link2"的标签
soup = BeautifulSoup(content,'html.parser') print(soup.find_all(id="link2"))
3.获取一个标签(一个soup对象)的内容.contents
print(soup.contents) print(soup.a.contents)
4.获取一个标签的class属性(要特别注意返回的是list,哪怕只有一个元素,因为HTML新特性——多属性导致的)
print(soup.a['class'])
5.删除一个标签
>>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>') >>> [s.extract() for s in soup('script')] >>> soup baba
6.删除一个特定class的标签
from bs4 import BeautifulSoup markup = '<a>This is not div <div class="1">This is div 1</div><div class="2">This is div 2</div></a>' soup = BeautifulSoup(markup,"html.parser") a_tag = soup soup.find('div',class_='2').decompose() print a_tag #<a>This is not div <div class="1">This is div 1</div></a>
7.注意在beautifulsoup中,<br>标签写成<br/>
8.提取一个soup里的所有字符串
for string in soup.strings: print(repr(string))
提取一个soup里的非空行非空白字符串
for string in soup.stripped_strings: print(repr(string))