python爬虫（7）——BeautifulSoup

　　　　今天介绍一个非常好用的python爬虫库——beautifulsoup4。beautifulsoup4的中文文档参考网址是：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

　　　　首先使用pip安装这个库，当然还要用到lxml这个解析器，配合使用可以很方便的帮助我们处理html文档，提取所需要的信息。可以使用pip list命令查看你已经安装好的包。提醒大家注意一点！一定是pip install beautifulsoup4 ，这个4千万别忘记了，否则会出现如下报错信息：

　　　　　　print "Unit tests have failed!"

　　　　　　　　SyntaxError: Missing parentheses in call to 'print'

　　　　　　Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-4g6q3fil\...

　　　　因为python中的print函数，在python3中是需要加括号的，所以我们可以知道报错是因为版本不兼容导致的。python3使用的beautifulsoup4，我之前安装就是出现了这个问题，好在很快发现了解决了。安装成功会出现successfully。

 1 C:\Users\Administrator\Desktop
 2 λ ipython
 3 Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
 4 Type 'copyright', 'credits' or 'license' for more information
 5 IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
 6 #导入这个包
 7 In [1]: from bs4 import BeautifulSoup
 8 
 9 In [2]: html="""\
10    ...: <!DOCTYPE HTML> <html> <head> <meta charset="utf-8"> <title>我的博客(CCColby.com)</title> </head> <body>  <video width="320" height="240" controls>   <source src="m
11    ...: ovie.mp4" type="video/mp4">   <source src="movie.ogg" type="video/ogg">   你的浏览器不支持 video 标签。 </video>  </body> </html>
12    ...: """
13 #创建对象，如果不指定解析方式，会出现警告
14 In [3]: soup=BeautifulSoup(html)
15 c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
16 
17 The code that caused this warning is on line 193 of the file c:\users\administrator\appdata\local\programs\python\python36\lib\runpy.py. To get rid of this warning, change code that looks like this:
18 
19  BeautifulSoup(YOUR_MARKUP})
20 
21 to this:
22 
23  BeautifulSoup(YOUR_MARKUP, "lxml")
24 
25   markup_type=markup_type))
26 #我们制定解析方式为'lxml'
27 In [4]: soup=BeautifulSoup(html,"lxml")
28 #格式化输出soup对象
29 In [5]: print(soup.prettify())
30 <!DOCTYPE HTML>
31 <html>
32  <head>
33   <meta charset="utf-8"/>
34   <title>
35    我的博客(CCColby.com)
36   </title>
37  </head>
38  <body>
39   <video controls="" height="240" width="320">
40    <source src="movie.mp4" type="video/mp4">
41     <source src="movie.ogg" type="video/ogg">
42      你的浏览器不支持 video 标签。
43     </source>
44    </source>
45   </video>
46  </body>
47 </html>

　　　　beautifulsoup将复杂的HTML文档归结为一个树形结构，每个节点都是python对象。这些对象分成四种：Tag、NavigableString、BeautifulSoup、Comment。

　　　　可以利用soup加上标签名，可以轻松的获取标签内容

1 In [6]: print(soup.title)
2 <title>我的博客(CCColby.com)</title>
3 
4 In [7]: print(soup.head)
5 <head> <meta charset="utf-8"/> <title>我的博客(CCColby.com)</title> </head>
6 
7 In [8]: print(soup.source)
8 <source src="movie.mp4" type="video/mp4"> <source src="movie.ogg" type="video/ogg">   你的浏览器不支持 video 标签。 </source></source>

　　　　如果我们要获取标签内部的文字怎么办？很简单

1 In [9]:  print（soup.titie.string）
2 
3 我的博客（CCColby.com）

　　　　关于beautifulsoup的遍历文档树，可以用contents方法、children方法。如果要遍历所有子节点，则用descendants属性。具体的用法在实例中学习就可以了。

　　　　搜索文档树find_all（name ,attrs,recursive,text,**kwargs）

　　　　其中name参数可以查找所有名字为name的Tag，字符串对象会被自动忽略；可以传入字符串、正则表达式（re.compile（）编译过的）、传列表。text参数是查找文档中的字符内容。

　　　　还有一种查找方法CSS选择器。

 1 #通过标签名查找
 2 
 3 print(soup.select('title'))
 4 
 5 #通过属性查找
 6 
 7 print(sou.select(a[class="name"]'))
 8 
 9 
10 #以上select返回的结果都是列表形式，要用遍历输出,然后用get_text()方法来获取它的内容
11 
12 for title in soup.select('title'):
13     print(title.get_text())

　　　　下一篇文章，讲一个用beautifulsoup实例来加深理解。

posted @ 2018-02-24 12:37 CCColby 阅读(863) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

CCColby

python爬虫（7）——BeautifulSoup

公告