Python网络爬虫提取之Beautiful Soup入门
(1).Beautiful Soup库的安装
Beautiful Soup库也叫美味汤,是一个非常优秀的Python第三方库,能够对html、xml格式进行解析并提取其中的相关信息,官网地址是“”。
安装Beautiful Soup库一样是使用pip命令,通过命令“pip install BeautifulSoup4”去安装,简单演示一下,如下图:
看到“WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.”不要慌,这个提示是提醒你pip可以升级了,并不是一定要升级,你可以使用“pip list”命令去查看一下安装的报是否成功,如下:
1 2 3 4 5 6 7 8 | <html><head><title>This is a python demo page</title></head> <body> <p class = "title" ><b>The demo python introduces several python courses.</b></p> <p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href= "" class = "py1" id= "link1" >Basic Python</a> and <a href="http://www.icourse1 " class=" py2 " id=" link2">Advanced Python</a>.</p> </body></html> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | >>> import requests >>> r = requests. get ( "" ) >>> r.text '<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class = "title" ><b>The demo python introduces sever al python courses.</b></p>\r\n<p class = "course" >Python is a wonderful general-purpose programming language. You can learn Pytho n from novice to professional by tracking the following courses:\r\n<a href= "" class = "py1" id= "link1" >Basic Python</a> and <a href= "" class = "py2" id= "link2" >Advance d Python</a>.</p>\r\n</body></html>' >>> demo = r.text >>> from bs4 import BeautifulSoup #bs4是BeautifulSoup4的缩写,这里是导入bs4库里的一个BeautifulSoup类 >>> soup = BeautifulSoup(demo, "html.parser" ) #html.parser是解析demo的解释器,表示对demo进行html的解析 >>> print(soup.prettify()) <html> <head> <title> This is a python demo page </title> </head> <body> <p class = "title" > <b> The demo python introduces several python courses. </b> </p> <p class = "course" > Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" > Basic Python </a> and <a class = "py2" href= "" id= "link2" > Advanced Python </a> . </p> </body> </html> |
(2).Beautiful Soup库的基本元素
1)Beautiful Soup库的理解
Beautiful Soup库是能够解析html和xml文件的功能库,也可以说Beautiful Soup库是解析、遍历、维护“标签树”的功能库。以html文件为例,打开任意一个html文件的源代码,我们都能看到它是由一组尖括号构成的标签组织起来的,这里面每一对尖括号形成了一个标签,而标签之间存在上下游关系,形成了一个标签树。
2)Beautiful Soup库的引用
Beautiful Soup库也叫beautifulsoup4库或bs4库,我们在使用它时需要采用一些引用方式,目前最常用的引用方式是“from bs4 import BeautifulSoup”。这种方式说明我们从bs4库中引入了一个类型,这个类型叫Beautiful Soup。
除了这样一个约定的方式之外,如果我们需要对Beautiful Soup库里的一些基本变量进行判断,我们也可以直接引用Beautiful Soup库,使用的是“import bs4”。
3)Beautiful Soup类
Beautiful Soup库本身解析的是html和xml文档,而这些文档与标签树是一一对应的,那么经过BeautifulSoup类的处理,我们可以使得每一个标签树转换为一个BeautifulSoup类。在这一过程中,我们可以将标签树看成一个字符串,而BeautifulSoup类就是一个能够代表标签树的类型。
4)Beautiful Soup库解析器
解析器 | 使用方法 | 条件 |
bs4的HTML解析器 | BeautifulSoup(mk,'html.parser') | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk,'lxml') | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk,'xml') | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk,'html5lib') | pip install html5lib |
5)Beautiful Soup类的基本元素
基本元素 | 说明 |
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
Name | 标签的名称,<p>...</p>的名称是'p',格式:<Tag>.name |
Attributes | 标签的属性,字典形式组织,格式:<Tag>.attrs |
NavigableString | 标签内非属性字符串,<>...</>中字符串,格式:<Tag>.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | >>> import requests #导入requests库 >>> r = requests. get ( "" ) >>> demo = r.text #将网页源代码赋给变量demo >>> from bs4 import BeautifulSoup #导入BeautifulSoup类 >>> soup = BeautifulSoup(demo, "html.parser" ) >>> soup.title #title标签 <title>This is a python demo page</title> >>> tag = soup.a #将a标签内容赋给tag变量,这里只能获取第一个a标签的内容 >>> tag <a class = "py1" href= "" id= "link1" >Basic Python</a> >>> #a标签的名称 'a' >>> #a标签的父标签名称 'p' >>> #a标签的父标签的父标签名称 'body' >>> tag.attrs #a标签的属性 { 'href' : '' , 'class' : [ 'py1' ], 'id' : 'link1' } >>> tag.attrs[ 'class' ] #a标签的属性中 class 属性 [ 'py1' ] >>> tag.attrs[ 'href' ] #a标签的属性中href属性 '' >>> type(tag.attrs) #a标签属性的类型是字典 < class 'dict' > >>> type(tag) #a标签的类型是bs4.element.Tag类型,也就是说bs4定义了一种特殊类型 < class 'bs4.element.Tag' > >>> soup.a. string #a标签内非属性字符串 'Basic Python' >>> soup.p <p class = "title" ><b>The demo python introduces several python courses.</b></p> >>> soup.p. string #直接跨越了b标签,由此可见NavigableString元素可以跨越多个标签 'The demo python introduces several python courses.' >>> type(soup.p. string ) #p标签内非属性字符串夜时一个特殊类型 < class 'bs4.element.NavigableString' > |
1 2 3 4 5 6 7 8 9 10 | >>> from bs4 import BeautifulSoup >>> newsoup = BeautifulSoup( "<b><!--This is a comment--></b><p>This is not a comment</p>" , "html.parser" ) >>> newsoup.b. string #自动省略了<!-- --> 'This is a comment' >>> type(newsoup.b. string ) #但是输出类型就是Comment元素类型 < class 'bs4.element.Comment' > >>> newsoup.p. string 'This is not a comment' >>> type(newsoup.p. string ) < class 'bs4.element.NavigableString' > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | <html> <head> <title> This is a python demo page </title> </head> <body> <p class = "title" > <b> The demo python introduces several python courses. </b> </p> <p class = "course" > Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" > Basic Python </a> and <a class = "py2" href= "" id= "link2" > Advanced Python </a> . </p> </body> </html> |
属性 | 说明 |
.contents | 当前节点的子节点列表,将<Tag>所有儿子节点存入列表 |
.children | 当前节点的子节点的迭代类型,与.contents类似,用于遍历循环儿子节点 |
.descendants | 当前节点的子孙节点的迭代类型,包含所有子孙节点,用于遍历循环 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | >>> import requests >>> r = requests. get ( "" ) >>> demo = r.text >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo, "html.parser" ) >>> soup.head <head><title>This is a python demo page</title></head> >>> soup.head.contents #head标签的子节点列表 [<title>This is a python demo page</title>] >>> soup.body.contents #body标签的子节点列表 [ '\n' , <p class = "title" ><b>The demo python introduces several python courses.</b></p>, '\n' , <p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking th e following courses: <a class = "py1" href= "" id= "link1" >Basic Python</a> and <a class = "py2" href = "" id= "link2" >Advanced Python</a>.</p>, '\n' ] >>> len(soup.body.contents) #body标签的子节点列表的长度 5 >>> soup.body.contents[1] #body标签的子节点列表中第二个参数 <p class = "title" ><b>The demo python introduces several python courses.</b></p> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | >>> for child in soup.body.children: print( "child:%s" %(child)) #因为存在换行的转义字符\n,所以为了方便观察,这里加上child:前缀 child: child:<p class = "title" ><b>The demo python introduces several python courses.</b></p> child: child:<p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" >Basic Python</a> and <a class = "py2" href= "" id= "link2" >Advanced Python</a>.</p> child: >>> for child in soup.body.descendants: print( "child:%s" %(child)) #因为存在换行的转义字符\n,所以为了方便观察,这里加上child:前缀 child: child:<p class = "title" ><b>The demo python introduces several python courses.</b></p> child:<b>The demo python introduces several python courses.</b> child:The demo python introduces several python courses. child: child:<p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" >Basic Python</a> and <a class = "py2" href= "" id= "link2" >Advanced Python</a>.</p> child:Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: child:<a class = "py1" href= "" id= "link1" >Basic Python</a> child:Basic Python child: and child:<a class = "py2" href= "" id= "link2" >Advanced Python</a> child:Advanced Python child:. child: #需要存在一行才能观测到最后一个换行符 |
属性 | 说明 |
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | >>> import requests >>> r = requests. get ( "" ) >>> demo = r.text >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo, "html.parser" ) >>> soup.title.parent <head><title>This is a python demo page</title></head> >>> soup.html.parent <html><head><title>This is a python demo page</title></head> <body> <p class = "title" ><b>The demo python introduces several python courses.</b></p> <p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" >Basic Python</a> and <a class = "py2" href= "" id= "link2" >Advanced Python</a>.</p> </body></html> >>> soup.parent >>> for parent in soup.a.parents: if parent is None: print(parent) else : print( p body html [document] |
属性 | 说明 |
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.pervious_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.pervious_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | >>> import requests >>> r = requests. get ( "" ) >>> demo = r.text >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo, "html.parser" ) >>> soup.a.next_sibling #a标签的下一个平行标签 ' and ' >>> soup.a.next_sibling.next_sibling #a标签的下一个平行标签的下一个平行标签 <a class = "py2" href= "" id= "link2" >Advanced Python</a> >>> soup.a.previous_sibling #a标签的上一个平行标签 'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n' >>> soup.a.previous_sibling.previous_sibling #a标签不存在上一个平行标签的上一个平行标签 >>> soup.a.parent #查看一下父标签进行验证 <p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice t o professional by tracking the following courses: <a class = "py1" href= "" id= "link1" >Basic Python</a> and <a class = "py2" href= "" id= "link2" >Advanced Python</a>.</p> |
1 2 3 4 5 6 7 8 9 10 | >>> for sibling in soup.a.next_siblings: print( "sibling:%s" %(sibling)) sibling: and sibling:<a class = "py2" href= "" id= "link2" >Advanced Python</a> sibling:. >>> for sibling in soup.a.previous_siblings: print( "sibling:%s" %(sibling)) sibling:Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | >>> import requests >>> r = requests. get ( "" ) >>> demo = r.text >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo, "html.parser" ) >>> soup.prettify() '<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class = "title" >\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class = "course" >\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracki ng the following courses:\n <a class = "py1" href= "" id= "link1" >\n Basic Python\n </a>\n and\n <a class = "py2" href= "" id="lin k2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>' >>> print(soup.prettify()) <html> <head> <title> This is a python demo page </title> </head> <body> <p class = "title" > <b> The demo python introduces several python courses. </b> </p> <p class = "course" > Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class = "py1" href= "" id= "link1" > Basic Python </a> and <a class = "py2" href= "" id= "link2" > Advanced Python </a> . </p> </body> </html> |
1 2 3 4 | >>> print(soup.a.prettify()) <a class = "py1" href= "" id= "link1" > Basic Python </a> |
