BeautifulSoup
BeautifulSoup简单使用:
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' # 然后创建BeautifulSoup对象,创建BeautifulSoup对象有两种方式: # 第一种:通过字符串创建 soup = BeautifulSoup(html, 'lxml') # 另一种通过文件来创建。假如html_str字符串保存为index.html文件。 # soup = BeautifulSoup(open('index.html')) # 文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码。 print(soup.prettify())
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
通过下面的一个例子,对bs4有一个简单的了解,以及看一下它的强大之处:
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title) print(soup.title.name) print(soup.title.string) print(soup.title.parent.name) print(soup.p) print(soup.p["class"]) print(soup.a) print(soup.find_all('a')) print(soup.find(id='link3'))
结果:
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> <title>The Dormouse's story</title> title The Dormouse's story head <p class="title"><b>The Dormouse's story</b></p> ['title'] <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
标签选择器
在快速使用中我们添加如下代码:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
通过这种soup.标签名 我们就可以获得这个标签的内容
这里有个问题需要注意,通过这种方式获取标签,如果文档中有多个这样的标签,返回的结果是第一个标签的内容,如上面我们通过soup.p获取p标签,而文档中有多个p标签,但是只返回了第一个p标签内容
获取内容 soup.title.string:
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup """ .string, .strings, stripped_strings 三个属性。 .string这个属性很有特点:如果一个标记里面没有标记里面没有标记了,那么,string就会返回标记里面的内容。如果标记里面里面只有唯一的一个标记了,那么,.steing也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string方法应该调用哪个子节点的内容,.srting的输出结果是None """ html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml',) # 想要获取标记内部的文字,需要用到.string print(soup.head.string) print(soup.title.string) print(soup.html.stting) print('-' * 50) # strings属性主要应用于tag中包含多个字符串的情况,可以进行循环遍历。 for string in soup.strings: print(string) print('+' * 50) # .stripped_strings属性可以去掉输出字符串中包含的空格或空行。 for q in soup.stripped_strings: print(q)
结果:
The Dormouse's story The Dormouse's story None -------------------------------------------------- The Dormouse's story The Dormouse's stor Once upon a time there ... ++++++++++++++++++++++++++++++++++++++++++++++++++ The Dormouse's story The Dormouse's stor Once upon a time there ...
嵌套选择
我们直接可以通过下面嵌套的方式获取
print(soup.head.title.string)
获取名称 soup.title.name
#!/urs/bin/evn python # -*- coding:utf-8 -*- """ Tag: Tag对象与XmL或HTML原生文档中Tag相同,通俗点说就是标记。比如<title>The Dormouse's story</title>或者<a href="http://example.com/elsie" class="sister" id="linkl">Elsie</a> 抽取title: print soup.title 抽取a: print soup.a 抽取p: print soup.a Tag 中有两个最重要的属性:name和attributes。 每个Tag都有自己的名字,通过.name来获取。 """ from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' # 然后创建BeautifulSoup对象,创建BeautifulSoup对象 # 第一种:通过字符串创建 soup = BeautifulSoup(html, 'lxml', ) print(soup.name) # soup对象本身比较特殊,他的name为[documernt], 对于其他内部标记,输出的值标记本身的名称。 print(soup.title.name) print(soup.p.sting) """ Tag:可以获取name。还可以修改name,改变之后将影响所有通过当前BeautifulSoup对象生成的HTMl文档。 """ soup.title.name = "cc" print(soup.title) print(soup.cc) # 这里已经修改title标记成功修改为cc # 再说一下Tag中的属性,<p class="title"><b>The Dormouue's story</b></p> 有一个"class"值性,值为”title“。 Tag的属性的操作方法与字典相同。 print(soup.p['class']) print(soup.p.get('class')) # 也可以点取,比如:.attrs, 用于获取Tag中所有属性 # name一样,我们可以对标记中的这些属性和内容等进行修改。 soup.p['class'] = 'cc' print(soup.p)
结果:
[document] title None None <cc>The Dormouse's story </cc> ['title'] ['title'] <p class="cc"><b>The Dormouse's stor </b></p>
获取属性
print(soup.p.attrs['name'])
print(soup.p['name'])
上面两种方式都可以获取p标签的name属性值
父节点和祖先节点
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml模块') print(soup.title) print(soup.title.parent) # 父节点 # 通过元素的.parents属性可以递归得到元素的所有的所有父辈节点,使用了.parents方法遍历了<a>标记到根节点的所有节点。 print(soup.a) for p in soup.parents: if p is None: print(p) else: print(p.name)
结果:
<title>The Dormouse's story </title> <head><title>The Dormouse's story </title></head> <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>
兄弟节点
soup.a.next_siblings 获取后面的兄弟节点
soup.a.previous_siblings 获取前面的兄弟节点
soup.a.next_sibling 获取下一个兄弟标签
souo.a.previous_sinbling 获取上一个兄弟标签
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') # 兄弟节点(从soup.prettify()的输出结果中,我们可以看到<a>有很多兄弟节点。兄弟节点可以理解为和本节点处在同一级的节点,.next_sibling属性可以获取该节点的下一个兄弟节点,.prebious_sibling则与之相反,如果节点不存在,则返回None。 # ) print(soup.p.next_sibling) print('-' * 50) print(soup.p.prev_sibling) print('#' * 50) print(soup.p.next_sibling.next_sibling) for i in soup.p.next_siblings: print(repr(i))
结果:
<p class="story">Once upon a time there <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p> -------------------------------------------------- None ################################################## <p class="story">Once upon a time there <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p> '\n'
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup # 前后节点需要使用.next_element,.previous_element这两个属性,与.next_sibling.previous_slbling不同,它并不是针对于兄弟节点,而是针对所有节点,不分层次,例如<head><title>The Dormiuse's</title></head>中的下一个节点就是title html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml模块') print(soup.head) print(soup.head.next_element) # 如果想遍历所有的前节点或者后节点,通过.next_elements 和.previous_elements的迭代器就可以向前或向后访问文档的解析内容。 print('-' * 50) for element in soup.a.next_element: print(repr(element))
结果
<head><title>The Dormouse's story </title></head> <title>The Dormouse's story </title> -------------------------------------------------- '.' '.' '.'
子节点和子孙节点:
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' # 子节点:(Tag)中的.contents和.children是非常重要的 soup = BeautifulSoup(html, 'lxml') print(soup.head.contents) print(len(soup.head.contents)) print(soup.head.contents[0].string) # 字符串没有.contents属性,就是没有子节点。 # .children属性返回一个生成器,可以对子节点进行循环。 for chid in soup.head.contents: print(chid) print('-' * 50) # .contents和.children属性包含Tag的直接子节点。 # .descendants属性可以对所有Tag的子孙节点进行递归循环 for c in soup.head.descendants: print(c)
结果:
[<title>The Dormouse's story </title>] 1 The Dormouse's story <title>The Dormouse's story </title> -------------------------------------------------- <title>The Dormouse's story </title> The Dormouse's story
标准选择器
find_all(name,attrs,recursive,text,**kwargs)
find_all(name,attrs,recursive,text,**kwargs)
可以根据标签名,属性,内容查找文档
name的用法:
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul'))
print('-' * 50)
print(type(soup.find_all('ul')[0]))
结果:
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] -------------------------------------------------- <class 'bs4.element.Tag'>
同时我们是可以针对结果再次find_all,从而获取所有的li标签信息:
for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs可以传入字典的方式来查找标签,但是这里有个特殊的就是class,因为class在python中是特殊的字段,所以如果想要查找class相关的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的标签属性可以不写attrs,例如id。
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))
结果:
['Foo', 'Foo']
其他用法:
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup import re """ find_all方法,用于搜索当前Tag的所有Tag子节点,并判断是否符合过滤器的条件, find_all(name, attrs, recursive, text, **kwargs) name参数:可以查找所有名字为name的标记,字符串对象会被自动忽略掉。name参数取值可以是字符串,正则表达式,列表,True 和方法。最简单的过滤是字符串。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容。 """ html = ''' <html><head><title>The Dormouse's story <body> <p class="title"><b>The Dormouse's stor <p class="story">Once upon a time there <a href="http://example.com/elsie" clas <a href="http://example.com/lacie" clas <a href="http://example.com/tillie" cla and they lived at the bottom of a well. <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml模块') print(soup.find_all('b')) # 如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。 for tag in soup.find_all(re.compile('^b')): print(tag.name) print('*' * 50) # 如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。 print(soup.find_all(['a', 'b'])) print('@' * 50) # 如果传入的参数是True,True可以匹配任何值。 for ti in soup.find_all(True): print(ti) print('#' * 50) # 如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数Tag节点,如果这个方法返回True表示当前匹配并且被找到,如果不是则返回FALSE。 """ def hasClass_id(tag): return tag.has_attr('class') and tag.has_attr('id') print(soup.find_all(hasClass_id)) """ # kwargs参数: 如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串,正则表达式,列表,True。如果包含id参数,BeautifulSoup会搜索每个tag的"id"属性。 print(soup.find_all(id='link2')) # 如果传入href参数,BeautifulSoup会搜索每个Tag的'href'属性。 print(soup.find_all(href=re.compile('elsie'))) print(soup.find_all(id=True)) # 如果想用class过滤。但是class是关键字,需要在class后面加个下划线。 print(soup.find_all('a', class_='sister')) print('c' * 50) # 使用多个指定名字的参数可以同时过滤Tag的多个属性: print(soup.find_all(href=re.compile('elsie'), id='linkl')) """ # 有些tag属性再搜索不能使用,比如:HTML5中的 data-*属性 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(attrs={"data-foo": "value"}) """
结果:
[<b>The Dormouse's stor </b>] body b ************************************************** [<b>The Dormouse's stor </b>, <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>] @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ <html><head><title>The Dormouse's story </title></head><body> <p class="title"><b>The Dormouse's stor </b></p><p class="story">Once upon a time there <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p> </body></html> <head><title>The Dormouse's story </title></head> <title>The Dormouse's story </title> <body> <p class="title"><b>The Dormouse's stor </b></p><p class="story">Once upon a time there <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p> </body> <p class="title"><b>The Dormouse's stor </b></p> <b>The Dormouse's stor </b> <p class="story">Once upon a time there <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p> <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a> ################################################## [] [<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>] [] [] cccccccccccccccccccccccccccccccccccccccccccccccccc []
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') # find_all()方法返回全部的搜索结果,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用limit参数限制返回结果的数量。当搜索到的结果数量到达limit的限制时,就停止搜索返回结果。 print(soup.find_all('a', limit=2))
结果:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') # 调用Tag的find_all()方法时,BeautifulSoup会搜索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数recursive=False print(soup.find_all('title')) print(soup.find_all('title', recursive=False))
结果
[<title>The Dormouse's story</title>] []
find
find(name,attrs,recursive,text,**kwargs)
find返回的匹配结果的第一个元素
其他一些类似的用法:
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法
CSS选择器
通过select()直接传入CSS选择器就可以完成选择
熟悉前端的人对CSS可能更加了解,其实用法也是一样的
.表示class #表示id
标签1,标签2 找到所有的标签1和标签2
标签1 标签2 找到标签1内部的所有的标签2
[attr] 可以通过这种方法找到具有某个属性的所有标签
[atrr=value] 例子[target=_blank]表示查找所有target=_blank的标签
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))
结果
[<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>
获取内容
通过get_text()就可以获取文本内容
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())
结果:
Foo
Bar
Jay
Foo
Bar
获取属性
或者属性的时候可以通过[属性名]或者attrs[属性名]
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
结果
list-1 list-1 list-2 list-2
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') # 通过CSS也可以定位元素的位置。在写CSS时,标记名不加任何修饰,类名前加点'.', id名前加'#',在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回类型是list. # 1通过标名称进行查找(通过标记名称可以直接查找,可以找到某个标记下的直接标记和兄弟节点标记) # 直接查找 print(soup.select('title')) #多层查找 print(soup.select('html head title')) # 查找直接子节点,查找head下的title标记 print(soup.select('head > title')) # 查找p下的id='linkl'的标记 print(soup.select('p > # linkl')) # 查找兄弟节点 # 查找id=‘linkl’之后class=sisiter的所有兄弟标记 print(soup.select('# linkl ~ .sister')) # 查找紧跟着id="linkl"之后 class=sisiter的子标记 print(soup.select('# link1 + .sester'))
结果:
[<title>The Dormouse's story</title>] [<title>The Dormouse's story</title>] [<title>The Dormouse's story</title>] [] [] []
#!/urs/bin/evn python # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml模块') print(soup.select('.sister')) print(soup.select('[class~=sister]')) # 通过tag的id查找 print(soup.select('# link1')) print(soup.select('a# link2')) # 通过是否存在某个属性来查找 print(soup.select('a[href]')) # 通过属性值来寻找 print(soup.select('a[href="http://example.com/elseie"]')) print(soup.select('a[href^="http://example.com/"]')) print(soup.select('a[href*=".com/el"]'))
结果:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] [] [] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] [] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]