Python beautifulsoup模块

BeautifulSoup中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BeautifulSoup下载:http://www.crummy.com/software/BeautifulSoup/

解压到任意目录

在cmd控制台下进入目录

执行:python setup.py install即可;

执行完后命令行进入python使用import bs4命令验证是否成功:

 

假设content变量里存着整个网页的字符串,或者是urllib.request.urlopen(url)的返回值

首先,导入模块,然后把content打包进soup里

from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'html.parser')

1.将字符串以网页的形式美化显示(返回的是一个字符串)

print(soup.prettify())

2.提取出网页中的特定标签

  比如:提取出所有<a>标签

soup = BeautifulSoup(content,'html.parser')
print(soup.find_all('a'))

  或者提取出所有<a>标签和<b>标签

soup = BeautifulSoup(content,'html.parser')
print(soup.find_all(['a','b']))

  或者提取出所有class为t-large的<span>标签(也就是所有类似于<span class="t-large"></span>的标签)

soup = BeautifulSoup(content,'html.parser')
print(soup.find_all('span','t-large'))

  或者提取出所有有class属性没有id属性的标签

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup = BeautifulSoup(content,'html.parser')
print(soup.find_all(has_class_but_no_id))

  或者提取出所有id等于"link2"的标签

soup = BeautifulSoup(content,'html.parser')
print(soup.find_all(id="link2"))

3.获取一个标签(一个soup对象)的内容.contents

print(soup.contents)
print(soup.a.contents)

4.获取一个标签的class属性(要特别注意返回的是list,哪怕只有一个元素,因为HTML新特性——多属性导致的)

    print(soup.a['class'])

5.删除一个标签

>>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>')
>>> [s.extract() for s in soup('script')]
>>> soup
baba

6.删除一个特定class的标签

from bs4 import BeautifulSoup

markup = '<a>This is not div <div class="1">This is div 1</div><div class="2">This is div 2</div></a>'
soup = BeautifulSoup(markup,"html.parser")
a_tag = soup

soup.find('div',class_='2').decompose()

print a_tag

#<a>This is not div <div class="1">This is div 1</div></a>

7.注意在beautifulsoup中,<br>标签写成<br/>

8.提取一个soup里的所有字符串

for string in soup.strings:
    print(repr(string))

 提取一个soup里的非空行非空白字符串

for string in soup.stripped_strings:
    print(repr(string))

 

posted @ 2016-09-24 10:48  lvmememe  阅读(614)  评论(0编辑  收藏  举报