Beautiful Soup的基本使用

文档地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

中文文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

css文档：https://www.w3school.com.cn/cssref/css_selectors.asp

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

 <!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
    <style>
        *{
            margin: 0 auto;
            padding: 0;
        }

        #header{
            width: 980px;
            height:45px;
            border: 1px solid blue;
        }
        #menu{
            width: 980px;
            height:45px;
            border: 1px solid red;
        }
        #content{
            width: 980px;
            height:400px;
            border: 1px solid red;
        }

        #content .left {
            width: 500px;
            height: 300px;
            border: 1px solid red;
            float: left;
        }
        #content .right{
            width: 400px;
            height: 300px;
            border: 1px solid red;
            float: right;
        }

        #foot{
            width: 980px;
            height:45px;
            border: 1px solid red;
        }
        .clear{
            width: 100%;
            height: 10px;
            clear: both;
        }



    </style>
</head>
<body>
    <div id="header">
        <!--<audio class="music" src="www.baidu.com"> />-->
        <span>logo  <a href="http://www.taobao.com">taobao</a></span><input type="text" /><input type="button" value="搜索" /><span>1</span><span>2</span><span>3</span>
    </div>
    <div class="clear"></div>
    <div id="menu">
        <a href="http://www.baidu.com">要闻</a>
        <a href="http://www.baidu.com">军事</a>
        <a href="http://www.baidu.com">体育</a>
        <a href="http://www.baidu.com">科技</a>
        <a href="http://www.baidu.com"><span>11</span></a>
        <a href="http://www.baidu.com"><span>haha</span></a>
        <a href="http://www.baidu.com"><span>50</span></a>

    </div>
    <div class="clear"></div>
    <div id="content">
        <a id="pw" href="www.baidu.com">lalala</a><br/>
        <div class="left" title="ff123">
            <a href="www.baidu.com">aaaaaa</a><br/>
            <a href="www.baidu.com">bbbbbb</a><br/>
            <a href="http://www.baidu.com">ccccccc</a><br/>
            <a href="http://www.baidu.com">ddddd</a><br/>

        </div>
        <div class="right" title="ff456">
            <a class="one" name="aa" href="http://www.baidu.com">eeeee</a><br/>
            <a class="one" href="http://www.baidus.com">fffffffffff</a><br/>
            <a href="http://www.baidu.com">gggggggggg</a><br/>
            <a href="http://www.baidu.com">hhhhhhhhhhh</a><br/>
        </div>

    </div>
    <div class="clear"></div>
    <div id="foot">
        <span class="f">底部</span>
    </div>
    <table>

        <tr>
            <td>aa</td>
            <td>vbb</td>
        </tr>
    </table>
</body>
</html>

web.html

#-*- conding:utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from faker import Factory


'''创建BeautifulSoup对象'''
fake = Factory().create('Zh_cn')
user_agent = fake.user_agent()
headers = {'User-Agent':user_agent}
response = requests.get('http://www.baidu.com',headers = headers).content.decode('utf-8')
print(response)

# # 解析
soup = BeautifulSoup(response,'html.parser')  #lxml
print(type(soup))  # 通过解析之后变成 bs4.BeautifulSoup 对象  就可以使用里面提供的函数进行操作
html = soup.prettify() #美化输出
print(html)

# #保存HTML文档
with open('test.html','w',encoding='utf-8') as f:
    f.write(html)


'''
四大对象种类
'''
# # 四大对象的 BeautifulSoup
'''
BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性
'''
soup = BeautifulSoup(open('web.html',encoding='utf-8'),'html.parser')
print(soup)

# 四大对象的 Tag
'''
Tag：通俗点讲就是HTML中的标签 如(title div)对于Tag有两个重要的属性
Name:返回标签名称
Attrs:返回标签属性
'''
print(soup.head.title.name)  # 获取标签名
print(soup.head.meta.attrs['charset'])  # 获取meta标签的charset属性
print(soup.body.div.attrs['id'])  # 获取meta标签的charset属性,默认会获取第一个标签

# 四大标签的NavigableString
'''
string：通过string获取标签里的内容。
Strings：获取多个内容，不过需要遍历获取
'''
# 如果我要获取标签中的文本可以使用string以及strings
print(type(soup.body.div.span.a.string))
print(soup.body.div.span.a.string)  # 获取到a标签中的文本
print(soup.body.div.span.string)  # 获取到span标签中的文本

# 如果标签中有子标签使用string则无法获取文本需要使用strings 可以获取，如果没有子标签则正常获取
print(soup.body.div.span.strings) # 可迭代对象
for i in soup.body.div.span.strings:
    print(i)


'''
直接子节点
    .contents：标签的 .content 属性可以将tag的子节点以列表的方式输出
    .children : 返回一个可迭代对象
所有子孙节点
    .descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。
节点内容
    .string：返回标签里面的内容
    .text:返回标签的文本
'''
print(soup.body.div.contents)  # 获取所有的儿子
# 获取标签文本
print(soup.body.div.span.a.string)  # 获取到a标签中的文本
print(soup.body.div.span.a.text)
print(soup.body.div.span.text)  # 获取文本使用text  更方便


'''
父节点
    .Parent:获取当前节点的父节点
全部父节点
    .Parents:获取当前节点的所有父节点
'''
print(soup.body.div.span.parent)
print(soup.body.div.span.parents)


'''
兄弟节点
    兄弟节点可以理解为和本节点处在同一级的节点。
    .next_sibling 属性获取了该节点的下一个兄弟节点，.previous_sibling 则与之相反，如果节点不存在，则返回 None
前后节点
    next_element：与 .next_sibling .previous_sibling 不同，它并不是针对于兄弟节点，而是在所有节点，不分层次
'''
print(soup.body.div)  # 他的下一个兄弟
print(soup.body.div.next_sibling.next_sibling)

print(soup.body.div.next_sibling.next_sibling.next_sibling.next_sibling)
a = soup.body.div.next_sibling.next_sibling.next_sibling.next_sibling.a
print(a.attrs['href'])
print(a.text)
print(a.next_sibling.next_sibling.attrs['href'])
print(a.next_sibling.next_sibling.text)

# 前后节点
print(a.next_element.next_element.next_element)


'''
find_all( name , attrs , recursive , text , **kwargs ) 
参数：
    rescurive 参数，默认find_all会搜索所有的子孙节点，
    rescurive设置为false，得到就是直接子节点。
find( name , attrs , recursive , text , **kwargs )
find_parents() find_parent()
find_next_siblings() find_next_sibling()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() 和 find_previous()
'''
# 倚天剑  find_all
# 传字符串
print(soup.find_all('a'))  # 把整个对象当中的a标签返回给你
print(soup.find_all(name='span'))  # 把整个对象当中的span标签返回给你

# 传正则
print(soup.find_all(href=re.compile('http://')))  # href属性包含HTTP的所有的标签

#如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回
print(soup.find_all(['a','span'])) # 获取所有的a标签和span标签

#keyword 参数(name,attrs)
print(soup.find_all(id='menu'))
print(soup.find_all(class_='one')) #class_是查找class属性为'one'的所有标签。
print(soup.find_all(class_='one',attrs={'name':'aa'}))

#通过 text 参数可以搜搜文档中的字符串内容
print(soup.find_all(text = '要闻'))

#.限定查找个数
print(soup.find_all(class_='one',limit = 1))  # 限制返回数量

# find  屠龙刀+倚天剑
print(soup.find(id='menu').find_all('a'))  # 返回当前这个节点

# 获取文本以及属性
print(soup.find(class_ = 'left').find_all('a'))  # 返回列表
for i in soup.find(class_ = 'left').find_all('a'):
    print(i.attrs['href'])
    print(i.text)

# 1. 怎么获取第三个
print(soup.find(class_='left').find_all('a')[2])  # 通过下标获取

# 2.class以及ID 的选择
print(soup.find(class_='left').find_all('a'))
# ID以及class 优先级为ID
# 首先查看内容准不准确s
# 如果重复了则，网上一级标签去定位


'''
CSS选择器
    这就是另一种与 find_all 方法有异曲同工之妙的查找方法.
    写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#，在这里我们也可以利用类似的方法来筛选元素，
    用到的方法是 soup.select()，返回类型是 list。
通过标签名查找:select(‘p’)
通过类名查找:select(‘.menu’)
通过 id 名查找:select(‘#link’)
属性查找:select(‘p[name=2]’)
组合查找:select(‘div #p2’)
获取内容：get_text()
'''
print(soup.select('a'))               # find_all('a')
print(soup.select('.one'))            # find_all(class_ = 'one')
print(soup.select('#menu'))           # find_all(id='menu')
print(soup.select('a[name="aa"]'))    # find_all('a',attrs={'name':'aa'})
print(soup.select('#menu a'))         # find(id='menu').find_all('a')

# 获取属性和文本的方式1
print(soup.select('#menu a')[0].attrs['href'])
print(soup.select('#menu a')[0].get_text())

# 获取属性和文本的方式1
print(soup.select('#menu a')[0]['href'])
print(soup.select('#menu a')[0].text)

python例题爬虫

#-*- conding:utf-8 -*-
from bs4 import BeautifulSoup
import requests
import time
'''
    1.需求分析
        100道例题：
            title、timu、cxfx、code
       
    2.源码分析
     首页入口地址：http://www.runoob.com/python/python-100-examples.html
        获取每一道案例的a（url）链接
        通过请求每一道案例的a链接获取到案例的详细内容
            解析案例的详情内容并且提取相关内容
    3.代码实现
'''
startUrl = 'http://www.runoob.com/python/python-100-examples.html'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

# 发送请求 获取首页
response = requests.get(startUrl,headers = headers).content.decode('utf-8')
# print(response)

# 解析html文档
Html = BeautifulSoup(response,'html.parser')
# print(Html)

# 提取所有的 A 链接地址
link_a = Html.find(id='content').find_all('a')

num = 1

# 获取每一道案例的内容
for i in link_a:
    response2 = requests.get('http://www.runoob.com'+i.attrs['href'],headers=headers).content.decode('utf-8')

    # 解析
    soup = BeautifulSoup(response2,'lxml')

    content = {}
    # title
    content['title'] = soup.find(id='content').h1.text
    # timu
    content['timu'] = soup.find(id='content').find_all('p')[1].text
    # cxfx
    content['cxfx'] = soup.find(id='content').find_all('p')[2].text
    # code
    try:
        content['code'] = soup.find(class_ = 'hl-main').text
    except:
        content['code'] = soup.find('pre').text

    print(content)

    # 保存内容
    with open('py100.txt','a+',encoding='utf-8') as file:
        file.write(content['title']+'\n')
        file.write(content['timu']+'\n')
        file.write(content['cxfx']+'\n')
        file.write(content['code']+'\n')
        file.write('='*50+'\n')
    time.sleep(1)

    print('第{0}道题'.format(num))
    num+=1

View Code

# -*- conding:utf-8 -*-
from bs4 import BeautifulSoup
import requests
import time

'''
    1.需求分析
        100道例题：
            title、timu、cxfx、code

    2.源码分析
     首页入口地址：http://www.runoob.com/python/python-100-examples.html
        获取每一道案例的a（url）链接
        通过请求每一道案例的a链接获取到案例的详细内容
            解析案例的详情内容并且提取相关内容
    3.代码实现
'''
startUrl = 'http://www.runoob.com/python/python-100-examples.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

# 发送请求 获取首页
response = requests.get(startUrl, headers=headers).content.decode('utf-8')
# print(response)

# 解析html文档
Html = BeautifulSoup(response, 'html.parser')
# print(Html)

# 提取所有的 A 链接地址
link_a = Html.select('#content a')
# print(link_a)
num = 1

# 获取每一道案例的内容
for i in link_a:
    response2 = requests.get('http://www.runoob.com' + i.attrs['href'], headers=headers).content.decode('utf-8')

    # 解析
    soup = BeautifulSoup(response2, 'lxml')

    content = {}
    # title
    content['title'] = soup.select('#content')[0].h1.text
    # timu
    content['timu'] = soup.select('#content p')[1].text
    # cxfx
    content['cxfx'] = soup.select('#content p')[2].text
    # code
    try:
        content['code'] = soup.select('.hl-main')[0].text
    except:
        content['code'] = soup.select('pre')[0].text

    print(content)

    # 保存内容
    # with open('py100.txt', 'a+', encoding='utf-8') as file:
    #     file.write(content['title'] + '\n')
    #     file.write(content['timu'] + '\n')
    #     file.write(content['cxfx'] + '\n')
    #     file.write(content['code'] + '\n')
    #     file.write('=' * 50 + '\n')
    time.sleep(1)

    print('第{0}道题'.format(num))
    num += 1

View Code

posted @ 2019-08-10 21:17 码迷-wjz 阅读(245) 评论(0) 收藏举报

刷新页面返回顶部

码迷-wjz

Beautiful Soup的基本使用

python例题爬虫

公告