网页解析_bs4-01
一:简介
1.BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库,它的使用方式相对于正则来说更加的简单方便,常常能够节省我们大量的时间。
2.BeautifulSoup的安装也是非常方便的,pip安装即可。
pip install beautifulsoup4
3.解析器:
BeautifulSoup解析网页需要指定一个可用的解析器,以下是主要几种解析器:
由于这个解析的过程在大规模的爬取中是会影响到整个爬虫系统的速度的,所以推荐使用的是lxml,速度会快很多,而lxml需要单独安装:
pip install lxml
soup = BeautifulSoup(html_doc, 'lxml') # 指定解析器
提示:如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,所以要指定某一个解析器。
二.例子
第一部分:公共代码
from bs4 import BeautifulSoup # 导包
html_str="""<!DOCTYPE html>
<html>
<head>
<title>爬虫</title>
<meta charset="utf-8">
<link rel="stylesheet" href="http://www.taobao.com">
<link rel="stylesheet" href="https://www.baidu.com">
<link rel="stylesheet" href="http://at.alicdn.com/t/font_684044_un7umbuwwfp.css">
</head>
<body>
<!-- footer start -->
<footer id="footer">
<div class="footer-box">
<div class="footer-content" >
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
</body>
</html>
"""
第二部分(主要内容):
soup=BeautifulSoup(html_str,'lxml') # html转换成操作对象
print(type(soup))
print(soup.a) # 1.获取第一个符合条件的标签
print(soup.a['href']) # 2.获取第一个符合条件的标签属性
print(soup.a.text) # 3.获取内容
# print(soup.body)
print('======================================================')
print(soup.body.children) # <list_iterator object at 0x00000271ADDB9C18>
# 4.亲戚标签选择
body=soup.body # 获取body标签
tags=body.children # 仅仅是儿子元素
# print(list(tags))
print('******************************************************')
for tag in tags:
print(tag)
print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
tags_des=body.descendants # 选择后代元素
print(tags_des) # <generator object descendants at 0x0000024F990B8EB8>
print(list(tags_des))
print('---------------------------------------------------------')
print(body.p) # p标签属性(如果有多个p标签获取第一个)
print('---------------------------------------------------------')
print(body.p.next_sibling.next_sibling) # 获取第2个p
print('---------------------------------------------------------')
print(body.previous_sibling.previous_sibling) # body 元素的上一级
print('#############################################################')
p_parents=body.p.parents
print(list(p_parents)) # 获取它的上一级标签元素(父类) <p class="top-content" id="111"> 有很多
print('---------------------------------------------------------')
# 5. find_all
'''
def find_all(self, name=None, attrs={}, recursive=True, string=None,
limit=None, **kwargs):
# 功能:获取所有元素
# 参数:
# 返回值:可迭对象
'''
print(soup.find_all(name='p')) # 获取所有p元素内容
print('#############################################################')
print(soup.find_all(name='p',attrs={'class':'top-content'})) # 获取第一个p 名字、属性定位
print('---------------------------------------------------------')
print(soup.find_all('a'))
print('---------------------------------------------------------')
# find_all的时候一定要注意,获取到的内容是在列表里
print(soup.find_all('a',text='淘宝')) # 获取a标签 [<a href="http://www.taobao.com">淘宝</a>]
print(soup.find_all('a',text='淘宝')[0]) # 拿出对应的值 <a href="http://www.taobao.com">淘宝</a>
print(soup.find_all('a',text='淘宝')[0]['href']) # http://www.taobao.com
#
print('---------------------------------------------------------')
print(soup.find_all(['p','a','span'])) # 获取所有的p,a.. 写在列表中
print('---------------------------------------------------------')
# # 必须是直系子元素才能找到
print(soup.html.find_all('body',recursive=False)) # recursive 递归 False只会拿到子元素 body是html的子元素
print(soup.html.find_all('a',recursive=False)) # [] a标签不是html的直接子元素
print('---------------------------------------------------------')
# soup.find_all()
# 6.CSS选择器
# 在BeautifulSoup中,同样也支持使用CSS选择器来进行搜索。使用select(),在其中传入字符串参数,
# 就可以使用CSS选择器的语法来找到tag。
print(soup.select('p')) # 获取所有p
print('---------------------------------------------------------')
print(soup.select('p>a')) # 获取p下的a 在列表
运行截图:
D:\py学习01\venv\Scripts\python.exe D:/py学习01/python爬虫基础/网页解析-05/网页解析_bs4-01.py
<class 'bs4.BeautifulSoup'>
<a href="http://www.taobao.com">淘宝</a>
http://www.taobao.com
淘宝
======================================================
<list_iterator object at 0x000001DAD9AA0DA0>
******************************************************
footer start
<footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
<generator object descendants at 0x000001DAD9496FC0>
['\n', ' footer start ', '\n', <footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>, '\n', <div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>, '\n', <div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>, '\n', <p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>, '\n', <a href="http://www.taobao.com">淘宝</a>, '淘宝', '\n', <span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>, '\n', <a class="product" href="https://www.baidu.com">关于Python</a>, '关于Python', '\n', <a href="http://www.taobao.com">好好学习</a>, '好好学习', '\n', <a href="javascript:void(0)">人生苦短</a>, '人生苦短', '\n', <a href="javascript:void(0)">我用Python</a>, '我用Python', '\n', '\n', <span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>, '关于我: ', <i class="PyWhich py-wechat"></i>, ' 忽略', '\n', '\n', <p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>, '\n', <span>地址: xxxx</span>, '地址: xxxx', '\n', <span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>, '联系方式: ', <a href="tel:400-1567-315">400-1567-315</a>, '400-1567-315', ' (24小时在线)', '\n', '\n', '\n', <p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>, '\n Copyright © 爬虫有限公司. All Rights Reserved\n ', '\n', '\n', '\n']
---------------------------------------------------------
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
---------------------------------------------------------
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
---------------------------------------------------------
<head>
<title>爬虫</title>
<meta charset="utf-8"/>
<link href="http://www.taobao.com" rel="stylesheet"/>
<link href="https://www.baidu.com" rel="stylesheet"/>
<link href="http://at.alicdn.com/t/font_684044_un7umbuwwfp.css" rel="stylesheet"/>
</head>
#############################################################
[<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>, <div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>, <footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>, <body>
<!-- footer start -->
<footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
</body>, <html>
<head>
<title>爬虫</title>
<meta charset="utf-8"/>
<link href="http://www.taobao.com" rel="stylesheet"/>
<link href="https://www.baidu.com" rel="stylesheet"/>
<link href="http://at.alicdn.com/t/font_684044_un7umbuwwfp.css" rel="stylesheet"/>
</head>
<body>
<!-- footer start -->
<footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
</body>
</html>, <!DOCTYPE html>
<html>
<head>
<title>爬虫</title>
<meta charset="utf-8"/>
<link href="http://www.taobao.com" rel="stylesheet"/>
<link href="https://www.baidu.com" rel="stylesheet"/>
<link href="http://at.alicdn.com/t/font_684044_un7umbuwwfp.css" rel="stylesheet"/>
</head>
<body>
<!-- footer start -->
<footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
</body>
</html>
]
---------------------------------------------------------
[<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>, <p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>, <p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>]
#############################################################
[<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>]
---------------------------------------------------------
[<a href="http://www.taobao.com">淘宝</a>, <a class="product" href="https://www.baidu.com">关于Python</a>, <a href="http://www.taobao.com">好好学习</a>, <a href="javascript:void(0)">人生苦短</a>, <a href="javascript:void(0)">我用Python</a>, <a href="tel:400-1567-315">400-1567-315</a>]
---------------------------------------------------------
[<a href="http://www.taobao.com">淘宝</a>]
<a href="http://www.taobao.com">淘宝</a>
http://www.taobao.com
---------------------------------------------------------
[<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>, <a href="http://www.taobao.com">淘宝</a>, <span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>, <a class="product" href="https://www.baidu.com">关于Python</a>, <a href="http://www.taobao.com">好好学习</a>, <a href="javascript:void(0)">人生苦短</a>, <a href="javascript:void(0)">我用Python</a>, <span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>, <p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>, <span>地址: xxxx</span>, <span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>, <a href="tel:400-1567-315">400-1567-315</a>, <p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>]
---------------------------------------------------------
[<body>
<!-- footer start -->
<footer id="footer">
<div class="footer-box">
<div class="footer-content">
<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>
<p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>
</div>
<p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>
</div>
</footer>
</body>]
[]
---------------------------------------------------------
[<p class="top-content" id="111">
<a href="http://www.taobao.com">淘宝</a>
<span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>
<span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 忽略</span>
</p>, <p class="bottom-content">
<span>地址: xxxx</span>
<span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
</p>, <p class="copyright-desc">
Copyright © 爬虫有限公司. All Rights Reserved
</p>]
---------------------------------------------------------
[<a href="http://www.taobao.com">淘宝</a>]
Process finished with exit code 0