python学习之beautifulsoup4、搜索文档树、遍历文档树、 mongDB的简单使用
1.beautifulsoup的简单使用
# 解析库:re,selenium # XML解析器 # Beatifulsoup解析库,需要配合解析器使用 # 目前主要的解析器:Python标准库,lxml HTML解析器(首选) # Beatifulsoup能给我们提供一种查找文档树的方法,其内部封装了re # 1.什么bs4,为什么要使用bs4 # html_doc = """ # <html><head><title>The Dormouse's story</title></head> # <body> # <p class="sister"><b>$37</b></p> # # <p class="story" id="p">Once upon a time there were three little sisters; and their names were # <a href="http://example.com/elsie" class="sister" >Elsie</a>, # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; # and they lived at the bottom of a well.</p> # # <p class="story">...</p> # """ html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="sister"><b>$37</b></p> <p class="story" id="p">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" >Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup # 从bs4中导入Beautiful # 调用BeautifulSoup实例化一个soup对象 # 参数一:解析文本 # 参数二:解析器(html.parser、lxml) soup=BeautifulSoup(html_doc,'lxml') print(soup) print(type(soup)) # 文档美化 html=soup.prettify() print(html)
2.bs4之搜索文档树
'''''' html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<b>tank</b><a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.<hr></hr></p><p class="story">...</p>""" ''' 搜索文档树: find() 找一个 find_all() 找多个 标签查找与属性查找: 标签: name 属性匹配 attrs 属性查找匹配 text 文本匹配 - 字符串过滤器 字符串全局匹配 - 正则过滤器 re模块匹配 - 列表过滤器 列表内的数据匹配 - bool过滤器 True匹配 - 方法过滤器 用于一些要的属性以及不需要的属性查找。 属性: - class_ - id ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') # 字符串过滤器 # name p_tag = soup.find(name='p') print(p_tag) # 根据文本p查找某个标签 # 找到所有标签名为p的节点 tag_s1 = soup.find_all(name='p') print(tag_s1) # attrs # 查找第一个class为sister的节点 p = soup.find(attrs={"class": "sister"}) print(p) # 查找所有class为sister的节点 tag_s2 = soup.find_all(attrs={"class": "sister"}) print(tag_s2) # text text = soup.find(text="$37") print(text) # 配合使用: # 找到一个id为link2、文本为Lacie的a标签 a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie") print(a_tag) # # 正则过滤器 # import re # # name # p_tag = soup.find(name=re.compile('p')) # print(p_tag) # 列表过滤器 # import re # # name # tags = soup.find_all(name=['p', 'a', re.compile('html')]) # print(tags) # - bool过滤器 # True匹配 # 找到有id的p标签 # p = soup.find(name='p', attrs={"id": True}) # print(p) # 方法过滤器 # 匹配标签名为a、属性有id没有class的标签 # def have_id_class(tag): # if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'): # return tag # # tag = soup.find(name=have_id_class) # print(tag)
3.bs4之遍历文档树
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>""" from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') ''' 遍历文档树: 1.直接使用 ''' # 1.直接使用 print(soup.p) # 查找第一个<p>标签 print(soup.a) # 查找第一个<a>标签 # 2.获取标签的名称 print(soup.head.name) # 3.获取标签的属性 print(soup.a.attrs) # 以字典的形式 print(soup.a.attrs['href']) # 获取a标签中的href属性 # 4.获取标签的内容 print(soup.p.text) # $37 # 5.嵌套选择 print(soup.html.head) # 6.子节点,子孙节点 # 找到闭合的标签 print(soup.body.children) # 找到body所有的子节点,返回的是迭代器的对象,这样可以节省电脑的资源 print(list(soup.body.children)) # 强制转化为列表类型 print(soup.body.descendants) #返回子孙节点 print(list(soup.body.descendants)) # 7.父节点、祖先节点 print(soup.p.parent)# 获取p标签的父亲节点 print(soup.p.parents) # 获取p标签所有的祖先节点 # 8.兄弟节点 # 找下一个兄弟 print(soup.p.next_sibling) # 找下面所有的兄弟 print(soup.p.next_siblings) # 此时返回的是迭代器的对象,这样可以节省电脑的资源 print(list(soup.p.next_siblings)) # 找上面的兄弟,逗号,文本都可以是兄弟 print(soup.a.previous_sibling) # 找到a标签的上一个兄弟 # 找到a标签上面所有的兄弟 print(soup.a.previous_siblings) print(list(soup.a.previous_siblings))
3.mongDB的简单使用
关系型数据库,强大的查询功能
非关系型数据库,灵活模式,扩展性,性能,需要建集合,没有一一对应的关系,
1.MangoDB
db全局变量显示当前位置
创建集合
SQL:
create table f1,f2...
MangoDB:
db.student
插入数据
MangoDB:
插多条
db.student.insert([{"name1":"tank1",{"name2":"tank2"}])
插一条
db.student.insert({"name1":"tank1"})
查数据
查全部
db.student.find({})
查一条查找name为tank的记录
db.student.find({"name":"tank"})
from pymongo import MongoClient # 1、链接mongoDB客户端 # 参数1: mongoDB的ip地址 # 参数2: mongoDB的端口号 默认:27017 client = MongoClient('localhost', 27017) print(client) # 2、进入tank_db库,没有则创建 print(client['tank_db']) # 3、创建集合 print(client['tank_db']['people']) # 4、给tank_db库插入数据 # 1.插入一条 data1 = { 'name': 'tank', 'age': 18, 'sex': 'male' } client['tank_db']['people'].insert(data1) # 2.插入多条 data1 = { 'name': 'tank', 'age': 18, 'sex': 'male' } data2 = { 'name': 'tank1', 'age': 84, 'sex': 'female' } data3 = { 'name': 'tank2', 'age': 73, 'sex': 'male' } client['tank_db']['people'].insert([data1, data2, data3]) # 5、查数据 # 查看所有数据 data_s = client['tank_db']['people'].find() print(data_s) # <pymongo.cursor.Cursor object at 0x000002EEA6720128> # 需要循环打印所有数据 for data in data_s: print(data) # 查看一条数据 data = client['tank_db']['people'].find_one() print(data) # 官方推荐使用 # 插入一条insert_one # client['tank_db']['people'].insert_one() # 插入多条insert_many # client['tank_db']['people'].insert_many()