python学习之beautifulsoup4、搜索文档树、遍历文档树、 mongDB的简单使用

1.beautifulsoup的简单使用

# 解析库:re,selenium
# XML解析器
# Beatifulsoup解析库,需要配合解析器使用
# 目前主要的解析器:Python标准库,lxml HTML解析器(首选)
# Beatifulsoup能给我们提供一种查找文档树的方法,其内部封装了re
# 1.什么bs4,为什么要使用bs4
# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="sister"><b>$37</b></p>
#
# <p class="story" id="p">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" >Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# """

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="sister"><b>$37</b></p>

<p class="story" id="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup  # 从bs4中导入Beautiful
# 调用BeautifulSoup实例化一个soup对象
# 参数一:解析文本
# 参数二:解析器(html.parser、lxml)
soup=BeautifulSoup(html_doc,'lxml')
print(soup)
print(type(soup))
# 文档美化
html=soup.prettify()
print(html)

2.bs4之搜索文档树

''''''
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<b>tank</b><a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.<hr></hr></p><p class="story">...</p>"""
'''
搜索文档树:
    find()  找一个  
    find_all()  找多个
    
标签查找与属性查找:
    标签:
            name 属性匹配
            attrs 属性查找匹配
            text 文本匹配
            
        - 字符串过滤器   
            字符串全局匹配

        - 正则过滤器
            re模块匹配

        - 列表过滤器
            列表内的数据匹配

        - bool过滤器
            True匹配

        - 方法过滤器
            用于一些要的属性以及不需要的属性查找。

    属性:
        - class_
        - id
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
# name
p_tag = soup.find(name='p')
print(p_tag)  # 根据文本p查找某个标签
# 找到所有标签名为p的节点
tag_s1 = soup.find_all(name='p')
print(tag_s1)


# attrs
# 查找第一个class为sister的节点
p = soup.find(attrs={"class": "sister"})
print(p)
# 查找所有class为sister的节点
tag_s2 = soup.find_all(attrs={"class": "sister"})
print(tag_s2)


# text
text = soup.find(text="$37")
print(text)


# 配合使用:
# 找到一个id为link2、文本为Lacie的a标签
a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie")
print(a_tag)



# # 正则过滤器
# import re
# # name
# p_tag = soup.find(name=re.compile('p'))
# print(p_tag)

# 列表过滤器
# import re
# # name
# tags = soup.find_all(name=['p', 'a', re.compile('html')])
# print(tags)

# - bool过滤器
# True匹配
# 找到有id的p标签
# p = soup.find(name='p', attrs={"id": True})
# print(p)

# 方法过滤器
# 匹配标签名为a、属性有id没有class的标签
# def have_id_class(tag):
#     if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'):
#         return tag
#
# tag = soup.find(name=have_id_class)
# print(tag)

3.bs4之遍历文档树

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
'''
遍历文档树:
1.直接使用
'''
# 1.直接使用
print(soup.p)  # 查找第一个<p>标签
print(soup.a)  # 查找第一个<a>标签
# 2.获取标签的名称
print(soup.head.name)
# 3.获取标签的属性
print(soup.a.attrs)  # 以字典的形式
print(soup.a.attrs['href'])  # 获取a标签中的href属性
# 4.获取标签的内容
print(soup.p.text)  # $37
# 5.嵌套选择
print(soup.html.head)
# 6.子节点,子孙节点
# 找到闭合的标签
print(soup.body.children)  #  找到body所有的子节点,返回的是迭代器的对象,这样可以节省电脑的资源
print(list(soup.body.children))  # 强制转化为列表类型
print(soup.body.descendants)  #返回子孙节点
print(list(soup.body.descendants))
# 7.父节点、祖先节点
print(soup.p.parent)# 获取p标签的父亲节点
print(soup.p.parents)  #  获取p标签所有的祖先节点
# 8.兄弟节点
# 找下一个兄弟
print(soup.p.next_sibling)
# 找下面所有的兄弟
print(soup.p.next_siblings)  # 此时返回的是迭代器的对象,这样可以节省电脑的资源
print(list(soup.p.next_siblings))
# 找上面的兄弟,逗号,文本都可以是兄弟
print(soup.a.previous_sibling)  # 找到a标签的上一个兄弟
# 找到a标签上面所有的兄弟
print(soup.a.previous_siblings)
print(list(soup.a.previous_siblings))

3.mongDB的简单使用

关系型数据库,强大的查询功能
非关系型数据库,灵活模式,扩展性,性能,需要建集合,没有一一对应的关系,
1.MangoDB
db全局变量显示当前位置
创建集合
SQL:
create table f1,f2...
MangoDB:
db.student
插入数据
MangoDB:
插多条
db.student.insert([{"name1":"tank1",{"name2":"tank2"}])
插一条
db.student.insert({"name1":"tank1"})
查数据
查全部
db.student.find({})
查一条查找name为tank的记录
db.student.find({"name":"tank"})
from pymongo import MongoClient

# 1、链接mongoDB客户端
# 参数1: mongoDB的ip地址
# 参数2: mongoDB的端口号 默认:27017
client = MongoClient('localhost', 27017)
print(client)

# 2、进入tank_db库,没有则创建
print(client['tank_db'])

# 3、创建集合
print(client['tank_db']['people'])

# 4、给tank_db库插入数据

# 1.插入一条
data1 = {
    'name': 'tank',
    'age': 18,
    'sex': 'male'
}
client['tank_db']['people'].insert(data1)

# 2.插入多条
data1 = {
    'name': 'tank',
    'age': 18,
    'sex': 'male'
}
data2 = {
    'name': 'tank1',
    'age': 84,
    'sex': 'female'
}
data3 = {
    'name': 'tank2',
    'age': 73,
    'sex': 'male'
}
client['tank_db']['people'].insert([data1, data2, data3])

# 5、查数据
# 查看所有数据
data_s = client['tank_db']['people'].find()
print(data_s)  # <pymongo.cursor.Cursor object at 0x000002EEA6720128>
# 需要循环打印所有数据
for data in data_s:
    print(data)

# 查看一条数据
data = client['tank_db']['people'].find_one()
print(data)

# 官方推荐使用
# 插入一条insert_one
# client['tank_db']['people'].insert_one()
# 插入多条insert_many
# client['tank_db']['people'].insert_many()

 

posted @ 2019-06-20 23:44  lhhhha  阅读(943)  评论(0编辑  收藏  举报