公告

爬虫--三种数据解析方式

数据解析方式一：正则

数据解析方式二：xpath表达式

数据解析方式三：bs4

======================================

数据解析方式一：正则

常见正则匹配：

import re
#提取出python
key="javapythonc++php"
re.findall('python',key)[0]
#####################################################################
#提取出hello world
key="<html><h1>hello world<h1></html>"
re.findall('<h1>(.*)<h1>',key)[0]
#####################################################################
#提取170
string = '我喜欢身高为170的女孩'
re.findall('\d+',string)
#####################################################################
#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)
#####################################################################
#提取出hello
key='lalala<hTml>hello</HtMl>hahah' #输出<hTml>hello</HtMl>
re.findall('<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>',key)
#####################################################################
#提取出hit. 
key='bobo@hit.edu.com'#想要匹配到hit.
re.findall('h.*?\.',key)
#####################################################################
#匹配sas和saas
key='saas and sas and saaas'
re.findall('sa{1,2}s',key)
#####################################################################
#匹配出i开头的行,re.M是分行匹配
string = '''fall in love with you
i love you very much
i love she
i love her'''
re.findall('^i.*',string,re.M)
#####################################################################
#匹配全部行,re.S是当做一行匹配
string1 = """<div>静夜思
窗前明月光
疑是地上霜
举头望明月
低头思故乡
</div>"""
re.findall('.*',string1,re.S)

需求：使用正则对糗事百科中的图片数据进行解析和下载
import requests,re
url = "https://www.qiushibaike.com/pic/"
response = requests.get(url=url)
page_text = response.text
'''
分析源码，解析的内容在class="thumb"的div下边的a标签下的img标签里
<div class="thumb">
<a href="/article/121159481" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/1212/CIY.jpg" alt="xad">
</a>
</div>
'''
img_list = re.findall('<div class="thumb">.*?<img src="(.*?)".*?>.*?</div>',page_text,re.S)
# ['//pic.qiushibaike.com/system/pictures/1212/CIY.jpg','',....]
# 将图片url进行拼接，拼接成完整的url
for img_url in img_list:
　　img_url = 'https:' + url
　　# 获取图片二进制数据
　　img_data = requests.get(url = url).content
　　img_name = img_url.split('/')[-1]
　　save_path = 'imgs'+img_name
　　with open(save_path, 'wb') as fp:
　　　　fp.write(img_data)
　　　　print(img_name+'写入成功')

数据解析方式二：xpath表达式

1.下载：pip install lxml
2.导包：from lxml import etree
3.创建etree对象进行制定数据的解析
　　-本地：tree = etree.parse('本地文件路径')
　　　　 tree.xpath('xpath表达式')
　　-网络：tree = etree.HTML('网络请求到的页面数据
　　　　 tree.xpath('xpath表达式')

常用xpath表达式：

属性定位：

#找到class属性值为song的div标签
//div[@class="song"]

层级&索引定位：

#找到class属性值为tang的div的儿子标签ul下的第二个儿子标签li下的儿子标签a
//div[@class="tang"]/ul/li[2]/a

逻辑运算：

#找到href属性值为空且class属性值为du的a标签
//a[@href="" and @class="du"]

模糊匹配：

# 找到class属性包含ng的div标签
//div[contains(@class, "ng")]
# 找到class属性以ta开头的div标签
//div[starts-with(@class, "ta")]

取文本：

# /表示获取某个标签下的文本内容
//div[@class="song"]/p[1]/text()
# //表示获取某个标签下的文本内容和所有子标签下的文本内容
//div[@class="tang"]//text()

取属性：

# 找到class属性为tang的div下所有子孙li标签中第二个li标签下的儿子a标签的href属性
//div[@class="tang"]//li[2]/a/@href

xpath插件：就可以直接将xpath表达式作用于浏览器的网页当中
安装：Chrome浏览器为例：更多工具-》扩展程序-》开启右上角的开发者模式-》xpath插件拖动到页面即可
快捷键开启关闭xpath插件：Ctrl+shift+x

需求：使用xpath对段子网中的段子内容和标题进行解析，持久化存储
import requests
from lxml import etree
url = 'https://ishuo.cn/joke'
response = requests.get(url)
page_text = response.text
tree = etree.HTML(page_text)
# 获取所有li标签(段子内容和标题在li标签里)
li_list = tree.xpath('//div[@id="list"]/ul/li')
# 注意:Element类型对象可以继续调用xpath函数对该对象内部内容进行解析
fp = open('duanzi.txt','w',encoding='utf8')
for li in li_list:
　　content=li.xpath('./div[@class="content"]/text()')[0]
　　title = li.xpath('./div[@class="info"]/a/text()')[0]
　　fp.write(title+":"+content+"\n\n")

数据解析方式三：bs4

bs4--Beautiful是python独有，简单便捷和高效。
1.环境安装：

　　-用国外的源可能有问题
　　- 需要将pip源设置为国内源，阿里源、豆瓣源、网易源等
　　- windows
　　　　（1）打开文件资源管理器(文件夹地址栏中)
　　　　（2）地址栏上面输入 %appdata%
　　　　（3）在这里面新建一个文件夹 pip
　　　　（4）在pip文件夹里面新建一个文件叫做 pip.ini ,内容写如下即可
[global]
timeout = 6000
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
　　- linux
　　　　（1）cd ~
　　　　（2）mkdir ~/.pip
　　　　（3）vi ~/.pip/pip.conf
　　　　（4）编辑内容，和windows一模一样
　　- 需要安装：
　　　　pip install bs4 ， bs4在使用时候需要一个第三方库，把这个库也安装一下 pip install lxml
2.使用流程：
　　核心思想：将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的节点内容
　　- 导包：from bs4 import BeautifulSoup
　　- 创建BeautiSoup对象：
　　（1）html文档是本地文件：
　　　　- soup = BeautifulSoup(open('本地文件'), 'lxml')
　　（2）html文档是网络文件：
　　　　- soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
3. 查找标签或属性：
　　（1）根据标签名查找
　　　　- soup.a 只能找到第一个符合要求的标签
　　（2）获取属性
　　　　- soup.a.attrs 获取a所有的属性和属性值，返回一个字典
　　　　- soup.a.attrs['href'] 获取href属性
　　　　- soup.a['href'] 也可简写为这种形式
　　（3）获取内容
　　　　- soup.a.string # 相当于/text()
　　　　- soup.a.text # 相当于//text()
　　　　- soup.a.get_text() # 相当于//text()
　　　　【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容
　　（4）find：找到第一个符合要求的标签
　　　　- soup.find('a') 找到第一个符合要求的
　　　　- soup.find('a', title="xxx")
　　　　- soup.find('a', alt="xxx")
　　　　- soup.find('a', class_="xxx")
　　　　- soup.find('a', id="xxx")
　　（5）find_all：找到所有符合要求的标签
　　　　- soup.find_all('a')
　　　　- soup.find_all(['a','b']) 找到所有的a和b标签
　　　　- soup.find_all('a', limit=2) 限制前两个
　　（6）根据选择器选择指定的内容
　　　　soup.select('#feng')
　　　　- 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
　　　　- 层级选择器：
　　　　soup.select('div .dudu #lala .meme .xixi') 空格表示下面的子孙节点
　　　　soup.select('div > p > a > .lala') >只是表示儿子节点
　　　　【注意】select选择器返回永远是列表，需要通过下标提取指定的对象

需求：爬取古诗文网中三国小说里的标题和内容

import requests
from bs4 import BeautifulSoup
url = "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text = requests.get(url=url).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
# 注意：bs4.element.Tag对象可以继续用soup的方法进行解析内部数据
fp = open('./sanguo.txt','w',encoding='utf8')
for a in a_list:
　　# 获取章节标题
　　title = a.string
　　content_url = "http://www.shicimingju.com"+a['href']
　　get_content(content_url)
　　fp.write(title+":"+content+"\n\n")
　　print("写入一个章节内容")
def get_content(content_url):
　　# 该函数根据标题的url获取标题内容
　　page_text = requests.get(url=content_url).text
　　soup = BeautifulSoup(page_text, 'lxml')
　　soup.find('div',class_='chapter_content')
　　return div.text

posted on 2019-04-25 15:02 要一直走下去阅读(424) 评论(0) 收藏举报

刷新页面返回顶部