爬虫之数据解析
数据解析简介
数据解析:解析或提取数据,从通用爬虫获取的整张页面中,取得指定的局部数据
- 作用:实现聚焦爬虫
- 实现方式:
正则
(相比来说麻烦一些)bs4
(python中独有的)xpath
(java,php,python均可使用)pyquery
(python独有)
- 数据解析的通用原理是什么?
- 解析的一定是html页面的源码数据
- 解析标签中存储的文本内容
- 解析标签属性的属性值
- 原理:
- 1,标签定位
- 2,取文本或者取属性
- 解析的一定是html页面的源码数据
- 爬虫实现的流程:
- 指定url
- 发请求
- 获取响应数据
- 数据解析
- 将解析到的数据持久化存储
正则解析
正则回顾
单字符:
. : 除换行以外所有字符
[] :[aoe] [a-w] 匹配集合中任意一个字符
\d :数字 [0-9]
\D : 非数字
\w :数字、字母、下划线、中文
\W : 非\w
\s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S : 非空白
数量修饰:
* : 任意多次 >=0
+ : 至少1次 >=1
? : 可有可无 0次或者1次
{m} :固定m次 hello{3,}
{m,} :至少m次
{m,n} :m-n次
边界:
$ : 以某某结尾
^ : 以某某开头
分组:
(ab)
贪婪模式: .*
非贪婪(惰性)模式: .*?
re.I : 忽略大小写
re.M :多行匹配
re.S :单行匹配
re.sub(正则表达式, 替换内容, 字符串)
案例:取出段子对应的标题
- 需求:使用正则将http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/对应的段子标题取出
import requests
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
url = 'http://duanziwang.com/category/搞笑图/'
#捕获到的是字符串形式的响应数据
page_text = requests.get(url=url,headers=headers).text
#数据解析
ex = '<div class="post-head">.*?<a href="http://duanziwang.com/\d+\.html">(.*?)</a></h1>'
ret = re.findall(ex,page_text,re.S)#爬虫使用正则做解析的话re.S必须要使用
#持久化存储
with open("./title.txt","a",encoding="utf-8") as f:
for i in ret:
f.write(f"{i}\n")
bs4解析
- 环境安装
pip install bs4
pip install lxml
- 解析原理&实现流程
- 1,实例化一个
BeautifulSoup
的对象,需要将等待被解析的页面源码数据加载到该对象中 - 2,调用
BeautifulSoup
对象中相关的属性和方法进行标签定位和文本数据的t提取
- 1,实例化一个
BeautifulSoup
如何实例化- 方式1:
BeautifulSoup(等待被解析的源码数据,“lxml”)
,将本地存储的一张html文件中的指定数据进行解析 - 方式2:
BeautifulSoup(响应数据,"lxml")
,将从互联网中爬取的数据进行数据解析
- 方式1:
常用的方法和属性
-
标签定位:根据标签名进行定位,只返回第一个出现的标签
soup.标签名
返回当前源码中的第一个出现的标签名
-
属性定位:根据指定的属性进行对应标签的定位
-
soup.find(标签名,标签属性=属性值)
只有class属性加 class_soup,find("tagName")
-
soup.findall(标签名,标签属性=属性值)
-
-
选择器定位
-
soup.select(".类")
soup.select("#id")
-
层级选择器
- 大于号:表示一个层级
- 空格:表示多个层级
-
-
取文本
-
标签定位到的标签.string
tag.string
- 取出标签下直系的文本内容
-
标签定位到的标签.text
tag.text
- 取出标签下所有的文本内容
-
-
取属性
-
标签定位到的标签["字符串形式的属性名称"]
tag.["attrName"]
-
案例:爬取小说
# 这种网站最好使用代理池
# 案例1
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "https://www.52bqg.com/book_10508/"
fp = open("./九仙图.txt","w",encoding="utf-8")
page_text = requests.get(url=url,headers=headers)
page_text.encoding="GBK"
x = page_text.text
soup = BeautifulSoup(x,'lxml')
a_list = soup.select('#list a')
for i in a_list:
title = i.string
a_href = 'https://www.52bqg.com/book_10508/' + i['href']
page_text_a = requests.get(url=a_href,headers=headers)
page_text_a.encoding="GBK"
f = page_text_a.text
a_soup = BeautifulSoup(f,'lxml')
div_tag = a_soup.find('div',id='content')
content = div_tag.text
fp.write("\n" + title + "\n" + content + "\n")
print(title,"下载完成")
fp.close()
# 案例2
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "http://www.balingtxt.com/txtml_84980.html"
page_text = requests.get(url=url,headers=headers)
page_text.encoding="utf-8"
page_text = page_text.text
menu_soup = BeautifulSoup(page_text,"lxml")
a_lst = menu_soup.select("#yulan > li > a")
fp = open("./天命相师.txt","w",encoding="utf-8")
for i in a_lst:
title = i.string
a_url = i["href"]
new_text = requests.get(url=a_url,headers=headers)
new_text.encoding="utf-8"
new_text = new_text.text
contcent_soup = BeautifulSoup(new_text,"lxml")
content = contcent_soup.find("div",class_="book_content").text
fp.write(f"{title}\n{content}\n")
print(f"{title} 下载完成!")
fp.close()
图片数据的爬取
基于requests
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "http://a0.att.hudong.com/78/52/01200000123847134434529793168.jpg"
img_data = requests.get(url=url,headers=headers).content # 返回的是二进制类型的数据
with open("./tiger1.jpg","wb") as f:
f.write(img_data)
content 返回的是二进制数据,如爬取图片,音频,视频
基于urllib
from urllib import request
url = "http://a0.att.hudong.com/78/52/01200000123847134434529793168.jpg"
ret = request.urlretrieve(url=url,filename="./trger2.jpg")
print(ret)
requests和urllib的区别就是能否实现UA伪装
Xpath解析
-
环境安装
pip install lxml
-
解析原理&流程
- 实例化一个
etree
的对象,且将被解析的页面源码数据加载到该对象中 - 使用etree对象中的xpath方法结合者不同形式的xpath表达式进行标签定位和数据的提取
- 实例化一个
-
实例化对象
etree.parse("文件路径")
:将本地存储的html文件中的数据加载到实例化好etree对象中etree.HTML("page_text")
:将网络上爬取的数据加载到其中
-
XPath表达式
-
标签定位
-
最左侧的/:必须冲根节点定位标签(几乎不用)
-
非最左侧的/:表示一个层级
-
最左侧的//:可以从任意位置进行指定标签的定位
-
非最左侧的//:表示多个层级(最常用)
-
属性定位:
"//tagName[@attrName='value']"
"//标签名[@属性名称=‘属性值’]"
-
索引定位:
"//标签名[索引]"
# 索引值从1开始,为了方便寻址 -
//div[contains(@class,'ng')]
# 所有的div中,class属性值中包含ng的 -
//div[start-with(@calss,'ta')]
# 所有的div中,class属性值以'ta'开头的
-
-
取文本
tree.xpath
的返回值之列表,取值时需要索引/text()
:取直系的文本//text()
:取出所有的文本
-
取属性
/@属性名称
-
案例:Xpath下载图片
- 需求:使用Xpath解析图片地址和名称且将图片进行下载保存到本地
import requests
import os
from lxml import etree
from urllib import request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
dirName = "imglibs" # 存储图片的文件夹名字
if not os.path.exists(dirName):
os.mkdir(dirName) # 没有该文件夹就创建
page_url = "http://pic.netbian.com/4kfengjing/index_%d.html"
for page_num in range(1,10): # 全站数据爬取,根据自己情况选择爬取的页码数
if page_num == 1:
url = "http://pic.netbian.com/4kfengjing/"
else:
url = format(page_url%page_num)
page_text = requests.get(url=url,headers=headers).text
# 数据解析
tree = etree.HTML(page_text) # 实例化一个etree对象
img_lst = tree.xpath('//div[@class="slist"]/ul/li/a') # 返回一个a标签的列表
for i in img_lst:
img_href = "http://pic.netbian.com" + i.xpath("./@href")[0] # 路径拼接,找到大图页的地址
img_text = requests.get(url=img_href,headers=headers).text
new_tree = etree.HTML(img_text) # 实例化一个新的etree对象
img_list = new_tree.xpath('//a[@id="img"]/img')[0] # 找到图片的img标签对象
img_src = "http://pic.netbian.com" + img_list.xpath('./@src')[0] # 拼接高清大图的地址
img_alt = img_list.xpath('./@alt')[0].encode('iso-8859-1').decode('GBK') # 找到图片的名字
filepath = "./" + dirName + "/" + img_alt + ".jpg" # 加.jpg后缀
request.urlretrieve(img_src,filename=filepath) # 持久化存储
print(img_alt,"下载成功!")
print("Over!")
pyquery解析
- 安装
pip install pyquery
- 引用
from pyquery import PyQuery as pq
- 简介
- pyquery是jQuery的一个专供python使用的HTML解析的库,类似于bs4
使用方法
- 初始化方法
from pyquery import PyQuery as pq
doc =pq(html) #解析html字符串
doc =pq("http://news.baidu.com/") #解析网页
doc =pq("./a.html") #解析html 文本
- 基本CSS选择器
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
print doc("#wrap .s_from link")
运行结果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
- 查找子元素
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
#查找子元素
doc = pq(html)
items=doc("#wrap")
print(items)
print("类型为:%s"%type(items))
link = items.find('.s_from')
print(link)
link = items.children()
print(link)
运行结果:
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
类型为:<class 'pyquery.pyquery.PyQuery'>
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
- 查找父元素
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
items=doc(".s_from")
print(items)
#查找父元素
parent_href=items.parent()
print(parent_href)
运行结果:
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点。
- 查找兄弟元素
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
items=doc("link.active1.a123")
print(items)
#查找兄弟元素
siblings_href=items.siblings()
print(siblings_href)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link>
<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
根据运行结果可以看出,siblings 返回了同级的其他标签
结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择
- 遍历查找结果
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link>
<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
- 获取属性信息
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.attr('href'))
print(it.attr.href)
运行结果:
http://asda.com
http://asda.com
http://asda1.com
http://asda1.com
http://asda2.com
http://asda2.com
- 获取文本
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.text())
运行结果:
asdadasdad12312
asdadasdad12312
asdadasdad12312
- 获取HTML信息
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.html())
运行结果:
<a>asdadasdad12312</a>
asdadasdad12312
asdadasdad12312
常用DOM操作
-
添加,移除class标签
addClass
removeClass
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print("添加:%s"%it.addClass('active1'))
print("移除:%s"%it.removeClass('active1'))
运行结果:
添加:<link class="active1 a123" href="http://asda.com"><a>asdadasdad12312</a></link>
移除:<link class="a123" href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link class="active2 active1" href="http://asda1.com">asdadasdad12312</link>
移除:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
添加:<link class="movie1 active1" href="http://asda2.com">asdadasdad12312</link>
移除:<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
需要注意的是已经存在的class标签不会继续添加
- attr 为获取/修改属性 css 添加style属性
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print("修改:%s"%it.attr('class','active'))
print("添加:%s"%it.css('font-size','14px'))
修改:<link class="active" href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link class="active" href="http://asda.com" style="font-size: 14px"><a>asdadasdad12312</a></link>
修改:<link class="active" href="http://asda1.com">asdadasdad12312</link>
添加:<link class="active" href="http://asda1.com" style="font-size: 14px">asdadasdad12312</link>
修改:<link class="active" href="http://asda2.com">asdadasdad12312</link>
添加:<link class="active" href="http://asda2.com" style="font-size: 14px">asdadasdad12312</link>
attr css操作直接修改对象的
-
remove
remove 移除标签
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("div")
print('移除前获取文本结果:\n%s'%its.text())
it=its.remove('ul')
print('移除后获取文本结果:\n%s'%it.text())
运行结果:
移除前获取文本结果:
hello nihao
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
移除后获取文本结果:
hello nihao
其他DOM方法参考:
http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
from pyquery import PyQuery as pq
html = '''
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class='active1 a123' href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link:first-child")
print('第一个标签:%s'%its)
its=doc("link:last-child")
print('最后一个标签:%s'%its)
its=doc("link:nth-child(2)")
print('第二个标签:%s'%its)
its=doc("link:gt(0)") #从零开始
print("获取0以后的标签:%s"%its)
its=doc("link:nth-child(2n-1)")
print("获取奇数标签:%s"%its)
its=doc("link:contains('hello')")
print("获取文本包含hello的标签:%s"%its)
运行结果:
第一个标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
最后一个标签:<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
第二个标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
获取0以后的标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
获取奇数标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
获取文本包含hello的标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>