05.Python网络爬虫之三种数据解析方式
一、正则解析
二、Xpath解析
测试页面数据
1 <html lang="en"> 2 <head> 3 <meta charset="UTF-8" /> 4 <title>测试bs4</title> 5 </head> 6 <body> 7 <div> 8 <p>百里守约</p> 9 </div> 10 <div class="song"> 11 <p>李清照</p> 12 <p>王安石</p> 13 <p>苏轼</p> 14 <p>柳宗元</p> 15 <a href="http://www.song.com/" title="赵匡胤" target="_self"> 16 <span>this is span</span> 17 宋朝是最强大的王朝,不是军队的强大,而是经济很强大,国民都很有钱</a> 18 <a href="" class="du">总为浮云能蔽日,长安不见使人愁</a> 19 <img src="http://www.baidu.com/meinv.jpg" alt="" /> 20 </div> 21 <div class="tang"> 22 <ul> 23 <li><a href="http://www.baidu.com" title="qing">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村</a></li> 24 <li><a href="http://www.163.com" title="qin">秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山</a></li> 25 <li><a href="http://www.126.com" alt="qi">岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君</a></li> 26 <li><a href="http://www.sina.com" class="du">杜甫</a></li> 27 <li><a href="http://www.dudu.com" class="du">杜牧</a></li> 28 <li><b>杜小月</b></li> 29 <li><i>度蜜月</i></li> 30 <li><a href="http://www.haha.com" id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a></li> 31 </ul> 32 </div> 33 </body> 34 </html>
常用xpath表达式应用
# Xpath封装在etree包中,etree封装在lxml模块中
from lxml import etree
# 实例化一个本地的etree对象,并且页面源码数据加载到该对象中
tree = etree.parse('./index.html') # 返回对象的类型是ElementTree类型
解析操做
属性定位
找到class属性值为song的div标签
tree.xpath('//div[@class="song"]')
层级&索引定位:
找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a
ree.xpath('//div[@class="tang"]/ul/li[2]/a')
逻辑运算:
找到href属性值为空且class属性值为du的a标签
tree.xpath('//a[@href="" and @class="du"]')
取文本:
/表示获取某个标签下的文本内容
//表示获取某个标签下的文本内容和所有子标签下的文本内容
tree.xpath('//div[@class="song"]/p[1]/text()')[0]
tree.xpath('//div[@class="tang"]//text()')
tree.xpath('//div[@class="tang"]/ul/li[2]//text()')
注意:
//text() 后面一定不要跟 [ ],应为列表元素不止一个
取属性:
tree.xpath('//div[@class="tang"]/ul/li[2]/a/@href')[0]
1 # Xpath封装在etree包中,etree封装在lxml模块中 2 from lxml import etree 3 #实例化一个本地的etree对象,并且页面源码数据加载到该对象中 4 tree = etree.parse('./index.html') # 返回对象的类型是ElementTree类型 5 6 # 解析操做 7 #属性定位 8 #找到class属性值为song的div标签 9 # tree.xpath('//div[@class="song"]') 10 11 # 层级&索引定位: 12 #找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a 13 # tree.xpath('//div[@class="tang"]/ul/li[2]/a') 14 15 #逻辑运算: 16 #找到href属性值为空且class属性值为du的a标签 17 # tree.xpath('//a[@href="" and @class="du"]') 18 19 # 取文本: 20 # /表示获取某个标签下的文本内容 21 # //表示获取某个标签下的文本内容和所有子标签下的文本内容 22 # tree.xpath('//div[@class="song"]/p[1]/text()')[0] 23 # tree.xpath('//div[@class="tang"]//text()') 24 # tree.xpath('//div[@class="tang"]/ul/li[2]//text()') 25 # 取属性: 26 tree.xpath('//div[@class="tang"]/ul/li[2]/a/@href')[0]
使用xpath表达式进行数据解析
1.下载:pip install lxml
2.导包:from lxml import etree
3.将html文档或者xml文档转换成一个etree对象,然后调用对象中的方法查找指定的节点
3.1 本地文件:tree = etree.parse(文件名)
tree.xpath("xpath表达式")
3.2 网络数据:tree = etree.HTML(网页内容字符串)
tree.xpath("xpath表达式")
项目需求:获取58同城北京昌平区的二手房的地址,价格,描述
https://bj.58.com/changping/ershoufang/?PGTID=0d30000c-0047-e853-04ee-93a15ab7eede&ClickID=1
1 import requests 2 from lxml import etree 3 4 # 获取页面源码数据 5 url = 'https://bj.58.com/changping/ershoufang/?PGTID=0d30000c-0047-eddf-b503-3e2a5e562019&ClickID=1' 6 headers={ 7 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 8 } 9 page_text = requests.get(url=url,headers=headers).text 10 # print(page_text) 11 12 all_dict_list = [] 13 14 #实例化etree对象,将页面源码数据加载到该对象中 15 tree = etree.HTML(page_text) 16 li_list = tree.xpath('//ul[@class="house-list-wrap"]/li') 17 for li in li_list: # li表示的是页面中每一个li标签对象 18 tilte = li.xpath('.//div[@class="list-info"]/h2/a/text()')[0] 19 detail_url = li.xpath('.//div[@class="list-info"]/h2/a/@href')[0] 20 if not "https:" in detail_url: 21 detail_url = "https:" + detail_url 22 23 price = li.xpath('.//div[@class="price"]/p//text()') 24 price = ''.join(price) # 将价格转化成字符串类型 25 # print(tilte,detail_url,price) 26 27 28 # 对象详情页发送请求,获取页面数据 29 detail_page_text = requests.get(url=detail_url,headers=headers).text 30 tree = etree.HTML(detail_page_text) 31 desc = tree.xpath('//div[@class="general-item-wrap"]//text()') # 获取当前div下面的所有内容 32 desc = ''.join(desc).strip(" \n \b \t") 33 # print(desc) 34 35 dic = { 36 "tilte":tilte, 37 "price":price, 38 "desc":desc 39 } 40 41 all_dict_list.append(dic) 42 print(all_dict_list) 43
加密数据的爬取
项目需求:爬取煎蛋网中图片数据 http://jandan.net/ooxx
1 #查看页面源码:发现所有图片的src值都是一样的。 2 #简单观察会发现每张图片加载都是通过jandan_load_img(this)这个js函数实现的。 3 #在该函数后面还有一个class值为img-hash的标签,里面存储的是一组hash值,该值就是加密后的img地址 4 #加密就是通过js函数实现的,所以分析js函数,获知加密方式,然后进行解密。 5 #通过抓包工具抓取起始url的数据包,在数据包中全局搜索js函数名(jandan_load_img),然后分析该函数实现加密的方式。 6 #在该js函数中发现有一个方法调用,该方法就是加密方式,对该方法进行搜索 7 #搜索到的方法中会发现base64和md5等字样,md5是不可逆的所以优先考虑使用base64解密
import requests from lxml import etree import base64 import os from urllib import request # 获取源码数据 headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } url = 'http://jandan.net/ooxx' page_text = requests.get(url=url,headers=headers).text # print(page_text) # 创建一个文件夹 if not os.path.exists("jiandan"): os.mkdir("jiandan") #实例化etree对象,解析src的密文数据 tree = etree.HTML(page_text) li_list = tree.xpath('//ol[@class="commentlist"]/li') for li in li_list: src_code = li.xpath('.//span[@class="img-hash"]/text()') if len(src_code) != 0: src_code = src_code[0] src = base64.b64decode(src_code).decode() src = "https:" + src # print(src) # 把爬取到的图片存到文件夹中 imgPath = "jiandan/" + src.split("/")[-1] request.urlretrieve(url=src,filename=imgPath) print("imgPath" + "下载完毕!!!")
1 import requests 2 from lxml import etree 3 import base64 4 import os 5 from urllib import request 6 7 # 获取源码数据 8 headers={ 9 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 10 } 11 url = 'http://jandan.net/ooxx' 12 page_text = requests.get(url=url,headers=headers).text 13 # print(page_text) 14 15 # 创建一个文件夹 16 if not os.path.exists("jiandan1"): 17 os.mkdir("jiandan1") 18 19 #实例化etree对象,解析src的密文数据 20 tree = etree.HTML(page_text) 21 src_code_list = tree.xpath('//span[@class="img-hash"]/text()') 22 for src_code in src_code_list: # src_code拿到的是每一个加密后src地址 23 # print(src_code) 24 src = base64.b64decode(src_code).decode() # src拿到的是解密后的src地址 25 src = "https:" + src 26 imgPath = "jiandan1/" + src.split("/")[-1] 27 request.urlretrieve(url=src,filename=imgPath) 28 print(imgPath + "下载完成") 29 print("下载完毕!!!")
文件爬取
项目需求:爬取站长素材中的免费简历模板 http://sc.chinaz.com/jianli/free.html
import requests from lxml import etree import random import os # 创建一个存放简历的文件夹 if not os.path.exists("jianli"): os.mkdir("jianli") # 获取源码数据 headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } url = 'http://sc.chinaz.com/jianli/free.html' # response = requests.get(url=url,headers=headers) # response.encoding = "utf-8" # page_text = response.text page_text = requests.get(url=url,headers=headers).content.decode() # print(page_text) #实例etree对象 tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="container"]/div') for div in div_list: detail_url = div.xpath('./a/@href')[0] name = div.xpath('./a/img/@alt')[0] # print(detail_url,name) detail_page_name = requests.get(url=detail_url,headers=headers).text tree = etree.HTML(detail_page_name) download_url_list = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') # print(download_url_list) # 下载简历模板的所有url download_url = random.choice(download_url_list) jianli_data = requests.get(url=download_url,headers=headers).content # 返回的是简历的二进制文件 file_path = 'jianli/'+name+'.rar' with open(file_path,'wb') as fb: fb.write(jianli_data) print(file_path + "下载成功!!")
1 # 处理多页 2 import requests 3 from lxml import etree 4 import random 5 import os 6 7 # 创建一个存放简历的文件夹 8 if not os.path.exists("jianli"): 9 os.mkdir("jianli") 10 11 # 获取源码数据 12 headers={ 13 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 14 } 15 16 url = 'http://sc.chinaz.com/jianli/free_%d.html' 17 start_page = input("start_page:") 18 end_page = input("end_page") 19 for page in range(int(start_page),int(end_page)+1): 20 if page == 1: 21 new_url = 'http://sc.chinaz.com/jianli/free.html' 22 else: 23 new_url = format(url%page) 24 25 # response = requests.get(url=url,headers=headers) 26 # response.encoding = "utf-8" 27 # page_text = response.text 28 page_text = requests.get(url=new_url,headers=headers).content.decode() 29 # print(page_text) 30 31 #实例etree对象 32 tree = etree.HTML(page_text) 33 div_list = tree.xpath('//div[@id="container"]/div') 34 for div in div_list: 35 detail_url = div.xpath('./a/@href')[0] 36 name = div.xpath('./a/img/@alt')[0] 37 # print(detail_url,name) 38 detail_page_name = requests.get(url=detail_url,headers=headers).text 39 tree = etree.HTML(detail_page_name) 40 download_url_list = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') 41 # print(download_url_list) # 下载简历模板的所有url 42 download_url = random.choice(download_url_list) 43 44 jianli_data = requests.get(url=download_url,headers=headers).content # 返回的是简历的二进制文件 45 file_path = 'jianli/'+name+'.rar' 46 with open(file_path,'wb') as fb: 47 fb.write(jianli_data) 48 print(file_path + "下载成功!!")
视频爬取
项目需求:爬取梨视频体育界面的视频 https://www.pearvideo.com/category_9
1 import requests 2 from lxml import etree 3 import re 4 import os 5 6 if not os.path.exists("vidio"): 7 os.mkdir("vidio") 8 9 # 获取源码数据 10 headers={ 11 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 12 } 13 url = 'https://www.pearvideo.com/category_9' 14 page_text = requests.get(url=url,headers=headers).text 15 # print(page_text) 16 17 # 实例化一个etree对象 18 tree = etree.HTML(page_text) 19 a_href_list = tree.xpath('//div[@class="vervideo-bd"]/a/@href') 20 for a_href in a_href_list: 21 # 拼接视频详情的页面的url地址 22 new_url = url.split("/")[0]+ "//" + url.split("/")[2] + "/" + a_href 23 #拿到视频详情的页面 24 detail_page_text = requests.get(url=new_url,headers=headers).text 25 26 # 拿到视频视频链接的js文件 27 tree = etree.HTML(detail_page_text) 28 video_path = tree.xpath('//*[@id="detailsbd"]/div[1]/script[1]/text()')[0] 29 # print(len(req)) 30 # 正则匹配 31 ex = 'srcUrl="(.*?)",vdoUrl' 32 33 vidio_true_path= re.findall(ex,video_path,re.S)[0] # 获取视频的url地址 34 # 获取当前视频的名称 35 vidio_name = tree.xpath('//div[@id="poster"]/img/@alt')[0] 36 37 # 拼接一个视频存放路径 38 file_path = "vidio/" + vidio_name + ".mp4" 39 # 获取视频的内容 40 vidio_data = requests.get(url=vidio_true_path,headers=headers).content 41 with open(file_path,"wb") as fp: 42 fp.write(vidio_data) 43 print("file_path" + "下载成功") 44 print("下载完成"
一个报错的处理方式【重点】
报错现象(问题):
在进行大量的请求发送的时,经常会报出这样的一错误:
HTTPConnectPool(host:XX) Max retries exceeded with url.
原因:
1.每次数据传输前客户端要和服务器建立TCP连接,为节省传输消耗,默认为keep-alive,
即连接一次,传输多次。然而如果连接迟迟不断开的话,若连接池满后则无法产生连接对象,导致请求无法发送
2.ip被封
3.请求频率太频繁
解决: 如果下列解决方法未生效,则可以尝试再次执行程序
1.设置请求头中的Connection的值为close,表示每次请求成功后断开连接
2.更换请求ip
3.每次请求之前使用sleep睡一下,进行等待间隔
xpath表达式的另一种表达方式
| 在xpath函数中表示或的意思
示例: 解析出所有城市的名称 https://www.aqistudy.cn/historydata/
1 import requests 2 from lxml import etree 3 headers = { 4 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 5 } 6 url = 'https://www.aqistudy.cn/historydata/' 7 page_text = requests.get(url=url,headers=headers).text 8 9 tree = etree.HTML(page_text) 10 li_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/div/li') 11 print(li_list)