【Python爬虫(一)】XPath
解析方式:XPath
XPath的基本使用
1 安装lxml库
conda install lxml
下载慢的话可以试一下热点或切换下载源
2 导入etree
from lxml import etree
3 XPath解析文件
①本地html文件
html_tree = etree.parse('xx.html')
②服务器文件
html_tree = etree.HTML(response.read().decode('utf-8'))
4 html_tree.xpath()
html_tree.xpath()
XPath基本语法
1 路径查询
//:查找所有子孙节点,不考虑层级关系
/:找直接字节点
2 谓词查询
//div[@id='']
//div[@class='']
3 属性查询
//@class
4 模糊查询
//div[contains(@id, '')]
//div[starts-with(@id, '')]
5 内容查询
text()
6 逻辑运算
//div[@id='' and @class='']
//title || //price
实例1:使用XPath获取本博客首页的随笔标题
import urllib.request from lxml import etree url = 'https://www.cnblogs.com/tod4/' headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,' 'application/signed-exchange;v=b3;q=0.9', # 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6', 'cache-control': 'max-age=0', 'cookie': 【自己的Cookie信息】, 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Microsoft Edge";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'none', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.39',} # 请求对象的定制 request = urllib.request.Request(url=url, headers=headers) # 模拟浏览器向服务器发送数据 response = urllib.request.urlopen(request) # 获取网页源码 context = response.read().decode('utf-8') # xPath解析服务器响应文件 tree = etree.HTML(context) # 获取想要的数据 result_list = tree.xpath("//a[@class='postTitle2 vertical-middle']/span/text()") for result in result_list: print(str(result).strip())
输出:
【读书笔记】【Spring实战】二 装配Bean 【图像分类网络(一)】残差神经网络ResNet以及组卷积ResNeXt pytorch图像处理基础 【Vue】Vuex 【Vue】三 【Vue】二 【Vue】一 【MyBatis】分页插件 【Mybatis】(一) 【SpringMVC】(三)
实例2:下载豆瓣图片
脚本代码
import os import urllib.request from lxml import etree url = 'https://www.douban.com/doulist/136189091/' headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', # 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6', 'Connection': 'keep-alive', 'Cookie': 【自己的Cookie信息】, 'Host': 'www.douban.com', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Microsoft Edge";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'none', 'Sec-Fetch-User': '?1', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.39', } request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) context = response.read().decode('utf-8') tree = etree.HTML(context) img_src_list = tree.xpath("//div[@class='post']/a/img/@src") img_name_list = tree.xpath("//div[@class='title']/a/text()") print(len(img_src_list)) print(len(img_name_list)) if not os.path.exists('./image'): os.mkdir('./image') for index in range(len(img_src_list)): img_src = str(img_src_list[index]).strip() img_name = str(img_name_list[index]).strip() img_path = './image/' + img_name + '.png' # 下载图片 urllib.request.urlretrieve(img_src, img_path)
结果:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步