python爬虫边看边学(xpath模块解析)

xpath模块解析

       Xpath是一门在 XML 文档中查找信息的语言。 Xpath可用来在 XML文档中对元素和属性进行遍历。而我们熟知的HTML恰巧属于XML的一个子集。所以完全可以用xpath去查找html中的内容。

一、安装lxml模块

       pip install lxml

       用法:1、将要解析的html内容构造出etree对象。

                  2、使用etree对象的xpath方法配合xpath表达式来完成对数据的提取。

简单案例:

from lxml import etree

xml='''
<book>
    <id>1</id>
    <name>野花遍地⾹</name>
    <price>1.23</price>
    <nick>臭⾖腐</nick>
    <author>
        <nick id="10086">周⼤强</nick>
        <nick id="10010">周芷若</nick>
        <nick class="joy">周杰伦</nick>
        <nick class="jolin">蔡依林</nick>
    <div>
        <nick>热了</nick>
    </div>
    <span>
        <nick>热了哦</nick>
    </span>
    </author>
    
    <partner>
    
        <nick id="ppc">胖胖陈</nick>
        <nick id="ppbc">胖胖不陈</nick>
    </partner>
</book>
'''

tree=etree.XML(xml)
res=tree.xpath('/book/name/text()')   #text() 拿文本
print(res)
# ['野花遍地⾹']
res=tree.xpath('/book/author/nick/text()')
print(res)
# ['周⼤强', '周芷若', '周杰伦', '蔡依林']
res=tree.xpath('/book/author//nick/text()')   # // 后代
print(res)
# ['周⼤强', '周芷若', '周杰伦', '蔡依林', '热了', '热了哦']
res=tree.xpath('/book/author/*/nick/text()')   #  * 任意一个节点
print(res)
# ['热了', '热了哦']

  

案例2:

有一html文件,文件名1.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>Title</title>
</head>
<body>
    <ul>
        <li><a href="http://www.baidu.com">百度</a></li>
        <li><a href="http://www.google.com">⾕歌</a></li>
        <li><a href="http://www.sogou.com">搜狗</a></li>
    </ul>
    <ol>
        <li><a href="feiji">飞机</a></li>
        <li><a href="dapao">⼤炮</a></li>
        <li><a href="huoche">⽕车</a></li>
    </ol>
    <div class="job">李嘉诚</div>
    <div class="common">胡辣汤</div>
</body>
</html>

  

解析如下:

from lxml import etree

tree = etree.parse('1.html')
result = tree.xpath('/html/body/ul/li/a/text()')
print(result)
# ['百度', '谷歌', '搜狗']
result = tree.xpath('/html/body/ul/li[2]/a/text()')  # xpath的顺序从1开始
print(result)
# ['谷歌']
result = tree.xpath('/html/body/ol/li/a[@href="dapao"]/text()')  # [@xxx=xxx] 属性的筛选
print(result)
# ['大炮']

ol_li_list = tree.xpath('/html/body/ol/li')
for li in ol_li_list:
    res = li.xpath('./a/text()')  # 在li中继续查找,相对查找
    print(res)
    # ['飞机']
    # ['大炮']
    # ['火车']
    res2 = li.xpath('./a/@href')  # 属性值:@属性
    print(res2)
    # ['feiji']
    # ['dapao']
    # ['huoche']

print(tree.xpath('/html/body/ul/li/a/@href'))
# ['http://www.baidu.com', 'http://www.google.com', 'http://www.sogou.com']

  

 案例3:爬取猪八戒网信息

import requests
from lxml import etree

url = 'https://beijing.zbj.com/search/f/?type=new&kw=前端开发'

resp = requests.get(url)
#解析
html = etree.HTML(resp.text)
divs = html.xpath('/html/body/div[6]/div/div/div[2]/div[4]/div[1]/div')
#每个服务商信息
for div in divs:
    price=div.xpath("./div/div/a/div[2]/div[1]/span[1]/text()")
    title=div.xpath("./div/div/a/div[2]/div[2]/p/text()")
    print(price,title)

  

posted @ 2021-04-02 18:26  wangshanglinju  阅读(209)  评论(3编辑  收藏  举报