Xpath使用教程

一、安装Xpath解析库-scrapy中的selector

win+r打开cmd,输入pip install wheel,先安装wheel库了才能安装.whl文件。

安装lxml库

到https://pypi.org/project/lxml/#files下载对应python版本的lxml库

 

切到lxml下载位置,安装lxml

安装Twisted库

到https://pypi.org/project/Twisted/#files下载对应python版本的Twisted库

切到Twisted下载位置,安装Twisted

安装scrapy库

到https://pypi.org/project/Scrapy/#files下载对应python版本的scrapy库

 

 安装完成之后,将pycharm的环境切到python_spider之前创建的虚拟环境中

二、Xpath 

xpath使用路径表达式在xml和html中进行导航,xpath包含标准函数库,xpath是一个w3c的标准。

xpath的节点关系(1)父节点(2)子节点(3)同胞节点(4)先辈节点(5)后代节点

Xpath语法

同一个元素可能会存在多种xpath的语法,xpath可以直接获取到值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
print(sel)
tag1=sel.xpath("//*[@id='info']/div/p[1]")
print("tag1:"+str(tag1))
#取出info的text
tag2=sel.xpath("//*[@id='info']/div[1]/p[1]/text()").extract()[0]
if tag2:
    print("tag2:"+str(tag2))
#获取第一个div的p节点的值
tag3=sel.xpath("//div[1]/div[1]/p[1]/text()").extract()[0]
print("tag3:"+str(tag3))
tag4=sel.xpath("//div[1]/div/p[1]/text()").extract()[0]
print("tag4:"+str(tag4))

 输出结果:

 输出年龄:29

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
name_xpath="//div[1]/div/p[1]/text()"
name=""
tag_texts=sel.xpath(name_xpath).extract()
if tag_texts:
    name=tag_texts[0]
print(name)

 输出结果:

 通过class属性xpath找值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
teacher_tag=sel.xpath('//div[@class="teacher_info"]/p[2]').extract()
print(teacher_tag)

输出结果:

如果是标签之中含有多个class ,可以使用contains方法获取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
teacher_tag=sel.xpath('//p[contains(@class,"bobbyname")]').extract_first()
print(teacher_tag)

  输出结果:

在这个网站上存在着很多类似contains的内置方法

 https://developer.mozilla.org/en-US/docs/Web/XPath/Functions

使用last()函数获取最后一个元素的值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
info1=sel.xpath('//div[contains(@class,"teacher")]/p[last()]/text()').extract_first()
print(info1)
info2=sel.xpath('//div[contains(@class,"teacher")]/p[last()-1]/text()').extract_first()
print(info2)

 输出结果:

  获取class属性值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
class_value=sel.xpath('//div[contains(@class,"teacher")]/p[last()-1]/@class').extract_first()
print(class_value)

 输出结果:

 同时获取两个属性值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from scrapy import Selector
 
html="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>bobby基本信息</title>
    <script src="jquery-3.5.1.min.js"></script>
</head>
<body>
    <div id="info">
        <p style="color: blue">讲师信息</p>
        <div class="teacher_info">
            Python全栈工程师
            <p class="age">年龄:29</p>
            <p class="name bobbyname" data-bind="bobby">姓名:bobby</p>
            <p class="work_years">工作年限:7年</p>
            <p class="position">职位:python开发工程师</p>
        </div>
        <p style="color:aquamarine">课程信息</p>
        <table class="courses">
            <tbody><tr><th>课程名称</th>
            <th>讲师</th>
            <th>地址</th>
        </tr><tr>
                <td>django打造在线教育</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/78.html">访问</a></td>
            </tr><tr>
                <td>python高级编程</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/200.html">访问</a></td>
            </tr><tr>
                <td>scrapy分布式爬虫</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/92.html">访问</a></td>
            </tr><tr>
                <td>diango rest framework打造生鲜电商</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/131.html">访问</a></td>
            </tr><tr>
                <td>tornado从入门到精通</td>
                <td>bobby</td>
                <td><a href="https://coding.imooc.com/class/290.html">访问</a></td>
            </tr></tbody></table>
 
</div>
</body>
</html>
"""
#先取出所有的html值
sel=Selector(text=html)
#print(sel)
class_value=sel.xpath('//p[@class="work_years"]|//p[@class="position"]').extract()
print(class_value)

 输出结果:

posted @   leagueandlegends  阅读(896)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2022-05-19 STP协议
点击右上角即可分享
微信分享提示