scrapy里的链接提取器简单使用
mac os
scrapy 1.3.3
conda env
2.使用读书网为例子:
https://www.dushu.com/book/1107.html
2.1
xpath规则提取
1 #打开shell交互
2 scrapy shell https://www.dushu.com/book/1107.html
3
4 #导入包
5 from scrapy.linkextractors import LinkExtractor
6
7 #实例化对象,并填入xpath规则
8 link=LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a')
9 #查看对象
10 link
11
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13
14 #填入response进行extract
15 link.extract_links(response)
16
17 #接下来就会输出links
注意 提取器在填写行xpath规则时,不能解析到属性
错误://div[@class="pages"]/a/@href
正确://div[@class="pages"]/a
2.2
css规则提取
1 #打开shell交互
2 scrapy shell https://www.dushu.com/book/1107.html
3
4 #导入包
5 from scrapy.linkextractors import LinkExtractor
6
7 #实例化对象,并填入css规则
8 link=LinkExtractor(restrict_css=r'.pages > a')
9 #查看对象
10 link
11
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13
14 #填入response进行extract
15 link.extract_links(response)
16
17 #接下来就会输出links
注意:在填入css规则,解析到标签时注意空格:
错误:r'.pages>a'
正确:r'.pages > a'
2.3
正则规则提取
1 #打开shell交互
2 scrapy shell https://www.dushu.com/book/1107.html
3
4 #导入包
5 from scrapy.linkextractors import LinkExtractor
6
7 #实例化对象,并填入re规则
8 link=LinkExtractor(allow=r'/book/1107_\d+.html')
9 #查看对象
10 link
11
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13
14 #填入response进行extract
15 link.extract_links(response)
16
17 #接下来就会输出links
xpath和css由于没有达到属性,所以link对象会得到多余东西,但是不用担心,scrapy会自动优化。
个人建议正则提取,配合多页下载