scrapy之Selectors
练习url:https://doc.scrapy.org/en/latest/_static/selectors-sample1.html
一 获取文本值
xpath
In [18]: response.selector.xpath('//title/text()').extract_first(default='') Out[18]: 'Example website'
css
In [19]: response.selector.css('title::text').extract_first(default='') Out[19]: 'Example website'
注:可以省略写成:response.xpath()
二 获取属性值
xpath
In [23]: response.selector.xpath('//base/@href').extract_first() Out[23]: 'http://example.com/'
css
In [24]: response.selector.css('base::attr(href)').extract_first() Out[24]: 'http://example.com/'
注: 可以省略写成:response.css
三 xpath,css嵌套使用
因为css,xpath返回的是 SelectorList 实例,所有可以嵌套便捷的使用。
ps:获取属性,xpath,@已经实现, 并不需要 /text()
In [21]: response.selector.css('img').xpath('@src').extract() Out[21]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
四 .re()
.re()
.re_first()
ps :返回的是unicode构成的列表,所以,不能嵌套使用 .re()
In [1]: response.selector.css('div > p:nth-of-type(2)::text').extract() Out[1]: ['333xxx'] In [2]: response.selector.css('div > p:nth-of-type(2)::text').extract_first() Out[2]: '333xxx' In [3]: response.selector.css('div > p:nth-of-type(2)::text').re_first('\w+') Out[3]: '333xxx' In [4]: response.selector.css('div > p:nth-of-type(2)::text').re_first('[A-Za-z]+') Out[4]: 'xxx' In [5]: response.selector.css('div > p:nth-of-type(2)::text').re('[A-Za-z]+') Out[5]: ['xxx']
五 关于Xpath的相对路径查找的注意
查找div标签下p标签
<html lang="zh-CN"> <head> </head> <body> <p>11</p> <div> <p>222</p> <p>333</p> </div> </body> </html>
错误做法:
In [4]: divs = response.selector.xpath('//div') In [5]: for p in divs.xpath('//p'): ...: print(p.extract()) ...: <p>11</p> <p>222</p> <p>333</p>
正确做法 1:
In [6]: divs = response.selector.css('div') In [7]: for p in divs.xpath('.//p'): ...: print(p.extract()) ...: ...: <p>222</p> <p>333</p>
正确做法 2:
In [8]: divs = response.selector.css('div') In [9]: for p in divs.xpath('p'): ...: print(p.extract()) ...: ...: ...: <p>222</p> <p>333</p>