scrapy Selector

scrapy Selector

https://docs.scrapy.org/en/latest/topics/selectors.html

基本使用

selector 常规写法:
>>> response.selector.xpath('//span/text()').get()
'good'

selector 缩写:
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

从文本中解析:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

解析响应

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

获取文本

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'

.get() 总是返回一个结果,如果有多个匹配项,则返回第一个匹配内容;如果没有匹配项,则返回None.

​ get(default="可以设置默认值")

​ 原先版本中,使用extract_first()取得第一个结果

.getall() 返回包含所有结果的列表

获取属性

1. 使用xpath的 @src  =>  response.xpath("//a/@href").getall()
2. 使用 .attrib  =>  response.css('img').attrib['src']

正则表达式

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

选择器嵌套

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']

xpath中使用变量

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '

删除名称空间

网站:
$ scrapy shell https://feeds.feedburner.com/PythonInside


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...


一般的:
>>> response.xpath("//link")
[]

删除名称空间后:
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
    <Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
    ...
posted @ 2020-10-27 09:01  xt12321  阅读(96)  评论(0编辑  收藏  举报