【python爬虫】scrapy入门5--xpath等后面接正则
比如我们要调试某网页:https://g.widora.cn/
shell不依赖工程环境
scrapy shell https://g.widora.cn/
类似页面F12,可用对象都列出来了,一般常用response
前面省略 2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x1118626d0> [s] item {} [s] request <GET https://g.widora.cn/> [s] response <200 https://g.widora.cn/> [s] settings <scrapy.settings.Settings object at 0x111bd7890> [s] spider <DefaultSpider 'default' at 0x112103250> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser 2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector
查找某群号:xpath等支持re,extract、get等后面不支持re
In [1]: response.xpath("/html/body/div/div[5]/p/a").extract() Out[1]: ['<a target="_blank" href="//shang.qq.com/wpa/qunwpa?idkey=f65cb90612db81ef9bee771440adb40c004933a18b7c0466a279486936aedc79" src="title=" style="color:#00a1d6">G.widora.cn 群(1031687050)</a>'] In [2]: response.xpath("/html/body/div/div[5]/p/a/text()").extract() Out[2]: ['G.widora.cn 群(1031687050)'] In [3]: response.xpath("/html/body/div/div[5]/p/a/text()") Out[3]: [<Selector xpath='/html/body/div/div[5]/p/a/text()' data='G.widora.cn 群(1031687050)'>] In [4]: response.xpath("/html/body/div/div[5]/p/a/text()").re('\d+') Out[4]: ['1031687050']
终端写这个很麻烦,还是在浏览器上先调试通过再写代码
-------------------------------
********厚德达理,励志勤工********
-------------------------------