scrapy备忘

response.request.headers.getlist('Cookie')

spider的命名规则：x.y.z.com, 则命名为x.y.z，在一个scrapy project里面，spider的命名要唯一
spider 可以接收命令行参数，scrapyd的schedule.json也可以

for index, el in enumerate(list): #获得数组的index

response.xpath/css返回的是selector list。 selector.re()返回的是unicode string list

nested selector
divs = resposne.xpath('//div') #获取所有的<div>

for p in divs.xpath('.//p') #获取div里面的所有的<p>
for p in divs.xpath('p') #获取div第一层下面的所有的<p>

xp("//li[1]") #取出所有含有li标签的父节点当中的的第一个li标签
xp('(//li)[1]') #取出文档中第一个li标签

#查询class的时候推荐使用css，因为contains(@class, 'someclassname')，会把包含someclassname的对象也找出来，会扩大范围.
.css('.shout').xpath(...) #可以找出class包含shout的对象集合. 例如 <a class='time shout'/>

xpath css返回的是selectorlist对象，支持xpath, css, extract等操作，很方便

获取一个HTML节点里面的所有文本内容

>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']

获得project setting

from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
    ...
    def parse(self, response):
        ...
        settings = get_project_settings()
        print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))

框架默认运行的的downloadmiddleware如果需要关闭，需要在 DOWNLOADER_MIDDLEWARES 当中设置为None

http_proxy要么通过Request.meta设置 valid per request，要么通过环境变量设置全局生效。

posted @ 2016-02-24 11:59 怎么也得过啊阅读(184) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

怎么也得过啊

scrapy备忘

公告