scrapy之spiders
官方文档:https://docs.scrapy.org/en/latest/topics/spiders.html#
一句话总结:spider是定义爬取的动作(是否跟进新的链接)及分析网页结构(提取数据,返回item)的地方。
一 scrapy.Spider
1 name
2 allowed_domins <-----------------------> offsitemiddleware
3 start_urls <-----------------------> start_requests()
4 custom_settings <------------------------->Built-in settings reference
It must be defined as a class attribute since the settings are updated before instantiation.
class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['https://www.baidu.com'] start_urls = ['http://https://www.baidu.com/'] custom_settings = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', } def parse(self, response): pass
5 crawler <----------> from_crawler()
6 settings
7 logger
8 from_crawler(crawler,*args,**kwargs)
This is the class method used by Scrapy to create your spiders.
9 start_request()
It is called by Scrapy when the spider is opened for scraping.
核心代码:
for url in self.start_urls: yield Request(url, dont_filter=True)
关于Request的说明。以下是Requet的源码。
class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None):
源码中可以看到,Request默认是get请求,如果是发post请求,需要在重写此方法。这里涉及到了 Request类
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
10 parse(response)
This method, as well as any other Request callback, must return an iterable of Request
and/or dicts or Item
objects.
11 log(message[ , level,component])
12 closed(reason)
二 Spider arguments
-a
三 Generic Spiders
1 CrawlSpider
推荐
加了 rules,简化了相关操作。
2 XMLFeedSpider
3 CSVFeedSpider
4 SitemapSpider