scrapy框架架构
1. The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs
to crawl.
2. The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
3. The Engine asks the Scheduler for the next URLs to crawl.
4. The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader,
passing through the Downloader Middleware (request direction).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middleware (response direction).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (input direction).
7. The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.
8. The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider)
to the Scheduler
9. The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the
domain.
翻译:
Scrapy的整个数据处理流程有Scrapy引擎进行控制,其主要的运行方式为:
- 引擎打开一个域名,时蜘蛛处理这个域名,并让蜘蛛获取第一个爬取的URL。
- 引擎从蜘蛛那获取第一个需要爬取的URL,然后作为请求在调度中进行调度。
- 引擎从调度那获取接下来进行爬取的页面。
- 调度将下一个爬取的URL返回给引擎,引擎将他们通过下载中间件发送到下载器。
- 当网页被下载器下载完成以后,响应内容通过下载中间件被发送到引擎。
- 引擎收到下载器的响应并将它通过蜘蛛中间件发送到蜘蛛进行处理。
- 蜘蛛处理响应并返回爬取到的项目,然后给引擎发送新的请求。
- 引擎将抓取到的项目项目管道,并向调度发送请求。
- 系统重复第二部后面的操作,直到调度中没有请求,然后断开引擎与域之间的联系。
spider例子:
class MySpider(BaseSpider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
http://www.example.com/1.html’,
http://www.example.com/2.html’,
http://www.example.com/3.html’,
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
#这部分是items
for h3 in hxs.select(’//h3’).extract():
yield MyItem(title=h3)
#这部分是新的request,如下一页,会发送给schedule
for url in hxs.select(’//a/@href’).extract():
yield Request(url, callback=self.parse)
也可以:
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
#这部分是items
for h3 in hxs.select(’//h3’).extract():
items.append(MyItem(title=h3))
#这部分是新的request,如下一页,会发送给schedule
for url in hxs.select(’//a/@href’).extract():
items.append(Request(url, callback=self.parse))
return itemss
Spider 分析出来的结果有两种:一种是需要进一步抓取的链接,例如之前分析的“下一页”的链接,这些东西会被传回 Scheduler ;另一种是需要保存的数据,它们则被送到 Item Pipeline 那里,那是对数据进行后期处理(详细分析、过滤、存储等)的地方, 值得注意的是,两种结果是可以混杂在一个 list 里返回的,通过类型进行区分,一种是item,一种是request,request会发回给scrapy继续调度下载,然后通过指定的回调函数处理。
对照看spider的说明:
For spiders, the scraping cycle goes through something like this:
1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called
with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default)
generates Request for the URLs specified in the start_urls and the parse method as callback function
for the Requests.
2. In the callback function, you parse the response (web page) and return either Item objects, Request objects,
or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded
by Scrapy and then their response handled by the specified callback.
3. In callback functions, you parse the page contents, typically using XPath Selectors (but you can also use BeautifuSoup,lxml or whatever mechanism you prefer) and generate items with the parsed data.
4. Finally, the items returned from the spider will be typically persisted in some Item pipeline.
大致流程如下:
Spider------>items------>pipeline
|------->requests----->engine-->schedule queque
另外一个例子:
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
posts = hxs.x('//h1/a/@href').extract()
items.extend([self.make_requests_from_url(url).replace(callback=self.parse_post)
for url in posts])
page_links = hxs.x('//div[@class="wp-pagenavi"]/a[not(@title)]')
for link in page_links:
if link.x('text()').extract()[0] == u'\xbb':
url = link.x('@href').extract()[0]
items.append(self.make_requests_from_url(url))
return items
前半部分是解析需要抓取的 blog 正文的链接,后半部分则是给出“下一页”的链接。需要注意的是,这里返回的列表里并不是一个个的字符串格式的 URL 就完了,Scrapy 希望得到的是 Request
对象,这比一个字符串格式的 URL 能携带更多的东西,诸如 Cookie 或者回调函数之类的。可以看到我们在创建 blog 正文的 Request
的时候替换掉了回调函数,因为默认的这个回调函数 parse
是专门用来解析文章列表这样的页面的,而 parse_post
定义如下:
def parse_post(self, response): item = BlogCrawlItem() item.url = unicode(response.url) item.raw = response.body_as_unicode() return [item]