初识scrapy框架(三)--------crawlspider

  为了说明crawlspider 与 spider 的不同,我们以初识scrapy框架(二)的案例来跟进代码的书写。

创建爬虫文件:scrapy genspider -t crawl 'crawlspider_name'  'url'  .

编写爬虫文件:

 1 from Tencent_recruit.items import TencentRecruitItem, DetailRecruitItem
 2 from scrapy.linkextractors import LinkExtractor
 3 from scrapy.spiders import CrawlSpider, Rule
 4 
 5 
 6 class CrawlRectSpider(CrawlSpider):
 7     name = 'crawl_rect'
 8     allowed_domains = ['hr.tencent.com']
 9     start_urls = ['https://hr.tencent.com/position.php']
10 
11     rules = (
12         Rule(LinkExtractor(allow=r'position\.php\?&start=\d+#a'), callback='parse_tencent', follow=True),
13         Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+&keywords=&tid=0&lid=0'), callback='parse_detail', follow=False),
14     )
15 
16     def parse_tencent(self, response):
17         node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
18         for node in node_list:
19             item = TencentRecruitItem()
20             item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()
21             item['position_category'] = node.xpath('./td[2]/text()').extract_first()
22             item['position_number'] = node.xpath('./td[3]/text()').extract_first()
23             item['position_place'] = node.xpath('./td[4]/text()').extract_first()
24             item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()
25             item['release_time'] = node.xpath('./td[5]/text()').extract_first()
26             yield item
27 
28     def parse_detail(self, response):
29         item = DetailRecruitItem()
30         print('detail正在执行')
31         item['position_name'] = response.xpath('//*[@id="sharetitle"]/text()').extract_first()
32         result = response.xpath('//ul[@class="squareli"]')
33         duty = result[0]
34         req = result[1]
35         item['work_duty'] = ''.join(duty.xpath('./li/text()').extract())
36         item['work_request'] = ''.join(req.xpath('./li/text()').extract())
37         yield item

执行爬虫程序:scrapy crawl crawl_rect

  执行过程可以看出,比spider更快,代码两者相比较crawlspider更少,更方便!!

crawlspider 与 spider 的总结:

 

posted @ 2018-05-14 17:03  巴蜀秀才  阅读(137)  评论(0编辑  收藏  举报