初识scrapy框架（三）--------crawlspider

　　为了说明crawlspider 与 spider 的不同，我们以初识scrapy框架（二）的案例来跟进代码的书写。

创建爬虫文件：scrapy genspider -t crawl 'crawlspider_name' 'url' .

编写爬虫文件：

 1 from Tencent_recruit.items import TencentRecruitItem, DetailRecruitItem
 2 from scrapy.linkextractors import LinkExtractor
 3 from scrapy.spiders import CrawlSpider, Rule
 4 
 5 
 6 class CrawlRectSpider(CrawlSpider):
 7     name = 'crawl_rect'
 8     allowed_domains = ['hr.tencent.com']
 9     start_urls = ['https://hr.tencent.com/position.php']
10 
11     rules = (
12         Rule(LinkExtractor(allow=r'position\.php\?&start=\d+#a'), callback='parse_tencent', follow=True),
13         Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+&keywords=&tid=0&lid=0'), callback='parse_detail', follow=False),
14     )
15 
16     def parse_tencent(self, response):
17         node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
18         for node in node_list:
19             item = TencentRecruitItem()
20             item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()
21             item['position_category'] = node.xpath('./td[2]/text()').extract_first()
22             item['position_number'] = node.xpath('./td[3]/text()').extract_first()
23             item['position_place'] = node.xpath('./td[4]/text()').extract_first()
24             item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()
25             item['release_time'] = node.xpath('./td[5]/text()').extract_first()
26             yield item
27 
28     def parse_detail(self, response):
29         item = DetailRecruitItem()
30         print('detail正在执行')
31         item['position_name'] = response.xpath('//*[@id="sharetitle"]/text()').extract_first()
32         result = response.xpath('//ul[@class="squareli"]')
33         duty = result[0]
34         req = result[1]
35         item['work_duty'] = ''.join(duty.xpath('./li/text()').extract())
36         item['work_request'] = ''.join(req.xpath('./li/text()').extract())
37         yield item

执行爬虫程序：scrapy crawl crawl_rect

　　执行过程可以看出，比spider更快，代码两者相比较crawlspider更少，更方便！！

crawlspider 与 spider 的总结：

posted @ 2018-05-14 17:03 巴蜀秀才阅读(137) 评论(0) 编辑收藏举报

刷新页面返回顶部

巴蜀秀才

初识scrapy框架（三）--------crawlspider

公告