初识scrapy框架(三)--------crawlspider
为了说明crawlspider 与 spider 的不同,我们以初识scrapy框架(二)的案例来跟进代码的书写。
创建爬虫文件:scrapy genspider -t crawl 'crawlspider_name' 'url' .
编写爬虫文件:
1 from Tencent_recruit.items import TencentRecruitItem, DetailRecruitItem 2 from scrapy.linkextractors import LinkExtractor 3 from scrapy.spiders import CrawlSpider, Rule 4 5 6 class CrawlRectSpider(CrawlSpider): 7 name = 'crawl_rect' 8 allowed_domains = ['hr.tencent.com'] 9 start_urls = ['https://hr.tencent.com/position.php'] 10 11 rules = ( 12 Rule(LinkExtractor(allow=r'position\.php\?&start=\d+#a'), callback='parse_tencent', follow=True), 13 Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+&keywords=&tid=0&lid=0'), callback='parse_detail', follow=False), 14 ) 15 16 def parse_tencent(self, response): 17 node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]') 18 for node in node_list: 19 item = TencentRecruitItem() 20 item['position_name'] = node.xpath('./td[1]/a/text()').extract_first() 21 item['position_category'] = node.xpath('./td[2]/text()').extract_first() 22 item['position_number'] = node.xpath('./td[3]/text()').extract_first() 23 item['position_place'] = node.xpath('./td[4]/text()').extract_first() 24 item['position_link'] = node.xpath('./td[1]/a/@href').extract_first() 25 item['release_time'] = node.xpath('./td[5]/text()').extract_first() 26 yield item 27 28 def parse_detail(self, response): 29 item = DetailRecruitItem() 30 print('detail正在执行') 31 item['position_name'] = response.xpath('//*[@id="sharetitle"]/text()').extract_first() 32 result = response.xpath('//ul[@class="squareli"]') 33 duty = result[0] 34 req = result[1] 35 item['work_duty'] = ''.join(duty.xpath('./li/text()').extract()) 36 item['work_request'] = ''.join(req.xpath('./li/text()').extract()) 37 yield item
执行爬虫程序:scrapy crawl crawl_rect
执行过程可以看出,比spider更快,代码两者相比较crawlspider更少,更方便!!
crawlspider 与 spider 的总结:
清澈的爱,只为中国