Scrapy:腾讯招聘整站数据爬取
项目地址:https://hr.tencent.com/
步骤一、分析网站结构和待爬取内容
以下省略一万字
步骤二、上代码(不能略了)
1、配置items.py
1 import scrapy 2 3 4 class HrTencentItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 # pass 8 position_name = scrapy.Field()#职位名称 9 position_type = scrapy.Field()#职位类别 10 detail_url = scrapy.Field() 11 people_count = scrapy.Field() 12 work_city = scrapy.Field() 13 release_date = scrapy.Field()#发布时间 14 job_description = scrapy.Field()#工作描述 15 job_require = scrapy.Field()#工作要求
2、配置settings.py
配置mongo
NEWSPIDER_MODULE = 'hr_tencent.spiders' MONGO_URL ='localhost' MONGO_DB ='hrtencent'
切记注册ITEM_PIPELINES
ITEM_PIPELINES = { # 'hr_tencent.pipelines.HrTencentPipeline': 300, 'hr_tencent.pipelines.MongoPipeline': 400, }
3.到spider文件夹里面执行指令 scrapy genspider tencent
4、打开自动生成的tencent.py文件,进行编辑
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from hr_tencent.items import HrTencentItem 4 5 6 class TencentSpider(scrapy.Spider): 7 name = 'tencent' 8 allowed_domains = ['hr.tencent.com'] 9 start_urls = ['https://hr.tencent.com/position.php'] 10 front_url = "https://hr.tencent.com/" 11 def parse(self, response): 12 13 tencenthr = response.xpath('//tr[@class="even"] | //tr[@class="odd"]') 14 for job in tencenthr: 15 item = HrTencentItem() 16 item["position_name"] = job.xpath('.//a/text()').extract_first() 17 item["detail_url"] = self.front_url + job.xpath('.//a/@href').extract_first() 18 item["position_type"] = job.xpath('.//td[2]/text()').extract_first() 19 item["people_count"] = job.xpath('.//td[3]/text()').extract_first() 20 item["work_city"] = job.xpath('.//td[4]/text()').extract_first() 21 item["release_date"] = job.xpath('.//td[5]/text()').extract_first() 22 yield scrapy.Request(url=item["detail_url"], callback=self.detail_parse, meta={"item": item}) 23 next_url = self.front_url + response.xpath('//div[@class="pagenav"]/a[@id="next"]/@href').extract_first() 24 yield scrapy.Request(url=next_url, callback=self.parse) 25 26 27 28 def detail_parse(self, response): 29 item = response.meta["item"] 30 node_list = response.xpath('//ul[@class="squareli"]') 31 item["job_description"] = ''.join(node_list[0].xpath("./li/text()").extract()) 32 item["job_require"] = ''.join(node_list[1].xpath("./li/text()").extract()) 33 yield item
5、配置pipelines.py文件
1 import pymongo 2 3 4 class MongoPipeline(object): 5 def __init__(self,mongo_url,mongo_db): 6 self.mongo_url = mongo_url 7 self.mongo_db = mongo_db 8 @classmethod 9 def from_crawler(cls,crawler): 10 return cls( 11 mongo_url = crawler.settings.get('MONGO_URL'), 12 mongo_db=crawler.settings.get('MONGO_DB') 13 14 ) 15 def open_spider(self,spider): 16 self.client = pymongo.MongoClient(self.mongo_url) 17 self.db = self.client[self.mongo_db] 18 19 def process_item(self,item,spider): 20 name = item.__class__.__name__ 21 self.db[name].insert(dict(item)) 22 return item 23 24 def close_spider(self,spider): 25 self.client.close()
6、新建一个run.py文件,为了不每次运行都敲指令,直接运行run.py即可
1 # -*- coding:utf-8 -*- 2 from scrapy import cmdline 3 4 cmdline.execute("scrapy crawl tencent".split())
7、运行本地或服务器上的mongo数据库(远程mongo数据库地址需要自己配)
8、执行run文件数据到手
逆风的方向更适合飞翔,不怕千万人阻挡,只怕自己投降!