基于scrapy+mongodb的智联招聘网站信息爬取

本篇内容主要记录博主本人爬取智联招聘网站信息的过程!!

环境:python 3.6.5

   pycharm JetBrains PyCharm Community Edition 2018.1 x64

   Mongodb 4.0.0

 

1.第一步当然是创建一个新的爬虫啦

scrapy crawl xxxx #爬虫名字

2.解析网站

进入到智联首页https://www.zhaopin.com/,点击搜索(以搜索python为例),得到信息列表按F12打开调试器

 

一开始我单纯的以为在Elements(查看器)获取所有信息足矣

 

让我满怀欢喜的把整个网页爬取下来,然后分析,发现要搜索上图标签//*[@id="listContent"]/div[1]/div[1]/div[1]/div[1]/a/span竟然啥没有,不信你打开网页源代码试试。

这是因为,生成的信息是通过异步加载出来的,所以要分析的话要通过Network来获取异步加载信息

点击上面的第二条链接,你会发现,所有信息都被藏在里面了

有了这些json格式的信息,就可以用xpath或者re来构造需要的信息。

还有一点就是实现翻页,拉到最底部点击第二页和第三页。会发现调试器又出现了四条链接与刚才的一样,对比下两条sou开头的链接

1.https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22jl%22:%22763%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D

2.https://fe-api.zhaopin.com/c/i/sou?start=60&pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,%22jl%22:%22763%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D

3.https://fe-api.zhaopin.com/c/i/sou?start=120&pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:3,%22jl%22:%22763%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D

会发现多了一个start=xxx,不难发现这就是页码跳转的一个参数。有了这个参数我们就可以实现翻页获取信息。接下来开始敲代码!

3.设置settings文件

激活DEFAULT_REQUEST_HEADERS头部信息,打开PIPELINES管道。

ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)"
                " Chrome/63.0.3239.132 Safari/537.36"
}
ITEM_PIPELINES = {
   'zhilian.pipelines.ZhilianPipeline': 300,
}

 4.设置items

暂时先抓取这些简单信息


class ZhilianItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
eduLevel = scrapy.Field() # 学历
company = scrapy.Field() # 公司名称
salary = scrapy.Field() # 薪资
city = scrapy.Field() # 城市
workingExp = scrapy.Field() # 工作经验
 

5.编写爬虫文件

 zl_spider.py

# -*- coding: utf-8 -*-
import json

import scrapy

from scrapy import Request

from zhilian.items import ZhilianItem


class ZlSpiderSpider(scrapy.Spider):
    name = 'zl_spider'
    allowed_domains = ['www.zhaopin.com']
    start_urls = ['https://www.zhaopin.com']
    first_url = 'https://fe-api.zhaopin.com/c/i/sou?start={page}&pageSize=60&cityId=763&workExperience=-1&education' \
                '=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,' \
                '%22jl%22:%22763%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D'

    def start_requests(self, page=None):  #实现翻页
        for i in range(0, 11):
            page = i*60           
            yield Request(self.first_url.format(page=page), callback=self.parse)
        pass

    def parse(self, response):
        result = json.loads(response.text)
        item = ZhilianItem()
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

至此运行查看一下有没有成功

发现空空如也

检查一下result.keys()里的值

 

发现字典里只有两个key,所以上面遍历出来的value为空,而我真正需要的值在data.results里面

接下来改写下代码,把parse部分改写成如下图

 

结果无情报错

 

无奈之下只好手打

最终修改得到的爬虫代码页为

# -*- coding: utf-8 -*-
import json

import scrapy
from scrapy import Request
from zhilian.items import ZhilianItem


class ZlSpiderSpider(scrapy.Spider):
    name = 'zl_spider'
    allowed_domains = ['www.zhaopin.com']
    start_urls = ['https://www.zhaopin.com']
    first_url = 'https://fe-api.zhaopin.com/c/i/sou?start={page}&pageSize=60&cityId=763&workExperience=-1&education' \
                '=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,' \
                '%22jl%22:%22763%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D'

    def start_requests(self, page=None):  # 翻页
        for i in range(0, 11):
            page = i*60
            yield Request(self.first_url.format(page=page), callback=self.parse)
        pass

    def parse(self, response):
        item = ZhilianItem()
        result = json.loads(response.text)
        for i in range(0, 60):  # 一个页面有60个数据字典,遍历每个字典
            results = result['data']['results'][i]
            company = results['company']['name']
            eduLevel = results['eduLevel']['name']
            workingExp = results['workingExp']['name']
            city = results['city']['display']
            item['company'] = company
            item['eduLevel'] = eduLevel
            item['salary'] = results.get('salary')
            item['workingExp'] = workingExp
            item['city'] = city
            yield item

运行结果部分如图

 

6.写入数据库mongodb,用update来使数据去重(适用于数据量小)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

class ZhilianPipeline(object):
    def process_item(self, item, spider):
        return item

class MongoPipeline(object):
    collection_name = 'zl_python'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update({'company': item['company']}, dict(item), True)  # 去重
        return item

7.最终成果部分截图

十页的数据为313,说明部分数据为重复数据。

 

 github地址:https://github.com/biyeee/ZL_spider

 

posted @ 2018-07-21 22:43  biyeee  阅读(824)  评论(0编辑  收藏  举报