我的第一个 scrapy 爬虫软件
安装 python
这个就不用我说了吧,网上教程一大堆
安装 scrapy 包
pip install scrapy
创建 scrapy 项目
scrapy startproject aliSpider
进入项目目录下,创建爬虫文件
cmd 进入项目目录,执行命令:
scrapy genspider -t crawl alispi job.alibaba.com
编写 items.py 文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AlispiderItem(scrapy.Item):
# define the fields for your item here like:
detail = scrapy.Field()
workPosition = scrapy.Field()
jobclass = scrapy.Field()
编写 alispi.py 文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from aliSpider.items import AlispiderItem
class AlispiSpider(CrawlSpider):
name = 'alispi'
allowed_domains = ['job.alibaba.com']
start_urls = ['https://job.alibaba.com/zhaopin/positionList.html#page/0']
pagelink = LinkExtractor(allow=("\d+"))
rules = (
Rule(pagelink, callback='parse_item', follow=True),
)
def parse_item(self, response):
# for each in response.xpath("//tr[@style='display:none']"):
for each in response.xpath("//tr"):
item = AlispiderItem()
# 职位名称
item['detail'] = each.xpath("./td[1]/span/a/@href").extract()
# # # 详情连接
item['workPosition'] = each.xpath("./td[3]/span/text()").extract()
# # # 职位类别
item['jobclass'] = each.xpath("./td[2]/span/text()").extract()
yield item
执行
scrapy crawl alispi
输出到文件 items.json
scrapy crawl alispi -o items.json
执行成功会显示如下内容
版本说明
python 3.5.5
参考:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html
关注微信公众号 [prepared],与博主深入探讨。
所有发生在我们身上的事件都是一个经过仔细包装的礼物。只要我们愿意面对它有时候有点丑恶的包装,带着耐心和勇气一点一点的拆开包装的话,我们会惊喜的看到里面珍藏的礼物。
----遇见未知的自己