11,scrapy框架持久化存储
今日总结
- 基于终端指令的持久化存储
- 基于管道的持久化存储
今日详情
1.基于终端指令的持久化存储
- 保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作。
2.基于管道的持久化存储
scrapy框架中已经为我们专门集成好了高效、便捷的持久化操作功能,我们直接使用即可。要想使用scrapy的持久化操作功能,我们首先来认识如下两个文件:
小试牛刀:将boss招聘网站中的招聘爬虫的数据爬取下来,然后进行持久化存储
- 爬虫文件:boss.py
# -*- coding: utf-8 -*- import scrapy from bossPro.items import BossproItem class BossSpider(scrapy.Spider): name = 'boss' # allowed_domains = ['www.xxx.com'] start_urls = [ 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position='] url = 'https://www.zhipin.com/c101010100/?query=python爬虫&page=%d&ka=page-2' page = 1 # 解析+管道持久化存储 def parse(self, response): li_list = response.xpath('//div[@class="job-list"]/ul/li') for li in li_list: job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()').extract_first() salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first() company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first() # print(company) item = BossproItem() item['job_name'] = job_name item['salary'] = salary item['company'] = company yield item if self.page <= 3: print(f'执行第{self.page+1}页') self.page += 1 new_url = format(self.url % self.page) yield scrapy.Request(url=new_url, callback=self.parse)
- items文件:items.py
import scrapy class BossproItem(scrapy.Item): # define the fields for your item here like: job_name = scrapy.Field() salary = scrapy.Field() company = scrapy.Field()
- 管道文件:pipelines.py
import os import pymysql from redis import Redis # 文件存储 class BossproPipeline(object): fp = None def open_spider(self, spider): print('开始爬虫.............................') if not os.path.exists('./boss'): os.mkdir('./boss') self.fp = open('./boss/boss.txt', 'w', encoding='utf8') def close_spider(self, spider): print('结束爬虫.............................') self.fp.close() def process_item(self, item, spider): self.fp.write(item['job_name']+':'+item['salary']+':'+item['company']+'\n') return item # mysql数据库存储 class mysqlPipeline(object): conn = None cursor =None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='', db='boss', charset="utf8") print(self.conn) def process_item(self, item, spider): self.cursor = self.conn.cursor() try: print('insert into boss values ("%s","%s","%s")'%(item['job_name'],item['salary'],item['company'])) self.cursor.execute('insert into boss(job_name,salary,company) values ("%s","%s","%s")'%(item['job_name'],item['salary'],item['company'])) self.conn.commit() except Exception as e: self.conn.rollback() def close_spider(self,spider): self.conn.close() self.cursor.close() # redis存储 class redisPipeLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1', port=6379, db=2, charset='utf-8') print(self.conn) def process_item(self, item, spider): # print(item['salary'],) dic = { 'name': item['job_name'], 'salary': item['salary'], 'company': item['company'] } self.conn.lpush('boss', dic)
- 配置文件:settings.py
- 答:管道文件中的代码为