Scrapy实战:使用scrapy再再次爬取干货集中营的妹子图片
需要学习的知识:
1.获取到的json数据如何处理
2.保存到json文件
3.保存到MongoDB数据库
4.下载项目图片(含缩略图)
1.创建项目
scrapy startproject gank
2.生成项目爬虫文件
scrapy genspider gank_img gank.io
注意:项目名称gank不能跟项目爬虫文件名gank_img一致
3.gank_img.py文件
import json import scrapy from gank.items import GankItem class GankImgSpider(scrapy.Spider): name = 'gank_img' allowed_domains = ['gank.io'] # 开始链接为什么要这样写请参考:https://www.cnblogs.com/sanduzxcvbnm/p/10271493.html start_urls = ['https://gank.io/api/data/福利/700/1'] def parse(self, response): # 返回的是json字符串,转换成字典,提取出需要的字段 results = json.loads(response.text)['results'] for i in results: item = GankItem() item['who'] = i['who'] item['url'] = i['url'] yield item
4.items.py文件
import scrapy class GankItem(scrapy.Item): # define the fields for your item here like: who = scrapy.Field() url = scrapy.Field() # 保存图片,生成图片路径 image_paths = scrapy.Field()
5.pipelines.py文件
import json from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem import pymongo import scrapy # 在settings.py文件中开启该pipeline,则主程序中yield的数据会传输到这边来进行处理 # 保存成json文件 class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.json', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item # 保存到MongoDB数据库 class MongoPipeline(object): # 数据表名 collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod # 从settings.py文件中获取参数 def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') # 数据库名 ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item # 下载项目图片 class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): # 图片链接是https的转换成http if item['url'][0:5] == 'https': item['url'] = item['url'].replace(item['url'][0:5], 'http') # for image_url in item['url']: # print('400',image_url) yield scrapy.Request(item['url']) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item
6.settings.py文件
只修改如下配置,其余保持不变
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,en-US;q=0.8,zh;q=0.5,en;q=0.3', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0' } # MongoDB数据库参数 MONGO_URI = '127.0.0.1' MONGO_DATABASE = 'gank' ITEM_PIPELINES = { 'gank.pipelines.JsonWriterPipeline': 300, 'gank.pipelines.MyImagesPipeline': 1, 'gank.pipelines.MongoPipeline': 400, } # 图片保存路径 IMAGES_STORE = 'D:\\gank\\images' # 90天的图片失效期限 IMAGES_EXPIRES = 90 # 缩略图 IMAGES_THUMBS = { 'small': (50, 50), 'big': (270, 270), }
7.执行爬虫程序
scrapy crawl gank_img
8.效果
json文件
MongoDB数据库
保存的图片及缩略图
其中full为图片本身大小所存放目录,thubmbs为缩略图存放目录,缩略图有big和small两种尺寸
scrapy结尾会有相应的统计信息
下载图片561个,无法下载的图片有108个
为什么有的图片无法下载,请参考之前的文章:https://www.cnblogs.com/sanduzxcvbnm/p/10271493.html