scrapy爬取知乎某个问题下的所有图片
前言:
1、仅仅是想下载图片,别人上传的图片也是没有版权的,下载来可以自己欣赏做手机背景但不商用
2、由于爬虫周期的问题,这个代码写于2019.02.13
1.关于知乎爬虫
网上能访问到的理论上都能爬取下来,只是网站反爬虫手段和爬取复杂的问题。知乎的内容大概是问题+回答(我才开始用,暂时的概念)。大概流程是:;<1>登录-->进入首页-->点击首页列表中的某篇问题-->查看问题和回答-->查看评论或者<2>百度到某篇问题-->查看问题和回答,在网页版中第二种方式并不需要登录,也即你爬取目的和方法有两种:
1.1.从知乎首页开始爬取所有问题(或者某类型问题),并爬取对应的回答(评论)
需要模拟登录的过程,再从首页访问问题,从问题地址获取回答、评论。这个在知乎模拟登录的过程(https://blog.csdn.net/sinat_34200786/article/details/78449499)有相关介绍,不过这些产品(包括反爬措施)都是在不断变化的,具体还是得自己分析。
1.2.爬取某个问题下的回答
现在来说不需要登录就可以直接获取,我找了上面的方法,发现我自己爬取图片的目的并不需要登录,之前只是一个小问题弄错了。
2.scrapy项目
这个项目也是想复习复习scrapy
根据分析浏览器网络访问过程可以知道,我所希望爬取的东西是通过以下网址获取的json
https://www.zhihu.com/api/v4/questions/309298287/answers? include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset=3&limit=1&sort_by=default&platform=desktop
里面有很多参数,主要参数是offset和limit。它请求头有些复杂,但是后来发现只要把基本的“User-Agent”设置好应该就差不多了,毕竟并不需要登录。使用了scrapy自带的ImagesPipeline,除了图片我对其他信息也不感兴趣。那么代码如下:
1.1. item.py中比较简单,只是储存图片地址,类型应该是['xxx.jpg','yyy.jpg']的list
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class Img584294770Item(scrapy.Item): # define the fields for your item here like: imgs = scrapy.Field()
1.2. Img58429477.py是spider文件,定义了爬取的过程
# -*- coding: utf-8 -*- #Author:lwx #21090111我想要爬取知乎id584294770问题下回答的图片 from scrapy import Spider import scrapy import json import re from zhihu.items import * import requests class Img584294770(Spider): name = 'Img584294770' start_urls=[ 'https://www.zhihu.com/', 'https://www.zhihu.com/api/v4/questions/309298287/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop' ] #print(start_urls[0].format(limit=1,offset=1)); #设置header headers = { 'Accept':'*/*', #'Accept-Encoding':'gzip, deflate, br',#这个不要设置,因为设置后会返回乱码 'Accept-Language':'zh-CN,zh;q=0.9', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36', 'x-requested-with':'fetch', } #设置访问 def start_requests(self): url = self.start_urls[1].format(offset=1,limit=1) #url = "http://36.248.23.136:8888/data_zfcg/" res = requests.get(url,headers=self.headers) data = json.loads(res.text) if('data' in data.keys()):#有获取到数据 #先获取回答总数 totalPage = data['paging']['totals'] for Page in range(0,totalPage): #以三个为一组进行访问 url = self.start_urls[1].format(offset=Page*3,limit=3) yield scrapy.Request(url=url, callback=self.parse_imgs, headers = self.headers) #获取回答中的图片地址 def parse_imgs(self, response): res = json.loads(response.body) if('data' in res.keys()):#有获取到数据 data = res['data'] for d in data: item = Img584294770Item() content = d['content'] t = self.get_imgs(content) item['imgs'] = t yield item #获取字符串中的图片地址src='xxx.jpg'或src="xxx.png" def get_imgs(self,content): imgs_url_list = re.findall(r'\ssrc="(.*?)"', content) imgs_list = [] for i in range(len(imgs_url_list)): if(imgs_url_list[i].split('.')[-1]=='jpg' or imgs_url_list[i].split('.')[-1]=='png'): imgs_list.append(imgs_url_list[i]) return imgs_list
# -*- coding: utf-8 -*- # Scrapy settings for zhihu project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'zhihu' SPIDER_MODULES = ['zhihu.spiders'] NEWSPIDER_MODULE = 'zhihu.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'zhihu (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False#置为false表示不遵守robot.txt,去爬取网站不允许的内容 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3#设置一下延时 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'zhihu.middlewares.ZhihuSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'zhihu.middlewares.ZhihuDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html import os IMAGES_EXPIRES = 90 #图片过期时间,在这个时间内爬取过的都不再爬取 IMAGES_URLS_FIELD ="imgs"#图片地址在item中的名字 project_dir=os.path.abspath(os.path.dirname(__file__)) IMAGES_STORE=os.path.join(project_dir,'images')#图片储存的文件夹 ITEM_PIPELINES = { 'scrapy.contrib.pipeline.images.ImagesPipeline':200, #'zhihu.pipelines.ZhihuPipeline': 300, 'zhihu.pipelines.Img584294770Pipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.总结
3.1.'utf-8'报错,你用的所有文件,item,spider或者pineline,请设置成utf-8格式,我看到这个问题就恶心但是总是记不得
3.2."ROBOTSTXT_OBEY = False",在settings.py中这个值默认是True,即遵守robot.txt的规定。如果没有设置,你会发现你的爬虫明明进去转了一圈,但是很绅士地什么都没动人家,扔给你一个200但是就是不给你想要的数据。这个值得True模式是一些搜索引擎常用的,而我们做这些爬虫,就是不受网页的所有者欢迎的,超越了robot.txt规定的范围。
3.3.我这个小项目拖了三天,不是因为有点不记得scrapy开发过程,主要是因为借着分析网页的目的围观某乎大佬装逼(并不)编码问题。即使后面发现就是headers中不要设置'Accept-Encoding':'gzip, deflate, br'的问题,但是编码问题还是狠狠蹂躏了我又一回,我还要因为又又又忘记编码问题回去啃一遍书。
6000多张图片了……自己右键下载多麻烦,不过自己喜欢的也不过是百来张……还是麻烦,代码又不能根据你喜好帮你挑好……
参考:
https://blog.csdn.net/xwbk12/article/details/79009995
https://blog.csdn.net/sinat_34200786/article/details/78449499