scrapy框架
Scrapy框架
一、介绍
Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下
1.引擎(EGINE)
引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。
2.调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
3.下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的
4.爬虫(SPIDERS)
SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求
5.项目管道(ITEM PIPLINES)
在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作
6.下载器中间件(DOWNLOADER MIDDLEWARES)
位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事
1.process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website)
2.change received response before passing it to a splider
3.send a new Request instead of passing received response to a splider
4.pass response to a splider without fetching a web page
5.silently drop some requests
7.爬虫中间件(Spider Middlewares)
位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)
二、安装
windowns平台
1.pip3 install wheel 安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
2.pip3 install lxml
3.pip3 install pyopenssl
4.pip3 install pywin32
5.下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
6.执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
7.pip3 install scrapy
Linux平台
1.pip3 install scrapy
三、命令行工具
#1 查看帮助
scrapy -h
scrapy <command> -h
#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
Global commands:
startproject #创建项目(需要记住)
genspider #创建爬虫程序,文件名 爬取网址(需要记住)
settings #如果是在项目目录下,则得到的是该项目的配置
runspider #运行一个独立的python文件,不必创建项目
shell #scrapy shell url地址 在交互式调试,如选择器规则正确与否
fetch #独立于程单纯地爬取一个页面,可以拿到请求头
view #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
version #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本(需要记住)
Project-only commands:
crawl #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False(需要记住)
check #检测项目中有无语法错误
list #列出项目中所包含的爬虫名
edit #编辑器,一般不用
parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确
bench #scrapy bentch压力测试
#3 官网链接
https://docs.scrapy.org/en/latest/topics/commands.html
四、项目结构以及爬虫应用简介
project_name/
scrapy.cfg
project_name/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
爬虫1.py
爬虫2.py
爬虫3.py
文件说明:
- scrapy.cfg 项目的主配置信息,用来部署scrapy时使用,爬虫相关的配置信息在settings.py文件中。
- items.py 设置数据存储模板,用于结构化数据,如:Django的Model
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必须大写否则视为无效,正确写法USER_AGENT='xxxx'
- spiders 爬虫目录,如:创建文件,编写爬虫规则
注意:一般创建爬虫文件时,以网站域名命名
#在项目目录下新建:run.py
from scrapy.cmdline import execute
execute(['scrapy','crawl','爬虫文件名','--nolog']) # 列表里就是cmd输入的东西(就可以在pycharm中执行了)
五、Spiders
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'tmall' #crawl执行爬虫程序的一个标识(scrapy crawl tmall)
allowed_domains = ['www.tmall.com'] #允许的域名范围
start_urls = ['https://www.tmall.com/'] #起始路由
custom_settings = {
'DEFAULT_REQUEST_HEADERS':{
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
} # 自定义配置文件,优先查找
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0],callback=self.parse)
def parse(self, response): # 解析函数
pass
六、Selectors选择器
import scrapy
from .. import items
class XiaopapaSpider(scrapy.Spider):
name = 'xiaopapa'
allowed_domains = ['list.tmall.com']
start_urls = 'https://list.tmall.com/search_product.htm?q=%C4%D0%D7%B0&totalPage=1&jumpto=1'
custom_settings = {
'DEFAULT_REQUEST_HEADERS':{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36 LIEBO"
}
}
def start_requests(self):
yield scrapy.Request(url=self.start_urls,callback=self.totalPage_parse,dont_filter=True,errback=self.err_case) #errback会处理所有异常
def totalPage_parse(self, response):
totalPage = int(response.css('[name="totalPage"]::attr(value)').extract_first()) # 获取标签值,取第一个,如果是extract(),得到的是一个包含所有值的列表
url = self.start_urls.replace('&totalPage=1&jumpto=1','&totalPage=%s&jumpto={}'%totalPage) # 替换路由
for i in range(1,totalPage+1):
next_url = url.format(i)
yield scrapy.Request(url=next_url,callback=self.parse_info,dont_filter=True,errback=self.err_case) # 拿到每一页的路由,再发送请求
def parse_info(self,response):
product_selector_list = response.css('[class="product "]') # 找到每一页的所有商品信息
for product_selector in product_selector_list:
product_imgurl = product_selector.css('[class="productImg-wrap"] img::attr(src)').extract_first(product_selector.css('[class="productImg-wrap"] img::attr(data-ks-lazyload)').extract_first()) # 因为有懒加载的原因,所以设置默认值(如果拿不到就拿懒加载的)
# print(product_imgurl)
product_title = product_selector.css('[class="productTitle"] a::attr(title)').extract_first(None)
print(product_title)
product_price = product_selector.css('[class="productPrice"] em::attr(title)').extract_first(None)
# print(product_price)
item = items.DemoItem()
item['product_imgurl'] = product_imgurl
item['product_title'] = product_title
item['product_price'] = product_price
yield item
def err_case(self,res):
print(res)
print("我是错误回调")
七、Items
import scrapy
class DemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
product_imgurl = scrapy.Field() # 有几个参数就设置几个
product_title = scrapy.Field()
product_price = scrapy.Field()
注意:在settings.py配置文件中解注ITEM_PIPELINES,后面数字越小,优先级越高
八、Item Pipeline
import pymongo
class DemoPipeline(object):
def __init__(self,MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE):
self.MONGO_INFO_DICT = MONGO_INFO_DICT
self.MONGO_DB = MONGO_DB
self.MONGO_TABLE = MONGO_TABLE
@classmethod
def from_crawler(cls, crawler):
"""
Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
成实例化,不用走init
"""
MONGO_INFO_DICT = crawler.settings.get('MONGO_INFO_DICT') # crawl下直接有settings,不需要导包了
MONGO_DB = crawler.settings.get('MONGO_DB')
MONGO_TABLE = crawler.settings.get('MONGO_TABLE')
return cls(MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE)
def open_spider(self,spider):
"""
爬虫刚启动时执行一次
"""
print('爬虫启动了')
self.client = pymongo.MongoClient(**self.MONGO_INFO_DICT)
self.table = self.client[self.MONGO_DB][self.MONGO_TABLE]
def close_spider(self,spider):
"""
爬虫关闭时执行一次
"""
print('爬虫结束了')
self.client.close()
def process_item(self, item, spider):
product_dict = dict(item)
self.table.insert(product_dict)
print("%s写入成功"%product_dict)
return item
九、Downloader Midderware
from scrapy.exceptions import IgnoreRequest
from scrapy import signals
from scrapy.http import Response,Request
import requests
# 代理文件在爬6
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").text
def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
class DemoSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class DemoDownloaderMiddleware1(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print(get_proxy())
request.meta['proxy'] = 'http://' + get_proxy()
request.meta['download_timeout'] = 10
# print(request.meta['proxy'])
# return Request(url='https://www.baidu.com')
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print("我是响应1")
return response
# return Request(url='https://www.baidu.com')
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
print("我是异常1")
proxy = request.meta.get('proxy').replace('http://','')
delete_proxy(proxy)
request.meta['proxy'] = 'http://' + get_proxy()
return request
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
注意:
1.将settings.py配置文件中的DOWNLOADER_MIDDLEWARES解注,数字越小,优先级越高
2.还要注意请求和响应的返回值
请求异常从最后开始返回再给errorback
响应异常直接返回给errorback(不会给process_exception处理)
3.代理配置