scrapy 框架
框架
- 就是一个集成了很多功能且具有很强通用性的一个项目模板。(项目的半成品)
- 学习每一个功能的作用及其用法即可
scrapy 框架
- 高性能的网络请求
- 高性能的数据解析
- 高性能的持久化存储
- 深度爬取
- 全栈爬取
- 分布式
- 中间件
- 请求传参
环境的安装
- mac/linux:pip install scrapy - window: - pip install wheel - 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted - 进入下载目录,执行 pip install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl # 飘红下载其他版本 - pip install pywin32 - pip install scrapy
基本使用
- 新建一个工程:scrapy startproject ProName - 目录结构: - spiders(包):空包 - settings:配置文件 - 不遵从robots - UA伪装 - 日志等级的指定 - cd ProName:进入到工程目录中 - 在spiders(爬虫文件夹)中创建一个爬虫文件 - scrapy genspider spiderName www.xxx.com - 编写代码:主要的代码会编写在爬虫文件中 - 执行工程:scrapy crawl 爬虫文件名 #日志信息
import scrapy #爬虫类:父类==Spider class FirstSpider(scrapy.Spider): #爬虫文件的名称:就是当前爬虫源文件的唯一标识 name = 'first' #允许的域名: # allowed_domains = ['www.baidu.com'] #起始的url列表: #列表中存放的url都可以被scrapy进行异步的网络请求 start_urls = ['https://www.baidu.com/','https://www.sogou.com'] #用作于数据解析的 #参数:response就是响应对象 def parse(self, response): print(response)
新建一个工程:scrapy startproject 名字
执行工程:scrapy crawl 爬虫文件名 #日志信息
爬取数据前要在setting中
ROBOTSTXT_OBEY = True
改为 Flase
不想显示日志:
scrapy crawl 爬虫文件名 --nolog
加上不显示log,但也不会显示报错信息
显示报错信息,需在settings加
LOG_LEVEL ='ERROR'
将日志写在文件里
LOG_FILE = './file.txt'
UA伪装
USER_AGENT = 'firstBlood (+http://www.yourdomain.com)'
scrapy的数据解析
- extract、extract_first()作用
import scrapy class ChoutiSpider(scrapy.Spider): name = 'chouti' # allowed_domains = ['www.xxx.com\']cd start_urls = ['https://dig.chouti.com/'] #数据解析 def parse(self, response): #解析内容和发布平台 div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div') for div in div_list: #xpath在取标签中存储在文本数据时必须要使用extract()、extract_first()进行字符串的单独提取 # content = div.xpath('.//div[@class="link-detail"]/a/text()')[0].extract() content = div.xpath('.//div[@class="link-detail"]/a/text()').extract_first() print(content)
scrapy的持久化存储
基于终端指令的持久化存储
- 只可以将parse方法的返回值存储到本地的磁盘文件(指定形式后缀)中
- scrapy crawl spiderName -o filePath
# 基于终端指令的持久化存储 def parse(self, response): all_data = [] #解析内容和发布平台 div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div') for div in div_list: #xpath在取标签中存储在文本数据时必须要使用extract()、extract_first()进行字符串的单独提取 # content = div.xpath('.//div[@class="link-detail"]/a/text()')[0].extract() content = div.xpath('.//div[@class="link-detail"]/a/text()').extract_first() author = 'bobo' dic = { 'content':content, 'author':author } all_data.append(dic) return all_data
基于管道进行持久化存储(重点)
- 编码流程
- 1.在爬虫文件中进行数据解析
- 2.在item类中定义相关的属性
- 3.将解析到的数据存储到一个item类型的对象中
- 4.将item类型的对象提交给管道
- 5.管道类的process_item方法负责接受item,接受到后可以对item实现任意形式的持久化存储操作
- 6.在配置文件中开启管道
- 一个管道类对应一种平台的持久化存储
文件
#基于管道的持久化存储 def parse(self, response): all_data = [] #解析内容和发布平台 div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div') for div in div_list: #xpath在取标签中存储在文本数据时必须要使用extract()、extract_first()进行字符串的单独提取 # content = div.xpath('.//div[@class="link-detail"]/a/text()')[0].extract() content = div.xpath('.//div[@class="link-detail"]/a/text()').extract_first() author = 'bobo' #实例化了一个item类型的对象 item = ChoutiproItem() #给item对象的属性赋值 item['content'] = content item['author'] = author yield item #将item提交给管道
items中
import scrapy class ChoutiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #Field类型视为一个万能的数据类型 content = scrapy.Field() # 接受数据 author = scrapy.Field() # 接受数据
pipelines
import pymysql class ChoutiproPipeline(object): #重写父类方法,该方法只会被执行一次 fp = None def open_spider(self,spider): print('开始爬虫......') self.fp = open('chouti.txt','w',encoding='utf-8') #该方法调用后就可以接受爬虫类提交过来的item对象,且赋值给了item参数 def process_item(self, item, spider): author = item['author'] content = item['content'] self.fp.write(author+':'+content+'\n') return item #将item传递给下一个即将被执行的管道类 def close_spider(self,spider): print('爬虫结束!') self.fp.close() class MysqlPileLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connection(host='127.0.0.1',port=3306,user='root',password='123',db='spider',charset='utf8') print(self.conn) def process_item(self,item,spider): content = item['content'] author = item['author'] sql = 'insert into chouti values ("%s","%s")'%(author,content) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close()
settings
ITEM_PIPELINES = { #300表示的是优先级,数值越小优先级也高 'choutiPro.pipelines.ChoutiproPipeline': 300, 'choutiPro.pipelines.MysqlPileLine': 301, }
scraapy的图片数据爬取
- scrapy中封装好了一个管道类(ImagesPipeline),基于该管道类可以实现图片资源的请求和持久化存储
- 编码流程:
- 爬虫文件中解析出图片的地址
- 将图片地址封装到item中且提交给管道
- 管道文件中自定义一个管道类(父类:ImagesPipeline)
- 重写三个方法:
- def get_media_requests(self,item,info):
- def file_path(self,request,response=None,info=None):
- def item_completed(self,result,item,info):
- 在配置文件中开启管道且加上IMAGES_STORE = './imgLibs'
创建文件
scrapy startproject imgPro
scrapy genspider img www.xxx.com
在 img 下
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from imgPro.items import ImgproItem class ImgSpider(scrapy.Spider): name = 'img' allowed_domains = ['www.xxx.com'] start_urls = ['http://www.521609.com/daxuemeinv/'] def parse(self, response): li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li') for li in li_list: img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first() item = ImgproItem() item['img_src'] = img_src yield item
在items下
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class ImgproItem(scrapy.Item): img_src = scrapy.Field()
在pipelines下
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
# 可以做持久化存储还可以做音频,视频,图片 from scrapy.pipelines.images import ImagesPipeline import scrapy class ImgproPipeline(ImagesPipeline): # 该方法是用做于请求发送 def get_media_requests(self, item, info): #对item中的ing_src进行请求发送 yield scrapy.Request(url=item['img_src']) #指定文件存储的路径(文件夹+文件的名称) def file_path(self,request,response=None,info=None): return request.url.split('/')[-1] #将item传递给下一个即将被执行的管道类 def item_completed(self,result,item,info): return item
settings下
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
# 写入文件 IMAGES_STORE = './imgLibs' ITEM_PIPELINES = { 'imgPro.pipelines.ImgproPipeline': 300, }
scrapy中手动请求发送
- yield scrapy.Request(url,callback)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from imgPro.items import ImgproItem class ImgSpider(scrapy.Spider): name = 'img' start_urls = ['http://www.521609.com/daxuemeinv/'] #定义一个通用的url模板 url = 'http://www.521609.com/daxuemeinv/list8%d.html' pageNum = 1 def parse(self, response): li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li') for li in li_list: img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first() item = ImgproItem() item['img_src'] = img_src yield item if self.pageNum < 3: self.pageNum += 1 new_url = format(self.url%self.pageNum) yield scrapy.Request(new_url,callback=self.parse)
scrapy中post请求的发送
- yield scrapy.FormRequest(url,callback,formdata)
对起始的url进行post请求的发送
- 重写父类的start_requests(self): def start_requests(self): for url in self.start_urls: yield scrapy.FormRequest(url,callback=self.parse,formdata={})
提升scrapy爬取数据效率
增加并发: 默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。 降低日志级别: 在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘ERROR’ 禁止cookie: 如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False 禁止重试: 对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False 减少下载超时: 如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 1 超时时间为10s
请求传参(深度爬取)
- 深度爬取:
- 爬取的数据没有存在同一张页面中。
- 如何实现请求传参
- Request(url,callback,meta={}):可以将meta字典传递给callback
- callback接收item:response.meta
创建文件
scrapy startproject moviePro
scrapy genspider movie www.xxx.com
movie
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
# -*- coding: utf-8 -*- import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/%E5%8A%A8%E4%BD%9C/id/1.html'] url = 'https://www.4567tv.tv/index.php/vod/show/class/动作/id/1/page/%d.html' pageNum = 2 #解析电影名称和详情页的url def parse(self, response): li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li') for li in li_list: title = li.xpath('./div/div/h4/a/text()').extract_first() detail_url = 'https://www.4567tv.tv'+li.xpath('./div/div/h4/a/@href').extract_first() item = MovieproItem() item['title'] = title #手动请求发送 meta{} 可以通过callback传到parse_detail yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item}) if self.pageNum < 5: new_url = format(self.url%self.pageNum) self.pageNum += 1 yield scrapy.Request(new_url,callback=self.parse) #解析详情页中的电影简介 def parse_detail(self,response): item = response.meta['item'] desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first() item['desc'] = desc yield item
items
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class MovieproItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() desc = scrapy.Field()
pipelines
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
class MovieproPipeline(object): def process_item(self, item, spider): print(item) return item
settings
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' ITEM_PIPELINES = { 'moviePro.pipelines.MovieproPipeline': 300, }
scrapy的五大核心组件
引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
中间件
- scrapy的中间件 - 有哪些中间件。 - 下载中间件(推荐) - 爬虫中间件 - 下载中间件的作用 - 批量拦截所有的请求和响应 - 为什么拦截请求 - 篡改请求的头信息(UA) - request.headers['User-Agent'] = 'xxxxx' - 代理 - request.meta['proxy'] = 'http://ip:port'
middle
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class MiddleSpider(scrapy.Spider): name = 'middle' # allowed_domains = ['www.xxx.com'] start_urls = ['https123://www.baidu.com/'] def parse(self, response): print(response)
middelwares
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
from scrapy import signals import random user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] PROXY_http = [ 'http://153.180.102.104:80', 'http://195.208.131.189:56055', ] PROXY_https = [ 'https://120.83.49.90:9000', 'https://95.189.112.214:35508', ] class MiddleproDownloaderMiddleware(object): #拦截正常请求 def process_request(self, request, spider): print('i am process_request()') #可以将拦截到的请求尽可能多的设定成不同的UA request.headers['User-Agent'] = random.choice(user_agent_list) #代理的应用 # if request.url.split(':')[0] == 'http': # request.meta['proxy'] = random.choice(PROXY_http) # else: # request.meta['proxy'] = random.choice(PROXY_https) return None #拦截所有的响应 def process_response(self, request, response, spider): print('i am process_response') return response #拦截异常的请求 def process_exception(self, request, exception, spider): print('i am process_exception') #相关的修正操作,将修正后的请求对象重新发送 return request #将异常的请求重新发送
settings
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' DOWNLOADER_MIDDLEWARES = { 'middelPro.middlewares.MiddelproDownloaderMiddleware': 543, }
网易新闻例子
- 网易新闻(国内,国际,军事,航空,无人机)新闻数据的标题和内容
- 分析:
- 1.每一个板块下对应的新闻数据都是动态加载出来的
- 2.会对五个板块的响应数据进行数据解析,但是板块对应的响应对象是不包含动态加载的新闻数据,目前
获取的每一个板块对应的响应对象是不满足需求的响应对象!!!
- 3.将不满足需求的5个响应对象(工程中一共会有1+5+n),修改成满足需求。
- 找到指定的5个不满足需求的响应对象(中间件)
- 如果你的redis如果不可以写入字典,在cmd
- pip install -U redis==2.10.6
wnagyi
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from selenium import webdriver from wangyiPro.items import WangyiproItem class WangyiSpider(scrapy.Spider): name = 'wangyi' # allowed_domains = ['www.xxx.com'] start_urls = ['https://news.163.com/'] model_urls =[] bro = webdriver.Chrome(executable_path=r'E:\Program Files\feiq\Recv Files\爬虫+数据\tools\chromedriver.exe') # 解析出每一个板块对应的url def parse(self, response): li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li') indexs = [3,4,6,7,8] for index in indexs: li_tag = li_list[index] # 解析到每一个板块对应的url model_url = li_tag.xpath('./a/@href').extract_first() self.model_urls.append(model_url) # 对板块的url进行请求发送获取每一个板块对应的页面源码数据 yield scrapy.Request(model_url, callback=self.parse_model_page) #用来解析每一个板块对应页面中的新闻标题 def parse_model_page(self,response): div_list = response.xpath('/html/body/div[1]/div[3]/div[4]/div[1]/div/div/ul/li/div/div') for div in div_list: item = WangyiproItem() detail_url = div.xpath('./a/@href').extract_first() title = div.xpath('./div/div[1]/h3/a/text()').extract_first() print(detail_url,title) if title and detail_url: item['title'] = title # print(detail_url,title) yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item}) # 解析新闻内容 def parse_detail(self,response): item = response.meta['item'] content = response.xpath('//*[@id="endText"]//text()').extract() content = ''.join(content) item['content'] = content yield item #该方法是在最后被执行一次 关闭 def closed(self,spider): self.bro.quit()
items
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class WangyiproItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field()
ipelines
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
# 判断是存储到mysql或redis,在setting更改ITEM_PIPELINES import pymysql class WangyiMysql(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connection(host='127.0.0.1',port=3306,user='root',password='123',db='spider',charset='utf8') print(self.conn) def process_item(self, item, spider): sql = 'insert into news values ("%s","%s")'%(item['title'],item['content']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() from redis import Redis class WangyiRedis(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self, item, spider): self.conn.lpush('news',item)
setting
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543, } ITEM_PIPELINES = { # 'wangyiPro.pipelines.WangyiMysql': 300, 'wangyiPro.pipelines.WangyiRedis': 300, }
middelwares
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
class WangyiproDownloaderMiddleware(object): def process_request(self, request, spider): return None # 可以拦截到1+5+n个响应对象 def process_response(self, request, response, spider): # 根据5个板块的url定位到指定的request # request定位到指定response # 文件参数通过spider传送 bro = spider.bro model_urls = spider.model_urls if request.url in model_urls: # response,就是最终定位到的5大板块对应的响应对象 # 如果该新的响应对象中对应的响应数据中包含了动态加载的新闻数据 bro.get(request.url) sleep(1) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(1) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(1) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(1) page_text =bro.page_source new_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request) return new_response else: return response def process_exception(self, request, exception, spider): pass
mysql
pipelines
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pymysql class WangyiMysql(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connection(host='127.0.0.1',port=3306,user='root',password='123',db='spider',charset='utf8') print(self.conn) def process_item(self, item, spider): sql = 'insert into news values ("%s","%s")'%(item['title'],item['content']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close()
cmd
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
mysql -uroot -p123 create database spider; use spider; create table news (title varchar(200),content varchar(9999)); desc news; commit;
radis
将radis放在框架中
进入Redis-x64-3.2.100的cmd执行
分别执行:
redis-server.exe
redis-cli.exe
keys *
CrawlSpider实现的全站数据的爬取
- 新建一个工程
- cd 工程
- 创建爬虫文件:scrapy genspider -t crawl spiderName www.xxx.com
- 连接提取器LinkExtractor
- 可以根据指定的规则对指定的连接进行提取
- 提取的规则就是构造方法中的allow(‘正则表达式’)参数决定
- 规则解析器Rule
- 可以将将连接提取器提取到的连接进行请求发送,可以根据指定的规则(callback)对请求到的数据进行解析
- follow=True:将连接提取器 继续作用到 连接提取器提取到的连接 所对应的 页面源码中
创建文件
scrapy startproject sunPro cd sunPro scrapy genspider -t crawl sun www.xxx.com # 创建CrawlSpider
sun
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import Item1,Item2 class SunSpider(CrawlSpider): name = 'sun' # allowed_domains = ['www.xxx.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page='] #连接提取器 link = LinkExtractor(allow=r'type=4&page=\d+') link_1 = LinkExtractor(allow=r'type=4&page=$') link_detail = LinkExtractor(allow=r'question/\d+/\d+\.shtml') rules = ( #实例化一个Rule(规则解析器)对象 Rule(link, callback='parse_item', follow=False), Rule(link_1, callback='parse_item'), Rule(link_detail, callback='parse_detail'), #follow=True:将连接提取器 继续作用到 连接提取器提取到的连接 所对应的 页面源码中 ) #数据解析:是用来解析连接提取器提取到的连接所对应的页面 def parse_item(self, response): tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: title = tr.xpath('./td[2]/a[2]/text()').extract_first() status = tr.xpath('./td[3]/span/text()').extract_first() item = Item2() item['title'] = title item['status'] = status item['num'] = tr.xpath('./td/text()').extract_first() yield item #解析详情页中的新闻内容 def parse_detail(self,response): content = response.xpath('/html/body/div[9]/table[2]//tr[1]/td//text()').extract() if content: content = ''.join(content) item = Item1() item['content'] = content num = response.xpath('/html/body/div[9]/table[1]//tr/td[2]/span[2]/text()').extract_first() item['num'] = num.split(':')[-1] yield item
items
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class Item1(scrapy.Item): content = scrapy.Field() num = scrapy.Field() class Item2(scrapy.Item): title = scrapy.Field() status = scrapy.Field() num = scrapy.Field()
pipelines
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
class SunproPipeline(object): def process_item(self, item, spider): if item.__class__.__name__ == 'Item1': content = item['content'] num = item['num'] print(content,num) else: title = item['title'] status = item['status'] num = item['num'] print(num,status,title) return item
settings
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' ITEM_PIPELINES = { 'sunPro.pipelines.SunproPipeline': 300, }
分布式
实现方式:
scrapy+scrapy_redis组件实现的分布式。scrapy+redis
- 原生的scrapy是不可以实现分布式的!!!
概念:
需要搭建一个由n台电脑组成的机群,然后在每一台电脑中执行同一组程序,让其对同一个网络资源
进行联合且分布的数据爬取。
scrapy 不可以实现分布式
- 调度器不可以被共享
- 管道不可以被共享
scrapy_redis组件的作用
- 提供可以被共享的管道和调度器
分布式的实现流程
- 环境的安装:pip install scrapy-redis
- 创建工程
- cd 工程
- 创建爬虫文件:
- 基于Spider
- 基于CrawlSpider
- 修改爬虫文件:
- 导报:
from scrapy_redis.spiders import RedisCrawlSpider#基于crawlSpider爬虫文件
from scrapy_redis.spiders import RedisSpider #基于Spider爬虫文件
- 将当前爬虫类的父类修改为RedisCrawlSpider
- 删除allowed_domains和start_urls
- 添加一个redis_key = ‘xxx’属性,表示的是调度器队列的名称、
- 根据常规形式编写爬虫文件后续的代码
- 修改settings配置文件
- 指定管道
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
- 指定调度器
# 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
SCHEDULER_PERSIST = True
- 指定redis数据库
REDIS_HOST = '192.168.13.254'
REDIS_PORT = 6379
- 修改redis的配置文件redis.windows.conf
- 关闭默认绑定
- 56行:#bind 127.0.0.1
- 关闭保护模式
- 75行:protected-mode no
- 启动redis的服务端(携带配置文件)和客户端
- 启动分布式的程序:
- scrapy runspider xxx.py
- 向调度器的队列中扔入一个起始的url
- 队列是存在于redis中
- redis的客户端中:lpush sun www.xxx.com
- 在redis中就可以查看爬取到的数据
sun 是全站数据的爬取的实例
fbs
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from fbsPro.items import FbsproItem from scrapy_redis.spiders import RedisCrawlSpider from scrapy_redis.spiders import RedisSpider class FbsSpider(RedisCrawlSpider): name = 'fbs' # allowed_domains = ['www.xxx.com'] # start_urls = ['http://www.xxx.com/'] redis_key = 'sun' #可被共享的调度器队列的名称 rules = ( Rule(LinkExtractor(allow=r'type=4&page=\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: title = tr.xpath('./td[2]/a[2]/text()').extract_first() status = tr.xpath('./td[3]/span/text()').extract_first() item = FbsproItem() item['title'] = title item['status'] = status yield item
items
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class FbsproItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() status = scrapy.Field()
settings
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
# USER_AGENT 不用写 ROBOTSTXT_OBEY = False #指定管道 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } #指定调度器 # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据 SCHEDULER_PERSIST = True #指定redis REDIS_HOST = '192.168.13.254' REDIS_PORT = 6379
增量式
概念:
监测
核心技术:
去重
适合使用增量式的网站
- 基于深度爬取
- 对爬取过的页面的url进行一个记录(记录表)
- 基于非深度爬取
- 记录表:爬取过的数据对应的数据指纹
- 数据指纹:就是原始数据的一组唯一标识
所谓的记录表是以怎样的形式存在于哪?
- redis的set充当记录表
创建
zls
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from zlsPro.items import ZlsproItem class ZlsSpider(CrawlSpider): name = 'zls' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] conn = Redis(host='127.0.0.1',port=6379) rules = ( Rule(LinkExtractor(allow=r'/frim/index1-\d+\.html'), callback='parse_item', follow=True), ) def parse_item(self, response): li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li') for li in li_list: title = li.xpath('./div/a/@title').extract_first() detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first() ex = self.conn.sadd('movie_urls',detail_url) if ex == 1:#detail_url不存在与redis的set中 print('有数据更新......') item = ZlsproItem() item['title'] = title yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item}) else:#detail_url存在于set集合中 print('暂无最新数据可爬') def parse_detail(self,response): item = response.meta['item'] desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first() item['content'] = desc yield item
items
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy class ZlsproItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() content = scrapy.Field()
pipelines
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
class ZlsproPipeline(object): def process_item(self, item, spider): conn = spider.conn conn.lpush('movie_data',item) return item
setting
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' ITEM_PIPELINES = { 'zlsPro.pipelines.ZlsproPipeline': 300, }