scrapy框架的基本使用
安装scrapy:
Windows: a. pip3 install wheel b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl d. pip3 install pywin32 e. pip3 install scrapy
一、scrapy的基本使用
1. 创建一个工程 scrapy startproject firstBlood
2. 切换到工程目录中 cd proName 3. 新建一个爬虫文件 scrapy genspider first www.example.com first:是文件名 www.example.com :这是爬取的起始url
4. 执行工程:scrapy crawl spiderName
创建的爬虫文件说明
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): # 爬虫文件的唯一标识 name = 'first' #表示允许的域名,用来做限定 #allowed_domains = ['www.example.com'] #其实url列表:只能存放url #作用:列表中存放的url可以被scrapy进行请求发送 start_urls = ['http://baidu.com/','www.sogou.com'] #用于数据解析 def parse(self, response): print(response)
配置文件settings一般修改三个地方
#1.robots协议,表示不准从robots协议 ROBOTSTXT_OBEY = False #2 UA伪装 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36' #添加日志级别 LOG_LEVEL = 'ERROR'
二、基于管道持久化存储
1. 在爬虫文件中做数据解析
2. 将解析到的数据封装存储到Item类型的对象中
3. 将Item类型的对象提交给管道
4. 在管道中进行任意形式的持久化存储
5. 在配置文件中开启管道
2.1 爬煎蛋网设计页面的标题和内容并持久化存储
爬虫文件代码:
# -*- coding: utf-8 -*- import scrapy from JiandanPro.items import JiandanproItem class JiandanSpider(scrapy.Spider): name = 'jiandan' # allowed_domains = ['www.xxx.com'] start_urls = ['http://jandan.net/tag/设计'] def parse(self, response): div_list = response.xpath('//*[@id="content"]/div') for div in div_list: title = div.xpath('./div/h2/a/text()').extract_first() content = div.xpath('.//div[@class="indexs"]/text()').extract() content = ''.join(content) if title and content: item = JiandanproItem() item['title'] = title item['content'] = content yield item #将item提交给管道
items代码
import scrapy class JiandanproItem(scrapy.Item): # define the fields for your item here like: #Field是一个万能的数据类型 title = scrapy.Field() content = scrapy.Field()
pipelines代码
class JiandanproPipeline(object): fp = None #重写父类的一个方法 def open_spider(self,spider): self.fp = open('./data.txt','w',encoding='utf-8') print('i am openSpider,我只会被调用一次!') #用来接收item并且对其进行任意持久化存储 #pip install -U redis==2.10.6 def process_item(self, item, spider): title = item['title'] content = item['content'] self.fp.write(title+':'+content+'\n') return item #item传递给下一个即将被执行的管道类 def close_spider(self, spider): self.fp.close() print('i am close_spider,我只会被调用一次!')
settings配置
ITEM_PIPELINES = { 'JiandanPro.pipelines.JiandanproPipeline': 300, }
2.2 基于mysql做数据备份
1. 一个管道类负责将数据写入一个平台
2. 爬虫文件提交的item只会提交给优先级最高的管道类
3. 如果使得所有的管道都可以接受到item呢?
在process_item方法中,进行item的返回即可
创建数据库和表
create database spider;
use spider
create table jiandan(title varchar(300),content varchar(500));
pipeline代码
import pymysql class MysqlPipeing(object): conn = None #连接对象 cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='wang',db='spider',charset='utf8') print(self.conn) def process_item(self,item,spider): title = item['title'] content = item['content'] self.cursor = self.conn.cursor() sql = 'insert into jiandan values ("%s","%s")'%(title,content) try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close()
redis做数据持久化
from redis import Redis
class RedisPipeline(object):
conn = None
def open_spider(self,spider):
self.conn = Redis(host='127.0.0.1',port=6379)
def process_item(self,item,spider):
#将字典向redis中写会报错
# pip install -U redis == 2.10.6
self.conn.lpush('dataList',item)
print(item)
return item
ps: 需要在settings里面注册才能生效
三、手动请求发送和全站数据爬取
yield scrapy.Request(url,callback) 发get请求
yield scrapy.FormRequest(new,formdata,callback) 发post请求
案列 爬取:http://wz.sun0769.com/index.php/question/questionType?page= 前六页
import scrapy class SunSpider(scrapy.Spider): name = 'sun' #allowed_domains = ['www.xx.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page='] url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d' page = 30 def parse(self, response): tr_list= response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: title = tr.xpath('./td[2]/a[2]/text()').extract_first() print(title) if self.page <= 150: new_url = format(self.url%self.page) self.page += 30 #手动请求发送 yield scrapy.Request(new_url,callback=self.parse)
四 、五大核心组件
1. 引擎:
用来处理整个系统的数据流处理,触发事件(框架核心)
2. 调度器:
用来接受引擎发过来的请求,放入队列中,并在引擎再次请求的时候返回
- 过滤器
-队列
3. 下载器:
用于下载网页内容,并将网页内容返回,下载器是建立在twisted这个高效异步模型之上
4. 爬虫(spider):
用于在特定的网页中提取自己需要的信息
5. 管道(pipeline):
负责处理爬虫从网页中国提取的实体,主要功能是持久化存储
五、请求传参
作用:让scrapy实现深度爬取
深度爬取:抓取的数据没有存储在同一张页面中
-通过scrapy.Request(url,callback,meta)中的meta字典传递
-在callback中通过respnse.meta接受meta这个字典
爬虫代码:
# -*- coding: utf-8 -*- import scrapy from RequestSendPro.items import RequestsendproItem class ParamdemoSpider(scrapy.Spider): name = 'paramDemo' #allowed_domains = ['www.xx.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page='] url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d' page = 30 def parse(self, response): tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: item = RequestsendproItem() title = tr.xpath('./td[2]/a[2]/text()').extract_first() item['title'] = title detail_url = tr.xpath('./td[2]/a[2]/@href').extract_first() # 对详情页的url发情请求 #meta这个字典的含义是可以将字典传递给callback yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item}) if self.page <= 150: new_url = format(self.url%self.page) self.page += 30 yield scrapy.Request(new_url,callback=self.parse) #解析新闻内容 def parse_detail(self,response): #接收meta item = response.meta['item'] content = response.xpath('/html/body/div[9]/table[2]//tr[1]/td//text()').extract() content = ''.join(content) item['content'] = content yield item
六、中间件
种类:
1.下载中间件
2. 爬虫中间件
作用:批量拦截请求和响应
为什么需要啊拦截请求:
- 设定代理
process_exception():
request.meta['proxy'] = 'http://ip:port'
- 篡改请求头信息(UA)
process_headers['User_Agent'] = 'xxx'
meddlewares 代码:
# -*- coding: utf-8 -*- from scrapy import signals import random user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] class MiddleproDownloaderMiddleware(object): #拦截请求 #spider:爬虫类实例化的对象 def process_request(self, request, spider): print('I am process_request') # 基于UA池进行UA伪装 request.headers['User-Agent'] = random.choices(user_agent_list) #代理 # request.meta['proxy'] = 'https://58.246.228.218:1080' return None #拦截所有的响应 def process_response(self, request, response, spider): return response #拦截异常的请求 def process_exception(self, request, exception, spider): print('I am process_exception') # 代理 request.meta['proxy'] = 'https://58.246.228.218:1080' #将修正后的对象进行重新发送 #return request def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
爬虫文件代码:
1 class MiddleSpider(scrapy.Spider): 2 name = 'middle' 3 #allowed_domains = ['www.xx.com'] 4 start_urls = ['https://www.baidu.com/s?wd=ip'] 5 6 def parse(self, response): 7 page_text = response.text 8 ips = response.xpath('//*[@id="1"]/div[1]/div[1]/div[2]/table//tr/td/span').extract_first() 9 print(ips) 10 with open('ip.txt','w',encoding='utf-8') as f: 11 f.write(page_text)
七、CrawlSpider
概述:是spider的一个子类
作用:用于实际全站数据爬取
使用:
1. 创建工程
2. cd ProName
3. scrapy genspider -t crawl spiderName 起始url
连接提取器(LinkExtractor): 可以根据指定的规则进行指定的链接的提取
规则解析器(Rule):可以将LinkExtractor提取出的链接进行请求发送,然后根据指定做数据解析
爬虫文件代码:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 6 class MovieSpider(CrawlSpider): 7 name = 'movie' 8 #allowed_domains = ['www.xx.com'] 9 start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html'] 10 # 连接提取器 11 # 作用:根据指定规则(allow)进行连接(url)的提取 12 13 link = LinkExtractor(allow=r'id/1/page/\d+\.html') 14 rules = ( 15 #实例化一个Rule类型的对象 16 #Rule: 规则解析器 17 # 作用:可以对链接提取器提取到的链接进行请求发送,按照指定规则进行数据解析 18 Rule(link, callback='parse_item', follow=True), 19 ) 20 21 #用于数据解析 22 def parse_item(self, response): 23 # 解析 24 print(response)
八、分布式
概念:组建一个分布式机群,然后让其执行同一组程序,联合爬取同一个资源中的数据
实现方式:scrapy+redis(scrapy和scrapy_redis组件)
原生scrapy不可以实现共享的原因:
1. 调度器不可以被共享
2. 管道不可以被共享
scrapy_redis组件作用:
可以提供被共享的管道和调度器
环境安装:
pip install scrapy-redis
编码流程:
修改爬虫文件:
1.导包:from scrapy_redis.spiders import RedisCrawlSpider
2.修改爬虫文件父类
3. 删除start_url和
4.添加一个redis_key的属性,属性值任意字符串即可
5. 爬虫文件的常规操作
6. 编写配置文件settings
- 指定管道:
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline':400
}
- 指定调度器:
#增加了一个去重容器类的配置
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#使用scrapy_redis自建自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#配置调度器是否要持久化
SCHEDULER_PERSIST = True
REDIS_HOST = '192.168.2.201'
REDIS_PORT = 6379
- 修改redis的配置文件 redis.windows.conf
- 启动redis的服务端和客户端
- 将起始url放入到可以被共享的调度队列中
- 队列是存在于redis数据库中
- redis-cli
未完待续......