爬虫之:Scrapy 模块
Scrapy 模块
1 Scrapy 简介
Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。其具有以下功能:
- 支持全栈数据爬取操作
- 支持XPath
- 异步的数据下载
- 支持高性能持久化存储
- 分布式
官网:Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
1.1 安装
# Twisted是用Python实现的基于事件驱动的网络引擎框架,Scrapy 基于 Twisted
pip install twisted
# 安装scrapy
pip install scrapy
1.2 Scrapy 全局命令
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
1.3 Scrapy项目命令
Usage:
scrapy <command> [options] [args]
Available commands:
check Check spider contracts
crawl Run a spider
edit Edit spider
list List available spiders
parse Parse URL (using its spider) and print the results
2 Scrapy 操作
2.1 创建项目操作
# 创建项目文件
scrapy startproject <scrapyPJname>
# 创建爬虫文件
cd <scrapyPJname>
scrapy genspider <spiderName> www.xxx.com
# 执行
scrapy crawl spiderName
2.2 配置项目文件
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 ' \
'Safari/537.36 '
## 不遵从robots协议
ROBOTSTXT_OBEY = False
## Log
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'
# 300表示的优先级,越小优先级越高
# 如果pipelines.py中定义了多个管道类,爬虫类提交的item会给到优先级最高的管道类。
ITEM_PIPELINES = {
'scrapyPJ01.pipelines.Scrapypj01Pipeline': 300,
}
1.4 数据解析
extract():列表是有多个列表元素
extract_first():列表元素只有单个
1.5 持久化存储流程
1.数据解析
2.在item的类中定义相关的属性
3.将解析的数据存储封装到item类型的对象中.item['name']
4.将item对象提交给管道
5.在管道类中的process_item方法负责接收item对象,然后对item进行任意形式的持久化存储
6.process_item 方法的 return item 的操作表示将item传递给下一个即将被执行的管道类
7.在settings.py中开启管道
3 实例
3.1 基于终端命令持久化存储
ctspider.py
import scrapy
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
def parse(self, response):
data_list = []
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
# 注意:xpath返回的列表中的列表元素是Selector对象,我们要解析获取的字符串的数据是存储在该对象中
# 必须经过一个extract()的操作才可以将改对象中存储的字符串的数据获取
# title = div.xpath('./div/div/div[1]/a/text()') # [<Selector xpath='./div/div/div[1]/a/text()' data='泽连斯基何以当选《时代》2022年度人物?'>]
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
# xpath返回的列表中的列表元素有多个(Selector对象),使用extract()取出
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract() # ['知世']
content = div.xpath('./div[1]/div/div[1]/div[3]/text()').extract_first() # 美国《时代》杂志将乌克兰总统泽连斯基及“乌克兰精神”评为2022年度风云人...
# 返回列表数据
data = {
'title':title,
'author':author,
'content':content
}
data_list.append(data)
return data_list
scrapy crawl ctspider -o ctresult.csv
3.2 引入item
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# Field是一个万能的数据类型
title = scrapy.Field()
author = scrapy.Field()
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['dig.chouti.com']
start_urls = ['http://dig.chouti.com/']
# 终端命令持久化存储
def parse(self, response):
ctresponse = response.xpath('')
title = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div/div[1]/a/text()').extract_first()
author = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
return ctitem
scrapy crawl ctspider -o ctspider.csv
3.3 基于管道的持久化存储:pipelines.py
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Pipleline
# open_spider与close_spider函数名字不能修改
class Scrapypj01Pipeline(object):
fp = None
def open_spider(self, spider):
print('爬虫开始')
self.fp = open('./ctresult.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
title = item['title']
author = item['author']
data = '{0},{1}\n'.format(title, author)
self.fp.write(data)
print(data, '写入成功')
return item
def close_spider(self, spider):
print('爬虫结束')
self.fp.close()
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.Scrapypj01Pipeline': 300,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['dig.chouti.com']
start_urls = ['http://dig.chouti.com/']
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
3.4 基于 Mysql 的持久化存储
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Mysql
import pymysql
# 专门用作于持久化存储
class Scrapypj01Pipeline(object):
fp = None
def open_spider(self, spider):
print('爬虫开始')
self.fp = open('./ctresult.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
title = item['title']
author = item['author']
data = '{0},{1}\n'.format(title, author)
self.fp.write(data)
print(data, '写入成功')
return item
def close_spider(self, spider):
print('爬虫结束')
self.fp.close()
class MysqlPipeline(object):
conn = None
cursor = None
def open_spider(self, spider):
print('爬虫开始')
self.conn = pymysql.connect(host='10.1.1.8', port=3306, user='root', password='Admin@123', db='spiderdb')
def process_item(self, item, spider):
sql = 'insert into ctinfo values(%s,%s)'
data = (item['title'], item['author'])
self.cursor = self.conn.cursor()
try:
self.cursor.execute(sql, data)
self.conn.commit()
except Exception as error:
print(error)
self.conn.rollback()
return item
def close_spider(self, spider):
print('爬虫结束')
self.cursor.close()
self.conn.close()
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.MysqlPipeline': 301,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['dig.chouti.com']
start_urls = ['http://dig.chouti.com/']
# 终端命令持久化存储
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
3.5 基于 Redis 的持久化存储
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Redis
from redis import Redis
class RedisPipeline(object):
conn = None
def open_spider(self, spider):
print('爬虫开始')
self.conn = Redis(host='10.1.1.8', port=6379, password='Admin@123')
def process_item(self, item, spider):
self.conn.lpush('ctlist', item)
return item
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.RedisPipeline': 302,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['dig.chouti.com']
start_urls = ['http://dig.chouti.com/']
# 终端命令持久化存储
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
4 基于Spider父类进行全站数据的爬取
- 全站数据的爬取:将所有页码对应的页面数据进行爬取
- 手动发起get请求:
yield scrapy.Request(url,callback)
- 手动发起post请求:
yield scrapy.FormRequest(url,formdata,callback)
formdata 是一个字典表示的是请求参数
4.1 实例
import scrapy
import hySpider.items as items
import json
class HySpider(scrapy.Spider):
name = 'hy'
# allowed_domains = ['huya.com']
start_urls = ['https://www.huya.com/g/xingxiu']
url = "https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=getLiveListJsonpCallback&page=%d"
def parse(self, response):
li_list = response.xpath('//*[@id="js-live-list"]/li')
for li in li_list:
title = li.xpath('./a[2]/text()')[0].extract()
imgurl = li.xpath('./a[1]/img/@data-original').extract_first().split('?')[0]
hyitem = items.HyspiderItem()
hyitem['title'] = title
hyitem['imgurl'] = imgurl
yield hyitem
for page in range(2, 3):
new_url = format(self.url % page)
# 利用回调函数调用parse_other方法
yield scrapy.Request(url=new_url, callback=self.parse_other)
def parse_other(self, response):
"""
属于parse方法的递归,因此所带参数与parse方法一致
:param response:
:return:
"""
res = response.text.replace('getLiveListJsonpCallback(','')
res = res[:-1]
res = json.loads(res)
if res.get('status') == 200:
data = res.get('data').get('datas')
for obj in data:
title = obj.get('introduction')
imgurl = obj.get('screenshot')
hyitem = items.HyspiderItem()
hyitem['title'] = title
hyitem['imgurl'] = imgurl
yield hyitem
5 Scrapy 特性说明
5.1 scrapy爬取数据的效率设置
在配置文件中进行相关的配置即可:
- 增加并发:默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改
CONCURRENT_REQUESTS = 100
值为100,并发设置成了为100。 - 降低日志级别:在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:
LOG_LEVEL = 'INFO'
- 禁止cookie:如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:
COOKIES_ENABLED = False
- 禁止重试:对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:
RETRY_ENABLED = False
- 减少下载超时:如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:
DOWNLOAD_TIMEOUT = 10
超时时间为10s
5.2 五大核心组件
Scrapy五大核心组件简介 - 帅小伙⊙∪⊙ - 博客园 (cnblogs.com)
5.3 scrapy的中间件
爬虫中间件(Spider Middleware)
下载器中间件(Downloader Middleware)
- 作用:批量拦截所有的请求和响应
- 为什么拦截请求
- 篡改请求的头信息(UA伪装)
- 修改请求对应的ip(代理)
- 为什么拦截响应
- 篡改响应数据,篡改响应对象
5.3.1 下载器中间件
import random
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
user_agent_list = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'
]
class MidwareDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def process_request(self, request, spider):
# 篡改请求的头信息
request.headers['User-Agent'] = random.choice(user_agent_list)
print(request.headers['User-Agent'])
# 代理
request.meta['proxy'] = 'http://121.13.252.61:41564'
print(request.meta['proxy'])
return None
def process_response(self, request, response, spider):
# 篡改响应信息
return response
def process_exception(self, request, exception, spider):
# 拦截发现异常的请求对象
pass
5.3.2 配合selnium抓取动态加载的网站
-
wyspider.py
import scrapy import wy.items as items from selenium import webdriver from selenium.webdriver.chrome.service import Service class WyspiderSpider(scrapy.Spider): name = 'wyspider' # allowed_domains = ['163.com'] start_urls = ['https://news.163.com/'] module_list = [] chrome_options = webdriver.ChromeOptions() # 处理SSL证书错误问题 chrome_options.add_argument('--ignore-certificate-errors') chrome_options.add_argument('--ignore-ssl-errors') # 忽略无用的日志 chrome_options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging']) chrome_options.binary_location = r'C:\Program Files\Google\Chrome Dev\Application\chrome.exe' ser = Service(r'chromedriver.exe') browser = webdriver.Chrome(service=ser, options=chrome_options) def parse(self, response): # 解析出每个板块所对应的URL modules_list = response.xpath('//*[@id="index2016_wrap"]/div[3]/div[2]/div[2]/div[2]/div/ul/li') # 过滤板块 module_index = [2, 5] for index in module_index: module_url = modules_list[index].xpath('./a/@href').extract_first() self.module_list.append(module_url) # 对过滤后的板块url进行手动发送请求 yield scrapy.Request(url=module_url, callback=self.parse_module) def parse_module(self, response): # 用作于解析每一个板块对应页面数据中的新闻标题和新闻详情页的url # 注:该方法中获取的response对象是没有包含动态加载出的新闻数据(是一个不满足需求的response) news_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div[1]/div/ul/li/div/div') for news_li in news_list: title = news_li.xpath('./div/h3/a/text()').extract_first() news_url = news_li.xpath('./div/h3/a/@href').extract_first() wyitem = items.WyItem() wyitem['title'] = title print(news_url) yield scrapy.Request(url=news_url, callback=self.parse_url, meta={'wyitem': wyitem}) def parse_url(self, response): # 解析新闻详情页 wyitem = response.meta['wyitem'] news_content = response.xpath('//*[@id="content"]/div[2]//text()').extract() news_content = ''.join(news_content) print(news_content) wyitem['news_content'] = news_content yield wyitem def close(self, spider, reason): # 该方法只会在整个程序结束时执行一次 self.browser.quit()
-
items.py
import scrapy class WyItem(scrapy.Item): title = scrapy.Field() news_content = scrapy.Field()
-
middleware.py
from scrapy import signals from time import sleep from scrapy.http import HtmlResponse # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter class WyDownloaderMiddleware: def process_request(self, request, spider): request.headers[ 'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' return None def process_response(self, request, response, spider): """ :param request: 响应对象对应的请求对象中的URL :param response:响应数据,可以由selenium中的page_source返回 :param spider:爬虫类实例化的对象,可以实现爬虫类和中间件类的数据交互 :return:news_response 拦截到5个板块对应的响应对象,将其替换成5个符合需求的新的响应对象进行返回 """ if request.url in spider.module_list: browser = spider.browser browser.get(request.url) sleep(20) # 包含了动态加载的新闻数据 page_text = browser.page_source news_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request) return news_response else: return response
-
pipelines.py
from redis import Redis class RedisPipeline(object): conn = None def open_spider(self, spider): print('爬虫开始') self.conn = Redis(host='10.1.1.5', port=6379, decode_responses=True) def process_item(self, item, spider): print('item: ', item) self.conn.lpush('wylist', item) return item def close_spider(self, spider): print('爬虫结束')
-
setting.py
ROBOTSTXT_OBEY = False ## Log LOG_LEVEL = 'ERROR' LOG_FILE = 'log.txt' DOWNLOADER_MIDDLEWARES = { 'wy.middlewares.WyDownloaderMiddleware': 543, } ITEM_PIPELINES = { # 'wy.pipelines.WyPipeline': 300, 'wy.pipelines.RedisPipeline': 300, }
5.4 Scrapy 处理懒加载及高效下载媒体数据
-
图片懒加载:涉及到标签的伪属性,数据捕获的时候一定是基于伪属性进行
-
ImagePileline
:Scrapy中专门用作于二进制数据下载和持久化存储的管道类 -
bgimgspider.py
import scrapy from bgimg.items import BgimgItem class BgimgspiderSpider(scrapy.Spider): name = 'bgimgspider' # allowed_domains = ['xx.cm'] start_urls = ['https://sc.chinaz.com/tupian/bangongrenwu.html'] def parse(self, response): img_list = response.xpath('/html/body/div[3]/div[2]/div') for img in img_list: imgtitle = img.xpath('./div/a/text()').extract_first() # 使用懒加载之前的imgsrc属性 imgurl = img.xpath('./img/@data-original').extract_first() bgimgitem = BgimgItem() bgimgitem['imgtitle'] = imgtitle bgimgitem['imgurl'] = 'https:' + imgurl yield bgimgitem
-
items.py
import scrapy class BgimgItem(scrapy.Item): imgtitle = scrapy.Field() imgurl = scrapy.Field()
-
piplines.py
import scrapy from scrapy.pipelines.images import ImagesPipeline class BgimgPipeline(ImagesPipeline): def get_media_requests(self, item, info): """ 该方法是用来对媒体资源进行请求(数据下载) :param item:接收到的爬虫类提交的item对象 :param info: :return: """ yield scrapy.Request(item['imgurl']) def file_path(self, request, response=None, info=None, *, item=None): """ 指明数据存储的路径。文件夹路径指定需要在setting.py文件中配置: IMAGES_STORE = './imglibs' :param request: :param response: :param info: :param item: :return: """ return item['imgtitle'] + request.url.split('/')[-1] def item_completed(self, results, item, info): """ 将item传递给下一个即将被执行的管道类 :param results: :param item: :param info: :return: """ return item
-
middlewares.py
class BgimgDownloaderMiddleware: def process_request(self, request, spider): request.headers[ 'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' return None def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass
-
setting.py
IMAGES_STORE = './imglibs' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' LOG_FILE = 'log.txt' DOWNLOADER_MIDDLEWARES = { 'bgimg.middlewares.BgimgDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'bgimg.pipelines.BgimgPipeline': 300, }
6 Scrapy 全站数据抓取
-
csspider.py
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule import crawlspider.items as items class CsspiderSpider(CrawlSpider): name = 'csspider' # allowed_domains = ['xx.com'] start_urls = ['https://wz.sun0769.com/political/index/politicsNewest?id=1&page=1'] # 实例化了一个 link 提取器对象,根据指定规则(allow=’正则表达式‘)进行指定 link 的提取 # 获取页码连接 link = LinkExtractor(allow=r'politicsNewest\?id=1&page=\d+') # 获取新闻详情页的 link link_detail = LinkExtractor(allow=r'politics/index\?id=\d+') rules = ( # 将link作用到了Rule构造方法的参数1中 # 将 link 提取器提取到的 link 进行请求发送, 且根据指定规则对请求到的数据进行数据解析 # follow=True:将 link 提取器继续作用到: link 提取器提取到的 link 所对应的页面中 Rule(link, callback='parse_item', follow=False), Rule(link_detail, callback='parse_detail_item', follow=False), ) def parse_item(self, response): case_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li') for case in case_list: item = items.CrawlspiderItem() item['case_num'] = case.xpath('./span[1]/text()').extract_first() item['case_status'] = case.xpath('./span[2]/text()').extract_first().strip() item['case_name'] = case.xpath('./span[3]/a/text()').extract_first() yield item def parse_detail_item(self, response): item = items.CrawlspiderItem2() case_num = response.xpath('/html/body/div[3]/div[2]/div[2]/div[1]/span[5]/text()').extract_first().split(':')[ -1] case_detail = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first() item['case_num'] = case_num item['case_detail'] = case_detail yield item
-
items.py
import scrapy class CrawlspiderItem(scrapy.Item): case_num = scrapy.Field() case_status = scrapy.Field() case_name = scrapy.Field() class CrawlspiderItem2(scrapy.Item): case_num = scrapy.Field() case_detail = scrapy.Field()
-
piplines.py
class CrawlspiderPipeline: def process_item(self, item, spider): if item.__class__.__name__ == 'CrawlspiderItem': print(item) else: case_num = item['case_num'] case_detail = item['case_detail'] print(case_num, case_detail) return item
-
middlewares.py
class CrawlspiderDownloaderMiddleware: def process_request(self, request, spider): request.headers[ 'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' return None def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass
-
setting.py
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' LOG_FILE = 'log.txt' DOWNLOADER_MIDDLEWARES = { 'crawlspider.middlewares.CrawlspiderDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'crawlspider.pipelines.CrawlspiderPipeline': 300, }
7 Scrapy 分布式抓取网站
- 需要搭建一个分布式的机群,然后在机群的每一台电脑中执行同一组程序,让其对某一个网站的数据进行联合分布爬取。
- 原生的scrapy框架是不可以实现分布式?
- 因为调度器不可以被共享
- 管道不可以被共享
- 如何实现分布式?
- scrapy+scrapy_redis实现分布式
- scrapy-redis组件的作用是什么?
- 可以提供可被共享的调度器和管道
- 特性:数据只可以存储到redis中。
7.1 爬取流程
-
分布式的实现流程:
-
pip install scrapy-redis
-
创建工程
scrapy startproject scrapyredis cd scrapyredis scrapy genspider -t crawl scrapyredisspider xx.com
-
修改爬虫类:
-
导包:
from scrapy_redis.spiders import RedisCrawlSpider
-
修改当前爬虫类的父类为
RedisCrawlSpider
-
将
allowed_domains
和start_urls
删除 -
添加一个新属性:
redis_key = 'fbsqueue'
,表示的是可以被共享的调度器队列的名称 -
编写爬虫类的其他操作
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from scrapyredis.items import ScrapyredisItem class ScrapyredisspiderSpider(RedisCrawlSpider): name = 'scrapyredisspider' # allowed_domains = ['xx.com'] # start_urls = ['http://xx.com/'] # 表示的是可以被共享的调试器队列的名称 redis_key = 'fbsqueue' link = LinkExtractor(allow=r'politicsNewest\?id=1&page=\d+') rules = ( Rule(link, callback='parse_item', follow=False), ) def parse_item(self, response): case_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li') for case in case_list: item = ScrapyredisItem() item['case_num'] = case.xpath('./span[1]/text()').extract_first() item['case_status'] = case.xpath('./span[2]/text()').extract_first().strip() item['case_name'] = case.xpath('./span[3]/a/text()').extract_first() yield item
-
-
items.py
配置import scrapy class ScrapyredisItem(scrapy.Item): case_num = scrapy.Field() case_status = scrapy.Field() case_name = scrapy.Field()
-
修改
settings.py
配置# UA伪装 USER_AGENT = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' # Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' LOG_FILE = './log.txt' # 使能 scrapy_redis 组件中封装好的管道 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } # 使能 scrapy_redis 组件中封装好的调度器 ## 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" ## 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" ## 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据 SCHEDULER_PERSIST = True # 配置爬虫程序连接redis配置 REDIS_HOST = '192.168.50.118' REDIS_PORT = 6379 REDIS_ENCODING = 'utf-8' # REDIS_PARAMS = {'password':'123456'}
-
修改
redis.conf
配置# 修改默认bind bind 0.0.0.0 # 关闭保护模式 protected-mode no
-
开启爬虫
# 启动redis程序 systemctl start redis # 爬虫程序 scrapy runspider .\scrapyredis\spiders\scrapyredisspider.py # 向调度器的队列中仍入一个起始的url,队列是存在于redis中 127.0.0.1:6379> lpush fbsqueue https://wz.sun0769.com/political/index/politicsNewest?id=1&page= [value ...]
-
8 Scrapy 增量式爬取网站
20.scrapy框架之增量式爬虫 - 盛夏中为你花开彼岸 - 博客园 (cnblogs.com)
8.1 实现增量式爬虫的方案:
- 在发送请求之前,判断url之前是否爬取过
- 将即将进行爬取的数据对应的url存储到redis的set中,
- 根据爬取到的数据进行重复过滤,然后在进行持久化存储(在解析内容后判断这部分内容是不是之前爬取过)
- 将爬取到的数据给其生成一个唯一的标识(可以将该标识作为mysql的列.可以将该标识存储到redis的set中)
- 写入存储介质时判断内容是不是已经在介质中存在
8.2 实现过程
-
zlsspider.py
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from zls.items import ZlsItem from redis import Redis class ZlsspiderSpider(CrawlSpider): conn = Redis(host='192.168.50.118', port=6379) name = 'zlsspider' # allowed_domains = ['xx.com'] start_urls = ['https://www.4567kp.com/frim/index1.html'] # Rule 中若没有被匹配,不会解析网站 rules = ( Rule(LinkExtractor(allow=r'frim/index1.html'), callback='parse_item', follow=False), ) def parse_item(self, response): movie_list = response.xpath('/html/body/div[2]/div/div[3]/div/div[2]/ul/li') for movie in movie_list: movie_detail_url = 'https://www.4567kp.com' + movie.xpath('./div/a/@href').extract_first() movie_name = movie.xpath('./div/a/@title').extract_first() item = ZlszlsItem() item['movie_name'] = movie_name ex = self.conn.sadd('movie_detail_urls', movie_detail_url) if ex == 1: print('捕获到最新更新数据!') yield scrapy.Request(url=movie_detail_url, callback=self.parse_detail, meta={'item': item}) else: print('当前数据暂无更新。') def parse_detail(self, response): item = response.meta['item'] movie_detail = response.xpath( '/html/body/div[2]/div/div[1]/div[5]/div/div[2]/div/span[2]/text()').extract_first() item['movie_detail'] = movie_detail return item
-
items.py
import scrapy class ZlsItem(scrapy.Item): movie_name = scrapy.Field() movie_detail = scrapy.Field()
-
piplines.py
# 此处为重点,将字典数据导入至redis class ZlsPipeline: def process_item(self, item, spider): conn = spider.conn dic = { "movie_name": item["movie_name"], "movie_detail": item["movie_detail"] } conn.lpush('movieData', json.dumps(dic)) return item
-
setting.py
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' LOG_FILE = 'log.txt' USER_AGENT = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' ITEM_PIPELINES = { 'zlszls.pipelines.ZlsPipeline': 300, }
9 解释数据所遇到的问题
1 解析json数据
以huya返回数据为例,选择第二页时,通过ajax请求数据,获取返回内容为:
# 请求url中callback携带参数时:
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=getLiveListJsonpCallback&page=2
getLiveListJsonpCallback({"status":200,"message":"","data":{...}})
# 请求url中callback不携带参数时:
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=&page=2
{"status":200,"message":"","data":{...}}
2 Rule 规则若没有被正确匹配,不会解析网站内容
# Rule 中若没有被匹配,不会解析网站
rules = (
Rule(LinkExtractor(allow=r'frim/index1.html'), callback='parse_item', follow=False),
)
3 通过Python 的Redis库,将scrapy的item引入至redis中
# 此处为重点,将字典数据导入至redis
class ZlsPipeline:
def process_item(self, item, spider):
conn = spider.conn
dic = {
"movie_name": item["movie_name"],
"movie_detail": item["movie_detail"]
}
conn.lpush('movieData', json.dumps(dic))
return item