爬虫之:Scrapy 模块

Scrapy 模块

1 Scrapy 简介

Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。其具有以下功能:

  • 支持全栈数据爬取操作
  • 支持XPath
  • 异步的数据下载
  • 支持高性能持久化存储
  • 分布式

官网:Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

1.1 安装

# Twisted是用Python实现的基于事件驱动的网络引擎框架,Scrapy 基于 Twisted
pip install twisted

# 安装scrapy
pip install scrapy

1.2 Scrapy 全局命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
    

1.3 Scrapy项目命令

Usage:
  scrapy <command> [options] [args]

Available commands:
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  list          List available spiders
  parse         Parse URL (using its spider) and print the results

2 Scrapy 操作

2.1 创建项目操作

# 创建项目文件
scrapy startproject <scrapyPJname>

# 创建爬虫文件
cd <scrapyPJname>
scrapy genspider <spiderName> www.xxx.com

# 执行
scrapy crawl spiderName

2.2 配置项目文件

# settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 ' \
             'Safari/537.36 '

## 不遵从robots协议
ROBOTSTXT_OBEY = False

## Log
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'

# 300表示的优先级,越小优先级越高
# 如果pipelines.py中定义了多个管道类,爬虫类提交的item会给到优先级最高的管道类。
ITEM_PIPELINES = {
   'scrapyPJ01.pipelines.Scrapypj01Pipeline': 300,
}

1.4 数据解析

extract():列表是有多个列表元素
extract_first():列表元素只有单个

1.5 持久化存储流程

1.数据解析
2.在item的类中定义相关的属性
3.将解析的数据存储封装到item类型的对象中.item['name']
4.将item对象提交给管道
5.在管道类中的process_item方法负责接收item对象,然后对item进行任意形式的持久化存储
6.process_item 方法的 return item 的操作表示将item传递给下一个即将被执行的管道类
7.在settings.py中开启管道

3 实例

3.1 基于终端命令持久化存储

  • ctspider.py
import scrapy


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        data_list = []
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            # 注意:xpath返回的列表中的列表元素是Selector对象,我们要解析获取的字符串的数据是存储在该对象中
            # 必须经过一个extract()的操作才可以将改对象中存储的字符串的数据获取
            # title = div.xpath('./div/div/div[1]/a/text()')  # [<Selector xpath='./div/div/div[1]/a/text()' data='泽连斯基何以当选《时代》2022年度人物?'>]
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            # xpath返回的列表中的列表元素有多个(Selector对象),使用extract()取出
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()  # ['知世']
            content = div.xpath('./div[1]/div/div[1]/div[3]/text()').extract_first()  # 美国《时代》杂志将乌克兰总统泽连斯基及“乌克兰精神”评为2022年度风云人...
            # 返回列表数据
            data = {
                'title':title,
                'author':author,
                'content':content
            }
            data_list.append(data)
        return data_list

  • scrapy crawl ctspider -o ctresult.csv

3.2 引入item

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # Field是一个万能的数据类型
    title = scrapy.Field()
    author = scrapy.Field()

  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    # 终端命令持久化存储
    def parse(self, response):
        ctresponse = response.xpath('')
        title = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div/div[1]/a/text()').extract_first()
        author = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
		
        # 实例化item类型的对象
        ctitem = items.Scrapypj01Item()
        ctitem['title'] = title
        ctitem['author'] = author

        return ctitem

  • scrapy crawl ctspider -o ctspider.csv

3.3 基于管道的持久化存储:pipelines.py

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Pipleline
# open_spider与close_spider函数名字不能修改
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()
  • settings.py
ITEM_PIPELINES = {
   'scrapypj01.pipelines.Scrapypj01Pipeline': 300,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.4 基于 Mysql 的持久化存储

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Mysql
import pymysql


# 专门用作于持久化存储
class Scrapypj01Pipeline(object):
    fp = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.fp = open('./ctresult.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        author = item['author']
        data = '{0},{1}\n'.format(title, author)
        self.fp.write(data)
        print(data, '写入成功')
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()


class MysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = pymysql.connect(host='10.1.1.8', port=3306, user='root', password='Admin@123', db='spiderdb')

    def process_item(self, item, spider):
        sql = 'insert into ctinfo values(%s,%s)'
        data = (item['title'], item['author'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql, data)
            self.conn.commit()
        except Exception as error:
            print(error)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

  • settings.py
ITEM_PIPELINES = {
      'scrapypj01.pipelines.MysqlPipeline': 301,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

3.5 基于 Redis 的持久化存储

  • items.py
import scrapy


class Scrapypj01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

  • pipelines.py: 专门用作于持久化存储
# Redis
from redis import Redis

class RedisPipeline(object):
    conn = None

    def open_spider(self, spider):
        print('爬虫开始')
        self.conn = Redis(host='10.1.1.8', port=6379, password='Admin@123')

    def process_item(self, item, spider):
        self.conn.lpush('ctlist', item)
        return item
  • settings.py
ITEM_PIPELINES = {
      'scrapypj01.pipelines.RedisPipeline': 302,
}
  • ctspider.py
import scrapy
import scrapypj01.items as items


class CtspiderSpider(scrapy.Spider):
    name = 'ctspider'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    # 终端命令持久化存储
    def parse(self, response):
        div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
        for div in div_list:
            title = div.xpath('./div/div/div[1]/a/text()').extract_first()
            author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
            
			# 实例化item类型的对象
            ctitem = items.Scrapypj01Item()
            ctitem['title'] = title
            ctitem['author'] = author

            # 将item对象提交给管道
            yield ctitem

4 基于Spider父类进行全站数据的爬取

  • 全站数据的爬取:将所有页码对应的页面数据进行爬取
  • 手动发起get请求:yield scrapy.Request(url,callback)
  • 手动发起post请求:yield scrapy.FormRequest(url,formdata,callback) formdata 是一个字典表示的是请求参数

4.1 实例

import scrapy
import hySpider.items as items
import json

class HySpider(scrapy.Spider):
    name = 'hy'
    # allowed_domains = ['huya.com']
    start_urls = ['https://www.huya.com/g/xingxiu']
    url = "https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=getLiveListJsonpCallback&page=%d"

    def parse(self, response):
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        for li in li_list:
            title = li.xpath('./a[2]/text()')[0].extract()
            imgurl = li.xpath('./a[1]/img/@data-original').extract_first().split('?')[0]

            hyitem = items.HyspiderItem()
            hyitem['title'] = title
            hyitem['imgurl'] = imgurl
            yield hyitem

        for page in range(2, 3):
            new_url = format(self.url % page)
            # 利用回调函数调用parse_other方法
            yield scrapy.Request(url=new_url, callback=self.parse_other)

    def parse_other(self, response):
        """
        属于parse方法的递归,因此所带参数与parse方法一致
        :param response:
        :return:
        """
        res = response.text.replace('getLiveListJsonpCallback(','')
        res = res[:-1]
        res = json.loads(res)
        if res.get('status') == 200:
            data = res.get('data').get('datas')
            for obj in data:
                title = obj.get('introduction')
                imgurl = obj.get('screenshot')

                hyitem = items.HyspiderItem()
                hyitem['title'] = title
                hyitem['imgurl'] = imgurl
                yield hyitem

5 Scrapy 特性说明

5.1 scrapy爬取数据的效率设置

在配置文件中进行相关的配置即可:

  • 增加并发:默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
  • 降低日志级别:在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = 'INFO'
  • 禁止cookie:如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
  • 禁止重试:对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
  • 减少下载超时:如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s

5.2 五大核心组件

Scrapy五大核心组件简介 - 帅小伙⊙∪⊙ - 博客园 (cnblogs.com)

image-20230219172917172.png

5.3 scrapy的中间件

爬虫中间件(Spider Middleware)

下载器中间件(Downloader Middleware)

  • 作用:批量拦截所有的请求和响应
  • 为什么拦截请求
    • 篡改请求的头信息(UA伪装)
    • 修改请求对应的ip(代理)
  • 为什么拦截响应
    • 篡改响应数据,篡改响应对象

5.3.1 下载器中间件

import random
from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

user_agent_list = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
    ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'
]


class MidwareDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        # 篡改请求的头信息
        request.headers['User-Agent'] = random.choice(user_agent_list)
        print(request.headers['User-Agent'])

        # 代理
        request.meta['proxy'] = 'http://121.13.252.61:41564'
        print(request.meta['proxy'])
        return None

    def process_response(self, request, response, spider):
        # 篡改响应信息
        return response

    def process_exception(self, request, exception, spider):
        # 拦截发现异常的请求对象
        pass

5.3.2 配合selnium抓取动态加载的网站

  • wyspider.py

    import scrapy
    import wy.items as items
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    
    
    class WyspiderSpider(scrapy.Spider):
        name = 'wyspider'
        # allowed_domains = ['163.com']
        start_urls = ['https://news.163.com/']
        module_list = []
        chrome_options = webdriver.ChromeOptions()
    
        # 处理SSL证书错误问题
        chrome_options.add_argument('--ignore-certificate-errors')
        chrome_options.add_argument('--ignore-ssl-errors')
    
        # 忽略无用的日志
        chrome_options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
        chrome_options.binary_location = r'C:\Program Files\Google\Chrome Dev\Application\chrome.exe'
        ser = Service(r'chromedriver.exe')
        browser = webdriver.Chrome(service=ser, options=chrome_options)
    
        def parse(self, response):
            # 解析出每个板块所对应的URL
            modules_list = response.xpath('//*[@id="index2016_wrap"]/div[3]/div[2]/div[2]/div[2]/div/ul/li')
            # 过滤板块
            module_index = [2, 5]
            for index in module_index:
                module_url = modules_list[index].xpath('./a/@href').extract_first()
                self.module_list.append(module_url)
                # 对过滤后的板块url进行手动发送请求
                yield scrapy.Request(url=module_url, callback=self.parse_module)
    
        def parse_module(self, response):
            # 用作于解析每一个板块对应页面数据中的新闻标题和新闻详情页的url
            # 注:该方法中获取的response对象是没有包含动态加载出的新闻数据(是一个不满足需求的response)
            news_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div[1]/div/ul/li/div/div')
            for news_li in news_list:
                title = news_li.xpath('./div/h3/a/text()').extract_first()
                news_url = news_li.xpath('./div/h3/a/@href').extract_first()
                wyitem = items.WyItem()
                wyitem['title'] = title
                print(news_url)
                yield scrapy.Request(url=news_url, callback=self.parse_url, meta={'wyitem': wyitem})
    
        def parse_url(self, response):
            # 解析新闻详情页
            wyitem = response.meta['wyitem']
            news_content = response.xpath('//*[@id="content"]/div[2]//text()').extract()
            news_content = ''.join(news_content)
            print(news_content)
            wyitem['news_content'] = news_content
    
            yield wyitem
    
        def close(self, spider, reason):
            # 该方法只会在整个程序结束时执行一次
            self.browser.quit()
    
    
  • items.py

    import scrapy
    
    class WyItem(scrapy.Item):
        title = scrapy.Field()
        news_content = scrapy.Field()
    
  • middleware.py

    from scrapy import signals
    from time import sleep
    from scrapy.http import HtmlResponse
    # useful for handling different item types with a single interface
    from itemadapter import is_item, ItemAdapter
    
    
    class WyDownloaderMiddleware:
    
        def process_request(self, request, spider):
            request.headers[
                'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
            return None
    
        def process_response(self, request, response, spider):
            """
            :param request: 响应对象对应的请求对象中的URL
            :param response:响应数据,可以由selenium中的page_source返回
            :param spider:爬虫类实例化的对象,可以实现爬虫类和中间件类的数据交互
            :return:news_response
            拦截到5个板块对应的响应对象,将其替换成5个符合需求的新的响应对象进行返回
            """
            if request.url in spider.module_list:
                browser = spider.browser
                browser.get(request.url)
                sleep(20)
                # 包含了动态加载的新闻数据
                page_text = browser.page_source
                news_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
                return news_response
            else:
                return response
    
    
  • pipelines.py

    from redis import Redis
    
    
    class RedisPipeline(object):
        conn = None
    
        def open_spider(self, spider):
            print('爬虫开始')
            self.conn = Redis(host='10.1.1.5', port=6379, decode_responses=True)
    
        def process_item(self, item, spider):
            print('item: ', item)
            self.conn.lpush('wylist', item)
            return item
    
        def close_spider(self, spider):
            print('爬虫结束')
    
    
  • setting.py

    ROBOTSTXT_OBEY = False
    
    ## Log
    LOG_LEVEL = 'ERROR'
    LOG_FILE = 'log.txt'
    
    DOWNLOADER_MIDDLEWARES = {
       'wy.middlewares.WyDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       # 'wy.pipelines.WyPipeline': 300,
       'wy.pipelines.RedisPipeline': 300,
    }
    
    

5.4 Scrapy 处理懒加载及高效下载媒体数据

  • 图片懒加载:涉及到标签的伪属性,数据捕获的时候一定是基于伪属性进行

  • ImagePileline:Scrapy中专门用作于二进制数据下载和持久化存储的管道类

  • bgimgspider.py

    import scrapy
    from bgimg.items import BgimgItem
    
    
    class BgimgspiderSpider(scrapy.Spider):
        name = 'bgimgspider'
        # allowed_domains = ['xx.cm']
        start_urls = ['https://sc.chinaz.com/tupian/bangongrenwu.html']
    
        def parse(self, response):
            img_list = response.xpath('/html/body/div[3]/div[2]/div')
            for img in img_list:
                imgtitle = img.xpath('./div/a/text()').extract_first()
                # 使用懒加载之前的imgsrc属性
                imgurl = img.xpath('./img/@data-original').extract_first()
                bgimgitem = BgimgItem()
                bgimgitem['imgtitle'] = imgtitle
                bgimgitem['imgurl'] = 'https:' + imgurl
                yield bgimgitem
    
    
  • items.py

    import scrapy
    
    
    class BgimgItem(scrapy.Item):
        imgtitle = scrapy.Field()
        imgurl = scrapy.Field()
    
    
  • piplines.py

    import scrapy
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class BgimgPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            """
            该方法是用来对媒体资源进行请求(数据下载)
            :param item:接收到的爬虫类提交的item对象
            :param info:
            :return:
            """
            yield scrapy.Request(item['imgurl'])
    
        def file_path(self, request, response=None, info=None, *, item=None):
            """
            指明数据存储的路径。文件夹路径指定需要在setting.py文件中配置: IMAGES_STORE = './imglibs'
            :param request:
            :param response:
            :param info:
            :param item:
            :return:
            """
            return item['imgtitle'] + request.url.split('/')[-1]
    
        def item_completed(self, results, item, info):
            """
            将item传递给下一个即将被执行的管道类
            :param results:
            :param item:
            :param info:
            :return:
            """
            return item
    
    
  • middlewares.py

    class BgimgDownloaderMiddleware:
    
        def process_request(self, request, spider):
            request.headers[
                'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
            return None
    
        def process_response(self, request, response, spider):
            return response
    
        def process_exception(self, request, exception, spider):
            pass
    
    
  • setting.py

    IMAGES_STORE = './imglibs'
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR'
    LOG_FILE = 'log.txt'
    DOWNLOADER_MIDDLEWARES = {
       'bgimg.middlewares.BgimgDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'bgimg.pipelines.BgimgPipeline': 300,
    }
    

6 Scrapy 全站数据抓取

  • csspider.py

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    import crawlspider.items as items
    
    
    class CsspiderSpider(CrawlSpider):
        name = 'csspider'
        # allowed_domains = ['xx.com']
        start_urls = ['https://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
        # 实例化了一个 link 提取器对象,根据指定规则(allow=’正则表达式‘)进行指定 link 的提取
        # 获取页码连接
        link = LinkExtractor(allow=r'politicsNewest\?id=1&page=\d+')
        # 获取新闻详情页的 link
        link_detail = LinkExtractor(allow=r'politics/index\?id=\d+')
        rules = (
            # 将link作用到了Rule构造方法的参数1中
            # 将 link 提取器提取到的 link 进行请求发送, 且根据指定规则对请求到的数据进行数据解析
            # follow=True:将 link 提取器继续作用到: link 提取器提取到的 link 所对应的页面中
            Rule(link, callback='parse_item', follow=False),
            Rule(link_detail, callback='parse_detail_item', follow=False),
        )
    
        def parse_item(self, response):
            case_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
            for case in case_list:
                item = items.CrawlspiderItem()
                item['case_num'] = case.xpath('./span[1]/text()').extract_first()
                item['case_status'] = case.xpath('./span[2]/text()').extract_first().strip()
                item['case_name'] = case.xpath('./span[3]/a/text()').extract_first()
                yield item
    
        def parse_detail_item(self, response):
            item = items.CrawlspiderItem2()
            case_num = response.xpath('/html/body/div[3]/div[2]/div[2]/div[1]/span[5]/text()').extract_first().split(':')[
                -1]
            case_detail = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()
            item['case_num'] = case_num
            item['case_detail'] = case_detail
            yield item
    
    
  • items.py

    import scrapy
    
    
    class CrawlspiderItem(scrapy.Item):
        case_num = scrapy.Field()
        case_status = scrapy.Field()
        case_name = scrapy.Field()
    
    
    class CrawlspiderItem2(scrapy.Item):
        case_num = scrapy.Field()
        case_detail = scrapy.Field()
    
    
  • piplines.py

    class CrawlspiderPipeline:
        def process_item(self, item, spider):
            if item.__class__.__name__ == 'CrawlspiderItem':
                print(item)
            else:
                case_num = item['case_num']
                case_detail = item['case_detail']
                print(case_num, case_detail)
            return item
    
    
  • middlewares.py

    class CrawlspiderDownloaderMiddleware:
    
        def process_request(self, request, spider):
            request.headers[
                'User-Agent'] = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
            return None
    
        def process_response(self, request, response, spider):
            return response
    
        def process_exception(self, request, exception, spider):
            pass
    
    
  • setting.py

    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR'
    LOG_FILE = 'log.txt'
    DOWNLOADER_MIDDLEWARES = {
       'crawlspider.middlewares.CrawlspiderDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'crawlspider.pipelines.CrawlspiderPipeline': 300,
    }
    

7 Scrapy 分布式抓取网站

  • 需要搭建一个分布式的机群,然后在机群的每一台电脑中执行同一组程序,让其对某一个网站的数据进行联合分布爬取。
  • 原生的scrapy框架是不可以实现分布式?
    • 因为调度器不可以被共享
    • 管道不可以被共享
  • 如何实现分布式?
    • scrapy+scrapy_redis实现分布式
  • scrapy-redis组件的作用是什么?
    • 可以提供可被共享的调度器和管道
    • 特性:数据只可以存储到redis中。

7.1 爬取流程

  • 分布式的实现流程:

    1. pip install scrapy-redis

    2. 创建工程

      scrapy startproject scrapyredis
      cd scrapyredis 
      scrapy genspider -t crawl scrapyredisspider xx.com
      
      
    3. 修改爬虫类:

      1. 导包:from scrapy_redis.spiders import RedisCrawlSpider

      2. 修改当前爬虫类的父类为RedisCrawlSpider

      3. allowed_domainsstart_urls删除

      4. 添加一个新属性:redis_key = 'fbsqueue',表示的是可以被共享的调度器队列的名称

      5. 编写爬虫类的其他操作

        import scrapy
        from scrapy.linkextractors import LinkExtractor
        from scrapy.spiders import CrawlSpider, Rule
        from scrapy_redis.spiders import RedisCrawlSpider
        from scrapyredis.items import ScrapyredisItem
        
        class ScrapyredisspiderSpider(RedisCrawlSpider):
            name = 'scrapyredisspider'
            # allowed_domains = ['xx.com']
            # start_urls = ['http://xx.com/']
            # 表示的是可以被共享的调试器队列的名称
            redis_key = 'fbsqueue'
            link = LinkExtractor(allow=r'politicsNewest\?id=1&page=\d+')
        
            rules = (
                Rule(link, callback='parse_item', follow=False),
            )
        
            def parse_item(self, response):
                case_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
                for case in case_list:
                    item = ScrapyredisItem()
                    item['case_num'] = case.xpath('./span[1]/text()').extract_first()
                    item['case_status'] = case.xpath('./span[2]/text()').extract_first().strip()
                    item['case_name'] = case.xpath('./span[3]/a/text()').extract_first()
                    yield item
        
        
    4. items.py配置

       import scrapy
      
      
      class ScrapyredisItem(scrapy.Item):
          case_num = scrapy.Field()
          case_status = scrapy.Field()
          case_name = scrapy.Field()
      
      
    5. 修改settings.py配置

      # UA伪装
      USER_AGENT = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
      
      # Obey robots.txt rules
      ROBOTSTXT_OBEY = False
      LOG_LEVEL = 'ERROR'
      LOG_FILE = './log.txt'
      
      # 使能 scrapy_redis 组件中封装好的管道
      ITEM_PIPELINES = {
          'scrapy_redis.pipelines.RedisPipeline': 400
      }
      
      # 使能 scrapy_redis 组件中封装好的调度器
      ## 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
      DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
      ## 使用scrapy-redis组件自己的调度器
      SCHEDULER = "scrapy_redis.scheduler.Scheduler"
      ## 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
      SCHEDULER_PERSIST = True
      
      # 配置爬虫程序连接redis配置
      REDIS_HOST = '192.168.50.118'
      REDIS_PORT = 6379
      REDIS_ENCODING = 'utf-8'
      # REDIS_PARAMS = {'password':'123456'}
      
      
    6. 修改redis.conf配置

      # 修改默认bind
      bind 0.0.0.0
      
      # 关闭保护模式
      protected-mode no
      
    7. 开启爬虫

      # 启动redis程序
      systemctl start redis
      
      # 爬虫程序
      scrapy runspider .\scrapyredis\spiders\scrapyredisspider.py
      
      # 向调度器的队列中仍入一个起始的url,队列是存在于redis中
      127.0.0.1:6379> lpush fbsqueue https://wz.sun0769.com/political/index/politicsNewest?id=1&page= [value ...]
      

8 Scrapy 增量式爬取网站

20.scrapy框架之增量式爬虫 - 盛夏中为你花开彼岸 - 博客园 (cnblogs.com)

8.1 实现增量式爬虫的方案:

  1. 在发送请求之前,判断url之前是否爬取过
    1. 将即将进行爬取的数据对应的url存储到redis的set中,
  2. 根据爬取到的数据进行重复过滤,然后在进行持久化存储(在解析内容后判断这部分内容是不是之前爬取过)
    1. 将爬取到的数据给其生成一个唯一的标识(可以将该标识作为mysql的列.可以将该标识存储到redis的set中)
  3. 写入存储介质时判断内容是不是已经在介质中存在

8.2 实现过程

  • zlsspider.py

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from zls.items import ZlsItem
    from redis import Redis
    
    
    class ZlsspiderSpider(CrawlSpider):
        conn = Redis(host='192.168.50.118', port=6379)
        name = 'zlsspider'
        # allowed_domains = ['xx.com']
        start_urls = ['https://www.4567kp.com/frim/index1.html']
    
        # Rule 中若没有被匹配,不会解析网站
        rules = (
            Rule(LinkExtractor(allow=r'frim/index1.html'), callback='parse_item', follow=False),
        )
    
        def parse_item(self, response):
            movie_list = response.xpath('/html/body/div[2]/div/div[3]/div/div[2]/ul/li')
            for movie in movie_list:
                movie_detail_url = 'https://www.4567kp.com' + movie.xpath('./div/a/@href').extract_first()
                movie_name = movie.xpath('./div/a/@title').extract_first()
                item = ZlszlsItem()
                item['movie_name'] = movie_name
                ex = self.conn.sadd('movie_detail_urls', movie_detail_url)
                if ex == 1:
                    print('捕获到最新更新数据!')
                    yield scrapy.Request(url=movie_detail_url, callback=self.parse_detail, meta={'item': item})
                else:
                    print('当前数据暂无更新。')
    
        def parse_detail(self, response):
            item = response.meta['item']
            movie_detail = response.xpath(
                '/html/body/div[2]/div/div[1]/div[5]/div/div[2]/div/span[2]/text()').extract_first()
            item['movie_detail'] = movie_detail
            return item
    
    
  • items.py

    import scrapy
    
    
    class ZlsItem(scrapy.Item):
        movie_name = scrapy.Field()
        movie_detail = scrapy.Field()
    
    
    
  • piplines.py

    # 此处为重点,将字典数据导入至redis
    class ZlsPipeline:
        def process_item(self, item, spider):
            conn = spider.conn
            dic = {
                "movie_name": item["movie_name"],
                "movie_detail": item["movie_detail"]
            }
            conn.lpush('movieData', json.dumps(dic))
            return item
    
    
  • setting.py

    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR'
    LOG_FILE = 'log.txt'
    USER_AGENT = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
    
    ITEM_PIPELINES = {
       'zlszls.pipelines.ZlsPipeline': 300,
    }
    

9 解释数据所遇到的问题

1 解析json数据

以huya返回数据为例,选择第二页时,通过ajax请求数据,获取返回内容为:

# 请求url中callback携带参数时:
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=getLiveListJsonpCallback&page=2

getLiveListJsonpCallback({"status":200,"message":"","data":{...}})

# 请求url中callback不携带参数时:
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&callback=&page=2

{"status":200,"message":"","data":{...}}

2 Rule 规则若没有被正确匹配,不会解析网站内容

# Rule 中若没有被匹配,不会解析网站
    rules = (
        Rule(LinkExtractor(allow=r'frim/index1.html'), callback='parse_item', follow=False),
    )

3 通过Python 的Redis库,将scrapy的item引入至redis中

# 此处为重点,将字典数据导入至redis
class ZlsPipeline:
    def process_item(self, item, spider):
        conn = spider.conn
        dic = {
            "movie_name": item["movie_name"],
            "movie_detail": item["movie_detail"]
        }
        conn.lpush('movieData', json.dumps(dic))
        return item

posted @ 2022-12-10 11:16  f_carey  阅读(50)  评论(0编辑  收藏  举报