爬虫 5 scrapy框架虎牙scrapy示例

Scrapy框架的使用
　　- pySpider
- 什么是框架？
- 就是一个具有很强通用性且集成了很多功能的项目模板（可以被应用在各种需求中）
- scrapy集成好的功能：
- 高性能的数据解析操作（xpath）
- 高性能的数据下载
- 高性能的持久化存储
- 中间件
- 全栈数据爬取操作
- 分布式：redis
- 请求传参的机制（深度爬取）
- scrapy中合理的应用selenium

- 环境的安装

a. pip3 install wheel

b. 下载

Twisted‑20.3.0‑cp38‑cp38‑win_amd64.whl 38是python 版本的意思

c. 进入下载目录，执行 pip3 install ./Twisted‑20.3.0‑cp38‑cp38‑win_amd64.whl

d. pip3 install pywin32

e. pip3 install scrapy

- 创建工程
　　- scrapy startproject ProName
　　- cd ProName
　　- scrapy genspider spiderName www.xxx.com :创建爬虫文件
　　- 执行：scrapy crawl spiderName
　　

　　- settings:
　　　　- 不遵从robots协议　　　　ROBOTSTXT_OBEY = False\

　　　　- UA伪装　　　　USER_AGENT

USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'

　　　　- LOG_LEVEL = 'ERROR' 错误的时候才输出日志！

　　　　　　(或者 -LOG-FILE=‘log.txt’)

- scrapy的数据解析
- extract():列表是有多个列表元素 xpath（）.extract()
- extract_first():列表元素只有单个

- scrapy的持久化存储
- 基于终端指令：
- 只可以将parse方法的返回值存储到磁盘文件中
- scrapy crawl first -o file.csv
- 基于管道：pipelines.py

- 编码流程：
- 1.数据解析
- 2.在item的类中定义相关的属性
- 3.将解析的数据存储封装到item类型的对象中.item['p']
- 4.将item对象提交给管道
- 5.在管道类中的process_item方法负责接收item对象，然后对item进行任意形式的持久化存储
- 6.在配置文件中开启管道

- 细节补充：
- 管道文件中的一个管道类表示将数据存储到某一种形式的平台中。
- 如果管道文件中定义了多个管道类，爬虫类提交的item会给到优先级最高的管道类。
- process_item方法的实现中的

　　　　return item 的操作表示将item传递给下一个即将被执行的管道类

好滴，来做个虎牙的信息爬取呗//

创建工程就不说了。

setting 文件

LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

爬虫文件 spiders /huya.py

import scrapy


class HuyaSpider(scrapy.Spider):
    name = 'huya'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.huya.com/g/xingxiu']

    def parse(self, response):
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        all_data = []
        for li in li_list:
            title = li.xpath('./a[2]/text()').extract_first()  # 去【】
            author = li.xpath(' ./span/span[1]/i/text()').extract_first()
            hot = li.xpath('./ span / span[2] / i[2]/text()').extract_first()
            print(title,author,hot)
            dic = {
                'title':title,
                'author':author,
                'hot':hot
            }
            all_data.append(dic)
        return all_data

View Code

好滴，这样直接运行 scrapy crawl huya 是直接输出屏幕的

而 scrapy crawl huya -o huya.csv 是输出表格的，表头是键

这样好像就差不都可以了。

但是输出文件的格式和方法好像就很局限了呀

这时候items.py和pipelines.py就携手出现啦！！！

huya.py 这样写

import scrapy
from huyaPro.items import HuyaproItem

class HuyaSpider(scrapy.Spider):
    name = 'huya'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.huya.com/g/xingxiu']

    def parse(self, response):
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        all_data = []
        for li in li_list:
            title = li.xpath('./a[2]/text()').extract_first()  # 去【】
            author = li.xpath(' ./span/span[1]/i/text()').extract_first()
            hot = li.xpath('./ span / span[2] / i[2]/text()').extract_first()
        #     print(title,author,hot)
        #     dic = {
        #         'title':title,
        #         'author':author,
        #         'hot':hot
        #     }
        #     all_data.append(dic)
        # return all_data
            item = HuyaproItem()
            item['title'] = title
            item['author'] = author
            item['hot'] = hot
            yield item  # 提交给管道

提交item

items.py要提前定义一个类

import scrapy


class HuyaproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    hot = scrapy.Field()

每个key都要实例化哦！

这时候 Pipelines.py就来实现持久化存储啦

　　这是存成txt的形式啦

class HuyaproPipeline:

    fp =None

    def open_spider(self,spider):  # 最开始执行一次
        self.fp = open('huyazhibo.txt','w',encoding='utf-8')

    def process_item(self, item, spider): #item就是接收到爬虫类提交过来的item对象

        self.fp.write(item['title']+':'+item['author']+':'+item['hot']+'\n')
        print(item['title'], '写入成功！@@')

        return item

    def close_spider(self,spider):  # 最后执行一次
        self.fp.close()
        print('我关闭了')

View Code

　想要存到数据库

import pymysql

class mysqlPipeline(object):
    conn = None
    cursor= None
    def open_spider(self,spider):
        self.conn =pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',
                                   db='spider',charset='utf8')

        print(self.conn)
    def process_item(self,item,spider):

        sql = 'insert into huya values ("%s" ,"%s", "%s")' % (item['title'],item['author'],item['hot'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:   # 事务处理
            print(e)
            self.conn.rollback()
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

View Code

那我pipeline有两类，两种存储方式，怎么管理呢

（记住return item哦，类的接力要接好）

在 setting 中的item管道中加入一个键值对就行了值代表的优先级（越小越优先咯）

ITEM_PIPELINES = {
   'huyaPro.pipelines.HuyaproPipeline': 300,
   'huyaPro.pipelines.mysqlPipeline': 301,

}

这样

>scrapy crawl huya 就存到了2个地方了呀！

什么？你还想存到redis中，行啊。在管道文件中再创个类！setting再添加个键值对！

class RedisPipeLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self,item,spider):
        self.conn.lpush('huyaList',item)
        return item

View Code

以上！

posted @ 2020-09-22 19:36 蜗牛般庄阅读(203) 评论(0) 编辑收藏举报

刷新页面返回顶部

大东在路上

爬虫 5 scrapy框架虎牙scrapy示例

公告

大东在路上

爬虫 5 scrapy框架 虎牙scrapy示例

公告

爬虫 5 scrapy框架虎牙scrapy示例