Scrapy之数据解析与数据持久化存储

JS反混淆：将js混淆的密文以原文的形式展示。推荐的解密网址：http://www.bm8.com.cn/jsConfusion/
通过python调用js的相关代码：

PyExecJS：可以让python对js代码进行模拟运行。
环境的安装：

pip install PyExecJS -i https://pypi.tuna.tsinghua.edu.cn/simple
安装nodeJS的环境，百度下载即可

使用：

# test.js文件内容
function test_function(start, end){
    var param;
    param = start-end;
    return param;
}

import execjs

node = execjs.get()

# Compile javascript
ctx = node.compile(open('test.js',encoding='utf-8').read())

# Get params，注意传入参数的格式，从0开始
js = 'test_function("{0}", "{1}")'.format(20, 4)
param = ctx.eval(js)
param  # 4

Scrapy框架

除此之外还有pySpider框架

框架：就是一个具有很强通用性且集成了很多功能的项目模板（可以被应用在各种需求中）

scrapy集成好的功能：

高性能的数据解析操作（xpath）
高性能的数据下载
高性能的持久化存储
中间件
全栈数据爬取操作
分布式：只能用redis数据库
请求传参的机制（深度爬取）
scrapy中合理的应用selenium

环境的安装（Windows）

pip install wheel -i https://pypi.tuna.tsinghua.edu.cn/simple
下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/ # twisted
进入下载目录，执行 pip install .\Twisted-20.3.0-cp39-cp39-win_amd64.whl
pip install pywin32 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

twisted是一个异步的组件。scrapy引用twisted来实现异步操作。

创建工程

scrapy startproject ProName
cd ProName
scrapy genspider spiderName www.xxx.com ，创建爬虫文件，网址之后还能改
执行工程：scrapy crawl spiderName
第三步执行后会在项目下spiders文件夹创建一个spiderName.py文件。

spiderName.py文件初始内容：

import scrapy


class spiderNameSpider(scrapy.Spider):

    # 爬虫文件名称，爬虫文件的唯一标识
    name = 'spiderName'

    # 允许的域名，只有在此域名之下的url才能请求成功，一般将其注释掉
    # allowed_domains = ['www.xxx.com']

    # 起始的url列表，通常方网站的首页url；列表中的列表元素会被scrapy自动的进行请求发送
    start_urls = ['http://www.xxx.com/']

    # 解析数据
    def parse(self, response):

        # 调用xpath解析
        response.xpath('xpath表达式')

        # 不能使用with打开文本存储数据，因为请求是异步的

scrapy默认遵从robots协议，可在项目目录下的settings.py中修改配置：
settings.py:

ROBOTSTXT_OBEY = False不遵从robots协议，默认为True

其它配置：

UA伪装：配置USER_AGENT=
LOG_LEVEL = 'ERROR'，只输出ERROR类型的日志
LOG_FILE = 'log.txt' 将日志输出到log.txt文件

scrapy中xpath数据解析，与xpath模块不同之处：

xpath返回的列表中的列表元素是Selector对象，要获取的字符串的数据在该对象中。如response.xpath('.//text()')[0]会返回一个对象，不会返回字符串,取出text文本内容：response.xpath('.//text()')[0].extract()或response.xpath('.//text()').extract_first()。
extract():取列表中的每一个Selector列表元素对象的文本内容，返回列表
extract_first():列表元素只有单个，返回字符串。

scrapy的持久化存储

基于终端指令：

只可以将parse方法的返回值存储到磁盘文件中
指令scrapy crawl spiderName -o file_name.csv
file_name文件后缀不能是.txt，可以是'json', 'jsonlines', '
jl', 'csv', 'xml', 'marshal', 'pickle'等。
示例：

import scrapy
class spiderNameSpider(scrapy.Spider):
    name = 'spiderName'
    start_urls = ['http://www.xxx.com/']
    def parse(self, response):
        data = response.xpath('xpath表达式').extract_first() 
        return data

# 终端中执行：
scrapy crawl spiderName -o file_name.csv

基于管道：项目中的pipelines.py文件，编码流程：

数据解析
在item.py文件中item的类中定义相关的属性
将解析的数据存储封装到item类型的对象中。
将item对象提交给管道
在管道类中的process_item方法负责接收item对象，然后对item进行任意形式的持久化存储
在配置文件中开启管道，打开ITEM_PIPELINES，

细节补充：

管道文件中的一个管道类表示将数据存储到某一种形式的平台中。
ITEM_PIPELINES字典中键对应的数字是优先级，值越小优先级越高，如果管道文件中定义了多个管道类，爬虫类提交的item会给到优先级最高的管道类。
process_item方法的实现中的return item的操作表示将item传递给下一个即将被执行的管道类。

简单使用示例，爬取虎牙直播：

# 配置：
scrapy startproject huyaPro
cd huyaPro
scrapy genspider huya www.xxx.com

# settings.py

BOT_NAME = 'huyaPro'

SPIDER_MODULES = ['huyaPro.spiders']
NEWSPIDER_MODULE = 'huyaPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# 6.在配置文件中开启管道
ITEM_PIPELINES = {
    'huyaPro.pipelines.HuyaproPipeline': 300,
    'huyaPro.pipelines.MysqlPipeLine': 301,
    'huyaPro.pipelines.RedisPipeLine': 302,

}

# items.py
import scrapy


class HuyaproItem(scrapy.Item):
    # define the fields for your item here like:
    # Field是一个万能的数据类型，可以存储任何数据类型
    # name = scrapy.Field()

    # 2.在item类中定义相关属性
    title = scrapy.Field()
    author = scrapy.Field()
    hot = scrapy.Field()

# spiders/huya.py
import scrapy

# 基于管道储存，导入类
from huyaPro.items import HuyaproItem

class HuyaSpider(scrapy.Spider):
    name = 'huya'

    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.huya.com/g/4079']

    def parse(self, response):

        # 基于终端储存
        # all_data_list = []

        # 1.数据解析
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        for li in li_list:
            # title = li.xpath('./a[2]/text()')[0]  <Selector xpath='./a[2]/text()' data='余生...'>
            title = li.xpath('./a[2]/text()').extract_first()
            author = li.xpath('./span/span[1]/i/text()').extract_first()
            hot = li.xpath('././span/span[2]/i[2]/text()').extract_first()

            '''基于终端储存
            dic = {
                'title':title,
                'author':author,
                'hot':hot,
            }
            all_data_list.append(dic)
        return all_data_list
        # 终端输入 scrapy crawl huya -o huya.csv 
        '''
        # 管道储存。
        # 实例化item类型的对象
        item = HuyaproItem()

        # 3.将解析的数据封装到item类型的对象中，注意需要通过[]调对象的属性。
        item['title'] = title
        item['author'] = author
        item['hot'] = hot

        # print(item)  item相当于一个字典
        # 将item对象提交给管道
        yield item

# pipelines.py

# 将数据写入磁盘文件，一个管道类表示将数据存储到某一种形式的平台中
class HuyaproPipeline:
    fp = None

    # 重写父类方法，该方法只会在爬虫开始的时候执行一次
    def open_spider(self, spider):
        self.fp = open('./huya.txt', 'w', encoding='utf-8')

    # 5.process_item负责接收item对象，在爬虫期间，该方法可能会执行多次，因此不能在此方法打开文件
    def process_item(self, item, spider):
        self.fp.write(item['author']+':'+item['title']+item['hot'])
        # 将item传递给下一个即将被执行的管道类
        return item

    # 重写父类方法，该方法只会在爬虫结束的时候执行一次
    def close_spider(self, spider):
        self.fp.close()

# 将数据写入mysql数据库中
import pymysql

class MysqlPipeLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='1',db='Spider',charset='utf8')

    def process_item(self,item,spider):
        sql = 'insert into huya values("%s","%s","%s")'%(item['author'],item['title'],item['hot'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        # 一定要写return，不然下一个管道类无法获取item
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

# 写入redis数据库
from redis import Redis

class RedisPipeLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self,item,spider):
        # item本身就是一个字典，因此可以使用队列数据结构
        self.conn.lpush('huyaList',item)

        # return item   # 可以不写，没有下个管道类
    def close_spider(self,spider):
        self.conn.close()

posted @ 2020-08-11 21:56 虫萧阅读(483) 评论(0) 编辑收藏举报

刷新页面返回顶部

Walden

Scrapy之数据解析与数据持久化存储

Scrapy框架

scrapy中xpath数据解析，与xpath模块不同之处：

scrapy的持久化存储

公告