Scrapy基础介绍与使用

一、爬虫工程化

在之前的学习中我们已经掌握了爬虫这门技术需要的大多数的技术点, 但是我们现在写的代码还很流程化, 很难进行商用的. 想要我们的爬虫达到商用级别, 必须要对我们现在编写的爬虫代码进行大刀阔斧式的重组, 已达到工程化的爬虫. 何为工程化, 就是让你的程序更加的有体系, 有逻辑, 更加的模块化.

就好比, 我们家里以前做过鞋子, 我妈妈给我做鞋, 她需要从画图纸到裁剪到最后的缝合, 一步一步的完成一双鞋子的制作. 这种手工鞋子如果每年做个几双, 没问题. 我妈妈辛苦一点, 也能搞定. 但是, 如果现在我想去售卖这个鞋子. 再依靠妈妈一双一双的缝制不切实际。为什么? 第一, 产能跟不上. 一个人的力量是有限的, 第二, 一个人要完整的把制作鞋子的工艺从头搞到尾. 就算你想招人分担一下. 貌似也不好找这样厉害的手艺人. 怎么办? 聪明的你可能已经想到了. 从头到尾完成一双鞋的人不好找. 那我就把这个工艺过程分开. 分成4份, 画图, 裁剪, 缝合, 验收. 招4个人. 每个人就负责一小部分. 并且这一小部分是很容易完成的. 最终只要有一个人(我)来做一个总指挥. 我的制鞋小工厂就建起来了.

上述逻辑同样适用于我们的爬虫, 想想, 到目前为止, 我们所编写的爬虫我们都是从头到尾的每一步都要亲力亲为. 这样做固然有其优点(可控性更好), 但是各位请认真思考. 这样的代码逻辑是不能形成批量生产的效果的(写100个爬虫). 很多具有共通性的代码逻辑都没有进行重复利用. 那我们就可以考虑看看, 能不能把一些共性的问题(获取页面源代码, 数据存储), 单独搞成一个功能. 如果我们把这些功能单独进行编写. 并且产生类似单独的功能模块, 将大大的提高我们爬虫的效率. 已达到我们爬虫工程化开发的效果.

爬虫工程化: 对爬虫的功能进行模块化的开发. 并达到可以批量生产的效果(不论是开发还是数据产出)

二、Scrapy简介

Scrapy到目前为止, 依然是这个星球上最流行的爬虫框架. 摘一下官方给出对scrapy的介绍

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

scrapy的特点: 速度快, 简单, 可扩展性强.

Scrapy的官方文档(英文): https://docs.scrapy.org/en/latest/

神马叫框架: 按照框架本身设计的逻辑. 往里面填写内容就可以了.
学习其他框架的时候. 切忌. 不要去直接上来去抠它的源码 .
先学会如何使用(怎么往里填窟窿). 反着去看他的源代码. 理解起来就容易了

三、 Scrapy工作流程(重点)

# 伪代码, 只为说明
def get_page_srouce():
	resp = requests.get(xxxxx)
	return resp.text | resp.json()
	
def parse_source():
	xpath, bs4, re
    return data
	
def save_data(data):
	txt, csv, mysql, mongodb
	
def main():  # 负责掌控全局
    # 首页的页面源代码
	ret = get_page_source()  # 获取页面源代码, 发送网络请求
	data = parse_source(ret)  # 去解析出你要的数据
	# 需要继续请求新的url
	while: 
		# 详情页 
		ret = get_page_source()  # 获取页面源代码, 发送网络请求
		data = parse_source(ret)  # 去解析出你要的数据
		save_data(data) # 负责数据存储
        
        # 详情页如果还有分页.
        # ...继续上述操作. 
      
if __name__ == '__main__':
	main()

之前我们所编写的爬虫的逻辑:

scrapy的工作流程:

整个工作流程:

爬虫中起始的url构造成request对象, 并传递给调度器.
引擎从调度器中获取到request对象. 然后交给下载器
由下载器来获取到页面源代码, 并封装成response对象. 并回馈给引擎
引擎将获取到的response对象传递给spider, 由spider对数据进行解析(parse). 并回馈给引擎
引擎将数据传递给pipeline进行数据持久化保存或进一步的数据处理.
在此期间如果spider中提取到的并不是数据. 而是子页面url. 可以进一步提交给调度器, 进而重复步骤2的过程

# 伪代码, 只为说明
def get_page_srouce(url, method):
    if method == get：
        resp = requests.get(xxxxx)
        return resp.text | resp.json()
	
def parse_source():
	xpath, bs4, re
	
def save_data(data):
	txt, csv, mysql, mongodb
	
def main():  # 负责掌控全局->为了你理解
	# 主页
    req = spider.get_first_req()
    while 1:
        scheduler.send(req)
        next = scheduler.next_req()
        sth = downloader.get_page_source(next)
        data = spider.parse(sth)
        if data is 数据:
        	pipeline.process_item(data)
            
if __name__ == '__main__':
	main()

上述过程中一直在重复着几个东西:

1. 引擎(engine)

   scrapy的核心, 所有模块的衔接, 数据流程梳理.

2. 调度器(scheduler)

   本质上这东西可以看成是一个集合和队列. 里面存放着一堆我们即将要发送的请求. 可以看成是一个url的容器. 它决定了下一步要去爬取哪一个url. 通常我们在这里可以对url进行去重操作.  

3. 下载器(downloader) 

   它的本质就是用来发动请求的一个模块. 小白们完全可以把它理解成是一个requests.get()的功能. 只不过这货返回的是一个response对象. 

4. 爬虫(spider)

   这是我们要写的第一个部分的内容, 负责解析下载器返回的response对象.从中提取到我们需要的数据. 

5. 管道(pipeline)

   这是我们要写的第二个部分的内容, 主要负责数据的存储和各种持久化操作. 

经过上述的介绍来看, scrapy其实就是把我们平时写的爬虫进行了四分五裂式的改造. 对每个功能进行了单独的封装, 并且, 各个模块之间互相的不做依赖. 一切都由引擎进行调配. 这种思想希望你能知道--解耦. 让模块与模块之间的关联性更加的松散. 这样我们如果希望替换某一模块的时候会非常的容易. 对其他模块也不会产生任何的影响. 

到目前为止, 我们对scrapy暂时了解这么多就够了. 后面会继续在这个图上进一步展开.

四、 Scrapy安装

在windows上安装scrapy以前是一个很痛苦的事情. 可能会出现各种各样的异常BUG. 新版本的3.9以后的python也稳定了很多. 出现问题的概率也很低了.

版本选择，我们选择使用:

scrapy(2.11.2) -> scrapy-redis(0.9.1)

先使用pip直接安装看看报错不

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy==2.11.2
pip install scrapy==2.11.2

如果安装成功, 直接去创建项目即可

如果安装失败. 请先升级一下pip. 然后重新安装scrapy即可.

最新版本的pip升级完成后. 安装依然失败, 可以根据报错信息进行一点点的调整, 多试几次pip. 直至success.

如果实在不行. 建议更换python3.9以上解释器.

五、 Scrapy实例

接下来, 我们用scrapy来完成一个超级简单的爬虫, 目标: 深入理解Scrapy工作的流程, 以及各个模块之间是如何搭配工作的.

1. 创建项目：

scrapy startproject 项目名称

示例:

scrapy startproject mySpider_2

创建好项目后, 我们可以在pycharm里观察到scrapy帮我们创建了一个文件夹, 里面的目录结构如下:

mySpider_2   # 项目所在文件夹, 建议用pycharm打开该文件夹
    ├── mySpider_2  		# 项目跟目录
    │   ├── __init__.py
    │   ├── items.py  		# 封装数据的格式
    │   ├── middlewares.py  # 所有中间件
    │   ├── pipelines.py	# 所有的管道
    │   ├── settings.py		# 爬虫配置信息
    │   └── spiders			# 爬虫文件夹, 稍后里面会写入爬虫代码
    │       └── __init__.py
    └── scrapy.cfg			# scrapy项目配置信息,不要删它,别动它,善待它.

2. 创建爬虫

cd 文件夹  # 进入项目所在文件夹
scrapy genspider 爬虫名称 允许抓取的域名范围

示例:

cd mySpider_2
scrapy genspider youxi 4399.com

效果:

(base) sylardeMBP:第七章 sylar$ cd mySpider_2
(base) sylardeMBP:mySpider_2 sylar$ ls
mySpider_2      scrapy.cfg
(base) sylardeMBP:mySpider_2 sylar$ scrapy genspider youxi http://www.4399.com/
Created spider 'youxi' using template 'basic' in module:
  mySpider_2.spiders.youxi
(base) sylardeMBP:mySpider_2 sylar$

至此, 爬虫创建完毕, 我们打开文件夹看一下.

├── mySpider_2
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── youxi.py   # 多了一个这个. 
└── scrapy.cfg

3. 编写数据解析过程

完善youxi.py中的内容.

import scrapy

class YouxiSpider(scrapy.Spider):
    name = 'youxi'  # 该名字非常关键, 我们在启动该爬虫的时候需要这个名字
    allowed_domains = ['4399.com']  # 爬虫抓取的域.
    start_urls = ['http://www.4399.com/flash/']  # 起始页

    def parse(self, response, **kwargs):
        # response.text  # 页面源代码
        # response.xpath()  # 通过xpath方式提取
        # response.css()  # 通过css方式提取
        # response.json() # 提取json数据

        # 用我们最熟悉的方式: xpath提取游戏名称, 游戏类别, 发布时间等信息
        li_list = response.xpath("//ul[@class='n-game cf']/li")
        for li in li_list:
            name = li.xpath("./a/b/text()").extract_first()
            category = li.xpath("./em/a/text()").extract_first()
            date = li.xpath("./em/text()").extract_first()

            dic = {
                "name": name,
                "category": category,
                "date": date
            }

            # 将提取到的数据提交到管道内.
            # 注意, 这里只能返回 request对象, 字典, item数据, or None
            yield dic

注意:

spider返回的内容只能是字典, requestes对象, item数据或者None. 其他内容一律报错

运行爬虫:

scrapy crawl 爬虫名字

实例:

scrapy crawl youxi

4. 编写pipeline.对数据进行简单的保存

数据传递到pipeline, 我们先看一下在pipeline中的样子.

首先修改settings.py文件中的pipeline信息.

ITEM_PIPELINES = {
    # 前面是pipeline的类名地址               
    # 后面是优先级, 优先级月低越先执行
   'mySpider_2.pipelines.Myspider2Pipeline': 300,
}

然后我们修改一下pipeline中的代码:

class Myspider2Pipeline:
    # 这个方法的声明不能动!!! 在spider返回的数据会自动的调用这里的process_item方法. 
    # 你把它改了. 管道就断了
    def process_item(self, item, spider):
        print(item)
        return item

六、自定义数据传输结构item

在上述案例中, 我们使用字典作为数据传递的载体, 但是如果数据量非常大. 由于字典的key是随意创建的. 极易出现问题, 此时再用字典就不合适了. Scrapy中提供item作为数据格式的声明位置. 我们可以在items.py文件提前定义好该爬虫在进行数据传输时的数据格式. 然后再写代码的时候就有了数据名称的依据了.

item.py文件

import scrapy

class GameItem(scrapy.Item):
    # 定义数据结构
    name = scrapy.Field()
    category = scrapy.Field()
    date = scrapy.Field()
class Person:
    private String name;
    private int age;
    
 dic = {name: "alex", age: 18}
p = Person( "alex", 18)

spider中. 这样来使用:

from mySpider_2.items import GameItem

# 以下代码在spider中的parse替换掉原来的字典
item = GameItem()
item["name"] = name
item["category"] = category
item["date"] = date
yield item

七、Scrapy管道

一. 关于管道

1. csv文件写入

写入文件是一个非常简单的事情. 直接在pipeline中开启文件即可. 但这里要说明的是. 如果我们只在process_item中进行处理文件是不够优雅的. 总不能有一条数据就open一次吧

class YouxiPipeline:
    def open_spider(self, spider):
       with open("youxi.csv", mode="w", encoding='utf-8') as f:
           self.f.write(f"{item['name']},{item['lei']},{item['tm']}")
           self.f.write("\n")
       return item

我们希望的是, 能不能打开一个文件, 然后就用这一个文件句柄来完成数据的保存. 答案是可以的. 我们可以在pipeline中创建两个方法, 一个是open_spider(), 另一个是close_spider(). 看名字也能明白其含义:
open_spider(), 在爬虫开始的时候执行一次
close_spider(), 在爬虫结束的时候执行一次

class YouxiPipeline:
    def open_spider(self, spider):
        self.f = open('youxi.csv', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        self.f.write(f"{item['name']},{item['lei']},{item['tm']}")
        self.f.write("\n")
        return item

在爬虫开始的时候打开一个文件, 在爬虫结束的时候关闭这个文件. 满分~
设置settings

ITEM_PIPELINES = {
   "youxi.pipelines.YouxiPipeline": 300
}

2. mysql数据库写入

有了上面的示例, 写入数据库其实也就很顺其自然了, 首先, 在open_spider中创建好数据库连接. 在close_spider中关闭链接. 在proccess_item中对数据进行保存工作.
先把mysql相关设置丢到settings里

# MYSQL配置信息
MYSQL_CONFIG = {
   "host": "localhost",
   "port": 3306,
   "user": "root",
   "password": "510alSKZ8dl.",
   "database": "spider",
}

from youxi.settings import MYSQL_CONFIG as mysql
import pymysql

class YouxiMysqlPipeline:

    def open_spider(self, spider):
        self.conn = pymysql.connect(host=mysql["host"], port=mysql["port"], user=mysql["user"], password=mysql["password"], database=mysql["database"])

    def close_spider(self, spider):
        self.conn.close()

        def process_item(self, item, spider):
        try:
            sql = "INSERT INTO youxi1(name, lei, tm) VALUES (%s, %s, %s)"
            self.cursor.execute(sql, (item['name'], item['lei'], item['tm']))
            self.connect.commit()
        except Exception as e:
            print("一条数据保存失败：", e)
            print("数据：", item)
            self.connect.rollback()
            spider.logger.error(f"保存数据库失败!", e, f"数据是: {item}")  # 记录错误日志
        return item

别忘了把pipeline设置一下

ITEM_PIPELINES = {
   "youxi.pipelines.YouxiMysqlPipeline": 301
}

3. mongodb数据库写入

mongodb数据库写入和mysql写入如出一辙...不废话直接上代码吧

MONGO_CONFIG = {
   "host": "localhost",
   "port": 27017,
   #'has_user': True,
   "user": "python_admin",
   "password": "123456",
   "db": "python"
}

from Youxi.settings import MONGO_CONFIG as mongo
import pymongo

class YouxiMongoDBPipeline:
    def open_spider(self, spider):
        client = pymongo.MongoClient(host=mongo['host'],
                                     port=mongo['port'])
        db = client[mongo['db']]
        #if mongo['has_user']:
        #    db.authenticate(mongo['user'], mongo['password'])
        self.client = client  #  你们那里不用这么麻烦. 
        self.collection = db['youxi']

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.collection.insert({"name": item['name'], 'lei': item["lei"], 'tm': item['tm']})
        return item

ITEM_PIPELINES = {
    "youxi.pipelines.YouxiMongoDBPipeline": 302
}

4. 文件保存

接下来我们来尝试使用scrapy来下载一些图片, 看看效果如何.

# 先安装模块
pip install pillow

首先, 随便找个图片网站(安排好的). https://desk.zol.com.cn/dongman/. 可以去看看

接下来. 创建好项目, 完善spider, 注意看yield scrapy.Request()

import scrapy
from urllib.parse import urljoin


class ZolSpider(scrapy.Spider):
    name = 'zol'
    allowed_domains = ['zol.com.cn']
    start_urls = ['https://desk.zol.com.cn/dongman/']

    def parse(self, resp, **kwargs):  # scrapy自动执行这个parse -> 解析数据
        # print(resp.text)
        # 1. 拿到详情页的url
        a_list = resp.xpath("//*[@class='pic-list2  clearfix']/li/a")
        for a in a_list:
            href = a.xpath("./@href").extract_first()
            if href.endswith(".exe"):
                continue

            # href = urljoin(resp.url, href)  # 这个拼接才是没问题的.
            # print(resp.url)  # resp.url   当前这个响应是请求的哪个url回来的.
            # print(href)
            # 仅限于scrapy
            href = resp.urljoin(href)  # resp.url 和你要拼接的东西
            # print(href)
            # 2. 请求到详情页. 拿到图片的下载地址

            # 发送一个新的请求
            # 返回一个新的请求对象
            # 我们需要在请求对象中, 给出至少以下内容(spider中)
            # url  -> 请求的url
            # method -> 请求方式
            # callback -> 请求成功后.得到了响应之后. 如何解析(parse), 把解析函数名字放进去
            yield scrapy.Request(
                url=href,
                method="get",
                # 当前url返回之后.自动执行的那个解析函数
                callback=self.suibianqimignzi,
            )

    def suibianqimignzi(self, resp, **kwargs):
        # 在这里得到的响应就是url=href返回的响应
        img_src = resp.xpath("//*[@id='bigImg']/@src").extract_first()
        # print(img_src)
        yield {"img_src": img_src}

'''
关于Request()的参数:
url: 请求地址
method: 请求方式
callback: 回调函数
errback: 报错回调
dont_filter: 默认False, 表示"不过滤", 该请求会重新进行发送
headers: 请求头.
cookies: cookie信息
接下来就是下载问题了. 如何在pipeline中下载一张图片呢? Scrapy早就帮你准备好了. 在Scrapy中有一个ImagesPipeline可以实现自动图片下载功能.
'''

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import scrapy
from itemadapter import ItemAdapter
# ImagesPipeline 图片专用的管道
from scrapy.pipelines.images import ImagesPipeline


class TuPipeline:
    def process_item(self, item, spider):
        print(item['img_src'])
        # 一个存储方案.
        # import requests
        # resp = requests.get(img_src)
        # resp.content
        return item

# scrapy的方案
class MyTuPipeline(ImagesPipeline):
    # 1. 发送请求(下载图片, 文件, 视频,xxx)
    def get_media_requests(self, item, info):
        url = item['img_src']
        yield scrapy.Request(url=url, meta={"sss": url})  # 直接返回一个请求对象即可

    # 2. 图片的存储路径
    # 完整的路径: IMAGES_STORE + file_path()的返回值
    # 在这个过程中. 文件夹自动创建
    def file_path(self, request, response=None, info=None, *, item=None):
        # 可以准备文件夹
        img_path = "dongman/imgs/kunmo/libaojun/liyijia"
        # 准备文件名字
        # 坑: response.url 没办法正常使用
        # file_name = response.url.split("/")[-1]  # 直接用响应对象拿到url
        # print("response:", file_name)
        file_name = item['img_src'].split("/")[-1]  # 用item拿到url
        print("item:", file_name)
        file_name = request.meta['sss'].split("/")[-1]
        print("meta:", file_name)

        real_path = img_path + "/" + file_name  # 文件夹路径拼接
        return real_path  # 返回文件存储路径即可

    # 3. 可能需要对item进行更新
    def item_completed(self, results, item, info):
        # print(results)
        for r in results:
            print(r[1]['path'])
        return item  # 一定要return item 把数据传递给下一个管道

最后, 需要在settings中设置以下内容:

LOG_LEVEL = "WARNING"

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'tu.pipelines.TuPipeline': 300,
   'tu.pipelines.MyTuPipeline': 301,
}

MEDIA_ALLOW_REDIRECTS = True
# 下载图片. 必须要给出一个配置
# 总路径配置
IMAGES_STORE = "./qiaofu"

八、 scrapy使用小总结

至此, 我们对scrapy有了一个非常初步的了解和使用. 快速总结一下. scrapy框架的使用流程:

创建爬虫项目. scrapy startproject xxx
进入项目目录. cd xxx
创建爬虫 scrapy genspider 名称抓取域
编写item.py 文件, 定义好数据item
修改spider中的parse方法. 对返回的响应response对象进行解析. 返回item
在pipeline中对数据进行保存工作.
修改settings.py文件, 将pipeline设置为生效, 并设置好优先级
启动爬虫 `scrapy crawl 名称

九、完整使用

runner.py

pycharm启动项目文件

from scrapy.cmdline import execute

if __name__ == '__main__':
    # 两种方式都可以用
    # execute(['scrapy', 'crawl', 'qiaofu_youxi'])
    execute("scrapy crawl qiaofu_youxi".split())

# 这段代码可以植入到scrapy源码中

spiders文件下youxi.py

文件解析页面数据

import scrapy

from youxi.items import YouxiItem


class QiaofuYouxiSpider(scrapy.Spider):
    name = "qiaofu_youxi"  # 创建项目名称
    allowed_domains = ["4399.com"]  # 可以设置多个域名
    start_urls = ["https://www.4399.com/flash/new.htm"]  #要访问的url地址

    def parse(self, response):
        # print(response.text)
        # 解析页面源代码
        li_list = response.xpath('//ul[@class="n-game cf"]/li')
        for li in li_list:
            # 解析每个的详情页
            name = li.xpath('./a/b/text()').extract_first()
            lei = li.xpath('./em/a/text()').extract_first()
            tm = li.xpath("./em[2]/text()").extract_first()
            item = YouxiItem(name=name, lei=lei, tm=tm)
            yield item

item.py

自定义数据传输结构item

import scrapy


class YouxiItem(scrapy.Item):
    name = scrapy.Field()
    lei = scrapy.Field()
    tm = scrapy.Field()

settings.py

配置文件

# Scrapy settings for youxi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "youxi"

SPIDER_MODULES = ["youxi.spiders"]
NEWSPIDER_MODULE = "youxi.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "youxi (+http://www.yourdomain.com)"

# Obey robots.txt rules
# 是否遵守robots协议
ROBOTSTXT_OBEY = False
# 设置日志级别
LOG_LEVEL = "WARNING"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 同时发请求的数量
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载器延时：-> time.sleep(3)
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# 默认有的session功能 -> 自动帮你处理cookie
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# 默认的请求头
DEFAULT_REQUEST_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "accept-encoding": "gzip, deflate, br, zstd",
    "accept-language": "zh-CN,zh;q=0.9",
    "cache-control": "max-age=0",
    "connection": "keep-alive",
    "cookie": "home4399=yes; UM_distinctid=1931f73656410a-0b1ce47bf3efd1-26011951-144000-1931f736565158; _4399stats_vid=17314642208328160; CNZZDATA30039538=cnzz_eid%3D373188627-1731399477-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1731464221; Hm_lvt_334aca66d28b3b338a76075366b2b9e8=1731399477,1731464222; Hm_lpvt_334aca66d28b3b338a76075366b2b9e8=1731464222; HMACCOUNT=BE3BFACC7E1F214F",
    "host": "www.4399.com",
    "if-modified-since": "Tue, 12 Nov 2024 01:15:46 GMT",
    "if-none-match": "W/\"6732ac42-147cd\"",
    "sec-ch-ua": "\"Google Chrome\";v=\"131\", \"Chromium\";v=\"131\", \"Not_A Brand\";v=\"24\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Annotate": "performance findings",
    "Label": "items and time ranges, draw connections between items, save and share annotated traces in the Performance panel.",
    "Ignore": "listing improvements",
    "Stack": "trace now hides ignored frames and you can now set DevTools to ignore anonymous scripts.",
    "Get": "performance insights",
    "Experimental": "Discover actionable insights right in the Performance panel, consolidated from the Lighthouse report and (soon to be deprecated) Performance insights panel."
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    "youxi.middlewares.YouxiSpiderMiddleware": 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    "youxi.middlewares.YouxiDownloaderMiddleware": 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "youxi.pipelines.YouxiPipeline": 300,
   "youxi.pipelines.YouxiPipeline_mysql": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

pipelines.py

保存的格式

from itemadapter import ItemAdapter
import pymysql


# 要想使用该pipeline，需要在settings.py中添加ITEM_PIPELINES = ['youxi.pipelines.YouxiPipeline']
# 文件保存
class YouxiPipeline:
    def open_spider(self, spider):
        self.f = open('youxi.csv', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        self.f.write(f"{item['name']},{item['lei']},{item['tm']}")
        self.f.write("\n")
        return item

# 数据库格式保存
要想使用该pipeline，需要在settings.py中添加ITEM_PIPELINES = ['youxi.pipelines.YouxiPipeline_mysql']
class YouxiPipeline_mysql:
    def open_spider(self, spider):
        self.connect = pymysql.connect(
            host='192.168.0.88',
            port=3306,
            user='baseuser',
            password='510alSKZ8dl.',
            db='youxi',
            charset='utf8mb4')
        self.cursor = self.connect.cursor()

    def close_spider(self, spider):
        try:
            self.cursor.close()
            self.connect.close()
        except Exception as e:
            print("关闭连接失败：", e)

    def process_item(self, item, spider):
        try:
            sql = "INSERT INTO youxi1(name, lei, tm) VALUES (%s, %s, %s)"
            self.cursor.execute(sql, (item['name'], item['lei'], item['tm']))
            self.connect.commit()
        except Exception as e:
            print("一条数据保存失败：", e)
            print("数据：", item)
            self.connect.rollback()
            return item

posted @ 2024-11-13 15:05 沈忻凯阅读(457) 评论(0) 收藏举报

刷新页面返回顶部

小凯在努力~

Scrapy基础介绍与使用

一、爬虫工程化

二、Scrapy简介

三、 Scrapy工作流程(重点)

四、 Scrapy安装

五、 Scrapy实例

1. 创建项目：

2. 创建爬虫

3. 编写数据解析过程

4. 编写pipeline.对数据进行简单的保存

六、自定义数据传输结构item

七、Scrapy管道

一. 关于管道

1. csv文件写入

2. mysql数据库写入

3. mongodb数据库写入

4. 文件保存

八、 scrapy使用小总结

九、完整使用

runner.py

spiders文件下youxi.py

item.py

settings.py

pipelines.py

公告

小凯 在努力~

Scrapy基础介绍与使用

一、爬虫工程化

二、Scrapy简介

三、 Scrapy工作流程(重点)

四、 Scrapy安装

五、 Scrapy实例

1. 创建项目：

2. 创建爬虫

3. 编写数据解析过程

4. 编写pipeline.对数据进行简单的保存

六、 自定义数据传输结构item

七、Scrapy管道

一. 关于管道

1. csv文件写入

2. mysql数据库写入

3. mongodb数据库写入

4. 文件保存

八、 scrapy使用小总结

九、完整使用

runner.py

spiders文件下youxi.py

item.py

settings.py

pipelines.py

公告

小凯在努力~

六、自定义数据传输结构item