[scrapy]一个简单的scrapy爬虫demo

一个简单的scrapy爬虫demo

爬取豆瓣top250的电影名称+电影口号

使用到持久化流程:

  • 爬虫文件爬取到数据后,需要将数据封装到items对象中。
  • 使用yield关键字将items对象提交给pipelines管道进行持久化操作。
  • settings.py配置文件中开启管道

同时完成多页爬取

【douban.py】

import scrapy
from ..items import ScPachongItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    start_urls = [
        'https://movie.douban.com/top250',
    ]
    allowed_domains = ["douban.com"]
    pageNum = 0  # 起始页码
    # 爬取多页
    url = 'https://movie.douban.com/top250?start={}&filter='  # 每页的url
    #解析函数
    def parse(self, response):
        # xpath为response中的方法,可以将xpath表达式直接作用于该函数中
        odiv = response.xpath('//div[@class="item"]')
        for div in odiv:
            # xpath函数返回的为列表,列表中存放的数据为Selector类型的数据。我们解析到的内容被封装在了Selector对象中,需要调用extract()函数将解析的内容从Selecor中取出。
            name = div.xpath('.//span[@class="title"]/text()')[0].extract()
            slogan = div.xpath('.//span[@class="inq"]/text()')[0].extract()

            item = ScPachongItem()
            item['name'] = name
            item['slogan'] = slogan
            yield item

        # 爬取所有页码数据
        self.pageNum += 25
        if self.pageNum <= 250:  # 一共爬取250条数据(共10页)
            url = self.url.format(self.pageNum)
            # 递归爬取数据:callback参数的值为回调函数(将url请求后,得到的相应数据继续进行parse解析),递归调用parse函数
            yield scrapy.Request(url=url, callback=self.parse)

【pipelines.py】

对item的操作可以双路开,

比如保存成data.txt和保存到MySQL中

只需要重新定义一个类

然后在【settings.py】中保存值,就会按照保存值的大小先后进行。

【【值越小,优先级别越高】】

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScPachongPipeline:
    def __init__(self):
        self.fp = None

    def open_spider(self,spider):
        print('爬虫开始')
        self.fp = open('./data.txt', 'w')

    def process_item(self, item, spider):
        # self.fp.write(item['name'] + '\n')
        self.fp.write(item['name'] + ':' + item['slogan'] + '\n')
        return item
    #结束爬虫时,执行一次
    def close_spider(self,spider):
        self.fp.close()
        print('爬虫结束')

【【两种持久化操作方式】】

#该类为管道类,该类中的process_item方法是用来实现持久化存储操作的。
class DoublekillPipeline(object):

    def process_item(self, item, spider):
        #持久化操作代码 (方式1:写入磁盘文件)
        return item

#如果想实现另一种形式的持久化操作,则可以再定制一个管道类:
class DoublekillPipeline_db(object):

    def process_item(self, item, spider):
        #持久化操作代码 (方式1:写入数据库)
        return item
复制代码

【items.py】

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScPachongItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    slogan = scrapy.Field()
    pass

【settings.py】

在ITEM_PIPELINES中可以调整300、200这种权值大小,调整item存取的先后。

# Scrapy settings for sc_pachong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'sc_pachong'

SPIDER_MODULES = ['sc_pachong.spiders']
NEWSPIDER_MODULE = 'sc_pachong.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'sc_pachong.middlewares.ScPachongSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'sc_pachong.middlewares.ScPachongDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'sc_pachong.pipelines.ScPachongPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最终爬取结果,保存为data.txt

肖申克的救赎:希望让人自由。
霸王别姬:风华绝代。
阿甘正传:一部美国近现代史。
泰坦尼克号:失去的才是永恒的。 
这个杀手不太冷:怪蜀黍和小萝莉不得不说的故事。
千与千寻:最好的宫崎骏,最好的久石让。 
美丽人生:最美的谎言。
辛德勒的名单:拯救一个人,就是拯救整个世界。
星际穿越:爱是一种力量,让我们超越时空感知它的存在。
、、、
、、、
、、、【不予展开了】

也可以通过命令行,保存成json格式。

平时可以不通过命令行运行 scrapy

只需要在最外围定义【run.py】

# -*- coding: utf-8 -*-
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'douban'])

运行此函数能直接是爬虫跑起来

保存json格式时

定义【save.py】

# -*- coding: utf-8 -*-
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'douban','-o', 'items.json','-t','json'])

以上两个模块的【douban】、【items】.json根据自己需要修改

posted @   J1nWan  阅读(140)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异
点击右上角即可分享
微信分享提示