scrapy爬虫简单使用&python执行cmd命令程序

1. 安装

pip install scrapy

2. scrapy简单运行以及架构

1. 项目创建以及运行

创建项目

aaa@localhost pyspace % scrapy startproject demo1
New Scrapy project 'demo1', using template directory '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/scrapy/templates/project', created in:
    /Users/aaa/app/pyspace/demo1

You can start your first spider with:
    cd demo1
    scrapy genspider example example.com

项目组成

1.spiders 文件夹存放的是爬虫文件。我们需要在spiders下面新增爬虫.py，下面创建。是实现爬虫核心功能的文件

2.items.py 定义数据结构

3.middlewares.py中间件，代理

4.pipelines.py 管道文件，用于处理下载数据的后续处理。里面只有一个类，可以自己定义多个。优先级是1-1000, 默认是300 优先级(值越小优先级越高)

5.settings.py 配置文件，比如是否遵守rebots 协议，User-Agent 定义等

创建爬虫文件

进入到项目的文件夹下面，然后创建spider

aaa@localhost pyspace % cd demo1
aaa@localhost demo1 % ls
demo1		scrapy.cfg
aaa@localhost demo1 % scrapy genspider baidu baidu.com
Created spider 'baidu' using template 'basic' in module:
  demo1.spiders.baidu

执行玩上面的命令，会在项目的spiders目录下面新建一个baidu.py 修改后内容如下：

import scrapy


class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url，指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法，方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        print("======")
        pass

运行上面的爬虫

语法：

scrapy crawl 爬虫名称

这里需要注意：上面有个robots 协议，可以理解为约定哪些可以爬，哪些不可以爬，我们访问

https://www.baidu.com/robots.txt 可以看到相关的描述。

1》修改不遵守robots协议，修改settings.py

ROBOTSTXT_OBEY = True

将上面的配置修改为False，或者直接注释掉。

2》执行爬虫baidu

aaa@localhost demo1 % pwd
/Users/aaa/app/pyspace/demo1
aaa@localhost demo1 % scrapy crawl baidu

3》结果可以看到自己打印的信息

修改代码，定位到百度一下按钮元素

import scrapy


class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url，指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法，方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        # print("======")
        # 响应的是字符串
        # print(response.text)
        print("******")
        # 响应的是二进制数据
        # print(response.body)

        # response.xpath 可以直接用xpayh 方法来解析response 中的内容. 返回的是一个 scrapy.selector.unified.SelectorList
        subList = response.xpath('//*[@id="su"]')
        print(subList)
        print(subList.__class__)
        # 可以用下标拿第一个元素，会拿到对应的元素。 也可以直接用 extract_first 获取。
        # extract 和 extract_first 拿到的是一个我们获取的元素data
        print(subList[0].extract())
        print(subList[0].extract().__class__)
        print(subList.extract_first())
        print(subList.extract_first().__class__)
        # .get() 等价于 .extract_first()
        # print(subList.get())
        # 比如直接拿按钮的 value 属性
        # print(response.xpath('//*[@id="su"]/@value').extract_first())

当然可以用css 或者bs4 选择器：

subList = response.css('#su')

1》重新运行

scrapy crawl baidu

2》结果

******
[<Selector xpath='//*[@id="su"]' data='<input type="submit" id="su" value="百...'>]
<class 'scrapy.selector.unified.SelectorList'>
<input type="submit" id="su" value="百度一下" class="bg s_btn">
<class 'str'>
<input type="submit" id="su" value="百度一下" class="bg s_btn">
<class 'str'>
百度一下

2. 架构以及简单原理

1. 架构

1.引擎：自动运行，无需关注，会自动组织所有的请求对象，分发给下载器

2.下载器：从引擎处获取到请求对象后，请求数据

3.spiders：定义爬取的动作以及爬取的网站

4.调度器：有自己的调度规则

5.管道：按照一定的顺序对Item 进行处理。可以理解为对数据进行处理，一般落库、保存为文件写在管道里面

2.工作原理

3. scrapy 例子

1. 爬取读书网

我们爬取，读书网里面类别为散文随笔的书籍信息，首页地址为:

https://www.dushu.com/book/1163_1.html

这里需要用到crawlspider，用于定义一些规则用于提取页面符合规则的数据，然后继续爬取。页面爬取规则如下：

allow=() 正则表达式，提取符合正则的链接

deny=() 正则表达式，拒绝符合正则的连接

allow_domains() 允许的域名

deny_domains=() 拒绝的域名

restrict_xpaths=() 提取符合xpath规则的连接

restract_css=() 提取符合css规则的连接

2. 创建项目以及运行

创建项目

scrapy startproject dushu
cd dushu
scrapy genspider -t crawl read_dushu www.dushu.com
# 查看现有的爬虫名称
aaa@localhost dushu % scrapy list
read_dushu

修改代码

1》修改read_dushu.py：item 数据结构用最简单的dict 字典数据类型

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            item['src'] = div.xpath('./div/a/img/@data-original').extract_first()
            item['name'] = div.xpath('./div/a/img/@alt').extract_first()
            item['author'] = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield item

2》pipelines.py 修改

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''


    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

3》修改settings.py，取消遵循robots 协议以及放开pipeline

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

...

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
}

运行项目

scrapy crawl read_dushu

修改代码的item，用数据结构代替dict数据类型

1》修改items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()

2》修改spiders/reader_dushu.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield DushuItem(src=src, name=name, author=author)

3. 继续改造项目，将书详情的价格也爬取出来

实现的效果就是将读书网点击书籍后的价格也爬取出来。

修改items.py 增加价格price 字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()

修改spiders/read_dushu.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            url = div.xpath('./div/a/@href').extract_first()
            url = "https://www.dushu.com" + url
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name, 'src': src, 'author': author})

    def parse_second(self, response):
        price = response.xpath('//div[@class="book-details"]//span/text()').get()
        name = response.meta['name']
        src = response.meta['src']
        author = response.meta['author']
        yield DushuItem(src=src, name=name, author=author, price=price)

测试运行

4. 继续改造，增加pipeline将图片下载下来

需要安装 pillow

pip install pillow

修改piplines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


import scrapy
# useful for handling different item types with a single interface
from scrapy.pipelines.images import ImagesPipeline


class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''


    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()


class ImgsPipLine(ImagesPipeline):

    def get_media_requests(self, item, info):
        src = item['src']
        # 避免有的数据拿不到src 属性
        if src:
            yield scrapy.Request(url=src, meta={'item': item})

    # 返回图片名称即可, 路径在全局配置文件中进行配置
    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        filePath = item['name'] + ".jpg"
        return filePath

    def item_completed(self, results, item, info):
        return item

修改settings.py增加相关配置

LOG_LEVEL = "WARNING"
IMAGES_STORE = './result'   #文件保存路径
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
   #  这两个请求头是必须的，没有referer 访问图片会报错403 。
   'referer': 'https://www.dushu.com/book/1163_11.html',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36'
}

ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
   'dushu.pipelines.ImgsPipLine': 301,
}

测试

运行后会在根目录生成result 目录，然后下载相关的jpg 图片。

5. 修改scrapy日志级别

修改settings.py

LOG_LEVEL = "WARNING"

6. 编写main 类启动scrapy 程序

方法一

from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)));
execute(["srcapy","crawl","read_dushu"])

方法二

import os

# 无法获取控制台输出的内容，只是简单的执行cmd指令，返回命令退出状态，其中结果为0表示执行成功
# retValue = os.system("ipconfig")
# print(retValue)

# 可以获取控制台输出的内容，返回的是一个file对象
# 'r' 消除转义符带来的影响,即'\'
# retValue = os.popen('ipconfig', 'r')
# res = retValue.read()
# for line in res.splitlines():
#     print(line)
# retValue.close()

# 执行scrapy 程序
retValue = os.popen('scrapy list', 'r')
res = retValue.read()
for line in res.splitlines():
    print(line)
retValue.close()

参考:

https://docs.scrapy.org/en/latest/topics/commands.html

https://docs.scrapy.org/en/latest/topics/architecture.html

posted @ 2022-07-31 23:08 QiaoZhi 阅读(487) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

· Zuul简单使用以及原理

· Scrapy的基础使用（1）

· Scrapy基础介绍与使用

· Scrapy

阅读排行：
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码，我用AI程序员写了一个飞机大战
· Obsidian + DeepSeek：免费 AI 助力你的知识管理，让你的笔记飞起来！
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· 白话解读 Dapr 1.15：你的「微服务管家」又秀新绝活了

历史上的今天：
2018-07-31 yui压缩JS和CSS文件
2017-07-31 Spring+Struts+Mybatis+Shiro整合配置
2017-07-31 Struts通配符映射

公告

昵称： QiaoZhi
园龄： 7年7个月
粉丝： 1000
关注： 9

+加关注

2025年3月

日

一

二

三

四

五

六

Qiao_Zhi

有远大抱负的人不可忽略眼前的工作!!!

scrapy爬虫简单使用&python执行cmd命令程序

1. 安装

2. scrapy简单运行以及架构

1. 项目创建以及运行

2. 架构以及简单原理

1. 架构

2.工作原理

3. scrapy 例子

1. 爬取读书网

2. 创建项目以及运行

3. 继续改造项目，将书详情的价格也爬取出来

4. 继续改造，增加pipeline将图片下载下来

5. 修改scrapy日志级别

6. 编写main 类启动scrapy 程序

公告

搜索

积分与排名

随笔分类 (1546)

相册 (4)

阅读排行榜

评论排行榜

推荐排行榜

最新评论