爬虫--增量式爬虫

增量式爬虫:

--概念:检测网站数据更新的情况,只会爬取网站最新更新出来的数据

--分析:

　　--指定其实url

　　--基于CrawlSpider获取其他页码链接

　　--基于Rule将其他页码链接进行请求

　　--从每一个页码对应的页面源码中解析出详情页的url

　　--检测详情页url是否已经爬取过,爬过略过,没爬取过发送爬取

　　　　--将爬取过的电影详情页的url进行存储

　　　　　　--存储到redis的set数据结构中(使用redis因为轻量化存储使用数据库)

　　--持久化存储

# 爬虫文件.py

class BlueSpider(CrawlSpider):
    name = 'blue'
    # allowed_domains = ['www.xxx.com'] # 域名限定,只允许该域名下的链接
    start_urls = ['https://www.fuzokuu.com/category/fuzokuuguide-thailand/page/1']
    # 创建redis链接对象
    conn = Redis(host='127.0.0.1', port=6379)
    def parse_item(self, response):
        # 基于response实现数据解析
        # 注意:xpath表达式中不可以出现tbody iframe等标签
        article_list = response.xpath('//article')
        for article in article_list:
            article_title = article.xpath('./div[2]/header/h2/a/@title').extract_first()
            article_url = article.xpath('./div[2]/header/h2/a/@href').extract_first()
            ex = self.conn.sadd('urls',article_url)
            if ex == 1:
                print('该url为最新添加,可以进行数据爬取')
                yield scrapy.Request(url=article_url,callback=self.parse_detail)
            else:
                print('数据还没有更新,暂无新数据可以爬取!')


    # 内容详情页解析
    def parse_detail(self, response):
        content = response.xpath('//article//p/text()').extract()
        article_id = response.xpath('//article/@id').extract_first()
        content = ''.join(content)
        article_id = article_id.split('-')[-1]
        item = DetailItem()
        item['content'] = content
        item['article_id'] = article_id
        yield item

# 管道文件pipelines.py

from redis import Redis


class BluespiderPipeline:
    conn = None

    def open_spider(self, spider):
        self.conn = spider.conn

    def process_item(self, item, spider):
        dic = {
            'article_id': item['article_id'],
            'content': item['content']
        }
        self.conn.lpush('articleData', dic)
        return item

posted @ 2022-06-02 16:02 EricYJChung 阅读(171) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 爬虫--分布式爬虫

· 爬虫--CrawlSpider及深度全站爬取

· 增量式爬虫

· scrapy框架增量式爬虫

阅读排行：
· 无需6万激活码！GitHub神秘组织3小时极速复刻Manus，手把手教你使用OpenManus搭建本
· Manus爆火，是硬核还是营销？
· 终于写完轮子一部分：tcp代理了，记录一下
· 别再用vector＜bool＞了！Google高级工程师：这可能是STL最大的设计失误
· 单元测试从入门到精通

公告

昵称： EricYJChung
园龄： 2年11个月
粉丝： 2
关注： 6

+加关注

2025年3月

日

一

二

三

四

五

六

ericyjchung 从心出发,不论归期

爬虫--增量式爬虫

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜

推荐排行榜