sparkStreaming滑动窗口操作
一、sparkStreaming窗口函数概念:
1、reduceByKeyAndWindow(_+_,Seconds(3), Seconds(2))

可以看到我们定义的window窗口大小Seconds(3s) ,是指每2s滑动时,需要统计前3s内所有的数据。
2、重载函数reduceByKeyAndWindow(_+_,_-_,Seconds(3s),seconds(2))
设计理念是,当 滑动窗口的时间Seconds(2) < Seconds(3)(窗口大小)时,两个统计的部分会有重复,那么我们就可以
不用重新获取或者计算,而是通过获取旧信息来更新新的信息,这样即节省了空间又节省了内容,并且效率也大幅提升。
如上图所示,2次统计重复的部分为time3对用的时间片内的数据,这样对于window1,和window2的计算可以如下所示
win1 = time1 + time2 + time3
win2 = time3 + time4 + time5
更新为
win1 = time1 + time2 + time3
win2 = win1+ time4 + time5 - time2 - time3
这样就理解了吧, _+_是对新产生的时间分片(time4,time5内RDD)进行统计,而_-_是对上一个窗口中,过时的时间分片
(time1,time2) 进行统计
二、应用场景:
在项目中,若有相关的业务需求需要进行跨批次的操作,例如,项目中的sparkStreaming设置的批次为5s,而业务计算过程中,需要应用一个5min的数据,这时候就可以使用滑动窗口函数来实现。
https://www.jianshu.com/p/2f0d2cb1faf4
三、代码:
/** * Return a new DStream by applying incremental `reduceByKey` over a sliding window. * The reduced value of over a new window is calculated using the old window's reduced value : * 1. reduce the new values that entered the window (e.g., adding new counts) * * 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) * * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function. * However, it is applicable to only "invertible reduce functions". * Hash partitioning is used to generate the RDDs with Spark's default number of partitions. * @param reduceFunc associative reduce function * @param invReduceFunc inverse reduce function * @param windowDuration width of the window; must be a multiple of this DStream's * batching interval * @param slideDuration sliding interval of the window (i.e., the interval after which * the new DStream will generate RDDs); must be a multiple of this * DStream's batching interval * @param filterFunc Optional function to filter expired key-value pairs; * only pairs that satisfy the function are retained */ def reduceByKeyAndWindow( reduceFunc: (V, V) => V, invReduceFunc: (V, V) => V, windowDuration: Duration, slideDuration: Duration = self.slideDuration, numPartitions: Int = ssc.sc.defaultParallelism, filterFunc: ((K, V)) => Boolean = null ): DStream[(K, V)] = ssc.withScope { reduceByKeyAndWindow( reduceFunc, invReduceFunc, windowDuration, slideDuration, defaultPartitioner(numPartitions), filterFunc ) }
输入数据:<Id <value,1>>
JavaPairDStream<String, Tuple2<Integer, Integer>> resultDStream = monitorId2SpeedDStream.reduceByKeyAndWindow(new Function2<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception { return new Tuple2<Integer, Integer>(v1._1+v2._1, v1._2+v2._2); } }, new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception { return new Tuple2<Integer, Integer>(v1._1 - v2._1,v2._2 - v2._2); } }, Durations.minutes(5), Durations.seconds(5));
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 一个奇形怪状的面试题:Bean中的CHM要不要加volatile?
· [.NET]调用本地 Deepseek 模型
· 一个费力不讨好的项目,让我损失了近一半的绩效!
· .NET Core 托管堆内存泄露/CPU异常的常见思路
· PostgreSQL 和 SQL Server 在统计信息维护中的关键差异
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· spring官宣接入deepseek,真的太香了~