sparkStreaming滑动窗口操作

一、sparkStreaming窗口函数概念：

1、reduceByKeyAndWindow（_+_,Seconds(3), Seconds(2)）

可以看到我们定义的window窗口大小Seconds(3s) ，是指每2s滑动时，需要统计前3s内所有的数据。

2、重载函数reduceByKeyAndWindow（_+_,_-_,Seconds(3s),seconds(2)）

设计理念是，当滑动窗口的时间Seconds(2) < Seconds(3)（窗口大小）时，两个统计的部分会有重复，那么我们就可以

不用重新获取或者计算，而是通过获取旧信息来更新新的信息，这样即节省了空间又节省了内容，并且效率也大幅提升。

如上图所示，2次统计重复的部分为time3对用的时间片内的数据，这样对于window1，和window2的计算可以如下所示

win1 = time1 + time2 + time3

win2 = time3 + time4 + time5

更新为

win1 = time1 + time2 + time3

win2 = win1+ time4 + time5 - time2 - time3

这样就理解了吧, _+_是对新产生的时间分片（time4,time5内RDD）进行统计，而_-_是对上一个窗口中，过时的时间分片

(time1,time2) 进行统计

二、应用场景：

　　在项目中，若有相关的业务需求需要进行跨批次的操作，例如，项目中的sparkStreaming设置的批次为5s，而业务计算过程中，需要应用一个5min的数据，这时候就可以使用滑动窗口函数来实现。

https://www.jianshu.com/p/2f0d2cb1faf4

三、代码：

/**
 * Return a new DStream by applying incremental `reduceByKey` over a sliding window.
 * The reduced value of over a new window is calculated using the old window's reduced value :
 *  1. reduce the new values that entered the window (e.g., adding new counts)
 *
 *  2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
 *
 * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
 * However, it is applicable to only "invertible reduce functions".
 * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
 * @param reduceFunc associative reduce function
 * @param invReduceFunc inverse reduce function
 * @param windowDuration width of the window; must be a multiple of this DStream's
 *                       batching interval
 * @param slideDuration  sliding interval of the window (i.e., the interval after which
 *                       the new DStream will generate RDDs); must be a multiple of this
 *                       DStream's batching interval
 * @param filterFunc     Optional function to filter expired key-value pairs;
 *                       only pairs that satisfy the function are retained
 */
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    invReduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration = self.slideDuration,
    numPartitions: Int = ssc.sc.defaultParallelism,
    filterFunc: ((K, V)) => Boolean = null
  ): DStream[(K, V)] = ssc.withScope {
  reduceByKeyAndWindow(
    reduceFunc, invReduceFunc, windowDuration,
    slideDuration, defaultPartitioner(numPartitions), filterFunc
  )
}

输入数据：<Id  <value,1>>

JavaPairDStream<String, Tuple2<Integer, Integer>> resultDStream = 
                 monitorId2SpeedDStream.reduceByKeyAndWindow(new Function2<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
                return new Tuple2<Integer, Integer>(v1._1+v2._1, v1._2+v2._2);
            }
        }, new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() {

            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
                
                return new Tuple2<Integer, Integer>(v1._1 - v2._1,v2._2 - v2._2);
            }
        }, Durations.minutes(5), Durations.seconds(5));