spark streaming 的batchDuration slideDuration windowDuration关系
batchDuration:尝试提交Job任务的间隔,请注意这里是尝试。具体代码如下
/** Checks whether the 'time' is valid wrt slideDuration for generating RDD */ private[streaming] def isTimeValid(time: Time): Boolean = { if (!isInitialized) { throw new SparkException (this + " has not been initialized") } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) { logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime + " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime)) false } else { logDebug("Time " + time + " is valid") true } }
假设slideDuration的大小是batchDuration的N倍,那么前N-1次尝试都会无法创建job去执行。
只有第N次尝试的时候,才会提交job。
默认情况下,batchDuration和slideDuration值是相等的,因此每次尝试的时候都会成功。
InputDStream override def slideDuration: Duration = { if (ssc == null) throw new Exception("ssc is null") if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null") ssc.graph.batchDuration } MappedDStream override def slideDuration: Duration = parent.slideDuration
但是有一个意外,哪就是如果有window窗口的时候,情况就不一样了。
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope { new WindowedDStream(this, windowDuration, slideDuration) } /** * Return a new DStream in which each RDD has a single element generated by reducing all * elements in a sliding window over this DStream. * @param reduceFunc associative reduce function * @param windowDuration width of the window; must be a multiple of this DStream's * batching interval * @param slideDuration sliding interval of the window (i.e., the interval after which * the new DStream will generate RDDs); must be a multiple of this * DStream's batching interval */ def reduceByWindow( reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration ): DStream[T] = ssc.withScope { this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc) }
可以看到的是,诸如需要window的方法,都可以自定义slideDuration,可以是slideDuration的值是batchDuration的倍数的任何值。这个值的修改只会影响之后的DStream,之前的DStream的slideDuration还是和batchDuration相等。
那么当任务是如何执行的呢?
每个DStream都会存在一个方法
override def compute(validTime: Time): Option[RDD[T]]
比如 MappedDStream,他的实现是
override def compute(validTime: Time): Option[RDD[U]] = {
parent.getOrCompute(validTime).map(_.map[U](mapFunc))
}
很简单,就是调用父DStream的getOrCompute,然后在执行map方法。然后逐级调用,直到没有父DStream为止。
我们知道slideDuration的值是在windowDStream才被改变的,那么它会有什么实现呢?
override def compute(validTime: Time): Option[RDD[T]] = { val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime) val rddsInWindow = parent.slice(currentWindow) val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) { logDebug("Using partition aware union for windowing at " + validTime) new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow) } else { logDebug("Using normal union for windowing at " + validTime) new UnionRDD(ssc.sc, rddsInWindow) } Some(windowRDD) }
我们看到,WindowedDStream中,会首先获取到此window的范围
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
然后调用父DStream的slice方法
val rddsInWindow = parent.slice(currentWindow)
再次提醒一下,此时父DStream的slideDuration已经变了,变成和batchDuration一样了。
slice的具体实现是
/** * Return all the RDDs between 'fromTime' to 'toTime' (both included) */ def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope { if (!isInitialized) { throw new SparkException(this + " has not been initialized") } val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) { toTime } else { logWarning("toTime (" + toTime + ") is not a multiple of slideDuration (" + slideDuration + ")") toTime.floor(slideDuration, zeroTime) } val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) { fromTime } else { logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration (" + slideDuration + ")") fromTime.floor(slideDuration, zeroTime) } logInfo("Slicing from " + fromTime + " to " + toTime + " (aligned to " + alignedFromTime + " and " + alignedToTime + ")") alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => { if (time >= zeroTime) getOrCompute(time) else None }) }
我们只看最后一段代码就行。
alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => { if (time >= zeroTime) getOrCompute(time) else None })
将window范围的time,根据slideDuration,也就是batchDuration的步长,生成batch进行计算。
/** Checks whether the 'time' is valid wrt slideDuration for generating RDD */ private[streaming] def isTimeValid(time: Time): Boolean = { if (!isInitialized) { throw new SparkException (this + " has not been initialized") } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) { logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime + " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime)) false } else { logDebug("Time " + time + " is valid") true } }
这个代码就返回结果就是true。
等把所有执行的结果返回后,windowedDStream会把结果进行整合。
val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) { logDebug("Using partition aware union for windowing at " + validTime) new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow) } else { logDebug("Using normal union for windowing at " + validTime) new UnionRDD(ssc.sc, rddsInWindow) } Some(windowRDD) }
最终得到我们想要的rdd。
posted on 2016-02-25 16:45 luckuan1985 阅读(2344) 评论(0) 编辑 收藏 举报