Spark Streaming运行流程及源码解析(二)
Spark Streaming源码流程解析。
写在前面
以下是我自己梳理了一遍Spark Streaming程序运行的流程,过程可能有点细、有点乱。
大家可以一边看我写的流程、一边跟着步骤点进去看源码,这样就不会太乱了。
跟着源码走一遍以后,对Spark Streaming的理解也就很清晰了。
这篇文章是自己看源码过程的记录,如果有理解偏差的部分,欢迎交流指正。
开干
以如下的WordCount代码展开叙述:
// 创建SparkConf,配置master为local
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("socket-streaming")
// 实例化StreamingContext
val ssc = new StreamingContext(conf, Seconds(2))
// 创建一个ReceiverInputDStream对象
val lines = ssc socketTextStream("localhost", 1234)
// 进行逻辑处理、输出
lines
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
// 启动
ssc.start()
// 等待执行停止
ssc.awaitTermination()
以上代码启动后,可以接受1234端口收到的消息,然后按空格将句子切分成单词,之后对单词进行计数,每隔两秒计算输出一次结果。
接下来以我们写的WordCount代码为辅,从启动流处理引擎、接收并存储数据、处理数据、输出数据依次走一遍源码。
启动流处理引擎
StreamingContext的创建
从val ssc = new StreamingContext(conf, Seconds(2))
开始,这里会实例化StreamingContext对象。
先看一下StreamingContext中的一些重要的变量。
// SparkContext实例,Spark上下文,可以通过直接传参获得,
// 也可以通过sparkConf创建,或从checkpoint中取到
private[streaming] val sc: SparkContext = {
if (_sc != null) {
_sc
} else if (isCheckpointPresent) {
SparkContext.getOrCreate(_cp.createSparkConf())
} else {
throw new SparkException("Cannot create StreamingContext without a SparkContext")
}
}
// DStreamGraph用来管理DStream的依赖,
// 创建时将StreamingContext实例绑定到DStreamGraph上
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
// JobScheduler用来生成和调度任务,
// 也会将StreamingContext实例绑定到自己身上
private[streaming] val scheduler = new JobScheduler(this)
// 批处理间隔
batchDuration
实例化StreamingContext时,这些变量都将会被实例化。
既然这样,就顺势也看一下DStreamGraph和JobScheduler中一些重要的变量。
先看一下DStreamGraph中的重要变量:
// inputStreams是输入数据源的集合,
// 输入数据源中有对应的receive方法用来接收数据
private val inputStreams = new ArrayBuffer[InputDStream[_]]()
// outputStreams就是DStream的集合,
// 我们调用的各个算子最终都会根据依赖生成的DStream,
// outputOperator型的算子都会注册到这里来
private val outputStreams = new ArrayBuffer[DStream[_]]()
再看看JobScheduler中的重要变量:
// 生成的job集合,以time为key,jobset为value的Map
private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]
// 一个线程池,用来执行job
private val jobExecutor =
ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
// JobGenerator用来生成job
private val jobGenerator = new JobGenerator(this)
// Driver端用于管理Receiver的总管家
var receiverTracker: ReceiverTracker = null
// 事件循环,用来处理JobScheduler相关的事件
// 本质是以LinkedBlockingDeque一个队列
private var eventLoop: EventLoop[JobSchedulerEvent] = null
接下来执行val lines = ssc.socketTextStream("localhost", 1234)
如下所示,socketTextStream()会调用socketStream(),socketStream方法中会new一个SocketInputDStream,SocketInputDStream用于接收数据
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
def socketStream[T: ClassTag](
hostname: String,
port: Int,
converter: (InputStream) => Iterator[T],
storageLevel: StorageLevel
): ReceiverInputDStream[T] = {
new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
}
追踪一下SocketInputDStream的继承关系,发现它继承于ReceiverInputDStream,ReceiverInputDStream又继承于InputDStream。
InputDStream中有ssc.graph.addInputStream(this)
这么一行代码,将InputDStream添加到DStreamGraph中的inputStreams中。
所以在new SocketInputDStream时,InputDStream就添加到DStreamGraph中了。(这个找了挺久才找见的,之前一直不知道InputDStream什么时候添加进去的)
outputOperator算子注册
接着执行如下几行代码
lines
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
上面每个算子的调用会生成相互依赖的DStream: FlatMappedDStream、MappedDStream、ShuffledDStream。
只有到print()(outputOperator类算子)调用的时候,才会将DStream注册到DStreamGraph中的outputStreams中,之后DStreamGraph才能根据依赖关系生成job。
接下来跟进一下print()
// 以下的方法是依次调用的
def print(): Unit = ssc.withScope {
print(10)
}
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println(s"Time: $time")
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}
private[streaming] def register(): DStream[T] = {
ssc.graph.addOutputStream(this)
this
}
这里调用了register()
将DStream注册到DStreamGraph的outputStreams中
到这里就将我们的业务逻辑什么的都封装到DStream中了
StreamingContext的启动
接下来走ssc.start()
启动StreamingContext
StreamingContext的start方法中主要就是调用scheduler.start()
,启动了JobScheduler
接下来在看看JobScheduler的start方法
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
// 事件环主要接收调度JobSchedulerEvent事件
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
// 启动事件环,接收事件、处理事件
eventLoop.start()
// 添加监听
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
// 监听总线启动
listenerBus.start()
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
case _ => null
}
// 管理分配Executor
executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
executorAllocClient,
receiverTracker,
ssc.conf,
ssc.graph.batchDuration.milliseconds,
clock)
executorAllocationManager.foreach(ssc.addStreamingListener)
// 启动ReceiverTracker
receiverTracker.start()
// 启动JobGenerator
jobGenerator.start()
executorAllocationManager.foreach(_.start())
logInfo("Started JobScheduler")
}
JobScheduler中主要启动了ReceiverTracker和JobGenerator。
ReceiverTracker通知Executor启动Receiver,管理Receiver的执行,与Receiver交互。
JobGenerator用于生成job,执行job。
这两个类分别代表了接收并存储数据 和 生成job、执行job
接下来先看接收并存储数据
接收并存储数据
Driver端ReceiverTracker的操作
先从ReceiverTracker.start()说起。
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
// 建立RPC终端
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
// 加载Receiver
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
// 加载Receiver
private def launchReceivers(): Unit = {
// 从inputStreams中获取receivers
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}
runDummySparkJob()
// 发送StartAllReceivers的消息
logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers))
}
ReceiverTracker先建立RPC终端点准备通信,监听、回复与Receiver相关的信息。
然后调用launchReceivers(),launchReceivers中的receiverInputStreams是从DStreamGraph中获取的InputStream的集合。通过InputStream获取Receiver,然后发送StartAllReceivers消息。
这里的StartAllReceivers是发给endpoint的,也就是发给ReceiverTrackerEndpoint实例,也就相当于是发给自己的。
ReceiverTrackerEndpoint的receive方法通过模式匹配进行消息的接收,在收到StartAllReceivers后,会根据资源调度分配适合启动Receiver的位置,然后调用本类的startReceiver()
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case StartAllReceivers(receivers) =>
// 分配适合的位置
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors)
}
}
接下来看看startReceiver方法
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
def shouldStartReceiver: Boolean = {
!(isTrackerStopping || isTrackerStopped)
}
val receiverId = receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// 封装在worker节点启动receiver的方法
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
}
}
// 使用ScheduledLocations创建RDD以在Spark作业中运行接收器
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
// 提交启动receiver的job到spark核心进行启动
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
...
}(ThreadUtils.sameThread)
logInfo(s"Receiver ${receiver.streamId} started")
}
stratReceiver方法先封装了启动receiver的方法和RDD,然后提交给spark核心进行执行。
上面代码startReceiverFunc中,封装了创建和启动ReceiverSupervisor的操作。
ReceiverSupervisor是Executor端Receiver的管理者,负责监督和管理Executor中的Receiver的运行
Executor端ReceiverSupervisor的操作
接下来追踪ReceiverSupervisor的start方法。
/** Start the supervisor */
def start() {
onStart()
startReceiver()
}
// ReceiverSupervisorImpl中的onStart方法
override protected def onStart() {
registeredBlockGenerators.asScala.foreach { _.start() }
}
// ReceiverSupervisor的方法,用于启动Receiver
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
receiverState = Started
// 启动receiver,开始接收数据
receiver.onStart()
} else {
...
}
} catch {
}
}
在onStart方法中,可以看到一个registeredBlockGenerators集合,它是BlockGenerator的集合。
BlockGenerator是Receiver中比较重要的一个类,用于将我们收到的单条数据写入buffer,然后定时将buffer封装为块,进行存储和汇报给Driver。
接下来详细看一下它的变量和方法
// listener创建BlockGenerator时传进来的监听器,
// 用来监听块相关事件:onAddData、onGenerateBlock、onPushBlock
listener: BlockGeneratorListener
// 是一个ArrayBuffer,用来暂存接收到的数据
@volatile private var currentBuffer = new ArrayBuffer[Any]
// 一个队列,用来存取封装好的Block块
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
// 定时器,定时将currentBuffer中的数据封装为Block,然后推到blocksForPushing里面
private val blockIntervalTimer =
new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
// blocksForPushing队列的大小
private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
// 这是一个线程,用来从blocksForPushing中取出Block,然后进行存储,汇报ReceiverTracker
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
// 按照时间生成块,然后将块推到blocksForPushing中
private def updateCurrentBuffer(time: Long): Unit = {
try {
var newBlock: Block = null
synchronized {
if (currentBuffer.nonEmpty) {
val newBlockBuffer = currentBuffer
currentBuffer = new ArrayBuffer[Any]
val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
listener.onGenerateBlock(blockId)
newBlock = new Block(blockId, newBlockBuffer)
}
}
if (newBlock != null) {
blocksForPushing.put(newBlock) // put is blocking when queue is full
}
} catch {
}
}
// 推送块
private def keepPushingBlocks() {
...
while (!blocksForPushing.isEmpty) {
val block = blocksForPushing.take()
logDebug(s"Pushing block $block")
// 调用本类的pushBlock方法
pushBlock(block)
logInfo("Blocks left to push " + blocksForPushing.size())
}
logInfo("Stopped block pushing thread")
} catch {
case ie: InterruptedException =>
logInfo("Block pushing thread was interrupted")
case e: Exception =>
reportError("Error in block pushing thread", e)
}
}
// 推送块
private def pushBlock(block: Block) {
listener.onPushBlock(block.id, block.buffer)
logInfo("Pushed block " + block.id)
}
大体来说,BlockGenerator中使用了一个ArrayBuffer来不断的接收存储数据,然后会按时将ArrayBuffer中的数据封装为Block。另有一个队列ArrayBlockingQueue来存取Block,一边存一边取,这样实现了单条数据的接收与存储。
再接着看pushBlock的操作。其中调用了listener.onPushBlock()。
listener是构造BlockGenerator时传进来的,使用的是ReceiverSupervisorImpl中的defaultBlockGeneratorListener。
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
def onAddData(data: Any, metadata: Any): Unit = { }
def onGenerateBlock(blockId: StreamBlockId): Unit = { }
def onError(message: String, throwable: Throwable) {
reportError(message, throwable)
}
// 推块的时候调用,它又会调用ReceiverSupervisorImpl.pushArrayBuffer()
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
}
// 将接收到的数据的ArrayBuffer作为数据块存储到Spark的内存中
def pushArrayBuffer(
arrayBuffer: ArrayBuffer[_],
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
// 调用pushAndReportBlock()
pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}
// 将块数据进行存储,然后汇报给Driver
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val time = System.currentTimeMillis
// 这步会真正的存储数据
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
val numRecords = blockStoreResult.numRecords
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
// 将存储结果报告Driver
if (!trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))) {
throw new SparkException("Failed to add block to receiver tracker.")
}
logDebug(s"Reported block $blockId")
}
listener.onPushBlock会调用pushArrayBuffer(),pushArrayBuffer方法会调用pushAndReportBlock()将数据进行存储,然后汇报给Driver。
这里需要注意一下:BlockGenerator负责单条数据的接收与生成快。这个一会会再说。
开始接收数据、存储数据
BlockGenerator的内部看完以后,接着回到ReceiverSupervisor.start()中来
def start() {
onStart()
startReceiver()
}
onStart()方法中启动BlockGenerator,启动块生成的定时器和推送块的线程
def start(): Unit = synchronized {
if (state == Initialized) {
state = Active
blockIntervalTimer.start()
blockPushingThread.start()
logInfo("Started BlockGenerator")
} else {
throw new SparkException(
s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
}
}
startReceiver()方法中,调用receiver.onStart(),开始接收数据
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
receiverState = Started
// 启动receiver开始接收数据
receiver.onStart()
} else {
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
}
}
以我们一开写的demo中的SocketInputDStream为例,它会生成一个SocketReceiver实例,以下是SocketReceiver的onStart方法。
def onStart() {
try {
// 启动socket,开始监听
socket = new Socket(host, port)
} catch {
}
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
def receive() {
try {
// 接收数据
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
// 将接收到的数据进行存储
store(iterator.next())
}
} catch {
...
} finally {
onStop()
}
}
可以看到,onStart中启动了一个线程,开始不断的接收数据,之后会调用store()将接收到的数据进行存储。
这里的store()方法是Receiver中定义的,我们跟进一下。
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
/** Store an ArrayBuffer of received data as a data block into Spark's memory. */
def store(dataBuffer: ArrayBuffer[T]) {
supervisor.pushArrayBuffer(dataBuffer, None, None)
}
/**
* Store an ArrayBuffer of received data as a data block into Spark's memory.
* The metadata will be associated with this block of data
* for being used in the corresponding InputDStream.
*/
def store(dataBuffer: ArrayBuffer[T], metadata: Any) {
supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)
}
/** Store an iterator of received data as a data block into Spark's memory. */
def store(dataIterator: Iterator[T]) {
supervisor.pushIterator(dataIterator, None, None)
}
会发现有好几个重载的方法,参数不尽相同。
SocketReceiver中调用的是store(dataItem: T)
这个方法,它会调用pushSingle将数据添加到BlockGenerator中的currentBuffer中。BlockGenerator再定时将currentBuffer封装为Block,然后调用pushBlock、pushArrayBuffer、pushAndReportBlock对数据进行存储、汇报Driver。
store(dataItem: T)
就相当于之前说的接收单条数据进行存储的操作。
另外几个重载方法也都会最终也都会调用pushAndReportBlock数据进行存储,然后报告Driver。这里就不再跟下去了。
数据的接收与存储到这里就结束了。接下来我们在回到JobGenerator解析一下job的生成和执行。
生成job、执行job
JobGenerator介绍
视线在跳回到JobGenerator这边来,先看看JobGenerator中几个重要变量
// job生成消息的事件环
private var eventLoop: EventLoop[JobGeneratorEvent]
// 定时器,按照批处理间隔定时向eventLoop发送生成job的消息
private val timer = new RecurringTimer(
clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))),
"JobGenerator"
)
接下来看看JobGenerator的start方法
def start(): Unit = synchronized {
if (eventLoop != null) return
checkpointWriter
// eventLoop的回调方法onReceive会调用processEvent(event)进行事件的处理
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = {
jobScheduler.reportError("Error in job generator", e)
}
}
// 启动事件环
eventLoop.start()
if (ssc.isCheckpointPresent) {
restart()
} else {
startFirstTime()
}
}
start方法中会启动eventLoop和调用startFirstTime()。
eventLoop启动后,会启动一个线程来不断的接收消息,根据接收到的消息作出相应的操作
看一下startFirstTime(),startFirstTime中启动了DStreamGraph 和 用于定时发送生成job消息的定时器
/** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
// 启动DStreamGraph
graph.start(startTime - graph.batchDuration)
// 启动定时器timer
timer.start(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
}
DStreamGraph的start方法就不跟进了,没有很重要的东西。
timer启动后,会定时发送GenerateJobs(new Time(longTime))
的消息。eventLoop在收到消息后,调用processEvent方法进行处理,如下:
private def processEvent(event: JobGeneratorEvent) {
logDebug("Got event " + event)
event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time)
case DoCheckpoint(time, clearCheckpointDataLater) =>
doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time)
}
}
生成job
接下来就开始generateJobs的旅程了。
首先processEvent会将GenerateJobs消息通过调用JobGenerator.generateJobs()进行处理。
以下是JobGenerator的generateJobs方法:
// 根据时间生成job
private def generateJobs(time: Time) {
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
Try {
// 调用receiverTracker给批分配数据
jobScheduler.receiverTracker.allocateBlocksToBatch(time)
// 在DStreamGraph中根据分配的块生成job
graph.generateJobs(time)
} match {
// 如果job生成成功,调用jobScheduler.submitJobSet提交job
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
// 失败则打报告
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
// 完成后进行checkpoint
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
首先会调用receiverTracker.allocateBlocksToBatch()给当前批分配需要处理的数据,之后调用DStreamGraph.generateJobs()生成job序列,如果生成成功,调用jobScheduler.submitJobSet提交job。
先跟进一下DStreamGraph.generateJobs():
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
// 根据outputStream生成job
outputStreams.flatMap { outputStream =>
val jobOption = outputStream.generateJob(time)
jobOption.foreach(_.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug("Generated " + jobs.length + " jobs for time " + time)
jobs
}
发现这里会遍历outputStreams生成job,outputStreams中存放的是我们调用的outputOperation算子对应的DStream,也就是之前说的调用outputOperation算子将DStream注册到DStreamGraph中的outputStreams中。
以我们最开始的WordCount代码为例,我们的代码最终会添加一个ForEachDStream到outputStreams中去。
所以就会调用这里就调用ForEachDStream.generateJob()来生成job。
以下是ForEachDStream的generateJob方法:
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
generateJob方法会调用parent.getOrCompute()生成RDD,如果生成成功,以RDD和我们定义的逻辑处理函数构造Job,并返回job。
需要注意一下这里的parent,parent其实就是它所依赖的上一个DStream的引用,
lines
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
以我们写的代码为例,这里的parent就是由reduceByKey算子生成的ShuffledDStream的引用,ShuffledDStream中的parent是map生成的MappedDStream的引用,MappedDStream中的parent是flatMap生成的FlatMappedDStream的引用。
FlatMappedDStream中的parent就是SocketInputDStream的引用
跟进一下parent.getOrCompute(),现在的parent是ShuffledDStream的引用
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
// 已经生成的RDD集合,是以时间为key,rdd为value的HashMap
generatedRDDs.get(time).orElse {
if (isTimeValid(time)) {
val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
// 执行compute方法,生成rdd,几乎每个DStream子类都会实现这个方法
SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true) {
compute(time)
}
}
// 对生成的rdd缓存或checkpoint,添加到已经生成的RDD集合中
rddOption.foreach { case newRDD =>
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
}
generatedRDDs.put(time, newRDD)
}
rddOption
} else {
None
}
}
}
DStream中定义了一个generatedRDDs用来存储已经生成的RDD。
会先去generatedRDDs中获取当前批的RDD,如果不存在则执行compute()生成RDD。
按我们写的代码来走的话,调用的是ShuffledDStream的compute方法。
override def compute(validTime: Time): Option[RDD[(K, C)]] = {
parent.getOrCompute(validTime) match {
case Some(rdd) => Some(rdd.combineByKey[C](
createCombiner, mergeValue, mergeCombiner, partitioner, mapSideCombine))
case None => None
}
}
发现又调用了parent.getOrCompute生成RDD。
我们就可以发现它是根据依赖关系,循环的去调用getOrCompute和compute,直到最开始的DStream。
我们代码中最开始的是SocketInputDStream,会调用SocketInputDStream实例的compute方法,SocketInputDStream没有compute方法,这里调用的是他的父类ReceiverInputDStream的compute方法。
override def compute(validTime: Time): Option[RDD[T]] = {
val blockRDD = {
if (validTime < graph.startTime) {
new BlockRDD[T](ssc.sc, Array.empty)
} else {
// 获取当前分配给当前批的块信息
val receiverTracker = ssc.scheduler.receiverTracker
val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
// 根据批时间和块信息创建RDD,并返回
createBlockRDD(validTime, blockInfos)
}
}
Some(blockRDD)
}
一系列操作生成RDD完成后,回到ForEachDStream的generateJob方法,
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
根据生成的RDD和业务处理函数封装成job,返回job到DStream.generateJobs()
DStream.generateJobs()再将job返回到JobGenerator.generateJobs()中来
此刻,我们的job就生成完成了。
提交执行job
接下来JobGenerator.generateJobs()中会执行jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
,将job进行提交。
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
这里会将job封装到JobHandler中进行处理,JobHandler是一个线程类,其中会执行job.run
运行job。
以下是Job的run方法,其中的func()就是我们封装进来的业务处理函数。
def run() {
_result = Try(func())
}
将JobHandler扔到线程池中执行,我们的job就跑起来了。
输出数据
job跑起来后,会根据我们封装的func(),执行对应的输出。
end...
至此,Spark Streaming源码流程解析就over了。
多敲、多看、多搬砖、加油。
个人公众号:码农峰,定时推送行业资讯,持续发布原创技术文章,欢迎大家关注。