Spark源码分析 – DAGScheduler
DAGScheduler的架构其实非常简单,
1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event
2. eventLoop Thread, 会不断的从eventQueue中获取event并处理
3. 实现TaskSchedulerListener, 并注册到TaskScheduler中, 这样TaskScheduler可以随时调用TaskSchedulerListener中的接口报告状况变更
TaskSchedulerListener的实现其实也就是post各种event到eventQueue
/** * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a * minimal schedule to run the job. It then submits stages as TaskSets to an underlying * TaskScheduler implementation that runs them on the cluster. * * In addition to coming up with a DAG of stages, this class also determines the preferred * locations to run each task on, based on the current cache status, and passes these to the * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are * not caused by shuffie file loss are handled by the TaskScheduler, which will retry each task * a small number of times before cancelling the whole stage. * * THREADING: This class runs all its logic in a single thread executing the run() method, to which * events are submitted using a synchonized queue (eventQueue). The public API methods, such as * runJob, taskEnded and executorLost, post events asynchronously to this queue. All other methods * should be private. */ private[spark] class DAGScheduler( taskSched: TaskScheduler, // 绑定的TaskScheduler mapOutputTracker: MapOutputTracker, blockManagerMaster: BlockManagerMaster, env: SparkEnv) extends TaskSchedulerListener with Logging { def this(taskSched: TaskScheduler) { this(taskSched, SparkEnv.get.mapOutputTracker, SparkEnv.get.blockManager.master, SparkEnv.get) } // task需要将task执行的状况报告给DAGScheduler,所以需要把DAGScheduler作为listener加到TaskScheduler中
taskSched.setListener(this)
// 并且实现各种TaskSchedulerListener的接口, 以便于TaskScheduler在状态发生变化时调用 // Called by TaskScheduler to report task's starting. override def taskStarted(task: Task[_], taskInfo: TaskInfo) { eventQueue.put(BeginEvent(task, taskInfo)) }
//……省略其他的接口实现
private val eventQueue = new LinkedBlockingQueue[DAGSchedulerEvent] // DAGScheduler的核心event queue val nextJobId = new AtomicInteger(0) val nextStageId = new AtomicInteger(0) val stageIdToStage = new TimeStampedHashMap[Int, Stage] val shuffleToMapStage = new TimeStampedHashMap[Int, Stage] private[spark] val stageToInfos = new TimeStampedHashMap[Stage, StageInfo] private val listenerBus = new SparkListenerBus() //DAGScheduler本身也提供SparkListenerBus, 便于其他模块listen DAGScheduler // Contains the locations that each RDD's partitions are cached on private val cacheLocs = new HashMap[Int, Array[Seq[TaskLocation]]]
// Start a thread to run the DAGScheduler event loop def start() { new Thread("DAGScheduler") { // 创建event处理线程 setDaemon(true) override def run() { DAGScheduler.this.run() } }.start() }
/** * The main event loop of the DAG scheduler, which waits for new-job / task-finished / failure * events and responds by launching tasks. This runs in a dedicated thread and receives events * via the eventQueue. */ private def run() { SparkEnv.set(env) while (true) { val event = eventQueue.poll(POLL_TIMEOUT, TimeUnit.MILLISECONDS) if (event != null) { logDebug("Got event of type " + event.getClass.getName) } this.synchronized { // needed in case other threads makes calls into methods of this class if (event != null) { if (processEvent(event)) { return } } val time = System.currentTimeMillis() // TODO: use a pluggable clock for testability // Periodically resubmit failed stages if some map output fetches have failed and we have // waited at least RESUBMIT_TIMEOUT. We wait for this short time because when a node fails, // tasks on many other nodes are bound to get a fetch failure, and they won't all get it at // the same time, so we want to make sure we've identified all the reduce tasks that depend // on the failed node. if (failed.size > 0 && time > lastFetchFailureTime + RESUBMIT_TIMEOUT) { resubmitFailedStages() } else { submitWaitingStages() } } } }
/** * Process one event retrieved from the event queue. * Returns true if we should stop the event loop. */ private[scheduler] def processEvent(event: DAGSchedulerEvent): Boolean = { event match { case JobSubmitted(finalRDD, func, partitions, allowLocal, callSite, listener, properties) => val jobId = nextJobId.getAndIncrement() // 获取新的jobId, nextJobId是AtomicInteger val finalStage = newStage(finalRDD, None, jobId, Some(callSite)) // 用finalRDD创建finalStage,前面是否有其他的stage或RDD需要根据deps推断 val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties) // 用finalStage创建Job clearCacheLocs() if (allowLocal && finalStage.parents.size == 0 && partitions.length == 1) { // Compute very short actions like first() or take() with no parent stages locally. runLocally(job) // 对于简单的Job, 直接locally执行 } else { listenerBus.post(SparkListenerJobStart(job, properties)) idToActiveJob(jobId) = job activeJobs += job resultStageToJob(finalStage) = job submitStage(finalStage) }
// 对于各种event的处理, 这里只看JobSubmitted, 其他的先省略
}
1. dagScheduler.runJob
继续前面, 在SparkContext中调用runJob的结果就是调用dagScheduler.runJob
而dagScheduler.runJob的工作, 就是把toSubmit event放到eventQueue中去, 并且wait这个Job结束, 很简单
而PrepareJob的工作就是创建JobWaiter和JobSubmitted对象
def runJob[T, U: ClassManifest]( finalRdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: String, allowLocal: Boolean, resultHandler: (Int, U) => Unit, properties: Properties = null) { if (partitions.size == 0) { return } val (toSubmit: JobSubmitted, waiter: JobWaiter[_]) = prepareJob( finalRdd, func, partitions, callSite, allowLocal, resultHandler, properties) eventQueue.put(toSubmit) waiter.awaitResult() match { case JobSucceeded => {} case JobFailed(exception: Exception, _) => logInfo("Failed to run " + callSite) throw exception } }
1.1 JobWaiter
JobWaiter比较简单, 首先实现JobListener的taskSucceeded和jobFailed函数, 当DAGScheduler收到tasksuccess或fail的event就会调用相应的函数
在tasksuccess会判断当所有task都success时, 就表示jobFinished
而awaitResult, 就是一直等待jobFinished被置位
private[spark] class JobWaiter[T](totalTasks: Int, resultHandler: (Int, T) => Unit) extends JobListener { override def taskSucceeded(index: Int, result: Any) { synchronized { if (jobFinished) { throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter") } resultHandler(index, result.asInstanceOf[T]) // 使用resultHandler处理task result finishedTasks += 1 if (finishedTasks == totalTasks) { jobFinished = true jobResult = JobSucceeded this.notifyAll() } } } override def jobFailed(exception: Exception) {……} def awaitResult(): JobResult = synchronized { while (!jobFinished) { this.wait() } return jobResult } }
1.2 JobSubmitted
JobSubmitted只是DAGSchedulerEvent的一种, 典型的pattern matching的场景
可以看到除了JobSubmitted还其他很多的DAGSchedulerEvent
private[spark] sealed trait DAGSchedulerEvent private[spark] case class JobSubmitted( finalRDD: RDD[_], func: (TaskContext, Iterator[_]) => _, partitions: Array[Int], allowLocal: Boolean, callSite: String, listener: JobListener, properties: Properties = null) extends DAGSchedulerEvent private[spark] case class BeginEvent(task: Task[_], taskInfo: TaskInfo) extends DAGSchedulerEvent private[spark] case class CompletionEvent( task: Task[_], reason: TaskEndReason, result: Any, accumUpdates: Map[Long, Any], taskInfo: TaskInfo, taskMetrics: TaskMetrics) extends DAGSchedulerEvent private[spark] case class ExecutorGained(execId: String, host: String) extends DAGSchedulerEvent private[spark] case class ExecutorLost(execId: String) extends DAGSchedulerEvent private[spark] case class TaskSetFailed(taskSet: TaskSet, reason: String) extends DAGSchedulerEvent private[spark] case object StopDAGScheduler extends DAGSchedulerEvent
2 processEvent.JobSubmitted
JobSubmit, 首先创建final stage, 然后submit final stage
stage相关操作参考, Spark 源码分析 -- Stage
2.1 submitStage
在submitStage, 首先会产生Stage的DAG, 然后按照先后顺序去提交每个stage的tasks
/** Submits stage, but first recursively submits any missing parents. */ private def submitStage(stage: Stage) { logDebug("submitStage(" + stage + ")") if (!waiting(stage) && !running(stage) && !failed(stage)) { val missing = getMissingParentStages(stage).sortBy(_.id) // 根据final stage发现是否有parent stage logDebug("missing: " + missing) if (missing == Nil) { logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents") submitMissingTasks(stage) // 如果没有parent stage需要执行, 则直接submit当前stage running += stage } else { for (parent <- missing) { submitStage(parent) // 如果有parent stage,需要先submit parent, 因为stage之间需要顺序执行 } waiting += stage // 当前stage放到waiting列表中 } } }
2.2 submitMissingTasks
task相关参考 Spark 源码分析 -- Task
可见无论是哪种stage, 都是对于每个stage中的每个partitions创建task
并最终封装成TaskSet, 将该stage提交给taskscheduler
/** Called when stage's parents are available and we can now do its task. */ private def submitMissingTasks(stage: Stage) { // Get our pending tasks and remember them in our pendingTasks entry var tasks = ArrayBuffer[Task[_]]() if (stage.isShuffleMap) { // 对于ShuffleMap Stage for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) { val locs = getPreferredLocs(stage.rdd, p) tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs) } } else { // 对于Result Stage // This is a final stage; figure out its job's missing partitions val job = resultStageToJob(stage) for (id <- 0 until job.numPartitions if !job.finished(id)) { val partition = job.partitions(id) val locs = getPreferredLocs(stage.rdd, partition) tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id) } } taskSched.submitTasks( new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties)) if (!stage.submissionTime.isDefined) { stage.submissionTime = Some(System.currentTimeMillis()) } } else { logDebug("Stage " + stage + " is actually done; %b %d %d".format( stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions)) running -= stage } }