Spark Core 1.3.1源码解析及个人总结
本篇源码基于赵星对Spark 1.3.1解析进行整理。话说,我不认为我这下文源码的排版很好,不能适应的还是看总结吧。
虽然1.3.1有点老了,但对于standalone模式下的Master、Worker和划分stage的理解是很有帮助的。
=====================================================
总结:
master和worker都要创建ActorSystem来创建自身的Actor对象,master内部维护了一个保存workerinfo的hashSet和一个key为workerid,
value为workerInfo的HashMap。
master构造方法执行后会启动一个定时器,定期检查超时的worker。
worker构造方法执行后会尝试与master建立连接并发送注册消息,master收到消息后,封装worker并持久化,再给worker反馈,
worker收到反馈后,启动定时任务向master发送心跳,master收到心跳后更新心跳时间。
new SparkContext(),执行主构造器,创建SparkEnv,env里创建了ActorSystem用于通信,
然后创建TaskScheduler,创建DAGScheduler。TaskScheduler里创建了2个actor分别负责与master和executors进行通信。
ClientActor创建之前,会准备一大堆的参数,包括spark参数,java参数,executor的实现类等,
封装进AppClient,然后创建ClientActor与Master建立连接发送注册信息,Master收到后保存app的信息并反馈。
这时Master开始调度资源并启动worker,有两种调度方式:尽量打散,尽量集中,默认打散。
Master发消息给Worker,worker拼接Java命令,启动子进程。
(TaskScheduler 里会创建一个backend,backend调用start方法后,会先调用父类的start方法,父类的start方法会创建DriverActor,再执行自己的start方法创建ClientActor)
执行到Action算子会执行sparkContext里的runJob(),再调用DAGScheduler的runJob(),通过2个HashSet和1个Stack划分stage,然后提交stage。
将stage创建成多个Task,分为shuffleMapTask和ResultTask,组成taskSet,由taskScheduler通过DriverActor向Executor进行提交。
DAG的逻辑:
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
将最后一个rdd压栈waitingForVisit,当waitingForVisit非空时while循环,waitingForVisit弹栈出的rdd判断是否在visited中,
否,则rdd添加进visited,循环rdd的父rdd,如果不是shuffleMapStage,将rdd压栈waitingForVisit,是shuffleMapStage,则再
求父stage加入parents,求父stage是调用本方法的递归过程。
=====================================================
object Master
|--def main()
|--加载配置文件并解析。
|--//创建ActorSystem和Actor
|--def startSyatemAndActor()
|--//通过AkkaUtils工具类创建ActorSystem
|--AkkaUtils.createActorSystem()
|--//定义一个函数,创建ActorSystem
|--val startService: Int => (返回值) = {doCreateActorSystem()}
|--val (actorSystem, boundPort) = doCreateActorSystem()
|--//创建ActorSystem
|--//准备Akka参数
|--val akkaConf = xxxx
|--//创建ActorSystem
|--val actorSystem = ActorSyatem(name, akkaConf)
|--return (actorSystem, boundPort)
|--//调用函数
|--Utils.startServiceOnPort(startService())
|--从一个没有被占用的端口启动服务,调用startService函数
|--//通过ActorSystem创建Actor:master
|--val actor(master) = actorSystem.actorOf(master)//创建master,master也是一个actor
|--//成员变量:保存workerInfo
|--val workers = new HashSet[WorkerInfo]
|--//成员变量:保存(workerId,workInfo)
|--val idToWorker = new HashMap[String, WorkerInfo]
|--//构造方法之后,receive方法之前
|--def preStart()
|--//启动一个定时器,定时检测超时的worker
|--context.system.scheduler.scheduler(self,checkxxx)//自己给自己发消息,发送到自己的recevice方法,启动任务
|--接收worker向master注册的消息
|--case RegisterWorker()
|--//封装worker信息
|--val worker = new WorkerInfo()
|--//持久化到zk
|--persistenceEngine.addWorker(worker)
|--//向worker反馈信息
|--sender ! RegisteredWorker(masterUrl)
|--//任务调度
|--schedule()
|--case Heartbeat(workerId)//worker发来的心跳
|--//更新上一次心跳时间
|--workerInfo.lastHeartbeat = Syatem.currentTimeMillis()
--------接SparkContext,Driver创建ClientActor向Master注册应用信息-----------
|--case RegisterApplication(description)
|--//封装消息
|--val app = createApplication(description, sender)
|--//注册消息,即存入集合
|--registerApplication(app) //方法内部就是把app放进map等
|--HashMap waitingApps(appid, app)
|--//持久化保存
|--persistenceEngine.addApplication(app)
|--//Master向ClientActor发送注册成功的消息
|--sender ! RegisterApplication(app.id, masterUrl)
|--//Master开始调度资源,将任务启动到worker上
|--//两种情况下会进行调度:
|--//1、提交任务,杀死任务
|--//2、worker新增或减少
|--schedule()
--------Master进行资源的调度-------------
|--//两种调度方式:尽量打散,尽量集中
|--def schedule()
|--//尽量打散
|--//进行一系列的判断过滤,例如worker上剩余的核数或内存是否大于app所需资源
|--//分核数的逻辑:
|--//假设需要10个核心,现有4台机器,各有8个核心
|--//创建一个长度为4的数组,角标为0,角标=(角标+1)%4,那角标只会在0~3之间循环,
|--//循环一次需要的核心-1,worker(角标)的核心+1
|--//Master发信息让worker启动executor
|--launchExecutor(usableWorkers(pos), exec)
|--//尽量集中
|--//一下子把worker剩余的资源全部分配完在分配下一个worker
|--//Master发信息让worker启动executor
|--launchExecutor(worker, exec)
--------Master发信息让worker启动executor-------------
|--def launchExecutor(worker, exec)
|--//记录worker使用资源
|--worker.addExecutor(exec)
|--//master发消息给worker,将参数传递给worker,让他启动executor
|--worker.actor ! LaunchExecutor(.....)
|--//Master向ClientActor发消息,告诉他executor已经启动了
|--exec.application.driver ! ExecutorAdded(......)
-----------------------------------------------------
object Worker
|--def main()
|--//创建ActorSystem和Actor
|--def startSyatemAndActor()
|--与Master过程相同
|--//通过ActorSystem创建Actor:worker
|--val actor(worker) = actorSystem.actorOf(worker)
|--//构造方法之后,receive方法之前
|--def preStart()
|--//与master建立连接,发送注册消息
|--registerWithMaster()
|--//尝试注册,如果失败尝试多次
|--tryRegidterAllMasters()
|--//建立连接
|--val actor(master) = context.actorSelection(masterAkkaUrl)
|--//发送注册消息
|--actor ! RegisterWorker(workId, host, port, cores, memory...)
|--//Master发给Worker注册成功的消息
|--case RegisteredWorker(masterUrl)
|--//启动定时器,定期发送心跳
|--context.system.scheduler.scheduler(self,SendHeartbeat)//自己给自己发消息,发送到自己的recevice方法,启动任务
|--case SendHeartbeat
|--//发送心跳
|--master ! Heartbeat(workid)
-------------上接:Master发信息让worker启动executor-----------
|--case LaunchExecutor(...)
|--创建ExecutorRunner,将参数放入其中,然后再通过他启动Executor
|--val manager = new ExecutorRunner(...)
|--//调用ExecutorRunner的start方法来启动executor java子进程
|--manager.start()
class ExecutorRunner
|--def start()
|--//创建线程,通过线程的start来启动java子进程
|--workerThread = new Thread(){def run(){fetchAndRunExecutor()}}
|--workerThread.start()
|--def fetchAndRunExecutor()
|--//启动子进程
|--//有具体的类,拼接java命令启动相应的类
总结:Master和Worker之间的通信:
master和worker都要创建ActorSystem来创建自身的Actor对象,master内部维护了一个保存workerinfo的hashSet和一个key为workerid,
value为workerInfo的HashMap。
master构造方法执行后会启动一个定时器,定期检查超时的worker。
worker构造方法执行后会尝试与master建立连接并发送注册消息,master收到消息后,封装worker并持久化,再给worker反馈,
worker收到反馈后,启动定时任务向master发送心跳,master收到心跳后更新心跳时间。
=====================================================
class SparkContext//即Driver端
|--//主构造器
|--def this()
|--//创建SparkEnv,包含了一个ActorSyatem
|--val env = createSparkEnv()
|--//创建ActorSyatem的方法
|--def createSparkEnv()
|--//调用 SparkEnv 的静态方法创建SparkEnv
|--SparkEnv.createDriverEnv()
|--//创建 TaskScheduler
|--var taskScheduler(schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master)
|--//创建 executors 和 DriverActor 的心跳Actor
|--val heartbeatReceiver = env.actorSystem.actorOf(new HeartbeatReceiver(taskScheduler),...)
|--//创建DAGScheduler
|--var dagScheduler = new DAGScheduler(this)
|--//启动TaskSecheduler
|--taskScheduler.start()
|--//创建 TaskScheduler 方法,
|--//根据提交任务时指定的url(本地/yarn/standalone)创建相应的 TaskScheduler
|--def createTaskScheduler()
|--//spark的standalone模式
|--case SPARK_REGEX(sparkUrl)
|--//创建 TaskSchedulerImpl
|--val scheduler = new TaskSchedulerImpl(sc)
|--//创建 SparkDeploySchedulerBackend
|--val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
|--//调用 initialize 创建调度器,默认使用先进先出的调度器
|--scheduler.initialize(backend)
class TaskSchedulerImpl
|--def initialize(backend)
|--val backend = backend
|--def start()
|--//首先调用 SparkDeploySchedulerBackend 的start()
|--backend.start()
-----------★★★调用taskScheduler的submitTasks方法来提交TaskSet-------------
|--def submitTasks(taskSet)
|--//Driver发消息任务
|--backend.reviveOffers()
class SparkDeploySchedulerBackend extends CoarseGrainedSchedulerBackend
|--def start()
|--//调用父类的 start 来创建 DriverActor
|--super.start() //CoarseGrainedSchedulerBackend 的 start 方法
|--//准备一大堆的参数,例如spark的参数,java的参数,在Driver端都准备好,届时直接发给master,master拿到后发给executor执行即可
|--conf......
|--//将参数封装成Command,这是以后executor的实现类,类名也封装好了,yarn中启动的也是这个,所以不是yarnChild
|--val command = Command("org.apache.executor.CoarseGrainedExecutorBackend",conf,...)
|--//将参数封装到ApplicationDescription
|--val appDesc = new ApplicationDescription(sc.appName, command, ....)
|--创建AppClient
|--client = new AppClient(sc.actorSystem, masters, appDesc, ...)
|--//调用AppClient的start方法,创建ClientActor用于与Master通信
|--client.start()
class CoarseGrainedSchedulerBackend
|--def start()
|--//通过 actorSystem 创建 DriverActor
|--driverActor = actorSystem.actorOf(new DriverActor(..)) //等待 executor 过来通信
----------上接:Executor向Driver注册"|--//Driver建立连接,注册exectuor"------------------------
|--def receiveWithLogging
|--//Driver收到executor发来的注册消息
|--case RegisterExecutor()
|--//反馈注册成功
|--//★★★查看是否有任务需要提交
|--makeOffers()//暂时没有任务,还没有构建DAG
-----------上接:提交前面的stage-------------------------
|--def makeOffers()
|--//调用launchTask向Executor提交Task
|--launchTask(tasks)
|--def launchTask(tasks)
|--//序列化task
|--val serializedTask = ser.serialize(task)
|--//向Executor发送序列化好的Task
|--executorData.executorActor ! LaunchTask(new SerializableBuffer(serializedTask))
-----------上接:backend.reviveOffers()------------------
|--def reviveOffers()
|--driverActor ! ReviveOffers
class DriverActor
|--★★★调用makeOffers向Executor提交Task
|--case ReviveOffers => makeOffers()
class AppClient
|--def start()
|--//创建ClientActor用于与Master通信
|--actor = actorSystem.actorOf(new ClientActor)
|--//主构造器
|--def preStart()
|--//ClientActor向Master注册
|--registerWithMaster()
|--def registerWithMaster()
|--//向Master注册
|--tryRegidterAllMasters()
|--def tryRegidterAllMasters()
|--//循环所有Master,建立连接
|--val actor = context.actorSelection(masterAkkaUrl)
|--//拿到Master的引用,向master注册,备用的master不给反馈,活跃的才给
|--//参数都保存在appDescription中,例如核数,内存大小,java参数,executor实现类
|--actor ! RegisterApplication(appDescription)
|--def receiveWithLogging
|--//ClientActor收到Master发来的注册成功的消息
|--case RegisterApplication
|--//更新Master地址
|--changeMaster(masterUrl)
object SparkEnv
|--def createDriverEnv()
|//调用 create 创建 Actor
|--create
|--//创建 ActorSystem
|--val (actorSystem, boundPort) = AkkaUtils.createActorSystem()
总结:new SparkContext(),执行主构造器,创建SparkEnv,env里创建了ActorSystem用于通信,
然后创建TaskScheduler,创建AGScheduler。TaskScheduler里创建了2个actor分别负责与master
和executors进行通信。(TaskScheduler 里会创建一个backend,backend调用start方法后,会
先调用父类的start方法,父类的start方法会创建DriverActor,再执行自己的start方法创建ClientActor)
ClientActor创建之前,会准备一大堆的参数,包括spark参数,java参数,executor的实现类等,
封装进AppClient,然后创建ClientActor与Master建立连接发送注册信息,Master收到后保存app的信息并反馈。
这时Master开始调度资源并启动worker,有两种调度方式:尽量打散,尽量集中,默认打散。
Master发消息给Worker,worker拼接Java命令,启动子进程。
=====================================================
spark-submit脚本提交流程源码分析:
spark-submit脚本
|--/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
spark-class脚本
|--1.3.1 echo "$RUNNER" -cp "$CLASSPATH" $JAVA_OPTS "$@" 1>&2 //org.apache.spark.deploy.SparkSubmit
|--1.6.1/2.0 "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
-----------------------------------------------------
object org.apache.spark.deploy.SparkSubmit
|--def main()
|--//进行匹配
|--appArgs.action match{case SparkSubmitAction.SUBMIT => submit(appArgs)}
|--def submit()
|--def doRunMain()
|--
|--//调用doRunMain
|--doRunMain()
|--proxyUser.doAs(new xxxAction(){
override def run():Unit = {
runMain(...,childMainClass,...)
}
})
|--def runMain(...,childMainClass,...)
|--//反射自定义的spark程序 class
|--mainClass = Class.forName(childMainClass,...)
|--//调用main方法
|--val mainMethod = mainClass.getMethod("main",...)
|--mainMethod.invoke(null, childArgs.toArray)
总结: spark-submit启动了一个spark自己的submit程序,通过反射调用我们自定义的spark程序
=====================================================
Executor跟Driver通信过程源码分析
org.apache.executor.CoarseGrainedExecutorBackend
|--def main()
|--//解析一大堆参数
|--//调用run方法
|--run(....)
|--def run()
|--//在executor里创建ActorSystem
|--val fetcher = AkkaUtils.createActorSystem(...)
|--//跟Driver建立连接
|--env.actorSystem.actorOf(new CoarseGrainedExecutorBackend)
|--def preStart()
|--//Driver建立连接,注册exectuor
|--.....
|--def receiveWithLogging
|--//Driver反馈注册成功
|--case RegisteredExecutor
|--//创建Executor实例,执行业务逻辑
|--executor = new Executor(....)
Executor
|--//初始化线程池
|--val threadPool = Utils.newDaemonCachedThreadPoll()
总结:Executor启动后,创建actor向driver注册,创建Executor实例执行业务逻辑
=====================================================
任务提交流源码分析,DAScheduler执行过程
sc.textFile-->hadoopFile-->hadoopRDD-->MapParitionsRDD-->shuffleRDD
rdd.saveAsTextFile()-->MapPartitionsRDD
Driver端提交任务,执行self.context.runJob(....)
class SparkContext
|--def runJob()
|--//DAGScheduler切分Stage,转成TaskSet给TaskScheduler再提交给Executor
|--DAGScheduler.runJob(.....)
class DAGScheduler
|--//runjob切分stage
|--def runJob()
|--//调用submitJob返回一个回调器
|--val waiter = submitJob(rdd, ...)
|--//进行模式匹配
|--waiter.awaitResult() match
|--case JobSuccesded
|--case JobFailed
|--def submitJob(rdd, ...)
|--//将数据封装到事件中放入eventProcessLoop的阻塞队列中
|--eventProcessLoop.post(JobSubmitted(...))
|--val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
class DAGSchedulerEventProcessLoop extends EventLoop
|--def onReceive()
|--//提交计算任务
|--case JobSubmitted(jobId,...)
|--//调用DAGScheduler的handleJobSubmitted方法处理
|--dagScheduler.handleJobSubmitted(jobId,...)
|--//切分stage
|--def handleJobSubmitted(jobId,...)
|--★★★划分stage
|--finalStage = newStage(finalRDD, partitons.size, None, jobId, ...)
|--//开始提交Stage
|--submitStage(finalStage)
|--def submitStage(finalStage)
|--//获取父stage
|--val missing = getMissingParentStages(stage).sortBy(_.id)
|--if(missing == null){
//提交前面的stage
submitMissingTasks(stage, jobId.get)
}else{
//有父stage,递归执行本方法
for(parent <- missing){
submitStage(parent)
}
|--//放进waitingStages
|--waitingStages += stage
}
|--def submitMissingTasks(stage, jobId.get)
|--//将stage创建成多个Task,分为shuffleMapTask和ResultTask
|--new ShuffleMapTask(stage.id, taskBinary, part, locs)
|--new ResultTask(stage.id, taskBinary, part, locs, id)
|--//★★★调用taskScheduler的submitTasks方法来提交TaskSet
|--taskScheduler.submitTasks(new TaskSet(tasks.toArray, stage.id, ..., stage.jobId, properties))
|--def newStage
|--//获取父stage
|--val parentStages = getParentStages(rdd, jobId)
|--val stage = new Stage(...,parentStages,...)
|--def getParentStages
|--//使用了3个数据结构来处理父类stage
|--val parents = new HashSet[Stage]
|--val visited = new HashSet[RDD]
|--val waitingForVisit = new Stack[RDD]
|--//思路:通过递归,压栈弹栈
|--//见最后源码
|--def getMissingParentStages
|--//与getParentStages一样的数据结构找父stage
class EventLoop
|--//阻塞队列
|--val eventQueue = new LinkedBlockingDeque()
|--//不停的取事件
|--val eventThread = new Thread(name){
def run(){
while(){
val event = eventQueue.take()
onReceive(event)
}
}
}
总结:Action算子会执行sparkContext里的runJob(),再调用DAGScheduler的runJob(),
通过2个HashSet和1个Stack划分stage,然后提交stage
=======================划分stage源码==============================
/**
* Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
* of a shuffle map stage in newOrUsedStage. The stage will be associated with the provided
* jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
* directly.
*/
//★★★用于创建Stage
private def newStage(
rdd: RDD[_],
numTasks: Int,
shuffleDep: Option[ShuffleDependency[_, _, _]],
jobId: Int,
callSite: CallSite)
: Stage =
{
//★★★获取他的父Stage
val parentStages = getParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
------------------------------------------------------
/**
* Get or create the list of parent stages for a given RDD. The stages will be assigned the
* provided jobId if they haven't already been created with a lower jobId.
*/
//TODO 用户获取父Stage
private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(r: RDD[_]) {
if (!visited(r)) {
visited += r
// Kind of ugly: need to register RDDs with the cache here since
// we can't do it in its constructor because # of partitions is unknown
for (dep <- r.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
//★★★把宽依赖传进去,获得父Stage
parents += getShuffleMapStage(shufDep, jobId)
case _ =>
waitingForVisit.push(dep.rdd)
}
}
}
}
waitingForVisit.push(rdd)
while (!waitingForVisit.isEmpty) {
visit(waitingForVisit.pop())
}
parents.toList
}
------------------------------------------------------
/**
* Get or create a shuffle map stage for the given shuffle dependency's map side.
* The jobId value passed in will be used if the stage doesn't already exist with
* a lower jobId (jobId always increases across jobs.)
*/
private def getShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): Stage = {
shuffleToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) => stage
case None =>
// We are going to register ancestor shuffle dependencies
registerShuffleDependencies(shuffleDep, jobId)
// Then register current shuffleDep
val stage =
//★★★创建服父Stage
newOrUsedStage(
shuffleDep.rdd, shuffleDep.rdd.partitions.size, shuffleDep, jobId,
shuffleDep.rdd.creationSite)
shuffleToMapStage(shuffleDep.shuffleId) = stage
stage
}
}
------------------------------------------------------
/**
* Create a shuffle map Stage for the given RDD. The stage will also be associated with the
* provided jobId. If a stage for the shuffleId existed previously so that the shuffleId is
* present in the MapOutputTracker, then the number and location of available outputs are
* recovered from the MapOutputTracker
*/
private def newOrUsedStage(
rdd: RDD[_],
numTasks: Int,
shuffleDep: ShuffleDependency[_, _, _],
jobId: Int,
callSite: CallSite)
: Stage =
{
//★★★递归
val stage = newStage(rdd, numTasks, Some(shuffleDep), jobId, callSite)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
for (i <- 0 until locs.size) {
stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null if missing
}
stage.numAvailableOutputs = locs.count(_ != null)
} else {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.size)
}
stage
}