Apache Spark 源代码分析之主节点和工作节点间协作流程
Spark 是一个高效的分布式计算框架,但想要更深入地学习它,就需要分析 Spark 的源代码,这不仅可以帮助更好地了解 Spark 的工作过程,还可以提高集群的故障排除能力。本文主要关注Spark Master的启动过程和Worker的启动过程。
Master Start
我们通过启动脚本 start-master.sh Shell 命令来启动 Master。脚本开始如下
start-master.sh -> spark-daemon.sh start org.apache.spark.deploy.master.Master
我们可以看到脚本以 org.apache.spark.deploy.master.Master 类开头。启动时会传入一些参数,比如 cpu execution core, memory size, main method of app等。
查看Master类的main方法内容下面
private[spark] object Master extends Logging { val systemName = "sparkMaster" private val actorName = "Master" //master startup entry def main(argStrings: Array[String]) { SignalLogger.register(log) //Create SparkConf val conf = new SparkConf //Save parameters to SparkConf val args = new MasterArguments(argStrings, conf) //Create Actor System and Actor val (actorSystem, _, _, _) = startSystemAndActor(args.host, args.port, args.webUiPort, conf) //Waiting for the End actorSystem.awaitTermination() }
这里我们主要看一下startSystemAndActor
/** * Start the Master and return a four tuple of: * (1) The Master actor system * (2) The bound port * (3) The web UI bound port * (4) The REST server bound port, if any */ def startSystemAndActor( host: String, port: Int, webUiPort: Int, conf: SparkConf): (ActorSystem, Int, Int, Option[Int]) = { val securityMgr = new SecurityManager(conf) //Creating ActorSystem with AkkaUtils val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port, conf = conf, securityManager = securityMgr) val actor = actorSystem.actorOf( Props(classOf[Master], host, boundPort, webUiPort, securityMgr, conf), actorName) .... } }
Spark 下层通讯使用Akka来实现
创建Actor->Actor系统。Actor 先通过 Actor System执行 Master 的构造方法 - >然后执行 Actor 生命周期方法
其中通过执行 Master 的构造函数来初始化部分变量
private[spark] class Master( host: String, port: Int, webUiPort: Int, val securityMgr: SecurityManager, val conf: SparkConf) extends Actor with ActorLogReceive with Logging with LeaderElectable { //primary constructor //Enable timer function import context.dispatcher // to use Akka's scheduler.schedule() val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf) def createDateFormat = new SimpleDateFormat("yyyyMMddHHmmss") // For application IDs //woker timeout val WORKER_TIMEOUT = conf.getLong("spark.worker.timeout", 60) * 1000 val RETAINED_APPLICATIONS = conf.getInt("spark.deploy.retainedApplications", 200) val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200) val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15) val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE") //A HashSet is used to save WorkerInfo val workers = new HashSet[WorkerInfo] //A HashMap saves workid - > WorkerInfo val idToWorker = new HashMap[String, WorkerInfo] val addressToWorker = new HashMap[Address, WorkerInfo] //A HashSet is used to save tasks submitted by the client (SparkSubmit) val apps = new HashSet[ApplicationInfo] //A HashMap Appid - "Application Info" val idToApp = new HashMap[String, ApplicationInfo] val actorToApp = new HashMap[ActorRef, ApplicationInfo] val addressToApp = new HashMap[Address, ApplicationInfo] //App Waiting for Scheduling val waitingApps = new ArrayBuffer[ApplicationInfo] val completedApps = new ArrayBuffer[ApplicationInfo] var nextAppNumber = 0 val appIdToUI = new HashMap[String, SparkUI] //Save DriverInfo val drivers = new HashSet[DriverInfo] val completedDrivers = new ArrayBuffer[DriverInfo] val waitingDrivers = new ArrayBuffer[DriverInfo] // Drivers currently spooled for scheduling
当主构造函数完成执行时,它会执行 preStart --“并接收方法。
//Start timer and check timeout worker //Focus on CheckForWorkerTime Out context.system.scheduler.schedule(0 millis, WORKER_TIMEOUT millis, self, CheckForWorkerTimeOut)
在 preStart 方法中,创建一个计时器来检查 Woker 的超时值 WORKER_TIMEOUT = conf. getLong("spark. worker. timeout", 60)* 1000 默认为 60 秒。
正如我们所看到的,Master 初始化的主要过程是构造一个 Master Actor 来等待消息,初始化一个集合来保存 Worker 信息,并使用计时器检查 Worker 的超时。
Master Start 序列图
Woker Start-up
执行salves.sh - 通过 Shell 脚本>,通过读取slaves 来开启remote worker,并通过 ssh
spark-daemon.sh 启动 org.apache.spark.deploy.worker.worker
该脚本启动 org.apache.spark.deploy.worker.Worker 类
查看工作线程源代码
private[spark] object Worker extends Logging { //Worker Start Entry def main(argStrings: Array[String]) { SignalLogger.register(log) val conf = new SparkConf val args = new WorkerArguments(argStrings, conf) //New Actor System and Actor val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir) actorSystem.awaitTermination() }
The most important thing here is Woker's Start SystemAndActor.
这里最重要的是Woker的startSystemAndActor
。
def startSystemAndActor( host: String, port: Int, webUiPort: Int, cores: Int, memory: Int, masterUrls: Array[String], workDir: String, workerNumber: Option[Int] = None, conf: SparkConf = new SparkConf): (ActorSystem, Int) = { // The LocalSparkCluster runs multiple local sparkWorkerX actor systems val systemName = "sparkWorker" + workerNumber.map(_.toString).getOrElse("") val actorName = "Worker" val securityMgr = new SecurityManager(conf) //Through Akka Utils Actor System val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port, conf = conf, securityManager = securityMgr) val masterAkkaUrls = masterUrls.map(Master.toAkkaUrl(_, AkkaUtils.protocol(actorSystem))) //Create Actor Worker-"Execution Constructor-" preStart-"Recice through actorSystem.actorOf actorSystem.actorOf(Props(classOf[Worker], host, boundPort, webUiPort, cores, memory, masterAkkaUrls, systemName, actorName, workDir, conf, securityMgr), name = actorName) (actorSystem, boundPort) }
在这里,Worker 还构造了一个属于 Worker 的 Actor 对象,并且 Worker 启动的初始化就完成了。
Worker 和Master 通信
Worker 的 preStart 方法根据 Actor 生命周期调用
override def preStart() { assert(!registered) logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format( host, port, cores, Utils.megabytesToString(memory))) logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}") logInfo("Spark home: " + sparkHome) createWorkDir() context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent]) shuffleService.startIfEnabled() webUi = new WorkerWebUI(this, workDir, webUiPort) webUi.bind() //Worker registers with Master registerWithMaster() .... }
这里我们调用 registerWithMaster 方法并开始注册Master。
def registerWithMaster() { // DisassociatedEvent may be triggered multiple times, so don't attempt registration // if there are outstanding registration attempts scheduled. registrationRetryTimer match { case None => registered = false //Start registration tryRegisterAllMasters() .... } }
tryRegisterAllMasters 方法通过在 registerWithMaster匹配结果来调用
private def tryRegisterAllMasters() { //Traversing the address of the master for (masterAkkaUrl <- masterAkkaUrls) { logInfo("Connecting to master " + masterAkkaUrl + "...") //Connect Worker to Mater val actor = context.actorSelection(masterAkkaUrl) //Send registration information to Master actor ! RegisterWorker(workerId, host, port, cores, memory, webUi.boundPort, publicAddress) } }
通过 master AkkaUrl 和 Master RegisterWorker 建立连接后(workerId、host、port、cores、memory、webUI. boundPort、publicAddress),Worker 向 Master
发送一条消息,其中包含参数、id、host、port、cpu 内核、内存等待
override def receiveWithLogging = { ...... //Accept registration information from Worker case RegisterWorker(id, workerHost, workerPort, cores, memory, workerUiPort, publicAddress) => { logInfo("Registering worker %s:%d with %d cores, %s RAM".format( workerHost, workerPort, cores, Utils.megabytesToString(memory))) if (state == RecoveryState.STANDBY) { // ignore, don't send response //Determine if the worker has been registered } else if (idToWorker.contains(id)) { //If registered, tell worker that registration failed sender ! RegisterWorkerFailed("Duplicate worker ID") } else { //No registration, encapsulate the registration information from Worker into WorkerInfo val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, sender, workerUiPort, publicAddress) if (registerWorker(worker)) { //Recording Worker's Information with a Persistence Engine persistenceEngine.addWorker(worker) //Feedback Worker to inform Worker of successful registration sender ! RegisteredWorker(masterUrl, masterWebUiUrl) schedule() } else { val workerAddress = worker.actor.path.address logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress) sender ! RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress) } } }
这是主要内容:ReciveWithLogging 轮询消息。当 Master 收到消息时,它会将参数封装为 WorkInfo 对象,将它们添加到集合中,然后将它们添加到持久性引擎中。sender ! RegisteredWorker(masterUrl, masterWebUiUrl)
向工作线程发送消息反馈.接下来,查看 worker 的 receiveWithLogging
override def receiveWithLogging = { case RegisteredWorker(masterUrl, masterWebUiUrl) => logInfo("Successfully registered with master " + masterUrl) registered = true changeMaster(masterUrl, masterWebUiUrl) //Start the timer and send Heartbeat at regular intervals context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis, self, SendHeartbeat) if (CLEANUP_ENABLED) { logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir") context.system.scheduler.schedule(CLEANUP_INTERVAL_MILLIS millis, CLEANUP_INTERVAL_MILLIS millis, self, WorkDirCleanup) }
worker 从Master 接收有关注册成功的反馈,启动计时器,并定期发送检测信号。
case SendHeartbeat => //The purpose of worker sending heartbeat is to report live if (connected) { master ! Heartbeat(workerId) }
ReciveWithLogging on Master 接收检测信号消息
override def receiveWithLogging = { .... case Heartbeat(workerId) => { idToWorker.get(workerId) match { case Some(workerInfo) => //Update the last heartbeat time workerInfo.lastHeartbeat = System.currentTimeMillis() ..... } } }
Record and update the last heartbeat time of workerInfo.lastHeartbeat = System.currentTimeMillis()
Master's scheduled tasks constantly send Worker information in a continuous polling set of CheckForWorkerTime Out internal messages, removing Worker information if it exceeds 60 seconds
记录并更新 workerInfo.lastHeartbeat = System.currentTimeMillis() 的上次检测信号时间
Master的计划任务在 CheckForWorkerTimeOut 内部消息的连续轮询集中不断发送工作线程信息,如果工作线程信息超过 60 秒,则删除该信息。
//Check timeout Worker case CheckForWorkerTimeOut => { timeOutDeadWorkers() }
timeOutDeadWorkers 方法
def timeOutDeadWorkers() { // Copy the workers into an array so we don't modify the hashset while iterating through it val currentTime = System.currentTimeMillis() val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT).toArray for (worker <- toRemove) { if (worker.state != WorkerState.DEAD) { logWarning("Removing %s because we got no heartbeat in %d seconds".format( worker.id, WORKER_TIMEOUT/1000)) removeWorker(worker) } else { if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT)) { workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it } } } }
如果(the last heartbeat time < current time-timeout time)被判断为工作线程超时,并从集合中删除信息。
case None => if (workers.map(_.id).contains(workerId)) { logWarning(s"Got heartbeat from unregistered worker $workerId." + " Asking it to re-register.") //Send a re-registered message sender ! ReconnectWorker(masterUrl) } else { logWarning(s"Got heartbeat from unregistered worker $workerId." + " This worker was never registered, so ignoring the heartbeat.") }
Worker 与Master 序列图
在Master 和Worker 启动后,一般的通信过程就到这里了,然后如何在集群上启动执行器进程计算任务。