大数据平台学习(三)----spark核心编程
spark架构原理
原理图:
创建RDD
一是使用程序中的集合创建RDD,主要用于进行测试,可以实际部署到集群运行之前,
自己使用集合构造测试数据,来测试后面的spark应用的流程;
二是使用本地文件创建RDD,主要用于的场景为在本地临时性地处理一些存储了大量数据的文件;
三是使用HDFS文件创建RDD,主要用于针对HDFS上存储的大数据,进行离线批处理操作
操作RDD
转化操作:
行为操作:
RDD持久化
为什么要执行RDD持久化?
RDD持久化策略是什么?
Spark内核源码深度剖析
原理图:
宽依赖和窄依赖
基于yarn的两种提交模式
sparkcontext原理剖析
master原理剖析与源码分析
资源调度机制十分重要(两种资源调度机制)
主备切换机制原理图:
注册机制原理图:
master资源调度算法原理:
master是通过schedule方法进行资源调度,告知worker启动executor,,,
1,schedule方法
2,startExecutorOnWorkers在woker上开启executor进程
3, scheduleExecutorsOnWorker在每一个worker上调度资源
判断该worker能否分配一个或者多个executor,能则分配相应的executor所需要的CPU核数;
4, allocateWorkerResourceToExecutor在worker上分配的具体资源
五、launchDriver发起driver
六、launchExecutor发起executor
worker原理剖析:
job触发流程原理剖析:
其实最终是调用了SparkContext之前初始化时创建的DAGSchdule的runjob方法;
DAGSchdule原理剖析:
注:reduceByKey的作用就是对相同key的数据进行处理,最终每个key只保留一条记录;
其作用对象是(key, value)形式的RDD;
Taskschedule原理分析
注:本地化级别种类:
PROCESS_LOCAL:进程本地化,rdd和partition和task在同一个executor中;
NODE_LOCAL:rdd的parttion和task,不在一个同一个executor中,但在同一个节点中;
NO_PREF:没有本地化级别;
RACK_LOCAL:机架本地化,至少rdd的parttion和task在一盒机架中;
ANY:任意本地化级别;
Executor原理剖析与源码分析
1、work为application启动的executor,实际上是启动了CoarseGrainedExecutorBackend进程;
2、获取driver的actor,向driver发送RegisterExecutor信息
3、dirver注册executor成功之后,会发送RegisteredExecutor信息
(此时CoarseGrainedExecutorBackend会创建Executor执行句柄,大部分功能都是通过Executor实现的)
4、启动task,反序列化task,调用Executor执行器的launchTask()启动一个task;
5、对每个task都会创建一个TaskRunner,将TaskRunner放入内存缓存中;
(将task封装在一个线程TaskRunner),将线程丢入线程池中,然后执行线程池是自动实现了排队机制的
Task原理剖析
1、task是封装在一个线程中(taskRunner)
2、调用run(),对序列化的task数据进行反序列化,然后通过通络通信,将需要的文件,资源,jar文件拷贝过来,
通过反序列化操作,调用updateDependencies() 将整个Task数据反序列化回来;
1 <span style="font-size:14px;">override def run(): Unit = { 2 val threadMXBean = ManagementFactory.getThreadMXBean 3 val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId) 4 val deserializeStartTime = System.currentTimeMillis() 5 val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { 6 threadMXBean.getCurrentThreadCpuTime 7 } else 0L 8 Thread.currentThread.setContextClassLoader(replClassLoader) 9 val ser = env.closureSerializer.newInstance() 10 logInfo(s"Running $taskName (TID $taskId)") 11 execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER) 12 var taskStart: Long = 0 13 var taskStartCpu: Long = 0 14 startGCTime = computeTotalGcTime() 15 16 try { 17 //对序列化的task数据进行反序列化 18 val (taskFiles, taskJars, taskProps, taskBytes) = 19 Task.deserializeWithDependencies(serializedTask) 20 21 // Must be set before <span style="color:#ff0000;">updateDependencies</span>() is called, in case fetching dependencies 22 // requires access to properties contained within (e.g. for access control). 23 Executor.taskDeserializationProps.set(taskProps) 24 //然后通过网络通信,将需要的文件,资源呢、jar拷贝过来 25 updateDependencies(taskFiles, taskJars) 26 //通过反序列化操作,将整个Task的数据反序列化回来 27 task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) 28 task.localProperties = taskProps 29 task.setTaskMemoryManager(taskMemoryManager) 30 31 // If this task has been killed before we deserialized it, let's quit now. Otherwise, 32 // continue executing the task. 33 if (killed) { 34 // Throw an exception rather than returning, because returning within a try{} block 35 // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl 36 // exception will be caught by the catch block, leading to an incorrect ExceptionFailure 37 // for the task. 38 throw new TaskKilledException 39 } 40 41 logDebug("Task " + taskId + "'s epoch is " + task.epoch) 42 env.mapOutputTracker.updateEpoch(task.epoch) 43 44 // Run the actual task and measure its runtime. 45 //计算task的开始时间 46 taskStart = System.currentTimeMillis() 47 taskStartCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) { 48 threadMXBean.getCurrentThreadCpuTime 49 } else 0L 50 var threwException = true 51 val value = try { 52 调用task的run() 53 val res = task.run( 54 taskAttemptId = taskId, 55 attemptNumber = attemptNumber, 56 metricsSystem = env.metricsSystem) 57 threwException = false 58 res 59 } finally { 60 val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId) 61 val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory() 62 63 if (freedMemory > 0 && !threwException) { 64 val errMsg = s"Managed memory leak detected; size = $freedMemory bytes, TID = $taskId" 65 if (conf.getBoolean("spark.unsafe.exceptionOnMemoryLeak", false)) { 66 throw new SparkException(errMsg) 67 } else { 68 logWarning(errMsg) 69 } 70 } 71 72 if (releasedLocks.nonEmpty && !threwException) { 73 val errMsg = 74 s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" + 75 releasedLocks.mkString("[", ", ", "]") 76 if (conf.getBoolean("spark.storage.exceptionOnPinLeak", false)) { 77 throw new SparkException(errMsg) 78 } else { 79 logWarning(errMsg) 80 } 81 } 82 } 83 val taskFinish = System.currentTimeMillis() 84 val taskFinishCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) { 85 threadMXBean.getCurrentThreadCpuTime 86 } else 0L 87 88 // If the task has been killed, let's fail it. 89 if (task.killed) { 90 throw new TaskKilledException 91 } </span> 92
3、updateDependencies() 主要是获取hadoop的配置文件,使用了java的synchronized多线程访问方式,
task是以java线程的方式,在一个CoarseGrainedExecutorBackend进程内并发运行的,因此在执行一些业务逻辑
的时候,需要访问一些共享资源,就会出现多线程并发访问的安全问题;
1 <span style="font-size:14px;"> private def updateDependencies(newFiles: HashMap[String, Long], newJars: HashMap[String, Long]) { 2 //获取hadoop的配置文件 3 lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf) 4 //使用了java的synchronized多线程并发访问方式 5 //task是以java线程的方式,在一个CoarseGrainedExecutorBackend进程内并发运行的 6 //因此在执行一些业务逻辑的时候,需要访问一些共享资源,就会出现多线程并发访问安全问题 7 synchronized { 8 // Fetch missing dependencies 9 //遍历要拉取的文件 10 for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) { 11 logInfo("Fetching " + name + " with timestamp " + timestamp) 12 // Fetch file with useCache mode, close cache for local mode. 13 //通过Utils的fetchFile(),通过网络通信,从远程拉取文件 14 Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf, 15 env.securityManager, hadoopConf, timestamp, useCache = !isLocal) 16 currentFiles(name) = timestamp 17 } 18 //遍历要拉取得jar 19 for ((name, timestamp) <- newJars) { 20 val localName = name.split("/").last 21 val currentTimeStamp = currentJars.get(name) 22 .orElse(currentJars.get(localName)) 23 .getOrElse(-1L) 24 if (currentTimeStamp < timestamp) { 25 logInfo("Fetching " + name + " with timestamp " + timestamp) 26 // Fetch file with useCache mode, close cache for local mode. 27 Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf, 28 env.securityManager, hadoopConf, timestamp, useCache = !isLocal) 29 currentJars(name) = timestamp 30 // Add it to our class loader 31 val url = new File(SparkFiles.getRootDirectory(), localName).toURI.toURL 32 if (!urlClassLoader.getURLs().contains(url)) { 33 logInfo("Adding " + url + " to class loader") 34 urlClassLoader.addURL(url) 35 } 36 } 37 } 38 } 39 } </span> 40
4、task的run方法,创建了Taskcontext,里面记录了task的一些全局性的数据,还包括task重试了几次,task属于
哪个stage,task处理的是rdd的哪个parttion,然后调用抽象方法runTask();
1 <span style="font-size:14px;">final def run( 2 taskAttemptId: Long, 3 attemptNumber: Int, 4 metricsSystem: MetricsSystem): T = { 5 SparkEnv.get.blockManager.registerTask(taskAttemptId) 6 //创建了Taskcontext,里面记录了task的一些全局性的数据 7 //包括task重试了几次,task属于哪个stage,task处理的是rdd的哪个partition 8 context = new TaskContextImpl( 9 stageId, 10 partitionId, 11 taskAttemptId, 12 attemptNumber, 13 taskMemoryManager, 14 localProperties, 15 metricsSystem, 16 metrics) 17 TaskContext.setTaskContext(context) 18 taskThread = Thread.currentThread() 19 20 if (_killed) { 21 kill(interruptThread = false) 22 } 23 24 new CallerContext("TASK", appId, appAttemptId, jobId, Option(stageId), Option(stageAttemptId), 25 Option(taskAttemptId), Option(attemptNumber)).setCurrentContext() 26 27 try { 28 //调用抽象方法runTask() 29 runTask(context) 30 } catch { 31 case e: Throwable => 32 // Catch all errors; run task failure callbacks, and rethrow the exception. 33 try { 34 context.markTaskFailed(e) 35 } catch { 36 case t: Throwable => 37 e.addSuppressed(t) 38 } 39 throw e 40 } finally { 41 // Call the task completion callbacks. 42 context.markTaskCompleted() 43 try { 44 Utils.tryLogNonFatalError { 45 // Release memory used by this thread for unrolling blocks 46 SparkEnv.get.blockManager.memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.ON_HEAP) 47 SparkEnv.get.blockManager.memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.OFF_HEAP) 48 // Notify any tasks waiting for execution memory to be freed to wake up and try to 49 // acquire memory again. This makes impossible the scenario where a task sleeps forever 50 // because there are no other tasks left to notify it. Since this is safe to do but may 51 // not be strictly necessary, we should revisit whether we can remove this in the future. 52 val memoryManager = SparkEnv.get.memoryManager 53 memoryManager.synchronized { memoryManager.notifyAll() } 54 } 55 } finally { 56 TaskContext.unset() 57 } 58 } 59 }</span> 60
5、抽象方法runTask(),需要Task的子类去实现,ShuffleMapTask、ResaultTask;
1 <span style="font-size:14px;">def runTask(context: TaskContext): T</span>
6、一个ShuffleMapTask会将一个个Rdd元素切分为多个bucket基于在一个ShuffleDependency的中指定的partitioner,
1 <span style="font-size:14px;">private[spark] class ShuffleMapTask( 2 stageId: Int, 3 stageAttemptId: Int, 4 taskBinary: Broadcast[Array[Byte]], 5 partition: Partition, 6 @transient private var locs: Seq[TaskLocation], 7 metrics: TaskMetrics, 8 localProperties: Properties, 9 jobId: Option[Int] = None, 10 appId: Option[String] = None, 11 appAttemptId: Option[String] = None) 12 extends Task[MapStatus](stageId, stageAttemptId, partition.index, metrics, localProperties, jobId, 13 appId, appAttemptId) 14 with Logging</span>
7、runTask有MapStatus返回值,对task要处理的数据做一些反序列化操作,然后通过广播变量拿到要处理rdd那部分
数据,获取shuffleManager,调用rdd的iterator(),并且传入了当前要执行的哪个parttion,rdd的iterator(),实现
了针对rdd的某个parttion执行我们定义的算子,之后返回的数据通过ShuffleWriter,经过Hashpartioner后,写入自己对应
bucket中,最后返回mapstatus,里面封装了计算后的数据,存储在Blockmanager中;
1 <span style="font-size:14px;">override def runTask(context: TaskContext): MapStatus = { 2 // Deserialize the RDD using the broadcast variable. 3 val threadMXBean = ManagementFactory.getThreadMXBean 4 val deserializeStartTime = System.currentTimeMillis() 5 val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { 6 threadMXBean.getCurrentThreadCpuTime 7 } else 0L 8 //对task要处理的数据做一些反序列化操作 9 //通过广播变量拿到要处理的rdd那部分数据 10 val ser = SparkEnv.get.closureSerializer.newInstance() 11 val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( 12 ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) 13 _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime 14 _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { 15 threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime 16 } else 0L 17 18 var writer: ShuffleWriter[Any, Any] = null 19 try { 20 //获取shuffleManager 21 val manager = SparkEnv.get.shuffleManager 22 //从shuffleManager中获取writer 23 writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) 24 //调用了rdd的iterator(),并且传入了当前要执行的哪个partition 25 //rdd的iterator(),实现了针对rdd的某个parttion执行我们自己定义的算子 26 //执行完我们自己定义的算子,返回的数据通过ShuffleWriter,经过Hashpartioner后,写入自己对应的分区bucket中 27 writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) 28 //最后返回MapStatus,里面封装了计算后的数据,存储在Blockmanager中 29 writer.stop(success = true).get 30 } catch { 31 case e: Exception => 32 try { 33 if (writer != null) { 34 writer.stop(success = false) 35 } 36 } catch { 37 case e: Exception => 38 log.debug("Could not stop writer", e) 39 } 40 throw e 41 } 42 } </span>
8、MapPartitionsRDD
针对RDD的某个partition,执行我们给定的算子或者函数;
(注:可以理解成自己定义的算子或者函数,spark对其进行了封装,这里针对rdd的parttion,执行自定义的计算
操作,并返回新的rdd的parttion的数据)
1 <span style="font-size:14px;">private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag]( 2 var prev: RDD[T], 3 f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator) 4 preservesPartitioning: Boolean = false) 5 extends RDD[U](prev) { 6 7 override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None 8 9 override def getPartitions: Array[Partition] = firstParent[T].partitions 10 //针对rdd的某个partiton,执行我们给定的算子或者函数 11 //f:可以理解成自己定义的算子或者函数,spark内部进行了封装 12 //这里针对rdd的partition,执行自定义的计算操作,并返回 新的rdd的partition的数据 13 override def compute(split: Partition, context: TaskContext): Iterator[U] = 14 f(context, split.index, firstParent[T].iterator(split, context)) 15 16 override def clearDependencies() { 17 super.clearDependencies() 18 prev = null 19 } 20 }</span>
9、statusUpdate()是个抽象方法
1 <span style="font-size:14px;">execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult) 2 // statusUpdate 3 override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) { 4 val msg = StatusUpdate(executorId, taskId, state, data) 5 driver match { 6 case Some(driverRef) => driverRef.send(msg) 7 case None => logWarning(s"Drop $msg because has not yet connected to driver") 8 } 9 } </span>
10、调用TaskSchedulerImpl的statusUpdate(),获取对应的taskset,task结束了,从内存移除;
1 <span style="font-size:14px;">def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) { 2 var failedExecutor: Option[String] = None 3 var reason: Option[ExecutorLossReason] = None 4 synchronized { 5 try { 6 taskIdToTaskSetManager.get(tid) match { 7 获取对应的taskSet 8 case Some(taskSet) => 9 if (state == TaskState.LOST) { 10 // TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode, 11 // where each executor corresponds to a single task, so mark the executor as failed. 12 val execId = taskIdToExecutorId.getOrElse(tid, throw new IllegalStateException( 13 "taskIdToTaskSetManager.contains(tid) <=> taskIdToExecutorId.contains(tid)")) 14 if (executorIdToRunningTaskIds.contains(execId)) { 15 reason = Some( 16 SlaveLost(s"Task $tid was lost, so marking the executor as lost as well.")) 17 removeExecutor(execId, reason.get) 18 failedExecutor = Some(execId) 19 } 20 } 21 if (TaskState.isFinished(state)) { 22 //task结束了从内存中移除 23 cleanupTaskState(tid) 24 taskSet.removeRunningTask(tid) 25 if (state == TaskState.FINISHED) { 26 taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) 27 } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) { 28 taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData) 29 } 30 } 31 case None => 32 logError( 33 ("Ignoring update with state %s for TID %s because its task set is gone (this is " + 34 "likely the result of receiving duplicate task finished status updates) or its " + 35 "executor has been marked as failed.") 36 .format(state, tid)) 37 } 38 } catch { 39 case e: Exception => logError("Exception in statusUpdate", e) 40 } 41 } 42 // Update the DAGScheduler without holding a lock on this, since that can deadlock 43 if (failedExecutor.isDefined) { 44 assert(reason.isDefined) 45 dagScheduler.executorLost(failedExecutor.get, reason.get) 46 backend.reviveOffers() 47 } 48 } </span>
普通Shuffle操作的原理剖析
(注:什么是shuffle过程,它分为map中的和reduce中的
首先看map中的:
再看reduce端的shuffle
shuffle的过程:
优化过后的shuffle操作的原理剖析
BlockManager原理剖析:
1、Diver上有BlockManagerMaster,负责对各个节点上的BlockManager内部管理的元数据进行维护;
2、每个节点的BlockManager有几个关键组件,DiskStore负责对磁盘上的数据进行读写,MemoryStore负责对内存中的数据进行读写,ConnectionManager负责建立BlockManager到远程其他节点的
BlockManager的网络连接,BlockTransferService负责对远程其他节点的BlockManager数据读写;
3、每个BlockManager创建之后,会向BlockManangerMaster进行注册,BlockManagerMaster会为其创建对应的BlockManagerInfo;
4、BlockManager进行读写时,比如RDD运行过程中的一些中间数据,或者指定的persist(),会优先将数据写入内存,如果内存大小不够,再将内存部分数据写入磁盘;
5、如果persist() 指定了要replica,那么会使用BlockTransferService将数据replica一份到其他节点的BlockManager上去;
6、BlockManager进行读操作时,就会用ConnectionManager与有数据的BlockManager建立连接,然后用BlockTransferService从远程BlockManager读写数据;
7、只要使用了BlockManager执行了数据增删改的操作,那么就必须将block的BlockStatus上报到BlockManagerInfo内部的BlockStatus进行增删改操作,从而对元数据进行维护;
cacheManager原理剖析:
CacheManager 源码解析 :
1、cacheManager管理的是缓存中的数据,缓存可以是基于内存的缓存,也可以是基于磁盘的缓存;
2、cacheManager需要通过BlockManager 来操作数据;
3、每当Task运行的时候回调用RDD的compute方法进行计算,而compute方法会调用iterator方法;
compute方法是final级别不能覆写但可以被子类去使用,可以看见RDD是优先使用内存的,如果存储级别
不等于NONE的情况下,程序会先找CacheManager 获得数据,否则的话会看有没有进行checkpoint;
4、cache在工作的时候会最大化的保留数据,但是数据不一定绝对完整,因为当前的计算机如果需要内存空间的话,那么内存中的数据必须让出空间,这是因为执行比缓存重要,此时如何在RDD持久化的时候同时指定了可以把数据放左Disk上,那么部分cache的数据可以从内存转入磁盘,否则的话,数据就会丢失;
在进行Cache时,BlockManager会帮你进行管理,我们可以通过key到BlockManager中找出曾经缓存的数据;
如果有BlockManager.get()方法没有返回任何数据,就调用acquireLockForPartition 方法,因为会有可能多条线程在操作数据,spark有一个东西叫慢任务StraggleTask 推迟,StraggleTask 推迟的时候一般都会运行两个任务在两台机器上;
最后还是通过 BlockManager.get 来获得数据
5、具体cacheManager在获得缓存数据的时候会通过BlockManager 来抓到数据,优先在本地找数据或者远程抓数据;
BlockManger.getLocal然后转过来调用doGetLocal方法,在doGetLocal的实现中看到缓存其实不一定在内存中,
缓存可以在存在、磁盘、也可以在offHeap(Tachyon)中;
6、在上一步调用了getLocal方法后转过调用了doGetLocal
7、在第5步中如果本地没有缓存的话,就调用getRemote方法从远程抓取数据;
8、如果cacheManager没有通过BlockManager 获得缓存内容的话,其实会通过RDD的computeOrReadCheckpoint ()方法来获得数据;
首先需要检查看当前的RDD是否进行了CheckPoint,如果进行了话就直接读取checkpoint的数据,否则的话就必需进行计算;
checkpoint本身很重要,计算之后通过putInBlockManager 会把数据按照StorageLevel 重新缓存起来;
9、如果数据缓存的空间不够,此时会调用memoryStore中的unrollSafety方法,里面有一个循环在内存中放数据;
CheckPoint原理:
原理分析:
1、SparkContext的setCheckpointDir 设置了一个checkpoint 目录
1 def setCheckpointDir(directory: String) { 2 3 // If we are running on a cluster, log a warning if the directory is local. 4 // Otherwise, the driver may attempt to reconstruct the checkpointed RDD from 5 // its own local file system, which is incorrect because the checkpoint files 6 // are actually on the executor machines. 7 if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) { 8 logWarning("Spark is not running in local mode, therefore the checkpoint directory " + 9 s"must not be on the local filesystem. Directory '$directory' " + 10 "appears to be on the local filesystem.") 11 } 12 //利用hadoop的api创建了一个hdfs目录 13 checkpointDir = Option(directory).map { dir => 14 val path = new Path(dir, UUID.randomUUID().toString) 15 val fs = path.getFileSystem(hadoopConfiguration) 16 fs.mkdirs(path) 17 fs.getFileStatus(path).getPath.toString 18 } 19 }
2、RDD核心的CheckPoint方法
def checkpoint(): Unit = RDDCheckpointData.synchronized { // NOTE: we use a global lock here due to complexities downstream with ensuring // children RDD partitions point to the correct parent partitions. In the future // we should revisit this consideration. if (context.checkpointDir.isEmpty) { throw new SparkException("Checkpoint directory has not been set in the SparkContext") } else if (checkpointData.isEmpty) { //创建ReliableRDDCheckpointData指向的RDDCheckpointData的实例 checkpointData = Some(new ReliableRDDCheckpointData(this)) } }
3、创建了一个ReliableRDDCheckpointData
//ReliableRDDCheckpointData 的父类RDDCheckpointData private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T]) extends RDDCheckpointData[T](rdd) with Logging { ...... }
4、父类RDDCheckpointData
/** * RDD 需要经过 * * [ Initialized --> CheckpointingInProgress--> Checkpointed ] * 这几个阶段才能被 checkpoint * */ private[spark] abstract class RDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T]) extends Serializable { import CheckpointState._ // The checkpoint state of the associated RDD. //标识CheckPoint的状态,第一次初始化时是Initialized protected var cpState = Initialized ...... }
checkpoint写入数据:
1、Spark job运行最终会调用SparkContext的runJob方法将任务提交给Executor去执行;
def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) progressBar.foreach(_.finishAll()) //在生产环境下会调用ReliableRDDCheckpointData的doCheckpoint方法 rdd.doCheckpoint() }
2、rdd.doCheckpoint()方法
private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { if (checkpointAllMarkedAncestors) { // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint // them in parallel. // Checkpoint parents first because our lineage will be truncated after we // checkpoint ourselves dependencies.foreach(_.rdd.doCheckpoint()) } checkpointData.get.checkpoint() } else { //遍历依赖的rdd,调用每个rdd的doCheckpoint方法 dependencies.foreach(_.rdd.doCheckpoint()) } } } }
3、checkpointData类型是RDDCheckpointData中的doCheckpoint()方法;
final def checkpoint(): Unit = { // Guard against multiple threads checkpointing the same RDD by // atomically flipping the state of this RDDCheckpointData RDDCheckpointData.synchronized { if (cpState == Initialized) { //1、标记当前状态为正在checkpoint中 cpState = CheckpointingInProgress } else { return } } //2 这里调用的是子类的doCheckpoint() val newRDD = doCheckpoint() // Update our state and truncate the RDD lineage // 3 标记checkpoint已完成,清空RDD依赖 RDDCheckpointData.synchronized { cpRDD = Some(newRDD) cpState = Checkpointed rdd.markCheckpointed() } }
4、子类ReliableRDDCheckpointData的doCheckpoint()方法;
1 protected override def doCheckpoint(): CheckpointRDD[T] = { 2 /** 3 * 为该 rdd 生成一个新的依赖,设置该 rdd 的 parent rdd 为 CheckpointRDD 4 * 该 CheckpointRDD 负责以后读取在文件系统上的 checkpoint 文件,生成该 rdd 的 partition 5 */ 6 val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir) 7 8 // Optionally clean our checkpoint files if the reference is out of scope 9 // 是否清除checkpoint文件如果超出引用的资源范围 10 if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) { 11 rdd.context.cleaner.foreach { cleaner => 12 cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id) 13 } 14 } 15 16 logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}") 17 // 将新产生的RDD返回给父类 18 newRDD 19 }
5、ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)方法
/** * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD. * * 触发runJob 来执行当前的RDD 中的数据写到Checkpoint 的目录中,同时会产生ReliableCheckpointRDD 实例 */ def writeRDDToCheckpointDirectory[T: ClassTag]( originalRDD: RDD[T], checkpointDir: String, blockSize: Int = -1): ReliableCheckpointRDD[T] = { val checkpointStartTimeNs = System.nanoTime() val sc = originalRDD.sparkContext // Create the output path for the checkpoint //把checkpointDir设置我们checkpoint的目录 val checkpointDirPath = new Path(checkpointDir) // 获取HDFS文件系统API接口 val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration) // 创建目录 if (!fs.mkdirs(checkpointDirPath)) { throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath") } // Save to file, and reload it as an RDD // 将配置文件信息广播到所有节点 val broadcastedConf = sc.broadcast( new SerializableConfiguration(sc.hadoopConfiguration)) // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582) // 核心代码 /** * 此处新提交一个Job,也是对RDD进行计算,那么如果原有的RDD对结果进行了cache的话, * 那么是不是减少了很多的计算呢,这就是为啥checkpoint的时候强烈推荐进行cache的缘故 */ sc.runJob(originalRDD, writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _) // 如果rdd的partitioner不为空,则将partitioner写入checkpoint目录 if (originalRDD.partitioner.nonEmpty) { writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath) } val checkpointDurationMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - checkpointStartTimeNs) logInfo(s"Checkpointing took $checkpointDurationMs ms.") // 创建一个CheckpointRDD,该分区数目应该和原始的rdd的分区数是一样的 val newRDD = new ReliableCheckpointRDD[T]( sc, checkpointDirPath.toString, originalRDD.partitioner) if (newRDD.partitions.length != originalRDD.partitions.length) { throw new SparkException( "Checkpoint RDD has a different number of partitions from original RDD. Original " + s"RDD [ID: ${originalRDD.id}, num of partitions: ${originalRDD.partitions.length}]; " + s"Checkpoint RDD [ID: ${newRDD.id}, num of partitions: " + s"${newRDD.partitions.length}].") } newRDD }
最后,会返回新的CheckpointRDD ,父类将它复值给成员cpRDD,最终标记当前状态为Checkpointed并清空当RDD的依赖链;到此Checkpoint的数据就被序列化到HDFS上了;
checkpoint读出数据
1、RDD的iterator()
1 /** 2 * 先调用presist(),再调用checkpoint() 3 * 先执行到rdd的iterator()的时候,storageLevel != StorageLevel.NONE,就会通过CacheManager获取数据 4 * 此时发生BlockManager获取不到数据,就会第一次计算数据,在通过BlockManager进行持久化 5 * 6 * rdd的job执行结束,启动单独的一个job,进行checkpoint, 7 * 下一次又运行到rdd的iterator()方法就会发现持久化级别不为空,默认从BlockManager中读取持久化数据(正常情况下) 8 * 9 * 在非正常情况下,就会调用computeOrReadCheckpoint方法,判断如果isCheckpoint为ture, 10 * 就会调用父rdd的iterator(),从外部文件系统中读取数据 11 */ 12 final def iterator(split: Partition, context: TaskContext): Iterator[T] = { 13 if (storageLevel != StorageLevel.NONE) { 14 getOrCompute(split, context) 15 } else { 16 //进行RDD的partition计算 17 computeOrReadCheckpoint(split, context) 18 } 19 }
2、继续调用computeOrReadCheckpoint
1 private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] = 2 { 3 if (isCheckpointedAndMaterialized) { 4 firstParent[T].iterator(split, context) 5 } else { 6 //抽象方法,找具体实现类,比如MapPartitionsRDD 7 compute(split, context) 8 } 9 }
调用rdd.iterator()
去计算rdd的partition的时候,会调用computeOrReadCheckpoint(split: Partition)去查看该rdd是否被checkPoint过了,
如果是,就调用rdd的parent rdd的iterator()也就是CheckpointRDD.iterator(),否则直接调用该RDD的compute
3、CheckpointRDD(ReliableCheckpointRDD)的compute
1 //Path上读取我们的CheckPoint数据 2 override def compute(split: Partition, context: TaskContext): Iterator[T] = { 3 val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index)) 4 ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context) 5 }
4、readCheckpointFile方法
1 def readCheckpointFile[T]( 2 path: Path, 3 broadcastedConf: Broadcast[SerializableConfiguration], 4 context: TaskContext): Iterator[T] = { 5 val env = SparkEnv.get 6 // 用hadoop API 读取HDFS上的数据 7 val fs = path.getFileSystem(broadcastedConf.value.value) 8 val bufferSize = env.conf.getInt("spark.buffer.size", 65536) 9 val fileInputStream = { 10 val fileStream = fs.open(path, bufferSize) 11 if (env.conf.get(CHECKPOINT_COMPRESS)) { 12 CompressionCodec.createCodec(env.conf).compressedInputStream(fileStream) 13 } else { 14 fileStream 15 } 16 } 17 val serializer = env.serializer.newInstance() 18 val deserializeStream = serializer.deserializeStream(fileInputStream) 19 20 // Register an on-task-completion callback to close the input stream. 21 context.addTaskCompletionListener[Unit](context => deserializeStream.close()) 22 //反序列化数据后转换为一个Iterator 23 deserializeStream.asIterator.asInstanceOf[Iterator[T]] 24 }