Spark Streaming源码分析 – Checkpoint
Persistence
Streaming没有做特别的事情,DStream最终还是以其中的每个RDD作为job进行调度的,所以persistence就以RDD为单位按照原先Spark的方式去做就可以了,不同的是Streaming是无限,需要考虑Clear的问题
在clearMetadata时,在删除过期的RDD的同时,也会做相应的unpersist
比较特别的是,NetworkInputDStream,是一定会做persistence的,因为会事先将流数据转化为persist block,然后NetworkInputDStream直接从block中读到数据
在design中看到NetworkInputDStream会将source data存两份,防止丢失,但在代码中没有找到这段逻辑,只看到往blockManager写入一份
Checkpoint
在Streaming中Checkpoint有特殊的意义
对于普通的Spark,没有cp不会影响正确性,因为任何数据都是可以从source replay出来的,而source data往往在HDFS上,所以cp只是一种优化。
并且Spark也只在worker级别做了failover,worker挂了,没事把上面的tasks换个worker重新replay出来即可, 但是并没有做driver的failover,driver挂了就失败了
因为Spark本身就看成是个query engine,query失败了没什么损失,again就ok
但是对于SparkStreaming,这个问题就没有那么简单了,如果driver挂掉,不做任何处理,恢复以后到底从哪里开始做?
首先一定会丢数据,影响正确性,因为流数据是无限的,你不可能像Spark一样把所有数据replay一遍,即使source支持replay,比如kafka
所以对于Streaming的checkpoint分为两部分,RDD的cp和DStreamGraph的cp
对于RDD的cp和Spark是一致的,没有区别
下面谈谈对于DStreamGraph的cp,目的就是在StreamingContext被重启后,可以从cp中恢复出之前Graph的执行时状况
a. Graph对象是会整个被序列化到文件,而其中最关键的是outputStreams,看似这里只会persist最终的outputStreams,其实会persist整个graph上所有的DStream
因为在def dependencies: List[DStream[_]]会包含所有的上一层DStream,依次递归,就会包含所有的DStream对象
在恢复出DStream对象后,如何恢复当时的RDD状况,可以看到generatedRDDs是@transient的,并不会被persist
答案在DStream.DStreamCheckpointData中,通过currentCheckpointFiles可以记录下cp时,generatedRDDs中所有完成cp的RDD的(times,cpfilename)
所以在恢复时只需要将RDD从cpfile中读出来,并加入到generatedRDDs即可
并且cpfile是需要清理的,当每次完成DStreamGraph的cp时,在该graph中的最老的RDD之前的所有RDD的cpfile都可以删掉,因为这些老的RDD不可能再被用到
b. 在Checkpoint对象中除了graph对象,还有该比较重要的是pendingTimes,这个记录在cp时,有多少的jobs没有被提交
这样当JobScheduler重新启动的时候会重新提交这些jobs,这里是at-least once逻辑,因为不知道在cp完多久后crash,所以其中某些job有可能已经被成功执行
创建cp的过程,
1. 在JobGenerator中,每次提交一组jobs到Spark后,会执行对DoCheckpoint将Checkpoint对象序列化写入文件(其中Checkpoint对象包含graph对象等信息)
2. 在完成DoCheckpoint后,会调用ClearCheckpointData清除过期的RDD的checkpoint文件
使用cp的过程,
1. 调用StreamingContext.getOrCreate,使用CheckpointReader.read从文件中反序列化出Checkpoint对象, 并使用Checkpoint对象去初始化StreamingContext对象
2. 在StreamingContext中调用cp_.graph.restoreCheckpointData来恢复每个DStream.generatedRDDs
3. 在JobGenerator中调用Restart,重新提交哪些在cp中未被提交的jobs
DStreamGraph
final private[streaming] class DStreamGraph extends Serializable with Logging {
private val inputStreams = new ArrayBuffer[InputDStream[_]]()
private val outputStreams = new ArrayBuffer[DStream[_]]()
var rememberDuration: Duration = null
var checkpointInProgress = false
var zeroTime: Time = null
var startTime: Time = null
var batchDuration: Duration = null
def updateCheckpointData(time: Time) {
logInfo("Updating checkpoint data for time " + time)
this.synchronized {
outputStreams.foreach(_.updateCheckpointData(time))
}
logInfo("Updated checkpoint data for time " + time)
}
def clearCheckpointData(time: Time) {
logInfo("Clearing checkpoint data for time " + time)
this.synchronized {
outputStreams.foreach(_.clearCheckpointData(time))
}
logInfo("Cleared checkpoint data for time " + time)
}
def restoreCheckpointData() {
logInfo("Restoring checkpoint data")
this.synchronized {
outputStreams.foreach(_.restoreCheckpointData())
}
logInfo("Restored checkpoint data")
}
}
DStreamCheckpointData
private[streaming]
class DStreamCheckpointData[T: ClassTag] (dstream: DStream[T])
extends Serializable with Logging {
protected val data = new HashMap[Time, AnyRef]()
// Mapping of the batch time to the checkpointed RDD file of that time
@transient private var timeToCheckpointFile = new HashMap[Time, String] // 保存所有被cp的RDD的(time,cpfilename)
// Mapping of the batch time to the time of the oldest checkpointed RDD
// in that batch's checkpoint data
@transient private var timeToOldestCheckpointFileTime = new HashMap[Time, Time] //保存每次cp时的当前时间和其中最old RDD的时间的关系
@transient private var fileSystem : FileSystem = null
protected[streaming] def currentCheckpointFiles = data.asInstanceOf[HashMap[Time, String]] // 保存此次被cp的RDD的(time,cpfilename)
/**
* Updates the checkpoint data of the DStream. This gets called every time
* the graph checkpoint is initiated. Default implementation records the
* checkpoint files to which the generate RDDs of the DStream has been saved.
*/
def update(time: Time) {
// Get the checkpointed RDDs from the generated RDDs
val checkpointFiles = dstream.generatedRDDs.filter(_._2.getCheckpointFile.isDefined) // 从当前的dstream.generatedRDDs过滤出已经完成cp的RDDs
.map(x => (x._1, x._2.getCheckpointFile.get))
// Add the checkpoint files to the data to be serialized
if (!checkpointFiles.isEmpty) {
currentCheckpointFiles.clear()
currentCheckpointFiles ++= checkpointFiles // 更新currentCheckpointFiles
// Add the current checkpoint files to the map of all checkpoint files
// This will be used to delete old checkpoint files
timeToCheckpointFile ++= currentCheckpointFiles
// Remember the time of the oldest checkpoint RDD in current state
timeToOldestCheckpointFileTime(time) = currentCheckpointFiles.keys.min(Time.ordering) // 找出此次cp中最old的那个RDD对应的时间
}
}
/**
* Cleanup old checkpoint data. This gets called after a checkpoint of `time` has been
* written to the checkpoint directory.
*/
def cleanup(time: Time) {
// Get the time of the oldest checkpointed RDD that was written as part of the
// checkpoint of `time`
timeToOldestCheckpointFileTime.remove(time) match { //timeToOldestCheckpointFileTime中记录了在time时的cp中最old的rdd的时间lastCheckpointFileTime
case Some(lastCheckpointFileTime) =>
// Find all the checkpointed RDDs (i.e. files) that are older than `lastCheckpointFileTime`
// This is because checkpointed RDDs older than this are not going to be needed
// even after master fails, as the checkpoint data of `time` does not refer to those files
val filesToDelete = timeToCheckpointFile.filter(_._1 < lastCheckpointFileTime) //清除所有比lastCheckpointFileTime更老的cpFile
logDebug("Files to delete:\n" + filesToDelete.mkString(","))
filesToDelete.foreach {
case (time, file) =>
try {
val path = new Path(file)
if (fileSystem == null) {
fileSystem = path.getFileSystem(dstream.ssc.sparkContext.hadoopConfiguration)
}
fileSystem.delete(path, true)
timeToCheckpointFile -= time
logInfo("Deleted checkpoint file '" + file + "' for time " + time)
} catch {
}
}
case None =>
logDebug("Nothing to delete")
}
}
/**
* Restore the checkpoint data. This gets called once when the DStream graph
* (along with its DStreams) are being restored from a graph checkpoint file.
* Default implementation restores the RDDs from their checkpoint files.
*/
def restore() {
// Create RDDs from the checkpoint data
currentCheckpointFiles.foreach {
case(time, file) => {
// 恢复,即从cpFile中反序列化出RDD,并加入dstream.generatedRDDs中
dstream.generatedRDDs += ((time, dstream.context.sparkContext.checkpointFile[T](file)))
}
}
}
}
DStream
//DStream
// Checkpoint details
private[streaming] val mustCheckpoint = false
private[streaming] var checkpointDuration: Duration = null
private[streaming] val checkpointData = new DStreamCheckpointData(this)
/**
* Enable periodic checkpointing of RDDs of this DStream
* @param interval Time interval after which generated RDD will be checkpointed
*/
def checkpoint(interval: Duration): DStream[T] = {
if (isInitialized) {
throw new UnsupportedOperationException(
"Cannot change checkpoint interval of an DStream after streaming context has started")
}
persist()
checkpointDuration = interval
this
}
/**
* Refresh the list of checkpointed RDDs that will be saved along with checkpoint of
* this stream. This is an internal method that should not be called directly. This is
* a default implementation that saves only the file names of the checkpointed RDDs to
* checkpointData. Subclasses of DStream (especially those of InputDStream) may override
* this method to save custom checkpoint data.
*/
private[streaming] def updateCheckpointData(currentTime: Time) {
checkpointData.update(currentTime)
dependencies.foreach(_.updateCheckpointData(currentTime))
}
private[streaming] def clearCheckpointData(time: Time) {
checkpointData.cleanup(time)
dependencies.foreach(_.clearCheckpointData(time))
}
/**
* Restore the RDDs in generatedRDDs from the checkpointData. This is an internal method
* that should not be called directly. This is a default implementation that recreates RDDs
* from the checkpoint file names stored in checkpointData. Subclasses of DStream that
* override the updateCheckpointData() method would also need to override this method.
*/
private[streaming] def restoreCheckpointData() {
// Create RDDs from the checkpoint data
checkpointData.restore()
dependencies.foreach(_.restoreCheckpointData())
}
JobGenerator
1. 在每次runJobs结束,即每次新提交一组jobs后,会执行对DoCheckpoint将Checkpoint对象写入文件
2. 在restart的时候,会重新run pendingTimes + downTimes的jobs,保证at-least once逻辑
//JobGenerator
private lazy val checkpointWriter =
if (ssc.checkpointDuration != null && ssc.checkpointDir != null) {
new CheckpointWriter(this, ssc.conf, ssc.checkpointDir, ssc.sparkContext.hadoopConfiguration)
} else {
null
}
/** Generate jobs and perform checkpoint for the given `time`. */
private def generateJobs(time: Time) {
SparkEnv.set(ssc.env)
Try(graph.generateJobs(time)) match {
case Success(jobs) => jobScheduler.runJobs(time, jobs)
case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e)
}
eventActor ! DoCheckpoint(time) //在完成runJobs后,对DStreamGraph进行CP
}
/** Perform checkpoint for the give `time`. */
private def doCheckpoint(time: Time) = synchronized {
if (checkpointWriter != null && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) {
ssc.graph.updateCheckpointData(time) //先更新graph中DStream的currentCheckpointFiles
checkpointWriter.write(new Checkpoint(ssc, time)) //使用checkpointWriter将Checkpoint对象写入文件
}
}
def onCheckpointCompletion(time: Time) {
eventActor ! ClearCheckpointData(time) //在完成DStreamGraph的CP后,需要清除该DStream之前的RDD的CP文件
}
/** Clear DStream checkpoint data for the given `time`. */
private def clearCheckpointData(time: Time) {
ssc.graph.clearCheckpointData(time)
}
/** Restarts the generator based on the information in checkpoint */
private def restart() {
// If manual clock is being used for testing, then
// either set the manual clock to the last checkpointed time,
// or if the property is defined set it to that time
if (clock.isInstanceOf[ManualClock]) {
val lastTime = ssc.initialCheckpoint.checkpointTime.milliseconds
val jumpTime = ssc.sc.conf.getLong("spark.streaming.manualClock.jump", 0)
clock.asInstanceOf[ManualClock].setTime(lastTime + jumpTime)
}
val batchDuration = ssc.graph.batchDuration
// Batches when the master was down, that is,
// between the checkpoint and current restart time
val checkpointTime = ssc.initialCheckpoint.checkpointTime
val restartTime = new Time(timer.getRestartTime(graph.zeroTime.milliseconds))
val downTimes = checkpointTime.until(restartTime, batchDuration) //在最后一次checkpoint到restart之间这段时间内RDD的times
logInfo("Batches during down time (" + downTimes.size + " batches): "
+ downTimes.mkString(", "))
// Batches that were unprocessed before failure
val pendingTimes = ssc.initialCheckpoint.pendingTimes.sorted(Time.ordering) // 在CP该graph时,jobsets仍没有提交的jobset
logInfo("Batches pending processing (" + pendingTimes.size + " batches): " +
pendingTimes.mkString(", "))
// Reschedule jobs for these times
val timesToReschedule = (pendingTimes ++ downTimes).distinct.sorted(Time.ordering) // 需要Reschedule的为pendingTimes + downTimes
logInfo("Batches to reschedule (" + timesToReschedule.size + " batches): " +
timesToReschedule.mkString(", "))
timesToReschedule.foreach(time =>
jobScheduler.runJobs(time, graph.generateJobs(time))
)
// Restart the timer
timer.start(restartTime.milliseconds)
logInfo("JobGenerator restarted at " + restartTime)
}
StreamingContext
在有checkpoint文件时,需要先读出Checkpoint对象,然后去初始化StreamingContext
从而使用Checkpoint去恢复graph中所有的DStream
//StreamingContext
class StreamingContext private[streaming] (
sc_ : SparkContext,
cp_ : Checkpoint,
batchDur_ : Duration
) extends Logging {
private[streaming] val isCheckpointPresent = (cp_ != null)
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
assert(batchDur_ != null, "Batch duration for streaming context cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
/**
* Set the context to periodically checkpoint the DStream operations for driver
* fault-tolerance.
* @param directory HDFS-compatible directory where the checkpoint data will be reliably stored.
* Note that this must be a fault-tolerant file system like HDFS for
*/
def checkpoint(directory: String) { //仅仅是创建checkpointDir,函数名起的不好
if (directory != null) {
val path = new Path(directory)
val fs = path.getFileSystem(sparkContext.hadoopConfiguration)
fs.mkdirs(path)
val fullPath = fs.getFileStatus(path).getPath().toString
sc.setCheckpointDir(fullPath)
checkpointDir = fullPath
} else {
checkpointDir = null
}
}
private[streaming] def initialCheckpoint: Checkpoint = {
if (isCheckpointPresent) cp_ else null
}
}
object StreamingContext extends Logging {
/**
* Either recreate a StreamingContext from checkpoint data or create a new StreamingContext.
* If checkpoint data exists in the provided `checkpointPath`, then StreamingContext will be
* recreated from the checkpoint data. If the data does not exist, then the StreamingContext
* will be created by called the provided `creatingFunc`.
*
* @param checkpointPath Checkpoint directory used in an earlier StreamingContext program
* @param creatingFunc Function to create a new StreamingContext
* @param hadoopConf Optional Hadoop configuration if necessary for reading from the
* file system
* @param createOnError Optional, whether to create a new StreamingContext if there is an
* error in reading checkpoint data. By default, an exception will be
* thrown on error.
*/
def getOrCreate(
checkpointPath: String,
creatingFunc: () => StreamingContext,
hadoopConf: Configuration = new Configuration(),
createOnError: Boolean = false
): StreamingContext = {
val checkpointOption = try { //从CPfile里面读出Checkpoint对象
CheckpointReader.read(checkpointPath, new SparkConf(), hadoopConf)
} catch {
case e: Exception =>
if (createOnError) {
None
} else {
throw e
}
}
checkpointOption.map(new StreamingContext(null, _, null)).getOrElse(creatingFunc()) //用Checkpoint对象去初始化StreamingContext
}
}
Checkpoint (org.apache.spark.streaming)
Checkpoint主要是为了cp DStreamGraph对象,通过CheckpointWriter将Checkpoint序列化到文件
private[streaming]
class Checkpoint(@transient ssc: StreamingContext, val checkpointTime: Time)
extends Logging with Serializable {
val master = ssc.sc.master
val framework = ssc.sc.appName
val sparkHome = ssc.sc.getSparkHome.getOrElse(null)
val jars = ssc.sc.jars
val graph = ssc.graph //关键需要cp的graph信息
val checkpointDir = ssc.checkpointDir
val checkpointDuration = ssc.checkpointDuration
//从JobScheduler的jobSets取出没有被run的jobset的time列表
val pendingTimes = ssc.scheduler.getPendingTimes().toArray
val delaySeconds = MetadataCleaner.getDelaySeconds(ssc.conf)
val sparkConfPairs = ssc.conf.getAll
}
CheckpointWriter,用于将CP对象写入文件
/**
* Convenience class to handle the writing of graph checkpoint to file
*/
private[streaming]
class CheckpointWriter(
jobGenerator: JobGenerator,
conf: SparkConf,
checkpointDir: String,
hadoopConf: Configuration
) extends Logging {
val MAX_ATTEMPTS = 3
val executor = Executors.newFixedThreadPool(1)
val compressionCodec = CompressionCodec.createCodec(conf)
private var stopped = false
private var fs_ : FileSystem = _
class CheckpointWriteHandler(checkpointTime: Time, bytes: Array[Byte]) extends Runnable {
def run() {
var attempts = 0
val startTime = System.currentTimeMillis()
val tempFile = new Path(checkpointDir, "temp") // 临时文件
val checkpointFile = Checkpoint.checkpointFile(checkpointDir, checkpointTime) // 正式的Cp文件
val backupFile = Checkpoint.checkpointBackupFile(checkpointDir, checkpointTime) // 备份文件
while (attempts < MAX_ATTEMPTS && !stopped) {
attempts += 1
try {
logInfo("Saving checkpoint for time " + checkpointTime + " to file '" + checkpointFile
+ "'")
// Write checkpoint to temp file,先写到临时文件
fs.delete(tempFile, true) // just in case it exists
val fos = fs.create(tempFile)
fos.write(bytes)
fos.close()
// If the checkpoint file exists, back it up
// If the backup exists as well, just delete it, otherwise rename will fail
if (fs.exists(checkpointFile)) {
fs.delete(backupFile, true) // just in case it exists
if (!fs.rename(checkpointFile, backupFile)) { // 将当前的CP rename成backup文件
logWarning("Could not rename " + checkpointFile + " to " + backupFile)
}
}
// Rename temp file to the final checkpoint file,再将临时文件rename成cp文件
if (!fs.rename(tempFile, checkpointFile)) {
logWarning("Could not rename " + tempFile + " to " + checkpointFile)
}
// Delete old checkpoint files
val allCheckpointFiles = Checkpoint.getCheckpointFiles(checkpointDir, fs)
if (allCheckpointFiles.size > 4) {
allCheckpointFiles.take(allCheckpointFiles.size - 10).foreach(file => {
logInfo("Deleting " + file)
fs.delete(file, true)
})
}
// All done, print success
val finishTime = System.currentTimeMillis()
jobGenerator.onCheckpointCompletion(checkpointTime) // Checkpoint完成是,触发jobGenerator.onCheckpointCompletion
return
} catch {
}
}
}
}
def write(checkpoint: Checkpoint) {
val bos = new ByteArrayOutputStream()
val zos = compressionCodec.compressedOutputStream(bos)
val oos = new ObjectOutputStream(zos)
oos.writeObject(checkpoint) // 将Checkpoint对象序列化
oos.close()
bos.close()
try {
executor.execute(new CheckpointWriteHandler(checkpoint.checkpointTime, bos.toByteArray)) // 用线程去执行CheckpointWriteHandler将数据写入文件
} catch {
}
}
}
CheckpointReader
private[streaming]
object CheckpointReader extends Logging {
def read(checkpointDir: String, conf: SparkConf, hadoopConf: Configuration): Option[Checkpoint] =
{
val checkpointPath = new Path(checkpointDir)
def fs = checkpointPath.getFileSystem(hadoopConf)
// Try to find the checkpoint files
val checkpointFiles = Checkpoint.getCheckpointFiles(checkpointDir, fs).reverse
if (checkpointFiles.isEmpty) {
return None
}
// Try to read the checkpoint files in the order
logInfo("Checkpoint files found: " + checkpointFiles.mkString(","))
val compressionCodec = CompressionCodec.createCodec(conf)
checkpointFiles.foreach(file => {
logInfo("Attempting to load checkpoint from file " + file)
try {
val fis = fs.open(file)
// ObjectInputStream uses the last defined user-defined class loader in the stack
// to find classes, which maybe the wrong class loader. Hence, a inherited version
// of ObjectInputStream is used to explicitly use the current thread's default class
// loader to find and load classes. This is a well know Java issue and has popped up
// in other places (e.g., http://jira.codehaus.org/browse/GROOVY-1627)
val zis = compressionCodec.compressedInputStream(fis)
val ois = new ObjectInputStreamWithLoader(zis,
Thread.currentThread().getContextClassLoader)
val cp = ois.readObject.asInstanceOf[Checkpoint] // 将文件内容反序列化成Checkpoint对象
ois.close()
fs.close()
cp.validate()
return Some(cp)
} catch {
}
})
}
}