spark 源码导读6 App, Driver 及 Worker的容错恢复

为了防止应用在运行时,应用所在机器或某个worker由于某种原因crash了,这些信息也随之消失,spark引入了HA机制,由trait PersistenceEngine提供实现。

spark中提供了三种容错机制:ZooKeeper、FileSystem、BlackHole,它们分别由对应的*PersistenceEngine实现,都继承自trait PersistenceEngine。

1. ZooKeeperPersistenceEngine 它是基于Zookeeper这个分布式协调系统。

    首先会根据配置 spark.deploy.zookeeper.dir (如果没有设置,就为/spark),获取工作目录,其次根据配置spark.deploy.zookeeper.url,创建zk client。最后根据这两个参数(zk, dir)创建工作目录。方法persist是对app,driver,worker对象进行序列化,unpersist是删除保存的对象文件,read是反序列化由zk client提供的文件。

复制代码
  private val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"
  private val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)

  SparkCuratorUtil.mkdir(zk, WORKING_DIR)


  override def persist(name: String, obj: Object): Unit = {
    serializeIntoFile(WORKING_DIR + "/" + name, obj)
  }

  override def unpersist(name: String): Unit = {
    zk.delete().forPath(WORKING_DIR + "/" + name)
  }

  override def read[T: ClassTag](prefix: String): Seq[T] = {
    val file = zk.getChildren.forPath(WORKING_DIR).filter(_.startsWith(prefix))
    file.map(deserializeFromFile[T]).flatten
  }
复制代码

2. FileSystemPersistenceEngine 是将每一个app和woker,在磁盘上保存成一个单独的文件。persist,unpersist,read等方法功能与前面一样。

3. BlackHolePersistenceEngine 不存储app和woker信息,也就是没有容错啦。

以上三种PersistenceEnging, spark究竟会选哪一种是由配置spark.deploy.recoveryMode决定的。默认是None, 即BlackHolePersistenceEngine模式。

复制代码
    val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
      case "ZOOKEEPER" =>
        logInfo("Persisting recovery state to ZooKeeper")
        val zkFactory =
          new ZooKeeperRecoveryModeFactory(conf, SerializationExtension(context.system))
        (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
      case "FILESYSTEM" =>
        val fsFactory =
          new FileSystemRecoveryModeFactory(conf, SerializationExtension(context.system))
        (fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
      case "CUSTOM" =>
        val clazz = Class.forName(conf.get("spark.deploy.recoveryMode.factory"))
        val factory = clazz.getConstructor(conf.getClass, Serialization.getClass)
          .newInstance(conf, SerializationExtension(context.system))
          .asInstanceOf[StandaloneRecoveryModeFactory]
        (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
      case _ =>
        (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
    }
    persistenceEngine = persistenceEngine_
    leaderElectionAgent = leaderElectionAgent_
  }
复制代码

 

    

posted on   Ai_togic  阅读(395)  评论(0编辑  收藏  举报

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示