spark 源码导读6 App, Driver 及 Worker的容错恢复

为了防止应用在运行时,应用所在机器或某个worker由于某种原因crash了,这些信息也随之消失,spark引入了HA机制,由trait PersistenceEngine提供实现。

spark中提供了三种容错机制:ZooKeeper、FileSystem、BlackHole,它们分别由对应的*PersistenceEngine实现,都继承自trait PersistenceEngine。

1. ZooKeeperPersistenceEngine 它是基于Zookeeper这个分布式协调系统。

    首先会根据配置 spark.deploy.zookeeper.dir (如果没有设置,就为/spark),获取工作目录,其次根据配置spark.deploy.zookeeper.url,创建zk client。最后根据这两个参数(zk, dir)创建工作目录。方法persist是对app,driver,worker对象进行序列化,unpersist是删除保存的对象文件,read是反序列化由zk client提供的文件。

  private val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"
  private val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)

  SparkCuratorUtil.mkdir(zk, WORKING_DIR)


  override def persist(name: String, obj: Object): Unit = {
    serializeIntoFile(WORKING_DIR + "/" + name, obj)
  }

  override def unpersist(name: String): Unit = {
    zk.delete().forPath(WORKING_DIR + "/" + name)
  }

  override def read[T: ClassTag](prefix: String): Seq[T] = {
    val file = zk.getChildren.forPath(WORKING_DIR).filter(_.startsWith(prefix))
    file.map(deserializeFromFile[T]).flatten
  }

2. FileSystemPersistenceEngine 是将每一个app和woker,在磁盘上保存成一个单独的文件。persist,unpersist,read等方法功能与前面一样。

3. BlackHolePersistenceEngine 不存储app和woker信息,也就是没有容错啦。

以上三种PersistenceEnging, spark究竟会选哪一种是由配置spark.deploy.recoveryMode决定的。默认是None, 即BlackHolePersistenceEngine模式。

    val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
      case "ZOOKEEPER" =>
        logInfo("Persisting recovery state to ZooKeeper")
        val zkFactory =
          new ZooKeeperRecoveryModeFactory(conf, SerializationExtension(context.system))
        (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
      case "FILESYSTEM" =>
        val fsFactory =
          new FileSystemRecoveryModeFactory(conf, SerializationExtension(context.system))
        (fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
      case "CUSTOM" =>
        val clazz = Class.forName(conf.get("spark.deploy.recoveryMode.factory"))
        val factory = clazz.getConstructor(conf.getClass, Serialization.getClass)
          .newInstance(conf, SerializationExtension(context.system))
          .asInstanceOf[StandaloneRecoveryModeFactory]
        (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
      case _ =>
        (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
    }
    persistenceEngine = persistenceEngine_
    leaderElectionAgent = leaderElectionAgent_
  }

 

    

posted on 2015-04-22 23:27  Ai_togic  阅读(394)  评论(0编辑  收藏  举报

导航