Spark 源码解读(三)SparkContext的初始化之Hadoop相关配置及Executor环境变量

Spark 源码解读(三)SparkContext的初始化之Hadoop相关配置及Executor环境变量

Hadoop相关配置信息

默认情况下,Spark使用HDFS作为分布式文件系统,所以需要获取Hadoop相关配置信息。

代码如下:

  _hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

获取的配置信息如下:

  1. 将Amazon S3文件系统的AccessKeyId和SecretAccessKey加载到Hadoop的Configuration;

  2. 将SparkConf中所有以spark.hadoop.开头的属性都复制到hadoop的Configuration;

  3. 将SparkConf的属性spark.buffer.size复制为Hadoop的Configuration的配置io.file.buffer.size

      def newConfiguration(conf: SparkConf): Configuration = {
        val hadoopConf = new Configuration()
        appendS3AndSparkHadoopConfigurations(conf, hadoopConf)
        hadoopConf
      }
      def appendS3AndSparkHadoopConfigurations(conf: SparkConf, hadoopConf: Configuration): Unit = {
        // Note: this null check is around more than just access to the "conf" object to maintain
        // the behavior of the old implementation of this code, for backwards compatibility.
        if (conf != null) {
          // Explicitly check for S3 environment variables
          if (System.getenv("AWS_ACCESS_KEY_ID") != null &&
              System.getenv("AWS_SECRET_ACCESS_KEY") != null) {
            val keyId = System.getenv("AWS_ACCESS_KEY_ID")
            val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY")
    
            hadoopConf.set("fs.s3.awsAccessKeyId", keyId)
            hadoopConf.set("fs.s3n.awsAccessKeyId", keyId)
            hadoopConf.set("fs.s3a.access.key", keyId)
            hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey)
            hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey)
            hadoopConf.set("fs.s3a.secret.key", accessKey)
          }
          // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar"
          conf.getAll.foreach { case (key, value) =>
            if (key.startsWith("spark.hadoop.")) {
              hadoopConf.set(key.substring("spark.hadoop.".length), value)
            }
          }
          val bufferSize = conf.get("spark.buffer.size", "65536")
          hadoopConf.set("io.file.buffer.size", bufferSize)
        }
      }
    

如果指定了Spark_Yarn_Mode属性,则会使用YarnSparkHadoopUtil,否则默认SparkHadoopUtil

Executor环境变量

executorEnvs包含的环境变量将会在注册应用的过程中发送给Master,Master给Worker发送调度之后,Worker最终使用executorEnvs提供的信息启动Executor。可以通过配置spark.executor.memory指定Executor占用的内存大小,也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM对其大小进行设置。如果不设置,默认大小为1024M 。代码如下:

 _executorMemory = _conf.getOption("spark.executor.memory")
      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
      .orElse(Option(System.getenv("SPARK_MEM"))
      .map(warnSparkMem))
      .map(Utils.memoryStringToMb)
      //如果没有设置则默认1024M
      .getOrElse(1024)

    // Convert java options to env vars as a work around
    // since we can't set env vars directly in sbt.
    for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
      value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
      executorEnvs(envKey) = value
    }
    Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
      executorEnvs("SPARK_PREPEND_CLASSES") = v
    }
    // The Mesos scheduler backend relies on this environment variable to set executor memory.
    // TODO: Set this only in the Mesos scheduler.
    executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
    executorEnvs ++= _conf.getExecutorEnv
    executorEnvs("SPARK_USER") = sparkUser
posted @ 2020-07-01 23:38  这个小仙女真可爱  阅读(736)  评论(0编辑  收藏  举报