Spark 源码解读(三)SparkContext的初始化之Hadoop相关配置及Executor环境变量
Spark 源码解读(三)SparkContext的初始化之Hadoop相关配置及Executor环境变量
Hadoop相关配置信息
默认情况下,Spark使用HDFS作为分布式文件系统,所以需要获取Hadoop相关配置信息。
代码如下:
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
获取的配置信息如下:
-
将Amazon S3文件系统的AccessKeyId和SecretAccessKey加载到Hadoop的Configuration;
-
将SparkConf中所有以spark.hadoop.开头的属性都复制到hadoop的Configuration;
-
将SparkConf的属性spark.buffer.size复制为Hadoop的Configuration的配置io.file.buffer.size
def newConfiguration(conf: SparkConf): Configuration = { val hadoopConf = new Configuration() appendS3AndSparkHadoopConfigurations(conf, hadoopConf) hadoopConf } def appendS3AndSparkHadoopConfigurations(conf: SparkConf, hadoopConf: Configuration): Unit = { // Note: this null check is around more than just access to the "conf" object to maintain // the behavior of the old implementation of this code, for backwards compatibility. if (conf != null) { // Explicitly check for S3 environment variables if (System.getenv("AWS_ACCESS_KEY_ID") != null && System.getenv("AWS_SECRET_ACCESS_KEY") != null) { val keyId = System.getenv("AWS_ACCESS_KEY_ID") val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") hadoopConf.set("fs.s3.awsAccessKeyId", keyId) hadoopConf.set("fs.s3n.awsAccessKeyId", keyId) hadoopConf.set("fs.s3a.access.key", keyId) hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3a.secret.key", accessKey) } // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" conf.getAll.foreach { case (key, value) => if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } } val bufferSize = conf.get("spark.buffer.size", "65536") hadoopConf.set("io.file.buffer.size", bufferSize) } }
如果指定了Spark_Yarn_Mode属性,则会使用YarnSparkHadoopUtil,否则默认SparkHadoopUtil
Executor环境变量
executorEnvs包含的环境变量将会在注册应用的过程中发送给Master,Master给Worker发送调度之后,Worker最终使用executorEnvs提供的信息启动Executor。可以通过配置spark.executor.memory指定Executor占用的内存大小,也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM对其大小进行设置。如果不设置,默认大小为1024M 。代码如下:
_executorMemory = _conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
.orElse(Option(System.getenv("SPARK_MEM"))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
//如果没有设置则默认1024M
.getOrElse(1024)
// Convert java options to env vars as a work around
// since we can't set env vars directly in sbt.
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
executorEnvs("SPARK_PREPEND_CLASSES") = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser