Spark2.1.0——剖析spark-shell
在《Spark2.1.0——运行环境准备》一文介绍了如何准备基本的Spark运行环境,并在《Spark2.1.0——Spark初体验》一文通过在spark-shell中执行word count的过程,让读者了解到可以使用spark-shell提交Spark作业。现在读者应该很想知道spark-shell究竟做了什么呢?
脚本分析
在Spark安装目录的bin文件夹下可以找到spark-shell,其中有代码清单1-1所示的一段脚本。
代码清单1-1 spark-shell脚本
1 2 3 4 5 6 7 8 9 10 11 | function main() { if $cygwin; then stty -icanon min 1 - echo > /dev/null 2>&1 export SPARK_SUBMIT_OPTS= "$SPARK_SUBMIT_OPTS -Djline.terminal=unix" "${SPARK_HOME}" /bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" stty icanon echo > /dev/null 2>&1 else export SPARK_SUBMIT_OPTS "${SPARK_HOME}" /bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" fi } |
我们看到脚本spark-shell里执行了spark-submit脚本,那么打开spark-submit脚本,发现代码清单1-2中所示的脚本。
代码清单1-2 spark-submit脚本
1 2 3 4 5 6 7 8 | if [ -z "${SPARK_HOME}" ]; then source "$(dirname " $0 ")" /find-spark-home fi # disable randomized hash for string in Python 3.3+ export PYTHONHASHSEED=0 exec "${SPARK_HOME}" /bin/spark-class org.apache.spark.deploy.SparkSubmit "$@" |
可以看到spark-submit中又执行了脚本spark-class。打开脚本spark-class,首先发现以下一段脚本:
1 2 3 4 5 6 7 8 9 10 11 | # Find the java binary if [ -n "${JAVA_HOME}" ]; then RUNNER= "${JAVA_HOME}/bin/java" else if [ "$(command -v java)" ]; then RUNNER= "java" else echo "JAVA_HOME is not set" >&2 exit 1 fi fi |
上面的脚本是为了找到Java命令。在spark-class脚本中还会找到以下内容:
1 2 3 4 5 6 7 8 9 | build_command() { "$RUNNER" -Xmx128m - cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" printf "%d\0" $? } CMD=() while IFS= read -d '' -r ARG; do CMD+=( "$ARG" ) done < <(build_command "$@" ) |
根据代码清单1-2,脚本spark-submit在执行spark-class脚本时,给它增加了参数SparkSubmit 。所以读到这,应该知道Spark启动了以SparkSubmit为主类的JVM进程。
远程监控
为便于在本地对Spark进程进行远程监控,在spark-shell脚本中找到以下配置:
1 | SPARK_SUBMIT_OPTS= "$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true" |
并追加以下jmx配置:
1 | -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=10207 -Dcom.sun.management.jmxremote.authenticate= false -Dcom.sun.management.jmxremote.ssl= false |
如果Spark安装在其他机器,那么在本地打开jvisualvm后需要添加远程主机,如图1所示:
右键单击已添加的远程主机,添加JMX连接,如图2:
如果Spark安装在本地,那么打开jvisualvm后就会在应用程序窗口看到org.apache.spark.deploy.SparkSubmit进程,只需双击即可。
选择右侧的“线程”选项卡,选择main线程,然后点击“线程Dump”按钮,如图3。
图3 查看Spark线程
从线程Dump的内容中找到线程main的信息如代码清单1-3所示。
代码清单1-3 main线程的Dump信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | "main" #1 prio=5 os_prio=31 tid=0x00007fa012802000 nid=0x1303 runnable [0x000000010d11c000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.read0(Native Method) at java.io.FileInputStream. read (FileInputStream.java:207) at jline.internal.NonBlockingInputStream. read (NonBlockingInputStream.java:169) - locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream) at jline.internal.NonBlockingInputStream. read (NonBlockingInputStream.java:137) at jline.internal.NonBlockingInputStream. read (NonBlockingInputStream.java:246) at jline.internal.InputStreamReader. read (InputStreamReader.java:261) - locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream) at jline.internal.InputStreamReader. read (InputStreamReader.java:198) - locked <0x00000007837a8ab8> (a jline.internal.NonBlockingInputStream) at jline.console.ConsoleReader.readCharacter(ConsoleReader.java:2145) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2349) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269) at scala.tools.nsc.interpreter.jline.InteractiveReader.readOneLine(JLineReader.scala:57) at scala.tools.nsc.interpreter.InteractiveReader$$anonfun$readLine$2.apply(InteractiveReader.scala:37) at scala.tools.nsc.interpreter.InteractiveReader$$anonfun$readLine$2.apply(InteractiveReader.scala:37) at scala.tools.nsc.interpreter.InteractiveReader$.restartSysCalls(InteractiveReader.scala:44) at scala.tools.nsc.interpreter.InteractiveReader$class.readLine(InteractiveReader.scala:37) at scala.tools.nsc.interpreter.jline.InteractiveReader.readLine(JLineReader.scala:28) at scala.tools.nsc.interpreter.ILoop.readOneLine(ILoop.scala:404) at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:413) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at org.apache.spark.repl.Main$.doMain(Main.scala:68) at org.apache.spark.repl.Main$.main(Main.scala:51) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) |
从main线程的栈信息中看出程序的调用顺序:SparkSubmit.main→repl.Main→Iloop.process。
源码分析
我们根据上面的线索,直接阅读Iloop的process方法的源码(Iloop是Scala语言自身的类库中的用于实现交互式shell的实现类,提供对REPL(Read-eval-print-loop)的实现),见代码清单1-4。
代码清单1-4 process的实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def process(settings : Settings) : Boolean = savingContextLoader { this .settings = settings createInterpreter() // sets in to some kind of reader depending on environmental cues in = in 0 .fold(chooseReader(settings))(r = > SimpleReader(r, out, interactive = true )) globalFuture = future { intp.initializeSynchronous() loopPostInit() !intp.reporter.hasErrors } loadFiles(settings) printWelcome() try loop() match { case LineResults.EOF = > out print Properties.shellInterruptedString case _ = > } catch AbstractOrMissingHandler() finally closeInterpreter() true } |
根据代码清单1-4,Iloop的process方法调用了loadFiles方法。Spark中的SparkILoop继承了Iloop并重写了loadFiles方法,其实现如下:
1 2 3 4 | override def loadFiles(settings : Settings) : Unit = { initializeSpark() super .loadFiles(settings) } |
根据上面展示的代码,loadFiles方法调用了SparkILoop的initializeSpark方法,initializeSpark的实现见代码清单1-5。
代码清单1-5 initializeSpark的实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | def initializeSpark() { intp.beQuietDuring { processLine( "" " @transient val spark = if (org.apache.spark.repl.Main.sparkSession != null) { org.apache.spark.repl.Main.sparkSession } else { org.apache.spark.repl.Main.createSparkSession() } @transient val sc = { val _sc = spark.sparkContext if (_sc.getConf.getBoolean(" spark.ui.reverseProxy ", false)) { val proxyUrl = _sc.getConf.get(" spark.ui.reverseProxyUrl ", null) if (proxyUrl != null) { println(s" Spark Context Web UI is available at ${proxyUrl}/proxy/${ _ sc.applicationId} ") } else { println(s" Spark Context Web UI is available at Spark Master Public URL ") } } else { _sc.uiWebUrl.foreach { webUrl => println(s" Spark context Web UI available at ${webUrl} ") } } println(" Spark context available as 'sc' " + s" (master = ${ _ sc.master}, app id = ${ _ sc.applicationId}). ") println(" Spark session available as 'spark' . ") _sc } " "" ) processLine( "import org.apache.spark.SparkContext._" ) processLine( "import spark.implicits._" ) processLine( "import spark.sql" ) processLine( "import org.apache.spark.sql.functions._" ) replayCommandStack = Nil // remove above commands from session history. } } |
我们看到initializeSpark向交互式shell发送了一大串代码,Scala的交互式shell将调用org.apache.spark.repl.Main的createSparkSession方法(见代码清单1-6)创建SparkSession。我们看到常量spark将持有SparkSession的引用,并且sc持有SparkSession内部初始化好的SparkContext。所以我们才能够在spark-shell的交互式shell中使用sc和spark。
代码清单1-6 createSparkSession的实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def createSparkSession() : SparkSession = { val execUri = System.getenv( "SPARK_EXECUTOR_URI" ) conf.setIfMissing( "spark.app.name" , "Spark shell" ) conf.set( "spark.repl.class.outputDir" , outputDir.getAbsolutePath()) if (execUri ! = null ) { conf.set( "spark.executor.uri" , execUri) } if (System.getenv( "SPARK_HOME" ) ! = null ) { conf.setSparkHome(System.getenv( "SPARK_HOME" )) } val builder = SparkSession.builder.config(conf) if (conf.get(CATALOG _ IMPLEMENTATION.key, "hive" ).toLowerCase == "hive" ) { if (SparkSession.hiveClassesArePresent) { sparkSession = builder.enableHiveSupport().getOrCreate() logInfo( "Created Spark session with Hive support" ) } else { builder.config(CATALOG _ IMPLEMENTATION.key, "in-memory" ) sparkSession = builder.getOrCreate() logInfo( "Created Spark session" ) } } else { sparkSession = builder.getOrCreate() logInfo( "Created Spark session" ) } sparkContext = sparkSession.sparkContext sparkSession } |
根据代码清单1-6,createSparkSession方法通过SparkSession的API创建SparkSession实例。本书将有关SparkSession等API的内容在《Spark内核设计的艺术》一书的第10章讲解,初次接触Spark的读者现在只需要了解即可。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· 展开说说关于C#中ORM框架的用法!
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?