|NO.Z.00052|——————————|BigDataEnd|——|Hadoop&Spark.V02|——|Spark.v12|spark sql原理|sql解析过程|
一、SQL解析过程
### --- sql解析过程
~~~ Spark SQL 可以说是 Spark 中的精华部分。
~~~ 原来基于 RDD 构建大数据计算任务,重心在向 DataSet 转移,原来基于 RDD 写的代码也在迁移。
~~~ 使用 Spark SQL 编码好处是非常大的,尤其是在性能方面,有很大提升。
~~~ Spark SQL 中各种内嵌的性能优化比写 RDD 遵守各种最佳实践更靠谱的,尤其对新手来说。
~~~ 如先 filter 操作再 map 操作,Spark SQL 中会自动进行谓词下推;
~~~ Spark SQL中会自动使用 broadcast join 来广播小表,把 shuffle join 转化为 map join 等等。
~~~ Spark SQL对SQL语句的处理和关系型数据库类似,即词法/语法解析、绑定、优化、执行。
~~~ Spark SQL会先将SQL语句解析成一棵树,然后使用规则(Rule)对Tree进行绑定、优化等处理过程。
### --- Spark SQL由Core、Catalyst、Hive、Hive-ThriftServer四部分构成:
~~~ Core: 负责处理数据的输入和输出,如获取数据,查询结果输出成DataFrame等
~~~ Catalyst: 负责处理整个查询过程,包括解析、绑定、优化等
~~~ Hive: 负责对Hive数据进行处理
~~~ Hive-ThriftServer: 主要用于对Hive的访问


~~~ # Spark SQL的代码复杂度是问题的本质复杂度带来的,
~~~ Spark SQL中的 Catalyst 框架大部分逻辑是在一个 Tree 类型的数据结构上做各种折腾,
~~~ 基于 Scala 来实现还是很优雅的,Scala 的偏函数和强大的 Case 正则匹配,
~~~ 让整个代码看起来非常优雅。
~~~ SparkSession 是编写 Spark 应用代码的入口,
~~~ 启动一个 spark-shell 会提供给你一个创建 SparkSession,
~~~ 这个对象是整个 Spark 应用的起始点。以下是SparkSession 的一些重要的变量和方法:
二、sparksql类
类 | 功能 |
catalog | 通过对这个类可以操作元数据,对数据库、表、 函数进行增删改查,内部使用SessionCatalog完成具体操作 |
table | 把一个table或view包装为一个DataFrame进行后续操作 |
emptyDataset/emptyDataFrame | 创建空的Dataset 或 DataFrame |
sql | 执行sql,返回一个DataFrame |
read或readStream | 获取数据读取器,读取各种格式的数据 |
sessionState | 维护了当前session使用的所有状态数据; 还包括SQL解析器、分析器、优化器等 |
### --- 源码提取说明
~~~ # 源码提取说明:sessionstate.scala
~~~ 59行
private[sql] class SessionState(
sharedState: SharedState,
val conf: SQLConf,
val experimentalMethods: ExperimentalMethods,
val functionRegistry: FunctionRegistry,
val udfRegistration: UDFRegistration,
catalogBuilder: () => SessionCatalog,
val sqlParser: ParserInterface,
analyzerBuilder: () => Analyzer,
optimizerBuilder: () => Optimizer,
val planner: SparkPlanner,
val streamingQueryManager: StreamingQueryManager,
val listenerManager: ExecutionListenerManager,
resourceLoaderBuilder: () => SessionResourceLoader,
createQueryExecution: LogicalPlan => QueryExecution,
createClone: (SparkSession, SessionState) => SessionState) {
三、编程代码实现
### --- 编程代码实现
package cn.yanqi.sparksql
import org.apache.spark.sql.{DataFrame, SparkSession}
object Plan {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Demo1")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("warn")
import spark.implicits._
Seq((0, "zhansan", 10),
(1, "lisi", 11),
(2, "wangwu", 12)).toDF("id", "name", "age").createOrReplaceTempView("stu")
Seq((0, "chinese", 80), (0, "math", 100), (0, "english", 98),
(1, "chinese", 86), (1, "math", 97), (1, "english", 90),
(2, "chinese", 90), (2, "math", 94), (2, "english", 88)
).toDF("id", "subject", "score").createOrReplaceTempView("score")
val df: DataFrame = spark.sql(
"""
|select sum(v), name
| from (select stu.id, 100 + 10 + score.score as v, name
| from stu join score
| where stu.id = score.id and stu.age >= 11) tmp
|group by name
|""".stripMargin)
println(df.queryExecution)
// df.show()
val df1: DataFrame = spark.sql(
"""
|select sum(v), name
| from (select stu.id, 100 + 10 + score.score as v, name
| from stu join score on stu.id = score.id where stu.age >= 11) tmp
|group by name
|""".stripMargin)
println(df1.queryExecution)
val df2: DataFrame = spark.sql(
"""
|select sum(v), name
| from (select stu.id, 100 + 10 + score.score as v, name
| from stu join score on stu.id = score.id where stu.age >= 11) tmp
|group by name
|""".stripMargin)
println(df2.queryExecution)
// df1.show()
// 打印执行计划
// println(df.queryExecution)
spark.close()
}
}
### --- 编译打印
D:\JAVA\jdk1.8.0_231\bin\java.exe "-javaagent:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=61251:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\bin" -Dfile.encoding=UTF-8 -classpath D:\JAVA\jdk1.8.0_231\jre\lib\charsets.jar;D:\JAVA\jdk1.8.0_231\jre\lib\deploy.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\access-bridge-64.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\cldrdata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\dnsns.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jaccess.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jfxrt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\localedata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\nashorn.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunec.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunjce_provider.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunmscapi.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunpkcs11.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\zipfs.jar;D:\JAVA\jdk1.8.0_231\jre\lib\javaws.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jce.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfr.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfxswt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jsse.jar;D:\JAVA\jdk1.8.0_231\jre\lib\management-agent.jar;D:\JAVA\jdk1.8.0_231\jre\lib\plugin.jar;D:\JAVA\jdk1.8.0_231\jre\lib\resources.jar;D:\JAVA\jdk1.8.0_231\jre\lib\rt.jar;E:\NO.Z.80000.Hadoop.spark\SparkBigData\target\classes;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-library\2.12.10\scala-library-2.12.10.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-core_2.12\2.4.5\spark-core_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\thoughtworks\paranamer\paranamer\2.8\paranamer-2.8.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro\1.8.2\avro-1.8.2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-compress\1.8.1\commons-compress-1.8.1.jar;C:\Users\Administrator\.m2\repository\org\tukaani\xz\1.5\xz-1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-mapred\1.8.2\avro-mapred-1.8.2-hadoop2.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-ipc\1.8.2\avro-ipc-1.8.2.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill_2.12\0.9.3\chill_2.12-0.9.3.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\kryo-shaded\4.0.2\kryo-shaded-4.0.2.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\minlog\1.3.0\minlog-1.3.0.jar;C:\Users\Administrator\.m2\repository\org\objenesis\objenesis\2.5.1\objenesis-2.5.1.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill-java\0.9.3\chill-java-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\xbean\xbean-asm6-shaded\4.8\xbean-asm6-shaded-4.8.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-client\2.6.5\hadoop-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-common\2.6.5\hadoop-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\xmlenc\xmlenc\0.52\xmlenc-0.52.jar;C:\Users\Administrator\.m2\repository\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\Administrator\.m2\repository\commons-configuration\commons-configuration\1.6\commons-configuration-1.6.jar;C:\Users\Administrator\.m2\repository\commons-digester\commons-digester\1.8\commons-digester-1.8.jar;C:\Users\Administrator\.m2\repository\commons-beanutils\commons-beanutils\1.7.0\commons-beanutils-1.7.0.jar;C:\Users\Administrator\.m2\repository\com\google\code\gson\gson\2.2.4\gson-2.2.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-auth\2.6.5\hadoop-auth-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-kerberos-codec\2.0.0-M15\apacheds-kerberos-codec-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-i18n\2.0.0-M15\apacheds-i18n-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-asn1-api\1.0.0-M20\api-asn1-api-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-util\1.0.0-M20\api-util-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-client\2.6.0\curator-client-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\htrace\htrace-core\3.0.4\htrace-core-3.0.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-hdfs\2.6.5\hadoop-hdfs-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar;C:\Users\Administrator\.m2\repository\xerces\xercesImpl\2.9.1\xercesImpl-2.9.1.jar;C:\Users\Administrator\.m2\repository\xml-apis\xml-apis\1.3.04\xml-apis-1.3.04.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-app\2.6.5\hadoop-mapreduce-client-app-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-common\2.6.5\hadoop-mapreduce-client-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-client\2.6.5\hadoop-yarn-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-server-common\2.6.5\hadoop-yarn-server-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-shuffle\2.6.5\hadoop-mapreduce-client-shuffle-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-api\2.6.5\hadoop-yarn-api-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-core\2.6.5\hadoop-mapreduce-client-core-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-common\2.6.5\hadoop-yarn-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\javax\xml\bind\jaxb-api\2.2.2\jaxb-api-2.2.2.jar;C:\Users\Administrator\.m2\repository\javax\xml\stream\stax-api\1.0-2\stax-api-1.0-2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-jaxrs\1.9.13\jackson-jaxrs-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-xc\1.9.13\jackson-xc-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-jobclient\2.6.5\hadoop-mapreduce-client-jobclient-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-annotations\2.6.5\hadoop-annotations-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-launcher_2.12\2.4.5\spark-launcher_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-kvstore_2.12\2.4.5\spark-kvstore_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\leveldbjni-all-1.8.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-core\2.6.7\jackson-core-2.6.7.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.6.7\jackson-annotations-2.6.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-common_2.12\2.4.5\spark-network-common_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-shuffle_2.12\2.4.5\spark-network-shuffle_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-unsafe_2.12\2.4.5\spark-unsafe_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\javax\activation\activation\1.1.1\activation-1.1.1.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-recipes\2.6.0\curator-recipes-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-framework\2.6.0\curator-framework-2.6.0.jar;C:\Users\Administrator\.m2\repository\com\google\guava\guava\16.0.1\guava-16.0.1.jar;C:\Users\Administrator\.m2\repository\org\apache\zookeeper\zookeeper\3.4.6\zookeeper-3.4.6.jar;C:\Users\Administrator\.m2\repository\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-math3\3.4.1\commons-math3-3.4.1.jar;C:\Users\Administrator\.m2\repository\com\google\code\findbugs\jsr305\1.3.9\jsr305-1.3.9.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-api\1.7.16\slf4j-api-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jul-to-slf4j\1.7.16\jul-to-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jcl-over-slf4j\1.7.16\jcl-over-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-log4j12\1.7.16\slf4j-log4j12-1.7.16.jar;C:\Users\Administrator\.m2\repository\com\ning\compress-lzf\1.0.3\compress-lzf-1.0.3.jar;C:\Users\Administrator\.m2\repository\org\xerial\snappy\snappy-java\1.1.7.3\snappy-java-1.1.7.3.jar;C:\Users\Administrator\.m2\repository\org\lz4\lz4-java\1.4.0\lz4-java-1.4.0.jar;C:\Users\Administrator\.m2\repository\com\github\luben\zstd-jni\1.3.2-2\zstd-jni-1.3.2-2.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\RoaringBitmap\0.7.45\RoaringBitmap-0.7.45.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\shims\0.7.45\shims-0.7.45.jar;C:\Users\Administrator\.m2\repository\commons-net\commons-net\3.1\commons-net-3.1.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-jackson_2.12\3.5.3\json4s-jackson_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-core_2.12\3.5.3\json4s-core_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-ast_2.12\3.5.3\json4s-ast_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-scalap_2.12\3.5.3\json4s-scalap_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-xml_2.12\1.0.6\scala-xml_2.12-1.0.6.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-client\2.22.2\jersey-client-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\ws\rs\javax.ws.rs-api\2.0.1\javax.ws.rs-api-2.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-api\2.4.0-b34\hk2-api-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-utils\2.4.0-b34\hk2-utils-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\aopalliance-repackaged\2.4.0-b34\aopalliance-repackaged-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\javax.inject\2.4.0-b34\javax.inject-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-locator\2.4.0-b34\hk2-locator-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\javassist\javassist\3.18.1-GA\javassist-3.18.1-GA.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-common\2.22.2\jersey-common-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\annotation\javax.annotation-api\1.2\javax.annotation-api-1.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\bundles\repackaged\jersey-guava\2.22.2\jersey-guava-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\osgi-resource-locator\1.0.1\osgi-resource-locator-1.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-server\2.22.2\jersey-server-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\media\jersey-media-jaxb\2.22.2\jersey-media-jaxb-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\validation\validation-api\1.1.0.Final\validation-api-1.1.0.Final.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet\2.22.2\jersey-container-servlet-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet-core\2.22.2\jersey-container-servlet-core-2.22.2.jar;C:\Users\Administrator\.m2\repository\io\netty\netty-all\4.1.42.Final\netty-all-4.1.42.Final.jar;C:\Users\Administrator\.m2\repository\io\netty\netty\3.9.9.Final\netty-3.9.9.Final.jar;C:\Users\Administrator\.m2\repository\com\clearspring\analytics\stream\2.7.0\stream-2.7.0.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-core\3.1.5\metrics-core-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-jvm\3.1.5\metrics-jvm-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-json\3.1.5\metrics-json-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-graphite\3.1.5\metrics-graphite-3.1.5.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.6.7.3\jackson-databind-2.6.7.3.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-scala_2.12\2.6.7.1\jackson-module-scala_2.12-2.6.7.1.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-reflect\2.12.1\scala-reflect-2.12.1.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-paranamer\2.7.9\jackson-module-paranamer-2.7.9.jar;C:\Users\Administrator\.m2\repository\org\apache\ivy\ivy\2.4.0\ivy-2.4.0.jar;C:\Users\Administrator\.m2\repository\oro\oro\2.0.8\oro-2.0.8.jar;C:\Users\Administrator\.m2\repository\net\razorvine\pyrolite\4.13\pyrolite-4.13.jar;C:\Users\Administrator\.m2\repository\net\sf\py4j\py4j\0.10.7\py4j-0.10.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-tags_2.12\2.4.5\spark-tags_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-crypto\1.0.0\commons-crypto-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sql_2.12\2.4.5\spark-sql_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\univocity\univocity-parsers\2.7.3\univocity-parsers-2.7.3.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sketch_2.12\2.4.5\spark-sketch_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-catalyst_2.12\2.4.5\spark-catalyst_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-parser-combinators_2.12\1.1.0\scala-parser-combinators_2.12-1.1.0.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\janino\3.0.9\janino-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\commons-compiler\3.0.9\commons-compiler-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr4-runtime\4.7\antlr4-runtime-4.7.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-core\1.5.5\orc-core-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-shims\1.5.5\orc-shims-1.5.5.jar;C:\Users\Administrator\.m2\repository\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\Administrator\.m2\repository\commons-lang\commons-lang\2.6\commons-lang-2.6.jar;C:\Users\Administrator\.m2\repository\io\airlift\aircompressor\0.10\aircompressor-0.10.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-mapreduce\1.5.5\orc-mapreduce-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-column\1.10.1\parquet-column-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-common\1.10.1\parquet-common-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-encoding\1.10.1\parquet-encoding-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-hadoop\1.10.1\parquet-hadoop-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-format\2.4.0\parquet-format-2.4.0.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-jackson\1.10.1\parquet-jackson-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-vector\0.10.0\arrow-vector-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-format\0.10.0\arrow-format-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-memory\0.10.0\arrow-memory-0.10.0.jar;C:\Users\Administrator\.m2\repository\com\carrotsearch\hppc\0.7.2\hppc-0.7.2.jar;C:\Users\Administrator\.m2\repository\com\vlkan\flatbuffers\1.2.0-3f79e055\flatbuffers-1.2.0-3f79e055.jar;C:\Users\Administrator\.m2\repository\joda-time\joda-time\2.9.7\joda-time-2.9.7.jar;C:\Users\Administrator\.m2\repository\mysql\mysql-connector-java\5.1.44\mysql-connector-java-5.1.44.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-hive_2.12\2.4.5\spark-hive_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\twitter\parquet-hadoop-bundle\1.6.0\parquet-hadoop-bundle-1.6.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-exec\1.2.1.spark2\hive-exec-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\commons-io\commons-io\2.4\commons-io-2.4.jar;C:\Users\Administrator\.m2\repository\javolution\javolution\5.5.1\javolution-5.5.1.jar;C:\Users\Administrator\.m2\repository\log4j\apache-log4j-extras\1.2.17\apache-log4j-extras-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr-runtime\3.4\antlr-runtime-3.4.jar;C:\Users\Administrator\.m2\repository\org\antlr\stringtemplate\3.2.1\stringtemplate-3.2.1.jar;C:\Users\Administrator\.m2\repository\antlr\antlr\2.7.7\antlr-2.7.7.jar;C:\Users\Administrator\.m2\repository\org\antlr\ST4\4.0.4\ST4-4.0.4.jar;C:\Users\Administrator\.m2\repository\com\googlecode\javaewah\JavaEWAH\0.3.2\JavaEWAH-0.3.2.jar;C:\Users\Administrator\.m2\repository\org\iq80\snappy\snappy\0.2\snappy-0.2.jar;C:\Users\Administrator\.m2\repository\stax\stax-api\1.0.1\stax-api-1.0.1.jar;C:\Users\Administrator\.m2\repository\net\sf\opencsv\opencsv\2.3\opencsv-2.3.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-metastore\1.2.1.spark2\hive-metastore-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\com\jolbox\bonecp\0.8.0.RELEASE\bonecp-0.8.0.RELEASE.jar;C:\Users\Administrator\.m2\repository\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\Administrator\.m2\repository\commons-logging\commons-logging\1.1.3\commons-logging-1.1.3.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-api-jdo\3.2.6\datanucleus-api-jdo-3.2.6.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-rdbms\3.2.9\datanucleus-rdbms-3.2.9.jar;C:\Users\Administrator\.m2\repository\commons-pool\commons-pool\1.5.4\commons-pool-1.5.4.jar;C:\Users\Administrator\.m2\repository\commons-dbcp\commons-dbcp\1.4\commons-dbcp-1.4.jar;C:\Users\Administrator\.m2\repository\javax\jdo\jdo-api\3.0.1\jdo-api-3.0.1.jar;C:\Users\Administrator\.m2\repository\javax\transaction\jta\1.1\jta-1.1.jar;C:\Users\Administrator\.m2\repository\commons-httpclient\commons-httpclient\3.1\commons-httpclient-3.1.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-avatica\1.2.0-incubating\calcite-avatica-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-core\1.2.0-incubating\calcite-core-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-linq4j\1.2.0-incubating\calcite-linq4j-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\net\hydromatic\eigenbase-properties\1.1.5\eigenbase-properties-1.1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpclient\4.5.6\httpclient-4.5.6.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpcore\4.4.10\httpcore-4.4.10.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\commons-codec\commons-codec\1.10\commons-codec-1.10.jar;C:\Users\Administrator\.m2\repository\org\jodd\jodd-core\3.5.2\jodd-core-3.5.2.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-core\3.2.10\datanucleus-core-3.2.10.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libthrift\0.9.3\libthrift-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libfb303\0.9.3\libfb303-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\derby\derby\10.12.1.1\derby-10.12.1.1.jar cn.yanqi.sparksql.Plan
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
== Parsed Logical Plan ==
'Aggregate ['name], [unresolvedalias('sum('v), None), 'name]
+- 'SubqueryAlias `tmp`
+- 'Project ['stu.id, ((100 + 10) + 'score.score) AS v#26, 'name]
+- 'Filter (('stu.id = 'score.id) && ('stu.age >= 11))
+- 'Join Inner
:- 'UnresolvedRelation `stu`
+- 'UnresolvedRelation `score`
== Analyzed Logical Plan ==
sum(v): bigint, name: string
Aggregate [name#8], [sum(cast(v#26 as bigint)) AS sum(v)#28L, name#8]
+- SubqueryAlias `tmp`
+- Project [id#7, ((100 + 10) + score#22) AS v#26, name#8]
+- Filter ((id#7 = id#20) && (age#9 >= 11))
+- Join Inner
:- SubqueryAlias `stu`
: +- Project [_1#3 AS id#7, _2#4 AS name#8, _3#5 AS age#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- SubqueryAlias `score`
+- Project [_1#16 AS id#20, _2#17 AS subject#21, _3#18 AS score#22]
+- LocalRelation [_1#16, _2#17, _3#18]
== Optimized Logical Plan ==
Aggregate [name#8], [sum(cast(v#26 as bigint)) AS sum(v)#28L, name#8]
+- Project [(110 + score#22) AS v#26, name#8]
+- Join Inner, (id#7 = id#20)
:- LocalRelation [id#7, name#8]
+- LocalRelation [id#20, score#22]
== Physical Plan ==
*(2) HashAggregate(keys=[name#8], functions=[sum(cast(v#26 as bigint))], output=[sum(v)#28L, name#8])
+- Exchange hashpartitioning(name#8, 200)
+- *(1) HashAggregate(keys=[name#8], functions=[partial_sum(cast(v#26 as bigint))], output=[name#8, sum#32L])
+- *(1) Project [(110 + score#22) AS v#26, name#8]
+- *(1) BroadcastHashJoin [id#7], [id#20], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- LocalTableScan [id#7, name#8]
+- LocalTableScan [id#20, score#22]
== Parsed Logical Plan ==
'Aggregate ['name], [unresolvedalias('sum('v), None), 'name]
+- 'SubqueryAlias `tmp`
+- 'Project ['stu.id, ((100 + 10) + 'score.score) AS v#33, 'name]
+- 'Filter ('stu.age >= 11)
+- 'Join Inner, ('stu.id = 'score.id)
:- 'UnresolvedRelation `stu`
+- 'UnresolvedRelation `score`
== Analyzed Logical Plan ==
sum(v): bigint, name: string
Aggregate [name#8], [sum(cast(v#33 as bigint)) AS sum(v)#35L, name#8]
+- SubqueryAlias `tmp`
+- Project [id#7, ((100 + 10) + score#22) AS v#33, name#8]
+- Filter (age#9 >= 11)
+- Join Inner, (id#7 = id#20)
:- SubqueryAlias `stu`
: +- Project [_1#3 AS id#7, _2#4 AS name#8, _3#5 AS age#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- SubqueryAlias `score`
+- Project [_1#16 AS id#20, _2#17 AS subject#21, _3#18 AS score#22]
+- LocalRelation [_1#16, _2#17, _3#18]
== Optimized Logical Plan ==
Aggregate [name#8], [sum(cast(v#33 as bigint)) AS sum(v)#35L, name#8]
+- Project [(110 + score#22) AS v#33, name#8]
+- Join Inner, (id#7 = id#20)
:- LocalRelation [id#7, name#8]
+- LocalRelation [id#20, score#22]
== Physical Plan ==
*(2) HashAggregate(keys=[name#8], functions=[sum(cast(v#33 as bigint))], output=[sum(v)#35L, name#8])
+- Exchange hashpartitioning(name#8, 200)
+- *(1) HashAggregate(keys=[name#8], functions=[partial_sum(cast(v#33 as bigint))], output=[name#8, sum#39L])
+- *(1) Project [(110 + score#22) AS v#33, name#8]
+- *(1) BroadcastHashJoin [id#7], [id#20], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- LocalTableScan [id#7, name#8]
+- LocalTableScan [id#20, score#22]
== Parsed Logical Plan ==
'Aggregate ['name], [unresolvedalias('sum('v), None), 'name]
+- 'SubqueryAlias `tmp`
+- 'Project ['stu.id, ((100 + 10) + 'score.score) AS v#40, 'name]
+- 'Filter ('stu.age >= 11)
+- 'Join Inner, ('stu.id = 'score.id)
:- 'UnresolvedRelation `stu`
+- 'UnresolvedRelation `score`
== Analyzed Logical Plan ==
sum(v): bigint, name: string
Aggregate [name#8], [sum(cast(v#40 as bigint)) AS sum(v)#42L, name#8]
+- SubqueryAlias `tmp`
+- Project [id#7, ((100 + 10) + score#22) AS v#40, name#8]
+- Filter (age#9 >= 11)
+- Join Inner, (id#7 = id#20)
:- SubqueryAlias `stu`
: +- Project [_1#3 AS id#7, _2#4 AS name#8, _3#5 AS age#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- SubqueryAlias `score`
+- Project [_1#16 AS id#20, _2#17 AS subject#21, _3#18 AS score#22]
+- LocalRelation [_1#16, _2#17, _3#18]
== Optimized Logical Plan ==
Aggregate [name#8], [sum(cast(v#40 as bigint)) AS sum(v)#42L, name#8]
+- Project [(110 + score#22) AS v#40, name#8]
+- Join Inner, (id#7 = id#20)
:- LocalRelation [id#7, name#8]
+- LocalRelation [id#20, score#22]
== Physical Plan ==
*(2) HashAggregate(keys=[name#8], functions=[sum(cast(v#40 as bigint))], output=[sum(v)#42L, name#8])
+- Exchange hashpartitioning(name#8, 200)
+- *(1) HashAggregate(keys=[name#8], functions=[partial_sum(cast(v#40 as bigint))], output=[name#8, sum#46L])
+- *(1) Project [(110 + score#22) AS v#40, name#8]
+- *(1) BroadcastHashJoin [id#7], [id#20], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- LocalTableScan [id#7, name#8]
+- LocalTableScan [id#20, score#22]
Process finished with exit code 0
四、SQL解析过程
### --- SQL解析过程
~~~ queryExecution 就是整个执行计划的执行引擎里面有执行过程中各个中间过程变量,
~~~ 整个执行流程如下:

### --- 上面例子中的 SQL 语句经过 Parser 解析后就会变成一个抽象语法树,对应解析后的逻辑计划 AST 为:
== Parsed Logical Plan ==
'Aggregate ['name], [unresolvedalias('sum('v), None), 'name]
+- 'SubqueryAlias `tmp`
+- 'Project ['stu.id, ((100 + 10) + 'score.score) AS v#26, 'name]
+- 'Filter (('stu.id = 'score.id) && ('stu.age >= 11))
+- 'Join Inner:- 'UnresolvedRelation `stu`
+- 'UnresolvedRelation `score`
### --- 备注:在执行计划中 Project/Projection 代表的意思是投影选投连三种最基本的操作
~~~ 其中过滤条件变为了 Filter 节点,这个节点是 UnaryNode(一元节点) 类型, 只有一个孩子。
~~~ 两个表中的数据变为了 UnresolvedRelation 节点,节点类型为 LeafNode,
~~~ 即叶子节点, JOIN 操作为节点, 这个是一个 BinaryNode 节点,有两个孩子。
~~~ 以上节点都是 LogicalPlan 类型的, 可以理解为进行各种操作的 Operator,
~~~ SparkSQL 对各种操作定义了各种 Operator。

五、 operator 组成的抽象语法树

~~~ # 这些 operator 组成的抽象语法树就是整个 Catatyst 优化的基础,
~~~ Catatyst 优化器会在这个树上面进行各种折腾,把树上面的节点挪来挪去来进行优化。
~~~ 经过 Parser 有了抽象语法树,但是并不知道 score,sum 这些东西是啥,
~~~ 所以就需要 analyer 来定位。
~~~ analyzer 会把 AST 上所有 Unresolved 的东西都转变为 resolved 状态,
~~~ SparkSQL有很多resolve 规则:
~~~ ResolverRelations。解析表(列)的基本类型等信息
~~~ ResolveFuncions。解析出来函数的基本信息
~~~ ResolveReferences。解析引用,通常是解析列名
== Analyzed Logical Plan ==
sum(v): bigint, name: string
Aggregate [name#8], [sum(cast(v#26 as bigint)) AS sum(v)#28L, name#8]
+- SubqueryAlias `tmp`
+- Project [id#7, ((100 + 10) + score#22) AS v#26, name#8]
+- Filter ((id#7 = id#20) && (age#9 >= 11))
+- Join Inner:- SubqueryAlias `stu`:
+- Project [_1#3 AS id#7, _2#4 AS name#8, _3#5 AS age#9]:
+- LocalRelation [_1#3, _2#4, _3#5]
+- SubqueryAlias `score`
+- Project [_1#16 AS id#20, _2#17 AS subject#21, _3#18 AS score#22]
+- LocalRelation [_1#16, _2#17, _3#18]

六、下面要进行逻辑优化了,常见的逻辑优化有:


== Optimized Logical Plan ==
Aggregate [name#8], [sum(cast(v#26 as bigint)) AS sum(v)#28L, name#8]
+- Project [(110 + score#22) AS v#26, name#8]
+- Join Inner, (id#7 = id#20) :- LocalRelation [id#7, name#8]
+- LocalRelation [id#20, score#22]
### --- 这里用到的优化有:
~~~ 谓词下推(Push Down Predicate)、常量折叠(ConstantFolding)、字段裁剪(Columning Pruning)
~~~ 做完逻辑优化,还需要先转换为物理执行计划,将逻辑上可行的执行计划变为 Spark可以真正执行的计划:


~~~ # SparkSQL 把逻辑节点转换为了相应的物理节点,
~~~ 比如 Join 算子,Spark 根据不同场景为该算子制定了不同的算法策略。
== Physical Plan ==
*(2) HashAggregate(keys=[name#8], functions=[sum(cast(v#26 as bigint))], output=[sum(v)#28L, name#8])
+- Exchange hashpartitioning(name#8, 200)
+- *(1) HashAggregate(keys=[name#8], functions= [partial_sum(cast(v#26 as bigint))], output=[name#8, sum#38L])
+- *(1) Project [(110 + score#22) AS v#26, name#8]
+- *(1) BroadcastHashJoin [id#7], [id#20], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))):
+- LocalTableScan [id#7, name#8]
+- LocalTableScan [id#20, score#22]
~~~ # 数据在一个一个的 plan 中流转,然后每个 plan 里面表达式都会对数据进行处理,
~~~ 就相当于经过了一个个小函数的调用处理,这里面有大量的函数调用开销。
~~~ 是不是可以把这些小函数内联一下,当成一个大函数,WholeStageCodegen 就是干这事的。
~~~ 可以看到最终执行计划每个节点前面有个 * 号,
~~~ 说明整段代码生成被启用,Project、BroadcastHashJoin、HashAggregate 这一段都启用了整段代码生成,
~~~ 级联为了大函数。
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
分类:
bdv016-spark.v01
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」