背景:
公司选用Apache Beam 用于大数据程序开发; Apache Beam 提供了一系列通用的JAVA API, 通用是指使用Apache Beam开发的程序,
可以在不对代码做任何修改的情况下运行在当下流行的计算框架上,如SPARK, FLINK...
Beam的程序运行在Spark, 需要依赖Spark,Hadoop 甚至kafka的一些JAR包。
如果Beam的程序打成一个fat包, 在Spark上运行不会遇到有问题:
spark-submit --master yarn-cluster --name RealTimeAPP --class com.data.analytics.app.RealTimeAPP RealTimeAPP-1.0.0-SNAPSHOT.jar --runner=SparkRunner
问题来了,这样的fat包太大, 有200多M, 老大们希望把小的核心程序和大的依赖包分离开了, 这样较大的依赖包只要上传一次服务器就够了, 核心程序有更新的时候只要在服务器上更新
核心程序就可以了(当然在核心程序的依赖包没有发生变化的时候可以这么做)。
于是spark-sumbit变成了这样:
spark-submit --master yarn-cluster \
--name ${CLASS_NAME} \
--class ${PACKAGE_NAME}.${CLASS_NAME} \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--conf spark.yarn.maxAppAttempts=${SPARK_YARN_MAXAPPATTEMPTS} \
--conf spark.yarn.am.attemptFailuresValidityInterval=${SPARK_YARN_AM_ATTEMPTFAILURESVALIDITYINTERVAL} \
--conf spark.yarn.max.executor.failures=${SPARK_YARN_MAX_EXECUTOR_FAILURES} \
--conf spark.yarn.executor.failuresValidityInterval=${SPARK_YARN_EXECUTOR_FAILURESVALIDITYINTERVAL} \
--conf spark.streaming.receiver.writeAheadLog.enable=${SPARK_STREAMING_RECEIVER_WRITEAHEADLOG_ENABLE} \
--files ${CONF_DIR}/metrics.properties,${CONF_DIR}/log4j.properties,${CONF_DIR}/conf.properties \
--driver-memory ${DRIVER_MEMORY} \
--executor-memory ${EXECUTOR_MEMORY} \
--num-executors ${NUM_EXECUTORS} \
--jars ${JARS} \ # 这里是依赖包的路径, 我们把依赖包放到了HDFS上面;如果有多个依赖包,用逗号分割,注意逗号前后不要出现空格
${JAR_DIR}/${JAR_FILE} \
--runner=SparkRunner \
--batchIntervalMillis=60000
好吧,铺垫有点长,正题来了:
后来有了个新项目,需要用到Beam SQL. 问题来了:
Beam SQL 在SPARK LOCAL模式可以成功运行, 到yarn-cluster或者yarn-client就出错。
原因竟然是一系列的JAR包冲突
解决:
1. 在spark-submit时设置参数--conf spark.driver.userClassPathFirst=true
spark.driver.userClassPathFirst=true
设置这个参数后JAVA类的加载次序: 用户CLASS PATH -> SPARK CLASS PATH -> System CLASS PATH
曾经也尝试过: --driver-class-path ****.jar, 但是这个参数只会在指定的JAR 包去加载类, 如果JAR包中没有那个类就会报错退出
2. 在pom文件中把一大批java 类排除掉,因为有些类必须是要用到spark环境提供的,否则无法运行:
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/LICENSE</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
<exclude>**/*.java</exclude>
<exclude>org/apache/hadoop/conf/**/*.class</exclude>
<exclude>org/apache/hadoop/fs/**/*.class</exclude>
<exclude>org/apache/hadoop/io/**/*.class</exclude>
<exclude>org/apache/hadoop/security/**/*.class</exclude>
<exclude>org/apache/hadoop/ipc/**/*.class</exclude>
<exclude>org/apache/log4j/**</exclude>
<exclude>org/slf4j/**</exclude>
<exclude>log4j.properties</exclude>
<exclude>com/codahale/**</exclude>
<exclude>scala/**</exclude>
<exclude>org/apache/hadoop/yarn/api/records/impl/pb/**</exclude>
<exclude>org/apache/hadoop/yarn/api/impl/pb/**</exclude>
<exclude>org/apache/hadoop/yarn/api/protocolrecords/impl/pb/**</exclude>
<exclude>org/apache/hadoop/net/**</exclude>
<exclude>org/apache/hadoop/hdfs/protocol/proto/**</exclude>
<exclude>com/google/protobuf/**</exclude>
<exclude>org/apache/spark/**</exclude>
<exclude>akka/**</exclude>
<exclude>org/apache/hadoop/util/**</exclude>
</excludes>
</filter>
spark local:
spark-submit \
--master local \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true \
--class com.data.analytics.app.RealTimeAPP2 \
--jars RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar \
--runner=SparkRunner
spark yarn-cluster:
spark-submit \
--master yarn-cluster \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true \
--class com.data.analytics.app.RealTimeAPP2 \
--jars hdfs://testserver:8020/user/spark/lib/RealTimeAPP2/RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar \
--runner=SparkRunner
Apache Beam: 2.2.0
Apache Spark: 1.6.3