背景:
公司选用Apache Beam 用于大数据程序开发; Apache Beam 提供了一系列通用的JAVA API, 通用是指使用Apache Beam开发的程序,
可以在不对代码做任何修改的情况下运行在当下流行的计算框架上,如SPARK, FLINK...

Beam的程序运行在Spark, 需要依赖Spark,Hadoop 甚至kafka的一些JAR包。

如果Beam的程序打成一个fat包, 在Spark上运行不会遇到有问题:
spark-submit --master yarn-cluster --name RealTimeAPP --class com.data.analytics.app.RealTimeAPP RealTimeAPP-1.0.0-SNAPSHOT.jar --runner=SparkRunner

问题来了,这样的fat包太大, 有200多M, 老大们希望把小的核心程序和大的依赖包分离开了, 这样较大的依赖包只要上传一次服务器就够了, 核心程序有更新的时候只要在服务器上更新
核心程序就可以了(当然在核心程序的依赖包没有发生变化的时候可以这么做)。
于是spark-sumbit变成了这样:


spark-submit --master yarn-cluster \
    --name ${CLASS_NAME} \
    --class ${PACKAGE_NAME}.${CLASS_NAME} \
    --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
    --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
    --conf spark.yarn.maxAppAttempts=${SPARK_YARN_MAXAPPATTEMPTS} \
    --conf spark.yarn.am.attemptFailuresValidityInterval=${SPARK_YARN_AM_ATTEMPTFAILURESVALIDITYINTERVAL} \
    --conf spark.yarn.max.executor.failures=${SPARK_YARN_MAX_EXECUTOR_FAILURES} \
    --conf spark.yarn.executor.failuresValidityInterval=${SPARK_YARN_EXECUTOR_FAILURESVALIDITYINTERVAL} \
    --conf spark.streaming.receiver.writeAheadLog.enable=${SPARK_STREAMING_RECEIVER_WRITEAHEADLOG_ENABLE} \
    --files ${CONF_DIR}/metrics.properties,${CONF_DIR}/log4j.properties,${CONF_DIR}/conf.properties \
    --driver-memory ${DRIVER_MEMORY} \
    --executor-memory ${EXECUTOR_MEMORY} \
    --num-executors ${NUM_EXECUTORS} \
    --jars ${JARS} \         #  这里是依赖包的路径, 我们把依赖包放到了HDFS上面;如果有多个依赖包,用逗号分割,注意逗号前后不要出现空格
    ${JAR_DIR}/${JAR_FILE} \
    --runner=SparkRunner \
    --batchIntervalMillis=60000

    
    
好吧,铺垫有点长,正题来了:
    
后来有了个新项目,需要用到Beam SQL. 问题来了:
Beam SQL 在SPARK LOCAL模式可以成功运行, 到yarn-cluster或者yarn-client就出错。
原因竟然是一系列的JAR包冲突

解决:

1. 在spark-submit时设置参数--conf spark.driver.userClassPathFirst=true

spark.driver.userClassPathFirst=true
设置这个参数后JAVA类的加载次序:  用户CLASS PATH ->  SPARK CLASS PATH -> System CLASS PATH

曾经也尝试过: --driver-class-path  ****.jar,  但是这个参数只会在指定的JAR 包去加载类, 如果JAR包中没有那个类就会报错退出

 

 

2. 在pom文件中把一大批java 类排除掉,因为有些类必须是要用到spark环境提供的,否则无法运行:

                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/LICENSE</exclude>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                    <exclude>**/*.java</exclude>
                    <exclude>org/apache/hadoop/conf/**/*.class</exclude>
                    <exclude>org/apache/hadoop/fs/**/*.class</exclude>
                    <exclude>org/apache/hadoop/io/**/*.class</exclude>
                    <exclude>org/apache/hadoop/security/**/*.class</exclude>
                    <exclude>org/apache/hadoop/ipc/**/*.class</exclude>
                    <exclude>org/apache/log4j/**</exclude>
                    <exclude>org/slf4j/**</exclude>
                    <exclude>log4j.properties</exclude>
                    <exclude>com/codahale/**</exclude>
                    <exclude>scala/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/records/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/protocolrecords/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/net/**</exclude>
                    <exclude>org/apache/hadoop/hdfs/protocol/proto/**</exclude>
                    <exclude>com/google/protobuf/**</exclude>
                    <exclude>org/apache/spark/**</exclude>
                    <exclude>akka/**</exclude>
                    <exclude>org/apache/hadoop/util/**</exclude>
                  </excludes>
                </filter>

                
                
spark local:  
spark-submit \
--master local \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true  \
--class com.data.analytics.app.RealTimeAPP2 \
--jars RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar \
--runner=SparkRunner


spark yarn-cluster:
spark-submit \
--master yarn-cluster  \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true  \
--class com.data.analytics.app.RealTimeAPP2 \
--jars hdfs://testserver:8020/user/spark/lib/RealTimeAPP2/RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar  \
--runner=SparkRunner




    

Apache Beam: 2.2.0
Apache Spark: 1.6.3


posted on 2018-02-12 11:16  robynn  阅读(1638)  评论(0编辑  收藏  举报