在spark1.0中推出spark-submit来统一提交applicaiton
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ ... # other options <application-jar> \ [application-arguments]
--class:application的入口点;
--master:集群的master url;
--deploy-mode:driver在集群中的部署模式;
application-jar:application代码的jar包, 可以放在HDFS上,也可以放在本地文件系统上;
standalone模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master spark://hadoop000:7077 \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
需要在master中设置spark集群的master地址;
yarn-client模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master yarn-client \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
yarn-cluster模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master yarn-cluster \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
注:提交yarn上执行需要配置HADOOP_CONF_DIR
yarn-client和yarn-cluser的区别:以Driver的位置来区分
yarn-client:
Client和Driver运行在一起,ApplicationMaster只用来获取资源;结果实时输出在客户端控制台上,可以方便的看到日志信息,推荐使用该模式;
提交到yarn后,yarn先启动ApplicationMaster和Executor,两者都是运行在Container中。注意:一个container中只运行一个executorbackend;
yarn-cluser:
Driver和ApplicationMaster运行在一起,所以运行结果不能在客户端控制台显示,需要将结果需要存放在HDFS或者写到数据库中;
driver在集群上运行,可通过ui界面访问driver的状态。