spark安装
1、下载地址 http://spark.apache.org/downloads.html
2、解压
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz -C /opt/module/
3、本地模式运行第一个程
bin/spark-submit --class org.apache.spark.examples.SparkPi --executor-memory 1G --total-executor-cores 2 ./examples/jars/spark-examples_2.11-2.4.4.jar 200
... ... 19/09/05 11:13:27 INFO Executor: Running task 198.0 in stage 0.0 (TID 198) 19/09/05 11:13:27 INFO Executor: Finished task 198.0 in stage 0.0 (TID 198). 824 bytes result sent to driver 19/09/05 11:13:27 INFO TaskSetManager: Starting task 199.0 in stage 0.0 (TID 199, localhost, executor driver, partition 199, PROCESS_LOCAL, 7866 bytes) 19/09/05 11:13:27 INFO TaskSetManager: Finished task 198.0 in stage 0.0 (TID 198) in 6 ms on localhost (executor driver) (199/200) 19/09/05 11:13:27 INFO Executor: Running task 199.0 in stage 0.0 (TID 199) 19/09/05 11:13:27 INFO Executor: Finished task 199.0 in stage 0.0 (TID 199). 781 bytes result sent to driver 19/09/05 11:13:27 INFO TaskSetManager: Finished task 199.0 in stage 0.0 (TID 199) in 9 ms on localhost (executor driver) (200/200) 19/09/05 11:13:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/09/05 11:13:27 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.129 s 19/09/05 11:13:27 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 3.262553 s Pi is roughly 3.1416157570807877 19/09/05 11:13:27 INFO SparkUI: Stopped Spark web UI at http://vmhome10.com:4040 19/09/05 11:13:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/09/05 11:13:27 INFO MemoryStore: MemoryStore cleared 19/09/05 11:13:27 INFO BlockManager: BlockManager stopped 19/09/05 11:13:27 INFO BlockManagerMaster: BlockManagerMaster stopped 19/09/05 11:13:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 19/09/05 11:13:27 INFO SparkContext: Successfully stopped SparkContext 19/09/05 11:13:27 INFO ShutdownHookManager: Shutdown hook called 19/09/05 11:13:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-7a49f112-3630-4ef6-b4dc-1c46af32c133 19/09/05 11:13:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-6ee58588-7298-4623-b10b-6310e628060d
参数说明:
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] 参数说明: --master spark://vmhome10.com:7077 指定Master的地址 --class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi) --deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)* --conf: 任意的Spark配置属性, 格式key=value. 如果值包含空格,可以加引号“key=value” application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path都包含同样的jar application-arguments: 传给main()方法的参数 --executor-memory 1G 指定每个executor可用内存为1G --total-executor-cores 2 指定每个executor使用的cup核数为2个
4、进入shell编程模式
bin/spark-shell
19/09/05 11:42:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://vmhome10.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1567654930914). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information.
如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的local模式,该模式仅在本机启动一个进程,没有与集群建立联系.
带参数启动shell:
bin/spark-shell \ --master spark://vmhome10.com:7077 \ --executor-memory 1g \ --total-executor-cores 2
Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可, sparksession 是sparksql
在shell中执行wordcount。
scala> sc.textFile("/home/hadoop/1.txt").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect res2: Array[(String, Int)] = Array((192.168.1.1,2), (mytest,1), (wow,5), (1990,1), (xu.dm,4), (192.168.1.3,1), (dnf,4), (sword,2), (192.168.1.2,2), (hdfs,2), (blade,2), (2000,3))