Spark Standalone Mode

因为Spark与Hadoop是关联的,所以在安装Spark前应该根据已安装的Hadoop版本来选择待安装的Sqark版本,要不然就会报“Server IPC version X cannot communicate with client version Y”的错误。

我安装的Hadoop版本为Hadoop2.4.0(下载),选择的Spark版本为spark-1.2.0-bin-hadoop2.4.tgz(下载)。要注意的是Spark和Scala存在一定的版本兼容问题,参考我的另一篇博客中记录的问题。

官方文档:http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

 

Spark依赖与Scala,所以还需要预装Scala,我下载的版本为scala-2.11.5.tgz,配置Scala的环境变量:

export SCALA_HOME=/opt/scala/scala-2.11.5
export PATH=$PATH:$SCALA_HOME/bin

修改后使环境变量生效,查看Scala版本:

 

然后配置Spark的环境变量:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

配置后使环境变量修改生效。

在 ${SPARK_HOME}/conf 目录下做如下操作:

cp spark-env.sh.template spark-env.sh

修改 spark-env.sh ,在文件最后添加(视具体配置路径而定):

export SCALA_HOME=/opt/scala/scala-2.11.5
export SPARK_MASTER_IP=127.0.0.1
export SPARK_WORKER_MEMORY=2G
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_75
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

 

配置完毕。

 

启动Spark前,先启动Hadoop:

hadoop@tinylcy:/usr/local/hadoop$ sbin/start-all.sh

 

然后启动Spark:

hadoop@tinylcy:/opt/spark$ sbin/start-all.sh 

 

切换到Spark的bin目录,进入交互模式:

hadoop@tinylcy:/opt/spark$ bin/spark-shell

 

测试:

scala> val textFile=sc.textFile("hdfs://localhost:9000/user/hadoop/input/words.txt")

 

scala> val count=textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)

 

scala> count.collect()

 

再举一个例子:

scala> val data=Array(1,2,3,4,5)  //产生data
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData=sc.parallelize(data)  //将data处理成RDD
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:14

scala> distData.reduce(_+_)  //在RDD上进行运算,对data里面的元素进行加和
15/07/19 14:37:56 INFO spark.SparkContext: Starting job: reduce at <console>:17
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Got job 0 (reduce at <console>:17) with 4 output partitions (allowLocal=false)
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Final stage: Stage 0(reduce at <console>:17)
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Missing parents: List()
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:14), which has no missing parents
15/07/19 14:37:56 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
15/07/19 14:37:56 INFO storage.MemoryStore: ensureFreeSpace(1184) called with curMem=0, maxMem=278019440
15/07/19 14:37:56 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1184.0 B, free 265.1 MB)
15/07/19 14:37:56 INFO storage.MemoryStore: ensureFreeSpace(912) called with curMem=1184, maxMem=278019440
15/07/19 14:37:56 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 912.0 B, free 265.1 MB)
15/07/19 14:37:56 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:47649 (size: 912.0 B, free: 265.1 MB)
15/07/19 14:37:56 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/07/19 14:37:56 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Submitting 4 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:14)
15/07/19 14:37:56 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1204 bytes)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1204 bytes)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 1204 bytes)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 1208 bytes)
15/07/19 14:37:56 INFO executor.Executor: Running task 3.0 in stage 0.0 (TID 3)
15/07/19 14:37:56 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/07/19 14:37:56 INFO executor.Executor: Running task 2.0 in stage 0.0 (TID 2)
15/07/19 14:37:56 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/07/19 14:37:56 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 727 bytes result sent to driver
15/07/19 14:37:56 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 727 bytes result sent to driver
15/07/19 14:37:56 INFO executor.Executor: Finished task 3.0 in stage 0.0 (TID 3). 727 bytes result sent to driver
15/07/19 14:37:56 INFO executor.Executor: Finished task 2.0 in stage 0.0 (TID 2). 727 bytes result sent to driver
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 60 ms on localhost (1/4)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 56 ms on localhost (2/4)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 57 ms on localhost (3/4)
15/07/19 14:37:56 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 58 ms on localhost (4/4)
15/07/19 14:37:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Stage 0 (reduce at <console>:17) finished in 0.077 s
15/07/19 14:37:56 INFO scheduler.DAGScheduler: Job 0 finished: reduce at <console>:17, took 0.342516 s
res0: Int = 15   //得到运算结果

scala> 

 

posted @ 2015-07-18 20:45  tinylcy  阅读(523)  评论(0编辑  收藏  举报