在IDEA中编写Spark的WordCount程序
1:spark shell仅在测试和验证我们的程序时使用的较多,在生产环境中,通常会在IDE中编制程序,然后打成jar包,然后提交到集群,最常用的是创建一个Maven项目,利用Maven来管理jar包的依赖。
2:配置Maven的pom.xml:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.bie</groupId> <artifactId>sparkWordCount</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <encoding>UTF-8</encoding> <scala.version>2.10.6</scala.version> <scala.compat.version>2.10</scala.compat.version> </properties> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.5.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.2</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-make:transitive</arg> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.18.1</version> <configuration> <useFile>false</useFile> <disableXmlReport>true</disableXmlReport> <includes> <include>**/*Test.*</include> <include>**/*Suite.*</include> </includes> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.bie.WordCount</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
注意:配置好pom.xml以后,点击Enable Auto-Import即可;
3:将src/main/java和src/test/java分别修改成src/main/scala和src/test/scala,与pom.xml中的配置保持一致();
4:新建一个scala class,类型为Object,然后编写spark程序,如下所示:
import org.apache.spark.{SparkConf, SparkContext} object WordCount { def main(args: Array[String]): Unit = { //创建SparkConf()并且设置App的名称 val conf = new SparkConf().setAppName("wordCount"); //创建SparkContext,该对象是提交spark app的入口 val sc = new SparkContext(conf); //使用sc创建rdd,并且执行相应的transformation和action sc.textFile(args(0)).flatMap(_.split(" ")).map((_ ,1)).reduceByKey(_ + _,1).sortBy(_._2,false).saveAsTextFile(args(1)); //停止sc,结束该任务 sc.stop(); } }
5:使用Maven打包:首先修改pom.xml中的mainClass,使其和自己的类路径对应起来:
然后,点击idea右侧的Maven Project选项,点击Lifecycle,选择clean和package,然后点击Run Maven Build:
等待编译完成,选择编译成功的jar包,并将该jar上传到Spark集群中的某个节点上:
记得,启动你的hdfs和Spark集群,然后使用spark-submit命令提交Spark应用(注意参数的顺序):
可以看下简单的几行代码,但是打成的包就将近百兆,都是封装好的啊,感觉牛人太多了。
然后开始进行Spark Submit提交操作,命令如下所示:
[root@master spark-1.6.1-bin-hadoop2.6]# bin/spark-submit \ > --class com.bie.WordCount \ > --master spark://master:7077 \ > --executor-memory 512M \ > --total-executor-cores 2 \ > /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar \ > hdfs://master:9000/wordcount.txt \ > hdfs://master:9000/output
或者如下:
bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 512M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/outpu
操作如下所示:
可以在图形化页面看到多了一个Application:
然后呢,就出错了,学知识,不出点错,感觉都不正常:
1 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout 2 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 3 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 4 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 5 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 6 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) 7 at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) 8 at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) 9 at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecutor(CoarseGrainedSchedulerBackend.scala:359) 10 at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.executorRemoved(SparkDeploySchedulerBackend.scala:144) 11 at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:186) 12 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) 13 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) 14 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 15 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) 16 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 17 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 18 at java.lang.Thread.run(Thread.java:745) 19 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] 20 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) 21 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 22 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) 23 at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 24 at scala.concurrent.Await$.result(package.scala:107) 25 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) 26 ... 12 more 27 18/02/23 01:28:46 WARN NettyRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 192.168.3.129, 60565),broadcast_1_piece0,StorageLevel(false, true, false, false, 1),2358,0,0)] in 1 attempts 28 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 29 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 30 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 31 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 32 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 33 at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) 34 at scala.util.Try$.apply(Try.scala:161) 35 at scala.util.Failure.recover(Try.scala:185) 36 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 37 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 38 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 39 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 40 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 41 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 42 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 43 at scala.concurrent.Promise$class.complete(Promise.scala:55) 44 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 45 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 46 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 47 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 48 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) 49 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) 50 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 51 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 52 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 53 at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) 54 at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) 55 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) 56 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 57 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 58 at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 59 at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 60 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241) 61 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 62 at java.util.concurrent.FutureTask.run(FutureTask.java:262) 63 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) 64 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) 65 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 66 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 67 at java.lang.Thread.run(Thread.java:745) 68 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds 69 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242) 70 ... 7 more 71 72 73 18/02/23 01:30:48 WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(0,Command exited with code 1)] in 2 attempts 74 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 75 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 76 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 77 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 78 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 79 at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) 80 at scala.util.Try$.apply(Try.scala:161) 81 at scala.util.Failure.recover(Try.scala:185) 82 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 83 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 84 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 85 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 86 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 87 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 88 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 89 at scala.concurrent.Promise$class.complete(Promise.scala:55) 90 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 91 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 92 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 93 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 94 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) 95 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) 96 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 97 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 98 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 99 at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) 100 at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) 101 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) 102 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 103 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 104 at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 105 at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 106 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241) 107 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 108 at java.util.concurrent.FutureTask.run(FutureTask.java:262) 109 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) 110 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) 111 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 112 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 113 at java.lang.Thread.run(Thread.java:745) 114 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds 115 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242) 116 ... 7 more 117 18/02/23 01:30:49 WARN NettyRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 192.168.3.129, 60565),broadcast_1_piece0,StorageLevel(false, true, false, false, 1),2358,0,0)] in 2 attempts 118 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 119 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 120 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 121 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 122 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 123 at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) 124 at scala.util.Try$.apply(Try.scala:161) 125 at scala.util.Failure.recover(Try.scala:185) 126 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 127 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 128 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 129 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 130 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 131 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 132 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 133 at scala.concurrent.Promise$class.complete(Promise.scala:55) 134 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 135 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 136 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 137 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 138 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) 139 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) 140 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 141 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 142 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 143 at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) 144 at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) 145 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) 146 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 147 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 148 at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 149 at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 150 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241) 151 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 152 at java.util.concurrent.FutureTask.run(FutureTask.java:262) 153 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) 154 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) 155 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 156 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 157 at java.lang.Thread.run(Thread.java:745) 158 Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds 159 at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242) 160 ... 7 more
解决思路,百度了一下,也没缕出思路,就只知道是连接超时了,超过了120s,然后呢,我感觉是自己内存设置小了,因为开的虚拟机,主机8G,三台虚拟机,每台分了1G内存,然后设置Spark可以占用800M,跑程序的时候,第一次设置为512M,就连接超时了,第二次设置为了700M,顺利跑完,可以看看跑的过程,还是很有意思的:
1 [root@master hadoop]# bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 700M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/output 2 bash: bin/spark-submit: No such file or directory 3 [root@master hadoop]# cd spark-1.6.1-bin-hadoop2.6/ 4 [root@master spark-1.6.1-bin-hadoop2.6]# bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 700M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/output 5 18/02/23 01:45:46 INFO SparkContext: Running Spark version 1.6.1 6 18/02/23 01:45:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7 18/02/23 01:45:47 INFO SecurityManager: Changing view acls to: root 8 18/02/23 01:45:47 INFO SecurityManager: Changing modify acls to: root 9 18/02/23 01:45:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 10 18/02/23 01:45:48 INFO Utils: Successfully started service 'sparkDriver' on port 60097. 11 18/02/23 01:45:49 INFO Slf4jLogger: Slf4jLogger started 12 18/02/23 01:45:49 INFO Remoting: Starting remoting 13 18/02/23 01:45:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.3.129:55353] 14 18/02/23 01:45:49 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 55353. 15 18/02/23 01:45:49 INFO SparkEnv: Registering MapOutputTracker 16 18/02/23 01:45:49 INFO SparkEnv: Registering BlockManagerMaster 17 18/02/23 01:45:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-17e41a67-b880-4c06-95eb-a0f64928f668 18 18/02/23 01:45:49 INFO MemoryStore: MemoryStore started with capacity 517.4 MB 19 18/02/23 01:45:49 INFO SparkEnv: Registering OutputCommitCoordinator 20 18/02/23 01:45:50 INFO Utils: Successfully started service 'SparkUI' on port 4040. 21 18/02/23 01:45:50 INFO SparkUI: Started SparkUI at http://192.168.3.129:4040 22 18/02/23 01:45:50 INFO HttpFileServer: HTTP File server directory is /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490/httpd-f346e1dd-642d-437d-8947-6190f2e83065 23 18/02/23 01:45:50 INFO HttpServer: Starting HTTP Server 24 18/02/23 01:45:50 INFO Utils: Successfully started service 'HTTP file server' on port 35900. 25 18/02/23 01:45:51 INFO SparkContext: Added JAR file:/home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar at http://192.168.3.129:35900/jars/sparkWordCount-1.0-SNAPSHOT.jar with timestamp 1519379151547 26 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Connecting to master spark://master:7077... 27 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20180223014552-0004 28 18/02/23 01:45:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37128. 29 18/02/23 01:45:52 INFO NettyBlockTransferService: Server created on 37128 30 18/02/23 01:45:52 INFO BlockManagerMaster: Trying to register BlockManager 31 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor added: app-20180223014552-0004/0 on worker-20180223110813-192.168.3.131-55934 (192.168.3.131:55934) with 1 cores 32 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20180223014552-0004/0 on hostPort 192.168.3.131:55934 with 1 cores, 700.0 MB RAM 33 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor added: app-20180223014552-0004/1 on worker-20180223110811-192.168.3.130-40991 (192.168.3.130:40991) with 1 cores 34 18/02/23 01:45:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20180223014552-0004/1 on hostPort 192.168.3.130:40991 with 1 cores, 700.0 MB RAM 35 18/02/23 01:45:52 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.3.129:37128 with 517.4 MB RAM, BlockManagerId(driver, 192.168.3.129, 37128) 36 18/02/23 01:45:52 INFO BlockManagerMaster: Registered BlockManager 37 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor updated: app-20180223014552-0004/1 is now RUNNING 38 18/02/23 01:45:52 INFO AppClient$ClientEndpoint: Executor updated: app-20180223014552-0004/0 is now RUNNING 39 18/02/23 01:45:53 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 40 18/02/23 01:45:53 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes 41 18/02/23 01:45:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 146.7 KB, free 146.7 KB) 42 18/02/23 01:45:53 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 160.6 KB) 43 18/02/23 01:45:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.3.129:37128 (size: 13.9 KB, free: 517.4 MB) 44 18/02/23 01:45:53 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:13 45 Java HotSpot(TM) Client VM warning: You have loaded library /tmp/libnetty-transport-native-epoll4006421548933729587.so which might have disabled stack guard. The VM will try to fix the stack guard now. 46 It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 47 18/02/23 01:45:55 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slaver1:60082) with ID 1 48 18/02/23 01:45:55 INFO BlockManagerMasterEndpoint: Registering block manager slaver1:56633 with 282.5 MB RAM, BlockManagerId(1, slaver1, 56633) 49 18/02/23 01:45:56 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 50 18/02/23 01:45:56 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 51 18/02/23 01:45:56 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 52 18/02/23 01:45:56 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 53 18/02/23 01:45:56 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 54 18/02/23 01:45:57 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slaver2:46908) with ID 0 55 18/02/23 01:45:57 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala:13 56 18/02/23 01:45:58 INFO FileInputFormat: Total input paths to process : 1 57 18/02/23 01:45:58 INFO BlockManagerMasterEndpoint: Registering block manager slaver2:36572 with 282.5 MB RAM, BlockManagerId(0, slaver2, 36572) 58 18/02/23 01:45:58 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:13) 59 18/02/23 01:45:58 INFO DAGScheduler: Registering RDD 5 (sortBy at WordCount.scala:13) 60 18/02/23 01:45:58 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:13) with 1 output partitions 61 18/02/23 01:45:58 INFO DAGScheduler: Final stage: ResultStage 2 (saveAsTextFile at WordCount.scala:13) 62 18/02/23 01:45:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1) 63 18/02/23 01:45:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1) 64 18/02/23 01:45:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13), which has no missing parents 65 18/02/23 01:45:59 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 164.7 KB) 66 18/02/23 01:45:59 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 167.0 KB) 67 18/02/23 01:45:59 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.3.129:37128 (size: 2.3 KB, free: 517.4 MB) 68 18/02/23 01:45:59 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 69 18/02/23 01:45:59 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13) 70 18/02/23 01:45:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 71 18/02/23 01:45:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slaver2, partition 0,NODE_LOCAL, 2196 bytes) 72 18/02/23 01:45:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slaver1, partition 1,NODE_LOCAL, 2196 bytes) 73 18/02/23 01:47:19 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slaver1:56633 (size: 2.3 KB, free: 282.5 MB) 74 18/02/23 01:47:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on slaver1:56633 (size: 13.9 KB, free: 282.5 MB) 75 18/02/23 01:47:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slaver2:36572 (size: 2.3 KB, free: 282.5 MB) 76 18/02/23 01:47:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on slaver2:36572 (size: 13.9 KB, free: 282.5 MB) 77 18/02/23 01:47:36 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 97388 ms on slaver1 (1/2) 78 18/02/23 01:48:19 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:13) finished in 140.390 s 79 18/02/23 01:48:19 INFO DAGScheduler: looking for newly runnable stages 80 18/02/23 01:48:19 INFO DAGScheduler: running: Set() 81 18/02/23 01:48:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 140387 ms on slaver2 (2/2) 82 18/02/23 01:48:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 83 18/02/23 01:48:19 INFO DAGScheduler: waiting: Set(ShuffleMapStage 1, ResultStage 2) 84 18/02/23 01:48:19 INFO DAGScheduler: failed: Set() 85 18/02/23 01:48:19 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[5] at sortBy at WordCount.scala:13), which has no missing parents 86 18/02/23 01:48:19 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 170.5 KB) 87 18/02/23 01:48:19 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 172.5 KB) 88 18/02/23 01:48:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.3.129:37128 (size: 2.0 KB, free: 517.4 MB) 89 18/02/23 01:48:19 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 90 18/02/23 01:48:19 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[5] at sortBy at WordCount.scala:13) 91 18/02/23 01:48:19 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 92 18/02/23 01:48:19 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, slaver2, partition 0,NODE_LOCAL, 1956 bytes) 93 18/02/23 01:48:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on slaver2:36572 (size: 2.0 KB, free: 282.5 MB) 94 18/02/23 01:48:19 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to slaver2:46908 95 18/02/23 01:48:19 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 156 bytes 96 18/02/23 01:48:20 INFO DAGScheduler: ShuffleMapStage 1 (sortBy at WordCount.scala:13) finished in 0.708 s 97 18/02/23 01:48:20 INFO DAGScheduler: looking for newly runnable stages 98 18/02/23 01:48:20 INFO DAGScheduler: running: Set() 99 18/02/23 01:48:20 INFO DAGScheduler: waiting: Set(ResultStage 2) 100 18/02/23 01:48:20 INFO DAGScheduler: failed: Set() 101 18/02/23 01:48:20 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:13), which has no missing parents 102 18/02/23 01:48:20 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 711 ms on slaver2 (1/1) 103 18/02/23 01:48:20 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 104 18/02/23 01:48:20 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.1 KB, free 237.6 KB) 105 18/02/23 01:48:20 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 22.6 KB, free 260.2 KB) 106 18/02/23 01:48:20 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.3.129:37128 (size: 22.6 KB, free: 517.4 MB) 107 18/02/23 01:48:20 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006 108 18/02/23 01:48:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:13) 109 18/02/23 01:48:20 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 110 18/02/23 01:48:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, slaver2, partition 0,NODE_LOCAL, 1967 bytes) 111 18/02/23 01:48:20 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on slaver2:36572 (size: 22.6 KB, free: 282.5 MB) 112 18/02/23 01:48:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slaver2:46908 113 18/02/23 01:48:20 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 136 bytes 114 18/02/23 01:48:25 INFO DAGScheduler: ResultStage 2 (saveAsTextFile at WordCount.scala:13) finished in 5.008 s 115 18/02/23 01:48:25 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 5012 ms on slaver2 (1/1) 116 18/02/23 01:48:25 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 117 18/02/23 01:48:25 INFO DAGScheduler: Job 0 finished: saveAsTextFile at WordCount.scala:13, took 147.737606 s 118 18/02/23 01:48:26 INFO SparkUI: Stopped Spark web UI at http://192.168.3.129:4040 119 18/02/23 01:48:26 INFO SparkDeploySchedulerBackend: Shutting down all executors 120 18/02/23 01:48:26 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 121 18/02/23 01:48:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 122 18/02/23 01:48:26 INFO MemoryStore: MemoryStore cleared 123 18/02/23 01:48:26 INFO BlockManager: BlockManager stopped 124 18/02/23 01:48:26 INFO BlockManagerMaster: BlockManagerMaster stopped 125 18/02/23 01:48:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 126 18/02/23 01:48:26 INFO SparkContext: Successfully stopped SparkContext 127 18/02/23 01:48:26 INFO ShutdownHookManager: Shutdown hook called 128 18/02/23 01:48:26 INFO ShutdownHookManager: Deleting directory /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490 129 18/02/23 01:48:26 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 130 18/02/23 01:48:26 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 131 18/02/23 01:48:26 INFO ShutdownHookManager: Deleting directory /tmp/spark-99c897ab-ea17-4797-8695-3a5df89ed490/httpd-f346e1dd-642d-437d-8947-6190f2e83065 132 [root@master spark-1.6.1-bin-hadoop2.6]#
最后查看执行结果即可(由于第一次跑失败了,作为强迫症的我就把第一次的输出结果文件删除了):