Spark scala 程序开发
开发Spark 程序,看到的都是要打包成jar 然后运行。 今天发现其实对于standalone也可以直接运行。
如下代码,设定好master , 然后选择run as -> Scala Application 即可。
通过这种方式可以节约打包时间。
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]): Unit = { val logFile = "README.md" // Should be some file on your system val conf = new SparkConf() .setAppName("Simple Application") .setMaster("local[2]") val sc = new SparkContext(conf) val file = sc.textFile(logFile, 2).cache() val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) counts.saveAsTextFile("result") } }
控制台输出结果如下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14 / 11 / 06 00 : 40 : 28 INFO SecurityManager: Changing view acls to: hduser, 14 / 11 / 06 00 : 40 : 28 INFO SecurityManager: Changing modify acls to: hduser, 14 / 11 / 06 00 : 40 : 28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, ); users with modify permissions: Set(hduser, ) 14 / 11 / 06 00 : 40 : 30 INFO Slf4jLogger: Slf4jLogger started 14 / 11 / 06 00 : 40 : 30 INFO Remoting: Starting remoting 14 / 11 / 06 00 : 40 : 31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp: //sparkDriver@namenode1:36164] 14 / 11 / 06 00 : 40 : 31 INFO Remoting: Remoting now listens on addresses: [akka.tcp: //sparkDriver@namenode1:36164] 14 / 11 / 06 00 : 40 : 31 INFO Utils: Successfully started service 'sparkDriver' on port 36164 . 14 / 11 / 06 00 : 40 : 31 INFO SparkEnv: Registering MapOutputTracker 14 / 11 / 06 00 : 40 : 31 INFO SparkEnv: Registering BlockManagerMaster 14 / 11 / 06 00 : 40 : 31 INFO DiskBlockManager: Created local directory at /tmp/spark-local- 20141106004031 -6c1f 14 / 11 / 06 00 : 40 : 31 INFO Utils: Successfully started service 'Connection manager for block manager' on port 43311 . 14 / 11 / 06 00 : 40 : 31 INFO ConnectionManager: Bound socket to port 43311 with id = ConnectionManagerId(namenode1, 43311 ) 14 / 11 / 06 00 : 40 : 31 INFO MemoryStore: MemoryStore started with capacity 515.8 MB 14 / 11 / 06 00 : 40 : 31 INFO BlockManagerMaster: Trying to register BlockManager 14 / 11 / 06 00 : 40 : 31 INFO BlockManagerMasterActor: Registering block manager namenode1: 43311 with 515.8 MB RAM 14 / 11 / 06 00 : 40 : 31 INFO BlockManagerMaster: Registered BlockManager 14 / 11 / 06 00 : 40 : 32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-885d61b7-801f- 4701 - 9676 -abfe34983844 14 / 11 / 06 00 : 40 : 32 INFO HttpServer: Starting HTTP Server 14 / 11 / 06 00 : 40 : 32 INFO Utils: Successfully started service 'HTTP file server' on port 50899 . 14 / 11 / 06 00 : 40 : 34 INFO Utils: Successfully started service 'SparkUI' on port 4040 . 14 / 11 / 06 00 : 40 : 34 INFO SparkUI: Started SparkUI at http: //namenode1:4040 14 / 11 / 06 00 : 40 : 35 WARN NativeCodeLoader: Unable to load native -hadoop library for your platform... using builtin-java classes where applicable 14 / 11 / 06 00 : 40 : 36 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp: //sparkDriver@namenode1:36164/user/HeartbeatReceiver 14 / 11 / 06 00 : 40 : 38 INFO MemoryStore: ensureFreeSpace( 159118 ) called with curMem= 0 , maxMem= 540821422 14 / 11 / 06 00 : 40 : 38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 515.6 MB) 14 / 11 / 06 00 : 40 : 38 INFO FileInputFormat: Total input paths to process : 1 14 / 11 / 06 00 : 40 : 39 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14 / 11 / 06 00 : 40 : 39 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14 / 11 / 06 00 : 40 : 39 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14 / 11 / 06 00 : 40 : 39 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 14 / 11 / 06 00 : 40 : 39 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14 / 11 / 06 00 : 40 : 39 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala: 21 14 / 11 / 06 00 : 40 : 39 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala: 18 ) 14 / 11 / 06 00 : 40 : 39 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala: 21 ) with 2 output partitions (allowLocal= false ) 14 / 11 / 06 00 : 40 : 39 INFO DAGScheduler: Final stage: Stage 0 (saveAsTextFile at WordCount.scala: 21 ) 14 / 11 / 06 00 : 40 : 39 INFO DAGScheduler: Parents of final stage: List(Stage 1 ) 14 / 11 / 06 00 : 40 : 39 INFO DAGScheduler: Missing parents: List(Stage 1 ) 14 / 11 / 06 00 : 40 : 40 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[ 3 ] at map at WordCount.scala: 18 ), which has no missing parents 14 / 11 / 06 00 : 40 : 40 INFO MemoryStore: ensureFreeSpace( 3360 ) called with curMem= 159118 , maxMem= 540821422 14 / 11 / 06 00 : 40 : 40 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 515.6 MB) 14 / 11 / 06 00 : 40 : 40 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[ 3 ] at map at WordCount.scala: 18 ) 14 / 11 / 06 00 : 40 : 40 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 14 / 11 / 06 00 : 40 : 40 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0 , localhost, PROCESS_LOCAL, 1192 bytes) 14 / 11 / 06 00 : 40 : 40 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1 , localhost, PROCESS_LOCAL, 1192 bytes) 14 / 11 / 06 00 : 40 : 40 INFO Executor: Running task 0.0 in stage 1.0 (TID 0 ) 14 / 11 / 06 00 : 40 : 40 INFO Executor: Running task 1.0 in stage 1.0 (TID 1 ) 14 / 11 / 06 00 : 40 : 41 INFO CacheManager: Partition rdd_1_1 not found, computing it 14 / 11 / 06 00 : 40 : 41 INFO CacheManager: Partition rdd_1_0 not found, computing it 14 / 11 / 06 00 : 40 : 41 INFO HadoopRDD: Input split: file:/home/hduser/workspace/TestScala/README.md: 0 + 2405 14 / 11 / 06 00 : 40 : 41 INFO HadoopRDD: Input split: file:/home/hduser/workspace/TestScala/README.md: 2405 + 2406 14 / 11 / 06 00 : 40 : 41 INFO MemoryStore: ensureFreeSpace( 7512 ) called with curMem= 162478 , maxMem= 540821422 14 / 11 / 06 00 : 40 : 41 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 7.3 KB, free 515.6 MB) 14 / 11 / 06 00 : 40 : 41 INFO MemoryStore: ensureFreeSpace( 8352 ) called with curMem= 169990 , maxMem= 540821422 14 / 11 / 06 00 : 40 : 41 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 8.2 KB, free 515.6 MB) 14 / 11 / 06 00 : 40 : 41 INFO BlockManagerInfo: Added rdd_1_1 in memory on namenode1: 43311 (size: 7.3 KB, free: 515.8 MB) 14 / 11 / 06 00 : 40 : 41 INFO BlockManagerMaster: Updated info of block rdd_1_1 14 / 11 / 06 00 : 40 : 41 INFO BlockManagerInfo: Added rdd_1_0 in memory on namenode1: 43311 (size: 8.2 KB, free: 515.8 MB) 14 / 11 / 06 00 : 40 : 41 INFO BlockManagerMaster: Updated info of block rdd_1_0 14 / 11 / 06 00 : 40 : 41 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0 ). 2433 bytes result sent to driver 14 / 11 / 06 00 : 40 : 41 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1 ). 2433 bytes result sent to driver 14 / 11 / 06 00 : 40 : 41 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1 ) in 1112 ms on localhost ( 1 / 2 ) 14 / 11 / 06 00 : 40 : 41 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0 ) in 1175 ms on localhost ( 2 / 2 ) 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: Stage 1 (map at WordCount.scala: 18 ) finished in 1.204 s 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: looking for newly runnable stages 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: running: Set() 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: waiting: Set(Stage 0 ) 14 / 11 / 06 00 : 40 : 41 INFO TaskSchedulerImpl: Removed TaskSet 1.0 , whose tasks have all completed, from pool 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: failed: Set() 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: Missing parents for Stage 0 : List() 14 / 11 / 06 00 : 40 : 41 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[ 5 ] at saveAsTextFile at WordCount.scala: 21 ), which is now runnable 14 / 11 / 06 00 : 40 : 42 INFO MemoryStore: ensureFreeSpace( 57496 ) called with curMem= 178342 , maxMem= 540821422 14 / 11 / 06 00 : 40 : 42 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 56.1 KB, free 515.5 MB) 14 / 11 / 06 00 : 40 : 42 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[ 5 ] at saveAsTextFile at WordCount.scala: 21 ) 14 / 11 / 06 00 : 40 : 42 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14 / 11 / 06 00 : 40 : 42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2 , localhost, PROCESS_LOCAL, 948 bytes) 14 / 11 / 06 00 : 40 : 42 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3 , localhost, PROCESS_LOCAL, 948 bytes) 14 / 11 / 06 00 : 40 : 42 INFO Executor: Running task 0.0 in stage 0.0 (TID 2 ) 14 / 11 / 06 00 : 40 : 42 INFO Executor: Running task 1.0 in stage 0.0 (TID 3 ) 14 / 11 / 06 00 : 40 : 42 INFO BlockManager: Removing broadcast 1 14 / 11 / 06 00 : 40 : 42 INFO BlockManager: Removing block broadcast_1 14 / 11 / 06 00 : 40 : 42 INFO MemoryStore: Block broadcast_1 of size 3360 dropped from memory (free 540588944 ) 14 / 11 / 06 00 : 40 : 42 INFO ContextCleaner: Cleaned broadcast 1 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648 , targetRequestSize: 10066329 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 15 ms 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648 , targetRequestSize: 10066329 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14 / 11 / 06 00 : 40 : 42 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 17 ms 14 / 11 / 06 00 : 40 : 42 INFO FileOutputCommitter: Saved output of task 'attempt_201411060040_0000_m_000001_3' to file:/home/hduser/workspace/TestScala/result/_temporary/ 0 /task_201411060040_0000_m_000001 14 / 11 / 06 00 : 40 : 42 INFO FileOutputCommitter: Saved output of task 'attempt_201411060040_0000_m_000000_2' to file:/home/hduser/workspace/TestScala/result/_temporary/ 0 /task_201411060040_0000_m_000000 14 / 11 / 06 00 : 40 : 42 INFO SparkHadoopWriter: attempt_201411060040_0000_m_000000_2: Committed 14 / 11 / 06 00 : 40 : 42 INFO SparkHadoopWriter: attempt_201411060040_0000_m_000001_3: Committed 14 / 11 / 06 00 : 40 : 42 INFO Executor: Finished task 1.0 in stage 0.0 (TID 3 ). 826 bytes result sent to driver 14 / 11 / 06 00 : 40 : 42 INFO Executor: Finished task 0.0 in stage 0.0 (TID 2 ). 826 bytes result sent to driver 14 / 11 / 06 00 : 40 : 42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2 ) in 395 ms on localhost ( 1 / 2 ) 14 / 11 / 06 00 : 40 : 42 INFO DAGScheduler: Stage 0 (saveAsTextFile at WordCount.scala: 21 ) finished in 0.387 s 14 / 11 / 06 00 : 40 : 42 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 3 ) in 395 ms on localhost ( 2 / 2 ) 14 / 11 / 06 00 : 40 : 42 INFO TaskSchedulerImpl: Removed TaskSet 0.0 , whose tasks have all completed, from pool 14 / 11 / 06 00 : 40 : 42 INFO SparkContext: Job finished: saveAsTextFile at WordCount.scala: 21 , took 2.649520816 s |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· 展开说说关于C#中ORM框架的用法!
· SQL Server 2025 AI相关能力初探
· Pantheons:用 TypeScript 打造主流大模型对话的一站式集成库