spark-shell简单使用介绍(scala)

>>提君博客原创  http://www.cnblogs.com/tijun/  <<

提君博客原创

1.进入命令窗口

./bin/spark-shell

附上帮助指令,查看一些帮助信息

scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:edit <id>|<line>        edit history
:help [command]          print this summary or command-specific help
:history [num]           show the history (optional num is commands to show)
:h? <string>             search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v]          show the implicits in scope
:javap <path|class>      disassemble a file or class name
:line <id>|<line>        place line(s) at the end of history
:load <path>             interpret lines in a file
:paste [-raw] [path]     enter paste mode or paste a file
:power                   enable power user mode
:quit                    exit the interpreter
:replay [options]        reset the repl and replay all previous commands
:require <path>          add a jar to the classpath
:reset [options]         reset the repl to its initial state, forgetting all session entries
:save <path>             save replayable session to a file
:sh <command line>       run a shell command (result is implicitly => List[String])
:settings <options>      update compiler options, if possible; see reset
:silent                  disable/enable automatic printing of results
:type [-v] <expr>        display the type of an expression without evaluating it
:kind [-v] <expr>        display the kind of expression's type
:warnings                show the suppressed warnings from the most recent line which had any

 

2.使用spark加载文件,创建Dataset

scala> val textFile = spark.read.textFile("hdfs://cluster1/input/README.txt")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]

 3.使用sc加载文件,创建RDD

scala> val textFile=sc.textFile("hdfs://cluster1/input/README.txt")
textFile: org.apache.spark.rdd.RDD[String] = hdfs://cluster1/input/README.txt MapPartitionsRDD[1] at textFile at <console>:24

4.统计textFile里面有多少行(item)

提君博客原创

scala> textFile.count()    // Number of items in this Dataset
res0: Long = 31

5.查看第一个iterm

scala> textFile.first()   // First item in this Dataset
res1: String = For the latest information about Hadoop, please visit our website at:

上面都挺简单,下面来一个完整的wordcount

>>提君博客原创  http://www.cnblogs.com/tijun/  <<

6.wordcount

scala> val wordsRdd=textFile.flatMap(line=>line.split(" "))
wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> val kvsRdd=wordsRdd.map(word=>(word,1))
kvsRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28

scala> val countRdd=kvsRdd.reduceByKey(_+_)
countRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:30

scala> countRdd.collect()
res2: Array[(String, Int)] = Array((under,1), (this,3), (distribution,2), (Technology,1), (country,1), (is,1), (Jetty,1), (currently,1), (permitted.,1), (check,1), (have,1), (Security,1), (U.S.,1), (with,1), (BIS,1), (This,1), (mortbay.org.,1), ((ECCN),1), (using,2), (security,1), (Department,1), (export,1), (reside,1), (any,1), (algorithms.,1), (from,1), (re-export,2), (has,1), (SSL,1), (Industry,1), (Administration,1), (details,1), (provides,1), (http://hadoop.apache.org/core/,1), (country's,1), (Unrestricted,1), (740.13),1), (policies,1), (country,,1), (concerning,1), (uses,1), (Apache,1), (possession,,2), (information,2), (our,2), (as,1), ("",18), (Bureau,1), (wiki,,1), (please,2), (form,1), (information.,1), (ENC,1), (Export,2), (included,1), (asymmetric,1), (Commodity,1), (For,1),...

本篇先暂时写到这里,后续再继续完善。

提君博客原创

>>提君博客原创  http://www.cnblogs.com/tijun/  <<

 

posted @ 2017-09-20 18:52  提君  阅读(1599)  评论(0编辑  收藏  举报