|NO.Z.00017|——————————|BigDataEnd|——|Hadoop&Spark.V05|——|Spark.v05|sparkcore|RDD编程&Action算子|

一、Action

### --- Action 用来触发RDD的计算，得到相关计算结果；

~~~     Action触发Job。一个Spark程序(Driver程序)包含了多少 Action 算子，那么就有多少Job；
~~~     典型的Action算子: collect / count
~~~     collect() => sc.runJob() => ... => dagScheduler.runJob() => 触发了Job
~~~     要求：能快速准确的区分：Transformation、Action

### --- 源码提取说明

~~~     # 源码提取说明：RDD.scala
~~~     # 983行
 /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

~~~     # 源码提取说明：sparkcontext.scala
~~~     # 2047行

    def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

~~~     # 源码提取说明：RDD.scala
~~~     # 方法说明

collect() / collectAsMap()
stats / count / mean / stdev / max / min
reduce(func) / fold(func) / aggregate(func)

~~~     first()：    Return the first element in this RDD
~~~     take(n)：    Take the first num elements of the RDD
~~~     top(n)：     按照默认（降序）或者指定的排序规则，返回前num个元素。
~~~     takeSample(withReplacement, num, [seed])：       返回采样的数据
~~~     foreach(func) / foreachPartition(func)：         与map、mapPartitions类似，区别是foreach 是 Action

### --- 实验专题：saveAsTextFile(path) / saveAsSequenceFile(path) / saveAsObjectFile(path)

~~~     # 返回统计信息。仅能作用 RDD[Double] 类型上调用
scala> val rdd1 = sc.range(1, 101)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[158] at range at <console>:24

scala> rdd1.stats
res69: org.apache.spark.util.StatCounter = (count: 100, mean: 50.500000, stdev: 28.866070, max: 100.000000, min: 1.000000)

scala> val rdd2 = sc.range(1, 101)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[162] at range at <console>:24

~~~     # 不能调用
scala> rdd1.zip(rdd2).stats
<console>:28: error: value stats is not a member of org.apache.spark.rdd.RDD[(Long, Long)]
       rdd1.zip(rdd2).stats
                      ^
~~~     # count在各种类型的RDD上，均能调用
scala> rdd1.zip(rdd2).count
res71: Long = 100

### --- 聚合操作

scala> rdd.reduce(_+_)
res72: Int = 55

scala> rdd.fold(0)(_+_)
res73: Int = 55

scala> rdd.fold(1)(_+_)
res74: Int = 58

scala> rdd.fold(1)((x, y) => {
     | println(s"x=$x, y=$y")
     | x+y
     | })
~~~ 输出参数
x=1, y=16
x=17, y=41
res75: Int = 58

scala> rdd.aggregate(0)(_+_, _+_)
res76: Int = 55

scala> rdd.aggregate(1)(_+_, _+_)
res77: Int = 58

scala> rdd.aggregate(1)(
     | (a, b) => {
     | println(s"a=$a, b=$b")
     |     a+b
     | },
     | (x, y) => {
     | println(s"x=$x, y=$y")
     | x+y
     | })
~~~输出参数
x=1, y=41
x=42, y=16
res78: Int = 58

~~~     # first / take(n) / top(n) ：获取RDD中的元素。多用于测试

scala> rdd.first
res79: Int = 1

scala> rdd.take(10)
res80: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.top(10)
res81: Array[Int] = Array(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)

~~~     # 采样并返回结果

scala> rdd.takeSample(false, 5)
res82: Array[Int] = Array(3, 6, 9, 10, 8)

~~~     # 保存文件到指定路径(rdd有多少分区，就保存为多少文件，保存文件时注意小文件问题)

scala> rdd.saveAsTextFile("data/t1")

二、实验结果说明

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart

——W.S.Landor