RDD的行动操作

1.定义:触发Job,调用runJob()方法:
  比如:collect、count
2.foreach
  说明:将结果返回值执行器节点,而非驱动器

3.aggregate
  def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
  说明:aggregate用于聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,
     再使用combOp将之前每个分区聚合后的U类型聚合成U类型
     特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
  val z = sc.parallelize(List(1,2,3,4,5,6), 2)
  z.aggregate(0)(math.max(_, _), _ + _)
  res40: Int = 9
  说明:
    step1:首先在第一个分区[0,1,2,3]中执行math.max,结果为:3
    step2:在第二个分区[0,4,5,6]中执行math.max,结果为:6
    stepn:在第N个分区中执行math.max,结果为:max
    step:将所有分区结果执行combOp(_+_),0+3+6=9
    z.aggregate(5)(math.max(_, _), _ + _)
    res29: Int = 16
  说明:
    // This example returns 16 since the initial value is 5
    // reduce of partition 0 will be max(5, 1, 2, 3) = 5
    // reduce of partition 1 will be max(5, 4, 5, 6) = 6
    // final reduce across partitions will be 5 + 5 + 6 = 16
    // note the final reduce include the initial value

  案例说明:
    scala> val z = sc.parallelize(List("12","23","345","4567"),2)
    z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

    scala> z.glom.collect
    res11: Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))

    scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
    res12: String = 42

    scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
    res13: String = 24
  说明:
    step1:第一个分区中执行math.max(x.length, y.length).toString代码
      res0=:math.max(“”.length, “12”.length).toString = “2”
      res1=:math.max(res0.length, “23”.length).toString = “2”
  第一个分区最终返回值为:2
    step2:第二个分区中执行math.max(x.length, y.length).toString代码
      res2=:math.max(“”.length, “345”.length).toString = “3”
      res3=:math.max(res2.length, “4567”.length).toString = “4”
  第一个分区最终返回值为:4
    step3:最后执行(x,y) => x + y :24 或 42
      scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
      res14: String = 11
    step1:第一个分区中执行math.min(x.length, y.length).toString代码
      res0=:math.min(“”.length, “12”.length).toString = “0”
      res1=:math.min(res0.length, “23”.length).toString = “1”
  第一个分区最终返回值为:1
    step2:第二个分区中执行math.min(x.length, y.length).toString代码
      res2=:math.min(“”.length, “345”.length).toString = “0”
      res3=:math.min(res2.length, “4567”.length).toString = “1”
  第一个分区最终返回值为:1
    step3:最后执行(x,y) => x + y :11 或 11
      scala> z.aggregate("12")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
      res15: String = 1211
    step1:第一个分区中执行math.min(x.length, y.length).toString代码
      res0=:math.min(“12”.length, “12”.length).toString = “2”
      res1=:math.min(res0.length, “23”.length).toString = “1”
  第一个分区最终返回值为:1
    step2:第二个分区中执行math.min(x.length, y.length).toString代码
      res2=:math.min(“12”.length, “345”.length).toString = “2”
      res3=:math.min(res2.length, “4567”.length).toString = “1”
  第一个分区最终返回值为:1
    step3:最后执行(x,y) => x + y :1211 或 1211
4.fold
  def fold(zeroValue: T)(op: (T, T) => T): T
  说明:fold理解为aggregate的简化
    val a = sc.parallelize(List(1,2,3), 3)
    a.fold(0)(_ + _)
    res59: Int = 6

惰性求值

1.定义:
  在RDD行动操作之前,不触发计算。转换操作和创建操作、控制操作均为惰性的;
  只有行动操作可触发Job。

posted @ 2018-08-30 20:02  Coding_Now  阅读(321)  评论(0编辑  收藏  举报