RDD的行动操作

1.定义：触发Job，调用runJob()方法：
　　比如：collect、count
2.foreach
　　说明：将结果返回值执行器节点，而非驱动器

3.aggregate
　　def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
　　说明：aggregate用于聚合RDD中的元素，先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型，
　　　　　再使用combOp将之前每个分区聚合后的U类型聚合成U类型
　　　　　特别注意seqOp和combOp都会使用zeroValue的值，zeroValue的类型为U
　　val z = sc.parallelize(List(1,2,3,4,5,6), 2)
　　z.aggregate(0)(math.max(_, _), _ + _)
　　res40: Int = 9
　　说明：
　　　　step1：首先在第一个分区[0,1,2,3]中执行math.max，结果为：3
　　　　step2：在第二个分区[0,4,5,6]中执行math.max，结果为：6
　　　　stepn：在第N个分区中执行math.max，结果为：max
　　　　step：将所有分区结果执行combOp(_+_),0+3+6=9
　　　　z.aggregate(5)(math.max(_, _), _ + _)
　　　　res29: Int = 16
　　说明：
　　　　// This example returns 16 since the initial value is 5
　　　　// reduce of partition 0 will be max(5, 1, 2, 3) = 5
　　　　// reduce of partition 1 will be max(5, 4, 5, 6) = 6
　　　　// final reduce across partitions will be 5 + 5 + 6 = 16
　　　　// note the final reduce include the initial value

　　案例说明：
　　　　scala> val z = sc.parallelize(List("12","23","345","4567"),2)
　　　　z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

　　　　scala> z.glom.collect
　　　　res11: Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))

　　　　scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
　　　　res12: String = 42

　　　　scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
　　　　res13: String = 24
　　说明：
　　　　step1：第一个分区中执行math.max(x.length, y.length).toString代码
　　　　　　res0=：math.max(“”.length, “12”.length).toString = “2”
　　　　　　res1=：math.max(res0.length, “23”.length).toString = “2”
　　第一个分区最终返回值为：2
　　　　step2：第二个分区中执行math.max(x.length, y.length).toString代码
　　　　　　res2=：math.max(“”.length, “345”.length).toString = “3”
　　　　　　res3=：math.max(res2.length, “4567”.length).toString = “4”
　　第一个分区最终返回值为：4
　　　　step3：最后执行(x,y) => x + y ：24 或 42
　　　　　　scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
　　　　　　res14: String = 11
　　　　step1：第一个分区中执行math.min(x.length, y.length).toString代码
　　　　　　res0=：math.min(“”.length, “12”.length).toString = “0”
　　　　　　res1=：math.min(res0.length, “23”.length).toString = “1”
　　第一个分区最终返回值为：1
　　　　step2：第二个分区中执行math.min(x.length, y.length).toString代码
　　　　　　res2=：math.min(“”.length, “345”.length).toString = “0”
　　　　　　res3=：math.min(res2.length, “4567”.length).toString = “1”
　　第一个分区最终返回值为：1
　　　　step3：最后执行(x,y) => x + y ：11 或 11
　　　　　　scala> z.aggregate("12")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
　　　　　　res15: String = 1211
　　　　step1：第一个分区中执行math.min(x.length, y.length).toString代码
　　　　　　res0=：math.min(“12”.length, “12”.length).toString = “2”
　　　　　　res1=：math.min(res0.length, “23”.length).toString = “1”
　　第一个分区最终返回值为：1
　　　　step2：第二个分区中执行math.min(x.length, y.length).toString代码
　　　　　　res2=：math.min(“12”.length, “345”.length).toString = “2”
　　　　　　res3=：math.min(res2.length, “4567”.length).toString = “1”
　　第一个分区最终返回值为：1
　　　　step3：最后执行(x,y) => x + y ：1211 或 1211
4.fold
　　def fold(zeroValue: T)(op: (T, T) => T): T
　　说明：fold理解为aggregate的简化
　　　　val a = sc.parallelize(List(1,2,3), 3)
　　　　a.fold(0)(_ + _)
　　　　res59: Int = 6

惰性求值

1.定义：
　　在RDD行动操作之前，不触发计算。转换操作和创建操作、控制操作均为惰性的；
　　只有行动操作可触发Job。

posted @ 2018-08-30 20:02 Coding_Now 阅读(321) 评论(0) 编辑收藏举报

刷新页面返回顶部

Code_exploration

程序人生，走向人生巅峰

RDD的行动操作

公告