RDD的行动操作
1.定义:触发Job,调用runJob()方法:
比如:collect、count
2.foreach
说明:将结果返回值执行器节点,而非驱动器
3.aggregate
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
说明:aggregate用于聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,
再使用combOp将之前每个分区聚合后的U类型聚合成U类型
特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9
说明:
step1:首先在第一个分区[0,1,2,3]中执行math.max,结果为:3
step2:在第二个分区[0,4,5,6]中执行math.max,结果为:6
stepn:在第N个分区中执行math.max,结果为:max
step:将所有分区结果执行combOp(_+_),0+3+6=9
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16
说明:
// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
案例说明:
scala> val z = sc.parallelize(List("12","23","345","4567"),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> z.glom.collect
res11: Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))
scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res12: String = 42
scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res13: String = 24
说明:
step1:第一个分区中执行math.max(x.length, y.length).toString代码
res0=:math.max(“”.length, “12”.length).toString = “2”
res1=:math.max(res0.length, “23”.length).toString = “2”
第一个分区最终返回值为:2
step2:第二个分区中执行math.max(x.length, y.length).toString代码
res2=:math.max(“”.length, “345”.length).toString = “3”
res3=:math.max(res2.length, “4567”.length).toString = “4”
第一个分区最终返回值为:4
step3:最后执行(x,y) => x + y :24 或 42
scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res14: String = 11
step1:第一个分区中执行math.min(x.length, y.length).toString代码
res0=:math.min(“”.length, “12”.length).toString = “0”
res1=:math.min(res0.length, “23”.length).toString = “1”
第一个分区最终返回值为:1
step2:第二个分区中执行math.min(x.length, y.length).toString代码
res2=:math.min(“”.length, “345”.length).toString = “0”
res3=:math.min(res2.length, “4567”.length).toString = “1”
第一个分区最终返回值为:1
step3:最后执行(x,y) => x + y :11 或 11
scala> z.aggregate("12")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res15: String = 1211
step1:第一个分区中执行math.min(x.length, y.length).toString代码
res0=:math.min(“12”.length, “12”.length).toString = “2”
res1=:math.min(res0.length, “23”.length).toString = “1”
第一个分区最终返回值为:1
step2:第二个分区中执行math.min(x.length, y.length).toString代码
res2=:math.min(“12”.length, “345”.length).toString = “2”
res3=:math.min(res2.length, “4567”.length).toString = “1”
第一个分区最终返回值为:1
step3:最后执行(x,y) => x + y :1211 或 1211
4.fold
def fold(zeroValue: T)(op: (T, T) => T): T
说明:fold理解为aggregate的简化
val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6
惰性求值
1.定义:
在RDD行动操作之前,不触发计算。转换操作和创建操作、控制操作均为惰性的;
只有行动操作可触发Job。