RDD的转换

常用的RDD转换算子

1.map(func),对数据集中的每个元素都使用func,然后返回一个新的rdd

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24

scala> rdd1.collect
res10: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd2 = rdd1.map(x => x * 2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at map at <console>:25

scala> rdd2.collect
res11: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

scala> val rdd3 = rdd1.map(_*3)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25

scala> rdd3.collect
res12: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)

2.filter(func),对数据集中的每个元素都使用func,然后返回一个包含使func为true的元素构成的新的RDD

scala> val rdd4 = rdd1.filter(_%3==0)
rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at filter at <console>:25

scala> rdd4.collect
res13: Array[Int] = Array(3, 6, 9)

3.flatMap(func),与map类似,每个输入元素被映射为0或多个输出元素

scala> val lst = List("hello scala","hello spark","hello zhangcong")
lst: List[String] = List(hello scala, hello spark, hello zhangcong)

scala> val rdd1 = sc.makeRDD(lst)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at makeRDD at <console>:26

scala> rdd1.collect
res14: Array[String] = Array(hello scala, hello spark, hello zhangcong)

scala> val rdd2 = rdd1.flatMap(_.split(" "))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at flatMap at <console>:25

scala> rdd2.collect
res15: Array[String] = Array(hello, scala, hello, spark, hello, zhangcong)

4.mapPartitions(func),和map很像,map将func作用在每个元素伤,mapPartitions将func作用在整个分区上

假设一个RDD有N个元素,M个分区,那么map的函数将被调用N次,mapPartitions的函数仅被调用M次,一次处理一个分区中的所有元素

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at <console>:24

scala> rdd1.getNumPartitions
res16: Int = 2

scala> rdd1.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd2 = rdd1.mapPartitions(iter => Iterator(iter.toList))
rdd2: org.apache.spark.rdd.RDD[List[Int]] = MapPartitionsRDD[17] at mapPartitions at <console>:25

scala> rdd2.collect
res18: Array[List[Int]] = Array(List(1, 2, 3, 4, 5), List(6, 7, 8, 9, 10))

scala> val rdd3 = rdd1.mapPartitions(iter => iter.map(_*2))
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[18] at mapPartitions at <console>:25

scala> rdd3.collect
res19: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

5.mapPartitionsWithIndex(func),与mapPartitions类似,多了分区索引信息

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at <console>:24

scala> rdd1.getNumPartitions
res16: Int = 2

scala> rdd1.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd2 = rdd1.mapPartitionsWithIndex((idx,iter) => Iterator(idx.toString + " : " + iter.toList.toString))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at mapPartitionsWithIndex at <console>:25

scala> rdd2.collect
res20: Array[String] = Array(0 : List(1, 2, 3, 4, 5), 1 : List(6, 7, 8, 9, 10))

6.groupBy(func),按传入函数的返回值进行分组,将key相同的值放入一个迭代器

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at makeRDD at <console>:24

scala> val rdd2 = rdd1.groupBy(_%3)
rdd2: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[22] at groupBy at <console>:25

scala> rdd2.collect
res22: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (2,CompactBuffer(2, 5, 8)), (1,CompactBuffer(7, 10, 1, 4)))

7.glom(),将每一个分区形成一个数组,形成新的RDD

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at makeRDD at <console>:24

scala> val rdd2 = rdd1.glom
rdd2: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[23] at glom at <console>:25

scala> rdd2.collect
res23: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7, 8, 9, 10))

8.sample(withReplacement, fraction, seed),采样算子,以指定的随机种子seed随机抽样出数量为fraction的数据,withReplacement表示抽出的数据是否放回

scala> val rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at <console>:24

scala> rdd1.sample(true, 0.5, 666).collect
res30: Array[Int] = Array(3, 3, 6, 6, 8, 10, 10, 10)

scala> rdd1.sample(false, 0.5, 666).collect
res31: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9, 10)

scala> rdd1.sample(true, 0.5).collect
res32: Array[Int] = Array(3, 4, 6, 7)

scala> rdd1.sample(true, 0.5).collect
res33: Array[Int] = Array(1, 3, 3, 4, 6, 10, 10)

scala> rdd1.sample(true, 0.5).collect
res34: Array[Int] = Array(4, 7, 9, 10)

scala> rdd1.sample(true, 0.5).collect
res35: Array[Int] = Array(1, 1, 3, 7, 9)

9.distinct([numTasks]),对RDD元素去重后,返回一个新的RDD。可传入numTasks参数改变RDD分区数

scala> val rdd1 = sc.makeRDD(List(1,1,1,2,3,4,5,5,6,7,7,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at makeRDD at <console>:24

scala> rdd1.collect
res36: Array[Int] = Array(1, 1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 7)

scala> val rdd2 = rdd1.distinct
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at distinct at <console>:25

scala> rdd2.collect
res37: Array[Int] = Array(4, 6, 2, 1, 3, 7, 5)

scala> rdd2.getNumPartitions
res38: Int = 2

scala> val rdd3 = rdd1.distinct(3)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at distinct at <console>:25

scala> rdd3.collect
res39: Array[Int] = Array(6, 3, 4, 1, 7, 5, 2)

scala> rdd3.getNumPartitions
res40: Int = 3

10.coalesce(numPartitions),缩减分区数,无shuffle

scala> val rdd1 = sc.makeRDD(1 to 20, 4)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at makeRDD at <console>:24

scala> rdd1.getNumPartitions
res41: Int = 4

scala> val rdd2 = rdd1.coalesce(3)
rdd2: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[42] at coalesce at <console>:25

scala> rdd2.getNumPartitions
res42: Int = 3

11.repartition(numPartitions),增加或减少分区数,有shuffle

scala> val rdd1 = sc.makeRDD(1 to 20, 4)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at makeRDD at <console>:24

scala> rdd1.getNumPartitions
res41: Int = 4

scala> val rdd2 = rdd1.repartition(3)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[46] at repartition at <console>:25

scala> rdd2.getNumPartitions
res43: Int = 3

scala> val rdd3 = rdd1.repartition(5)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[50] at repartition at <console>:25

scala> rdd3.getNumPartitions
res44: Int = 5

12.sortBy(func, [ascending], [numTasks]),使用func对数据进行处理,对处理后的结果进行排序

scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@66d34df2

scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 5, 3, 5, 0, 3, 3, 3, 0, 8, 5, 8, 4, 5, 4, 8, 6, 8, 1, 9)

scala> val rdd1 = sc.makeRDD(arr)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:26

scala> rdd1.collect
res45: Array[Int] = Array(4, 5, 3, 5, 0, 3, 3, 3, 0, 8, 5, 8, 4, 5, 4, 8, 6, 8, 1, 9)

scala> val rdd2 = rdd1.sortBy(x => x)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[56] at sortBy at <console>:25

scala> rdd2.collect
res46: Array[Int] = Array(0, 0, 1, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 8, 8, 8, 8, 9)

scala> val rdd3 = rdd1.sortBy(x => x, false)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at sortBy at <console>:25

scala> rdd3.collect
res47: Array[Int] = Array(9, 8, 8, 8, 8, 6, 5, 5, 5, 5, 4, 4, 4, 3, 3, 3, 3, 1, 0, 0)

13.intersection(otherRDD),取两个RDD的交集

scala> val rdd1 = sc.makeRDD(1 to 21)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[62] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(11 to 31)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[63] at makeRDD at <console>:24

scala> val rdd3 = rdd1.intersection(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[69] at intersection at <console>:27

scala> rdd3.collect
res48: Array[Int] = Array(16, 14, 18, 12, 20, 13, 19, 15, 21, 11, 17)

scala> rdd3.collect.sorted
res50: Array[Int] = Array(11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)

14.union(otherRDD),取两个RDD的并集

scala> val rdd4 = rdd1.union(rdd2)
rdd4: org.apache.spark.rdd.RDD[Int] = UnionRDD[70] at union at <console>:27

scala> rdd4.collect.sorted
res51: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)

15.subtract(otherRDD),取两个RDD的差集

scala> val rdd5 = rdd1.subtract(rdd2)
rdd5: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[74] at subtract at <console>:27

scala> rdd5.collect.sorted
res52: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

16.cartesian(otherRDD),取两个RDD的笛卡尔积

scala> val rdd6 = rdd1.cartesian(rdd2)
rdd6: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[75] at cartesian at <console>:27

scala> rdd6.collect
res53: Array[(Int, Int)] = Array((1,11), (1,12), (1,13), (1,14), (1,15), (1,16), (1,17), (1,18), (1,19), (1,20), (2,11), (2,12), (2,13), (2,14), (2,15), (2,16), (2,17), (2,18), (2,19), (2,20), (3,11), (3,12), (3,13), (3,14), (3,15), (3,16), (3,17), (3,18), (3,19), (3,20), (4,11), (4,12), (4,13), (4,14), (4,15), (4,16), (4,17), (4,18), (4,19), (4,20), (5,11), (5,12), (5,13), (5,14), (5,15), (5,16), (5,17), (5,18), (5,19), (5,20), (6,11), (6,12), (6,13), (6,14), (6,15), (6,16), (6,17), (6,18), (6,19), (6,20), (7,11), (7,12), (7,13), (7,14), (7,15), (7,16), (7,17), (7,18), (7,19), (7,20), (8,11), (8,12), (8,13), (8,14), (8,15), (8,16), (8,17), (8,18), (8,19), (8,20), (9,11), (9,12), (9,13), (9,14), (9,15), (9,16), (9,17), (9,18), (9,19), (9,20), (10,11), (10,12), ...

17.zip(otherRDD),将两个RDD组合成key-value形式的RDD,需两个RDD分区数和元素数一致

scala> val rdd7 = rdd1.zip(rdd2)
rdd7: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[76] at zip at <console>:27

scala> rdd7.collect
res54: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), (5,15), (6,16), (7,17), (8,18), (9,19), (10,20), (11,21), (12,22), (13,23), (14,24), (15,25), (16,26), (17,27), (18,28), (19,29), (20,30), (21,31))

注:RDD的转换操作都是延迟执行的,需要Action来触发

posted @ 2022-03-15 23:13  NeilCheung514  阅读(353)  评论(0)    收藏  举报