Title

Spark Key-Value RDD操作

下述操作在Spark shell中

RDD整体上分为Value 类型和 Key-Value类型。

实际使用更多的是key-value 类型的RDD,也称为PairRDD

Value 类型RDD的操作基本集中在 RDD.scala 中

key-value 类型的RDD操作集中在 PairRDDFunctions.scala 中;

1.1 创建Pair RDD

val arr = (1 to 10).toArray
val arr1 = arr.map(x => (x, x*10, x*100))
// rdd1 不是 Pair RDD
val rdd1 = sc.makeRDD(arr1)

// rdd2 是 Pair RDD
val arr2 = arr.map(x => (x, (x*10, x*100)))
val rdd2 = sc.makeRDD(arr2)

Pair RDD需要有明确的key,当时用collectAsMap不会出现异常

scala> rdd1.collectAsMap
<console>:26: error: value collectAsMap is not a member of org.apache.spark.rdd.RDD[(Int, Int, Int)]
       rdd1.collectAsMap

scala> rdd2.collectAsMap
res3: scala.collection.Map[Int,(Int, Int)] = Map(8 -> (80,800), 2 -> (20,200), 5 -> (50,500), 4 -> (40,400), 7 -> (70,700), 10 -> (100,1000), 1 -> (10,100), 9 -> (90,900), 3 -> (30,300), 6 -> (60,600))

1.2 Transformation操

  • 类似 map 操作

    • mapValues 实现交换value(Int,Int)的值

      scala> rdd2.collect
      res25: Array[(Int, (Int, Int))] = Array((1,(10,100)), (2,(20,200)), (3,(30,300)), (4,(40,400)), (5,(50,500)), (6,(60,600)), (7,(70,700)), (8,(80,800)), (9,(90,900)), (10,(100,1000)))
      
      scala> rdd2.mapValues(x => (x._2,x._1)).collect
      res26: Array[(Int, (Int, Int))] = Array((1,(100,10)), (2,(200,20)), (3,(300,30)), (4,(400,40)), (5,(500,50)), (6,(600,60)), (7,(700,70)), (8,(800,80)), (9,(900,90)), (10,(1000,100)))
      ## map实现
      scala> rdd2.map{case (x,y)=>(x,(y._2,y._1))}.collect
      res28: Array[(Int, (Int, Int))] = Array((1,(100,10)), (2,(200,20)), (3,(300,30)), (4,(400,40)), (5,(500,50)), (6,(600,60)), (7,(700,70)), (8,(800,80)), (9,(900,90)), (10,(1000,100)))
      
    • flatMapValues压平操作

      scala> val a = sc.parallelize(List((1,2),(3,4),(5,6)))
      a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[15] at parallelize at <console>:24
      
      scala> a.flatMapValues(x => 1 to x).collect
      res37: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
      

      map实现

      scala> a.mapValues(x=> 1 to x).flatMap{case (x,y)=>{y.map(z=>(x,z))}}.collect
      res39: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
      
    • keys

      scala> a.keys.collect
      res41: Array[Int] = Array(1, 3, 5)
      

      map实现

      scala> a.map{case (k, v) => k}.collect
      res43: Array[Int] = Array(1, 3, 5)
      
    • values

      scala> a.keys.collect
      res41: Array[Int] = Array(1, 3, 5)
      

      map实现

      scala> a.map{case (k, v) => v}.collect
      res44: Array[Int] = Array(2, 4, 6)
      

    1.3 聚合操作

    主要有groupByKey / reduceByKey / foldByKey / aggregateByKey 其底层实现调用了下列两种

    combineByKey(老版本) / combineByKeyWithClassTag (新版本)

    subtractByKey:类似于subtract,删掉 RDD 中键与 other RDD 中的键相同的元素

    求平均价格

    scala> val rdd = sc.makeRDD(Array(("Java", 12), ("C++", 26),("C", 23), ("Java", 15), ("scala", 26),("spark", 25),("C", 21), ("Java", 16), ("scala", 24), ("scala", 16)))
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[24] at makeRDD at <console>:24
    
    • groupByKey 通过key聚合,聚合后主键为key,value为Iterable

      scala> rdd.groupByKey.collect
      res46: Array[(String, Iterable[Int])] = Array((Java,CompactBuffer(16, 12, 15)), (C++,CompactBuffer(26)), (spark,CompactBuffer(25)), (scala,CompactBuffer(26, 16, 24)), (C,CompactBuffer(23, 21)))
      
      scala> rdd.groupByKey.map{case (x,y)=>(x, y.sum.toDouble / y.size)}.collect
      res49: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      
      scala> rdd.groupByKey.mapValues (y=> y.sum.toDouble / y.size).collect
      res50: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      
    • reduceByKey

      scala> rdd.mapValues(v =>(v,1)).reduceByKey((x,y) => (x._1+y._1, x._2+y._2)).mapValues(x=>x._1.toDouble / x._2).collect
      res52: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      
    • foldByKey

      scala> rdd.mapValues(v =>(v,1)).foldByKey((0,0))((x,y) => (x._1+y._1, x._2+y._2)).mapValues(x=>x._1.toDouble / x._2).collect
      res57: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      
    • aggregateByKey => 定义初值 + 分区内的聚合函数 + 分区间的聚合函数

      scala> rdd.mapValues(v =>(v,1)).aggregateByKey((0,0))((x,y)=>(x._1+y._1, x._2+y._2), (a,b)=>(a._1+b._1, a._2+b._2)).mapValues(x=>x._1.toDouble / x._2).collect
      res61: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      
      
      scala> rdd.aggregateByKey((0,0))((x,y)=>(x._1+y, x._2+1), (a,b)=>(a._1+b._1, a._2+b._2)).mapValues(x=>x._1.toDouble / x._2).collect
      res63: Array[(String, Double)] = Array((Java,14.333333333333334), (C++,26.0), (spark,25.0), (scala,22.0), (C,22.0))
      

    groupByKey在一般情况下效率低,尽量少用

    image-20210925201708009

1.4 Join操作

cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin

scala> val rdd1 = sc.makeRDD(Array((1,"Spark"), (2,"Hadoop"), (3,"Kylin"), (4,"Flink")))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[50] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array((3,"李四"), (4,"王五"), (5,"赵六"), (6,"冯七")))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[51] at makeRDD at <console>:24

scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[53] at cogroup at <console>:27

scala> rdd3.collect.foreach(println)
(1,(CompactBuffer(Spark),CompactBuffer()))
(2,(CompactBuffer(Hadoop),CompactBuffer()))
(3,(CompactBuffer(Kylin),CompactBuffer(李四)))
(4,(CompactBuffer(Flink),CompactBuffer(王五)))
(5,(CompactBuffer(),CompactBuffer(赵六)))
(6,(CompactBuffer(),CompactBuffer(冯七)))

val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),
("3","Scala"),("4","Java")))
val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),
("5","25K"),("6","10K")))
rdd1.join(rdd2).collect
rdd1.leftOuterJoin(rdd2).collect
rdd1.rightOuterJoin(rdd2).collect
rdd1.fullOuterJoin(rdd2).collect

1.5 Action操作

collectAsMap / countByKey / lookup(key)

lookup(key):高效的查找方法,只查找对应分区的数据(如果RDD有分区器的话)

val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),
("3","Scala"),("1","Java")))
val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),
("5","25K"),("6","10K")))
rdd1.lookup("1")
rdd2.lookup("3")
posted @ 2021-09-25 22:43  apeGcWell  阅读(141)  评论(0编辑  收藏  举报