1. 定义

  /*
  * 1. 定义
  *     def groupByKey(): RDD[(K, Iterable[V])]
  *     def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
  *     def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
  *
  * 2. 功能
  *     按照相同的 Key 对 Value 进行聚合
  *     只能处理 key-value型的Rdd
  *
  * 3. 思考
  *    1. groupByKey 和 reduceByKey的区别？
  *       1. 从shuffle 的角度
  *           1. groupByKey 和 reduceByKey 都存在shuffle(都要从不同分区节点拉取数据)
  *              但是 reduceByKey 可以在shuffle前对分区相同的key进行 预聚合(combine)功能
  *                  这样会 减少落盘和传输的数据量
  *              但是 groupByKey 只能进行分组,而不能预聚合
  *              所以 reduceByKey的性能比较高
  *       2. 从功能 的角度
  *           1. groupByKey : 对相同的key 进行分组
  *           2. reduceByKey : 对相同的key 先分组再进行聚合
  *
  *
  * 4. note
  *   1. 传入的分区器 是对 分组结果的key处理
  *
  *
  * */

回到顶部

2. 示例

  object groupByKeyTest extends App {

    val sparkconf: SparkConf = new SparkConf().setMaster("local").setAppName("distinctTest")

    val sc: SparkContext = new SparkContext(sparkconf)

    val rdd: RDD[(Int, String)] = sc.makeRDD(List((1, "x1"), (-2, "x2"), (3, "x3"), (-4, "x4"), (-5, "x5"), (-6, "x6"), (7, "x7")), 2)

    private val rdd1 = rdd.groupByKey(
      new Partitioner {
        override def numPartitions: Int = 2

        override def getPartition(key: Any): Int = if (key.asInstanceOf[Int] > 0) 1 else 0
      }
    )

    private val rdd2: RDD[(Int, Iterable[(Int, String)])] = rdd.groupBy(_._1)

    println(s"${rdd1.collect().mkString(",")}")
    println(s"${rdd2.collect().mkString(",")}")

    sc.stop()
  }