aggregate 和 treeAggregate 的对比

 1.定义

   【aggregate】
      /**
      * Aggregate the elements of each partition, and then the results for all the partitions, using
      * given combine functions and a neutral "zero value". This function can return a different result
      * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
      * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
      * allowed to modify and return their first argument instead of creating a new U to avoid memory
      * allocation.
      */
      即：
      aggregate需要三个参数（初始值zeroValue，函数seqOp和函数combOp），返回值类型U同初始值zeroValue一样。
      处理过程：
          1.在rdd的每个分区上应用seqOp函数（应用初始值zeroValue）并返回分区的结果值（U类型）。
          2.分区的结果值返回到driver端做reduce处理，也就是说在分区的结果集上应用函数combOp（应用初始值zeroValue），
            并返回最终结果值（U类型）。
      函数头：
         def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

  【treeAggregate】
     /**
      * Aggregates the elements of this RDD in a multi-level tree pattern.
      * @param depth suggested depth of the tree (default: 2)
      * @see [[org.apache.spark.rdd.RDD#aggregate]]
      */
      即：treeAggregate和aggregate可以一样用，只是多了一个参数depth，但此参数默认为2，可以不指定。
        treeAggregate和aggregate的参数，返回值及用法完全一样。只是处理过程及最终的结果集处理有些微不同，下面详细说明。

      函数头：
        def treeAggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U,combOp: (U, U) => U,depth: Int = 2): U

 2.用法示例

    【aggregate】
        scala> def seq(a:Int,b:Int):Int={
             | println("seq:"+a+":"+b)
             | math.min(a,b)}
        seq: (a: Int, b: Int)Int

        scala> def comb(a:Int,b:Int):Int={
             | println("comb:"+a+":"+b)
             | a+b}
        comb: (a: Int, b: Int)Int

        val z =sc.parallelize(List(1,2,4,5,8,9),3)
        scala> z.aggregate(3)(seq,comb)
        seq:3:4
        seq:3:1
        seq:1:2
        seq:3:8
        seq:3:5
        seq:3:9
        comb:3:1
        comb:4:3
        comb:7:3
        res0: Int = 10
  【treeAggregate】
        scala> def seq(a:Int,b:Int):Int={
             | println("seq:"+a+":"+b)
             | math.min(a,b)}
        seq: (a: Int, b: Int)Int

        scala> def comb(a:Int,b:Int):Int={
             | println("comb:"+a+":"+b)
             | a+b}
        comb: (a: Int, b: Int)Int

        val z =sc.parallelize(List(1,2,4,5,8,9),3)
        scala> z.treeAggregate(3)(seq,comb)
        seq:3:4   //3 分区1
        seq:3:1   //1 分区1
        seq:1:2   //1 分区1
        seq:3:8   //3 分区2
        seq:3:5   //3 分区2
        seq:3:9   //3 分区3
        comb:1:3
        comb:4:3
        res1: Int = 7

    由上可见，形式上两种用法一致，只是aggregate 比 treeAggregate在最后结果的reduce操作时，多使用了一次初始值。

    3.区别

      查看aggregate的代码和treeAggregate的代码实现会发现，确实如上现象所反映，整理结果如下：
      （1）最终结果上，aggregate会比treeAggregate多做一次对于初始值的combOp操作。但从参数名字上就可以看到，
          一般要传入类似0或者空的集合的zeroValue初始值。
      （2）aggregate会把分区的结果直接拿到driver端做reduce操作。treeAggregate会先把分区结果做reduceByKey，
          最后再把结果拿到driver端做reduce,算出最终结果。reduceByKey需要几层，由参数depth决定，也就是相当于
          做了depth层的reduceByKey，这也是treeAggregate名字的由来。

    4.源码解释
      源码逻辑如上分析，较简单，不赘述了。
      借鉴图一张（http://blog.csdn.net/lookqlp/article/details/52121057）
      

    5.优缺点
      （1） aggregate在combine上的操作，复杂度为O(n). treeAggregate的时间复杂度为O(lg n)。n为分区数。

       (2) aggregate把数据全部拿到driver端，存在内存溢出的风险。treeAggregate则不会。
       
      因此，笔者觉得就用treeAggregate好了，如有不对之处，敬请留言指正。

posted on 2016-08-11 20:00 在大地画满窗子阅读(4900) 评论(0) 编辑收藏举报