准备数据

SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs;
a	苹果	5
a	橙子	3
a      苹果   2
b	烧鸡	1
 
hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a	10
b	1

计算过程

hive-groupby-cal
默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是 hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的 mode是complete.

Operator

hive-groupby-op

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
 
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage
 
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        logs 
          TableScan // 扫描表
            alias: logs
            Select Operator //选择字段
              expressions:
                    expr: uid
                    type: string
                    expr: count
                    type: int
              outputColumnNames: uid, count
              Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
                aggregations:
                      expr: sum(count) //聚集函数
                bucketGroup: false
                keys: //键
                      expr: uid
                      type: string
                mode: hash //hash方式,processHashAggr()
                outputColumnNames: _col0, _col1
                Reduce Output Operator //输出key,value给reducer
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
 
          aggregations:
                expr: sum(VALUE._col0)
//聚合
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial //合并值
          outputColumnNames: _col0, _col1
          Select Operator //选择字段
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator //输出到文件
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 
  Stage: Stage-0
    Fetch Operator
      limit: -1

hive> select distinct value from src;
hive> select max(key) from src;
因为没有grouping keys,所以只有一个reducer。

 
      2.2 如果有聚合函数或者groupby,做如下处理:
            插入一个select operator,选取所有的字段,用于优化阶段ColumnPruner的优化
            2.2.1  hive.map.aggr为true,默认是true,开启的,在map端做部分聚合
                  2.2.1.1 hive.groupby.skewindata为false,默认是关闭的,groupby的数据没有倾斜。     
                  生成的operator是: GroupByOperator+ReduceSinkOperator+GroupByOperator。
      GroupByOperator+ReduceSinkOperator用于在map端做操作,第一个GroupByOperator在map端先做部分聚合。第二个用于在reduce端做GroupBy操作
                  2.2.1.2 hive.groupby.skewindata为true
                  生成的operator是: GroupbyOperator+ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator
               GroupbyOperator+ReduceSinkOperator(第一个MapredTask的map阶段)
               GroupbyOperator(第一个MapredTask的reduce阶段)
               ReduceSinkOperator (第二个MapredTask的map阶段)
               GroupByOperator(第二个MapredTask的reduce阶段)
            2.2.2  hive.map.aggr为false
                   2.2.2.1 hive.groupby.skewindata为true
                    生成的operator是: ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator                   
               ReduceSinkOperator(第一个MapredTask的map阶段)
               GroupbyOperator(第一个MapredTask的reduce阶段)
               ReduceSinkOperator (第二个MapredTask的map阶段)
               GroupByOperator(第二个MapredTask的reduce阶段)
                   2.2.2.2 hive.groupby.skewindata为false
                    生成的operator是: ReduceSinkOperator(map阶段运行)+GroupbyOperator(reduce阶段运行)



第一种情况:
set hive.map.aggr=false;
set hive.groupby.skewindata=false;
SemanticAnalyzer.genGroupByPlan1MR(){
  (1)ReduceSinkOperator: It will put all Group By keys and the distinct field (if any) in the map-reduce sort key, and all other fields in the map-reduce value.
  (2)GroupbyOperator:GroupByDesc.Mode.COMPLETE,Reducer: iterate/merge (mode = COMPLETE)
}

第二种情况:
set hive.map.aggr=true;
set hive.groupby.skewindata=false;
SemanticAnalyzer.genGroupByPlanMapAggr1MR(){
   (1)GroupByOperator:GroupByDesc.Mode.HASH,The agggregation evaluation functions are as follows: Mapper: iterate/terminatePartial (mode = HASH)
   (2)ReduceSinkOperator:Partitioning Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
   (3)GroupByOperator:GroupByDesc.Mode.MERGEPARTIAL,Reducer: iterate/terminate if DISTINCT merge/terminate if NO DISTINCT (mode = MERGEPARTIAL)
}

第三种情况:
set hive.map.aggr=false;
set hive.groupby.skewindata=true;
SemanticAnalyzer.genGroupByPlan2MR(){
    (1)ReduceSinkOperator:Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
    (2)GroupbyOperator:GroupByDesc.Mode.PARTIAL1,Reducer: iterate/terminatePartial (mode = PARTIAL1)
    (3)ReduceSinkOperator:Partitioning Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
    (4)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
}

第四种情况:
set hive.map.aggr=true;
set hive.groupby.skewindata=true;
SemanticAnalyzer.genGroupByPlanMapAggr2MR(){
     (1)GroupbyOperator:GroupByDesc.Mode.HASH,Mapper: iterate/terminatePartial (mode = HASH)
     (2)ReduceSinkOperator: Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT。 Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT。
     (3)GroupbyOperator:GroupByDesc.Mode.PARTIALS, Reducer: iterate/terminatePartial if DISTINCT merge/terminatePartial if NO DISTINCT (mode = MERGEPARTIAL)
     (4)ReduceSinkOperator:Partitioining Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
     (5)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
}


ReduceSinkOperator的processOp(Object row, int tag)会根据相应的条件设置Key的hash值,如第四种情况的第一个ReduceSinkOperator:Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT,如果没有DISTINCT字段,那么在OutputCollector.collect前会设置当前Key的hash值为一个随机 数,random = new Random(12345);。如果有DISTINCT字段,那么key的hash值跟grouping + distinct key有关。



GroupByOperator:
initializeOp(Configuration hconf)
processOp(Object row, int tag)
closeOp(boolean abort)
forward(ArrayList<Object> keys, AggregationBuffer[] aggs)


groupby10.q   groupby11.q
set hive.map.aggr=false;
set hive.groupby.skewindata=false;

EXPLAIN
FROM INPUT
INSERT OVERWRITE TABLE dest1 SELECT INPUT.key, count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5)) GROUP BY INPUT.key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input
          TableScan
            alias: input
            Select Operator  // insertSelectAllPlanForGroupBy
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Reduce Output Operator
                key expressions:
                      expr: key
                      type: int
                      expr: substr(value, 5)
                      type: string
                sort order: ++
                Map-reduce partition columns:
                      expr: key
                      type: int
                tag: -1
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(KEY._col1:0._col0)
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: complete
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: dest1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: dest1




set hive.map.aggr=true;
set hive.groupby.skewindata=false;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input
          TableScan
            alias: input
            Select Operator
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Group By Operator
                aggregations:
                      expr: count(substr(value, 5))
                      expr: count(DISTINCT substr(value, 5))
                bucketGroup: false
                keys:
                      expr: key
                      type: int
                      expr: substr(value, 5)
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: string
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1
                  value expressions:
                        expr: _col2
                        type: bigint
                        expr: _col3
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: dest1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: dest1








set hive.map.aggr=false;
set hive.groupby.skewindata=true;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input
          TableScan
            alias: input
            Select Operator
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Reduce Output Operator
                key expressions:
                      expr: key
                      type: int
                      expr: substr(value, 5)
                      type: string
                sort order: ++
                Map-reduce partition columns:
                      expr: key
                      type: int
                tag: -1
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(KEY._col1:0._col0)
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: partial1
          outputColumnNames: _col0, _col1, _col2
          File Output Operator
            compressed: false
            GlobalTableId: 0
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-48-26_387_7978992474997402829/-mr-10002
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: _col0
                    type: int
              tag: -1
              value expressions:
                    expr: _col1
                    type: bigint
                    expr: _col2
                    type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: final
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: dest1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: dest1




set hive.map.aggr=true;
set hive.groupby.skewindata=true;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input
          TableScan
            alias: input
            Select Operator
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Group By Operator
                aggregations:
                      expr: count(substr(value, 5))
                      expr: count(DISTINCT substr(value, 5))
                bucketGroup: false
                keys:
                      expr: key
                      type: int
                      expr: substr(value, 5)
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: string
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1
                  value expressions:
                        expr: _col2
                        type: bigint
                        expr: _col3
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: partials
          outputColumnNames: _col0, _col1, _col2
          File Output Operator
            compressed: false
            GlobalTableId: 0
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-49-25_899_4946067838822964610/-mr-10002
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: _col0
                    type: int
              tag: -1
              value expressions:
                    expr: _col1
                    type: bigint
                    expr: _col2
                    type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: final
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: dest1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: dest1








set hive.map.aggr=false;
set hive.groupby.skewindata=false;

EXPLAIN extended
FROM INPUT
INSERT OVERWRITE TABLE dest1 SELECT INPUT.key, count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5)) GROUP BY INPUT.key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input
          TableScan
            alias: input
            Select Operator
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Reduce Output Operator
                key expressions:
                      expr: key
                      type: int
                      expr: substr(value, 5)
                      type: string
                sort order: ++
                Map-reduce partition columns:
                      expr: key
                      type: int
                tag: -1
      Needs Tagging: false
      Path -> Alias:
        hdfs://localhost:54310/user/hive/warehouse/input [input]
      Path -> Partition:
        hdfs://localhost:54310/user/hive/warehouse/input
          Partition
            base file name: input
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              bucket_count -1
              columns key,value
              columns.types int:string
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://localhost:54310/user/hive/warehouse/input
              name input
              serialization.ddl struct input { i32 key, string value}
              serialization.format 1
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              transient_lastDdlTime 1310523947
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
         
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                bucket_count -1
                columns key,value
                columns.types int:string
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://localhost:54310/user/hive/warehouse/input
                name input
                serialization.ddl struct input { i32 key, string value}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1310523947
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: input
            name: input
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(KEY._col1:0._col0)
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: complete
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
                NumFilesPerFileSink: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      bucket_count -1
                      columns key,val1,val2
                      columns.types int:int:int
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location hdfs://localhost:54310/user/hive/warehouse/dest1
                      name dest1
                      serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
                      serialization.format 1
                      serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      transient_lastDdlTime 1310523946
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: dest1
                TotalFiles: 1
                MultiFileSpray: false

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          source: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                bucket_count -1
                columns key,val1,val2
                columns.types int:int:int
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://localhost:54310/user/hive/warehouse/dest1
                name dest1
                serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1310523946
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: dest1
          tmp directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10001



ABSTRACT SYNTAX TREE:
(TOK_QUERY
(TOK_FROM (TOK_TABREF INPUT))
(TOK_INSERT
(TOK_DESTINATION (TOK_TAB dest1))
(TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL INPUT) key))
(TOK_SELEXPR (TOK_FUNCTION count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
)
(TOK_GROUPBY (. (TOK_TABLE_OR_COL INPUT) key))
)
)



SemanticAnalyzer.genBodyPlan(QB qb, Operator input){
       if (qbp.getAggregationExprsForClause(dest).size() != 0
            || getGroupByForClause(qbp, dest).size() > 0) { //如果有聚合函数或者有groupby,则执行下面的操作
          //multiple distincts is not supported with skew in data
          if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
              .equalsIgnoreCase("true") &&
             qbp.getDistinctFuncExprsForClause(dest).size() > 1) {
            throw new SemanticException(ErrorMsg.UNSUPPORTED_MULTIPLE_DISTINCTS.
                getMsg());
          }
          // insert a select operator here used by the ColumnPruner to reduce
          // the data to shuffle
          curr = insertSelectAllPlanForGroupBy(dest, curr); //生成一个SelectOperator,所有的字段都会选取,selectStar=true。
          if (conf.getVar(HiveConf.ConfVars.HIVEMAPSIDEAGGREGATE)
              .equalsIgnoreCase("true")) {
            if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
                .equalsIgnoreCase("false")) {
              curr = genGroupByPlanMapAggr1MR(dest, qb, curr);
            } else {
              curr = genGroupByPlanMapAggr2MR(dest, qb, curr);
            }
          } else if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
              .equalsIgnoreCase("true")) {
            curr = genGroupByPlan2MR(dest, qb, curr);
          } else {
            curr = genGroupByPlan1MR(dest, qb, curr);
          }
       }  
}

distince:
count.q.out
groupby11.q.out
groupby10.q.out
nullgroup4_multi_distinct.q.out
join18.q.out
groupby_bigdata.q.out
join18_multi_distinct.q.out
nullgroup4.q.out
auto_join18_multi_distinct.q.out
auto_join18.q.out

(1)map端部分聚合,数据无倾斜,一个MR生成。
genGroupByPlanMapAggr1MR,生成三个Operator:
(1.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
       处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
       处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
       处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.HASH
  outputColumnNames:groupby+Distinct+Aggregation
  keys:groupby+Distinct
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 

(1.2)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(1.3)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.MERGEPARTIAL
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 

(2)map端部分聚合,数据倾斜,两个MR生成。
genGroupByPlanMapAggr2MR:
(2.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
       处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
       处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
       处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.HASH
  outputColumnNames:groupby+Distinct+Aggregation
  keys:groupby+Distinct
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 

(2.2)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(2.3)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.PARTIALS
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 

(2.4)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(2.5)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.FINAL
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 


(3)map端不部分聚合,数据倾斜,两个MR生成。
genGroupByPlan2MR:

(3.1)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(3.2)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.PARTIAL1
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 

(3.3)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(3.4)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
      处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.FINAL
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 


(4)map端不部分聚合,数据无倾斜,一个MR生成。
genGroupByPlan1MR:
(4.1)ReduceSinkOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
      int numDistributionKeys,
      java.util.ArrayList<ExprNodeDesc> valueCols,
      java.util.ArrayList<java.lang.String> outputKeyColumnNames,
      List<List<Integer>> distinctColumnIndices,
      java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
      java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
      final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
    this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
    this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
    this.valueCols = valueCols; //reduceValues,聚合函数
    this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
    this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
    this.tag = tag; // -1
    this.numReducers = numReducers; // 一般都是-1
    this.partitionCols = partitionCols; // groupby
    this.keySerializeInfo = keySerializeInfo;
    this.valueSerializeInfo = valueSerializeInfo;
    this.distinctColumnIndices = distinctColumnIndices;
  }

(4.2)GroupByOperator
      处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
      处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
  public GroupByDesc(
      final Mode mode,
      final java.util.ArrayList<java.lang.String> outputColumnNames,
      final java.util.ArrayList<ExprNodeDesc> keys,
      final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
      final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
    this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
        false, groupByMemoryUsage, memoryThreshold);
  }
  mode:GroupByDesc.Mode.COMPLETE
  outputColumnNames:groupby+Aggregation
  keys:groupby
  aggregators:Aggregation
  groupKeyNotReductionKey:false
  groupByMemoryUsage:默认为0.5
  memoryThreshold:默认为0.9 



SemanticAnalyzer.genBodyPlan
  optimizeMultiGroupBy  (multi-group by with the same distinct)
  groupby10.q  groupby11.q
posted on 2013-05-14 13:32  风生水起  阅读(5574)  评论(0编辑  收藏  举报