准备数据
SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
|
hive> SELECT * FROM logs;
a 苹果 5
a 橙子 3
a 苹果 2
b 烧鸡 1
hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a 10
b 1
|
计算过程
默认设置了hive.map.aggr=true,所以会在mapper端先group
by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是
hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的
mode是complete.
Operator
Explain
hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
logs
TableScan // 扫描表
alias: logs
Select Operator //选择字段
expressions:
expr: uid
type: string
expr: count
type: int
outputColumnNames: uid, count
Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
aggregations:
expr: sum(count) //聚集函数
bucketGroup: false
keys: //键
expr: uid
type: string
mode: hash //hash方式,processHashAggr()
outputColumnNames: _col0, _col1
Reduce Output Operator //输出key,value给reducer
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
//聚合
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial //合并值
outputColumnNames: _col0, _col1
Select Operator //选择字段
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator //输出到文件
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
hive> select distinct value from src;
hive> select max(key) from src;
因为没有grouping keys,所以只有一个reducer。
2.2 如果有聚合函数或者groupby,做如下处理:
插入一个select operator,选取所有的字段,用于优化阶段ColumnPruner的优化
2.2.1 hive.map.aggr为true,默认是true,开启的,在map端做部分聚合
2.2.1.1 hive.groupby.skewindata为false,默认是关闭的,groupby的数据没有倾斜。
生成的operator是: GroupByOperator+ReduceSinkOperator+GroupByOperator。
GroupByOperator+ReduceSinkOperator用于在map端做操作,第一个GroupByOperator在map端先做部分聚合。第二个用于在reduce端做GroupBy操作
2.2.1.2 hive.groupby.skewindata为true
生成的operator是: GroupbyOperator+ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator
GroupbyOperator+ReduceSinkOperator(第一个MapredTask的map阶段)
GroupbyOperator(第一个MapredTask的reduce阶段)
ReduceSinkOperator (第二个MapredTask的map阶段)
GroupByOperator(第二个MapredTask的reduce阶段)
2.2.2 hive.map.aggr为false
2.2.2.1 hive.groupby.skewindata为true
生成的operator是: ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator
ReduceSinkOperator(第一个MapredTask的map阶段)
GroupbyOperator(第一个MapredTask的reduce阶段)
ReduceSinkOperator (第二个MapredTask的map阶段)
GroupByOperator(第二个MapredTask的reduce阶段)
2.2.2.2 hive.groupby.skewindata为false
生成的operator是: ReduceSinkOperator(map阶段运行)+GroupbyOperator(reduce阶段运行)
第一种情况:
set hive.map.aggr=false;
set hive.groupby.skewindata=false;
SemanticAnalyzer.genGroupByPlan1MR(){
(1)ReduceSinkOperator: It will put all Group By keys and the
distinct field (if any) in the map-reduce sort key, and all other fields
in the map-reduce value.
(2)GroupbyOperator:GroupByDesc.Mode.COMPLETE,Reducer: iterate/merge (mode = COMPLETE)
}
第二种情况:
set hive.map.aggr=true;
set hive.groupby.skewindata=false;
SemanticAnalyzer.genGroupByPlanMapAggr1MR(){
(1)GroupByOperator:GroupByDesc.Mode.HASH,The agggregation
evaluation functions are as follows: Mapper: iterate/terminatePartial
(mode = HASH)
(2)ReduceSinkOperator:Partitioning Key: grouping key。Sorting Key:
grouping key if no DISTINCT grouping + distinct key if DISTINCT
(3)GroupByOperator:GroupByDesc.Mode.MERGEPARTIAL,Reducer:
iterate/terminate if DISTINCT merge/terminate if NO DISTINCT (mode =
MERGEPARTIAL)
}
第三种情况:
set hive.map.aggr=false;
set hive.groupby.skewindata=true;
SemanticAnalyzer.genGroupByPlan2MR(){
(1)ReduceSinkOperator:Partitioning Key: random() if no DISTINCT
grouping + distinct key if DISTINCT。Sorting Key: grouping key if no
DISTINCT grouping + distinct key if DISTINCT
(2)GroupbyOperator:GroupByDesc.Mode.PARTIAL1,Reducer: iterate/terminatePartial (mode = PARTIAL1)
(3)ReduceSinkOperator:Partitioning Key: grouping key。Sorting
Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
(4)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
}
第四种情况:
set hive.map.aggr=true;
set hive.groupby.skewindata=true;
SemanticAnalyzer.genGroupByPlanMapAggr2MR(){
(1)GroupbyOperator:GroupByDesc.Mode.HASH,Mapper: iterate/terminatePartial (mode = HASH)
(2)ReduceSinkOperator: Partitioning Key: random() if no
DISTINCT grouping + distinct key if DISTINCT。 Sorting Key: grouping key
if no DISTINCT grouping + distinct key if DISTINCT。
(3)GroupbyOperator:GroupByDesc.Mode.PARTIALS, Reducer:
iterate/terminatePartial if DISTINCT merge/terminatePartial if NO
DISTINCT (mode = MERGEPARTIAL)
(4)ReduceSinkOperator:Partitioining Key: grouping key。Sorting
Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
(5)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
}
ReduceSinkOperator的processOp(Object row, int
tag)会根据相应的条件设置Key的hash值,如第四种情况的第一个ReduceSinkOperator:Partitioning Key:
random() if no DISTINCT grouping + distinct key if
DISTINCT,如果没有DISTINCT字段,那么在OutputCollector.collect前会设置当前Key的hash值为一个随机
数,random = new Random(12345);。如果有DISTINCT字段,那么key的hash值跟grouping +
distinct key有关。
GroupByOperator:
initializeOp(Configuration hconf)
processOp(Object row, int tag)
closeOp(boolean abort)
forward(ArrayList<Object> keys, AggregationBuffer[] aggs)
groupby10.q groupby11.q
set hive.map.aggr=false;
set hive.groupby.skewindata=false;
EXPLAIN
FROM INPUT
INSERT OVERWRITE TABLE dest1 SELECT INPUT.key,
count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5))
GROUP BY INPUT.key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
input
TableScan
alias: input
Select Operator // insertSelectAllPlanForGroupBy
expressions:
expr: key
type: int
expr: value
type: string
outputColumnNames: key, value
Reduce Output Operator
key expressions:
expr: key
type: int
expr: substr(value, 5)
type: string
sort order: ++
Map-reduce partition columns:
expr: key
type: int
tag: -1
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(KEY._col1:0._col0)
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: complete
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
set hive.map.aggr=true;
set hive.groupby.skewindata=false;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
input
TableScan
alias: input
Select Operator
expressions:
expr: key
type: int
expr: value
type: string
outputColumnNames: key, value
Group By Operator
aggregations:
expr: count(substr(value, 5))
expr: count(DISTINCT substr(value, 5))
bucketGroup: false
keys:
expr: key
type: int
expr: substr(value, 5)
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Reduce Output Operator
key expressions:
expr: _col0
type: int
expr: _col1
type: string
sort order: ++
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col2
type: bigint
expr: _col3
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
set hive.map.aggr=false;
set hive.groupby.skewindata=true;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
input
TableScan
alias: input
Select Operator
expressions:
expr: key
type: int
expr: value
type: string
outputColumnNames: key, value
Reduce Output Operator
key expressions:
expr: key
type: int
expr: substr(value, 5)
type: string
sort order: ++
Map-reduce partition columns:
expr: key
type: int
tag: -1
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(KEY._col1:0._col0)
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: partial1
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-48-26_387_7978992474997402829/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col1
type: bigint
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: final
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
set hive.map.aggr=true;
set hive.groupby.skewindata=true;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
input
TableScan
alias: input
Select Operator
expressions:
expr: key
type: int
expr: value
type: string
outputColumnNames: key, value
Group By Operator
aggregations:
expr: count(substr(value, 5))
expr: count(DISTINCT substr(value, 5))
bucketGroup: false
keys:
expr: key
type: int
expr: substr(value, 5)
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Reduce Output Operator
key expressions:
expr: _col0
type: int
expr: _col1
type: string
sort order: ++
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col2
type: bigint
expr: _col3
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: partials
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-49-25_899_4946067838822964610/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col1
type: bigint
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: final
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
set hive.map.aggr=false;
set hive.groupby.skewindata=false;
EXPLAIN extended
FROM INPUT
INSERT OVERWRITE TABLE dest1 SELECT INPUT.key,
count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5))
GROUP BY INPUT.key;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
input
TableScan
alias: input
Select Operator
expressions:
expr: key
type: int
expr: value
type: string
outputColumnNames: key, value
Reduce Output Operator
key expressions:
expr: key
type: int
expr: substr(value, 5)
type: string
sort order: ++
Map-reduce partition columns:
expr: key
type: int
tag: -1
Needs Tagging: false
Path -> Alias:
hdfs://localhost:54310/user/hive/warehouse/input [input]
Path -> Partition:
hdfs://localhost:54310/user/hive/warehouse/input
Partition
base file name: input
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,value
columns.types int:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:54310/user/hive/warehouse/input
name input
serialization.ddl struct input { i32 key, string value}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1310523947
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,value
columns.types int:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:54310/user/hive/warehouse/input
name input
serialization.ddl struct input { i32 key, string value}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1310523947
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: input
name: input
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(KEY._col1:0._col0)
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: complete
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
NumFilesPerFileSink: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,val1,val2
columns.types int:int:int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:54310/user/hive/warehouse/dest1
name dest1
serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1310523946
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
TotalFiles: 1
MultiFileSpray: false
Stage: Stage-0
Move Operator
tables:
replace: true
source: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,val1,val2
columns.types int:int:int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:54310/user/hive/warehouse/dest1
name dest1
serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1310523946
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
tmp directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10001
ABSTRACT SYNTAX TREE:
(TOK_QUERY
(TOK_FROM (TOK_TABREF INPUT))
(TOK_INSERT
(TOK_DESTINATION (TOK_TAB dest1))
(TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL INPUT) key))
(TOK_SELEXPR (TOK_FUNCTION count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
)
(TOK_GROUPBY (. (TOK_TABLE_OR_COL INPUT) key))
)
)
SemanticAnalyzer.genBodyPlan(QB qb, Operator input){
if (qbp.getAggregationExprsForClause(dest).size() != 0
|| getGroupByForClause(qbp, dest).size() > 0) { //如果有聚合函数或者有groupby,则执行下面的操作
//multiple distincts is not supported with skew in data
if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
.equalsIgnoreCase("true") &&
qbp.getDistinctFuncExprsForClause(dest).size() > 1) {
throw new SemanticException(ErrorMsg.UNSUPPORTED_MULTIPLE_DISTINCTS.
getMsg());
}
// insert a select operator here used by the ColumnPruner to reduce
// the data to shuffle
curr = insertSelectAllPlanForGroupBy(dest, curr); //生成一个SelectOperator,所有的字段都会选取,selectStar=true。
if (conf.getVar(HiveConf.ConfVars.HIVEMAPSIDEAGGREGATE)
.equalsIgnoreCase("true")) {
if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
.equalsIgnoreCase("false")) {
curr = genGroupByPlanMapAggr1MR(dest, qb, curr);
} else {
curr = genGroupByPlanMapAggr2MR(dest, qb, curr);
}
} else if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
.equalsIgnoreCase("true")) {
curr = genGroupByPlan2MR(dest, qb, curr);
} else {
curr = genGroupByPlan1MR(dest, qb, curr);
}
}
}
distince:
count.q.out
groupby11.q.out
groupby10.q.out
nullgroup4_multi_distinct.q.out
join18.q.out
groupby_bigdata.q.out
join18_multi_distinct.q.out
nullgroup4.q.out
auto_join18_multi_distinct.q.out
auto_join18.q.out
(1)map端部分聚合,数据无倾斜,一个MR生成。
genGroupByPlanMapAggr1MR,生成三个Operator:
(1.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.HASH
outputColumnNames:groupby+Distinct+Aggregation
keys:groupby+Distinct
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(1.2)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(1.3)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.MERGEPARTIAL
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(2)map端部分聚合,数据倾斜,两个MR生成。
genGroupByPlanMapAggr2MR:
(2.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.HASH
outputColumnNames:groupby+Distinct+Aggregation
keys:groupby+Distinct
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(2.2)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(2.3)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.PARTIALS
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(2.4)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(2.5)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.FINAL
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(3)map端不部分聚合,数据倾斜,两个MR生成。
genGroupByPlan2MR:
(3.1)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(3.2)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.PARTIAL1
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(3.3)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(3.4)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.FINAL
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
(4)map端不部分聚合,数据无倾斜,一个MR生成。
genGroupByPlan1MR:
(4.1)ReduceSinkOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
int numDistributionKeys,
java.util.ArrayList<ExprNodeDesc> valueCols,
java.util.ArrayList<java.lang.String> outputKeyColumnNames,
List<List<Integer>> distinctColumnIndices,
java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
this.valueCols = valueCols; //reduceValues,聚合函数
this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
this.tag = tag; // -1
this.numReducers = numReducers; // 一般都是-1
this.partitionCols = partitionCols; // groupby
this.keySerializeInfo = keySerializeInfo;
this.valueSerializeInfo = valueSerializeInfo;
this.distinctColumnIndices = distinctColumnIndices;
}
(4.2)GroupByOperator
处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
public GroupByDesc(
final Mode mode,
final java.util.ArrayList<java.lang.String> outputColumnNames,
final java.util.ArrayList<ExprNodeDesc> keys,
final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
false, groupByMemoryUsage, memoryThreshold);
}
mode:GroupByDesc.Mode.COMPLETE
outputColumnNames:groupby+Aggregation
keys:groupby
aggregators:Aggregation
groupKeyNotReductionKey:false
groupByMemoryUsage:默认为0.5
memoryThreshold:默认为0.9
SemanticAnalyzer.genBodyPlan
optimizeMultiGroupBy (multi-group by with the same distinct)
groupby10.q groupby11.q