hive源码(五)OperatorTree优化

OperatorTree优化(逻辑执行计划优化)

优化入口
    Optimizer optm = new Optimizer();
    optm.setPctx(pCtx);
    optm.initialize(conf);
    //优化器都会执行实现这个方法,通过这调用执行优化过程  
    //类  org.apache.hadoop.hive.ql.optimizer.Optimizer  代码transformations.add
    pCtx = optm.optimize(); 

OperatorTree样例类似上一篇已经有了,能看出来基本的语法。
SQL boy常用的优化,例如:分区裁剪、列裁剪、谓词下推等。都是在这有操作的下面我们详细看

部分单词缩写简介
    TableScanOperator(TS)
    SelectOperator(SEL)
    FilterOperator(FIL)
    GroupByOperator(GBY)
    ReduceSinkOperator(RS)
    JoinOperator(
JOIN)
    MapJoinOperator(MAPJOIN)
    ScriptOperator(SCR)
    UDTFOperator(UDTF)
    UnionOperator(UNION)
    FileSinkOperator(FS)
    LimitOperator(LIM)
优化器代码样例
    //下面就是匹配对应节点,如果不是这样的结构,就不进行优化。
MAPJOIN[6] 格式好像有点不对啊。。。
UNION[4]
UNION[5]
FS[3]
RS[2]
TS[0]
TS[1]
    Map<Rule, NodeProcessor> opRules = new LinkedHashMap<Rule, NodeProcessor>();
opRules.put(new RuleRegExp(new String("R1"),TableScanOperator.getOperatorName() + "%"),new GenMRTableScan1());
opRules.put(new RuleRegExp(new String("R2"),TableScanOperator.getOperatorName() + "%.*" + ReduceSinkOperator.getOperatorName() + "%"),new GenMRRedSink1());
opRules.put(new RuleRegExp(new String("R3"),ReduceSinkOperator.getOperatorName() + "%.*" + ReduceSinkOperator.getOperatorName() + "%"),new GenMRRedSink2());
opRules.put(new RuleRegExp(new String("R4"),FileSinkOperator.getOperatorName() + "%"),new GenMRFileSink1());
opRules.put(new RuleRegExp(new String("R5"),UnionOperator.getOperatorName() + "%"),new GenMRUnion1());
opRules.put(new RuleRegExp(new String("R6"),UnionOperator.getOperatorName() + "%.*" + ReduceSinkOperator.getOperatorName() + "%"),new GenMRRedSink3());
opRules.put(new RuleRegExp(new String("R7"),MapJoinOperator.getOperatorName() + "%"),MapJoinFactory.getTableScanMapJoin());

开始优化样例代码:
GraphWalker ogw = new GenMapRedWalker(disp);
ArrayList<Node> topNodes = new ArrayList<Node>();
topNodes.addAll(pCtx.getTopOps().values());
ogw.startWalking(topNodes, null);
PointLookupOptimizer
触发条件:1.hive.optimize.point.lookup=true 2.hive.optimize.point.lookup.min=31 满足个数 2.cbo执行失败 
功能介绍:把SQl里面多个or条件换成 in 。具体or的数值数可配置
执行SQL:
set hive.optimize.point.lookup.min=5;  //设置将or 换成in 的阈值  原始值30
select id,user_id from ( select id from data group by id)a inner join( select user_id from data group by user_id)b on a.id = b.user_id where (
id=1 or id=2 or id=3 or id=4 or id=5 or id=6 or id=7 ) ;
优化以前:FIL[17] GenericUDFOPOr 换成 GenericUDFIn
FS[19]
    SEL[18]
        FIL[17] GenericUDFOPOr(GenericUDFOPEqual(Column[_col0], Const bigint 1), GenericUDFOPEqual(Column[_col0], Const bigint 2), GenericUDFOPEqual(Column[_col0], Const bigint 3), GenericUDFOPEqual(Column[_col0], Const bigint 4), GenericUDFOPEqual(Column[_col0], Const bigint 5), GenericUDFOPEqual(Column[_col0], Const bigint 6), GenericUDFOPEqual(Column[_col0], Const bigint 7))
            JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
                RS[13]  valueCols Column[_col0]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[5]  _col0
                            GBY[4]  KEY._col0
                                RS[3]  _col0
                                    GBY[2]  Column[id]
                                        SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
                RS[15]  Column[_col0]
                    FIL[14]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[11]  _col0
                            GBY[10]  KEY._col0
                                RS[9]  _col0
                                    GBY[8]  Column[user_id]
                                        SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:
FS[19]
    SEL[18]
        FIL[17] GenericUDFIn(Column[_col0], Const bigint 1Const bigint 2Const bigint 3Const bigint 4Const bigint 5Const bigint 6Const bigint 7)
            JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
                RS[13]  valueCols Column[_col0]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[5]  _col0
                            GBY[4]  KEY._col0
                                RS[3]  _col0
                                    GBY[2]  Column[id]
                                        SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
                RS[15]  Column[_col0]
                    FIL[14]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[11]  _col0
                            GBY[10]  KEY._col0
                                RS[9]  _col0
                                    GBY[8]  Column[user_id]
                                        SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
PartitionColumnsSeparator (where 条件里面使用struct的 我也是第一次认识)
触发条件:1.hive.optimize.partition.columns.separate=true
功能介绍:把SQl里面where约束条件里面的分区字段处理单独提取出来
执行SQL:
select id,dt from data a where struct(a.dt,a.idin (struct('20220801','1'),struct('20220802','2'));
优化以前:
FS[3
    SEL[2] Column[id] Column[dt]
        FIL[1] GenericUDFIn(GenericUDFStruct(Column[dt], Column[id]), Const struct<col1:string,col2:string> [202208011], Const struct<col1:string,col2:string> [202208022])
            TS[0]   dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:分区的条件提取出来了
FS[3
    SEL[2] Column[id] Column[dt]
        FIL[1] GenericUDFOPAnd(GenericUDFIn(GenericUDFStruct(Column[dt], Column[id]), Const struct<col1:string,col2:string> [202208011], Const struct<col1:string,col2:string> [202208022]), GenericUDFIn(GenericUDFStruct(Column[dt]), GenericUDFStruct(Const string 20220801), GenericUDFStruct(Const string 20220802)))
            TS[0]   dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
PredicateTransitivePropagate
触发条件:1.hive.optimize.ppd=true 2.cbo执行失败 
功能介绍:将and条件的过滤规则传播到别的表上
执行SQL:
set hive.cbo.enable=false;  --关闭cbo优化
select a.id,b.id from ( select id from data group by id)a inner joinselect id from data_hr group by id)b on a.id = b.id and b.id<=5;
优化以前:
FS[18]
    SEL[17] Column[_col0] Column[_col1]
        JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
            RS[14]  valueCols Column[_col0]
                FIL[13]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                    SEL[5]  _col0
                        GBY[4]  KEY._col0
                            RS[3]  _col0
                                GBY[2]  Column[id]
                                    SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[15]  Column[_col0]
                FIL[12]  GenericUDFOPEqualOrLessThan(Column[_col0], Const bigint 5)
                    SEL[11]  _col0
                        GBY[10]  KEY._col0
                            RS[9]  _col0
                                GBY[8]  Column[user_id]
                                    SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                        TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:FIL[13] FIL[12]都改变了  增加了属性
FS[18]
    SEL[17] Column[_col0] Column[_col1]
        JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
            RS[14]  valueCols Column[_col0]
                FIL[13]  GenericUDFOPAnd(GenericUDFOPEqualOrLessThan(Column[_col0], Const bigint 5), GenericUDFOPNotNull(Column[_col0]))
                    SEL[5]  _col0
                        GBY[4]  KEY._col0
                            RS[3]  _col0
                                GBY[2]  Column[id]
                                    SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[15]  Column[_col0]
                FIL[12]  GenericUDFOPAnd(GenericUDFOPNotNull(Column[_col0]), GenericUDFOPEqualOrLessThan(Column[_col0], Const bigint 5))
                    SEL[11]  _col0
                        GBY[10]  KEY._col0
                            RS[9]  _col0
                                GBY[8]  Column[user_id]
                                    SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                        TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
ConstantPropagate
触发条件:1.hive.optimize.constant.propagation=true 2.cbo执行失败 
功能介绍:把表达式计算成最终的结果  
执行SQL:
set hive.cbo.enable=false;  --关闭cbo优化
select * from (select user_id from data where id=2+1)t;
优化以前:
FS[4]
    SEL[3] Column[_col0]
        SEL[2] Column[user_id]
            FIL[1] GenericUDFOPEqual(Column[id], GenericUDFOPPlus(Const int 2Const int 1))
                TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:GenericUDFOPPlus 替换成常量了
FS[4]
    SEL[3] Column[_col0]
        SEL[2] Column[user_id]
            FIL[1] GenericUDFOPEqual(Column[id], Const int 3)
                TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
set hive.cbo.enable=false;  --关闭cbo优化
select * from (select user_id,'abcd' as abcd from data )t;
优化以前:常量
FS[3]
    SEL[2] Column[_col0] Column[_col1]
        SEL[1] Column[user_id],Const string abcd
            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:
FS[3]
    SEL[2] Column[_col0] Const string abcd
        SEL[1] Column[user_id],Const string abcd
            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
SyntheticJoinPredicate
触发条件:1.hive.optimize.ppd=true 2.spark tez引擎
功能介绍:将分区常量计算的值替换,这个应该是合理化的优化。因为SQL这样写就是想要优化后的结果。
执行SQL:
select * from data_par where dt = (select max(dt) from data_par_hr );
优化以前:
FS[15]
    SEL[14] Column[_col0] Column[_col1] Column[_col2] Column[_col3]
        JOIN[13]
            RS[11]
                SEL[2] Column[id] Column[user_id] Column[desc]
                    FIL[1] GenericUDFOPNotNull(Column[dt])
                        TS[0] dbName=default tableName=data_par BLOCKOFFSET FILENAME ROWID
            RS[12]
                SEL[10] Column[_col0]
                    FIL[9] GenericUDFOPNotNull(Column[_col0])
                        SEL[8] Column[_col0]
                            GBY[7
                                RS[6
                                    GBY[5
                                        SEL[4] Column[id]  Column[user_id]  Column[desc] Column[dt] Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[3] dbName=default tableName=data_par_hr BLOCKOFFSET FILENAME ROWID
优化以前:新增FIL[16]
FS[15]
    SEL[14] Column[_col0] Column[_col1] Column[_col2] Column[_col3]
        JOIN[13]
            RS[11]
                FIL[16] GenericUDFIn(Column[_col3], RS[12])
                    SEL[2] Column[id] Column[user_id] Column[desc]
                        FIL[1] GenericUDFOPNotNull(Column[dt])
                            TS[0] dbName=default tableName=data_par BLOCKOFFSET FILENAME ROWID
            RS[12]
                SEL[10] Column[_col0]
                    FIL[9] GenericUDFOPNotNull(Column[_col0])
                        SEL[8] Column[_col0]
                            GBY[7
                                RS[6
                                    GBY[5
                                        SEL[4] Column[id]  Column[user_id]  Column[desc] Column[dt] Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[3] dbName=default tableName=data_par_hr BLOCKOFFSET FILENAME ROWID
PredicatePushDown
触发条件:1.hive.optimize.ppd=true 2.cbo执行失败
功能介绍:把SQl里面最外层的where条件下推到最里面
执行SQL:
select id,user_id from ( select id from data group by id)a inner joinselect user_id from data group by user_id)b on a.id = b.user_id where id ='1' or id='2';
优化以前:
FS[19]
    SEL[18]
        FIL[17] GenericUDFOPOr(GenericUDFOPEqual(Column[_col0], Const bigint 1), GenericUDFOPEqual(Column[_col0], Const bigint 2))
            JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
                RS[13]  valueCols Column[_col0]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[5]  _col0
                            GBY[4]  KEY._col0
                                RS[3]  _col0
                                    GBY[2]  Column[id]
                                        SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
                RS[15]  Column[_col0]
                    FIL[14]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[_col0]))
                        SEL[11]  _col0
                            GBY[10]  KEY._col0
                                RS[9]  _col0
                                    GBY[8]  Column[user_id]
                                        SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:新增FIL[20] FIL[21]
FS[19]
    SEL[18]
            JOIN[16]  outputColumnNames _col0 aliasToOpInfo a |   _col1  aliasToOpInfo  b
                RS[13]  valueCols Column[_col0]
                        SEL[5]  _col0
                            GBY[4]  KEY._col0
                                RS[3]  _col0
                                    GBY[2]  Column[id]
                                        SEL[1]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            FIL[20]  GenericUDFOPAnd(GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[id])), GenericUDFOPOr(GenericUDFOPEqual(Column[id], Const bigint 1), GenericUDFOPEqual(Column[id], Const bigint 2)))
                                                TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
                RS[15]  Column[_col0]
                        SEL[11]  _col0
                            GBY[10]  KEY._col0
                                RS[9]  _col0
                                    GBY[8]  Column[user_id]
                                        SEL[7]  Column[id]  Column[user_id]  Column[pending_reward]  Column[description]  Column[BLOCK__OFFSET__INSIDE__FILE]  Column[INPUT__FILE__NAME]  Column[ROW__ID]
                                            FIL[21] GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[user_id]))
                                                TS[6]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
PartitionPruner
触发条件:1.hive.optimize.ppd=true
功能介绍:分区裁剪
执行SQL:
select user_id,dt from data_par_hr where dt= 20220801;
优化以后: 在PCTX里面更改了了一个属性:
opToPartPruner  {TableScanOperator@9981"TS[0]" -> {ExprNodeGenericFuncDesc@10354"GenericUDFOPEqual(Column[dt], Const int 20220801)"
PartitionConditionRemover
触发条件:1.hive.optimize.ppd=true
功能介绍:将分区操作移出来
执行SQL:
select user_id,dt from data_par_hr where dt= 20220801 and dt in ('20220801','20220802');
优化以后: 在PCTX里面更改了了一个属性:其实我觉得和上面的属性差不多
opToPartList    default.data_par_hr;((dt) IN ('20220801', '20220802') and (dt = '20220801'))
优化以前:
FS[3]
    SEL[2] Column[user_id] Column[dt]
        FIL[4]GenericUDFOPAnd(GenericUDFIn(Column[dt], Const string 20220801Const string 20220802), GenericUDFOPEqual(GenericUDFBridge ==> UDFToDouble (Column[dt]), Const double 2.0220801E7))
            TS[0]dbName=default tableName=data_par_hr BLOCKOFFSET FILENAME ROWID
优化之后:去除了FIL[4]节点
FS[3]
    SEL[2] Column[user_id] Column[dt]
        TS[0]dbName=default tableName=data_par_hr BLOCKOFFSET FILENAME ROWID
ColumnPruner
触发条件:
功能介绍:列裁剪
执行SQL:
select a.user_id,b.id from ( select * from data)a inner join( select * from data)b on a.id = b.user_id ;
优化以前:
FS[10]
    SEL[9] Column[_col1] Column[_col3]
        JOIN[8
            RS[5]  GenericUDFBridge ==> UDFToDouble (Column[_col0])
                SEL[1] Column[id] Column[user_id] Column[desc]
                    FIL[11]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[id]))
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[7]  GenericUDFBridge ==> UDFToDouble (Column[_col1])
                SEL[3]  Column[id] Column[user_id] Column[desc]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[user_id]))
                        TS[2]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:删掉了Column[desc]
FS[10]
    SEL[9] Column[_col1] Column[_col3]
        JOIN[8
            RS[5]  GenericUDFBridge ==> UDFToDouble (Column[_col0])
                SEL[1] Column[id] Column[user_id] 
                    FIL[11]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[id]))
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[7]  GenericUDFBridge ==> UDFToDouble (Column[_col1])
                SEL[3]  Column[id] Column[user_id] 
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[user_id]))
                        TS[2]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
NonBlockingOpDeDupProc
触发条件:合并SEL节点和FIL节点  SEL节点比较多比较好复现
功能介绍:
执行SQL:
set hive.cbo.enable=false;  --关闭cbo优化
select * from (select * from data where id=1 union all select * from data_hr where id=2)t where id=1;
优化以前:
FS[7]
    SEL[6] Column[_col0] Column[_col1] Column[_col2]
        SEL[5] Column[_col0] Column[_col1] Column[_col2]
            UNION[4] 
                SEL[1] Column[id] Column[user_id] Column[desc]
                    TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
                SEL[3] Column[id] Column[user_id] Column[desc]
                    TS[2]  dbName=default tableName=data_hr BLOCKOFFSET FILENAME ROWID
优化之后:SEL[6] 节点去掉
FS[7]
    SEL[5] Column[_col0] Column[_col1] Column[_col2]
        UNION[4] 
            SEL[1] Column[id] Column[user_id] Column[desc]
                TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            SEL[3] Column[id] Column[user_id] Column[desc]
                TS[2]  dbName=default tableName=data_hr BLOCKOFFSET FILENAME ROWID
IdentityProjectRemover
触发条件:hive.optimize.remove.identity.project=true; hive.cbo.returnpath.hiveop=false;
功能介绍:去除 不做列裁剪的 SEL节点
执行SQL:
select * from ( select * from data)a inner join( select * from data)b on a.id = b.user_id ;
优化以前:
FS[10]
    SEL[9] Column[_col0] Column[_col1] Column[_col2] Column[_col3] Column[_col4] Column[_col5]
        JOIN[8
            RS[5]  GenericUDFBridge ==> UDFToDouble (Column[_col0])
                SEL[1] Column[id] Column[user_id] Column[desc]
                    FIL[11]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[id]))
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[7]  GenericUDFBridge ==> UDFToDouble (Column[_col1])
                SEL[3]  Column[id] Column[user_id] Column[desc]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[user_id]))
                        TS[2]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:删掉了SEL[9] 没有意义的节点
FS[10]
        JOIN[8
            RS[5]  GenericUDFBridge ==> UDFToDouble (Column[_col0])
                SEL[1] Column[id] Column[user_id] Column[desc]
                    FIL[11]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[id]))
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
            RS[7]  GenericUDFBridge ==> UDFToDouble (Column[_col1])
                SEL[3]  Column[id] Column[user_id] Column[desc]
                    FIL[12]  GenericUDFOPNotNull(GenericUDFBridge ==> UDFToDouble (Column[user_id]))
                        TS[2]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
LimitPushdownOptimizer
触发条件:1.hive.limit.pushdown.memory.usage>0
功能介绍:limit 操作下移
执行SQL:
select id from data group by id limit 3;
优化以前:
FS[7]
    LIM[6]
        GBY[4]
            RS[3]  GenericUDFBridge ==> UDFToDouble (Column[_col0]) 
                GBY[2]
                    SEL[1] Column[id]
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:RS[3] 后面加了TOPN的数字,限定每一个reduce的输出
FS[7]
    LIM[6]
        GBY[4]
            RS[3]  GenericUDFBridge ==> UDFToDouble (Column[_col0])  TopN=3
                GBY[2]
                    SEL[1] Column[id]
                        TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
StatsOptimizer
触发条件:1.hive.compute.query.using.stats=true
功能介绍:元数据拿到一些统计数据
执行SQL:
select max(idfrom data;
Mysql里面的对应的数据:(可能是hive的原因,只有insert的表有统计信息  load的表没有统计信息)
select * from TAB_COL_STATS
SimpleFetchOptimizer
触发条件:1.hive.fetch.task.conversion!=none no join,groupby,distinct,subq,CTAS
功能介绍:简单取数据
执行SQL:
select id from data limit 3;
优化以前:
FS[3]
    LIM[2]
        SEL[1] Column[id]
            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID
优化之后:LIST_SINK
LIST_SINK[4]
    LIM[2]
        SEL[1] Column[id]
            TS[0]  dbName=default tableName=data BLOCKOFFSET FILENAME ROWID

 

  上面的结果只是把sql的优化结果debug出来了,但是真正想理解优化器原理的:推荐看这篇文章 https://cloud.tencent.com/developer/column/95255   Hive优化器原理与源码解析系列

posted @ 2022-08-09 08:48  Kotlin  阅读(400)  评论(0编辑  收藏  举报
Live2D