【亲测有效】hive sql DML语句优化思路 hive表查询优化优化你的hive任务，all you need，持续更新中

Hive表优化

小表、大表join
将key相对分散，并且数据量小的表放在join的左边，这样可以有效减少内存溢出错误发生的几率；再进一步，可以使用Group让小的维度表（1000条以下的记录条数）先进内存。在map端完成reduce。
实际测试发现：新版的hive已经对小表JOIN大表和大表JOIN小表进行了优化。小表放在左边和右边已经没有明显区别。

将key相对分散，并且数据量小的表放在join的左边，这样可以有效减少内存溢出错误发生的几率；再进一步，可以使用map join让小的维度表（1000条以下的记录条数）先进内存。在map端完成reduce。
新版的hive已经对小表JOIN大表和大表JOIN小表进行了优化。小表放在左边和右边已经没有明显区别
案例

数据准备

小表和大表.png
创建大表、小表和join后表语句

-- 创建大表
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
-- 创建小表
create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
-- 创建join后表的语句
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

导入数据

load data local inpath '/opt/module/datas/bigtable' into table bigtable;
load data local inpath '/opt/module/datas/smalltable' into table smalltable;

关闭mapjoin功能（默认是打开的）
set hive.auto.convert.join = false;

<property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
    <description>Whether Hive enables the optimization about converting common join into mapjoin based on the
 input file size</description>
  </property>

执行小表join大表

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable  b
on b.id = s.id;

执行大表join小表

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable  b
left join smalltable  s
on s.id = b.id;

大表join大表
1)-空值过滤
有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。例如key对应的字段为空。

有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。例如key对应的字段为空
案例

数据准备

配置历史服务器
在运行ResourceManager服务的服务器下进行如下配置，修改mapred-site.xml文件，添加如下内容

<property>
                <name>mapreduce.jobhistory.address</name>
                <value>hadoop-101:10020</value>
        </property>
        
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>hadoop-101:19888</value>
        </property>

启用日志聚集功能
在hadoop的每台服务器上，做如下配置，修改yarn-site.xml

 <!-- 日志聚集功能使能 -->
        <property>
                <name>yarn.log-aggregation-enable</name>
                <value>true</value>
        </property>

        <!-- 日志保留时间设置7天 -->
        <property>
                <name>yarn.log-aggregation.retain-seconds</name>
                <value>604800</value>
        </property>

重启hdfs、yarn，启动历史服务器功能
stop-yarn.sh
stop-dfs.sh
start-yarn.sh
start-dfs.sh
mr-jobhistory-daemon.sh start historyserver
创建相关表

-- 创建原始表
create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
-- 创建空id表
create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
-- 创建join后表的语句
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

导入数据

load data local inpath '/opt/module/datas/ori' into table ori;
load data local inpath '/opt/module/datas/nullid' into table nullidtable;

测试不过滤空id
insert overwrite table jointable select n.* from nullidtable n left join ori o on n.id = o.id;

测试过滤空id
insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;

空key转换
有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上。

案例

不随机分布空null值
1). 设置5个reduce个数
set mapreduce.job.reduces = 5;
2). JOIN两张表

insert overwrite table jointable select n.* from nullidtable n left join ori b on n.id = b.id;

可以看出来，出现了数据倾斜，某些reducer的资源消耗远大于其他reducer

随机分布空null值
1） . 设置5个reduce个数
set mapreduce.job.reduces = 5;
2）. JOIN两张表

insert overwrite table jointable select n.* from nullidtable n full join ori o on case when n.id is null then concat('hive', rand()) else n.id end = o.id;

可以看出来，消除了数据倾斜，负载均衡reducer的资源消耗

MapJoin

如果不指定MapJoin或者不符合MapJoin的条件，那么Hive解析器会将Join操作转换成Common Join，即在Reduce阶段完成join。容易发生数据倾斜。可以用MapJoin把小表全部加载到内存在map端进行join，避免reducer处理
相关参数设置

设置自动选择Mapjoin
set hive.auto.convert.join = true; 默认为true
大表小表的阈值设置
set hive.mapjoin.smalltable.filesize=25000000;
默认配置在配置文件中如下

 <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
    <description>Whether Hive enables the optimization about converting common join into mapjoin based on the
 input file size</description>
  </property>
<property>
    <name>hive.mapjoin.smalltable.filesize</name>
    <value>25000000</value>
    <description>
      The threshold for the input file size of the small tables; if the file size is smaller
      than this threshold, it will try to convert the common join into map join
    </description>
  </property>

MapJoin原理如下

MapJoin相关案例实操在大表join小表和小表join大小已经做过了，这里就不再重复了

Count(Distinct) 去重统计
数据量小的时候无所谓，数据量大的情况下，由于COUNT DISTINCT操作需要用一个Reduce Task来完成，这一个Reduce需要处理的数据量太大，就会导致整个Job很难完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换。

笛卡儿积
尽量避免笛卡尔积，join的时候不加on条件，或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积。

行列过滤
列处理：在SELECT中，只拿需要的列，如果有，尽量使用分区过滤，少用SELECT *。
行处理：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤。

执行计划(Explain)

基本语法
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
案例实操
（1）. 查看下面这条语句的执行计划

explain select * from emp;
explain select deptno, avg(sal) avg_sal from emp group by deptno;

（2）. 查看详细执行计划

explain extended select * from emp;
explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

posted @ 2022-06-20 13:11 爱上编程技术阅读(38) 评论(0) 收藏举报来源

刷新页面返回顶部

爱上编程技术

天天学习

【亲测有效】hive sql DML语句优化思路 hive表查询优化优化你的hive任务，all you need，持续更新中

执行计划(Explain)

公告

爱上编程技术

天天学习

【亲测有效】hive sql DML语句优化思路 hive表查询优化 优化你的hive任务，all you need，持续更新中

执行计划(Explain)

公告

【亲测有效】hive sql DML语句优化思路 hive表查询优化优化你的hive任务，all you need，持续更新中