第十章 Hive调优【大表Join大表-bucketmapjoin】

1. hive 的三种join
    1. reduceJoin 也叫 Common Join、Shuffle Join
    2. MapJoin
    3. Sort Merge Bucket Join(分桶表Join)

2. SMB(Sort Merge Bucket) Join 分桶表join
        说明 : 大表与大表join时,如果key分布均匀,单纯因为数据量过大,导致任务失败或运行时间过长
              可以考虑将大表分桶,来优化任务
        原理 :
            key % 分桶数 = 分桶编号
            分桶编号1 join 分桶编号1
        注意 :  A表、B表 都需要是分桶表且分桶规则相同
        参数 :

 　　　　　　 set hive.optimize.bucketmapjoin=true;
            set hive.optimize.bucketmapjoin.sortedmerge=true;
            set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

3.测试案例

-- 对照组
-- 不分桶 A表
create table bigtable2(
   id bigint,
   t bigint,
   uid string,
   keyword string,
   url_rank int,
   click_num int,
   click_url string)
row format delimited fields terminated by '\t';

-- 不分桶 B表
create table bigtable(
   id bigint,
   t bigint,
   uid string,
   keyword string,
   url_rank int,
   click_num int,
   click_url string)
row format delimited fields terminated by '\t';

-- 导入数据
load data local inpath '/root/bigtable' into table bigtable2;
load data local inpath '/root/bigtable' into table bigtable;


-- 不分桶关联
set yarn.scheduler.maximum-allocation-mb=118784;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set yarn.nodemanager.vmem-pmem-ratio=4.2;
insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable s
join bigtable2 b
on b.id = s.id;

Time taken: 109.024 seconds

-- 实验组

-- 创建分桶表1,分桶个数不超过CPU核数
create table bigtable_buck1(
   id bigint,
   t bigint,
   uid string,
   keyword string,
url_rank int,
   click_num int,
   click_url string)
clustered by(id)
sorted by(id)
into 2 buckets
row format delimited fields terminated by '\t';

load data local inpath '/root/bigtable' into table bigtable_buck1;

-- 创建分桶表2,分桶个数不超过CPU核数
create table bigtable_buck2(
   id bigint,
   t bigint,
   uid string,
   keyword string,
   url_rank int,
   click_num int,
   click_url string)
clustered by(id)
sorted by(id)
into 2 buckets
row format delimited fields terminated by '\t';

load data local inpath '/root/bigtable' into table bigtable_buck2;

-- 参数设置(开启分桶连接)
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;

Time taken: 64.895 seconds

posted @ 2022-02-10 19:42 学而不思则罔！阅读(2332) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 第十章 Hive调优【大小表join-MapJoin】

· 第十章 Hive调优【小文件合并】

· Hive进阶

· hive的几种join

· Hive优化总结

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· AI与.NET技术实操系列（二）：开始使用ML.NET
· 单线程的Redis速度为什么快？

公告

昵称：学而不思则罔！
园龄： 4年10个月
粉丝： 11
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

私人小院

第十章 Hive调优【大表Join大表-bucketmapjoin】

公告

搜索

常用链接

随笔分类

随笔档案

linux运维

阅读排行榜

评论排行榜

推荐排行榜

最新评论

私人小院

第十章 Hive调优 【大表Join大表-bucketmapjoin】

公告

搜索

常用链接

随笔分类

随笔档案

linux运维

阅读排行榜

评论排行榜

推荐排行榜

最新评论

第十章 Hive调优【大表Join大表-bucketmapjoin】