第十章 Hive调优 【大小表join-MapJoin】

1.大小表join(MapJoin)
说明 : 当大表小表关联时,可以将小表读取到内存,在Map端进行数据关联
小表在左在右都会触发 Mapjoin

2. 怎样判断大小表?
大表 > set hive.mapjoin.smalltable.filesize
小表 <= set hive.mapjoin.smalltable.filesize

3. 大小表关联可能引起的后果?
复制代码
-- 大文件 A表
1 a
2 b
2 b
2 b
3 c
3 c
3 c

-- 小文件 B表
1 x
2 y
3 z

--结果文件
1 a x
2 b y
2 b y
2 b y
3 c z
3 c z
3 c z


map阶段:
1.读取大文件 map输出
key : 1  value : a-大表标识
key : 2  value : b-大表标识
key : 2  value : b-大表标识
key : 2  value : b-大表标识
key : 3  value : c-大表标识

2.读取小文件 map输出
key : 1 value : x-小表标识
key : 2 value : y-小表标识
key : 3 value : z-小表标识

3.reduce阶段:
key : 1  value : a-大表标识
key : 1  value : x-小表标识
  聚合 : key : 1 value : a,x

key : 2  value : b-大表标识
key : 2  value : b-大表标识
key : 2  value : y-小表标识
  聚合 :
    key : 2  value : b,y
    key : 2  value : b,y
    key : 2  value : b,y

key : 3  value : c-大表标识
key : 3  value : z-小表标识
   聚合 :
      key : 3 value : c,z

4. 说明
    reduce阶段Join :
            当 A表有10亿数据,且id=1的数据有9亿 id=other的1亿,会造成数据严重倾斜,处理id=1的reduceTask 资源不足或处理过慢(map = 100%,  reduce = 99%)
    map阶段Join :
            将 B表数据读取到内存,在Map阶段完成join操作,避免了reduce阶段,id分布不均也就不会造成影响
View Code
复制代码
4. 参数设置
-- 根据输入文件大小,判断是否将普通join 转换为mapjoin (默认为true)
set hive.auto.convert.join=true;
-- 设置小表阈值(默认为 25M)
set hive.mapjoin.smalltable.filesize=25000000;
5. MapJoin 使用场景
1. 大小表关联
6. 案例测试
复制代码
A表 :
大小 :
    79.2 M  dt=20220209/000000_0
    72.9 M  dt=20220209/000001_0
    8.0 M   dt=20220209/000002_0
    11.1 M  dt=20220209/000003_0
数据量 :
    1613772

B表 :
大小 :
    27.0 K dt=20220209/000000_0
数据量 :
    523
--===============reduceJoin=================================================================

INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=
INFO  : number of splits:2
INFO  : Submitting tokens for job: job_1640670724826_487384
INFO  : The url to track the job: j
INFO  : Starting Job = job_1640670724826_487384, Tracking URL =
INFO  : Kill Command =  job  -kill job_1640670724826_487384
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2022-02-10 14:37:22,574 Stage-1 map = 0%,  reduce = 0%
INFO  : 2022-02-10 14:37:28,863 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 8.68 sec
INFO  : 2022-02-10 14:37:42,394 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 21.03 sec
INFO  : 2022-02-10 14:37:45,532 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 21.85 sec
INFO  : 2022-02-10 14:37:56,990 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 27.97 sec
INFO  : 2022-02-10 14:38:08,359 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 31.64 sec
INFO  : 2022-02-10 14:38:13,527 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 39.42 sec
INFO  : 2022-02-10 14:38:14,562 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 40.27 sec
INFO  : MapReduce Total cumulative CPU time: 40 seconds 270 msec
INFO  : Ended Job = job_1640670724826_487384
INFO  : Loading data to table tab from
INFO  : Table tab stats: [numFiles=1, numRows=1238897, totalSize=74173, rawDataSize=441047332]
No rows affected (93.818 seconds)

--================MapJoin================================================================

INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_
INFO  : The url to track the job:
INFO  : Starting Job = job_1640670724826_487444, Tracking URL =
INFO  : Kill Command =  job  -kill job_
INFO  : Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
INFO  : 2022-02-10 14:41:32,271 Stage-4 map = 0%,  reduce = 0%
INFO  : 2022-02-10 14:41:43,802 Stage-4 map = 89%,  reduce = 0%, Cumulative CPU 12.68 sec
INFO  : 2022-02-10 14:41:46,957 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 17.03 sec
INFO  : MapReduce Total cumulative CPU time: 17 seconds 30 msec
INFO  : Ended Job = job_1640670724826_487444
INFO  : Loading data to table tab1 from
INFO  : Table tab stats: [numFiles=1, numRows=1238897, totalSize=3935316, rawDataSize=441047332]
No rows affected (49.744 seconds)
复制代码

 










posted @   学而不思则罔!  阅读(1461)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?
点击右上角即可分享
微信分享提示