第十章 Hive调优 【大小表join-MapJoin】
1.大小表join(MapJoin)
说明 : 当大表小表关联时,可以将小表读取到内存,在Map端进行数据关联
小表在左在右都会触发 Mapjoin
2. 怎样判断大小表?
大表 > set hive.mapjoin.smalltable.filesize
小表 <= set hive.mapjoin.smalltable.filesize
3. 大小表关联可能引起的后果?

-- 大文件 A表
1 a
2 b
2 b
2 b
3 c
3 c
3 c
-- 小文件 B表
1 x
2 y
3 z
--结果文件
1 a x
2 b y
2 b y
2 b y
3 c z
3 c z
3 c z
map阶段:
1.读取大文件 map输出
key : 1 value : a-大表标识
key : 2 value : b-大表标识
key : 2 value : b-大表标识
key : 2 value : b-大表标识
key : 3 value : c-大表标识
2.读取小文件 map输出
key : 1 value : x-小表标识
key : 2 value : y-小表标识
key : 3 value : z-小表标识
3.reduce阶段:
key : 1 value : a-大表标识
key : 1 value : x-小表标识
聚合 : key : 1 value : a,x
key : 2 value : b-大表标识
key : 2 value : b-大表标识
key : 2 value : y-小表标识
聚合 :
key : 2 value : b,y
key : 2 value : b,y
key : 2 value : b,y
key : 3 value : c-大表标识
key : 3 value : z-小表标识
聚合 :
key : 3 value : c,z
4. 说明
reduce阶段Join :
当 A表有10亿数据,且id=1的数据有9亿 id=other的1亿,会造成数据严重倾斜,处理id=1的reduceTask 资源不足或处理过慢(map = 100%, reduce = 99%)
map阶段Join :
将 B表数据读取到内存,在Map阶段完成join操作,避免了reduce阶段,id分布不均也就不会造成影响
4. 参数设置
-- 根据输入文件大小,判断是否将普通join 转换为mapjoin (默认为true)
set hive.auto.convert.join=true;
-- 设置小表阈值(默认为 25M)
set hive.mapjoin.smalltable.filesize=25000000;
5. MapJoin 使用场景
1. 大小表关联
6. 案例测试
A表 :
大小 :
79.2 M dt=20220209/000000_0
72.9 M dt=20220209/000001_0
8.0 M dt=20220209/000002_0
11.1 M dt=20220209/000003_0
数据量 :
1613772
B表 :
大小 :
27.0 K dt=20220209/000000_0
数据量 :
523
--===============reduceJoin=================================================================
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:2
INFO : Submitting tokens for job: job_1640670724826_487384
INFO : The url to track the job: j
INFO : Starting Job = job_1640670724826_487384, Tracking URL =
INFO : Kill Command = job -kill job_1640670724826_487384
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2022-02-10 14:37:22,574 Stage-1 map = 0%, reduce = 0%
INFO : 2022-02-10 14:37:28,863 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.68 sec
INFO : 2022-02-10 14:37:42,394 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 21.03 sec
INFO : 2022-02-10 14:37:45,532 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 21.85 sec
INFO : 2022-02-10 14:37:56,990 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 27.97 sec
INFO : 2022-02-10 14:38:08,359 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 31.64 sec
INFO : 2022-02-10 14:38:13,527 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 39.42 sec
INFO : 2022-02-10 14:38:14,562 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 40.27 sec
INFO : MapReduce Total cumulative CPU time: 40 seconds 270 msec
INFO : Ended Job = job_1640670724826_487384
INFO : Loading data to table tab from
INFO : Table tab stats: [numFiles=1, numRows=1238897, totalSize=74173, rawDataSize=441047332]
No rows affected (93.818 seconds)
--================MapJoin================================================================
INFO : Execution completed successfully
INFO : MapredLocal task succeeded
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_
INFO : The url to track the job:
INFO : Starting Job = job_1640670724826_487444, Tracking URL =
INFO : Kill Command = job -kill job_
INFO : Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
INFO : 2022-02-10 14:41:32,271 Stage-4 map = 0%, reduce = 0%
INFO : 2022-02-10 14:41:43,802 Stage-4 map = 89%, reduce = 0%, Cumulative CPU 12.68 sec
INFO : 2022-02-10 14:41:46,957 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 17.03 sec
INFO : MapReduce Total cumulative CPU time: 17 seconds 30 msec
INFO : Ended Job = job_1640670724826_487444
INFO : Loading data to table tab1 from
INFO : Table tab stats: [numFiles=1, numRows=1238897, totalSize=3935316, rawDataSize=441047332]
No rows affected (49.744 seconds)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?