HIVE JOIN_1

HIVE JOIN

概述

Hive join的实现包含了:

  • Common (Reduce-side) Join
  • Broadcast (Map-side) Join
  • Bucket Map Join
  • Sort Merge Bucket Join
  • Skew Join

这里记录下前两种.
第一种是common join,就像字面意思那样,它是一种最常见的join实现方式,但是不够灵活,并且性能也不够好。
一个common join包含了一个map阶段和一个shuffle阶段,以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key
和join value,并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序,并且将key相同的数据合并到一个文件
中。Ruduce会进行最终的合并,并产生结果数据。

第二种是broadcast join,这种方式是取消shuffle和reduce阶段, 将join动作在map 阶段完成, 它会将join中的小表加载到内存中,所有
mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。
如何将小表加载到内存中也是挺讲究的,先要讲小表加载到内存中,然后将其序列化到一个hashtable file。当map阶段开始的时候,将这个
hashtable file 加载到distributed cache中,并将其分发到每个mapper所在的硬盘里,然后这些mapper将hashtable file加载到内存中,并进行join运算。通过优化,这些小表只需要读一次就OK,如果很多个mappper在同一台机器上,那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表,分别是test_atest_city

  • test_a的数据如下:
test_a.id test_a.uid test_a.city_id
1 1 1
2 2 2
3 3 3
  • test_city的数据如下:
test_city.id test_city.name
1 beijing
2 shanghai
3 hangzhou

LEFT JOIN

具体的SQL如下:

explain
select a.id, a.uid, b.name
from
    temp.test_a as a
left join
    temp.test_city as b
on a.city_id = b.id;

因为表很小,所以就使用了 map side join,具体过程如下:

STAGE DEPENDENCIES:
2	  Stage-4 is a root stage
3	  Stage-3 depends on stages: Stage-4
4	  Stage-0 depends on stages: Stage-3
5	
6	STAGE PLANS:
7	  Stage: Stage-4
8	    Map Reduce Local Work
9	      Alias -> Map Local Tables://从文件中读取数据
10	        $hdt$_1:b 
11	          Fetch Operator
12	            limit: -1
13	      Alias -> Map Local Operator Tree:
14	        $hdt$_1:b 
15	          TableScan //扫描表 test_city,一行一行读取数据
16	            alias: b
17	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
18	            Select Operator //选取数据
19	              expressions: id (type: bigint), name (type: string)
20	              outputColumnNames: _col0, _col1
21	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
22	              HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key,但是不是很确定。
23	                keys:
24	                  0 _col2 (type: bigint)
25	                  1 _col0 (type: bigint)
26	
27	  Stage: Stage-3
28	    Map Reduce
29	      Map Operator Tree:
30	          TableScan
31	            alias: a
32	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
33	            Select Operator
34	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)
35	              outputColumnNames: _col0, _col1, _col2
36	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
37	              Map Join Operator //注意这里用到了map side join
38	                condition map:
39	                     Left Outer Join0 to 1
40	                keys:
41	                  0 _col2 (type: bigint)
42	                  1 _col0 (type: bigint)
43	                outputColumnNames: _col0, _col1, _col4
44	                Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
45	                Select Operator
46	                  expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)
47	                  outputColumnNames: _col0, _col1, _col2
48	                  Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
49	                  File Output Operator
50	                    compressed: false
51	                    Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
52	                    table:
53	                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
54	                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
55	                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
56	      Local Work:
57	        Map Reduce Local Work
58	
59	  Stage: Stage-0
60	    Fetch Operator
61	      limit: -1
62	      Processor Tree:
63	        ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分:

STAGE DEPENDENCIES:
2	  Stage-1 is a root stage
3	  Stage-0 depends on stages: Stage-1
4	
5	STAGE PLANS:
6	  Stage: Stage-1
7	    Map Reduce
8	      Map Operator Tree: //map过程
9	          TableScan
10	            alias: a
11	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
12	            Select Operator
13	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)
14	              outputColumnNames: _col0, _col1, _col2
15	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
16	              Reduce Output Operator //map端的Reduce,然后输出到reduce整体的Reduce阶段
17	                key expressions: _col2 (type: bigint)
18	                sort order: +
19	                Map-reduce partition columns: _col2 (type: bigint)
20	                Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
21	                value expressions: _col0 (type: bigint), _col1 (type: bigint)
22	          TableScan
23	            alias: b
24	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
25	            Select Operator
26	              expressions: id (type: bigint), name (type: string)
27	              outputColumnNames: _col0, _col1
28	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
29	              Reduce Output Operator
30	                key expressions: _col0 (type: bigint)
31	                sort order: +
32	                Map-reduce partition columns: _col0 (type: bigint)
33	                Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
34	                value expressions: _col1 (type: string)
35	      Reduce Operator Tree:
36	        Join Operator
37	          condition map:
38	               Left Outer Join0 to 1
39	          keys:
40	            0 _col2 (type: bigint)
41	            1 _col0 (type: bigint)
42	          outputColumnNames: _col0, _col1, _col4
43	          Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
44	          Select Operator
45	            expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)
46	            outputColumnNames: _col0, _col1, _col2
47	            Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
48	            File Output Operator
49	              compressed: false
50	              Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
51	              table:
52	                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
53	                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
54	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
55	
56	  Stage: Stage-0
57	    Fetch Operator
58	      limit: -1
59	      Processor Tree:
60	        ListSink

参考:

  1. HIVE JOIN 官方文档

  2. Hive 中的 LEFT SEMI JOIN 与 JOIN ON 的前世今生

  3. MapReduce 中的两表 join 几种方案简介

  4. 几种 hive join 类型简介

  5. Hive join ppt

  6. A Comparison of Join Algorithms for Log Processing inMapReduce

  7. Hive sql 生成的MR

  8. common join and map side join

posted on 2016-11-21 23:03  walkwalkwalk  阅读(385)  评论(0编辑  收藏  举报

导航