源表数据记录数:60000000
第一步:创建三种文件类型的表,建表语法参考Hive文件存储格式
Sql代码
TextFile
1 set hive.exec.compress.output=true; 2 3 set mapred.output.compress=true; 4 5 set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 6 7 set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 8 9 INSERT OVERWRITE table hzr_test_text_table PARTITION(product='xxx',dt='2013-04-22') 10 11 SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22';
SquenceFile
1 set hive.exec.compress.output=true; 2 3 set mapred.output.compress=true; 4 5 set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 6 7 set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 8 9 set io.seqfile.compression.type=BLOCK; 10 11 INSERT OVERWRITE table hzr_test_sequence_table PARTITION(product='xxx',dt='2013-04-22') 12 13 SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22';
RCFile
1 set hive.exec.compress.output=true; 2 3 set mapred.output.compress=true; 4 5 set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 6 7 set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 8 9 INSERT OVERWRITE table hzr_test_rcfile_table PARTITION(product='xxx',dt='2013-04-22') 10 11 SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22';
第二步:测试insert overwrite table tablename select.... 耗时,存储空间
类型 |
insert耗时(S) |
存储空间(G) |
Sequence |
97.291 |
7.13G |
RCFile |
120.901 |
5.73G |
TextFile |
290.517 |
6.80G |
insert耗时、count(1)耗时比较:
第三步:查询响应时间
测试一
Sql代码
方案一,测试整行记录的查询效率:
1 select * from hzr_test_sequence_table where game='XXX' ; 2 3 select * from hzr_test_rcfile_table where game='XXX' ; 4 5 select * from hzr_test_text_table where game='XXX' ;
方案二,测试特定列的查询效率:
文件格式 |
查询整行记录耗时(S) |
查询特定列记录耗时(S) |
sequence |
42.241 |
39.918 |
rcfile |
37.395 |
36.248 |
text |
43.164 |
41.632 |
方案耗时对比:
测试二:
本测试目的是验证RCFILE的数据读取方式和Lazy解压方式是否有性能优势。数据读取方式只读取元数据和相关的列,节省IO;Lazy解压方式只解压相关的列数据,对不满足where条件的查询数据不进行解压,IO和效率都有优势。
方案一:
记录数:698020
Sql代码
1 insert overwrite local directory 'XXX/XXXX' select game,game_server from hzr_test_xxx_table where game ='XXX';
方案二:
记录数:67236221
Sql代码
1 insert overwrite local directory 'xxx/xxxx' select game,game_server from hzr_test_xxx_table;
方案三:
记录数:
Sql代码
1 insert overwrite local directory 'xxx/xxx' select game from hzr_xxx_rcfile_table;
文件类型 |
方案一 |
方案二 |
方案三 |
TextFile |
54.895 |
69.428 |
167.667 |
SequenceFile |
137.096 |
77.03 |
123.667 |
RCFile |
44.28 |
57.037 |
89.9 |
上图表现反应在大小数据集上,RCFILE的查询效率高于SEQUENCEFILE,在特定字段数据读取时,RCFILE的查询效率依然优于SEQUENCEFILE。