Hive 调优-SQL篇
环境:
Hadoop 3.1.3
Hive:hive-3.1.2
1. 用MULTI-TABLE-INSERT代替UNION
1 insert into table student_stat partition(tp) 2 select s_age,max(s_birth) stat, 'max' tp 3 from student 4 group by s_age 5 union all 6 select s_age,min(s_birth) stat, 'min' tp 7 from student 8 group by s_age
执行过程及结果:
Total jobs = 5
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 10.25 sec HDFS Read: 114300469 HDFS Write: 261 SUCCESS Stage-Stage-9: Map: 1 Reduce: 1 Cumulative CPU: 9.7 sec HDFS Read: 114300471 HDFS Write: 261 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 17.83 sec HDFS Read: 21872 HDFS Write: 499 SUCCESS Total MapReduce CPU Time Spent: 37 seconds 780 msec
1 from student 2 INSERT into table student_stat partition(tp) 3 select s_age,min(s_birth) stat,'min' tp 4 group by s_age 5 insert into table student_stat partition(tp) 6 select s_age,max(s_birth) stat,'max' tp 7 group by s_age;
执行过程及结果:
Total jobs = 3
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 12.16 sec HDFS Read: 114302639 HDFS Write: 574 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 7.84 sec HDFS Read: 11548 HDFS Write: 183 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 6.56 sec HDFS Read: 11548 HDFS Write: 183 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 560 msec
2. 存储文件因素
1 from student_tb_txt 2 --业务过程同案例1.4一样 3 INSERT into table student_stat partition(tp) 4 select s_age,min(s_birth) stat,'max' stat 5 GROUP by s_age 6 insert into table student_stat partition(tp) 7 select s_age,max(s_birth) stat,'min' stat 8 GROUP by s_age;
MapReduce Total cumulative CPU time: 6 seconds 340 msec
Ended Job = job_1607608490512_0054
MapReduce Jobs Launched:
Stage-Stage-2: Map: 2 Reduce: 2 Cumulative CPU: 28.54 sec HDFS Read: 335962830 HDFS Write: 1602 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.35 sec HDFS Read: 11970 HDFS Write: 191 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 6.34 sec HDFS Read: 11970 HDFS Write: 191 SUCCESS
Total MapReduce CPU Time Spent: 41 seconds 230 msec
OK
Time taken: 93.516 seconds
1 set hive.merge.mapfiles=true; 2 set hive.merge.orcfile.stripe.level=true; 3 set hive.merge.size.per.task=268435454; 4 set hive.merge.smallfiles.avgsize=16777216; 5 6 create table student_tb_orc like student_tb_txt stored as orc; 7 insert into student_tb_orc 8 select * from student_tb_txt; 9 10 11 create table student_tb_txt_bigfile like student_tb_txt stored as textfile; 12 insert into student_tb_txt_bigfile 13 select * from student_tb_orc;
1 from student_tb_txt_bigfile 2 INSERT into table student_stat partition(tp) 3 select s_age,min(s_birth) stat,'min' stat 4 GROUP by s_age 5 insert into table student_stat partition(tp) 6 select s_age,max(s_birth) stat,'max' stat 7 GROUP by s_age;
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 9.67 sec HDFS Read: 24269 HDFS Write: 612 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.64 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 6.54 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 850 msec
OK
Time taken: 91.072 seconds
1 hive> desc formatted student_tb_txt;
OK
# col_name data_type comment
s_no string
s_name string
s_birth string
s_age bigint
s_sex string
s_score bigint
s_desc string
# Detailed Table Information
Database: hive
OwnerType: USER
Owner: hadoop
CreateTime: Thu Dec 10 21:30:12 CST 2020
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/warehouse/student
Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version 2
numFiles 50
totalSize 335929567
transient_lastDdlTime 1607607012
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.166 seconds, Fetched: 35 row(s)
1 hive> desc formatted student_tb_txt_bigfile;
OK
# col_name data_type comment
s_no string
s_name string
s_birth string
s_age bigint
s_sex string
s_score bigint
s_desc string
# Detailed Table Information
Database: hive
OwnerType: USER
Owner: hadoop
CreateTime: Thu Dec 10 22:14:49 CST 2020
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/user/hive/warehouse/hive.db/student_tb_txt_bigfile
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"s_age\":\"true\",\"s_birth\":\"true\",\"s_desc\":\"true\",\"s_name\":\"true\",\"s_no\":\"true\",\"s_score\":\"true\",\"s_sex\":\"true\"}}
numFiles 1
numRows 50
rawDataSize 8613
totalSize 8663
transient_lastDdlTime 1607609743
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.128 seconds, Fetched: 37 row(s)
总结:
student_tb_txt_bigfile 性能上要优于原始的多文件存储
3. 存储格式因素
1 from student_tb_seq 2 --业务过程同案例1.4一样 3 INSERT into table student_stat partition(tp) 4 select s_age,min(s_birth) stat,'max' stat 5 GROUP by s_age 6 insert into table student_stat partition(tp) 7 select s_age,max(s_birth) stat,'min' stat 8 GROUP by s_age;
apReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 10.98 sec HDFS Read: 24893 HDFS Write: 954 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 5.86 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 5.98 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 820 msec
OK
Time taken: 92.55 seconds
1 from student_tb_par 2 --业务过程同案例1.4一样 3 INSERT into table student_stat partition(tp) 4 select s_age,min(s_birth) stat,'max' stat 5 GROUP by s_age 6 insert into table student_stat partition(tp) 7 select s_age,max(s_birth) stat,'min' stat 8 GROUP by s_age;
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 15.0 sec HDFS Read: 18691 HDFS Write: 954 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.49 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 5.59 sec HDFS Read: 11558 HDFS Write: 191 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 80 msec
OK
Time taken: 91.433 seconds
1 from student_tb_orc 2 --业务过程同案例1.4一样 3 INSERT into table student_stat partition(tp) 4 select s_age,min(s_birth) stat,'max' stat 5 GROUP by s_age 6 insert into table student_stat partition(tp) 7 select s_age,max(s_birth) stat,'min' stat 8 GROUP by s_age;
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 10.41 sec HDFS Read: 19345 HDFS Write: 954 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 5.98 sec HDFS Read: 11555 HDFS Write: 191 SUCCESS
Stage-Stage-6: Map: 1 Reduce: 1 Cumulative CPU: 6.47 sec HDFS Read: 11555 HDFS Write: 191 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 860 msec
OK
Time taken: 90.767 seconds
总结:
orc > seq > par
4.分区、桶因素
1 DROP TABLE if EXISTS student_orc_partition; 2 --创建student_orc_bucket桶表s_age为分桶列 3 create table if not exists student_orc_partition( 4 s_no string, 5 s_name string, 6 s_birth string, 7 8 s_sex string, 9 s_score bigint, 10 s_desc string 11 ) 12 partitioned by(s_age bigint) 13 STORED AS ORC; 14 15 insert into table student_orc_partition partition(s_age) 16 select s_no,s_name,s_birth,s_sex,s_score,s_desc,s_age 17 from student_tb_orc 18 19 create table if not exists student_orc_bucket( 20 s_no string, 21 s_name string, 22 s_birth string, 23 s_age bigint, 24 s_sex string, 25 s_score bigint, 26 s_desc string 27 ) 28 --分成16个桶 29 clustered BY (s_age) INTO 16 BUCKETS 30 STORED AS ORC; 31 set hive.exec.dynamic.partition=true; 32 set hive.exec.dynamic.partition.mode=nonstrict; 33 set hive.enforce.bucketing = true; 34 insert into table student_orc_bucket 35 select s_no,s_name,s_birth,s_age,s_sex,s_score,s_desc 36 from student_tb_orc; 37 38 create table if not exists student_orc_partition_bucket( 39 s_no string , 40 s_name string , 41 s_birth string , 42 s_age string, 43 s_sex string, 44 s_score bigint, 45 s_desc string 46 ) 47 partitioned by(part bigint) 48 clustered BY (s_age) INTO 16 BUCKETS 49 STORED AS ORC; 50 set hive.exec.dynamic.partition=true; 51 set hive.exec.dynamic.partition.mode=nonstrict; 52 set hive.enforce.bucketing = true; 53 insert into table student_orc_partition_bucket partition(part) 54 select s_no,s_name,s_birth,s_age,s_sex,s_score,s_desc,pmod(hash(s_no),10) part 55 from student_tb_orc;
1 select s_age,count(s_no) 2 from student_orc 3 where s_age<23 4 group by s_age
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.98 sec HDFS Read: 18260 HDFS Write: 139 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 980 msec
OK
20 20
21 2
22 8
Time taken: 26.731 seconds, Fetched: 3 row(s)
1 select s_age,count(s_no) 2 from student_orc_partition 3 where s_age<23 4 group by s_age;
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.72 sec HDFS Read: 20201 HDFS Write: 139 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 720 msec
OK
20 20
21 2
22 8
Time taken: 27.234 seconds, Fetched: 3 row(s)
1 select s_age,count(s_no) 2 from student_orc_bucket 3 where s_age<23 4 group by s_age;
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.5 sec HDFS Read: 26434 HDFS Write: 139 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 500 msec
OK
20 20
21 2
22 8
Time taken: 29.778 seconds, Fetched: 3 row(s)
1 select s_age,count(s_no) 2 from student_orc_partition_bucket 3 where s_age<23 4 group by s_age;
Total MapReduce CPU Time Spent: 10 seconds 960 msec
OK
20 20
21 2
22 8
Time taken: 29.442 seconds, Fetched: 3 row(s)
总结:
小数据量情况下:partition > source > partition_bucket > bucket