impala+kudu

【impala建表】kudu的表必须有主键,作为分区的字段需排在其他字段前面

【range分区】(不推荐)
CREATE TABLE KUDU_WATER_HISTORY ( 
id STRING, 
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id,year) 
) PARTITION BY RANGE (year)

PARTITION VALUES < 2017, 
PARTITION 2017 <= VALUES < 2018, 
PARTITION 2018 <= VALUES
)
STORED AS KUDU 
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

【hash分区】(不推荐)
CREATE TABLE KUDU_WATER_HISTORY_PARTITION_BY_ID ( 
id STRING, 
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id) 
) PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU 
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

【hash range混合分区】推荐是用混合分区方式
CREATE TABLE KUDU_WATER_HISTORY (
id STRING,
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id,device,year)
) PARTITION BY HASH (device) PARTITIONS 3,
RANGE (year)
(
PARTITION VALUE = 2016,
PARTITION VALUE = 2017,
PARTITION VALUE = 2018,
PARTITION VALUE = 2019
)
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

CREATE TABLE DEVICE_KUDU (
id STRING,
device STRING,
name STRING,
orgId INT,
PRIMARY KEY (id)
) PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

【增加分区】
ALTER TABLE KUDU_WATER_HISTORY ADD RANGE PARTITION VALUE = 2020;

【查询语句】

select
T_3C75F1.`device`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select DEVICE_KUDU.device,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,DEVICE_KUDU where KUDU_WATER_HISTORY.device=DEVICE_KUDU.device) as `T_3C75F1`
group by
T_3C75F1.`device`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);

耗时:DEVICE_KUDU表50条记录,KUDU_WATER_HISTORY表1亿条记录,执行上面的查询语句耗时12秒。同样的数据量和查询方式,比impala+hdfs+parquet快10倍。

 

【跨数据源的查询】

device表的数据存储在hdfs中,抄表读数的数据存储在kudu中。分别在impala中创建对应的表

explain select
T_3C75F1.`deviceid`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select device_parquet.deviceid,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,device_parquet where KUDU_WATER_HISTORY.device=device_parquet.deviceid) as `T_3C75F1`
group by
T_3C75F1.`deviceid`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);

执行上面的语句,查看执行计划

+-----------------------------------------------------------------------------------------------------------------------------------------+
| Explain String                                                                                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Per-Host Resource Reservation: Memory=400.00MB                                                                                          |
| Per-Host Resource Estimates: Memory=2.25GB                                                                                              |
| WARNING: The following tables are missing relevant table and/or column statistics.                                                      |
| default.device_parquet, default.kudu_water_history                                                                                      |
|                                                                                                                                         |
| PLAN-ROOT SINK                                                                                                                          |
| |                                                                                                                                       |
| 07:EXCHANGE [UNPARTITIONED]                                                                                                             |
| |                                                                                                                                       |
| 06:AGGREGATE [FINALIZE]                                                                                                                 |
| |  output: sum:merge(T_3C75F1.reading), count:merge(1)                                                                                  |
| |  group by: T_3C75F1.deviceid, year(T_3C75F1.time), month(T_3C75F1.time)                                                               |
| |                                                                                                                                       |
| 05:EXCHANGE [HASH(T_3C75F1.deviceid,year(T_3C75F1.time),month(T_3C75F1.time))]                                                          |
| |                                                                                                                                       |
| 03:AGGREGATE [STREAMING]                                                                                                                |
| |  output: sum(reading), count(1)                                                                                                       |
| |  group by: device_parquet.deviceid, year(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')), month(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')) |
| |                                                                                                                                       |
| 02:HASH JOIN [INNER JOIN, BROADCAST]                                                                                                    |
| |  hash predicates: KUDU_WATER_HISTORY.device = device_parquet.deviceid                                                                 |
| |                                                                                                                                       |
| |--04:EXCHANGE [BROADCAST]                                                                                                              |
| |  |                                                                                                                                    |
| |  01:SCAN HDFS [default.device_parquet]                                                                                                |
| |     partitions=1/1 files=1 size=1.00KB                                                                                                |
| |                                                                                                                                       |
| 00:SCAN KUDU [default.kudu_water_history]                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------------------------+

能看到00:SCAN KUDU和01:SCAN HDFS,分别从kudu和hdfs中查询数据。

 

posted @ 2018-06-27 15:37  puroc  阅读(901)  评论(0编辑  收藏  举报