大叔经验分享（83）impala执行多个select distinct

impala在一个select中执行多个count distinct时会报错，比如执行

select key, count(distinct column_a), count(distinct column_b) from test_table group by key

会报错

Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.

这时有几种方法：

1 使用近似值

1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改为ndv，即ndv(column_a)
这两种方法底层实现是一样的，设置APPX_COUNT_DISTINCT会自动将count distinct改写为ndv，ndv全称为（number of distinct values），用到
Cardinality（基数计数），底层实现是类似HLLC（Hyper LogLog Counting）这种概率算法，详见参考；

An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.

2 使用精确值

改写为多个子查询然后join，比如

select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key

参考：

ndv

http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv

APPX_COUNT_DISTINCT

http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html

其他

https://stackoverflow.com/questions/39236076/impala-all-distinct-aggregate-functions-need-to-have-the-same-set-of-parameters

posted @ 2019-09-28 01:13 匠人先生阅读(5032) 评论(0) 收藏举报

刷新页面返回顶部

Thinking in BigData

匠人先生

大叔经验分享（83）impala执行多个select distinct

1 使用近似值

2 使用精确值

公告