大叔经验分享(83)impala执行多个select distinct
impala在一个select中执行多个count distinct时会报错,比如执行
select key, count(distinct column_a), count(distinct column_b) from test_table group by key
会报错
Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.
这时有几种方法:
1 使用近似值
1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改为ndv,即ndv(column_a)
这两种方法底层实现是一样的,设置APPX_COUNT_DISTINCT会自动将count distinct改写为ndv,ndv全称为(number of distinct values),用到
Cardinality(基数计数),底层实现是类似HLLC(Hyper LogLog Counting)这种概率算法,详见参考;
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col)
, the "number of distinct values". It is much faster than the combination of COUNT
and DISTINCT
, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
2 使用精确值
改写为多个子查询然后join,比如
select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key
参考:
ndv
http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv
APPX_COUNT_DISTINCT
http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html
其他
---------------------------------------------------------------- 结束啦,我是大魔王先生的分割线 :) ----------------------------------------------------------------
- 由于大魔王先生能力有限,文中可能存在错误,欢迎指正、补充!
- 感谢您的阅读,如果文章对您有用,那么请为大魔王先生轻轻点个赞,ありがとう