使用clickhouse实现开窗函数 row/rank_number 和 lag lead

ROW_NUMBER实现

如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在一些其他数据库中可用于RANK排序。

CH中并没有直接提供对应的开窗函数,需要利用一些特殊函数变相实现,主要会用到下面几个数组函数,它们分别是:

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

这些函数均接受一个数组作为输入参数,并返回数组中元素出现的位置,例如:

SELECT    
 arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,     
 arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,    
 arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
 
┌─row_number──┬─dense_rank──┬─uniq_rank───┐
│ [1,2,3,4,5][1,2,3,1,4][1,1,1,2,1] │
└─────────────┴─────────────┴─────────────┘

 

数据格式:

我们的目标是实现开窗查询:

ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq

代码如下

SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id

 

 数组展开,利用ARRAY JOIN将数组展开,并按照customer_id 、loan_id  列排序:

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number 
,dense_rank
,uniq_rank
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

 

 技巧:因为end_date可能为空值,会导致array长度不一致。报错。需要用特数值填充然后最后再转换回来。

lag/lead实现:

neighbor(column, offset[, default_value])
 
The result of the function depends on the affected data blocks and the order of data in the block.
If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result.
 
Parameters
 
column — A column name or scalar expression.
offset — The number of rows forwards or backwards from the current row of column. Int64.
default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected.
Returned values
 
Value for column in offset distance from current row if offset value is not outside block bounds.
Default value for column if offset value is outside block bounds. If default_value is given, then it will be used.
Type: type of data blocks affected or default value type.
 
 
参考:
https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/

代码如下

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number 
,dense_rank
,uniq_rank
,if(neighbor(row_number , 1)<>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
,if(row_number<>1,neighbor(end_dt, -1),null)  as lag_end_dt
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
groupArray(loan_tot) AS loan_tot, 
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

发现有一点问题:就是最后一行lead的时候会出现异常值1970-01-01年值的问题。 

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number as row_num
,dense_rank
,uniq_rank
,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
,if(row_num<>1,neighbor(end_dt, -1),null)  as lag_end_dt
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.ipeso_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

 

posted @ 2020-12-25 17:17  wqbin  阅读(7073)  评论(0编辑  收藏  举报