关于 SQL navigation function 的一点使用记录
来自于 SQL2011 对窗口函数的增强,新添加了叫导航函数的类别,进一步丰富了窗口的计算能力。
这将一次记录几个比较常用的导航函数他们包含,
- FIRST_VALUE
- LAST_VALUE
- NTH_VALUE
- LEAD
- LAG
下面我将依次用例子记录使用方法。
FIRST_VALUE/LAST_VALUE/NTH_VALUE
首先我们创建一个测试用的数据集后续我都将使用该数据集进行测试。
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34')
导航函数归根到底还是窗口函数功能的一种增强,下面来看使用
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, FORMAT_TIMESTAMP('%X', finish_time) AS finish_time, division, FORMAT_TIMESTAMP('%X', fastest_time) AS fastest_time, TIMESTAMP_DIFF(finish_time, fastest_time, SECOND) AS delta_in_seconds FROM ( SELECT name, finish_time, division, FIRST_VALUE(finish_time) OVER (PARTITION BY division ORDER BY finish_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS fastest_time FROM finishers); +-----------------+-------------+----------+--------------+------------------+ | name | finish_time | division | fastest_time | delta_in_seconds | +-----------------+-------------+----------+--------------+------------------+ | Carly Forte | 03:08:58 | F25-29 | 03:08:58 | 0 | | Sophia Liu | 02:51:45 | F30-34 | 02:51:45 | 0 | | Nikki Leith | 02:59:01 | F30-34 | 02:51:45 | 436 | | Jen Edwards | 03:06:36 | F30-34 | 02:51:45 | 891 | | Meghan Lederer | 03:07:41 | F30-34 | 02:51:45 | 956 | | Lauren Reasoner | 03:10:14 | F30-34 | 02:51:45 | 1109 | | Lisa Stelzner | 02:54:11 | F35-39 | 02:54:11 | 0 | | Lauren Matthews | 03:01:17 | F35-39 | 02:54:11 | 426 | | Desiree Berry | 03:05:42 | F35-39 | 02:54:11 | 691 | | Suzy Slane | 03:06:24 | F35-39 | 02:54:11 | 733 | +-----------------+-------------+----------+--------------+------------------+
其他的都很好理解我们就看 FIRST_VALUE 那一行发挥的作用。里面还有一些比较生疏的关键字,我把他们列出来解释一下:
OVER: OVER 和窗口函数一起使用, OVER 语句用于对数据进行分组,并对组内元素进行排序。窗口函数用于给组内生成序号,而导航函数可以直接对组内进行操作取值。
PARTITION BY: 指定分区键,可以用一个或多个键进行分区。PARTITION BY 将表按分区键分区,每个分区是一个窗口,窗口函数和导航函数就是这样作用于各个分区。
ORDER BY: 在分区中指明排序顺序,另外如果 ORDER BY 后面并未接 ROWS/RANGE 子句的话,一般会默认跟 range between unbounded preceding and current row
ROWS BETWEEN: 根据 ORDER BY 子句排序后,与范围关键字进行连用。比如连用下面的 UNBOUNDED PRECEDING 和 UNBOUNDED FOLLOWING 写作 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 表示选取的范围为当前行最前面行到当前行最后面行。
UNBOUNDED PRECEDING: 表示选取当前位置到窗口最前面。
UNBOUNDED FOLLOWING: 表示选取当前位置到窗口最后面。
未在这里展示的但是很重要的关键字:
CURRENT ROW:表示当前行。
num PERCEDING:一般使用是 ROWS BETWEEN 10 PERCEDING AND 9 FOLLOWING 这样使用,表达的意思就是窗口当前行前 10 行到 当前窗口行后 9 行。
num FOLLOWING:一般使用是 ROWS BETWEEN 10 PERCEDING AND 9 FOLLOWING 这样使用,表达的意思就是窗口当前行前 10 行到 当前窗口行后 9 行。
RANGE BETWEEN: 可以看到上面有 ROWS BETWEEN ,其实 ROWS 为物理窗口,RANGE 是逻辑窗口。ROWS 表达的意思更像是我们平时想的那样跟 order by key 的 key 值无关。这里重点说下 range 的不同。range 和当前行值有关和 order by key 的 key 值有关且在该 key 上操作 range 范围。
这个可能比较难理解举个例子。
比如我们有这样一组结果
使用语句
select id, sum(id) over (order by id range between 1 preceding and 2 following) range_sum1 sum(id) over (order by id rows between 1 preceding and 2 following) rows_sum1 from tmp
id range sum1 rows sum1 1 5 5 1 5 11 3 3 16 6 33 21 6 33 25 6 33 27 7 42 30 8 24 24
我们从第二行就可以看出开始有结果的不同了。我来解释一下发生了什么。
Range: 上面我们说到了 range 是和 order by 字段逻辑相关的,这里这个字段就是 id 。
between 1 preceding and 2 following 在这里就是 id - 1 and id + 2, 所以这里的第二行是的到范围 1-1 and 1+2 = [0, 3] 这样一个范围内的值,注意是闭区间。所以他会包含前三条 range sum1 = 1 + 1 + 3 = 5
Rows: rows between 就和逻辑无关。所以我们只需要取单纯的上一条和下面两条的范围。这里就会包括 id [1, 1, 3, 6] rows sum1 = 1 + 1 + 3 + 6 = 11
介绍完了最主要的关键字其实再看导航函数在作什么就比较好理解了。
FIRST_VALUE 就是取分区排序完了各组的第一个元素。
LAST_VALUE 就是取分区排序完了各组最后一个元素。
NTH_VALUE 就是取分区排序完了各组第 N 个元素。这里单独补充一个 NTH_VALUE 的例子。
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, FORMAT_TIMESTAMP('%X', finish_time) AS finish_time, division, FORMAT_TIMESTAMP('%X', fastest_time) AS fastest_time, FORMAT_TIMESTAMP('%X', second_fastest) AS second_fastest FROM ( SELECT name, finish_time, division,finishers, FIRST_VALUE(finish_time) OVER w1 AS fastest_time, NTH_VALUE(finish_time, 2) OVER w1 as second_fastest FROM finishers WINDOW w1 AS ( PARTITION BY division ORDER BY finish_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)); +-----------------+-------------+----------+--------------+----------------+ | name | finish_time | division | fastest_time | second_fastest | +-----------------+-------------+----------+--------------+----------------+ | Carly Forte | 03:08:58 | F25-29 | 03:08:58 | NULL | | Sophia Liu | 02:51:45 | F30-34 | 02:51:45 | 02:59:01 | | Nikki Leith | 02:59:01 | F30-34 | 02:51:45 | 02:59:01 | | Jen Edwards | 03:06:36 | F30-34 | 02:51:45 | 02:59:01 | | Meghan Lederer | 03:07:41 | F30-34 | 02:51:45 | 02:59:01 | | Lauren Reasoner | 03:10:14 | F30-34 | 02:51:45 | 02:59:01 | | Lisa Stelzner | 02:54:11 | F35-39 | 02:54:11 | 03:01:17 | | Lauren Matthews | 03:01:17 | F35-39 | 02:54:11 | 03:01:17 | | Desiree Berry | 03:05:42 | F35-39 | 02:54:11 | 03:01:17 | | Suzy Slane | 03:06:24 | F35-39 | 02:54:11 | 03:01:17 | +-----------------+-------------+----------+--------------+----------------+
LEAD/LAG
LEAD 函数用于返回排序后窗口里的后续行字段的值。这个描述还是有点抽象我们需要举个例子来看下
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, finish_time, division, LEAD(name) OVER (PARTITION BY division ORDER BY finish_time ASC) AS followed_by FROM finishers; +-----------------+-------------+----------+-----------------+ | name | finish_time | division | followed_by | +-----------------+-------------+----------+-----------------+ | Carly Forte | 03:08:58 | F25-29 | NULL | | Sophia Liu | 02:51:45 | F30-34 | Nikki Leith | | Nikki Leith | 02:59:01 | F30-34 | Jen Edwards | | Jen Edwards | 03:06:36 | F30-34 | Meghan Lederer | | Meghan Lederer | 03:07:41 | F30-34 | Lauren Reasoner | | Lauren Reasoner | 03:10:14 | F30-34 | NULL | | Lisa Stelzner | 02:54:11 | F35-39 | Lauren Matthews | | Lauren Matthews | 03:01:17 | F35-39 | Desiree Berry | | Desiree Berry | 03:05:42 | F35-39 | Suzy Slane | | Suzy Slane | 03:06:24 | F35-39 | NULL | +-----------------+-------------+----------+-----------------+
还是重点来看开窗部分,我们获取开窗每一组的后面那个元素作为 followed_by 字段的值。因为开窗结果已经根据 finish_time 进行排序,所以我们看结果我们总是可以获得每个分区里的下一个人的名字作为 followed_by。第一组 f25-29 是因为该组只有这一个结果,所以取不到下一个值。
同时该函数和 LEAD/LAG 也支持 offset 的筛选,可以筛选往后或者往前后面几条,例如:
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, finish_time, division, LEAD(name, 2) OVER (PARTITION BY division ORDER BY finish_time ASC) AS two_runners_back FROM finishers; +-----------------+-------------+----------+------------------+ | name | finish_time | division | two_runners_back | +-----------------+-------------+----------+------------------+ | Carly Forte | 03:08:58 | F25-29 | NULL | | Sophia Liu | 02:51:45 | F30-34 | Jen Edwards | | Nikki Leith | 02:59:01 | F30-34 | Meghan Lederer | | Jen Edwards | 03:06:36 | F30-34 | Lauren Reasoner | | Meghan Lederer | 03:07:41 | F30-34 | NULL | | Lauren Reasoner | 03:10:14 | F30-34 | NULL | | Lisa Stelzner | 02:54:11 | F35-39 | Desiree Berry | | Lauren Matthews | 03:01:17 | F35-39 | Suzy Slane | | Desiree Berry | 03:05:42 | F35-39 | NULL | | Suzy Slane | 03:06:24 | F35-39 | NULL | +-----------------+-------------+----------+------------------+
这个例子展示了我们要取后面第二条而不是默认的第一条。同样没有的话显示 null。同时他们还有支持的第三个参数,如果没有找到 null 的话可以用默认值进行填充。
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, finish_time, division, LEAD(name, 2, 'Nobody') OVER (PARTITION BY division ORDER BY finish_time ASC) AS two_runners_back FROM finishers; +-----------------+-------------+----------+------------------+ | name | finish_time | division | two_runners_back | +-----------------+-------------+----------+------------------+ | Carly Forte | 03:08:58 | F25-29 | Nobody | | Sophia Liu | 02:51:45 | F30-34 | Jen Edwards | | Nikki Leith | 02:59:01 | F30-34 | Meghan Lederer | | Jen Edwards | 03:06:36 | F30-34 | Lauren Reasoner | | Meghan Lederer | 03:07:41 | F30-34 | Nobody | | Lauren Reasoner | 03:10:14 | F30-34 | Nobody | | Lisa Stelzner | 02:54:11 | F35-39 | Desiree Berry | | Lauren Matthews | 03:01:17 | F35-39 | Suzy Slane | | Desiree Berry | 03:05:42 | F35-39 | Nobody | | Suzy Slane | 03:06:24 | F35-39 | Nobody | +-----------------+-------------+----------+------------------+
LAG 同理这里就不赘述了,LAG 是寻找当前数据前面的数据,支持参数都一样,同样支持 offset 和 default value。
Reference:
https://stackoverflow.com/questions/30861919/what-is-rows-unbounded-preceding-used-for-in-teradata
https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
https://blog.csdn.net/weixin_42307036/article/details/112381387