Spark sql -- Spark sql中的窗口函数和对应的api
一、窗口函数种类
- ranking 排名类
- analytic 分析类
- aggregate 聚合类
Function Type | SQL | DataFrame API | Description |
Ranking | rank | rank | rank值可能是不连续的 |
Ranking | dense_rank | denseRank | rank值一定是连续的 |
Ranking | percent_rank | percentRank | 相同的分组中 (rank -1) / ( count(score) - 1 ) |
Ranking | ntile | ntile | 将同一组数据循环的往n个桶中放,返回对应的桶的index,index从1开始 |
Ranking | row_number | rowNumber | 很单纯的行号,类似excel的行号 |
Analytic | cume_dist | cumeDist | |
Analytic | first_value | firstValue | 相同的分组中最小值 |
Analytic | last_value | lastValue | 相同的分组中最大值 |
Analytic | lag | lag | 取前n行数据 |
Analytic | lead | lead | 取后n行数据 |
Aggregate | min | min | 最小值 |
Aggregate | max | max | 最大值 |
Aggregate | sum | sum | 求和 |
Aggregate | avg | avg | 求平均 |
二、具体用法如下
count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)--
三、实际例子
案例数据:/root/score.json/score.json,学生名字、课程、分数
{"name":"A","lesson":"Math","score":100}
{"name":"B","lesson":"Math","score":100}
{"name":"C","lesson":"Math","score":99}
{"name":"D","lesson":"Math","score":98}
{"name":"A","lesson":"E","score":100}
{"name":"B","lesson":"E","score":99}
{"name":"C","lesson":"E","score":99}
{"name":"D","lesson":"E","score":98}
select name,lesson,score, ntile(2) over (partition by lesson order by score desc ) as ntile_2, ntile(3) over (partition by lesson order by score desc ) as ntile_3, row_number() over (partition by lesson order by score desc ) as row_number, rank() over (partition by lesson order by score desc ) as rank, dense_rank() over (partition by lesson order by score desc ) as dense_rank, percent_rank() over (partition by lesson order by score desc ) as percent_rank from score order by lesson,name,score
输出结果完全一样,如下表所示
name | lesson | score | ntile_2 | ntile_3 | row_number | rank | dense_rank | percent_rank |
---|---|---|---|---|---|---|---|---|
A | E | 100 | 1 | 1 | 1 | 1 | 1 | 0.0 |
B | E | 99 | 1 | 1 | 2 | 2 | 2 | 0.3333333333333333 |
C | E | 99 | 2 | 2 | 3 | 2 | 2 | 0.3333333333333333 |
D | E | 98 | 2 | 3 | 4 | 4 | 3 | 1.0 |
A | Math | 100 | 1 | 1 | 1 | 1 | 1 | 0.0 |
B | Math | 100 | 1 | 1 | 2 | 1 | 1 | 0.0 |
C | Math | 99 | 2 | 2 | 3 | 3 | 2 | 0.6666666666666666 |
D | Math | 98 | 2 | 3 | 4 | 4 | 3 | 1.0 |
参考:
=================================================================================
原创文章,转载请务必将下面这段话置于文章开头处(保留超链接)。
本文转发自程序媛说事儿,原文链接https://www.cnblogs.com/abc8023/p/10910741.html
=================================================================================