greenplum窗口函数使用浅析

最近处于系统不活跃期，没怎么升级，因此有了时间可以对整个ETL系统在稳定的基础上进行优化。每天列出TOP 10 COST TIME JOB进行分析，其中TOP1 COSTTIME JOB采用了窗口函数first_value和last_value,结果SQL全部使用的是first_value,并且为了全部使用first_value,对窗口函数进行了二次排序。通过explain这段代码，可以发现两次sort消耗的时候大概是一次sort的1.7倍，把sort二次改进成一次，并且把SQL从datastage迁移到greenplum的function里面，整个过程由24分钟降至40秒。

分析为什么只使用first_value的写法，且为了完全使用first_value不惜二次sort,原因如下：

WINDOW CALL: WINDOW window_name (partition by xxx order by xxx)
在使用first_value,last_value的时候，partition和order by会对得出的结果有影响。
分析函数包含三个分析子句：partition by,order by ,window.window里面rows的方式如下：unbounded preceding(第一行）,current row（当前行）,unbounded following（最后一行）
（1）语句为(partition by xxx order by xxx)，则默认的窗口为 unbounded preceding and current row
（2）语句为(partition by xxx), 窗口默认为unbounded preceding and unbounded following
（3）语句为()，窗口默认为全体结果集。
可能出现的问题就是语句使用第一种方式的时候。
测试环境如下：
create table windowcheck
(
oc_date integer ,
city     varchar(50),
id       integer,
sale     integer
);
select *from windowcheck;
oc_date | city | id | sale
----------+------+------+-------
20120701 | bj   | 3299 | 10040
20120701 | cs   | 3210 | 7100
20120701 | nj   | 3300 | 8900
20120701 | nj   | 3301 | 9000
20120701 | tj   | 3303 | 3890
20120701 | wh   | 3302 | 4700
select oc_date,city,id,sale,last_value(sale) over (partition by oc_date order by city) last_value
from windowcheck;
oc_date | city | id | sale | last_value
----------+------+------+-------+------------
20120701 | bj   | 3299 | 10040 |      10040
20120701 | cs   | 3210 | 7100 |       7100
20120701 | nj   | 3301 | 9000 |       8900
20120701 | nj   | 3300 | 8900 |       8900
20120701 | tj   | 3303 | 3890 |       3890
20120701 | wh   | 3302 | 4700 |       4700
问题出来了：我们通过oc_date进行分区，对city进行排序，得出的结果集最后一个值为wh,其sale值为4700.那我们原来的想法就是结果集所有的last_value应该都为4700。那么问题出在哪个地方呢？问题出在之前写的窗口的范围上了。有partition与order的默认的窗口为 unbounded preceding and current row。此时，读取第一条的时候，窗口范围仅自己一条，其last_value值为10040，读取第二条的时候，窗口范围为第一条与自己这一条，那么得出last_value值为7100,以此类推.解决方案就是把窗口范围扩大些。语句如下：
select oc_date,city,id,sale,last_value(sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
from windowcheck;
到这里，把窗口范围大小会引起的歧义解决了。还有一个疑惑是：如果在使用分区函数的时候，这个SQL语句本身也进行排序会怎么样？因为SQL语句的结果集的顺序会影响ast_value或者first_value的值。这里就要分析整个语句的先后执行顺序了：分析函数是在整个SQL查询结束后(SQL语句中的ORDER BY的执行比较特殊)再进行的操作, 也就是说
SQL语句中的ORDER BY也会影响分析函数的执行结果。分析语句如下：
--产生结果集的SQL语句的order by与分区函数里面的order by是一致的
select oc_date,city,id,sale,last_value(sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
from windowcheck order by city;
oc_date | city | id | sale | last_value
----------+------+------+-------+------------
20120701 | bj   | 3299 | 10040 |       4700
20120701 | cs   | 3210 | 7100 |       4700
20120701 | nj   | 3301 | 9000 |       4700
20120701 | nj   | 3300 | 8900 |       4700
20120701 | tj   | 3303 | 3890 |       4700
20120701 | wh   | 3302 | 4700 |       4700
--产生结果集的SQL语句的order by与分区函数里面的order by不一致
select oc_date,city,id,sale,last_value(sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
from windowcheck order by city desc;
oc_date | city | id | sale | last_value
----------+------+------+-------+------------
20120701 | wh   | 3302 | 4700 |       4700
20120701 | tj   | 3303 | 3890 |       4700
20120701 | nj   | 3301 | 9000 |       4700
20120701 | nj   | 3300 | 8900 |       4700
20120701 | cs   | 3210 | 7100 |       4700
20120701 | bj   | 3299 | 10040 |       4700
由以上二个语句，可以分析出：产生结果集的SQL语句里面的order by不会对分区函数里面的结果造成影响，原因是如果结果集的order by与分区函数里面的不一致时，先使用分区函数里面的order by进行结果运算，然后再执行结果集里面的order by.如果一致，则分区函数分析时不需要排序。
GP里面使用WINDOW的另外方式
select oc_date,city,id,sale,last_value(sale) over (w) last_value
from windowcheck
where oc_date=20120701
WINDOW w as (partition by oc_date order by city rows between unbounded preceding and unbounded following);

posted on 2012-08-17 21:36 gobird 阅读(3066) 评论(0) 收藏举报

刷新页面返回顶部

greenplum窗口函数使用浅析

导航

公告