ORACLE高级SQL之分析函数
首先来看一个例子:
--删除emp表 drop table emp purge; --创建emp表 CREATE TABLE emp ( emp_id NUMBER(6), ename VARCHAR2(45), dept_id NUMBER(4), hire_date DATE, sal NUMBER(8,2) ); --创建emp数据 INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (101, 'Tom', 20, TO_DATE('21-09-1989', 'DD-MM-YYYY'), 2000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (102, 'Mike', 20, TO_DATE('13-01-1993', 'DD-MM-YYYY'), 8000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (120, 'John', 50, TO_DATE('18-07-1996', 'DD-MM-YYYY'), 1000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (121, 'Joy', 50, TO_DATE('10-04-1997', 'DD-MM-YYYY'), 4000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (122, 'Rich', 50, TO_DATE('01-05-1995', 'DD-MM-YYYY'), 3000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (123, 'Kate', 50, TO_DATE('10-10-1997', 'DD-MM-YYYY'), 5000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (124, 'Jess', 50, TO_DATE('16-11-1999', 'DD-MM-YYYY'), 6000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (100, 'Stev', 10, TO_DATE('01-01-1990', 'DD-MM-YYYY'), 7000); COMMIT;
查询结果:
SELECT
emp_id,ename,dept_id,hire_date,sal,
SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date) sum_sal,
SUM(sal) OVER (PARTITION BY dept_id ) sum_sal2,
SUM(sal) OVER ( ) sum_sal3
FROM emp;
SELECT
emp_id,ename,dept_id,hire_date,sal,
SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date) sum_sal1,
SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date DESC) sum_sal2, --desc是从下往上累加
SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date DESC nulls LAST) sum_sal3,
SUM(sal) OVER (PARTITION BY dept_id ) sum_sal4,
SUM(sal) OVER ( ) sum_sal5
FROM emp;
查询结果:
开窗函数
1、开窗函数之ROWS:
SELECT emp_id,ename,dept_id,hire_date,sal, -- 以下均为首先按dept_id进行分组,其次按照hire_date进行排序,且所有统计不能跨越其所在分区,故不再重复 -- 窗口范围为该分区的第一行到该分区的最后一行,与sum_sal_part等同 SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) sum_1_to_last, -- 窗口范围为该分区的第一行到本行,与sum_sal_part_order等同 SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) sum_1_to_cur, -- 窗口范围为该分区的第一行到本行前一行,统计的是第一行到本行前一行薪资的累计 SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1/*value_expr*/ PRECEDING) sum_1_to_curbef1, -- 窗口范围为该分区的第一行到本行后一行,统计的是第一行到本行后一行薪资的累计 SUM(sal) OVER (PARTITION BY dept_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) sum_1_to_curaft1 FROM emp order by dept_id,hire_date;
查询结果:
2、开窗函数之RANGE:
RANGE窗口中只能对NUMBER、DATE起作用。
RANGE窗口中,ORDER BY中只能有一列;但是在ROWS窗口中,ORDER BY中可以有多列。
SELECT emp_id,ename,dept_id,hire_date,sal, -- 后面均为以dept_id分组,再按hire_date排序,且所有统计不能跨分区,由于是逻辑范围,因此PRECEDING和FOLLOWING表达式有符号 -- 窗口范围为该分区的第一行到该分区的最后一行,与sum_sal_part等同,在非条件表达式中等同于ROWS SUM(sal) OVER (PARTITION BY dept_id ORDER BY sal RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) sum_1_to_last, -- 窗口范围为该分区的第一行到本行,与sum_sal_part_order等同,在非条件表达式中等同于ROWS SUM(sal) OVER (PARTITION BY dept_id ORDER BY sal RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) sum_1_to_cur, -- 窗口范围为该分区内小于(当前行sal - 2500)的所有的薪资累计 SUM(sal) OVER (PARTITION BY dept_id ORDER BY sal RANGE BETWEEN UNBOUNDED PRECEDING AND 2500/*value_expr*/ PRECEDING) sum1, -- 窗口范围为该分区内小于(当前行sal + 2500)的所有薪资累计 SUM(sal) OVER (PARTITION BY dept_id ORDER BY sal RANGE BETWEEN UNBOUNDED PRECEDING AND 2500/*value_expr*/ FOLLOWING) sum2 FROM emp;
查询结果:
3、开窗函数之KEEP
KEEP为聚合函数的特殊关键字。(弄不明白这个函数)
当聚合函数MIN、MAX、SUM、AVG、COUNT、VARIANCE和STDDEV使用KEEP。
当KEEP和DENSE_RANK FIRST/DENSE_RANK LAST 一起使用时,获取一组中排名第一或者排名最后的记录。必须由order by子句来排序。后面也可以接over()分析函数部分。
Min(col2) keep (dense_rank first order by coll)保留按coll排名第一的col2的最小值。
Min(col2) keep (dense_rank first order by co11) over (partition by co 13 )按col3分组保留按coll排名各组第一的col2 的最小值。
SELECT emp_id,ename,dept_id,hire_date,sal,
DENSE_RANK() OVER(PARTITION BY dept_id ORDER BY sal) DENSE_RANK,
MIN(hire_date) KEEP (DENSE_RANK FIRST ORDER BY sal) OVER(PARTITION BY dept_id) min_first,
MIN(hire_date) KEEP (DENSE_RANK LAST ORDER BY sal) OVER(PARTITION BY dept_id) min_last,
MAX(hire_date) KEEP (DENSE_RANK FIRST ORDER BY sal) OVER(PARTITION BY dept_id) max_first,
MAX(hire_date) KEEP (DENSE_RANK LAST ORDER BY sal) OVER(PARTITION BY dept_id) max_last
FROM emp;
查询结果:
函数部分
1、排名函数:ROW_NUMBER、RANK、DENSE_RANK
这三个函数都是根据PARTITION BY后面的字段分组,然后根据ORDER BY后面的字段进行排序后再进行排名
SELECT
emp_id,ename,dept_id,hire_date,sal,
ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY sal) AS row_number,
RANK() OVER (PARTITION BY dept_id ORDER BY sal) AS rank,
DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY sal) AS dense_rank
FROM emp;
查询结果:
SELECT
emp_id,ename,dept_id,hire_date,sal,
MIN(sal) KEEP (DENSE_RANK FIRST ORDER BY hire_date) OVER (PARTITION BY dept_id) Worst,
MAX(sal) KEEP (DENSE_RANK LAST ORDER BY hire_date) OVER (PARTITION BY dept_id) Best
FROM emp;
查询结果:
2、FIRST_VALUE与LAST_VALUE
FIRST_VALUE 返回一个排序数据集合的第一行,而LAST_VALUE 则返回一个排序数据集合的最后一行。
SELECT
emp_id,ename,dept_id,hire_date,sal,
FIRST_VALUE(ename) OVER(PARTITION BY dept_id ORDER BY sal ) AS fir_val,
FIRST_VALUE(ename) OVER(PARTITION BY dept_id ORDER BY sal DESC) AS fir_val_desc,
LAST_VALUE(ename) OVER(PARTITION BY dept_id ORDER BY sal ) AS last_val,
LAST_VALUE(ename) OVER(PARTITION BY dept_id ORDER BY sal DESC) AS last_val_desc
FROM emp;
查询结果:
3、LAG与LEAD
lag与lead函数是跟偏移量相关的两个分析函数,通过这两个函数可以在一次查询中取出同一字段的前N行的数据(lag)和后N行的数据(lead)作为独立的列,从而更方便地进行进行数据过滤。这种操作可以代替表的自联接,并且LAG和LEAD有更高的效率。
over()表示 lag()与lead()操作的数据都在over()的范围内,它里面可以使用partition by语句(用于分组) order by 语句(用于排序)。partition by a order by b表示以a字段进行分组,再以b字段进行排序,对数据进行查询。
例如:lead(field, num, defaultvalue) field需要查找的字段,num往后查找的num行的数据,defaultvalue没有符合条件的默认值。
SELECT emp_id,ename,dept_id,hire_date,sal, LAG(sal) OVER (ORDER BY hire_date) AS prev_sal1, LEAD(sal) OVER (ORDER BY hire_date) AS next_sal1, LAG(sal, 1, 0) OVER (ORDER BY hire_date) AS prev_sal2, LEAD(sal, 1,0) OVER (ORDER BY hire_date) AS next_sal2, LAG(sal, 1, 0) OVER (partition BY dept_id ORDER BY hire_date) AS prev_sal3, LEAD(sal, 1,0) OVER (partition BY dept_id ORDER BY hire_date) AS next_sal3, LAG(sal, 2, 999) OVER (partition BY dept_id ORDER BY hire_date) AS prev_sal4, LEAD(sal, 2,999) OVER (partition BY dept_id ORDER BY hire_date) AS next_sal4 FROM emp;
查询结果:
案例分析
1、排名论次
环境准备:
drop table emp purge; CREATE TABLE emp ( emp_id NUMBER(6), ename VARCHAR2(45), dept_id NUMBER(4), hire_date DATE, sal NUMBER(8,2) ); --创建emp数据 INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (101, 'Tom', 20, TO_DATE ('21-09-1989', 'DD-MM-YYYY'), 2000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (102, 'Mike', 20, TO_DATE ('13-01-1993', 'DD-MM-YYYY'), 8000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (120, 'John', 50, TO_DATE ('18-07-1996', 'DD-MM-YYYY'), 1000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (121, 'Joy', 50, TO_DATE ('10-04-1997', 'DD-MM-YYYY'), 1000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (122, 'Rich', 50, TO_DATE ('01-05-1995', 'DD-MM-YYYY'), 4000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (123, 'Kate', 50, TO_DATE ('10-10-1997', 'DD-MM-YYYY'), 4000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (124, 'Jess', 50, TO_DATE ('16-11-1999', 'DD-MM-YYYY'), 7000); INSERT INTO emp (emp_id, ename, dept_id, hire_date, sal) VALUES (100, 'Stev', 10, TO_DATE ('01-01-1990', 'DD-MM-YYYY'), 7000); COMMIT;
SQL> select * from emp;
EMP_ID ENAME DEPT_ID HIRE_DATE SAL
------- ----- ------- ----------- ----------
101 Tom 20 1989-09-21 2000.00
102 Mike 20 1993-01-13 8000.00
120 John 50 1996-07-18 1000.00
121 Joy 50 1997-04-10 1000.00
122 Rich 50 1995-05-01 4000.00
123 Kate 50 1997-10-10 4000.00
124 Jess 50 1999-11-16 7000.00
100 Stev 10 1990-01-01 7000.00
用普通语句查询每个部门内收入最低的人:
WITH t as (SELECT dept_id, min(sal) as min_sal FROM emp GROUP BY dept_id) select emp.emp_id, emp.ename, emp.dept_id, emp.hire_date,emp.sal from emp, t where emp.dept_id = t.dept_id and emp.sal = t.min_sal; 查询结果: EMP_ID ENAME DEPT_ID HIRE_DATE SAL ------- ----- ------- ----------- ---------- 101 Tom 20 1989-09-21 2000.00 121 Joy 50 1997-04-10 1000.00 120 John 50 1996-07-18 1000.00 100 Stev 10 1990-01-01 7000.00
用dense_rand分析函数查询每个部门内收入最低的人(效率更高):
SELECT emp_id, ename, dept_id, hire_date,sal FROM (SELECT emp.*, dense_rank() OVER(PARTITION BY dept_id ORDER BY sal ) AS N FROM emp) WHERE N = 1; 查询结果: EMP_ID ENAME DEPT_ID HIRE_DATE SAL ------- ----- ------- ----------- ---------- 100 Stev 10 1990-01-01 7000.00 101 Tom 20 1989-09-21 2000.00 120 John 50 1996-07-18 1000.00 121 Joy 50 1997-04-10 1000.00
2、数据去重
环境准备:
DROP TABLE t purge ; CREATE TABLE t AS SELECT * FROM dba_objects WHERE rownum<=10; UPDATE t SET object_id=rownum; UPDATE t SET object_id=3 WHERE object_id<=3; UPDATE t SET object_id=4 WHERE object_id>=4 AND object_id<=6; COMMIT;
普通的去重写法(随便删除,保留rowid最大的一条):
delete from t where rowid < (select max(rowid) from t t2 where t.object_id = t2.object_id );
数据去重的分析函数写法(保留最新的一条记录):
delete t where rowid in (select rid from (select rowid rid, row_number() over(partition by object_id order by created desc) rn from t) where rn > 1);
日常在处理数据去重的时候,如果数据量太大,可以考虑先把不重复的数据移出到另外一个表中,然后只针对重复数据进行处理。
3、占比应用 ratio_to_report
SELECT
emp_id,ename,dept_id,hire_date,sal,
ratio_to_report(sal) OVER () as pct1l,
ratio_to_report(sal) OVER (partition by dept_id) as pct2
FROM emp;
查询结果:
4、连续值判定
环境准备:
drop table t purge; create table t (id1 int,id2 int ,id3 int); insert into t (id1 ,id2,id3) values (1,45,89); insert into t (id1 ,id2,id3) values (2,45,89); insert into t (id1 ,id2,id3) values (3,45,89); insert into t (id1 ,id2,id3) values (8,45,89); insert into t (id1 ,id2,id3) values (12,45,89); insert into t (id1 ,id2,id3) values (36,45,89); insert into t (id1 ,id2,id3) values (22,45,89); insert into t (id1 ,id2,id3) values (23,45,89); insert into t (id1 ,id2,id3) values (89,45,89); insert into t (id1 ,id2,id3) values (92,45,89); insert into t (id1 ,id2,id3) values (91,45,89); insert into t (id1 ,id2,id3) values (90,45,89); commit; SQL> select * from t; ID1 ID2 ID3 ---- ---- ---- 1 45 89 2 45 89 3 45 89 8 45 89 12 45 89 36 45 89 22 45 89 23 45 89 89 45 89 92 45 89 91 45 89 90 45 89
需要实现的需求:
1、将ID1连续的数据查找出来,效果如下:
ID1 ID2 ID3 ---- ---- ---- 1 45 89 2 45 89 3 45 89 22 45 89 23 45 89 89 45 89 90 45 89 91 45 89 92 45 89
2、要求查出连续数据,并且要算出出最小值和最大值及连续的个数,效果如下:
1 3 3 22 23 2 89 92 4
实现方法:
1、先看如下sql的查询结果,使用分析函数ROW_NUMBER已经把连续的数据找出来了,分成3个组。
SELECT id1, id2, id3, ROW_NUMBER() OVER(ORDER BY id1) - ID1 AS group_id FROM t; ID1 ID2 ID3 GROUP_ID ---- ---- ---- ---------- 1 45 89 0 2 45 89 0 3 45 89 0 8 45 89 -4 12 45 89 -7 22 45 89 -16 23 45 89 -16 36 45 89 -28 89 45 89 -80 90 45 89 -80 91 45 89 -80 92 45 89 -80
2、需求1的实现方法:
select id1, id2, id3 from (select id1, id2, id3, count(*) over(partition by group_id) cnt from (select id1, id2, id3, row_number() over(order by id1) - id1 as group_id from t) ) where cnt > 1 order by id1; 查询结果: ID1 ID2 ID3 ---- ---- ---- 1 45 89 2 45 89 3 45 89 22 45 89 23 45 89 89 45 89 90 45 89 91 45 89 92 45 89
3、需求2的实现方法:
select min(id1), max(id1), count(*) from (select id1, id2, id3, row_number() over(order by id1) - id1 as group_id from t) having count(*) > 1 group by group_id order by 1; 查询结果: MIN(ID1) MAX(ID1) COUNT(*) ---------- ---------- ---------- 1 3 3 22 23 2 89 92 4
5、高频数获取
原始数据查询结果:
SELECT sal,COUNT(*) repeat_num FROM emp GROUP BY sal; 查询结果: SAL REPEAT_NUM ---------- ---------- 1000.00 1 4000.00 4 --4000这个数值出现了4次,出现频率最高 2000.00 1 8000.00 1 7000.00 1
普通写法查找:
select sal from (select sal, count(*) as repeat_num from emp group by sal) t where t.repeat_num = (select max(repeat_num) from (select sal, count(*) as repeat_num from emp group by sal)); 查询结果: SAL ---------- 4000.00
分析函数查找:
select sal from (select sal, rank() over(order by repeat_num desc) rank_repeat_num from (select sal, count(*) repeat_num from emp group by sal) ) where rank_repeat_num = 1; 查询结果: SAL ---------- 4000.00