三种中位数求法效率比拼
解法1:这是日本人MICK在其著作《SQL进阶教程》里提出的方法:
select avg(distinct salary) from ( select t1.salary from tb_employee t1,tb_employee t2 group by t1.salary having sum(case when t1.salary>=t2.salary then 1 else 0 end)>=count(*)/2 and sum(case when t1.salary<=t2.salary then 1 else 0 end)>=count(*)/2 )
解法2:这是我在https://www.cnblogs.com/xiandedanteng/p/12637767.html 里提出的方法:
select avg(b.salary) from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_employee order by salary) a ) b where b.diff=(select min(c.diff) from (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_employee order by salary) a ) c)
上面把rank函数换成row_number函数也是一样的结果。
解法3:这是我在 https://www.cnblogs.com/xiandedanteng/p/12677633.html 里提出的方法:
select avg(a.salary) from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil from tb_employee) a where a.seq=a.ceil or a.revseq=a.ceil
上面的row_numbe函数不能换成rank函数,否则查不出结果。
为了比拼效率,我特地制作了一张有两百万随机数据的大表:
表结构:
create table tb_employee( id number(7,0) primary key, name nvarchar2(20) not null, salary integer not null)
充值方法:
insert into tb_employee select rownum,dbms_random.string('*',dbms_random.value(2,20)),dbms_random.value(1000,10000) from dual connect by level<2000001 order by dbms_random.random;
然后发现,执行方法一时搞得sql plus都僵死了,估计是做非等值连接时出现两百万乘以两百万的超大结果集导致的,方法二三倒是可以出来,直觉上感觉方法三更快。
两百万数据时方法二的解释计划:
SQL> select avg(b.salary) from 2 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) b 3 where b.diff=(select min(c.diff) from 4 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) c); 已用时间: 00: 00: 00.00 执行计划 ---------------------------------------------------------- Plan hash value: 3874635296 ----------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 39 | | 34592 (2)| 00:06:56 | | 1 | SORT AGGREGATE | | 1 | 39 | | | | |* 2 | VIEW | | 1916K| 71M| | 17296 (2)| 00:03:28 | | 3 | WINDOW SORT | | 1916K| 47M| 66M| 17296 (2)| 00:03:28 | | 4 | WINDOW SORT | | 1916K| 47M| 66M| 17296 (2)| 00:03:28 | | 5 | TABLE ACCESS FULL | TB_EMPLOYEE | 1916K| 47M| | 3044 (2)| 00:00:37 | | 6 | SORT AGGREGATE | | 1 | 26 | | | | | 7 | VIEW | | 1916K| 47M| | 17296 (2)| 00:03:28 | | 8 | WINDOW SORT | | 1916K| 47M| 66M| 17296 (2)| 00:03:28 | | 9 | WINDOW SORT | | 1916K| 47M| 66M| 17296 (2)| 00:03:28 | | 10 | TABLE ACCESS FULL| TB_EMPLOYEE | 1916K| 47M| | 3044 (2)| 00:00:37 | ----------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ABS("A"."SEQ"-"A"."REVSEQ")= (SELECT MIN(ABS("A"."SEQ"-"A"."REVSEQ")) FROM (SELECT "ID" "ID","SALARY" "SALARY",ROW_NUMBER() OVER ( ORDER BY "SALARY") "SEQ",ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("SALARY") DESC ) "REVSEQ" FROM "TB_EMPLOYEE" "TB_EMPLOYEE" ORDER BY "SALARY") "A")) Note ----- - dynamic sampling used for this statement (level=2)
两百万数据时方法三的解释计划:
SQL> select avg(a.salary) from 2 (select 3 id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil 4 from tb_employee) a 5 where a.seq=a.ceil or a.revseq=a.ceil; 已用时间: 00: 00: 00.00 执行计划 ---------------------------------------------------------- Plan hash value: 3541675306 ----------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 52 | | 21406 (2)| 00:04:17 | | 1 | SORT AGGREGATE | | 1 | 52 | | | | | 2 | SORT AGGREGATE | | 1 | | | | | | 3 | INDEX FAST FULL SCAN| SYS_C0012264 | 1916K| | | 1504 (1)| 00:00:19 | |* 4 | VIEW | | 1916K| 95M| | 21406 (2)| 00:04:17 | | 5 | WINDOW SORT | | 1916K| 23M| 36M| 21406 (2)| 00:04:17 | | 6 | WINDOW SORT | | 1916K| 23M| 36M| 21406 (2)| 00:04:17 | | 7 | TABLE ACCESS FULL | TB_EMPLOYEE | 1916K| 23M| | 3044 (2)| 00:00:37 | ----------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - filter("A"."SEQ"="A"."CEIL" OR "A"."REVSEQ"="A"."CEIL") Note ----- - dynamic sampling used for this statement (level=2)
方法三cost比方法二少三分之一多,这与执行时间上的感觉是可以互相印证的。
而跑方法一的效果就有点恐怖了:
SQL> select avg(distinct salary) 2 from 3 ( 4 select t1.salary 5 from tb_employee t1,tb_employee t2 6 group by t1.salary 7 having sum(case when t1.salary>=t2.salary then 1 else 0 end)>=count(*)/2 8 and sum(case when t1.salary<=t2.salary then 1 else 0 end)>=count(*)/2 9 ); 已用时间: 00: 00: 00.00 执行计划 ---------------------------------------------------------- Plan hash value: 93677323 ------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 13 | 6315M (9)|999:59:59 | | 1 | SORT AGGREGATE | | 1 | 13 | | | | 2 | VIEW | VM_NWVW_1 | 3671G| 43T| 6315M (9)|999:59:59 | | 3 | HASH GROUP BY | | 3671G| 43T| 6315M (9)|999:59:59 | | 4 | VIEW | | 3671G| 43T| 6315M (9)|999:59:59 | |* 5 | FILTER | | | | | | | 6 | HASH GROUP BY | | 3671G| 86T| 6315M (9)|999:59:59 | | 7 | MERGE JOIN CARTESIAN| | 3671G| 86T| 5829M (2)|999:59:59 | | 8 | TABLE ACCESS FULL | TB_EMPLOYEE | 1916K| 23M| 3044 (2)| 00:00:37 | | 9 | BUFFER SORT | | 1916K| 23M| 6315M (9)|999:59:59 | | 10 | TABLE ACCESS FULL | TB_EMPLOYEE | 1916K| 23M| 3042 (2)| 00:00:37 | ------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 5 - filter(SUM(CASE WHEN "T1"."SALARY">="T2"."SALARY" THEN 1 ELSE 0 END )>=COUNT(*)/2 AND SUM(CASE WHEN "T1"."SALARY"<="T2"."SALARY" THEN 1 ELSE 0 END )>=COUNT(*)/2) Note ----- - dynamic sampling used for this statement (level=2)
Cost是6315M!,令人惊讶的数字,难怪会让sql plus跑死。
方案二的运行时间:
SQL> set autotrace off; SQL> select avg(b.salary) from 2 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) b 3 where b.diff=(select min(c.diff) from 4 (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) c); AVG(B.SALARY) ------------- 5498 已用时间: 00: 00: 10.76
方案三的运行时间:
SQL> select avg(a.salary) from 2 (select 3 id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil 4 from tb_employee) a 5 where a.seq=a.ceil or a.revseq=a.ceil; AVG(A.SALARY) ------------- 5498 已用时间: 00: 00: 01.89
方案一因运行导致僵死而不能参赛。
最终的大比拼表格:
# | 方案一 | 方案二 | 方案三 |
执行结果 | 没出来 | 5498 | 5498 |
Cost | 6315M | 34592 | 21406 |
时间 | 僵死 | 10.76秒 | 1.89秒 |
从这里看,代码上简洁,未必是最好的方案,经过实际检验执行时间最短才是。
如果要衡量一个程序的好坏,第一标准是出现正确的结果,第二是出现正确结果的最快方案,第三才是寻求最快方案里的最简洁方案。
从以上标准衡量,方法三最优,方法二其次,方法一则经不住大量数据的考验,只适用于小数据量的场合。对于这个结论,不知《SQL进阶教程》的著者和译者看了有什么想法。
当然,方法三和方法二出现的结果是有细微差别的,认真执行过的同学能知道,具体原因还需要日后探讨。
--2020年4月11日--