从员工表和部门表联合查询的不同方式看CBO对不同SQL的优化
名词解释:
CBO Cost-Based Optimization 基于代价的优化器
往下进入正题:
有一张员工表这样设计:
create table emp( id int, name nvarchar2(20), deptid int, primary key(id));
可以这样给它塞入三十万条记录:
insert into emp select rownum, dbms_random.string('*',dbms_random.value(6,20)), dbms_random.value(1,10) from dual connect by level<300001;
还有一张部门表这样设计:
create table dept( id int, name nvarchar2(20), primary key(id));
这回选择逐条插值:
insert into dept(id,name) values('1','dev'); insert into dept(id,name) values('2','prod'); insert into dept(id,name) values('3','sales'); insert into dept(id,name) values('4','postsales'); insert into dept(id,name) values('5','market'); insert into dept(id,name) values('6','lab'); insert into dept(id,name) values('7','research'); insert into dept(id,name) values('8','adv'); insert into dept(id,name) values('9','hr'); insert into dept(id,name) values('10','mng');
需求出来了,要找员工名以AK打头,属于生产prod、销售sales、售后postsales和市场market四个部的员工。
按我机器上产生的结果,以AK开头的记录是466条,占emp表总数的0.15%;四个部的员工占dept表总数的4成。
按常规理解,应该是先过滤再连接最高效,但既然是实验,就把能出来正确结果的各种SQL都试试,比较一下。
满足这个需求的SQL不少,首先便是:
1.
select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5)
这条SQL是让员工表和部门表内联,然后让表连接条件和两个过滤条件统一写到where从句中。
这是常规解法,效率未纳入考量范围。让我们看看解释计划是怎样的:
SQL> explain plan for select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5); 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 486 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 486 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 483 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR "DEPT"."ID"=5) 3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5)) Note ----- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- - dynamic sampling used for this statement (level=2) 已选择23行。
从解释计划里,我们发现CBO是让两表先过滤再连接,cost是486. 也就是说这种SQL没考虑效率问题,CBO帮咱们考虑了,并按最优方案进行。
第二种SQL
2.
select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id
这句SQL知道先主动过滤两表再进行连接,表面上是比第一种要快的,让我们看看解释计划里它会如何表现:
SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 486 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 486 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 483 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5) 3 - filter("NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5)) Note ----- - dynamic sampling used for this statement (level=2) 已选择22行。
我们发现,CBO知道该先过滤再连接,如果SQL这样做了,它也就顺其自然,于是解释计划和上面一样,Cost也是486.
第三种SQL
3.
select a.name,a.dname from (select emp.name,dept.name as dname,dept.id from emp,dept where emp.deptid=dept.id) a where a.name like 'AK%' and a.id in (2,3,4,5)
这种SQL是不管数据多少故意硬让emp和dept两表连接,形成一个逻辑大表,再在出来的结果集里筛选。
理论上,我们知道这是最费力的方案,但是,Oracle有CBO,它会对SQL进行优化,看解释计划暴露的CBO意图如何?
SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 486 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 486 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 483 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR "DEPT"."ID"=5) 3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5)) Note ----- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- - dynamic sampling used for this statement (level=2) 已选择23行。
看,即使SQL里强制先连接表再过滤,知道最佳方案是怎样的CBO是不会舍近求远的,它还是选择最佳路径前进,所以cost依旧是486.
看到这里,你会知道过于纠结一些SQL的写法无意义,CBO会知道如何优化,就像java虚拟机对字符串+的优化;另一方面,SQL跑得顺溜也未必是SQL写得正确高效,或许是CBO在背后默默奉献。
这个例子只是简单情况,所以CBO能一直坚持最短路径,但复杂情况下未必合理,我们还是要秉承先过滤后连接的方针不变,就像第二种方案里的SQL那样写。
如果给emp表的name单列加索引会是什么效果呢?cost是增长还是有效降低?。
让我们先加索引,再对第二SQL(先过滤后连接)查看解释计划。
SQL> create index idx_emp_name on emp(name); 索引已创建。 SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 262175135 -------------------------------------------------------------------------------- ------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU )| Time | -------------------------------------------------------------------------------- ------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 467 (1 )| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 467 (1 )| 00:00:06 | |* 2 | TABLE ACCESS FULL | DEPT | 4 | 140 | 3 (0 )| 00:00:01 | |* 3 | TABLE ACCESS BY INDEX ROWID| EMP | 111 | 3885 | 463 (0 )| 00:00:06 | PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- |* 4 | INDEX RANGE SCAN | IDX_EMP_NAME | 362 | | 5 (0 )| 00:00:01 | -------------------------------------------------------------------------------- ------------- Predicate Information (identified by operation id): --------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5) 3 - filter("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5) 4 - access("NAME" LIKE U'AK%') filter("NAME" LIKE U'AK%') Note ----- - dynamic sampling used for this statement (level=2) 已选择25行。
看来对name单列加索引对半模糊查询的优化效果极其有限。
下面我们用instr函数取代 like ‘AK%’,看是否有惊喜。
SQL:
select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;
执行情况:
SQL> create index idx_emp_name on emp(name); 索引已创建。 SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 487 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 487 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 484 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5) 3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))
Cost反而上升了1,看来对name加单列索引,对查询的改变都很有限。
下面我们对emp表的name和deptid字段加联合索引,看是否有改善。
SQL> create index idx_emp_name_deptid on emp(name,deptid); 索引已创建。 SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 487 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 487 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 484 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5) 3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5)) Note ----- - dynamic sampling used for this statement (level=2) 已选择22行。
结果和对name加单列索引是一样的,cost还上升了1.
下面我们把dept的in查询改成范围查询,还是用上面创建的联合索引,看看情况如何。
SQL:
select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id;
执行:
SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 485 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 485 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 482 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID">1 AND "ID"<6) 3 - filter("EMP"."DEPTID">1 AND "EMP"."DEPTID"<6 AND INSTR("NAME",U'AK')=1) Note ----- - dynamic sampling used for this statement (level=2) 已选择22行。
从Cost上看,少了1,是聊胜于无的改善。
再换一种, 用between and 取代<>, 注意between and是包括上下边界的。
select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id;
执行情况:
SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id; 已解释。 SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 615168685 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 45 | 3150 | 485 (2)| 00:00:06 | |* 1 | HASH JOIN | | 45 | 3150 | 485 (2)| 00:00:06 | |* 2 | TABLE ACCESS FULL| DEPT | 4 | 140 | 3 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 111 | 3885 | 482 (2)| 00:00:06 | --------------------------------------------------------------------------- PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTID"="DEPT"."ID") 2 - filter("ID">=2 AND "ID"<=5) 3 - filter("EMP"."DEPTID">=2 AND "EMP"."DEPTID"<=5 AND INSTR("NAME",U'AK')=1) Note ----- - dynamic sampling used for this statement (level=2) 已选择22行。
从Cost上看,这个方案比大于小于方案没有改进,还是485.
就这个简单例子而言,CBO的优化已经做到了极限,以致于改进的空间都极其有限了。
END