从员工表和部门表联合查询的不同方式看CBO对不同SQL的优化

名词解释:

CBO Cost-Based Optimization 基于代价的优化器

 

往下进入正题:

有一张员工表这样设计:

create table emp(
id int,
name nvarchar2(20),
deptid int,
primary key(id));

可以这样给它塞入三十万条记录:

insert into emp
select rownum,
dbms_random.string('*',dbms_random.value(6,20)),
dbms_random.value(1,10)
from dual
connect by level<300001;

还有一张部门表这样设计:

create table dept(
id int,
name nvarchar2(20),
primary key(id));

这回选择逐条插值:

insert into dept(id,name) values('1','dev');
insert into dept(id,name) values('2','prod');
insert into dept(id,name) values('3','sales');
insert into dept(id,name) values('4','postsales');
insert into dept(id,name) values('5','market');
insert into dept(id,name) values('6','lab');
insert into dept(id,name) values('7','research');
insert into dept(id,name) values('8','adv');
insert into dept(id,name) values('9','hr');
insert into dept(id,name) values('10','mng');

 

需求出来了,要找员工名以AK打头,属于生产prod、销售sales、售后postsales和市场market四个部的员工。

按我机器上产生的结果,以AK开头的记录是466条,占emp表总数的0.15%;四个部的员工占dept表总数的4成。

按常规理解,应该是先过滤再连接最高效,但既然是实验,就把能出来正确结果的各种SQL都试试,比较一下。

 

满足这个需求的SQL不少,首先便是:

1.
select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5)

这条SQL是让员工表和部门表内联,然后让表连接条件和两个过滤条件统一写到where从句中。

这是常规解法,效率未纳入考量范围。让我们看看解释计划是怎样的:

SQL> explain plan for select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5);

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR
              "DEPT"."ID"=5)
   3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   - dynamic sampling used for this statement (level=2)

已选择23行。

从解释计划里,我们发现CBO是让两表先过滤再连接,cost是486. 也就是说这种SQL没考虑效率问题,CBO帮咱们考虑了,并按最优方案进行。

 

第二种SQL

2.

select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id

这句SQL知道先主动过滤两表再进行连接,表面上是比第一种要快的,让我们看看解释计划里它会如何表现:

SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter("NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

我们发现,CBO知道该先过滤再连接,如果SQL这样做了,它也就顺其自然,于是解释计划和上面一样,Cost也是486.

 

第三种SQL

3.

select a.name,a.dname from (select emp.name,dept.name as dname,dept.id from emp,dept where emp.deptid=dept.id) a where a.name like 'AK%' and a.id in (2,3,4,5)

这种SQL是不管数据多少故意硬让emp和dept两表连接,形成一个逻辑大表,再在出来的结果集里筛选。

理论上,我们知道这是最费力的方案,但是,Oracle有CBO,它会对SQL进行优化,看解释计划暴露的CBO意图如何?

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR
              "DEPT"."ID"=5)
   3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   - dynamic sampling used for this statement (level=2)

已选择23行。

看,即使SQL里强制先连接表再过滤,知道最佳方案是怎样的CBO是不会舍近求远的,它还是选择最佳路径前进,所以cost依旧是486.

 

看到这里,你会知道过于纠结一些SQL的写法无意义,CBO会知道如何优化,就像java虚拟机对字符串+的优化;另一方面,SQL跑得顺溜也未必是SQL写得正确高效,或许是CBO在背后默默奉献。

这个例子只是简单情况,所以CBO能一直坚持最短路径,但复杂情况下未必合理,我们还是要秉承先过滤后连接的方针不变,就像第二种方案里的SQL那样写。

 

如果给emp表的name单列加索引会是什么效果呢?cost是增长还是有效降低?。

 

让我们先加索引,再对第二SQL(先过滤后连接)查看解释计划。

SQL> create index idx_emp_name on emp(name);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 262175135

--------------------------------------------------------------------------------
-------------

| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU
)| Time     |

--------------------------------------------------------------------------------
-------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |    45 |  3150 |   467   (1
)| 00:00:06 |

|*  1 |  HASH JOIN                   |              |    45 |  3150 |   467   (1
)| 00:00:06 |

|*  2 |   TABLE ACCESS FULL          | DEPT         |     4 |   140 |     3   (0
)| 00:00:01 |

|*  3 |   TABLE ACCESS BY INDEX ROWID| EMP          |   111 |  3885 |   463   (0
)| 00:00:06 |

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

|*  4 |    INDEX RANGE SCAN          | IDX_EMP_NAME |   362 |       |     5   (0
)| 00:00:01 |

--------------------------------------------------------------------------------
-------------


Predicate Information (identified by operation id):
---------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR
              "EMP"."DEPTID"=5)
   4 - access("NAME" LIKE U'AK%')
       filter("NAME" LIKE U'AK%')

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择25行。

看来对name单列加索引对半模糊查询的优化效果极其有限。

 

下面我们用instr函数取代 like ‘AK%’,看是否有惊喜。

SQL:

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

执行情况:

SQL> create index idx_emp_name on emp(name);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   484   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Cost反而上升了1,看来对name加单列索引,对查询的改变都很有限。

 

下面我们对emp表的name和deptid字段加联合索引,看是否有改善。

SQL> create index idx_emp_name_deptid on emp(name,deptid);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   484   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

结果和对name加单列索引是一样的,cost还上升了1.

 

下面我们把dept的in查询改成范围查询,还是用上面创建的联合索引,看看情况如何。

SQL:

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id;

执行:

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   482   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID">1 AND "ID"<6)
   3 - filter("EMP"."DEPTID">1 AND "EMP"."DEPTID"<6 AND
              INSTR("NAME",U'AK')=1)

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

从Cost上看,少了1,是聊胜于无的改善。

 

再换一种, 用between and 取代<>, 注意between and是包括上下边界的。

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id;

执行情况:

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   482   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID">=2 AND "ID"<=5)
   3 - filter("EMP"."DEPTID">=2 AND "EMP"."DEPTID"<=5 AND
              INSTR("NAME",U'AK')=1)

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

从Cost上看,这个方案比大于小于方案没有改进,还是485.

 

就这个简单例子而言,CBO的优化已经做到了极限,以致于改进的空间都极其有限了。

END

posted @ 2021-09-03 01:11  逆火狂飙  阅读(230)  评论(0编辑  收藏  举报
生当作人杰 死亦为鬼雄 至今思项羽 不肯过江东