oracle单表选择率(selectivity)——计算执行计划的基数
CBO优化器是基于对当前经过特定测试的数据集中预期的行比率估计来计算基数的。此处的行数之比是一个数值,称为选择率(selectivity)。得到选择率之后,将其与输入行数进行简单相乘既可得到基数。
在理解选择性之前,必须得对user_tab_col_statistics视图有一定了解:
- SQL> desc user_tab_col_statistics
- 名称 是否为空? 类型
- ----------------------------------------- -------- ----------------------------
- TABLE_NAME VARCHAR2(30) 表名
- COLUMN_NAME VARCHAR2(30) 列名
- NUM_DISTINCT NUMBER 列中distinct值的数目
- LOW_VALUE RAW(32) 列的最小值
- HIGH_VALUE RAW(32) 列的最大值
- DENSITY NUMBER 当对列创建了直方图,则值不再等于1/NUM_DISTINCT。
- NUM_NULLS NUMBER 列中的NULL值数目。
- NUM_BUCKETS NUMBER Number of buckets in histogram for the column
- LAST_ANALYZED DATE 最近分析时间。
- SAMPLE_SIZE NUMBER 分析样本大小。
- GLOBAL_STATS VARCHAR2(3) 对分区采样,则-NO,否则-YES。
- USER_STATS VARCHAR2(3) 统计信息由用户导入,则YES,否则-NO。
- AVG_COL_LEN NUMBER 列的平均长度(以字节为单位)
- HISTOGRAM VARCHAR2(15) Indicates existence/type of histogram: NONE、FREQUENCY、HEIGHT BALANCED
下面创建一张测试表,并收集统计信息:
- SQL> create table audience as
- 2 select
- 3 trunc(dbms_random.value(1,13)) month_no
- 4 from
- 5 all_objects
- 6 where
- 7 rownum <= 1200
- 8 ;
- 表已创建。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 );ethod_opt => 'for all columns size 1'
- method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 过程已成功完成。
先查看一下上面表和列的统计信息:
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE
- 2 from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 5 1200
- SQL> select s.table_name,
- s.column_name,
- s.num_distinct,
- 4 s.low_value,
- s.high_value,
- s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 0 1200 3
- SQL> select rawtohex(1), rawtohex(12) from dual;
- RAWT RAWT
- ---- ----
- C102 C10D
- SQL> select dump(1,16),dump(12,16) from dual;
- DUMP(1,16) DUMP(12,16)
- ----------------- -----------------
- Typ=2 Len=2: c1,2 Typ=2 Len=2: c1,d
- SQL> select utl_raw.cast_to_number('c102'),utl_raw.cast_to_number('c10d') from dual;
- UTL_RAW.CAST_TO_NUMBER('C102') UTL_RAW.CAST_TO_NUMBER('C10D')
- ------------------------------ ------------------------------
- 1 12 --可以看见上面的LOW_VALUE和HIGH_VALUE的值分别就是1和12.
- SQL> select count(a.month_no) from AUDIENCE a;
- COUNT(A.MONTH_NO)
- -----------------
- 1200 --可以看见,这里的值和NUM_ROWS是一样的。
- SQL> select count(distinct a.month_no) from AUDIENCE a;
- COUNT(DISTINCTA.MONTH_NO)
- -------------------------
- 12 --可以看见,这里的值也和NUM_DISTINCT的值是一样的。
- SQL> select 1/12 from dual;
- 1/12
- ----------
- .083333333 --这里的值和DENSITY一样的,计算公式为1/NUM_DISTINCT。
1、假如在上面创建了一张表,里面包含1200个人,如何才能确定其中有多少人的生日是在12月份。
- SQL> select count(*) from AUDIENCE where month_no=12;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 100 | 300 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12)
可以看见CBO计算出1200里面,12月份生日的人是100人(在ID=2行的rows)。这和我们通常所理解的是一样的,我们知道月份只有12个,在1200人中在某一个月出生的人,算概率也是100人(CBO也是这样做得)。
计算方法为:DENSITY * NUM_ROWS = 1 / 12 * 1200 = 100。
2、现在假设有10%的人不记得自己的生日了,那么CBO会怎么计算呐?
- SQL> drop table audience purge;
- 表已删除。
- SQL> create table audience as
- 2 select
- 3 rownum id,
- 4 trunc(dbms_random.value(1,13)) month_no
- 5 from
- 6 all_objects
- 7 where
- 8 rownum <= 1200;
- 表已创建。
- SQL> update
- 2 audience
- 3 set month_no = null
- 4 where mod(id,10) = 0; --10%的用户不记得自己的生日。
- 已更新120行。
- SQL> commit;
- 提交完成。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 过程已成功完成。
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 5 1200
- SQL> select s.table_name,
- 2 s.column_name,
- 3 s.num_distinct,
- 4 s.low_value,
- 5 s.high_value,
- 6 s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 120 1080 3 --这里可以看见,NUM_NULLS的值确实为120。
- AUDIENCE ID 1200 C102 C20D .000833333 0 1200 4
- SQL> select count(*) from AUDIENCE where month_no=12;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 90 | 270 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12)
调整后的选择率:DENSITY * ((NUM_ROWS-NUM_NULLS)/NUM_ROWS) = 1 / 12 * ((1200 - 120) / 1200) = 0.075。
返回的记录数(ROWS):调整后的选择率 * NUM_ROWS = 0.075 * 1200 = 90行。
或者我们可以换一种方法思考,通过前面可以很容易的知道12分月有100人生日(其中这里就包含了不记得生日的人)。然后1200人中有10%的人不记得自己的生日,也就是120,那么12月份不记得自己生日的人就平摊到10个人,100-10=90。
3、现在假如我们想知道在6、7、8月份生日的人有多少呐?
- SQL> select count(*) from AUDIENCE where month_no in(6,7,8);
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 270 | 810 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=6 OR "MONTH_NO"=7 OR "MONTH_NO"=8)
6、7、8月份的选择率:6月份选择率 + 7月份选择率 + 8月份选择率 = 0.075 * 3 = 0.225
返回的记录数(ROWS):6、7、8月份的选择率 * NUM_ROWS = 0.225 * 1200 = 270行。
4、下面来一个更复杂一点的,我们想知道不在6、7、8月份生日的人有多少呐?
- SQL> select count(*) from AUDIENCE where month_no not in(6,7,8);
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 674 | 2022 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"<>6 AND "MONTH_NO"<>7 AND "MONTH_NO"<>8)
选择率:1 - 6、7、8月份的选择率 = 1 - 0.075 * 3
返回记录数:(1-0.075*3)*1200 = 930。
month_no in{specific list} 的基数 + month_no not in{specific list} 的基数 = NUM_ROWS,这里计算出来是相等的,但是在数据库中看见的却不想等,需要注意!
5、现在我们求8月份以后出生的人,不包含8月份。
- SQL> select count(*) from AUDIENCE where month_no>8;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 393 | 1179 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">8)
选择率:((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回的记录数:选择率 * NUM_ROWS = ((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS) * NUM_ROWS = round(((12-8)/(12-1))*((1200-120)/1200)*1200) = 393。
如果是求8月份以后出生的人,包含8月份。
- SQL> select count(*) from AUDIENCE where month_no>=8;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 483 | 1449 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">=8)
选择率:((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回记录数:选择率 * NUM_ROWS = ((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS) * NUM_ROWS = round(((12-8)/(12-1)+1/12)*((1200-120)/1200)*1200) = 483。
如果是<8,选择率:((LIMIT - LOW_VALUE) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
如果是<=8,选择率:((LIMIT - LOW_VALUE) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
6、现在我们想知道6月份到8月份出生的人的数量?
- SQL> select count(*) from AUDIENCE where month_no>=6 and month_no<=8;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 376 | 1128 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">=6 AND "MONTH_NO"<=8)
选择率:((HIGH_LIMIT - LOW_LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回记录数:round(((8-6)/(12-1)+1/12+1/12)*((1200-120)/1200)*1200) = 376。
7、下面看两个谓词的情况下,CBO是怎么计算选择率的。
- SQL> drop table audience purge;
- 表已删除。
- SQL> create table audience as
- 2 select
- 3 rownum id,
- 4 trunc(dbms_random.value(1,13))month_no,
- 5 trunc(dbms_random.value(1,16))eu_country
- 6 from
- 7 all_objects
- 8 where
- 9 rownum <= 1200;
- 表已创建。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 过程已成功完成。
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 6 1200
- SQL> select s.table_name,
- 2 s.column_name,
- 3 s.num_distinct,
- 4 s.low_value,
- 5 s.high_value,
- 6 s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE EU_COUNTRY 15 C102 C110 .066666667 0 1200 3
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 0 1200 3
- AUDIENCE ID 1200 C102 C20D .000833333 0 1200 4
- SQL> select count(*) from audience where month_no=12 and eu_country=8;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 6 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 7 | 42 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("EU_COUNTRY"=8 AND "MONTH_NO"=12)
选择率:month_no选择率 * eu_contry选择率 = 1/12 * 1/15
返回记录:round(1/12*1/15*1200) = 7。
- SQL> select count(*) from audience where month_no=12 or eu_country=8;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 6 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 173 | 1038 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12 OR "EU_COUNTRY"=8)
选择率:month_no选择率 + eu_contry选择率 - month_no选择率 * eu_contry选择率 = 1/12+1/15-1/12*1/15
返回记录:round((1/12+1/15-1/12*1/15)*1200) = 173。
- SQL> select count(*) from audience where month_no<>12;
- 执行计划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 1100 | 3300 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"<>12)
选择率:1- month_no选择率 = 1- 1/12
返回记录:(1-1/12)*1200 = 1100。
8、总结:
- 单个谓词过滤:
- = 基数计算公式 :1/num_distinct*(num_rows-num_nulls),如果有直方图,基数计算公式=(num_rows-num_nulls)*density
- > 基数计算公式:(high_value-limit)/(high_value-low_value)*(num_rows-num_nulls)
- >= 基数计算公式:((high_value-limit)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls) 因为有=,所以要加上=的选择率,=的选择率为1/num_distinct
- < 基数计算公式:(limit-low_value)/(high_value-low_value)*(num_rows-num_nulls)
- <= 基数计算公式:((limit-low_value)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls)
- between ... and ... 的基数计算公式等价于 xxx<= high_limit ,xxxx>=low_limit
- 基数计算公式:((high_limit-low_limit)/(high_value-low_value)+2/num_distinct)*(num_rows-num_nulls)
- low_limit<xxx and xxx<high_limit 基数计算公式:(high_limit-low_limit)/(high_value-low_value)*(num_rows-num_nulls)
- low_limit<=xxx and xxx<high_limit 基数计算公式:(high_limit-low_limit)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls)
- 双谓词,多谓词:
- A AND B 选择率计算公式=A选择率*B选择率
- A OR B 选择率计算公式=A+B-(A AND B)
- NOT A 选择率计算公式=1-A选择率