histogram
首先要知道一个概念selectivit--选择性。选择性是一个row source中可能返回的row的多少。比如一个100行的表,经过查询返回48行,那么selectivity就是0.48。 selectivity对CBO的判断非常重要,简单的说,如果selectivity很大,返回的row占row source的大部分,CBO就倾向于用全表扫描来访问表,反之则倾向于index扫描。
为了计算selectivity,CBO需要提前知道一些统计信息以及设置一些初始化参数。相对表的一列来说,CBO需要以下statistics来计算selectivity:
- distinct 值的多少
- 该列的low和high值
- null值的多少
- 数据分部信息,或者说histogram(这个是可选的)
如果没有histogram信息,CBO就用前三种信息来判断选择性,这时候CBO会认为该列的值的分布是均匀的。也就是在low 和 high值之间,所有的distinct值都是相等的。我们来看一个例子:(这里的10000是字符 正常的测试应该是数字。但是字符和数字的表现不一样。这个值得研究)
SQL> create table GOOD as select rownum all_distinct, 10000 skew from dual connect by level <= 10000; SQL> update GOOD set skew=all_distinct+10 where rownum<=10; SQL> select * from GOOD where rownum<12; ALL_DISTINCT SKEW ------------ ---------- 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 11 10000
然后我们收集一下统计信息,要注意的是这里没有收集histogram的信息
exec dbms_stats.gather_table_stats('SYS','GOOD', method_opt=>'for all columns size 1');
然后我们看一下收集到的统计信息
SQL> select column_name,num_distinct,LOW_VALUE,HIGH_VALUE,NUM_NULLS,density from user_tab_col_statistics where table_name='GOOD'; COLUMN_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE NUM_NULLS DENSITY ------------------ ------------ ------------------ ------------------ ---------- ---------- ALL_DISTINCT 10000 C102 C302 0 .0001 SKEW 11 C10C C302 0 .090909091
因为我们没有收集histogram信息,所以这里的density的计算方式就是1/NUM_DISTINCT。
让我们看一下这种情况下CBO做出的执行计划
SQL> select /*+ gather_plan_statistics */ * from GOOD where skew=12; ALL_DISTINCT SKEW ------------ ---------- 2 12 SQL> select * from TABLE(dbms_xplan.display_cursor(null,null,'iostats last')); ------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 21 | |* 1 | TABLE ACCESS FULL| GOOD | 1 | 909 | 1 |00:00:00.01 | 21 |
E-Rows是CBO认为会返回的rows数量,因为没有收集直方图信息,oracle认为数据是均匀分布的。所以cardinality = density * 10000 = 909.09 rows。 再看一个执行计划。
SQL> explain plan for select * from GOOD where skew=10000; Explained. SQL> select * from table(dbms_xplan.display()); -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 909 | 8181 | 7 (15)| 00:00:01 | |* 1 | TABLE ACCESS FULL| GOOD | 909 | 8181 | 7 (15)| 00:00:01 | --------------------------------------------------------------------------
这里也是一样,即使是skew=10000,oracle也认为会返回909条row。所以如果有索引,oracle就不能正确的使用索引。比如:
SQL> create index GOOD_I on GOOD(skew); SQL> exec dbms_stats.gather_index_stats('SYS','GOOD_I'); SQL> explain plan for select * from GOOD where skew=10000; SQL> select * from table(dbms_xplan.display()); -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 909 | 5454 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| GOOD | 909 | 5454 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | GOOD_I | 909 | | 2 (0)| 00:00:01 | --------------------------------------------------------------------------------------
所以,没有histogram的时候,oracle会认为一列中的数据是均匀分布的。很难做出正确的选择。这里skew明明是9990,但是CBO认为返回909条,所以用了索引。
那么在有histogram的情况下会是什么样呢?oracle的histogram会告诉CBO一列长数据的真正分布。比如在上面的例子中,histogram会告诉CBO,skew=11只有一条,而skew=10000有9990条。这样oracle就能够做出正确的选择了。
oracle的histogram有两种,一种叫宽度平衡直方图(频率直方图),另一种叫高度均衡直方图。我们来仔细看一下这两种直方图。
width-blanced or frequence histogram
这种直方图的x轴是distinct value,y轴是对应distinc value在列中出现的次数。这种直方图的前提就是x轴能够涵盖所有的distinct value。oracle histogram的bucket最大值是254.也就是说一个column如果它的distinct值不超过254个,我们就可以使用这种直方图。下图是我们例子的频率直方图
我们先创建一个频率直方图然后看一下这个直方图在数据字典中是怎样存储的。
SQL> exec dbms_stats.gather_table_stats('SYS','GOOD',method_opt=>'for columns skew size 11'); PL/SQL procedure successfully completed. SQL> select column_name,endpoint_number,endpoint_value from user_tab_histograms where table_name='GOOD' and column_name='SKEW'; COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ------------------ --------------- -------------- SKEW 1 11 SKEW 2 12 SKEW 3 13 SKEW 4 14 SKEW 5 15 SKEW 6 16 SKEW 7 17 SKEW 8 18 SKEW 9 19 SKEW 10 20 SKEW 10000 10000
这里的ENDPOINT_VALUE 是distinct value的值。而ENDPOINT_NUMBER是对应endpoint_value出现次数的累加。比如 11出现了一次,12 就出现了2-1=1次。而20出现了10-9-1次。 你可能会想问什么不直接写出现的次数呢?为什么用累加值呢? 因为这样存储在遇到范围扫描的时候非常有用。 比如skew>15的值的数量就是 10000-5.
当然你可以通过下面这样的SQL获得一个更直观的 distinct value 与 出现次数对应的查询
select endpoint_value as column_value, endpoint_number as cummulative_frequency, endpoint_number - lag(endpoint_number,1,0) over (order by endpoint_number) as frequency from user_tab_histograms where table_name = 'GOOD' and column_name = 'SKEW'; COLUMN_VALUE CUMMULATIVE_FREQUENCY FREQUENCY ------------ --------------------- ---------- 11 1 1 12 2 1 13 3 1 14 4 1 15 5 1 16 6 1 17 7 1 18 8 1 19 9 1 20 10 1 10000 10000 9990
下面我们看一看直方图对查询的作用
SQL> select column_name, density, histogram from user_tab_col_statistics where table_name='GOOD' ; COLUMN_NAME DENSITY HISTOGRAM ------------------ ---------- --------------------------------------------- ALL_DISTINCT .0001 NONE SKEW .00005 FREQUENCY
首先可以看到density的不同。 执行计划也会变的更好:
SQL> explain plan for Select * from good where skew=10000; SQL> select * from table(dbms_xplan.display); -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 9990 | 59940 | 6 (17)| 00:00:01 | |* 1 | TABLE ACCESS FULL| GOOD | 9990 | 59940 | 6 (17)| 00:00:01 | --------------------------------------------------------------------------
可以看到 rows是9990说明CBO正确的估算了返回值的大小。
Height-balanced Histograms
In the case of Frequency histograms Oracle allocates a bucket for each distinct value. However the
maximum possible number of buckers is 254, so if you have tables with a huge number of distinct
values (greater than 254); you would have to go for height-balanced histograms.
In height-balanced histograms, since we have more distinct values than number of buckets, hence
Oracle first sorts the column data and then the complete data set is divided into number of buckets
and all buckets contain the same number of values (which is why they are called height-balanced
histograms), except the last bucket that may have fewer values than the other buckets.
There is no separate statement to create height-balanced histograms. When the number of buckets
requested is less than the number of distinct values in a column, Oracle creates height-balanced
histograms and the meaning of ENDPOINT_VALUE and ENDPOINT_NUMBER are quite
different. To understand how to interpret histogram information, let’s take another example of a
column data which has 23 values and there are 9 distinct values in the column. Let’s suppose we have
requested for 5 buckets. Below is a pictorial representation of how data will be stored in histogram.
We can make following points based on above picture:
• Number of buckets is less than number of distinct values in the column.
• Since we’ve requested for 5 buckets, so the total dataset will be divided into equally sized
buckets, except the last bucket, which in this case has only 3 values.
• End points of each bucket and first point of the first bucket are marked, as they are of special
interest.
• Data value ‘3’ is marked in red color; it is special in the sense that it is end point in multiple
buckets.
With 5 buckets and 23 values means there are 5 values in each bucket except that last bucket which
has 3 values. Actually this is the way Oracle stores height-balanced histogram information in data
dictionary views, with a minor change. Since Bucket 1 and 2 both have 3 as an end point, Oracle
doesn’t store bucket 1 so as to save space. So when both buckets will be merged, single entry will be
stored.
Let’s create histogram on column skew, this time with number of buckets less than the actual number
of distinct values that is 11.
exec dbms_stats.gather_table_stats('SYS','GOOD',method_opt=>'for columns skew size 5'); SQL> select table_name, column_name,endpoint_number,endpoint_value from DBA_TAB_HISTOGRAMS where table_name='GOOD' and COLUMN_NAME='SKEW'; TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ------------------ ------------------ --------------- -------------- GOOD SKEW 0 11 GOOD SKEW 5 10000
Here buckets 1-5 all have 10000 as an end point so these buckets 1-4 are not stored so as to save
space.
So in nutshell, in height-balanced histograms, data is divided into different 'buckets' where each
bucket contains the same number of values. The highest value in each bucket is recorded together
(ENDPOINT_VALUE) with the lowest value in the first bucket (bucket 0). Also,
ENDPOINT_NUMBER represents the bucket number. Once the data is recorded in buckets we
recognize 2 types of data value - Non-popular values and popular values.
Popular values are those that occur multiple times as end points. For instance, in our previous
example 3 is a popular value and in the column skew 10000 is a popular value. Non-popular values
are those that do not occur multiple times as end times. As you might be thinking, popular and nonpopular
values are not fixed and depend on bucket size. Changing the bucket size will result in
different popular values
Let me summarize our discussion w.r.t two histogram types:
• Distinct values less than or equal to the number of buckets: When you have less number of
distinct values than the number of buckets, the ENDPOINT_VALUE column contains the
distinct values themselves while the ENDPOINT_NUMBER column holds the
CUMULATIVE number of rows with less than that column value (Frequency Histograms).
• More number of distinct values than the number of buckets: When you have more number of
distinct values than the number of buckets, then the ENDPOINT_NUMBER column
contains the bucket id and the ENDPOINT_VALUE holds the highest value in each bucket.
Bucket 0 is special in that it holds the low value for that column (Height-balanced
Histograms).
Creating Histograms
The GATHER_TABLE_STATS procedure of DBMS_STATS package is used to gather table and
column statistics and optionally we can instruct to create histogram on certain column(s) using the
method_opt parameter:
The ‘method_opt’ parameter of the procedure accepts following values:
• For all [Indexed/Hidden] Columns [Size option]
• For Columns column_name [Size option] column_name [Size option] …
The SIZE keyword specifies the maximum number of buckets for the histogram and takes following
values:
SIZE {integer | REPEAT | AUTO | SKEWONLY}
• integer: Number of histogram buckets. Must be in the range 1-254.
• Repeat: Collects histograms only on the columns that already have histograms.
• Auto: Oracle determines the columns to collect histograms based on data distribution and the
workload of the columns.
• Skewonly: Oracle determines the columns to collect histograms based on the data distribution
of the columns.
Auto option also considers the workload of the columns. What it means is that it checks for the SQL
queries having column name in where predicate.
The default for method_opt is changed to ‘For all Columns Size Auto’ in 10g, which in 9i used to
have ‘For all columns size 1’. In other words, Oracle now automatically decides for us which columns
need histograms and number of buckets also. This seems ideal situation, but this has many caveats
which are not in the scope of this paper. In next part, I’ll touch base on this topic in greater detail.
The default value can be changed using the SET_PARAM Procedure.
Viewing Histograms
We are fetching histogram information for a while, now it’s time to see in detail the various options
available for histogram information. We can find information about existing histograms in the
database through DBA_TAB_HISTOGRAMS data dictionary view. This view lists histograms on
columns of all tables. The actual value may be stored in ENDPOINT_ACTUAL_VALUE if the
column is not a number (i.e. a varchar2) and the first six bytes of some values are the same.
Number of buckets in each column’s histogram and density value can be found in
DBA_TAB_COLUMNS and DBA_TAB_COL_STATISTICS data dictionary views. The latter
extracts the data from DBA_TAB_COLUMNS only.
There are corresponding views available for partition and sub-partitions columns, for example
DBA_SUBPART_HISTOGRAMS etc.