Oracle 优化之集群因子 (转载)

Oracle Clustering Factor – Explained

By Brian Peasland

http://www.peasland.net

 

Definition

What is the definition of Oracle’s Clustering Factor? If you ask most people, they will probably give you an incorrect answer, but one that does have some basis of truth to it, so people accept the answer. Most people will tell you that the Clustering Factor is a measure of how sorted the data is within the table. This is not exactly true. The real definition of the Clustering Factor is that the value is the anticipated number of I/O operations that need to complete when performing a range scan using an index and reading all rows of data from the table. The generally accepted definition does have a measure of truth to it because if you store the data in sorted order, the clustering factor will be lower than if the data were not sorted.

Clustering Factor Defined

The following diagram shows a table and its index that will have a high clustering factor.

 

 

 

In the diagram above, we can see the first index leaf block has three ‘A’ values. The index points to the presence of ‘A’ in blocks 1, 2, and 3. So if we are using the index, to read all ‘A’ values, I have to read three blocks from the table. Similarly, to read all ‘B’ values, I have to read blocks 1, 3, and 4.

To read all ‘A’ and ‘B’ values, the clustering factor tells us that we need to read 6 blocks. But wait a minute; the table only has 4 blocks in it. The clustering factor assumes that if you read a block and then need to read another block, the first block read is moved out of the Buffer Cache. The clustering factor assumes that to read all of the data values, we would perform the following I/O operations:

  1. Read block 1 – A
  2. Read block 2 – A
  3. Read block 3 – A
  4. Read block 1 – B
  5. Read block 3 – B
  6. Read block 4 – B
  7. Read block 1 – C
  8. Read block 4 – C
  9. Read block 2 – D
  10. Read block 4 – D
  11. Read block 2 – E
  12. Read block 3 – E

In this example, CLUSTERING_FACTOR=12 to read all 12 values, because it would require 12 read I/O operations to use this index to all rows of data. This is the first thing we need to understand when looking at the Clustering Factor. If the data is not sorted, the Clustering Factor will be tend towards the number of rows in the table. 

Now let’s store the data in the table in sorted order.

 

 

 

With the data stored in sorted order, we have many fewer reads, as follows:

  1. Read block 1 – A
  2. Block 1 in cache – A
  3. Block 1 in cache – A
  4. Read block 2 – B
  5. Block 2 in cache – B
  6. Block 2 in cache – B
  7. Read block 3 – C
  8. Block 3 in cache – C
  9. Block 3 in cache – D
  10. Read block 4 – D
  11. Block 4 in cache – E
  12. Block 4 in cache – E

In this example, CLUSTERING_FACTOR=4 to read all 12 values, because it would require 4 read I/O operations to use this index to all rows of data. This is the second thing we need to understand when looking at the Clustering Factor. If the data is sorted in the table, then the clustering factor will be low and will tend towards the number of table blocks. 

There are three major problems when looking at Clustering Factor in isolation. In other words, we cannot look at just the Clustering Factor and decide that we have an issue that needs to be addressed. The first major problem is the clustering factor ignores the Buffer Cache. In our first example, if the Buffer Cache were a factor, the operations would be as follows:

  1. Read block 1 – A
  2. Read block 2 – A
  3. Read block 3 – A
  4. Block 1 in cache – B
  5. Block 3 in cache – B
  6. Read block 4 – B
  7. Block 1 in cache – C
  8. Block 4 in cache – C
  9. Block 2 in cache – D
  10. Block 4 in cache – D
  11. Block 2 in cache – E
  12. Block 3 in cache – E

Even though the blocks are accessed out of order, there are still only four read I/O operations. This is for a small table and if the table is sufficiently large, the Buffer Cache will help but may not be able to hold the entire table’s blocks. In this case, the Buffer Cache will still help and will have a positive impact, but reading a block more than once can be a possibility.

The second major problem with the Clustering Factor is it assumes the index is used to read all rows of the table. However, Oracle will only use the index if the query is estimated to return 10% of the total number of rows of the table, or less. This is vitally important: A high Clustering Factor does not necessarily equate to poor performance because Oracle may not even use the index, preferring a full table scan instead.

The third major problem is that the Clustering Factor only applies to using the index when performing INDEX RANGE scans in queries. Index range scans are performed when the indexed column is compared with the following operators: <, >, ≤, ≥, and BETWEEN. If equality operations are performed, the Clustering Factor has little or no bearing. Just looking at the Clustering Factor eliminates from consideration if the index is being used for index range scans or for unique scans, or both.

 

Clustering Factor Test Case

Let’s look at a simple table and its clustering factor.

SQL> select count(*) from test_tab;

  COUNT(*)
----------
    108837
SQL> select clustering_factor from user_indexes where index_name='TEST_TAB_IX1';

CLUSTERING_FACTOR
-----------------
              165
SQL> select blocks from user_tables where table_name='TEST_TAB';

    BLOCKS
----------
       244

 

 

Here, we have a 100,000 row table with a clustering factor of 165. The number of blocks is 244. Since the clustering factor is closer to the number of blocks so the clustering factor is “low”.

 

 1 SQL> alter system flush buffer_cache;
 2 
 3  
 4 
 5 System altered.
 6 
 7 SQL> select id,val from test_tab
 8 
 9   2  where id between 100 and 200;
10 
11  
12 
13 --------------------------------------------------------------------------------
14 
15 | Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)
16 
17 --------------------------------------------------------------------------------
18 
19 |   0 | SELECT STATEMENT            |              |    95 |   475 |     3   (0)
20 
21 |   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |    95 |   475 |     3   (0)
22 
23 |*  2 |   INDEX RANGE SCAN          | TEST_TAB_IX1 |    95 |       |     2   (0)
24 
25 --------------------------------------------------------------------------------
26 
27  
28 
29 Statistics
30 
31 ----------------------------------------------------------
32 
33           0  recursive calls
34 
35           0  db block gets
36 
37           2  consistent gets
38 
39           3  physical reads
40 
41  

 

We see that we have 3 physical reads when performing the index range scan.

Now we’ll look at the table with random data.

 1 SQL> select count(*) from test_tab;
 2 
 3  
 4 
 5   COUNT(*)
 6 
 7 ----------
 8 
 9     108838
10 
11 SQL> select clustering_factor from user_indexes where index_name='TEST_TAB_IX1';
12 
13  
14 
15 CLUSTERING_FACTOR
16 
17 -----------------
18 
19            108273
20 
21 SQL> select blocks from user_tables where table_name='TEST_TAB';
22 
23  
24 
25     BLOCKS
26 
27 ----------
28 
29        244
30 
31  

 

The above shows that we have a table with the same number of rows and the same number of blocks, but the clustering factor is closer to the number of rows, so we say we have a high clustering factor.  

 1 SQL> select id,val from test_tab
 2 
 3   2  where id between 100 and 200;
 4 
 5  
 6 
 7 --------------------------------------------------------------------------------
 8 
 9 | Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)
10 
11 --------------------------------------------------------------------------------
12 
13 |   0 | SELECT STATEMENT            |              |    95 |   475 |     3   (0)
14 
15 |   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |    95 |   475 |     3   (0)
16 
17 |*  2 |   INDEX RANGE SCAN          | TEST_TAB_IX1 |    95 |       |     2   (0)
18 
19 --------------------------------------------------------------------------------
20 
21  
22 
23 Statistics
24 
25 ----------------------------------------------------------
26 
27           0  recursive calls
28 
29           0  db block gets
30 
31           2  consistent gets
32 
33           8  physical reads

 

 

We can see with the unordered data, that we have 8 physical reads for the same data as opposed for 3 physical reads in the example with ordered data. Since each physical read requires time, we can safely say that the unsorted data takes longer to process than the example with sorted data.

Which leads us to another vitally important concept: It is the run time and only the run time that is the important statistic. One cannot safely look at just the clustering factor and say that we can improve performance by sorting the data. We may improve performance or we may not. Performance also depends on a number of factors such as the Buffer Cache, the data in the buffer cache, the SQL statement using that index, competing resource usage on the system, etc. None of these have any correlation to the Clustering Factor!

 

But Lower Clustering Factor Performs!

Fair enough. The DBA has not focused solely on the Clustering Factor metric. Instead, the DBA is using only the run time metric. But the DBA notices that if the data is sorted, the clustering factor is lower and the run time decreases. At some later date, the data in the table becomes unsorted; the clustering factor rises as does the run time. The DBA has also performed due diligence and determined that the queries using the index on this table are performing Range Scans. Finally at this point we can say that by sorting the data and lowering the clustering factor, we have improved query performance. This is a fair enough observation.

Stop your reorgs

The DBA then performs table reorganization, the data is sorted and query performance is happy again. That is, everyone is happy until another ETL load or two and the data is no longer sorted anymore. So the DBA performs another reorg. This process will repeat itself ad-infinitum unless we do something else.

A reorg can only improve the clustering factor on one indexed column. Let’s create a table with two columns and populate it with data.

 1 SQL> create table test_tab (id1 number, id2 varchar2(30));
 2 
 3  
 4 
 5 Table created.
 6 
 7  
 8 
 9 SQL> insert into test_tab
10 
11   2  select object_id,object_name from dba_objects;
12 
13  
14 
15 108837 rows created.
16 
17  
18 
19 SQL> commit;
20 
21  
22 
23 Commit complete.
24 
25 SQL> create index test_tab_ix1 on test_tab(id1);
26 
27  
28 
29 Index created.
30 
31  
32 
33 SQL> create index test_tab_ix2 on test_tab(id2);
34 
35  
36 
37 Index created.
38 
39  

 

We will then calculate stats on the table and its indexes and look at the clustering factor.

 1 SQL> exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'TEST_TAB',method_opt=>'FOR ALL INDEXED COLUMNS',cascade=>TRUE);
 2 
 3  
 4 
 5 PL/SQL procedure successfully completed.
 6 
 7  
 8 
 9 SQL> select index_name,clustering_factor from user_indexes;
10 
11  
12 
13 INDEX_NAME                     CLUSTERING_FACTOR
14 
15 ------------------------------ -----------------
16 
17 TEST_TAB_IX1                                 997
18 
19 TEST_TAB_IX2                               58135

 

 

The first index has a low clustering factor and the second index has a high one. Let’s reorg the table to improve the clustering factor for IX2.

 1 SQL> create table test_tab_temp as select * from test_tab;
 2 
 3  
 4 
 5 Table created.
 6 
 7  
 8 
 9 SQL> truncate table test_tab;
10 
11  
12 
13 Table truncated.
14 
15  
16 
17 SQL> insert into test_tab select * from test_tab_temp order by id2;
18 
19  
20 
21 108837 rows created.
22 
23  
24 
25 SQL> commit;
26 
27  
28 
29 Commit complete.
30 
31  
32 
33 SQL> drop table test_tab_temp;
34 
35  
36 
37 Table dropped.
38 
39  
40 
41 SQL> exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'TEST_TAB',method_opt=>'FOR ALL INDEXED COLUMNS',cascade=>TRUE);
42 
43  
44 
45 PL/SQL procedure successfully completed.
46 
47  
48 
49 SQL> select index_name,clustering_factor from user_indexes;
50 
51  
52 
53 INDEX_NAME                     CLUSTERING_FACTOR
54 
55 ------------------------------ -----------------
56 
57 TEST_TAB_IX1                               53001
58 
59 TEST_TAB_IX2                                 557

 

 

We have performed a standard reorg by copying the data to another table, truncating the source table, and then inserting the data back in sorted order. The clustering factor on IX2 has improved! But notice that the clustering factor on IX1 is now worse. Except under very rare circumstances, when the DBA attempts to improve the clustering factor of one index, it will worsen the clustering factor for other indexes on the table. The DBA needs to weigh this very carefully before reorg-ing to improve the clustering factor. You will not be able to improve the Clustering Factor on one index without negatively impacting that same metric on other indexes, so the sorted order becomes important.

Here are some other methods to stop frequent reorgs to improve the Clustering Factor:

  • If your ETL process allows it, try to insert the records in sorted order. This way, you won’t have to reorg the table regularly. This may not be possible depending on the data. It is not possible to interleave new data with existing data without the reorg. But if the new data is “larger” then the existing data, you can presort before inserting. For instance, we have a table with an ID column and currently, the ID values are sorted and range from 1 to 100,000. The next ETL process will add 10,000 more rows and we know that the ID values will be at least 100,001. The new data does not need to be sorted with the old data so we can sort on ingest during the ETL process.
  • Partitioning is a great way to stop the reorgs. Partition your table on the column that has the bad clustering factor on its index. While the data will not necessarily be sorted, when performing a range scan, Oracle will use partition pruning to significantly reduce the blocks needing to be read.
  • Indexed Organized Tables (IOT’s) can help you stop the reorgs. The IOT basically stores all of the data in the index, not just the index key. But to the application, the IOT looks just like any other table. When a new row is added to the IOT, the row is inserted into the correct sorted location. Your data always stays sorted. By leveraging the IOT, the need for future reorgs disappears. There are some disadvantages to IOT’s. The IOT must have a primary key and you cannot cluster or compress the IOT. But you can create secondary indexes on the IOT.
  • If you cannot use the IOT, then a sorted hash cluster can help. A sorted hash cluster is often used to eliminate reorgs for the purposes of reducing the Clustering Factor. With a sorted hash cluster, the rows are stored in sorted order which can lead to faster data retrieval. The biggest downside to the sorted hash cluster is you must sort on the primary key. If the PK index is not the index you are trying to reduce the clustering factor on, then the sorted hash cluster may not work for you.

 

posted @ 2022-09-22 17:09  闻先生  阅读(23)  评论(0编辑  收藏  举报