MySQL 8.0 Reference Manual(读书笔记59节--Controlling the Query Optimizer(3))

3.7 Statement Execution Time Optimizer Hints

The MAX_EXECUTION_TIME hint is permitted only for SELECT statements. It places a limit N (a timeout value in milliseconds) on how long a statement is permitted to execute before the server terminates it:

MAX_EXECUTION_TIME(N)

Example with a timeout of 1 second (1000 milliseconds):

SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1 INNER JOIN t2 WHERE ...

The MAX_EXECUTION_TIME(N) hint sets a statement execution timeout of N milliseconds. If this option is absent【æbˈsent 缺席的;不存在;不在的;缺少;心不在焉的;出神的;】 or N is 0, the statement timeout established【ɪˈstæblɪʃt 建立(尤指正式关系);设立;确立;创立;使立足;使稳固;】 by the max_execution_time system variable applies.

The MAX_EXECUTION_TIME hint is applicable as follows:

• For statements with multiple SELECT keywords, such as unions or statements with subqueries, MAX_EXECUTION_TIME applies to the entire statement and must appear after the first SELECT.

• It applies to read-only SELECT statements. Statements that are not read only are those that invoke a stored function that modifies data as a side effect.

• It does not apply to SELECT statements in stored programs and is ignored.

3.8 Variable-Setting Hint Syntax

The SET_VAR hint sets the session value of a system variable temporarily【ˈtɛmpəˌrɛrəlɪ 暂时;】 (for the duration of a single statement). Examples:

SELECT /*+ SET_VAR(sort_buffer_size = 16M) */ name FROM people ORDER BY name;
INSERT /*+ SET_VAR(foreign_key_checks=OFF) */ INTO t2 VALUES(2);
SELECT /*+ SET_VAR(optimizer_switch = 'mrr_cost_based=off') */ 1;

Syntax of the SET_VAR hint:

SET_VAR(var_name = value)

var_name names a system variable that has a session value (although not all such variables can be named, as explained later). value is the value to assign to the variable; the value must be a scalar.

SET_VAR makes a temporary variable change, as demonstrated by these statements:

mysql> SELECT @@unique_checks;
+-----------------+
| @@unique_checks |
+-----------------+
| 1               |
+-----------------+
mysql> SELECT /*+ SET_VAR(unique_checks=OFF) */ @@unique_checks;
+-----------------+
| @@unique_checks |
+-----------------+
| 0               |
+-----------------+
mysql> SELECT @@unique_checks;
+-----------------+
| @@unique_checks |
+-----------------+
| 1               |
+-----------------+

 With SET_VAR, there is no need to save and restore the variable value. This enables you to replace multiple statements by a single statement. Consider this sequence of statements:

SET @saved_val = @@SESSION.var_name;
SET @@SESSION.var_name = value;
SELECT ...
SET @@SESSION.var_name = @saved_val;

The sequence can be replaced by this single statement:

SELECT /*+ SET_VAR(var_name = value) ...

Standalone【单独的,独立的;】 SET statements permit any of these syntaxes for naming session variables:

SET SESSION var_name = value;
SET @@SESSION.var_name = value;
SET @@.var_name = value;

Because the SET_VAR hint applies only to session variables, session scope is implicit, and SESSION, @@SESSION., and @@ are neither needed nor permitted. Including explicit session-indicator syntax results in the SET_VAR hint being ignored with a warning.

Not all session variables are permitted for use with SET_VAR. Individual system variable descriptions indicate whether each variable is hintable.descriptions indicate whether each variable is hintable; see Section 5.1.8, “Server System Variables”. You can also check a system variable at runtime by attempting to use it with SET_VAR. If the variable is not hintable, a warning occurs:

mysql> SELECT /*+ SET_VAR(collation_server = 'utf8mb4') */ 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
 Level: Warning
 Code: 4537
Message: Variable 'collation_server' cannot be set using SET_VAR hint.

SET_VAR syntax permits setting only a single variable, but multiple hints can be given to set multiple variables:

SELECT /*+ SET_VAR(optimizer_switch = 'mrr_cost_based=off')
 SET_VAR(max_heap_table_size = 1G) */ 1;

If several hints with the same variable name appear in the same statement, the first one is applied and the others are ignored with a warning:

SELECT /*+ SET_VAR(max_heap_table_size = 1G)
 SET_VAR(max_heap_table_size = 3G) */ 1;

In this case, the second hint is ignored with a warning that it is conflicting.

A SET_VAR hint is ignored with a warning if no system variable has the specified name or the variable value is incorrect:

SELECT /*+ SET_VAR(max_size = 1G) */ 1;
SELECT /*+ SET_VAR(optimizer_switch = 'mrr_cost_based=yes') */ 1;

For the first statement, there is no max_size variable. For the second statement, mrr_cost_based takes values of on or off, so attempting to set it to yes is incorrect. In each case, the hint is ignored with a warning.

The SET_VAR hint is permitted only at the statement level. If used in a subquery, the hint is ignored with a warning.

Replicas ignore SET_VAR hints in replicated statements to avoid the potential for security issues.

3.9 Resource Group Hint Syntax

The RESOURCE_GROUP optimizer hint is used for resource group management. This hint assigns【əˈsaɪnz 分配(某物);指定;指派;分派,布置(工作、任务等);派遣;委派;】 the thread that executes a statement to the named resource group temporarily (for the duration of the statement). It requires the RESOURCE_GROUP_ADMIN or RESOURCE_GROUP_USER privilege.

 Examples:

SELECT /*+ RESOURCE_GROUP(USR_default) */ name FROM people ORDER BY name;
INSERT /*+ RESOURCE_GROUP(Batch) */ INTO t2 VALUES(2);

Syntax of the RESOURCE_GROUP hint:

RESOURCE_GROUP(group_name)

group_name indicates the resource group to which the thread should be assigned for the duration of statement execution. If the group is nonexistent, a warning occurs and the hint is ignored.

The RESOURCE_GROUP hint must appear after the initial statement keyword (SELECT, INSERT, REPLACE, UPDATE, or DELETE).

An alternative to RESOURCE_GROUP is the SET RESOURCE GROUP statement, which nontemporarily assigns threads to a resource group.

3.10 Optimizer Hints for Naming Query Blocks

Table-level, index-level, and subquery optimizer hints permit specific query blocks to be named as part of their argument syntax. To create these names, use the QB_NAME hint, which assigns a name to the query block in which it occurs:

QB_NAME(name)

QB_NAME hints can be used to make explicit【ɪkˈsplɪsɪt 明确的;详述的;直言的, 坦率的;一目了然的;】 in a clear way which query blocks other hints apply to. They also permit all non-query block name hints to be specified within a single hint comment for easier understanding of complex statements. Consider the following statement:

SELECT ...
 FROM (SELECT ...
 FROM (SELECT ... FROM ...)) ...

QB_NAME hints assign names to query blocks in the statement:

SELECT /*+ QB_NAME(qb1) */ ...
 FROM (SELECT /*+ QB_NAME(qb2) */ ...
 FROM (SELECT /*+ QB_NAME(qb3) */ ... FROM ...)) ...

Then other hints can use those names to refer to the appropriate query blocks:

SELECT /*+ QB_NAME(qb1) MRR(@qb1 t1) BKA(@qb2) NO_MRR(@qb3t1 idx1, id2) */ ...
 FROM (SELECT /*+ QB_NAME(qb2) */ ...
 FROM (SELECT /*+ QB_NAME(qb3) */ ... FROM ...)) ...

The resulting effect is as follows:

• MRR(@qb1 t1) applies to table t1 in query block qb1.

• BKA(@qb2) applies to query block qb2.

• NO_MRR(@qb3 t1 idx1, id2) applies to indexes idx1 and idx2 in table t1 in query block qb3.

Query block names are identifiers and follow the usual rules about what names are valid and how to quote them. For example, a query block name that contains spaces must be quoted, which can be done using backticks:

SELECT /*+ BKA(@`my hint name`) */ ...
 FROM (SELECT /*+ QB_NAME(`my hint name`) */ ...) ...

If the ANSI_QUOTES SQL mode is enabled, it is also possible to quote query block names within double quotation marks:

SELECT /*+ BKA(@"my hint name") */ ...
 FROM (SELECT /*+ QB_NAME("my hint name") */ ...) ...

4 Index Hints

Index hints give the optimizer information about how to choose indexes during query processing. Index hints, described here, differ from optimizer hints, Index and optimizer hints may be used separately or together.

Index hints apply to SELECT and UPDATE statements. They also work with multi-table DELETE statements, but not with single-table DELETE, as shown later in this section.

Index hints are specified following a table name. (For the general syntax for specifying tables in a SELECT statement) The syntax for referring to an individual table, including index hints, looks like this:

tbl_name [[AS] alias] [index_hint_list]
index_hint_list:
 index_hint [index_hint] ...
index_hint:
 USE {INDEX|KEY}
 [FOR {JOIN|ORDER BY|GROUP BY}] ([index_list])
 | {IGNORE|FORCE} {INDEX|KEY}
 [FOR {JOIN|ORDER BY|GROUP BY}] (index_list)
index_list:
 index_name [, index_name] ...

The USE INDEX (index_list) hint tells MySQL to use only one of the named indexes to find rows in the table. The alternative syntax IGNORE INDEX (index_list) tells MySQL to not use some particular index or indexes. These hints are useful if EXPLAIN shows that MySQL is using the wrong index from the list of possible indexes.

The FORCE INDEX hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the named indexes to find rows in the table.

【As of MySQL 8.0.20, the server supports the index-level optimizer hints JOIN_INDEX, GROUP_INDEX, ORDER_INDEX, and INDEX, which are equivalent to and intended to supersede FORCE INDEX index hints, as well as the NO_JOIN_INDEX, NO_GROUP_INDEX, NO_ORDER_INDEX, and NO_INDEX optimizer hints, which are equivalent to and intended to supersede IGNORE INDEX index hints. Thus, you should expect USE INDEX, FORCE INDEX, and IGNORE INDEX to be deprecated in a future release of MySQL, and at some time thereafter to be removed altogether.

These index-level optimizer hints are supported with both single-table and multitable DELETE statements.】

Each hint requires index names, not column names. To refer to a primary key, use the name PRIMARY. To see the index names for a table, use the SHOW INDEX statement or the Information Schema STATISTICS table.

An index_name value need not be a full index name. It can be an unambiguous【ˌʌnæmˈbɪɡjuəs 明确的;毫不含糊的;无歧义的;意思清楚的;】 prefix of an index name. If a prefix is ambiguous, an error occurs.

Examples:

SELECT * FROM table1 USE INDEX (col1_index,col2_index)
 WHERE col1=1 AND col2=2 AND col3=3;
SELECT * FROM table1 IGNORE INDEX (col3_index)
 WHERE col1=1 AND col2=2 AND col3=3;

The syntax for index hints has the following characteristics:

• It is syntactically valid to omit index_list for USE INDEX, which means “use no indexes.” Omitting index_list for FORCE INDEX or IGNORE INDEX is a syntax error.

• You can specify the scope of an index hint by adding a FOR clause to the hint. This provides more fine-grained control over optimizer selection of an execution plan for various phases of query processing.

To affect only the indexes used when MySQL decides how to find rows in the table and how to process joins, use FOR JOIN. To influence index usage for sorting or grouping rows, use FOR ORDER BY or FOR GROUP BY.

• You can specify multiple index hints:

SELECT * FROM t1 USE INDEX (i1) IGNORE INDEX FOR ORDER BY (i2) ORDER BY a;

It is not an error to name the same index in several hints (even within the same hint):

SELECT * FROM t1 USE INDEX (i1) USE INDEX (i1,i1);

However, it is an error to mix USE INDEX and FORCE INDEX for the same table:

SELECT * FROM t1 USE INDEX FOR JOIN (i1) FORCE INDEX FOR JOIN (i2);

If an index hint includes no FOR clause, the scope of the hint is to apply to all parts of the statement. For example, this hint:

IGNORE INDEX (i1)

is equivalent to this combination of hints:

IGNORE INDEX FOR JOIN (i1)
IGNORE INDEX FOR ORDER BY (i1)
IGNORE INDEX FOR GROUP BY (i1)

In MySQL 5.0, hint scope with no FOR clause was to apply only to row retrieval. To cause the server to use this older behavior when no FOR clause is present, enable the old system variable at server startup. Take care about enabling this variable in a replication setup. With statement-based binary logging, having different modes for the source and replicas might lead to replication errors.

When index hints are processed, they are collected in a single list by type (USE, FORCE, IGNORE) and by scope (FOR JOIN, FOR ORDER BY, FOR GROUP BY). For example:

SELECT * FROM t1
 USE INDEX () IGNORE INDEX (i2) USE INDEX (i1) USE INDEX (i2);

is equivalent to:

SELECT * FROM t1
 USE INDEX (i1,i2) IGNORE INDEX (i2);

The index hints then are applied for each scope【skoʊp (题目、组织、活动等的)范围;(做或实现某事的)机会,能力;…镜(观察仪器);】 in the following order:

1. {USE|FORCE} INDEX is applied if present. (If not, the optimizer-determined set of indexes is used.)

2. IGNORE INDEX is applied over the result of the previous step. For example, the following two queries are equivalent:

SELECT * FROM t1 USE INDEX (i1) IGNORE INDEX (i2) USE INDEX (i2);
SELECT * FROM t1 USE INDEX (i1);

For FULLTEXT searches, index hints work as follows:

• For natural language mode searches, index hints are silently ignored. For example, IGNORE INDEX(i1) is ignored with no warning and the index is still used.

• For boolean mode searches, index hints with FOR ORDER BY or FOR GROUP BY are silently ignored. Index hints with FOR JOIN or no FOR modifier are honored. In contrast to how hints apply for non-FULLTEXT searches, the hint is used for all phases of query execution (finding rows and retrieval, grouping, and ordering). This is true even if the hint is given for a non-FULLTEXT index.

For example, the following two queries are equivalent【ɪˈkwɪvələnt (价值、数量、意义、重要性等)相同的;相等的;】:

SELECT * FROM t
 USE INDEX (index1)
 IGNORE INDEX FOR ORDER BY (index1)
 IGNORE INDEX FOR GROUP BY (index1)
 WHERE ... IN BOOLEAN MODE ... ;
SELECT * FROM t
 USE INDEX (index1)
 WHERE ... IN BOOLEAN MODE ... ;

Index hints work with DELETE statements, but only if you use multi-table DELETE syntax, as shown here:

mysql> EXPLAIN DELETE FROM t1 USE INDEX(col2)
 -> WHERE col1 BETWEEN 1 AND 100 AND COL2 BETWEEN 1 AND 100\G
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near 'use
index(col2) where col1 between 1 and 100 and col2 between 1 and 100' at line 1
 
mysql> EXPLAIN DELETE t1.* FROM t1 USE INDEX(col2)
 -> WHERE col1 BETWEEN 1 AND 100 AND COL2 BETWEEN 1 AND 100\G
*************************** 1. row ***************************
 id: 1
 select_type: DELETE
 table: t1
 partitions: NULL
 type: range
possible_keys: col2
 key: col2
 key_len: 5
 ref: NULL
 rows: 72
 filtered: 11.11
 Extra: Using where
1 row in set, 1 warning (0.00 sec)

5 The Optimizer Cost Model

To generate execution plans, the optimizer uses a cost model that is based on estimates of the cost of various operations that occur during query execution. The optimizer has a set of compiled-in default “cost constants” available to it to make decisions regarding execution plans.

The optimizer also has a database of cost estimates to use during execution plan construction. These estimates are stored in the server_cost and engine_cost tables in the mysql system database and are configurable at any time. The intent of these tables is to make it possible to easily adjust the cost estimates that the optimizer uses when it attempts to arrive at query execution plans.

5.1 Cost Model General Operation

The configurable optimizer cost model works like this:

• The server reads the cost model tables into memory at startup and uses the in-memory values at runtime. Any non-NULL cost estimate specified in the tables takes precedence over the corresponding compiled-in default cost constant. Any NULL estimate indicates to the optimizer to use the compiled-in default.

• At runtime, the server may re-read the cost tables. This occurs when a storage engine is dynamically loaded or when a FLUSH OPTIMIZER_COSTS statement is executed.

• Cost tables enable server administrators to easily adjust cost estimates by changing entries in the tables. It is also easy to revert to a default by setting an entry's cost to NULL. The optimizer uses the in-memory cost values, so changes to the tables should be followed by FLUSH OPTIMIZER_COSTS to take effect.

• The in-memory cost estimates that are current when a client session begins apply throughout that session until it ends. In particular, if the server re-reads the cost tables, any changed estimates apply only to subsequently started sessions. Existing sessions are unaffected.

• Cost tables are specific to a given server instance. The server does not replicate cost table changes to replicas.

5.2 The Cost Model Database

The optimizer cost model database consists of two tables in the mysql system database that contain cost estimate information for operations that occur during query execution:

• server_cost: Optimizer cost estimates for general server operations

• engine_cost: Optimizer cost estimates for operations specific to particular storage engines

The server_cost table contains these columns:

• cost_name

The name of a cost estimate used in the cost model. The name is not case-sensitive. If the server does not recognize the cost name when it reads this table, it writes a warning to the error log.

• cost_value

The cost estimate value. If the value is non-NULL, the server uses it as the cost. Otherwise, it uses the default estimate (the compiled-in value). DBAs can change a cost estimate by updating this column. If the server finds that the cost value is invalid (nonpositive) when it reads this table, it writes a warning to the error log.

To override a default cost estimate (for an entry that specifies NULL), set the cost to a non-NULL value. To revert to the default, set the value to NULL. Then execute FLUSH OPTIMIZER_COSTS to tell the server to re-read the cost tables.

• last_update

The time of the last row update.

• comment

A descriptive【dɪˈskrɪptɪv 描写的;叙述的;说明的;描写性的(描述语言的实际应用而非使用规则);】 comment associated with the cost estimate. DBAs can use this column to provide information about why a cost estimate row stores a particular value.

• default_value

The default (compiled-in) value for the cost estimate. This column is a read-only generated column that retains its value even if the associated cost estimate is changed. For rows added to the table at runtime, the value of this column is NULL.

The primary key for the server_cost table is the cost_name column, so it is not possible to create multiple entries for any cost estimate.

The server recognizes these cost_name values for the server_cost table:

• disk_temptable_create_cost, disk_temptable_row_cost

The cost estimates for internally created temporary tables stored in a disk-based storage engine (either InnoDB or MyISAM). Increasing these values increases the cost estimate of using internal temporary tables and makes the optimizer prefer query plans with less use of them.

The larger default values for these disk parameters compared to the default values for the corresponding memory parameters (memory_temptable_create_cost, memory_temptable_row_cost) reflects the greater cost of processing disk-based tables.

• key_compare_cost

The cost of comparing record keys. Increasing this value causes a query plan that compares many keys to become more expensive. For example, a query plan that performs a filesort becomes relatively more expensive compared to a query plan that avoids sorting by using an index.

• memory_temptable_create_cost, memory_temptable_row_cost

The cost estimates for internally created temporary tables stored in the MEMORY storage engine. Increasing these values increases the cost estimate of using internal temporary tables and makes the optimizer prefer query plans with less use of them.

The smaller default values for these memory parameters compared to the default values for the corresponding disk parameters (disk_temptable_create_cost, disk_temptable_row_cost) reflects the lesser cost of processing memory-based tables.

• row_evaluate_cost

The cost of evaluating record conditions. Increasing this value causes a query plan that examines many rows to become more expensive compared to a query plan that examines fewer rows. For example, a table scan becomes relatively more expensive compared to a range scan that reads fewer rows.

The engine_cost table contains these columns:

• engine_name

The name of the storage engine to which this cost estimate applies. The name is not case-sensitive. If the value is default, it applies to all storage engines that have no named entry of their own. If the server does not recognize the engine name when it reads this table, it writes a warning to the error log.

• device_type

The device type to which this cost estimate applies. The column is intended for specifying different cost estimates for different storage device types, such as hard disk drives versus solid state drives. Currently, this information is not used and 0 is the only permitted value.

 • cost_name

Same as in the server_cost table.

• cost_value

Same as in the server_cost table.

• last_update

Same as in the server_cost table.

• comment

Same as in the server_cost table.

• default_value

The default (compiled-in) value for the cost estimate. This column is a read-only generated column that retains its value even if the associated cost estimate is changed. For rows added to the table at runtime, the value of this column is NULL, with the exception that if the row has the same cost_name value as one of the original rows, the default_value column has the same value as that row.

The primary key for the engine_cost table is a tuple comprising the (cost_name, engine_name, device_type) columns, so it is not possible to create multiple entries for any combination of values in those columns.

The server recognizes these cost_name values for the engine_cost table:

• io_block_read_cost

The cost of reading an index or data block from disk. Increasing this value causes a query plan that reads many disk blocks to become more expensive compared to a query plan that reads fewer disk blocks. For example, a table scan becomes relatively more expensive compared to a range scan that reads fewer blocks.

• memory_block_read_cost

Similar to io_block_read_cost, but represents the cost of reading an index or data block from an inmemory database buffer.

If the io_block_read_cost and memory_block_read_cost values differ, the execution plan may change between two runs of the same query. Suppose that the cost for memory access is less than the cost for disk access. In that case, at server startup before data has been read into the buffer pool, you may get a different plan than after the query has been run because then the data is in memory.

5.3 Making Changes to the Cost Model Database

For DBAs who wish to change the cost model parameters from their defaults, try doubling or halving the value and measuring the effect.

Changes to the io_block_read_cost and memory_block_read_cost parameters are most likely to yield worthwhile results. These parameter values enable cost models for data access methods to take into account the costs of reading information from different sources; that is, the cost of reading information from disk versus reading information already in a memory buffer. For example, all other things being equal, setting io_block_read_cost to a value larger than memory_block_read_cost causes the optimizer to prefer query plans that read information already held in memory to plans that must read from disk.

This example shows how to change the default value for io_block_read_cost:

UPDATE mysql.engine_cost
 SET cost_value = 2.0
 WHERE cost_name = 'io_block_read_cost';
FLUSH OPTIMIZER_COSTS;

This example shows how to change the value of io_block_read_cost only for the InnoDB storage engine:

INSERT INTO mysql.engine_cost
 VALUES ('InnoDB', 0, 'io_block_read_cost', 3.0,
 CURRENT_TIMESTAMP, 'Using a slower disk for InnoDB');
FLUSH OPTIMIZER_COSTS;

6 Optimizer Statistics

The column_statistics data dictionary table stores histogram statistics about column values, for use by the optimizer in constructing query execution plans. To perform histogram management, use the ANALYZE TABLE statement.

The column_statistics table has these characteristics:

• The table contains statistics for columns of all data types except geometry types (spatial data) and JSON.

• The table is persistent so that column statistics need not be created each time the server starts.

• The server performs updates to the table; users do not.

The column_statistics table is not directly accessible by users because it is part of the data dictionary. Histogram information is available using INFORMATION_SCHEMA.COLUMN_STATISTICS, which is implemented as a view on the data dictionary table. COLUMN_STATISTICS has these columns:

• SCHEMA_NAME, TABLE_NAME, COLUMN_NAME: The names of the schema, table, and column for which the statistics apply.

• HISTOGRAM: A JSON value describing the column statistics, stored as a histogram.

Column histograms contain buckets for parts of the range of values stored in the column. Histograms are JSON objects to permit flexibility in the representation of column statistics. Here is a sample histogram object:

{
 "buckets": [
 [
 1,
 0.3333333333333333
 ],
 [
 2,
 0.6666666666666666
 ],
 [
 3,
 1
 ]
 ],
 "null-values": 0,
 "last-updated": "2017-03-24 13:32:40.000000",
 "sampling-rate": 1,
 "histogram-type": "singleton",
 "number-of-buckets-specified": 128,
 "data-type": "int",
 "collation-id": 8
}

Histogram objects have these keys:

• buckets: The histogram buckets. Bucket structure depends on the histogram type.

For singleton histograms, buckets contain two values:

  • Value 1: The value for the bucket. The type depends on the column data type.
  • Value 2: A double representing the cumulative frequency for the value. For example, .25 and .75 indicate that 25% and 75% of the values in the column are less than or equal to the bucket value.

For equi-height histograms, buckets contain four values:

  • Values 1, 2: The lower and upper inclusive values for the bucket. The type depends on the column data type.
  • Value 3: A double representing the cumulative frequency for the value. For example, .25 and .75 indicate that 25% and 75% of the values in the column are less than or equal to the bucket upper value.
  • Value 4: The number of distinct values in the range from the bucket lower value to its upper value.

• null-values: A number between 0.0 and 1.0 indicating the fraction of column values that are SQL NULL values. If 0, the column contains no NULL values.

• last-updated: When the histogram was generated, as a UTC value in YYYY-MM-DD hh:mm:ss.uuuuuu format.

• sampling-rate: A number between 0.0 and 1.0 indicating the fraction of data that was sampled to create the histogram. A value of 1 means that all of the data was read (no sampling).

• histogram-type: The histogram type:

  • singleton: One bucket represents one single value in the column. This histogram type is created when the number of distinct values in the column is less than or equal to the number of buckets specified in the ANALYZE TABLE statement that generated the histogram.
  • equi-height: One bucket represents a range of values. This histogram type is created when the number of distinct values in the column is greater than the number of buckets specified in the ANALYZE TABLE statement that generated the histogram.

• number-of-buckets-specified: The number of buckets specified in the ANALYZE TABLE statement that generated the histogram.

• data-type: The type of data this histogram contains. This is needed when reading and parsing histograms from persistent storage into memory. The value is one of int, uint (unsigned integer), double, decimal, datetime, or string (includes character and binary strings).

• collation-id: The collation ID for the histogram data. It is mostly meaningful when the data-type value is string. Values correspond to ID column values in the Information Schema COLLATIONS table.

To extract particular values from the histogram objects, you can use JSON operations. For example:

mysql> SELECT
 TABLE_NAME, COLUMN_NAME,
 HISTOGRAM->>'$."data-type"' AS 'data-type',
 JSON_LENGTH(HISTOGRAM->>'$."buckets"') AS 'bucket-count'
 FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+-----------------+-------------+-----------+--------------+
| TABLE_NAME      | COLUMN_NAME | data-type | bucket-count |
+-----------------+-------------+-----------+--------------+
| country         | Population  | int       | 226          |
| city            | Population  | int       | 1024         |
| countrylanguage | Language    | string    | 457          |
+-----------------+-------------+-----------+--------------+

The optimizer uses histogram statistics, if applicable, for columns of any data type for which statistics are collected. The optimizer applies histogram statistics to determine row estimates based on the selectivity (filtering effect) of column value comparisons against constant values. Predicates of these forms qualify for histogram use:

col_name = constant
col_name <> constant
col_name != constant
col_name > constant
col_name < constant
col_name >= constant
col_name <= constant
col_name IS NULL
col_name IS NOT NULL
col_name BETWEEN constant AND constant
col_name NOT BETWEEN constant AND constant
col_name IN (constant[, constant] ...)
col_name NOT IN (constant[, constant] ...)

For example, these statements contain predicates that qualify for histogram use:

SELECT * FROM orders WHERE amount BETWEEN 100.0 AND 300.0;
SELECT * FROM tbl WHERE col1 = 15 AND col2 > 100;

The requirement for comparison against a constant value includes functions that are constant, such as ABS() and FLOOR():

SELECT * FROM tbl WHERE col1 < ABS(-34);

Histogram statistics are useful primarily for nonindexed columns. Adding an index to a column for which histogram statistics are applicable might also help the optimizer make row estimates. The tradeoffs are:

• An index must be updated when table data is modified.

• A histogram is created or updated only on demand, so it adds no overhead when table data is modified. On the other hand, the statistics become progressively more out of date when table modifications occur, until the next time they are updated.

The optimizer prefers range optimizer row estimates to those obtained from histogram statistics. If the optimizer determines that the range optimizer applies, it does not use histogram statistics.

For columns that are indexed, row estimates can be obtained for equality comparisons using index dives. In this case, histogram statistics are not necessarily useful because index dives can yield better estimates.

In some cases, use of histogram statistics may not improve query execution (for example, if the statistics are out of date). To check whether this is the case, use ANALYZE TABLE to regenerate the histogram statistics, then run the query again.

Alternatively, to disable histogram statistics, use ANALYZE TABLE to drop them. A different method of disabling histogram statistics is to turn off the condition_fanout_filter flag of the optimizer_switch system variable (although this may disable other optimizations as well):

SET optimizer_switch='condition_fanout_filter=off';

If histogram statistics are used, the resulting effect is visible using EXPLAIN. Consider the following query, where no index is available for column col1:

SELECT * FROM t1 WHERE col1 < 24;

SELECT * FROM t1 WHERE col1 < 24;

If histogram【ˈhɪstəɡræm 直方图;(统计学的)直方图,矩形图;】 statistics indicate that 57% of the rows in t1 satisfy the col1 < 24 predicate, filtering can occur even in the absence of an index, and EXPLAIN shows 57.00 in the filtered column.

posted @ 2024-05-29 23:21  东山絮柳仔  阅读(32)  评论(0编辑  收藏  举报