MySQL 8.0 Reference Manual(读书笔记34节-- 字符编码(1))
MySQL includes character【ˈkærəktər 字母,符号;】 set support that enables you to store data using a variety【vəˈraɪəti (同一事物的)不同种类,多种式样;变化;(植物、语言等的)变种,变体;多样化;综艺节目;品种;多变性;异体;】 of character sets and perform comparisons【kəmˈpɛrəsənz 比较;对比;相比;】 according to a variety of collations. The default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but you can specify character sets at the server, database, table, column, and string literal levels.
1.Character Sets and Collations in General
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
Suppose that we have an alphabet【ˈælfəbet 字母表;(一种语言的)全部字母;】 with four letters: A, B, a, b. We give each letter a number: A = 0, B = 1, a = 2, b = 3. The letter A is a symbol【ˈsɪmbl 符号;象征;记号;代号;】, the number 0 is the encoding for A, and the combination【ˌkɑːmbɪˈneɪʃn 结合;联合;混合;结合体;联合体;混合体;(用于开密码锁的)数码组合,字码组合;】 of all four letters and their encodings is a character set.
Suppose that we want to compare【kəmˈper 比较;对比;与…类似(或相似);将…比作;表明…与…相似;】 two string values, A and B. The simplest way to do this is to look at the encodings: 0 for A and 1 for B. Because 0 is less than 1, we say A is less than B. What we've just done is apply a collation【kəˈleɪʃn 整理;校对,核对;(对书卷号码、编页等的)核实,配页;】 to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent【ɪˈkwɪvələnt】? Then we would have at least two rules: (1) treat the lowercase letters a and b as equivalent to A and B; (2) then compare the encodings. We call this a case-insensitive collation. It is a little more complex than a binary collation.
In real life, most character sets have many characters: not just A and B but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation【ˌpʌŋktʃuˈeɪʃn 标点符号;标点符号用法;】 marks. Also in real life, most collations have many rules, not just for whether to distinguish【dɪˈstɪŋɡwɪʃ 区分;辨别;分清;使有别于;使出众;认出;看清;】 lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German Ö), and for multiple-character mappings (such as the rule that Ö = OE in one of the two German collations).
MySQL can do these things for you:
• Store strings using a variety【vəˈraɪəti (同一事物的)不同种类,多种式样;变化;(植物、语言等的)变种,变体;多样化;综艺节目;品种;多变性;异体;】 of character sets.
• Compare strings using a variety of collations.
• Mix strings with different character sets or collations in the same server, the same database, or even the same table
• Enable specification of character set and collation at any level.
To use these features effectively, you must know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
2. Character Sets and Collations in MySQL
MySQL Server supports multiple character sets, including several Unicode character sets. To display the available character sets, use the INFORMATION_SCHEMA CHARACTER_SETS table or the SHOW CHARACTER SET statement.
mysql> SHOW CHARACTER SET;
By default, the SHOW CHARACTER SET statement displays all available character sets. It takes an optional LIKE or WHERE clause that indicates which character set names to match. The following example shows some of the Unicode character sets (those based on Unicode Transformation Format): ---支持模式(匹配)查询(筛选)
mysql> SHOW CHARACTER SET LIKE 'utf%';
A given character set always has at least one collation, and most character sets have several. To list the display collations for a character set, use the INFORMATION_SCHEMA COLLATIONS table or the SHOW COLLATION statement.
By default, the SHOW COLLATION statement displays all available collations. It takes an optional LIKE or WHERE clause that indicates which collation names to display. For example, to see the collations for the default character set, utf8mb4, use this statement:
mysql> SHOW COLLATION WHERE Charset = 'utf8mb4';
Collations have these general characteristics:
• Two different character sets cannot have the same collation.
• Each character set has a default collation. For example, the default collations for utf8mb4 and latin1 are utf8mb4_0900_ai_ci and latin1_swedish_ci, respectively【rɪˈspektɪvli 分别;各自;分别地;依次为;顺序为;】. The INFORMATION_SCHEMA CHARACTER_SETS table and the SHOW CHARACTER SET statement indicate the default collation for each character set. The INFORMATION_SCHEMA COLLATIONS table and the SHOW COLLATION statement have a column that indicates for each collation whether it is the default for its character set (Yes if so, empty if not).
• Collation names start with the name of the character set with which they are associated, generally followed by one or more suffixes indicating other collation characteristics.
When a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing an inappropriate collation, perform some comparisons with representative【ˌreprɪˈzentətɪv 典型的;有代表性的;由代表组成的;可作为典型(或示例)的;代表各类人(或事物)的;】 data values to make sure that a given collation sorts values the way you expect.
2.1 Character Set Repertoire【ˈrepərtwɑːr (总称某人的)可表演项目;(某人的)全部才能,全部本领;】
The repertoire of a character set is the collection of characters in the set.
String expressions have a repertoire attribute, which can have two values:
• ASCII: The expression can contain only ASCII characters; that is, characters in the Unicode range U +0000 to U+007F.
• UNICODE: The expression can contain characters in the Unicode range U+0000 to U+10FFFF. This includes characters in the Basic Multilingual Plane (BMP) range (U+0000 to U+FFFF) and supplementary【ˌsʌplɪˈmentri 补充的;额外的;补充性的;外加的;增补性的;】 characters outside the BMP range (U+10000 to U+10FFFF).
The ASCII range is a subset of UNICODE range, so a string with ASCII repertoire can be converted safely without loss of information to the character set of any string with UNICODE repertoire. It can also be converted safely to any character set that is a superset of the ascii character set. (All MySQL character sets are supersets of ascii with the exception of swe7, which reuses some punctuation characters for Swedish accented characters.)
The use of repertoire enables character set conversion in expressions for many cases where MySQL would otherwise return an “illegal mix of collations” error when the rules for collation coercibility【可压缩性,可压凝性】 are insufficient【ˌɪnsəˈfɪʃnt 不足的;不充分的;不够重要的;】 to resolve ambiguities【æmbəˈgjuətiz 歧义;不明确;模棱两可;模棱两可的词;一语多义;含混不清的语句;】.
2.2 UTF-8 for Metadata
Metadata is “the data about the data.” Anything that describes the database—as opposed【əˈpoʊzd 强烈反对;截然不同;】 to being the contents of the database—is metadata. Thus column names, database names, user names, version names, and most of the string results from SHOW are metadata. This is also true of the contents of tables in INFORMATION_SCHEMA because those tables by definition contain information about database objects.
Representation【ˌreprɪzenˈteɪʃn 代表;陈述;表现;描述;支持;描绘;表现形式;维护;抗议;有代理人;】 of metadata must satisfy these requirements:
• All metadata must be in the same character set. Otherwise, neither the SHOW statements nor SELECT statements for tables in INFORMATION_SCHEMA would work properly because different rows in the same column of the results of these operations would be in different character sets.
• Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.
To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.
The metadata requirements mean that the return values of the USER(), CURRENT_USER(), SESSION_USER(), SYSTEM_USER(), DATABASE(), and VERSION() functions have the UTF-8 character set by default.
The server sets the character_set_system system variable to the name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | character_set_system | utf8mb3 | +----------------------+---------+
Storage of metadata using Unicode does not mean that the server returns headers of columns and the results of DESCRIBE functions in the character_set_system character set by default. When you use SELECT column1 FROM t, the name column1 itself is returned from the server to the client in the character set determined by the value of the character_set_results system variable, which has a default value of utf8mb4. If you want the server to pass metadata results back in a different character set, use the SET NAMES statement to force the server to perform character set conversion. SET NAMES sets the character_set_results and other related system variables. Alternatively, a client program can perform the conversion after receiving the result from the server. It is more efficient for the client to perform the conversion, but this option is not always available for all clients.
If character_set_results is set to NULL, no conversion is performed and the server returns metadata using its original character set (the set indicated by character_set_system).
Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.
If you are using (for example) the USER() function for comparison or assignment within a single statement, don't worry. MySQL performs some automatic conversion for you.
SELECT * FROM t1 WHERE USER() = latin1_column;
This works because the contents of latin1_column are automatically converted to UTF-8 before the comparison.
INSERT INTO t1 (latin1_column) SELECT USER();
This works because the contents of USER() are automatically converted to latin1 before the assignment.
Although automatic conversion is not in the SQL standard, the standard does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
3. Specifying Character Sets and Collations
There are default settings for character sets and collations at four levels: server, database, table, and column.
CHARACTER SET is used in clauses that specify a character set. CHARSET can be used as a synonym for CHARACTER SET.
Character set issues affect not only data storage, but also communication between client programs and the MySQL server. If you want the client program to communicate with the server using a character set different from the default, you'll need to indicate【ˈɪndɪkeɪt 表明;指示;显示;暗示;示意;有必要;写明;象征;间接提及;(用灯光或手臂)打行车转向信号;】 which one. For example, to use the utf8mb4 Unicode character set, issue this statement after connecting to the server:
SET NAMES 'utf8mb4';
3.1 Collation Naming Conventions【kənˈvɛnʃənz (国家或首脑间的)公约,协定,协议;常规;(某职业、政党等成员的)大会,集会;惯例;习俗;】
MySQL collation names follow these conventions:
• A collation name starts with the name of the character set with which it is associated, generally followed by one or more suffixes indicating other collation characteristics. For example, utf8mb4_0900_ai_ci and latin1_swedish_ci are collations for the utf8mb4 and latin1 character sets, respectively. The binary character set has a single collation, also named binary, with no suffixes.
• A language-specific collation includes a locale【loʊˈkæl 场所;现场;发生地点;】 code or language name. For example, utf8mb4_tr_0900_ai_ci and utf8mb4_hu_0900_ai_ci sort characters for the utf8mb4 character set using the rules of Turkish and Hungarian, respectively. utf8mb4_turkish_ci and utf8mb4_hungarian_ci are similar but based on a less recent version of the Unicode Collation Algorithm.
• Collation suffixes indicate whether a collation is case-sensitive, accent-sensitive, or kana-sensitive (or some combination thereof), or binary. The following table shows the suffixes used to indicate these characteristics.
Suffix | Meaning |
_ai | Accent-insensitive |
_as | Accent-sensitive |
_ci | Case-insensitive |
_cs | Case-sensitive |
_ks | Kana-sensitive |
_bin | Binary |
For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity. If a collation name does not contain _ai or _as, _ci in the name implies _ai and _cs in the name implies【ɪmˈplaɪz 暗示;表明;说明;暗指;使有必要;必然包含;含有…的意思;】 _as. For example, latin1_general_ci is explicitly【ɪkˈsplɪsətli 明确地;明白地;】 case-insensitive and implicitly accent-insensitive, latin1_general_cs is explicitly case-sensitive and implicitly accent-sensitive, and utf8mb4_0900_ai_ci is explicitly case-insensitive and accent-insensitive.
For Japanese collations, the _ks suffix【ˈsʌfɪks后缀】 indicates that a collation is kana-sensitive; that is, it distinguishes Katakana characters from Hiragana characters. Japanese collations without the _ks suffix are not kana-sensitive and treat Katakana and Hiragana characters equal for sorting.
For the binary collation of the binary character set, comparisons are based on numeric byte values. For the _bin collation of a nonbinary character set, comparisons are based on numeric character code values, which differ from byte values for multibyte characters.
• Collation names for Unicode character sets may include a version number to indicate the version of the Unicode Collation Algorithm (UCA) on which the collation is based. UCA-based collations without a version number in the name use the version-4.0.0 UCA weight keys. For example:
- utf8mb4_0900_ai_ci is based on UCA 9.0.0 weight keys (http://www.unicode.org/Public/ UCA/9.0.0/allkeys.txt).
- utf8mb4_unicode_520_ci is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/ UCA/5.2.0/allkeys.txt).
- utf8mb4_unicode_ci (with no version named) is based on UCA 4.0.0 weight keys (http:// www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt).
• For Unicode character sets, the xxx_general_mysql500_ci collations preserve the pre-5.1.24 ordering of the original xxx_general_ci collations and permit upgrades for tables created before MySQL 5.1.24 (Bug #27877).
3.2 Server Character Set and Collation
MySQL Server has a server character set and a server collation. By default, these are utf8mb4 and utf8mb4_0900_ai_ci, but they can be set explicitly【ɪkˈsplɪsətli 明确地;明白地;】 at server startup on the command line or in an option file and changed at runtime.
Initially, the server character set and collation depend on the options that you use when you start mysqld. You can use --character-set-server for the character set. Along with it, you can add -- collation-server for the collation. If you don't specify a character set, that is the same as saying --character-set-server=utf8mb4. If you specify only a character set (for example, utf8mb4) but not a collation, that is the same as saying --character-set-server=utf8mb4 --collationserver=utf8mb4_0900_ai_ci because utf8mb4_0900_ai_ci is the default collation for utf8mb4.
The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose.
The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.
3.3 Database Character Set and Collation
Every database has a database character set and a database collation. The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifying the database character set and collation:
CREATE DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name] ALTER DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name]
The keyword SCHEMA can be used instead of DATABASE.
The CHARACTER SET and COLLATE clauses make it possible to create databases with different character sets and collations on the same MySQL server.
Database options are stored in the data dictionary and can be examined by checking the Information Schema SCHEMATA table.
MySQL chooses the database character set and database collation in the following manner:
• If both CHARACTER SET charset_name and COLLATE collation_name are specified, character set charset_name and collation collation_name are used.
• If CHARACTER SET charset_name is specified without COLLATE, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.
• If COLLATE collation_name is specified without CHARACTER SET, the character set associated with collation_name and collation collation_name are used.
• Otherwise (neither CHARACTER SET nor COLLATE is specified), the server character set and server collation are used.
The character set and collation for the default database can be determined from the values of the character_set_database and collation_database system variables. The server sets these variables whenever the default database changes. If there is no default database, the variables have the same value as the corresponding server-level system variables, character_set_server and collation_server.
To see the default character set and collation for a given database, use these statements:
USE db_name; SELECT @@character_set_database, @@collation_database; Alternatively, to display the values without changing the default data
Alternatively, to display the values without changing the default database:
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';
The database character set and collation affect these aspects of server operation:
• For CREATE TABLE statements, the database character set and collation are used as default values for table definitions if the table character set and collation are not specified. To override this, provide explicit CHARACTER SET and COLLATE table options.
• For LOAD DATA statements that include no CHARACTER SET clause, the server uses the character set indicated by the character_set_database system variable to interpret the information in the file. To override this, provide an explicit CHARACTER SET clause.
• For stored routines (procedures and functions), the database character set and collation in effect at routine creation time are used as the character set and collation of character data parameters for which the declaration includes no CHARACTER SET or a COLLATE attribute. To override this, provide CHARACTER SET and COLLATE explicitly.
3.4 Table Character Set and Collation
Every table has a table character set and a table collation. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:
CREATE TABLE tbl_name (column_list) [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]] ALTER TABLE tbl_name [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]
MySQL chooses the table character set and collation in the following manner:
• If both CHARACTER SET charset_name and COLLATE collation_name are specified, character set charset_name and collation collation_name are used.
• If CHARACTER SET charset_name is specified without COLLATE, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.
• If COLLATE collation_name is specified without CHARACTER SET, the character set associated with collation_name and collation collation_name are used.
• Otherwise (neither CHARACTER SET nor COLLATE is specified), the database character set and collation are used.
The table character set and collation are used as default values for column definitions if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
3.5 Column Character Set and Collation
Every “character” column (that is, a column of type CHAR, VARCHAR, a TEXT type, or any synonym) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation:
col_name {CHAR | VARCHAR | TEXT} (col_length) [CHARACTER SET charset_name] [COLLATE collation_name]
MySQL chooses the column character set and collation in the following manner:
• If both CHARACTER SET charset_name and COLLATE collation_name are specified, character set charset_name and collation collation_name are used.
• If CHARACTER SET charset_name is specified without COLLATE, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.
• If COLLATE collation_name is specified without CHARACTER SET, the character set associated with collation_name and collation collation_name are used.
• Otherwise (neither CHARACTER SET nor COLLATE is specified), the table character set and collation are used.
The CHARACTER SET and COLLATE clauses are standard SQL.
If you use ALTER TABLE to convert a column from one character set to another, MySQL attempts to map the data values, but if the character sets are incompatible, there may be data loss.