永远不要在mysql中使用utf8,始终使用utf8mb4

一、背景

两周前将中文字符存入mysql数据库,遇到一个问题,表现如下:

插入语句:

insert into test_tbl set text='𦐒帝';

报错信息如下:

1 [SQL]insert into test_tbl set text='𦐒帝';
2 [Err] 1366 - Incorrect string value: '\xF0\xA6\x90\x92\xE5\xB8...' for column 'text' at row 1

二、分析原因

查看数据库建立语句:

CREATE DATABASE IF NOT EXISTS `config_db` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

建表语句如下:

CREATE TABLE `test_tbl` (
  `text` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_general_ci;

结合官网文档:https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4,有这么一段总结永远不要在MySQL中使用utf8——始终使用utf8mb4。

Never use utf8 in MySQL — always use utf8mb4 instead. 
Updating your databases and code might take some time, 
but it’s definitely worth the effort.
Why would you arbitrarily limit the set of symbols that can be used in your database?
Why would you lose data every time a user enters an astral symbol as part of a comment
or message or whatever it is you store in your database?
There’s no reason not to strive for full Unicode support everywhere. Do the right thing, and use utf8mb4. 🍻

 在Mysql中有utf8和utf8mb4两种差不多字符集,

utf8是Mysql中的字符集,只支持最长三个字节的UTF-8字符,utf8mb4中的mb4就是most bytes 4的意思,专门用来兼容四字节的unicode。

可能是因为历史原因,早期的Mysql只支持最长三字节的UTF-8字符,在版本5.5.3之后增加了utf8mb4的编码。

三、解决方法

  • 将数据库、表及表字段字符集由utf8字符集更改为utf8mb4字符集。
/*
将数据库config_db的字符集更改utf8mb4
*/
ALTER DATABASE `config_db` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

/*
将表及表字段字符集更改为utf8mb4
*/
ALTER TABLE `test_tbl` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  •  修改Mysql配置文件(mysqld.cnf)。
[client] 
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
pid-file    = /var/run/mysqld/mysqld.pid
socket        = /var/run/mysqld/mysqld.sock
datadir        = /var/lib/mysql
#log-error    = /var/log/mysql/error.log
# By default we only accept connections from localhost
#bind-address    = 127.0.0.1
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
  • 重启Mysql服务,最终确认配置是否生效
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

四、总结及其他

  1. 永远不要在Mysql中使用utf8,始终使用utf8mb4
  2. collation与字符排序规则相关。
    • 准确性:
      • utf8mb4_unicode_ci是基于标准的Unicode来排序和比较,能够在各种语言之间精确排序。
      • utf8mb4_general_ci没有实现Unicode排序规则,在遇到某些特殊语言或者字符集,排序结果可能不一致。
      • 但是,绝大数情况下,特殊字符的顺序并不需要那么精确。
    • 性能:
      • utf8mb4_general_ci在比较和排序的时候更快
      • utf8mb4_unicode_ci在特殊情况下,Unicode排序规则为了能够处理特殊字符的情况,实现了稍微复杂的排序算法。
      • 但是在绝大数情况下,不会发生此类复杂比较。相比选择哪一种collation,使用者更应该关心字符集与排序规则在db里需要统一。
posted @ 2022-03-31 15:40  一瞬光阴  阅读(363)  评论(0编辑  收藏  举报