InnoDB的全文检索

InnoDB的全文检索

注:全文为MySQL官网5.7的文档(MySQL 8.0的文档与此几乎一致)

MySQL 5.6 不支持中、日、韩语,因为无法对其分词,5.7版本引入NGram(基于字符)对中、日、韩语分词,MeCab(基于单词)支持日语分词。

1、倒排索引

单词与所在文档之间的关系有两种表现形式,InnoDB采用第二种方式,形式见表1.

  • inverted file index:

  • full inverted index:

表1 full inverted index的关联数组

Numer Text Documents
1 code (1:6), (4:8)
2 days (3:2), (6:2)
3 hot (1:3), (2:4)

InnoDB将单词与文档的关系保存到辅助表(auxiliary table)中,为了提高全文检索的并行性能,使用Latin编码对word分区,拆为6张表。这些表都是持久表,位于磁盘上。

2、FTS index cache

引入原因:新增记录时,需要将解析后的词与文档的关系存入auxiliary表中,即使只有少量记录,也会造成大量单词的新增和单词对应文档的新增,会让auxiliary表的并发量下降

存储内容不是辅助表的内容,而是需要往辅助表插入的那些内容,即新来的文档,当缓存满时,将他们批量写入到磁盘辅助表中。数据库关闭时,也会将缓存中的数据同步到磁盘上。

数据库宕机时,FTS index cache中的数据可能没有同步到磁盘上,下次重启数据库时,当用户进行全文检索(插入或查询操作)时,InnoDB会自动读取未完成的文档,对其分析,将分词结果放入FTS index cache中。若缓存过大,恢复时间会较长(恢复数据多)。

好处

  • 避免频繁刷新辅助表,避免并发问题

  • 避免对同一个词的多次插入,最大限度减少重复条目,提高插入效率

  • 保持辅助索引表尽可能小(缓存默认8MB)

参数

  • innodb_ft_cache_size:定义每个表的缓存大小,达到限制时提交到磁盘
  • innodb_ft_total_cache_size:为所有表的InnoDB全文索引缓存分配的总内存,达到该限制时,强制同步所有数据

2.1 insert

(1)插入到对应的表

(2)事务提交时将分词写入FTS index cache中

(3)缓存满时,批量写入辅助表中

2.2 delete

(1)需要删除的记录的DOC_ID,保存到DELETED表中

(2)删除原表中对应的数据

2.3 update

(1)需要更新的记录的DOC_ID,保存到DELETED表中

(2)更新原表数据,包括DOC_ID(会有新的DOC_ID)

(3)事务提交时,将分词写入FTS index cache中(新的DOC_ID)

(4)缓存满时,批量写入到辅助表中

2.4 select

(1)将FTS index cache 和 辅助表合并

(2)进行全文检索

(3)使用DELETED表对检索结果过滤

3、FTS document id

引入原因:需要唯一标识符来标识与单词对应的文档的DOC_ID

FTS_DOC_ID 类型为 BIGINT UNSIGNED NOT NULL

特点:创建表时可手动创建该列,若没有,则在创建fulltext索引时InnoDB会创建名为FTS_DOC_ID的列,并对其添加唯一索引,但InnoDB创建的FTS_DOC_ID列会隐藏,无法查看

注意:若创建FTS_DOC_ID时表里已有数据,则需要重建该表的所有索引(add column)

说明:删除、更新操作存入DELETED表中的DOC_ID即为FTS_DOC_ID的值,且已经删除的FTS_DOC_ID不能再添加到原表中,否则报错

4、OPTIMIZE TABLE

原因:① FTS index cache只有满时才会批量更新到磁盘表中,若更新到磁盘前发生宕机,则需要恢复,缓存中存的数据越多,恢复时间越长;② 删除的记录永远在辅助表中,没有实际删除

使用:

set global innodb_optimize_fulltext_only=ON;
OPTIMIZE TABLE full_test_table;

作用:对有fulltext索引的表执行OPTIMIZE TABLE会有如下操作

  • rebuild the full-text index
  • removing deleted Document IDs(索引表中需要删除的文档---delete和update时未删除)
  • consolidating multiple entries for the same word, where possible

与delete记录有关的操作
使用delete语句删除数据时,delete语句只是将记录的位置或数据页标记为了“可复用”,但是磁盘文件的大小不会改变,即表空间不会直接回收。此时您可以通过optimize table语句释放表空间。
使用“optimize table”命令释放MySQL实例的表空间

相关参数:避免大表上重建fulltext索引时间过长,可使用innodb_ft_num_word_optimize分阶段优化

  • innodb_ft_num_word_optimize:定义每次运行OPTIMIZE TABLE时优化的单词数,默认为2000,每次运行OPTIMIZE TABLE时优化2000个单词,之后的OPTIMIZE TABLE从前一次结束的地方继续

5、INFORMATION_SCHEMA有关fulltext索引的表

mysql> SHOW TABLES FROM INFORMATION_SCHEMA LIKE 'INNODB_FT%';
+-------------------------------------------+
| Tables_in_information_schema (INNODB_FT%) |
+-------------------------------------------+
| INNODB_FT_CONFIG                          |
| INNODB_FT_BEING_DELETED                   |
| INNODB_FT_DELETED                         |
| INNODB_FT_DEFAULT_STOPWORD                |
| INNODB_FT_INDEX_TABLE                     |
| INNODB_FT_INDEX_CACHE                     |
+-------------------------------------------+
6 rows in set (0.00 sec)
  • INNODB_FT_CONFIG:fulltext索引的元数据和InnoDB表相关的处理

  • INNODB_FT_DELETED:保存从原表中删除的记录,这些记录仅在执行OPTIMIZE TABLE期间才会从fulltext表(辅助表)中删除

  • INNODB_FT_BEING_DELETED:是INNODB_FT_DELETED的快照,执行OPTIMIZE TABLE时,先清空该表数据,然后将DOC_ID从INNODB_FT_DELETED表中移过来,但表INNODB_FT_DELETED中的数据并未删除。该表的生存周期较短,用来调试和监控的作用比较小。

  • INNODB_FT_DEFAULT_STOPWORD:当InnoDB表创建fulltext索引时使用的默认停用词表

  • INNODB_FT_INDEX_TABLE:有关InnoDB表的fulltext索引文本搜索的倒排索引信息(与辅助表内容一样)

  • INNODB_FT_INDEX_CACHE:fulltext索引需要新插入的记录的分词信息

除过INNODB_FT_DEFAULT_STOPWORD表外,其他表最初时都是空的,查看这些表之前需要设置innodb_ft_aux_table系统变量为db_name/table_name,如下

mysql> set global innodb_ft_aux_table='testdb/full_test_table';

6、示例(mysql version=5.7.40)

-- 建表,不定义FTS_DOC_ID列
mysql> create table full_test_table(
	`id` int(11) not null comment "id",
	`description` text comment "介绍",
	primary key(id)
);
-- 创建全文索引
mysql> create fulltext index idx_desc on full_test_table(description);

6.1 创建全文索引后新增的表

-- 查看新产生的表
mysql> SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_SYS_TABLES
WHERE name LIKE 'testdb/%';
+----------+------------------------------------------------------+-------+
| table_id | name                                                 | space |
+----------+------------------------------------------------------+-------+
|       72 | testdb/FTS_0000000000000047_0000000000000066_INDEX_1 |    55 |
|       73 | testdb/FTS_0000000000000047_0000000000000066_INDEX_2 |    56 |
|       74 | testdb/FTS_0000000000000047_0000000000000066_INDEX_3 |    57 |
|       75 | testdb/FTS_0000000000000047_0000000000000066_INDEX_4 |    58 |
|       76 | testdb/FTS_0000000000000047_0000000000000066_INDEX_5 |    59 |
|       77 | testdb/FTS_0000000000000047_0000000000000066_INDEX_6 |    60 |
|       78 | testdb/FTS_0000000000000047_BEING_DELETED            |    61 |
|       79 | testdb/FTS_0000000000000047_BEING_DELETED_CACHE      |    62 |
|       80 | testdb/FTS_0000000000000047_CONFIG                   |    63 |
|       81 | testdb/FTS_0000000000000047_DELETED                  |    64 |
|       82 | testdb/FTS_0000000000000047_DELETED_CACHE            |    65 |
|       71 | testdb/full_test_table                               |    54 |
+----------+------------------------------------------------------+-------+

这些表不能直接查询,只能通过查询information_schema下封装过的临时表查看

前6张表为单词与文档关系的倒排索引辅助表,中间两段为16进制数,第一个47表示full_test_tabletable_id(71)的16进制数,第二个66为该表全文索引的id(102)

-- 查询索引id
mysql> select index_id, name, table_id from INFORMATION_SCHEMA.INNODB_SYS_INDEXES where table_id=71;
+----------+------------------+----------+
| index_id | name             | table_id |
+----------+------------------+----------+
|      101 | PRIMARY          |       71 |
|      102 | idx_desc         |       71 |
|      103 | FTS_DOC_ID_INDEX |       71 |
+----------+------------------+----------+
  • FTS_*_DELETEDFTS_*_DELETED_CACHE

​ 从表中已删除但辅助表未删除的记录的DOC_ID,FTS_*_DELETED_CACHE为内存版本

  • FTS_*_BEING_DELETEDFTS_*_BEING_DELETED_CACHE

​ 从表中已删除,fulltext索引正在删除的记录的DOC_ID,FTS_*_BEING_DELETED_CACHE为内存版本

  • FTS_*_CONFIG

    保存fulltext索引内部状态信息,其中最重要的是FTS_SYNCED_DOC_ID,它标识了已经解析并刷新到磁盘的文档,当宕机恢复时,该值用于将没有刷新到磁盘的文档重新解析加入到FTS index cache中。

    查看该表,需要查询表INFORMATION_SCHEMA.INNODB_FT_CONFIG

6.2 查看索引

-- 增加数据
mysql> insert into full_test_table values
(1, "today is wednesday"),
(2, "tomorrow is thursday");
-- 查看 FTS index cache
mysql> set global innodb_ft_aux_table = 'testdb/full_test_table';
mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| thursday  |            4 |           4 |         1 |      4 |       12 |
| today     |            3 |           3 |         1 |      3 |        0 |
| tomorrow  |            4 |           4 |         1 |      4 |        0 |
| wednesday |            3 |           3 |         1 |      3 |        9 |
+-----------+--------------+-------------+-----------+--------+----------+
-- 查看倒排索引辅助表
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
-- Empty set (0.00 sec)

此时缓存未满,没有刷新到磁盘,所以fulltext索引表为空

-- 强制刷新,重建索引
mysql> set global innodb_optimize_fulltext_only=ON;
mysql> OPTIMIZE TABLE full_test_table;
+------------------------+----------+----------+----------+
| Table                  | Op       | Msg_type | Msg_text |
+------------------------+----------+----------+----------+
| testdb.full_test_table | optimize | status   | OK       |
+------------------------+----------+----------+----------+

执行后会返回执行状态为OK

再查看缓存表和索引表

-- 缓存表
mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE;
-- Empty set (0.00 sec)

-- 索引表
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| thursday  |            4 |           4 |         1 |      4 |       12 |
| today     |            3 |           3 |         1 |      3 |        0 |
| tomorrow  |            4 |           4 |         1 |      4 |        0 |
| wednesday |            3 |           3 |         1 |      3 |        9 |
+-----------+--------------+-------------+-----------+--------+----------+

6.3 delete

mysql> delete from full_test_table where id=1;
-- 索引表,没有删除
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| thursday  |            4 |           4 |         1 |      4 |       12 |
| today     |            3 |           3 |         1 |      3 |        0 |
| tomorrow  |            4 |           4 |         1 |      4 |        0 |
| wednesday |            3 |           3 |         1 |      3 |        9 |
+-----------+--------------+-------------+-----------+--------+----------+

查看DELETED表,删除的id=1的记录,其FTS_DOC_ID=3

mysql> SELECT * FROM information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
|      3 |
+--------+

执行OPTIMIZE TABLE,删除索引表中该记录

mysql> OPTIMIZE TABLE full_test_table;
+------------------------+----------+----------+----------+
| Table                  | Op       | Msg_type | Msg_text |
+------------------------+----------+----------+----------+
| testdb.full_test_table | optimize | status   | OK       |
+------------------------+----------+----------+----------+

执行后,INNODB_FT_DELETED表中依然存在DOC_ID=3的记录,可以看到BEING_DELETED中有DOC_ID=3的记录

mysql> SELECT * FROM information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
|      3 |
+--------+
mysql> SELECT * FROM information_schema.INNODB_FT_BEING_DELETED;
+--------+
| DOC_ID |
+--------+
|      3 |
+--------+
-- INDEX_TABLE索引表已经更新
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+----------+--------------+-------------+-----------+--------+----------+
| WORD     | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| thursday |            4 |           4 |         1 |      4 |       12 |
| tomorrow |            4 |           4 |         1 |      4 |        0 |
+----------+--------------+-------------+-----------+--------+----------+

mysql 5.7.40版本在第二次执行OPTIMIZE TABLE时,INNODB_FT_DELETED表和INNODB_FT_BEING_DELETED表内容清空。

-- 第二次执行OPTIMIZE TABLE
mysql> OPTIMIZE TABLE full_test_table;
+------------------------+----------+----------+----------+
| Table                  | Op       | Msg_type | Msg_text |
+------------------------+----------+----------+----------+
| testdb.full_test_table | optimize | status   | OK       |
+------------------------+----------+----------+----------+
-- 查询DELETED表
mysql> SELECT * FROM information_schema.INNODB_FT_DELETED;
-- Empty set (0.00 sec)

-- 查询BEING_DELETED表
mysql> SELECT * FROM information_schema.INNODB_FT_BEING_DELETED;
-- Empty set (0.00 sec)

6.4 update

mysql> update full_test_table set description = "yesterday is tuesday" where id=2;
-- 查询DELETED表,id=2的记录,其DOC_ID=4
mysql> SELECT * FROM information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
|      4 |
+--------+
-- 缓存表,更新后id=2的记录对应的DOC_ID=7
mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| tuesday   |            7 |           7 |         1 |      7 |       13 |
| yesterday |            7 |           7 |         1 |      7 |        0 |
+-----------+--------------+-------------+-----------+--------+----------+
-- 索引表暂未更新
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+----------+--------------+-------------+-----------+--------+----------+
| WORD     | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| thursday |            4 |           4 |         1 |      4 |       12 |
| tomorrow |            4 |           4 |         1 |      4 |        0 |
+----------+--------------+-------------+-----------+--------+----------+
-- 执行OPTIMIZE TABLE
mysql> OPTIMIZE TABLE full_test_table;
+------------------------+----------+----------+----------+
| Table                  | Op       | Msg_type | Msg_text |
+------------------------+----------+----------+----------+
| testdb.full_test_table | optimize | status   | OK       |
+------------------------+----------+----------+----------+
-- 查询DELETED表,依然存在
mysql> SELECT * FROM information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
|      4 |
+--------+
-- 查询BEING_DELETED表
mysql> SELECT * FROM information_schema.INNODB_FT_BEING_DELETED;
+--------+
| DOC_ID |
+--------+
|      4 |
+--------+
-- 缓存表,已清空
mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE;
-- Empty set (0.00 sec)

-- 索引表已更新
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| tuesday   |            7 |           7 |         1 |      7 |       13 |
| yesterday |            7 |           7 |         1 |      7 |        0 |
+-----------+--------------+-------------+-----------+--------+----------+

6.5 INNODB_FT_CONFIG

mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 8     |
| stopword_table_name       |       |
| use_stopword              | 1     |
+---------------------------+-------+

optimize_checkpoint_limitOPTIMIZE TABLE执行的最长时间,单位:秒(The number of seconds after which an OPTIMIZE TABLE run stops.)

synced_doc_id:the next DOC_ID to be issued. 下一个分发的DOC_ID

stopword_table_name:用户定义停用词表的"database/table"名,为空表示用户未定义

use_stopword:在创建fulltext索引时定义,表示是否使用停用词表

-- 禁用innodb_optimize_fulltext_only
mysql> set global innodb_optimize_fulltext_only=OFF;

7、停用词

由于科学、文学等方向经常使用短词作为关键字或重要短语,因此InnoDB默认的停用词较少,只有36个。

mysql> select * from information_schema.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a     |
| about |
| an    |
| are   |
| as    |
| at    |
| be    |
| by    |
| com   |
| de    |
| en    |
| for   |
| from  |
| how   |
| i     |
| in    |
| is    |
| it    |
| la    |
| of    |
| on    |
| or    |
| that  |
| the   |
| this  |
| to    |
| was   |
| what  |
| when  |
| where |
| who   |
| will  |
| with  |
| und   |
| the   |
| www   |
+-------+
36 rows in set (0.00 sec)

自定义停用词需要与INNODB_FT_DEFAULT_STOPWORD表结构相同,必须有一个列名为value且类型为VARCHAR的列。使用时,在创建fulltext索引前需要用innodb_ft_server_stopword_table设置停用词表,形式为db_name/table_name

示例如下

-- 创建停用词表
mysql> create table my_stopwords(value VARCHAR(30)) engine=innodb;
mysql> insert into my_stopwords(value) VALUES('today');
-- 创建表
mysql> create table stop_test(
   `id` int(11) not null comment "id",
   `description` text comment "介绍",
	primary key(id)
);
-- 插入数据
mysql> insert into stop_test values
(1, "today is wednesday"),
(2, "tomorrow is thursday");
-- 设置使用自定义停用词表
mysql> set global innodb_ft_server_stopword_table='testdb/my_stopwords';
-- 创建fulltext索引
mysql> create fulltext index idx on stop_test(description);
-- 查看config
mysql> set global innodb_ft_aux_table = 'testdb/stop_test';
mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+---------------------+
| KEY                       | VALUE               |
+---------------------------+---------------------+
| optimize_checkpoint_limit | 180                 |
| synced_doc_id             | 5                   |
| stopword_table_name       | testdb/my_stopwords |
| use_stopword              | 1                   |
+---------------------------+---------------------+
-- 索引中已过滤today
mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| thursday  |            4 |           4 |         1 |      4 |       12 |
| tomorrow  |            4 |           4 |         1 |      4 |        0 |
| wednesday |            3 |           3 |         1 |      3 |        9 |
+-----------+--------------+-------------+-----------+--------+----------+

:若新建表想使用默认的停用词表,在创建fulltext索引前,使用set global innodb_ft_aux_table = default;修改即可。

8、检索

  • 全文索引设计的目的是基于相似度的查询,支持的字段类型为CHARVARCHARTEXT

  • fulltext索引可以通过create tablealter table或者create index语句创建。

  • 对于数据量大的集合,先导入数据再创建fulltext索引比先创建fulltext索引再导入数据的方式快

目前限制:由多列组合而成的fulltext索引必须使用相同的字符集与排序规则

语法:

MATCH(col1,col2,...) AGAINST (expr [search_modifier])
search_modifier:
{
		IN NATURAL LANGUAGE MODE
	|	IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
	|	IN BOOLEAN MODE
	|	WITH QUERY EXPANSION
}

全文搜索使用match() against()语法执行,match()为要搜索的列表,用逗号分隔(搜索时与定义fulltext索引的列相同)。against()接收要搜索的字符串,以及一个可选的修饰符,表示要执行的搜索类型。

8.1 natural language

修饰符为IN NATURAL LANGUAGE MODE或不填,即为自然语言搜索模式。

对于表中每一条记录,MATCH()返回一个相似性值,其表示搜索字符串与该记录的文本之间的相似度。

8.1.1 例子
mysql> create table full_test_table(
	`id` int(11) not null comment "id",
    `description` varchar(200) comment "简介",
    `content` text comment "内容",
    primary key(id),
    fulltext(description,content)
);
mysql> insert into full_test_table values
(1, "tom cat", "tom is a cat"),
(2, "jerry", "jerry is a mouse"),
(3, "tom and jerry", "they are happy");
-- 自然语言模式搜索
mysql> select * from full_test_table where match(description,content) against("tom");
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  1 | tom cat       | tom is a cat   |
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
2 rows in set (0.00 sec)
  • 搜索默认不区分大小写

  • match()出现在where子句中时,返回的结果会按照相似度值降序排序,但要满足如下条件

    • 不能有order by子句
    • 搜索使用fulltext索引扫描而非全表扫描
    • 若有表连接(join),则fulltext索引扫描必须是连接中最左边的非常量表
8.1.2 相似性

非负的浮点数,0表示不相似

基于以下数据计算

  • 文档(记录)中单词数量
  • 记录中unique 单词的数量(去重后单词数)
  • 集合中的单词总数量
  • 包含指定单词的记录数量
8.1.3 查询匹配的总记录数
mysql> select count(*) from full_test_table where match(description,content) against("tom");
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)

mysql> select count(if(match(description,content) against("tom" in natural language mode), 1, null)) as count from full_test_table;
+-------+
| count |
+-------+
|     2 |
+-------+
1 row in set (0.00 sec)

以上两条查询语句,由于表里数据太少,执行时间上没有分别。

但第一个语句会做一些额外的工作,比如根据相似性排序。如果搜索匹配的行很少,索引查找可能会使第一个查询更快;若搜索词在大多数行中都出现,则第二个全表扫描的可能比索引更快。

8.1.4 查看相似性值

match()子句放在select子句中,查询的结果为相似度得分,因为没有where子句,返回的结果没有排序

mysql> select id, match(description, content) against("tom" in natural language mode) as score from full_test_table;
+----+----------------------+
| id | score                |
+----+----------------------+
|  1 | 0.062016263604164124 |
|  2 |                    0 |
|  3 | 0.031008131802082062 |
+----+----------------------+
3 rows in set (0.00 sec)

若需要返回相似度值并且按照相似度降序排序,则match()子句需要出现在select子句和where子句中。出现两次的match()不会有额外的工作,MYSQL优化器只会调用一次fulltext搜索。

mysql> select id, match(description,content) against("tom" in natural language mode) as score from full_test_table where match(description,content) against("tom" in natural language mode);
+----+----------------------+
| id | score                |
+----+----------------------+
|  1 | 0.062016263604164124 |
|  3 | 0.031008131802082062 |
+----+----------------------+
2 rows in set (0.00 sec)

许多文档中都出现的单词,其具有较低的语义值,具有较低的权重;若词很罕见,会有更高的权重。

8.1.5 短语搜索

双引号内的字符串是作为短语搜索的,其先被拆分为单独的单词,然后在fulltext索引中搜索单词,对于非单词的字符,不需要完全匹配。短语搜索只需要包含与短语完全相同且顺序相同的单词,比如test phrasetest, phrase是匹配的。若短语不包含在索引中,则结果为空,比如所有单词都是停用词或短于索引词的最小长度。

注:短语包含在双引号内,外用单引号,否则结果不正确,为普通的OR关系

-- 先插入数据
mysql> insert into full_test_table value(4, "tom, cat", "tom is a cat");
mysql> insert into full_test_table value(5, "cat tom", "an animal");
-- 搜索"tom cat"
mysql> select * from full_test_table where match(description, content) against('"tom cat"' in natural language mode);
+----+-------------+--------------+
| id | description | content      |
+----+-------------+--------------+
|  1 | tom cat     | tom is a cat |
|  4 | tom, cat    | tom is a cat |
+----+-------------+--------------+
2 rows in set (0.00 sec)

mysql> select * from full_test_table where match(description, content) against('"cat tom"' in natural language mode);
+----+-------------+-----------+
| id | description | content   |
+----+-------------+-----------+
|  5 | cat tom     | an animal |
+----+-------------+-----------+
1 row in set (0.00 sec)

mysql> select * from full_test_table where match(description, content) against('"tom jerry"' in natural language mode);
Empty set (0.00 sec)

-- 若外面无单引号
mysql> select * from full_test_table where match(description, content) against("tom cat" in natural language mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  1 | tom cat       | tom is a cat   |
|  4 | tom, cat      | tom is a cat   |
|  5 | cat tom       | an animal      |
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
4 rows in set (0.00 sec)
8.1.6 单词分隔

将任何由字母、数字、下划线组成的字符串视为一个单词。字符串可以包含撇号',但不能连续出现多个,即aaa'bbb被视为一个单词,但aaa''bbb视为两个词(aaabbb)。全文解析器会去除单词开头或结尾的撇号,'aaa'bbb'解析为aaa'bbb事实非也

-- 验证的结果表明,“'”会作为分隔符
mysql> insert into full_test_table values
(6, "aaa'bbb", "ccc''ddd"),
(7, "'eee'fff'", "ggg'''hhh"),
(8, "xxx''yyy", "'www'zzzz'");

mysql> select * from information_schema.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| aaa  |           12 |          12 |         1 |     12 |        0 |
| bbb  |           12 |          12 |         1 |     12 |        4 |
| ccc  |           12 |          12 |         1 |     12 |        8 |
| ddd  |           12 |          12 |         1 |     12 |       13 |
| eee  |           13 |          13 |         1 |     13 |        1 |
| fff  |           13 |          13 |         1 |     13 |        5 |
| ggg  |           13 |          13 |         1 |     13 |       10 |
| hhh  |           13 |          13 |         1 |     13 |       16 |
| xxx  |           14 |          14 |         1 |     14 |        0 |
| yyy  |           14 |          14 |         1 |     14 |        5 |
| zzzz |           14 |          14 |         1 |     14 |       14 |
+------+--------------+-------------+-----------+--------+----------+
11 rows in set (0.00 sec)

内置的fulltext解析器通过查找特定的分隔符来确定单词的起止位置,如空格、逗号、句号。但对于中日韩语,需要使用ngram或MeCab.

8.1.7 会被忽略的单词
  • 单词太短,长度小于innodb_ft_min_token_size,默认值为3,ngram解析器的单词长度由ngram_token_size定义
  • 停用词列表中的单词会被忽略

8.2 boolean

修饰符:IN BOOLEAN MODE

单词前后出现的某些字符有其特殊的含义,如+-分别表示该单词必须出现或不能出现,即想要查包含tom而不能包含cat的记录:+tom -cat

mysql> select * from full_test_table where match(description,content) against("+tom -cat" in boolean mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
1 row in set (0.00 sec)

实现这个特性时,MySQL使用了所谓的隐式布尔逻辑,+表示AND-表示NOT,没有操作符则表示OR

8.2.1 特点
  • 不会自动按照相似度降序排序

  • InnoDB表用Boolean搜索需要对所在列建立fulltext索引,而MyISAM可以不需要fulltext索引

  • 使用内置的fulltext解析器和MeCab解析器创建的fulltext索引都遵守最小、最大的单词长度,但ngram解析器创建的fulltext索引是由ngram_token_size设置的

  • 停用词列表的相关配置:innodb_ft_enable_stopwordinnodb_ft_server_stopword_tableinnodb_ft_user_stopword_table

  • 不支持对一个搜索词使用多个操作符,会报错,如++apple

    mysql> select * from full_test_table where match(description,content) against("++tom" in boolean mode);
    ERROR 1064 (42000): syntax error, unexpected '+'
    
  • 只支持前导+ 和前导-

    mysql> select * from full_test_table where match(description,content) against("tom+" in boolean mode);
    ERROR 1064 (42000): syntax error, unexpected $end
    
  • 不支持+*+-+-apple

    mysql> select * from full_test_table where match(description,content) against("+*" in boolean mode);
    ERROR 1064 (42000): syntax error, unexpected $end, expecting FTS_TERM or FTS_NUMB or '*'
    
    mysql> select * from full_test_table where match(description,content) against("+-" in boolean mode);
    ERROR 1064 (42000): syntax error, unexpected '-'
    
    mysql> select * from full_test_table where match(description,content) against("+-tom" in boolean mode);
    ERROR 1064 (42000): syntax error, unexpected '-'
    
  • boolean搜索不支持使用@,其保留供@distance使用

8.2.2 Boolean搜索支持的操作

先看下此时表中的数据,再插入一条内容长一点的记录

mysql> insert into full_test_table value(9, "today is wednesday and a good day", "but the rows that contain it are rated higher");
mysql> select * from full_test_table;
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  1 | tom cat                           | tom is a cat                                  |
|  2 | jerry                             | jerry is a mouse                              |
|  3 | tom and jerry                     | they are happy                                |
|  4 | tom, cat                          | tom is a cat                                  |
|  5 | cat tom                           | an animal                                     |
|  6 | aaa'bbb                           | ccc''ddd                                      |
|  7 | 'eee'fff'                         | ggg'''hhh                                     |
|  8 | xxx''yyy                          | 'www'zzzz'                                    |
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
+----+-----------------------------------+-----------------------------------------------+
9 rows in set (0.00 sec)

操作符

  • +:前导加号表示返回的结果必须包含该单词
-- 查找包含“jerry”的记录
mysql> select * from full_test_table where match(description,content) against("+jerry" in boolean mode);
+----+---------------+------------------+
| id | description   | content          |
+----+---------------+------------------+
|  2 | jerry         | jerry is a mouse |
|  3 | tom and jerry | they are happy   |
+----+---------------+------------------+
2 rows in set (0.00 sec)
  • -:前导减号表示返回的结果中不包含该单词

-只对用其他条件查找到的结果做过滤,不能直接使用该操作符查询不包含该单词的记录。

-- 直接查询不包含该单词的记录
mysql> select * from full_test_table where match(description,content) against("-cat" in boolean mode);
Empty set (0.00 sec)

-- 查询一定包含“tom”但一定不包含“cat”的记录
mysql> select * from full_test_table where match(description,content) against("+tom -cat" in boolean mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
1 row in set (0.00 sec)
  • 无操作符:默认情况,既无+又无-,该单词是可选的,但包含该单词的相似度更高,效果与不带in boolean mode一样
-- 查询一定包含“tom”但不一定包含“cat”的记录,都包含则排名靠前
mysql> select * from full_test_table where match(description,content) against("+tom cat" in boolean mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  1 | tom cat       | tom is a cat   |
|  4 | tom, cat      | tom is a cat   |
|  5 | cat tom       | an animal      |
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
4 rows in set (0.00 sec)
  • @distance:两个或多个单词之间的距离是不是在指定的范围内,以单词为单位。如:match(col1) against('"word1 word2 word3" @8' in boolean mode)

    但实践发现,查询字符串包含两个单词时,若

    • 两个单词均不是停用词,则距离为这两个单词之间的单词数加上2
    • 一个停用词与一个非停用词:停用词在前,@0能查到;停用词在后,@1可查到。查询的是非停用词(过滤停用词后,只剩一个词,放前放后为何有区别,暂不知
    • 两个都是停用词:无论@后为多少,都查不到(应该是过滤完停用词后没有查询单词了)
-- id=3的记录,“tom”与“jerry”之间为1个单词,“and”不为停用词
mysql> select * from full_test_table where match(description,content) against('"tom jerry" @2' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"tom jerry" @3' in boolean mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  3 | tom and jerry | they are happy |
+----+---------------+----------------+
1 row in set (0.00 sec)

-- “today”与“good”之间为4个单词
mysql> select * from full_test_table where match(description,content) against('"today good" @4' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"today good" @5' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"today good" @6' in boolean mode);
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
+----+-----------------------------------+-----------------------------------------------+
1 row in set (0.00 sec)

-- 一个停用词(长度<3)与一个非停用词
mysql> select * from full_test_table where match(description,content) against('"today a" @0' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"today a" @1' in boolean mode);
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
+----+-----------------------------------+-----------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"tom a" @0' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"tom a" @1' in boolean mode);
+----+---------------+----------------+
| id | description   | content        |
+----+---------------+----------------+
|  1 | tom cat       | tom is a cat   |
|  4 | tom, cat      | tom is a cat   |
|  3 | tom and jerry | they are happy |
|  5 | cat tom       | an animal      |
+----+---------------+----------------+
4 rows in set (0.00 sec)
-- 一个停用词(长度>=3)与一个非停用词
mysql> select * from full_test_table where match(description,content) against('"rows are" @0' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"rows are" @1' in boolean mode);
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
+----+-----------------------------------+-----------------------------------------------+
1 row in set (0.00 sec)

-- 停用词在前,非停用词在后
mysql> select * from full_test_table where match(description,content) against('"that higher" @0' in boolean mode);
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
+----+-----------------------------------+-----------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"is cat" @0' in boolean mode);
+----+-------------+--------------+
| id | description | content      |
+----+-------------+--------------+
|  1 | tom cat     | tom is a cat |
|  4 | tom, cat    | tom is a cat |
|  5 | cat tom     | an animal    |
+----+-------------+--------------+
3 rows in set (0.00 sec)

-- 均为停用词
mysql> select * from full_test_table where match(description,content) against('"is a" @0' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"is a" @1' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"is a" @10' in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"is a" @100' in boolean mode);
Empty set (0.00 sec)
  • ><:这两个操作符用于改变单词对该行的相似性贡献,>表示该单词的贡献增加;<表示该单词的贡献会降低。实践上从score来看,像是先算出不带< >符号的score,然后记录中若包含了>的单词,则score+1,若包含了<的单词,则score-1,若都包含,则score不变
-- 不对相关性做控制
mysql> select id, description, content, match(description,content) against("jerry tom" in boolean mode) as score from full_test_table where match(description,content) against("jerry tom" in boolean mode);
+----+---------------+------------------+---------------------+
| id | description   | content          | score               |
+----+---------------+------------------+---------------------+
|  2 | jerry         | jerry is a mouse |  0.8533731698989868 |
|  3 | tom and jerry | they are happy   |   0.550719141960144 |
|  1 | tom cat       | tom is a cat     |  0.2480650544166565 |
|  4 | tom, cat      | tom is a cat     |  0.2480650544166565 |
|  5 | cat tom       | an animal        | 0.12403252720832825 |
+----+---------------+------------------+---------------------+
5 rows in set (0.00 sec)

-- 增加“tom”的贡献,从score看,像是对包含“tom”的记录的score + 1
mysql> select id, description, content, match(description,content) against("jerry >tom" in boolean mode) as score from full_test_table where match(description,content) against("jerry >tom" in boolean mode);
+----+---------------+------------------+--------------------+
| id | description   | content          | score              |
+----+---------------+------------------+--------------------+
|  3 | tom and jerry | they are happy   | 1.5507190227508545 |
|  1 | tom cat       | tom is a cat     | 1.2480649948120117 |
|  4 | tom, cat      | tom is a cat     | 1.2480649948120117 |
|  5 | cat tom       | an animal        | 1.1240324974060059 |
|  2 | jerry         | jerry is a mouse | 0.8533731698989868 |
+----+---------------+------------------+--------------------+
5 rows in set (0.00 sec)

-- 降低“jerry”的贡献,对上一个结果中包含“jerry”的score - 1
mysql> select id, description, content, match(description,content) against("<jerry >tom" in boolean mode) as score from full_test_table where match(description,content) against("<jerry >tom" in boolean mode);
+----+---------------+------------------+----------------------+
| id | description   | content          | score                |
+----+---------------+------------------+----------------------+
|  1 | tom cat       | tom is a cat     |   1.2480649948120117 |
|  4 | tom, cat      | tom is a cat     |   1.2480649948120117 |
|  5 | cat tom       | an animal        |   1.1240324974060059 |
|  3 | tom and jerry | they are happy   |    0.550719141960144 |
|  2 | jerry         | jerry is a mouse | -0.14662683010101318 |
+----+---------------+------------------+----------------------+
5 rows in set (0.00 sec)
  • ( ):括号将单词分组为子表达式,可嵌套
mysql> select * from full_test_table where match(description,content) against("today (+tom -cat)" in boolean mode);
+----+-----------------------------------+-----------------------------------------------+
| id | description                       | content                                       |
+----+-----------------------------------+-----------------------------------------------+
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher |
|  3 | tom and jerry                     | they are happy                                |
+----+-----------------------------------+-----------------------------------------------+
2 rows in set (0.00 sec)

-- 包含‘jerry’和‘tom’或者包含‘jerry’和‘mouse’,且包含‘jerry’和‘tom’的排名靠前
mysql> select * from full_test_table where match(description,content) against("+jerry +(>tom <mouse)" in boolean mode);
+----+---------------+------------------+
| id | description   | content          |
+----+---------------+------------------+
|  3 | tom and jerry | they are happy   |
|  2 | jerry         | jerry is a mouse |
+----+---------------+------------------+
2 rows in set (0.00 sec)
  • ~:前导波浪号表示否定运算,单词的相关性贡献为负。包含该词的行的分数低于其他行,但不会像-运算完全排除,可用于标记“噪声”。
mysql> select id, description, content, match(description,content) against("+tom cat" in boolean mode) as score from full_test_table where match(description,content) against("+tom cat" in boolean mode);
+----+---------------+----------------+---------------------+
| id | description   | content        | score               |
+----+---------------+----------------+---------------------+
|  1 | tom cat       | tom is a cat   |  0.7033544778823853 |
|  4 | tom, cat      | tom is a cat   |  0.7033544778823853 |
|  5 | cat tom       | an animal      |  0.3516772389411926 |
|  3 | tom and jerry | they are happy | 0.12403252720832825 |
+----+---------------+----------------+---------------------+
4 rows in set (0.00 sec)

mysql> select id, description, content, match(description,content) against("+tom ~cat" in boolean mode) as score from full_test_table where match(description,content) against("+tom ~cat" in boolean mode);
+----+---------------+----------------+---------------------+
| id | description   | content        | score               |
+----+---------------+----------------+---------------------+
|  1 | tom cat       | tom is a cat   |  0.2480650544166565 |
|  4 | tom, cat      | tom is a cat   |  0.2480650544166565 |
|  3 | tom and jerry | they are happy | 0.12403252720832825 |
|  5 | cat tom       | an animal      | 0.12403252720832825 |
+----+---------------+----------------+---------------------+
4 rows in set (0.00 sec)
  • *:通配符,若*之前的单词长度太短,不会从布尔查询中删除

    注:官网说若innodb_ft_min_token_size=4,则查询+word +the*的结果少于+word +the,因为后者the长度小于4,会被删除,但实际查询时,只有不带加号的the会被删除,带加号的+the不会删除

mysql> select * from full_test_table where match(description,content) against("+cat +to*" in boolean mode);
+----+-------------+--------------+
| id | description | content      |
+----+-------------+--------------+
|  1 | tom cat     | tom is a cat |
|  4 | tom, cat    | tom is a cat |
|  5 | cat tom     | an animal    |
+----+-------------+--------------+
3 rows in set (0.00 sec)
-- +to to长度小于3,并没有将to删除再查询
mysql> select * from full_test_table where match(description,content) against("+cat +to" in boolean mode);
Empty set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against("+cat to" in boolean mode);
+----+-------------+--------------+
| id | description | content      |
+----+-------------+--------------+
|  1 | tom cat     | tom is a cat |
|  4 | tom, cat    | tom is a cat |
|  5 | cat tom     | an animal    |
+----+-------------+--------------+
3 rows in set (0.00 sec)
mysql> select * from full_test_table where match(description,content) against('"tom cat"' in boolean mode);
+----+-------------+--------------+
| id | description | content      |
+----+-------------+--------------+
|  1 | tom cat     | tom is a cat |
|  4 | tom, cat    | tom is a cat |
+----+-------------+--------------+
2 rows in set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"cat tom"' in boolean mode);
+----+-------------+-----------+
| id | description | content   |
+----+-------------+-----------+
|  5 | cat tom     | an animal |
+----+-------------+-----------+
1 row in set (0.00 sec)

mysql> select * from full_test_table where match(description,content) against('"tom jerry"' in boolean mode);
Empty set (0.00 sec)
8.2.3 布尔模式的相关性排序

InnoDB全文搜索以Sphinx全文搜索引擎为蓝本,使用BM25和TF-IDF算法,与MyISAM算法不同

注:由于InnoDB在转换整数和浮点数/双精度数之间的精度和舍入策略不同,计算的结果会有细微差别。

TF-IDF算法另有描述,不再叙述

计算单个词的相关性TF-IDF评分为:$$score=tf * idf^2$$

多个词的相关性评分为每个词评分之和

8.3 query expansion

来由:对于太短的搜索词非常有用,因为可能包含一些用户想要的隐藏知识。如用户搜索databases,可能想要MySQLOracleDB2或者RDBMS

修饰符:WITH QUERY EXPANSION或者IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION

搜索:两次搜索,第一次使用原搜索词搜索,第二次使用原搜索词与第一次搜索结果中最相关的文档中单词进行搜索

如:搜索词databases,第一次使用该词搜索,返回结果中包含MySQL,第二次搜索以databasesMySQL为搜索词进行搜索

mysql> select id, description, content, match(description,content) against('tom' in natural language mode) as score from full_test_table where match(description,content) against('tom' in natural language mode);
+----+---------------+----------------+---------------------+
| id | description   | content        | score               |
+----+---------------+----------------+---------------------+
|  1 | tom cat       | tom is a cat   |  0.2480650544166565 |
|  4 | tom, cat      | tom is a cat   |  0.2480650544166565 |
|  3 | tom and jerry | they are happy | 0.12403252720832825 |
|  5 | cat tom       | an animal      | 0.12403252720832825 |
+----+---------------+----------------+---------------------+
4 rows in set (0.00 sec)

mysql> select id, description, content, match(description,content) against('tom' with query expansion) as score from full_test_table where match(description,content) against('tom' with query expansion);
+----+-----------------------------------+-----------------------------------------------+--------------------+
| id | description                       | content                                       | score              |
+----+-----------------------------------+-----------------------------------------------+--------------------+
|  3 | tom and jerry                     | they are happy                                |  2.798563241958618 |
|  5 | cat tom                           | an animal                                     | 1.2622560262680054 |
|  2 | jerry                             | jerry is a mouse                              | 0.8533731698989868 |
|  1 | tom cat                           | tom is a cat                                  | 0.7033544778823853 |
|  4 | tom, cat                          | tom is a cat                                  | 0.7033544778823853 |
|  9 | today is wednesday and a good day | but the rows that contain it are rated higher | 0.4266865849494934 |
+----+-----------------------------------+-----------------------------------------------+--------------------+
6 rows in set (0.00 sec)

:由于扩展查询会增加查询词返回一些不相关的文档,因此仅在搜索短语较短时使用

9、fulltext的限制

  • 只支持InnoDB表和MyISAM表(两者有区别)
  • 不支持分区表
  • 可用大多数的多字节字符集,对于Unicode,可用utf8字符集,不能使用ucs2字符集
  • 表意语言(中日韩)没有单词分隔符,内置全文解析器无法确定在这些语言中单词的开始和结束。NGram全文解析器支持中日韩基于字符的解析、MeCab支持日语的基于单词的解析
  • full text索引中的所有列必须使用相同的字符集和排序规则(collation具体含义可参考该文章
  • 除过MyISAM表的BOOLEAN MODE查询外,MATCH()的列必须与定义full text索引的列保持完全一致(MyISAM表的BOOLEAN MODE可以不走索引);多个列建立了一个fulltext索引,若需对某个单独的列进行fulltext搜索,则需要针对该列再创建一个fulltext索引
  • AGAINST()的参数必须在查询期间保持不变的字符串
  • 对于InnoDB,所有具有fulltext索引的DML操作(insert、update、delete)都是在提交事务时处理,全文检索只返回commit的数据;delete表只在手动执行OPTIMIZE TABLE时才会清空
  • 不支持通配符%
  • MySQL的全文检索几乎没有用户可调整的参数,除非修改源码,大多数情况下修改默认配置会降低效率
  • 大多数fulltext参数需要在服务器启动时设置,运行时修改需重启,如innodb_ft_min_token_sizeinnodb_ft_max_token_size
  • 一些变量修改需要重建fulltext索引,如innodb_ft_min_token_sizeinnodb_ft_max_token_sizeinnodb_ft_server_stopword_tableinnodb_ft_user_stopword_tableinnodb_ft_enable_stopword

10. 中文搜索

ngram全文解析器支持中日韩语,该解析器可用于InnoDB和MyISAM。ngram为MySQL内置的插件,在服务器启动时自动加载。

ngram解析器将文本序列标记为连续的n个字符序列,如对于字符串“abcd”使用不同的n值解析:

n 解析结果
1 "a", "b", "c","d"
2 "ab", "bc", "cd"
3 "abc", "bcd"
4 "abcd"

10.1 配置ngram token size

ngram默认的值为2,即为bigram,例:"abc def" => "ab", "bc", "de", "ef"

参数:ngram_token_size,默认为2,最小为1,最大为10

通常,ngram_token_size设置为要搜索的最大的token的长度,token size越小,对应的fulltext索引越小,搜索也越快

说明

  • ngram_token_size只读,only be set as part of a startup string or in a configuration file

    startup string设置:

    mysqld --ngram_token_size=2
    

    configuration file设置:

    [mysqld]
    ngram_token_size=2
    
  • 当fulltext索引使用ngram解析器时,会忽略配置的最小、最大word长度(innodb_ft_min_token_size, innodb_ft_max_token_size

10.2 创建fulltext索引

设置:在create table、alter table、create index中用 with parser ngram设置

mysql> CREATE TABLE articles (
      id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
      title VARCHAR(200),
      body TEXT,
      FULLTEXT (title,body) WITH PARSER ngram
    ) ENGINE=InnoDB CHARACTER SET utf8mb4;

-- 对已经存在的表增加fulltext索引
-- alter table articles add fulltext index ft_index(title, body) with parser ngram;
-- or
-- create fulltext index ft_index on articles(title, body) with parser ngram;

mysql> INSERT INTO articles (title,body) VALUES
    ('数据库管理','在本教程中我将向你展示如何管理数据库'),
    ('数据库应用开发','学习开发数据库应用程序');

mysql> SET GLOBAL innodb_ft_aux_table="testdb/articles";

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 数据   |            2 |           3 |         2 |      2 |        0 |
| 据库   |            2 |           3 |         2 |      2 |        3 |
| 库管   |            2 |           2 |         1 |      2 |        6 |
| 管理   |            2 |           2 |         1 |      2 |        9 |
| 在本   |            2 |           2 |         1 |      2 |       16 |
| 本教   |            2 |           2 |         1 |      2 |       19 |
| 教程   |            2 |           2 |         1 |      2 |       22 |
| 程中   |            2 |           2 |         1 |      2 |       25 |
| 中我   |            2 |           2 |         1 |      2 |       28 |
| 我将   |            2 |           2 |         1 |      2 |       31 |
| 将向   |            2 |           2 |         1 |      2 |       34 |
| 向你   |            2 |           2 |         1 |      2 |       37 |
| 你展   |            2 |           2 |         1 |      2 |       40 |
| 展示   |            2 |           2 |         1 |      2 |       43 |
| 示如   |            2 |           2 |         1 |      2 |       46 |
| 管理   |            2 |           2 |         1 |      2 |       46 |
| 如何   |            2 |           2 |         1 |      2 |       49 |
| 何管   |            2 |           2 |         1 |      2 |       52 |
| 理数   |            2 |           2 |         1 |      2 |       58 |
| 数据   |            2 |           3 |         2 |      2 |       61 |
| 据库   |            2 |           3 |         2 |      2 |       61 |
| 数据   |            2 |           3 |         2 |      3 |        0 |
| 据库   |            2 |           3 |         2 |      3 |        3 |
| 库应   |            3 |           3 |         1 |      3 |        6 |
| 应用   |            3 |           3 |         1 |      3 |        9 |
| 用开   |            3 |           3 |         1 |      3 |       12 |
| 开发   |            3 |           3 |         1 |      3 |       13 |
| 开发   |            3 |           3 |         1 |      3 |       15 |
| 学习   |            3 |           3 |         1 |      3 |       22 |
| 习开   |            3 |           3 |         1 |      3 |       25 |
| 发数   |            3 |           3 |         1 |      3 |       31 |
| 据库   |            2 |           3 |         2 |      3 |       34 |
| 应用   |            3 |           3 |         1 |      3 |       34 |
| 数据   |            2 |           3 |         2 |      3 |       34 |
| 库应   |            3 |           3 |         1 |      3 |       34 |
| 用程   |            3 |           3 |         1 |      3 |       46 |
| 程序   |            3 |           3 |         1 |      3 |       49 |
+--------+--------------+-------------+-----------+--------+----------+
37 rows in set (0.00 sec)

对于空格的处理:类似于分隔符

mysql> insert into articles(title,body) value ("今天,天气", "大太阳 天气好");
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 今天   |            8 |           8 |         1 |      8 |        0 |
| 天,   |            8 |           8 |         1 |      8 |        3 |
| ,天   |            8 |           8 |         1 |      8 |        6 |
| 天气   |            8 |           8 |         1 |      8 |        9 |
| 大太   |            8 |           8 |         1 |      8 |       16 |
| 天气   |            8 |           8 |         1 |      8 |       17 |
| 太阳   |            8 |           8 |         1 |      8 |       19 |
| 气好   |            8 |           8 |         1 |      8 |       29 |
+--------+--------------+-------------+-----------+--------+----------+
8 rows in set (0.00 sec)

10.3 对停用词的处理

mysq内置的全文本解析器会将分词后的word与停用词对比,若相等,则将该word删除。但ngram不同,若ngram解析的word包含了停用词,则该word也会被删除。但若停用词长度大于ngram_token_size,则该停用词会被忽略。

如:"a,b" => "a,", ",b",若停用词有",",则"a,"和",b"都会被删除

ngram使用默认的停用词表,其只包含英文,若需要中日韩语的停用词,需要自己创建停用词表,可参考 7 停用词

10.4 term搜索

  • 自然语言搜索:搜索term转为 a union of ngram terms。

    如:"abc" => "ab bc",两个文档,一个包含"ab",另一个包含"abc",则这两篇都会匹配上

  • Boolean模式搜索:搜索term转为 an ngram phrase search。

    如:"abc" => "ab bc",两个文档,一个包含"ab",另一个包含"abc",则只会匹配包含"abc"的文档

mysql> update articles set body = '学习本教开发数据库应用程序' where id=2;

mysql> select * from articles where match(title,body) against("本教程");
+----+-----------------------+--------------------------------------------------------+
| id | title                 | body                                                   |
+----+-----------------------+--------------------------------------------------------+
|  1 | 数据库管理            | 在本教程中我将向你展示如何管理数据库                   |
|  2 | 数据库应用开发        | 学习本教开发数据库应用程序                             |
+----+-----------------------+--------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> select * from articles where match(title,body) against("本教程" in boolean mode);
+----+-----------------+--------------------------------------------------------+
| id | title           | body                                                   |
+----+-----------------+--------------------------------------------------------+
|  1 | 数据库管理      | 在本教程中我将向你展示如何管理数据库                   |
+----+-----------------+--------------------------------------------------------+
1 row in set (0.00 sec)

10.5 通配符搜索

ngram全文检索只包含ngrams,terms的开始信息并没有,使用通配符搜索时可能出现意想不到的结果。以下场景可以用通配符搜索:

  • 若通配符搜索的前缀项短于ngram_token_size,则查询将返回包含该前缀项开头的ngram tokens 的所有行,如:ngram_token_size=2,搜索a*会返回所有以a开头的行;
  • 若通配符搜索的前缀长于ngram_token_size,则前缀项将转为ngram短语并忽略通配符运算。如:ngram_token_size=2,搜索abc*会转为ab bc

10.6 短语搜索

Phrase searches are converted to ngram phrase searches.

搜索短语"abc" => "ab bc",会返回包含"abc"和 "ab bc"的文档

搜索短语"abc def" => "ab bc de ef",会返回包含"abc def"和"ab bc de ef"的文档,但不会返回包含"abcdef"的文档

必须使用Boolean模式

-- ab cd
mysql> insert into articles(title,body) values("今天","太阳天气"),("天气","大太阳天气好");

mysql> select * from articles where match(title,body) against('"太阳 天气"' in boolean mode);
+----+-----------------+---------------------+
| id | title           | body                |
+----+-----------------+---------------------+
|  3 | 今天,天气      | 大太阳 天气好       |
+----+-----------------+---------------------+
1 row in set (0.00 sec)

mysql> select * from articles where match(title,body) against('"太阳 天气"' in natural language mode);
+----+-----------------+---------------------+
| id | title           | body                |
+----+-----------------+---------------------+
|  3 | 今天,天气      | 大太阳 天气好       |
|  5 | 天气            | 大太阳天气好        |
|  4 | 今天            | 太阳天气            |
+----+-----------------+---------------------+
3 rows in set (0.00 sec)

mysql> select * from articles where match(title,body) against('"今天 天气"' in boolean mode);
Empty set (0.00 sec)

mysql> select * from articles where match(title,body) against('"今天 天气"' in natural language mode);
+----+-----------------+---------------------+
| id | title           | body                |
+----+-----------------+---------------------+
|  3 | 今天,天气      | 大太阳 天气好       |
|  4 | 今天            | 太阳天气            |
|  5 | 天气            | 大太阳天气好        |
+----+-----------------+---------------------+
3 rows in set (0.00 sec)
posted @ 2023-02-01 15:03  dedication  阅读(139)  评论(0编辑  收藏  举报