MySQL Full-Text Search Functions 文档
12.10 Full-Text Search Functions
- 12.10.1 Natural Language Full-Text Searches
- 12.10.2 Boolean Full-Text Searches
- 12.10.3 Full-Text Searches with Query Expansion
- 12.10.4 Full-Text Stopwords
- 12.10.5 Full-Text Restrictions
- 12.10.6 Fine-Tuning MySQL Full-Text Search
- 12.10.7 Adding a User-Defined Collation for Full-Text Indexing
- 12.10.8 ngram Full-Text Parser
- 12.10.9 MeCab Full-Text Parser Plugin
MATCH (
col1
,col2
,...) AGAINST (expr
[search_modifier
])
search_modifier:
{
IN NATURAL LANGUAGE MODE
| IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
| IN BOOLEAN MODE
| WITH QUERY EXPANSION
}
MySQL has support for full-text indexing and searching:
-
A full-text index in MySQL is an index of type
FULLTEXT
. -
Full-text indexes can be used only with
InnoDB
orMyISAM
tables, and can be created only forCHAR
,VARCHAR
, orTEXT
columns. -
MySQL provides a built-in full-text ngram parser that supports Chinese, Japanese, and Korean (CJK), and an installable MeCab full-text parser plugin for Japanese. Parsing differences are outlined in Section 12.10.8, “ngram Full-Text Parser”, and Section 12.10.9, “MeCab Full-Text Parser Plugin”.
-
A
FULLTEXT
index definition can be given in theCREATE TABLE
statement when a table is created, or added later usingALTER TABLE
orCREATE INDEX
. -
For large data sets, it is much faster to load your data into a table that has no
FULLTEXT
index and then create the index after that, than to load data into a table that has an existingFULLTEXT
index.
Full-text searching is performed using MATCH() AGAINST()
syntax. MATCH()
takes a comma-separated list that names the columns to be searched. AGAINST
takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a string value that is constant during query evaluation. This rules out, for example, a table column because that can differ for each row.
There are three types of full-text searches:
-
A natural language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators, with the exception of double quote (") characters. The stopword list applies. For more information about stopword lists, see Section 12.10.4, “Full-Text Stopwords”.
Full-text searches are natural language searches if the
IN NATURAL LANGUAGE MODE
modifier is given or if no modifier is given. For more information, see Section 12.10.1, “Natural Language Full-Text Searches”. -
A boolean search interprets the search string using the rules of a special query language. The string contains the words to search for. It can also contain operators that specify requirements such that a word must be present or absent in matching rows, or that it should be weighted higher or lower than usual. Certain common words (stopwords) are omitted from the search index and do not match if present in the search string. The
IN BOOLEAN MODE
modifier specifies a boolean search. For more information, see Section 12.10.2, “Boolean Full-Text Searches”. -
A query expansion search is a modification of a natural language search. The search string is used to perform a natural language search. Then words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
orWITH QUERY EXPANSION
modifier specifies a query expansion search. For more information, see Section 12.10.3, “Full-Text Searches with Query Expansion”.
For information about FULLTEXT
query performance, see Section 8.3.4, “Column Indexes”.
For more information about InnoDB
FULLTEXT
indexes, see Section 14.6.2.4, “InnoDB Full-Text Indexes”.
Constraints on full-text searching are listed in Section 12.10.5, “Full-Text Restrictions”.
The myisam_ftdump utility dumps the contents of a MyISAM
full-text index. This may be helpful for debugging full-text queries. See Section 4.6.2, “myisam_ftdump — Display Full-Text Index information”.
12.10.1 Natural Language Full-Text Searches
By default or with the IN NATURAL LANGUAGE MODE
modifier, the MATCH()
function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT
index. The search string is given as the argument to AGAINST()
. For each row in the table, MATCH()
returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH()
list.
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.08 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','DBMS stands for DataBase ...'),
-> ('How To Use MySQL Well','After you went through a ...'),
-> ('Optimizing MySQL','In this tutorial, we show ...'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL vs. YourSQL','In the following database comparison ...'),
-> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.01 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body)
-> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
By default, the search is performed in case-insensitive fashion. To perform a case-sensitive full-text search, use a binary collation for the indexed columns. For example, a column that uses the latin1
character set of can be assigned a collation of latin1_bin
to make it case-sensitive for full-text searches.
When MATCH()
is used in a WHERE
clause, as in the example shown earlier, the rows returned are automatically sorted with the highest relevance first as long as the following conditions are met:
-
There must be no explicit
ORDER BY
clause. -
The search must be performed using a full-text index scan rather than a table scan.
-
If the query joins tables, the full-text index scan must be the leftmost non-constant table in the join.
Given the conditions just listed, it is usually less effort to specify using ORDER BY
an explicit sort order when one is necessary or desired.
Relevance values are nonnegative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row (document), the number of unique words in the row, the total number of words in the collection, and the number of rows that contain a particular word.
The term “document” may be used interchangeably with the term “row”, and both terms refer to the indexed part of the row. The term “collection” refers to the indexed columns and encompasses all rows.
To simply count matches, you could use a query like this:
mysql> SELECT COUNT(*) FROM articles
-> WHERE MATCH (title,body)
-> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----------+
| COUNT(*) |
+----------+
| 2 |
+----------+
1 row in set (0.00 sec)
You might find it quicker to rewrite the query as follows:
mysql> SELECT
-> COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), 1, NULL))
-> AS count
-> FROM articles;
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set (0.03 sec)
The first query does some extra work (sorting the results by relevance) but also can use an index lookup based on the WHERE
clause. The index lookup might make the first query faster if the search matches few rows. The second query performs a full table scan, which might be faster than the index lookup if the search term was present in most rows.
For natural-language full-text searches, the columns named in the MATCH()
function must be the same columns included in some FULLTEXT
index in your table. For the preceding query, the columns named in the MATCH()
function (title
and body
) are the same as those named in the definition of the article
table's FULLTEXT
index. To search the title
or body
separately, you would create separate FULLTEXT
indexes for each column.
You can also perform a boolean search or a search with query expansion. These search types are described in Section 12.10.2, “Boolean Full-Text Searches”, and Section 12.10.3, “Full-Text Searches with Query Expansion”.
A full-text search that uses an index can name columns only from a single table in the MATCH()
clause because an index cannot span multiple tables. For MyISAM
tables, a boolean search can be done in the absence of an index (albeit more slowly), in which case it is possible to name columns from multiple tables.
The preceding example is a basic illustration that shows how to use the MATCH()
function where rows are returned in order of decreasing relevance. The next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the SELECT
statement includes neither WHERE
nor ORDER BY
clauses:
mysql> SELECT id, MATCH (title,body)
-> AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score
-> FROM articles;
+----+---------------------+
| id | score |
+----+---------------------+
| 1 | 0.22764469683170319 |
| 2 | 0 |
| 3 | 0.22764469683170319 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
+----+---------------------+
6 rows in set (0.00 sec)
The following example is more complex. The query returns the relevance values and it also sorts the rows in order of decreasing relevance. To achieve this result, specify MATCH()
twice: once in the SELECT
list and once in the WHERE
clause. This causes no additional overhead, because the MySQL optimizer notices that the two MATCH()
calls are identical and invokes the full-text search code only once.
mysql> SELECT id, body, MATCH (title,body)
-> AGAINST ('Security implications of running MySQL as root'
-> IN NATURAL LANGUAGE MODE) AS score
-> FROM articles
-> WHERE MATCH (title,body)
-> AGAINST('Security implications of running MySQL as root'
-> IN NATURAL LANGUAGE MODE);
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
A phrase that is enclosed within double quote ("
) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT
index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase"
matches "test, phrase"
. If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.
The MySQL FULLTEXT
implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('
), but not more than one in a row. This means that aaa'bbb
is regarded as one word, but aaa''bbb
is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT
parser; 'aaa'bbb'
would be parsed as aaa'bbb
.
The built-in FULLTEXT
parser determines where words start and end by looking for certain delimiter characters; for example,
(space), ,
(comma), and .
(period). If words are not separated by delimiters (as in, for example, Chinese), the built-in FULLTEXT
parser cannot determine where a word begins or ends. To be able to add words or other indexed terms in such languages to a FULLTEXT
index that uses the built-in FULLTEXT
parser, you must preprocess them so that they are separated by some arbitrary delimiter. Alternatively, you can create FULLTEXT
indexes using the ngram parser plugin (for Chinese, Japanese, or Korean) or the MeCab parser plugin (for Japanese).
It is possible to write a plugin that replaces the built-in full-text parser. For details, see The MySQL Plugin API. For example parser plugin source code, see the plugin/fulltext
directory of a MySQL source distribution.
Some words are ignored in full-text searches:
-
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for
InnoDB
search indexes, or four characters forMyISAM
. You can control the cutoff by setting a configuration option before creating the index:innodb_ft_min_token_size
configuration option forInnoDB
search indexes, orft_min_word_len
forMyISAM
.NoteThis behavior does not apply to
FULLTEXT
indexes that use the ngram parser. For the ngram parser, token length is defined by thengram_token_size
option. -
Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list. The stopword lists and related configuration options are different for
InnoDB
search indexes andMyISAM
ones. Stopword processing is controlled by the configuration optionsinnodb_ft_enable_stopword
,innodb_ft_server_stopword_table
, andinnodb_ft_user_stopword_table
forInnoDB
search indexes, andft_stopword_file
forMyISAM
ones.
See Section 12.10.4, “Full-Text Stopwords” to view default stopword lists and how to change them. The default minimum word length can be changed as described in Section 12.10.6, “Fine-Tuning MySQL Full-Text Search”.
Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Thus, a word that is present in many documents has a lower weight, because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. This technique works best with large collections.
For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results for search indexes on MyISAM
tables. For example, although the word “MySQL” is present in every row of the articles
table shown earlier, a search for the word in a MyISAM
search index produces no results:
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body)
-> AGAINST ('MySQL' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)
The search result is empty because the word “MySQL” is present in at least 50% of the rows, and so is effectively treated as a stopword. This filtering technique is more suitable for large data sets, where you might not want the result set to return every second row from a 1GB table, than for small data sets where it might cause poor results for popular terms.
The 50% threshold can surprise you when you first try full-text searching to see how it works, and makes InnoDB
tables more suited to experimentation with full-text searches. If you create a MyISAM
table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until the table contains more rows. Users who need to bypass the 50% limitation can build search indexes on InnoDB
tables, or use the boolean search mode explained in Section 12.10.2, “Boolean Full-Text Searches”.
12.10.2 Boolean Full-Text Searches
MySQL can perform boolean full-text searches using the IN BOOLEAN MODE
modifier. With this modifier, certain characters have special meaning at the beginning or end of words in the search string. In the following query, the +
and -
operators indicate that a word must be present or absent, respectively, for a match to occur. Thus, the query retrieves all the rows that contain the word “MySQL” but that do not contain the word “YourSQL”:
mysql> SELECT * FROM articles WHERE MATCH (title,body)
-> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title | body |
+----+-----------------------+-------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 2 | How To Use MySQL Well | After you went through a ... |
| 3 | Optimizing MySQL | In this tutorial, we show ... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
| 6 | MySQL Security | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+
In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which
-
+
stands forAND
-
-
stands forNOT
-
[no operator] implies
OR
Boolean full-text searches have these characteristics:
-
They do not automatically sort rows in order of decreasing relevance.
-
InnoDB
tables require aFULLTEXT
index on all columns of theMATCH()
expression to perform boolean queries. Boolean queries against aMyISAM
search index can work even without aFULLTEXT
index, although a search executed in this fashion would be quite slow. -
The minimum and maximum word length full-text parameters apply to
FULLTEXT
indexes created using the built-inFULLTEXT
parser and MeCab parser plugin.innodb_ft_min_token_size
andinnodb_ft_max_token_size
are used forInnoDB
search indexes.ft_min_word_len
andft_max_word_len
are used forMyISAM
search indexes.Minimum and maximum word length full-text parameters do not apply to
FULLTEXT
indexes created using the ngram parser. ngram token size is defined by thengram_token_size
option. -
The stopword list applies, controlled by
innodb_ft_enable_stopword
,innodb_ft_server_stopword_table
, andinnodb_ft_user_stopword_table
forInnoDB
search indexes, andft_stopword_file
forMyISAM
ones. -
InnoDB
full-text search does not support the use of multiple operators on a single search word, as in this example:'++apple'
. Use of multiple operators on a single search word returns a syntax error to standard out. MyISAM full-text search successfully processes the same search, ignoring all operators except for the operator immediately adjacent to the search word. -
InnoDB
full-text search only supports leading plus or minus signs. For example,InnoDB
supports'+apple'
but does not support'apple+'
. Specifying a trailing plus or minus sign causesInnoDB
to report a syntax error. -
InnoDB
full-text search does not support the use of a leading plus sign with wildcard ('+*'
), a plus and minus sign combination ('+-'
), or leading a plus and minus sign combination ('+-apple'
). These invalid queries return a syntax error. -
InnoDB
full-text search does not support the use of the@
symbol in boolean full-text searches. The@
symbol is reserved for use by the@distance
proximity search operator. -
They do not use the 50% threshold that applies to
MyISAM
search indexes.
The boolean full-text search capability supports the following operators:
-
+
A leading or trailing plus sign indicates that this word must be present in each row that is returned.
InnoDB
only supports leading plus signs. -
-
A leading or trailing minus sign indicates that this word must not be present in any of the rows that are returned.
InnoDB
only supports leading minus signs.Note: The
-
operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean-mode search that contains only terms preceded by-
returns an empty result. It does not return “all rows except those containing any of the excluded terms.” -
(no operator)
By default (when neither
+
nor-
is specified), the word is optional, but the rows that contain it are rated higher. This mimics the behavior ofMATCH() AGAINST()
without theIN BOOLEAN MODE
modifier. -
@
distance
This operator works on
InnoDB
tables only. It tests whether two or more words all start within a specified distance from each other, measured in words. Specify the search words within a double-quoted string immediately before the@
operator, for example,distance
MATCH(col1) AGAINST('"word1 word2 word3" @8' IN BOOLEAN MODE)
-
> <
These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The
>
operator increases the contribution and the<
operator decreases it. See the example following this list. -
( )
Parentheses group words into subexpressions. Parenthesized groups can be nested.
-
~
A leading tilde acts as a negation operator, causing the word's contribution to the row's relevance to be negative. This is useful for marking “noise” words. A row containing such a word is rated lower than others, but is not excluded altogether, as it would be with the
-
operator. -
*
The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it is appended to the word to be affected. Words match if they begin with the word preceding the
*
operator.If a word is specified with the truncation operator, it is not stripped from a boolean query, even if it is too short or a stopword. Whether a word is too short is determined from the
innodb_ft_min_token_size
setting forInnoDB
tables, orft_min_word_len
forMyISAM
tables. These options are not applicable toFULLTEXT
indexes that use the ngram parser.The wildcarded word is considered as a prefix that must be present at the start of one or more words. If the minimum word length is 4, a search for
'+
could return fewer rows than a search forword
+the*''+
, because the second query ignores the too-short search termword
+the'the
. -
"
A phrase that is enclosed within double quote (
"
) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in theFULLTEXT
index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example,"test phrase"
matches"test, phrase"
.If the phrase contains no words that are in the index, the result is empty. The words might not be in the index because of a combination of factors: if they do not exist in the text, are stopwords, or are shorter than the minimum length of indexed words.
The following examples demonstrate some search strings that use boolean full-text operators:
-
'apple banana'
Find rows that contain at least one of the two words.
-
'+apple +juice'
Find rows that contain both words.
-
'+apple macintosh'
Find rows that contain the word “apple”, but rank rows higher if they also contain “macintosh”.
-
'+apple -macintosh'
Find rows that contain the word “apple” but not “macintosh”.
-
'+apple ~macintosh'
Find rows that contain the word “apple”, but if the row also contains the word “macintosh”, rate it lower than if row does not. This is “softer” than a search for
'+apple -macintosh'
, for which the presence of “macintosh” causes the row not to be returned at all. -
'+apple +(>turnover <strudel)'
Find rows that contain the words “apple” and “turnover”, or “apple” and “strudel” (in any order), but rank “apple turnover” higher than “apple strudel”.
-
'apple*'
Find rows that contain words such as “apple”, “apples”, “applesauce”, or “applet”.
-
'"some words"'
Find rows that contain the exact phrase “some words” (for example, rows that contain “some words of wisdom” but not “some noise words”). Note that the
"
characters that enclose the phrase are operator characters that delimit the phrase. They are not the quotation marks that enclose the search string itself.
InnoDB
full-text search is modeled on the Sphinx full-text search engine, and the algorithms used are based on BM25 and TF-IDF ranking algorithms. For these reasons, relevancy rankings for InnoDB
boolean full-text search may differ from MyISAM
relevancy rankings.
InnoDB
uses a variation of the “term frequency-inverse document frequency” (TF-IDF
) weighting system to rank a document's relevance for a given full-text search query. The TF-IDF
weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked.
How Relevancy Ranking is Calculated
The term frequency (TF
) value is the number of times that a word appears in a document. The inverse document frequency (IDF
) value of a word is calculated using the following formula, where total_records
is the number of records in the collection, and matching_records
is the number of records that the search term appears in.
${IDF} = log10( ${total_records} / ${matching_records} )
When a document contains a word multiple times, the IDF value is multiplied by the TF value:
${TF} * ${IDF}
Using the TF
and IDF
values, the relevancy ranking for a document is calculated using this formula:
${rank} = ${TF} * ${IDF} * ${IDF}
The formula is demonstrated in the following examples.
Relevancy Ranking for a Single Word Search
This example demonstrates the relevancy ranking calculation for a single-word search.
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
->) ENGINE=InnoDB;
Query OK, 0 rows affected (1.04 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','This database tutorial ...'),
-> ("How To Use MySQL",'After you went through a ...'),
-> ('Optimizing Your Database','In this database tutorial ...'),
-> ('MySQL vs. YourSQL','When comparing databases ...'),
-> ('MySQL Security','When configured properly, MySQL ...'),
-> ('Database, Database, Database','database database database'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');
Query OK, 8 rows affected (0.06 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> SELECT id, title, body,
-> MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score
-> FROM articles ORDER BY score DESC;
+----+------------------------------+-------------------------------------+---------------------+
| id | title | body | score |
+----+------------------------------+-------------------------------------+---------------------+
| 6 | Database, Database, Database | database database database | 1.0886961221694946 |
| 3 | Optimizing Your Database | In this database tutorial ... | 0.36289870738983154 |
| 1 | MySQL Tutorial | This database tutorial ... | 0.18144935369491577 |
| 2 | How To Use MySQL | After you went through a ... | 0 |
| 4 | MySQL vs. YourSQL | When comparing databases ... | 0 |
| 5 | MySQL Security | When configured properly, MySQL ... | 0 |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0 |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0 |
+----+------------------------------+-------------------------------------+---------------------+
8 rows in set (0.00 sec)
There are 8 records in total, with 3 that match the “database” search term. The first record (id 6
) contains the search term 6 times and has a relevancy ranking of 1.0886961221694946
. This ranking value is calculated using a TF
value of 6 (the “database” search term appears 6 times in record id 6
) and an IDF
value of 0.42596873216370745, which is calculated as follows (where 8 is the total number of records and 3 is the number of records that the search term appears in):
${IDF} = LOG10( 8 / 3 ) = 0.42596873216370745
The TF
and IDF
values are then entered into the ranking formula:
${rank} = ${TF} * ${IDF} * ${IDF}
Performing the calculation in the MySQL command-line client returns a ranking value of 1.088696164686938.
mysql> SELECT 6*LOG10(8/3)*LOG10(8/3);
+-------------------------+
| 6*LOG10(8/3)*LOG10(8/3) |
+-------------------------+
| 1.088696164686938 |
+-------------------------+
1 row in set (0.00 sec)
You may notice a slight difference in the ranking values returned by the SELECT ... MATCH ... AGAINST
statement and the MySQL command-line client (1.0886961221694946
versus 1.088696164686938
). The difference is due to how the casts between integers and floats/doubles are performed internally by InnoDB
(along with related precision and rounding decisions), and how they are performed elsewhere, such as in the MySQL command-line client or other types of calculators.
Relevancy Ranking for a Multiple Word Search
This example demonstrates the relevancy ranking calculation for a multiple-word full-text search based on the articles
table and data used in the previous example.
If you search on more than one word, the relevancy ranking value is a sum of the relevancy ranking value for each word, as shown in this formula:
${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}
Performing a search on two terms ('mysql tutorial') returns the following results:
mysql> SELECT id, title, body, MATCH (title,body)
-> AGAINST ('mysql tutorial' IN BOOLEAN MODE) AS score
-> FROM articles ORDER BY score DESC;
+----+------------------------------+-------------------------------------+----------------------+
| id | title | body | score |
+----+------------------------------+-------------------------------------+----------------------+
| 1 | MySQL Tutorial | This database tutorial ... | 0.7405621409416199 |
| 3 | Optimizing Your Database | In this database tutorial ... | 0.3624762296676636 |
| 5 | MySQL Security | When configured properly, MySQL ... | 0.031219376251101494 |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0.031219376251101494 |
| 2 | How To Use MySQL | After you went through a ... | 0.015609688125550747 |
| 4 | MySQL vs. YourSQL | When comparing databases ... | 0.015609688125550747 |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
| 6 | Database, Database, Database | database database database | 0 |
+----+------------------------------+-------------------------------------+----------------------+
8 rows in set (0.00 sec)
In the first record (id 8
), 'mysql' appears once and 'tutorial' appears twice. There are six matching records for 'mysql' and two matching records for 'tutorial'. The MySQL command-line client returns the expected ranking value when inserting these values into the ranking formula for a multiple word search:
mysql> SELECT (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2));
+-------------------------------------------------------+
| (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2)) |
+-------------------------------------------------------+
| 0.7405621541938003 |
+-------------------------------------------------------+
1 row in set (0.00 sec)
The slight difference in the ranking values returned by the SELECT ... MATCH ... AGAINST
statement and the MySQL command-line client is explained in the preceding example.
12.10.3 Full-Text Searches with Query Expansion
Full-text search supports query expansion (and in particular, its variant “blind query expansion”). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine lacks. For example, a user searching for “database” may really mean that “MySQL”, “Oracle”, “DB2”, and “RDBMS” all are phrases that should match “databases” and should be returned, too. This is implied knowledge.
Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION
or IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”. The following example shows this difference:
mysql> SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
mysql> SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' WITH QUERY EXPANSION);
+----+-----------------------+------------------------------------------+
| id | title | body |
+----+-----------------------+------------------------------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 3 | Optimizing MySQL | In this tutorial we show ... |
| 6 | MySQL Security | When configured properly, MySQL ... |
| 2 | How To Use MySQL Well | After you went through a ... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+-----------------------+------------------------------------------+
6 rows in set (0.00 sec)
Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion finds all books with the word “Maigret” on the second pass.
Because blind query expansion tends to increase noise significantly by returning nonrelevant documents, use it only when a search phrase is short.
12.10.4 Full-Text Stopwords
The stopword list is loaded and searched for full-text queries using the server character set and collation (the values of the character_set_server
and collation_server
system variables). False hits or misses might occur for stopword lookups if the stopword file or columns used for full-text indexing or searches have a character set or collation different from character_set_server
or collation_server
.
Case sensitivity of stopword lookups depends on the server collation. For example, lookups are case-insensitive if the collation is latin1_swedish_ci
, whereas lookups are case-sensitive if the collation is latin1_general_cs
or latin1_bin
.
InnoDB
has a relatively short list of default stopwords, because documents from technical, literary, and other sources often use short words as keywords or in significant phrases. For example, you might search for “to be or not to be” and expect to get a sensible result, rather than having all those words ignored.
To see the default InnoDB
stopword list, query the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD
table.
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a |
| about |
| an |
| are |
| as |
| at |
| be |
| by |
| com |
| de |
| en |
| for |
| from |
| how |
| i |
| in |
| is |
| it |
| la |
| of |
| on |
| or |
| that |
| the |
| this |
| to |
| was |
| what |
| when |
| where |
| who |
| will |
| with |
| und |
| the |
| www |
+-------+
36 rows in set (0.00 sec)
To define your own stopword list for all InnoDB
tables, define a table with the same structure as the INNODB_FT_DEFAULT_STOPWORD
table, populate it with stopwords, and set the value of the innodb_ft_server_stopword_table
option to a value in the form
before creating the full-text index. The stopword table must have a single db_name
/table_name
VARCHAR
column named value
. The following example demonstrates creating and configuring a new global stopword table for InnoDB
.
-- Create a new stopword table
mysql> CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
Query OK, 0 rows affected (0.01 sec)
-- Insert stopwords (for simplicity, a single stopword is used in this example)
mysql> INSERT INTO my_stopwords(value) VALUES ('Ishmael');
Query OK, 1 row affected (0.00 sec)
-- Create the table
mysql> CREATE TABLE opening_lines (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200)
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.01 sec)
-- Insert data into the table
mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
('Call me Ishmael.','Herman Melville','Moby-Dick'),
('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
('I am an invisible man.','Ralph Ellison','Invisible Man'),
('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
('It was love at first sight.','Joseph Heller','Catch-22'),
('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
Query OK, 8 rows affected (0.00 sec)
Records: 8 Duplicates: 0 Warnings: 0
-- Set the innodb_ft_server_stopword_table option to the new stopword table
mysql> SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords';
Query OK, 0 rows affected (0.00 sec)
-- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined)
mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (1.17 sec)
Records: 0 Duplicates: 0 Warnings: 1
Verify that the specified stopword ('Ishmael') does not appear by querying the words in INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE
.
By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB
full-text search index. Maximum and minimum word length values are configurable using the innodb_ft_max_token_size
and innodb_ft_min_token_size
variables. This default behavior does not apply to the ngram parser plugin. ngram token size is defined by the ngram_token_size
option.
mysql> SET GLOBAL innodb_ft_aux_table='test/opening_lines';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT word FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 15;
+-----------+
| word |
+-----------+
| across |
| all |
| burn |
| buy |
| call |
| comes |
| dalloway |
| first |
| flowers |
| happened |
| herself |
| invisible |
| less |
| love |
| man |
+-----------+
15 rows in set (0.00 sec)
To create stopword lists on a table-by-table basis, create other stopword tables and use the innodb_ft_user_stopword_table
option to specify the stopword table that you want to use before you create the full-text index.
The stopword file is loaded and searched using latin1
if character_set_server
is ucs2
, utf16
, utf16le
, or utf32
.
To override the default stopword list for MyISAM tables, set the ft_stopword_file
system variable. (See Section 5.1.7, “Server System Variables”.) The variable value should be the path name of the file containing the stopword list, or the empty string to disable stopword filtering. The server looks for the file in the data directory unless an absolute path name is given to specify a different directory. After changing the value of this variable or the contents of the stopword file, restart the server and rebuild your FULLTEXT
indexes.
The stopword list is free-form, separating stopwords with any nonalphanumeric character such as newline, space, or comma. Exceptions are the underscore character (_
) and a single apostrophe ('
) which are treated as part of a word. The character set of the stopword list is the server's default character set; see Section 10.3.2, “Server Character Set and Collation”.
The following list shows the default stopwords for MyISAM
search indexes. In a MySQL source distribution, you can find this list in the storage/myisam/ft_static.c
file.
a's able about above according
accordingly across actually after afterwards
again against ain't all allow
allows almost alone along already
also although always am among
amongst an and another any
anybody anyhow anyone anything anyway
anyways anywhere apart appear appreciate
appropriate are aren't around as
aside ask asking associated at
available away awfully be became
because become becomes becoming been
before beforehand behind being believe
below beside besides best better
between beyond both brief but
by c'mon c's came can
can't cannot cant cause causes
certain certainly changes clearly co
com come comes concerning consequently
consider considering contain containing contains
corresponding could couldn't course currently
definitely described despite did didn't
different do does doesn't doing
don't done down downwards during
each edu eg eight either
else elsewhere enough entirely especially
et etc even ever every
everybody everyone everything everywhere ex
exactly example except far few
fifth first five followed following
follows for former formerly forth
four from further furthermore get
gets getting given gives go
goes going gone got gotten
greetings had hadn't happens hardly
has hasn't have haven't having
he he's hello help hence
her here here's hereafter hereby
herein hereupon hers herself hi
him himself his hither hopefully
how howbeit however i'd i'll
i'm i've ie if ignored
immediate in inasmuch inc indeed
indicate indicated indicates inner insofar
instead into inward is isn't
it it'd it'll it's its
itself just keep keeps kept
know known knows last lately
later latter latterly least less
lest let let's like liked
likely little look looking looks
ltd mainly many may maybe
me mean meanwhile merely might
more moreover most mostly much
must my myself name namely
nd near nearly necessary need
needs neither never nevertheless new
next nine no nobody non
none noone nor normally not
nothing novel now nowhere obviously
of off often oh ok
okay old on once one
ones only onto or other
others otherwise ought our ours
ourselves out outside over overall
own particular particularly per perhaps
placed please plus possible presumably
probably provides que quite qv
rather rd re really reasonably
regarding regardless regards relatively respectively
right said same saw say
saying says second secondly see
seeing seem seemed seeming seems
seen self selves sensible sent
serious seriously seven several shall
she should shouldn't since six
so some somebody somehow someone
something sometime sometimes somewhat somewhere
soon sorry specified specify specifying
still sub such sup sure
t's take taken tell tends
th than thank thanks thanx
that that's thats the their
theirs them themselves then thence
there there's thereafter thereby therefore
therein theres thereupon these they
they'd they'll they're they've think
third this thorough thoroughly those
though three through throughout thru
thus to together too took
toward towards tried tries truly
try trying twice two un
under unfortunately unless unlikely until
unto up upon us use
used useful uses using usually
value various very via viz
vs want wants was wasn't
way we we'd we'll we're
we've welcome well went were
weren't what what's whatever when
whence whenever where where's whereafter
whereas whereby wherein whereupon wherever
whether which while whither who
who's whoever whole whom whose
why will willing wish with
within without won't wonder would
wouldn't yes yet you you'd
you'll you're you've your yours
yourself yourselves zero
12.10.5 Full-Text Restrictions
-
Full-text searches are supported for
InnoDB
andMyISAM
tables only. -
Full-text searches are not supported for partitioned tables. See Section 22.6, “Restrictions and Limitations on Partitioning”.
-
Full-text searches can be used with most multibyte character sets. The exception is that for Unicode, the
utf8
character set can be used, but not theucs2
character set. AlthoughFULLTEXT
indexes onucs2
columns cannot be used, you can performIN BOOLEAN MODE
searches on aucs2
column that has no such index.The remarks for
utf8
also apply toutf8mb4
, and the remarks forucs2
also apply toutf16
,utf16le
, andutf32
. -
Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the built-in full-text parser cannot determine where words begin and end in these and other such languages.
A character-based ngram full-text parser that supports Chinese, Japanese, and Korean (CJK), and a word-based MeCab parser plugin that supports Japanese are provided for use with
InnoDB
andMyISAM
tables. -
Although the use of multiple character sets within a single table is supported, all columns in a
FULLTEXT
index must use the same character set and collation. -
The
MATCH()
column list must match exactly the column list in someFULLTEXT
index definition for the table, unless thisMATCH()
isIN BOOLEAN MODE
on aMyISAM
table. ForMyISAM
tables, boolean-mode searches can be done on nonindexed columns, although they are likely to be slow. -
The argument to
AGAINST()
must be a string value that is constant during query evaluation. This rules out, for example, a table column because that can differ for each row. -
Index hints are more limited for
FULLTEXT
searches than for non-FULLTEXT
searches. See Section 8.9.4, “Index Hints”. -
For
InnoDB
, all DML operations (INSERT
,UPDATE
,DELETE
) involving columns with full-text indexes are processed at transaction commit time. For example, for anINSERT
operation, an inserted string is tokenized and decomposed into individual words. The individual words are then added to full-text index tables when the transaction is committed. As a result, full-text searches only return committed data. -
The '%' character is not a supported wildcard character for full-text searches.
12.10.6 Fine-Tuning MySQL Full-Text Search
MySQL's full-text search capability has few user-tunable parameters. You can exert more control over full-text searching behavior if you have a MySQL source distribution because some changes require source code modifications. See Section 2.9, “Installing MySQL from Source”.
Full-text search is carefully tuned for effectiveness. Modifying the default behavior in most cases can actually decrease effectiveness. Do not alter the MySQL sources unless you know what you are doing.
Most full-text variables described in this section must be set at server startup time. A server restart is required to change them; they cannot be modified while the server is running.
Some variable changes require that you rebuild the FULLTEXT
indexes in your tables. Instructions for doing so are given later in this section.
The minimum and maximum lengths of words to be indexed are defined by the innodb_ft_min_token_size
and innodb_ft_max_token_size
for InnoDB
search indexes, and ft_min_word_len
and ft_max_word_len
for MyISAM
ones.
Minimum and maximum word length full-text parameters do not apply to FULLTEXT
indexes created using the ngram parser. ngram token size is defined by the ngram_token_size
option.
After changing any of these options, rebuild your FULLTEXT
indexes for the change to take effect. For example, to make two-character words searchable, you could put the following lines in an option file:
[mysqld]
innodb_ft_min_token_size=2
ft_min_word_len=2
Then restart the server and rebuild your FULLTEXT
indexes. For MyISAM
tables, note the remarks regarding myisamchk in the instructions that follow for rebuilding MyISAM
full-text indexes.
For MyISAM
search indexes, the 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in storage/myisam/ftdefs.h
:
#define GWS_IN_USE GWS_PROB
Change that line to this:
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need to rebuild the indexes in this case.
By making this change, you severely decrease MySQL's ability to provide adequate relevance values for the MATCH()
function. If you really need to search for such common words, it would be better to search using IN BOOLEAN MODE
instead, which does not observe the 50% threshold.
To change the operators used for boolean full-text searches on MyISAM
tables, set the ft_boolean_syntax
system variable. (InnoDB
does not have an equivalent setting.) This variable can be changed while the server is running, but you must have privileges sufficient to set global system variables (see Section 5.1.8.1, “System Variable Privileges”). No rebuilding of indexes is necessary in this case.
For the built-in full-text parser, you can change the set of characters that are considered word characters in several ways, as described in the following list. After making the modification, rebuild the indexes for each table that contains any FULLTEXT
indexes. Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:
-
Modify the MySQL source: In
storage/innobase/handler/ha_innodb.cc
(forInnoDB
), or instorage/myisam/ftdefs.h
(forMyISAM
), see thetrue_word_char()
andmisc_word_char()
macros. Add'-'
to one of those macros and recompile MySQL. -
Modify a character set file: This requires no recompilation. The
true_word_char()
macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the<ctype><map>
array in one of the character set XML files to specify that'-'
is a “letter.” Then use the given character set for yourFULLTEXT
indexes. For information about the<ctype><map>
array format, see Section 10.13.1, “Character Definition Arrays”. -
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
For the changes to take effect, FULLTEXT
indexes must be rebuilt after modifying any of the following full-text index variables: innodb_ft_min_token_size
; innodb_ft_max_token_size
; innodb_ft_server_stopword_table
; innodb_ft_user_stopword_table
; innodb_ft_enable_stopword
; ngram_token_size
. Modifying innodb_ft_min_token_size
, innodb_ft_max_token_size
, or ngram_token_size
requires restarting the server.
To rebuild FULLTEXT
indexes for an InnoDB
table, use ALTER TABLE
with the DROP INDEX
and ADD INDEX
options to drop and re-create each index.
Running OPTIMIZE TABLE
on a table with a full-text index rebuilds the full-text index, removing deleted Document IDs and consolidating multiple entries for the same word, where possible.
To optimize a full-text index, enable innodb_optimize_fulltext_only
and run OPTIMIZE TABLE
.
mysql> set GLOBAL innodb_optimize_fulltext_only=ON;
Query OK, 0 rows affected (0.01 sec)
mysql> OPTIMIZE TABLE opening_lines;
+--------------------+----------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+--------------------+----------+----------+----------+
| test.opening_lines | optimize | status | OK |
+--------------------+----------+----------+----------+
1 row in set (0.01 sec)
To avoid lengthy rebuild times for full-text indexes on large tables, you can use the innodb_ft_num_word_optimize
option to perform the optimization in stages. The innodb_ft_num_word_optimize
option defines the number of words that are optimized each time OPTIMIZE TABLE
is run. The default setting is 2000, which means that 2000 words are optimized each time OPTIMIZE TABLE
is run. Subsequent OPTIMIZE TABLE
operations continue from where the preceding OPTIMIZE TABLE
operation ended.
If you modify full-text variables that affect indexing (ft_min_word_len
, ft_max_word_len
, or ft_stopword_file
), or if you change the stopword file itself, you must rebuild your FULLTEXT
indexes after making the changes and restarting the server.
To rebuild the FULLTEXT
indexes for a MyISAM
table, it is sufficient to do a QUICK
repair operation:
mysql> REPAIR TABLE tbl_name QUICK;
Alternatively, use ALTER TABLE
as just described. In some cases, this may be faster than a repair operation.
Each table that contains any FULLTEXT
index must be repaired as just shown. Otherwise, queries for the table may yield incorrect results, and modifications to the table cause the server to see the table as corrupt and in need of repair.
If you use myisamchk to perform an operation that modifies MyISAM
table indexes (such as repair or analyze), the FULLTEXT
indexes are rebuilt using the default full-text parameter values for minimum word length, maximum word length, and stopword file unless you specify otherwise. This can result in queries failing.
The problem occurs because these parameters are known only by the server. They are not stored in MyISAM
index files. To avoid the problem if you have modified the minimum or maximum word length or stopword file values used by the server, specify the same ft_min_word_len
, ft_max_word_len
, and ft_stopword_file
values for myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can repair a table with myisamchk like this:
myisamchk --recover --ft_min_word_len=3 tbl_name.MYI
To ensure that myisamchk and the server use the same values for full-text parameters, place each one in both the [mysqld]
and [myisamchk]
sections of an option file:
[mysqld]
ft_min_word_len=3
[myisamchk]
ft_min_word_len=3
An alternative to using myisamchk for MyISAM
table index modification is to use the REPAIR TABLE
, ANALYZE TABLE
, OPTIMIZE TABLE
, or ALTER TABLE
statements. These statements are performed by the server, which knows the proper full-text parameter values to use.
12.10.7 Adding a User-Defined Collation for Full-Text Indexing
This section describes how to add a user-defined collation for full-text searches using the built-in full-text parser. The sample collation is like latin1_swedish_ci
but treats the '-'
character as a letter rather than as a punctuation character so that it can be indexed as a word character. General information about adding collations is given in Section 10.14, “Adding a Collation to a Character Set”; it is assumed that you have read it and are familiar with the files involved.
To add a collation for full-text indexing, use the following procedure. The instructions here add a collation for a simple character set, which as discussed in Section 10.14, “Adding a Collation to a Character Set”, can be created using a configuration file that describes the character set properties. For a complex character set such as Unicode, create collations using C source files that describe the character set properties.
-
Add a collation to the
Index.xml
file. The permitted range of IDs for user-defined collations is given in Section 10.14.2, “Choosing a Collation ID”. The ID must be unused, so choose a value different from 1025 if that ID is already taken on your system.<charset name="latin1"> ... <collation name="latin1_fulltext_ci" id="1025"/> </charset>
-
Declare the sort order for the collation in the
latin1.xml
file. In this case, the order can be copied fromlatin1_swedish_ci
:<collation name="latin1_fulltext_ci"> <map> 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5D D7 D8 55 55 55 59 59 DE DF 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5D F7 D8 55 55 55 59 59 DE FF </map> </collation>
-
Modify the
ctype
array inlatin1.xml
. Change the value corresponding to 0x2D (which is the code for the'-'
character) from 10 (punctuation) to 01 (uppercase letter). In the following array, this is the element in the fourth row down, third value from the end.<ctype> <map> 00 20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 48 10 10 10 10 10 10 10 10 10 10 10 10 01 10 10 84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10 10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10 10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20 10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00 00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02 </map> </ctype>
-
Restart the server.
-
To employ the new collation, include it in the definition of columns that are to use it:
mysql> DROP TABLE IF EXISTS t1; Query OK, 0 rows affected (0.13 sec) mysql> CREATE TABLE t1 ( a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, FULLTEXT INDEX(a) ) ENGINE=InnoDB; Query OK, 0 rows affected (0.47 sec)
-
Test the collation to verify that hyphen is considered as a word character:
mysql> INSERT INTO t1 VALUEs ('----'),('....'),('abcd'); Query OK, 3 rows affected (0.22 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM t1 WHERE MATCH a AGAINST ('----' IN BOOLEAN MODE); +------+ | a | +------+ | ---- | +------+ 1 row in set (0.00 sec)
12.10.8 ngram Full-Text Parser
The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK). The ngram full-text parser is supported for use with InnoDB
and MyISAM
.
MySQL also provides a MeCab full-text parser plugin for Japanese, which tokenizes documents into meaningful words. For more information, see Section 12.10.9, “MeCab Full-Text Parser Plugin”.
An ngram is a contiguous sequence of n
characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of n
characters. For example, you can tokenize “abcd” for different values of n
using the ngram full-text parser.
n=1: 'a', 'b', 'c', 'd'
n=2: 'ab', 'bc', 'cd'
n=3: 'abc', 'bcd'
n=4: 'abcd'
The ngram full-text parser is a built-in server plugin. As with other built-in server plugins, it is automatically loaded when the server is started.
The full-text search syntax described in Section 12.10, “Full-Text Search Functions” applies to the ngram parser plugin. Differences in parsing behavior are described in this section. Full-text-related configuration options, except for minimum and maximum word length options (innodb_ft_min_token_size
, innodb_ft_max_token_size
, ft_min_word_len
, ft_max_word_len
) are also applicable.
Configuring ngram Token Size
The ngram parser has a default ngram token size of 2 (bigram). For example, with a token size of 2, the ngram parser parses the string “abc def” into four tokens: “ab”, “bc”, “de” and “ef”.
ngram token size is configurable using the ngram_token_size
configuration option, which has a minimum value of 1 and maximum value of 10.
Typically, ngram_token_size
is set to the size of the largest token that you want to search for. If you only intend to search for single characters, set ngram_token_size
to 1. A smaller token size produces a smaller full-text search index, and faster searches. If you need to search for words comprised of more than one character, set ngram_token_size
accordingly. For example, “Happy Birthday” is “生日快乐” in simplified Chinese, where “生日” is “birthday”, and “快乐” translates as “happy”. To search on two-character words such as these, set ngram_token_size
to a value of 2 or higher.
As a read-only variable, ngram_token_size
may only be set as part of a startup string or in a configuration file:
-
Startup string:
mysqld --ngram_token_size=2
-
Configuration file:
[mysqld] ngram_token_size=2
The following minimum and maximum word length configuration options are ignored for FULLTEXT
indexes that use the ngram parser: innodb_ft_min_token_size
, innodb_ft_max_token_size
, ft_min_word_len
, and ft_max_word_len
.
Creating a FULLTEXT Index that Uses the ngram Parser
To create a FULLTEXT
index that uses the ngram parser, specify WITH PARSER ngram
with CREATE TABLE
, ALTER TABLE
, or CREATE INDEX
.
The following example demonstrates creating a table with an ngram
FULLTEXT
index, inserting sample data (Simplified Chinese text), and viewing tokenized data in the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
table.
mysql> USE test;
mysql> CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body) WITH PARSER ngram
) ENGINE=InnoDB CHARACTER SET utf8mb4;
mysql> SET NAMES utf8mb4;
INSERT INTO articles (title,body) VALUES
('数据库管理','在本教程中我将向你展示如何管理数据库'),
('数据库应用开发','学习开发数据库应用程序');
mysql> SET GLOBAL innodb_ft_aux_table="test/articles";
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
To add a FULLTEXT
index to an existing table, you can use ALTER TABLE
or CREATE INDEX
. For example:
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT
) ENGINE=InnoDB CHARACTER SET utf8;
ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER ngram;
# Or:
CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER ngram;
ngram Parser Space Handling
The ngram parser eliminates spaces when parsing. For example:
-
“ab cd” is parsed to “ab”, “cd”
-
“a bc” is parsed to “bc”
ngram Parser Stopword Handling
The built-in MySQL full-text parser compares words to entries in the stopword list. If a word is equal to an entry in the stopword list, the word is excluded from the index. For the ngram parser, stopword handling is performed differently. Instead of excluding tokens that are equal to entries in the stopword list, the ngram parser excludes tokens that contain stopwords. For example, assuming ngram_token_size=2
, a document that contains “a,b” is parsed to “a,” and “,b”. If a comma (“,”) is defined as a stopword, both “a,” and “,b” are excluded from the index because they contain a comma.
By default, the ngram parser uses the default stopword list, which contains a list of English stopwords. For a stopword list applicable to Chinese, Japanese, or Korean, you must create your own. For information about creating a stopword list, see Section 12.10.4, “Full-Text Stopwords”.
Stopwords greater in length than ngram_token_size
are ignored.
ngram Parser Term Search
For natural language mode search, the search term is converted to a union of ngram terms. For example, the string “abc” (assuming ngram_token_size=2
) is converted to “ab bc”. Given two documents, one containing “ab” and the other containing “abc”, the search term “ab bc” matches both documents.
For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming ngram_token_size=2
) is converted to '“ab bc”'. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase '“ab bc”' only matches the document containing 'abc'.
ngram Parser Wildcard Search
Because an ngram FULLTEXT
index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT
search indexes:
-
If the prefix term of a wildcard search is shorter than ngram token size, the query returns all indexed rows that contain ngram tokens starting with the prefix term. For example, assuming
ngram_token_size=2
, a search on “a*” returns all rows starting with “a”. -
If the prefix term of a wildcard search is longer than ngram token size, the prefix term is converted to an ngram phrase and the wildcard operator is ignored. For example, assuming
ngram_token_size=2
, an “abc*” wildcard search is converted to “ab bc”.
ngram Parser Phrase Search
Phrase searches are converted to ngram phrase searches. For example, The search phrase “abc” is converted to “ab bc”, which returns documents containing “abc” and “ab bc”.
The search phrase “abc def” is converted to “ab bc de ef”, which returns documents containing “abc def” and “ab bc de ef”. A document that contains “abcdef” is not returned.
12.10.9 MeCab Full-Text Parser Plugin
The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation for Japanese, MySQL provides a MeCab full-text parser plugin. The MeCab full-text parser plugin is supported for use with InnoDB
and MyISAM
.
MySQL also provides an ngram full-text parser plugin that supports Japanese. For more information, see Section 12.10.8, “ngram Full-Text Parser”.
The MeCab full-text parser plugin is a full-text parser plugin for Japanese that tokenizes a sequence of text into meaningful words. For example, MeCab tokenizes “データベース管理” (“Database Management”) into “データベース” (“Database”) and “管理” (“Management”). By comparison, the ngram full-text parser tokenizes text into a contiguous sequence of n
characters, where n
represents a number between 1 and 10.
In addition to tokenizing text into meaningful words, MeCab indexes are typically smaller than ngram indexes, and MeCab full-text searches are generally faster. One drawback is that it may take longer for the MeCab full-text parser to tokenize documents, compared to the ngram full-text parser.
The full-text search syntax described in Section 12.10, “Full-Text Search Functions” applies to the MeCab parser plugin. Differences in parsing behavior are described in this section. Full-text related configuration options are also applicable.
For additional information about the MeCab parser, refer to the MeCab: Yet Another Part-of-Speech and Morphological Analyzer project on Github.
Installing the MeCab Parser Plugin
The MeCab parser plugin requires mecab
and mecab-ipadic
.
On supported Fedora, Debian and Ubuntu platforms (except Ubuntu 12.04 where the system mecab
version is too old), MySQL dynamically links to the system mecab
installation if it is installed to the default location. On other supported Unix-like platforms, libmecab.so
is statically linked in libpluginmecab.so
, which is located in the MySQL plugin directory. mecab-ipadic
is included in MySQL binaries and is located in
.MYSQL_HOME
\lib\mecab
You can install mecab
and mecab-ipadic
using a native package management utility (on Fedora, Debian, and Ubuntu), or you can build mecab
and mecab-ipadic
from source. For information about installing mecab
and mecab-ipadic
using a native package management utility, see Installing MeCab From a Binary Distribution (Optional). If you want to build mecab
and mecab-ipadic
from source, see Building MeCab From Source (Optional).
On Windows, libmecab.dll
is found in the MySQL bin
directory. mecab-ipadic
is located in
.MYSQL_HOME
/lib/mecab
To install and configure the MeCab parser plugin, perform the following steps:
-
In the MySQL configuration file, set the
mecab_rc_file
configuration option to the location of themecabrc
configuration file, which is the configuration file for MeCab. If you are using the MeCab package distributed with MySQL, themecabrc
file is located inMYSQL_HOME/lib/mecab/etc/
.[mysqld] loose-mecab-rc-file=MYSQL_HOME/lib/mecab/etc/mecabrc
The
loose
prefix is an option modifier. Themecab_rc_file
option is not recognized by MySQL until the MeCaB parser plugin is installed but it must be set before attempting to install the MeCaB parser plugin. Theloose
prefix allows you restart MySQL without encountering an error due to an unrecognized variable.If you use your own MeCab installation, or build MeCab from source, the location of the
mecabrc
configuration file may differ.For information about the MySQL configuration file and its location, see Section 4.2.2.2, “Using Option Files”.
-
Also in the MySQL configuration file, set the minimum token size to 1 or 2, which are the values recommended for use with the MeCab parser. For
InnoDB
tables, minimum token size is defined by theinnodb_ft_min_token_size
configuration option, which has a default value of 3. ForMyISAM
tables, minimum token size is defined byft_min_word_len
, which has a default value of 4.[mysqld] innodb_ft_min_token_size=1
-
Modify the
mecabrc
configuration file to specify the dictionary you want to use. Themecab-ipadic
package distributed with MySQL binaries includes three dictionaries (ipadic_euc-jp
,ipadic_sjis
, andipadic_utf-8
). Themecabrc
configuration file packaged with MySQL contains and entry similar to the following:dicdir = /path/to/mysql/lib/mecab/lib/mecab/dic/ipadic_euc-jp
To use the
ipadic_utf-8
dictionary, for example, modify the entry as follows:dicdir=MYSQL_HOME/lib/mecab/dic/ipadic_utf-8
If you are using your own MeCab installation or have built MeCab from source, the default
dicdir
entry in themecabrc
file differs, as do the dictionaries and their location.NoteAfter the MeCab parser plugin is installed, you can use the
mecab_charset
status variable to view the character set used with MeCab. The three MeCab dictionaries provided with the MySQL binary support the following character sets.-
The
ipadic_euc-jp
dictionary supports theujis
andeucjpms
character sets. -
The
ipadic_sjis
dictionary supports thesjis
andcp932
character sets. -
The
ipadic_utf-8
dictionary supports theutf8
andutf8mb4
character sets.
mecab_charset
only reports the first supported character set. For example, theipadic_utf-8
dictionary supports bothutf8
andutf8mb4
.mecab_charset
always reportsutf8
when this dictionary is in use. -
-
Restart MySQL.
-
Install the MeCab parser plugin:
The MeCab parser plugin is installed using
INSTALL PLUGIN
syntax. The plugin name ismecab
, and the shared library name islibpluginmecab.so
. For additional information about installing plugins, see Section 5.5.1, “Installing and Uninstalling Plugins”.INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';
Once installed, the MeCab parser plugin loads at every normal MySQL restart.
-
Verify that the MeCab parser plugin is loaded using the
SHOW PLUGINS
statement.mysql> SHOW PLUGINS;
A
mecab
plugin should appear in the list of plugins.
Creating a FULLTEXT Index that uses the MeCab Parser
To create a FULLTEXT
index that uses the mecab parser, specify WITH PARSER ngram
with CREATE TABLE
, ALTER TABLE
, or CREATE INDEX
.
This example demonstrates creating a table with a mecab
FULLTEXT
index, inserting sample data, and viewing tokenized data in the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
table:
mysql> USE test;
mysql> CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body) WITH PARSER mecab
) ENGINE=InnoDB CHARACTER SET utf8;
mysql> SET NAMES utf8;
mysql> INSERT INTO articles (title,body) VALUES
('データベース管理','このチュートリアルでは、私はどのようにデータベースを管理する方法を紹介します'),
('データベースアプリケーション開発','データベースアプリケーションを開発することを学ぶ');
mysql> SET GLOBAL innodb_ft_aux_table="test/articles";
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
To add a FULLTEXT
index to an existing table, you can use ALTER TABLE
or CREATE INDEX
. For example:
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT
) ENGINE=InnoDB CHARACTER SET utf8;
ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER mecab;
# Or:
CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER mecab;
MeCab Parser Space Handling
The MeCab parser uses spaces as separators in query strings. For example, the MeCab parser tokenizes データベース管理 as データベース and 管理.
MeCab Parser Stopword Handling
By default, the MeCab parser uses the default stopword list, which contains a short list of English stopwords. For a stopword list applicable to Japanese, you must create your own. For information about creating stopword lists, see Section 12.10.4, “Full-Text Stopwords”.
MeCab Parser Term Search
For natural language mode search, the search term is converted to a union of tokens. For example, データベース管理 is converted to データベース 管理.
SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN NATURAL LANGUAGE MODE);
For boolean mode search, the search term is converted to a search phrase. For example, データベース管理 is converted to データベース 管理.
SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN BOOLEAN MODE);
MeCab Parser Wildcard Search
Wildcard search terms are not tokenized. A search on データベース管理* is performed on the prefix, データベース管理.
SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース*' IN BOOLEAN MODE);
MeCab Parser Phrase Search
Phrases are tokenized. For example, データベース管理 is tokenized as データベース 管理.
SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('"データベース管理"' IN BOOLEAN MODE);
Installing MeCab From a Binary Distribution (Optional)
This section describes how to install mecab
and mecab-ipadic
from a binary distribution using a native package management utility. For example, on Fedora, you can use Yum to perform the installation:
yum mecab-devel
On Debian or Ubuntu, you can perform an APT installation:
apt-get install mecab
apt-get install mecab-ipadic
Installing MeCab From Source (Optional)
If you want to build mecab
and mecab-ipadic
from source, basic installation steps are provided below. For additional information, refer to the MeCab documentation.
-
Download the tar.gz packages for
mecab
andmecab-ipadic
from http://taku910.github.io/mecab/#download. As of February, 2016, the latest available packages aremecab-0.996.tar.gz
andmecab-ipadic-2.7.0-20070801.tar.gz
. -
Install
mecab
:tar zxfv mecab-0.996.tar cd mecab-0.996 ./configure make make check su make install
-
Install
mecab-ipadic
:tar zxfv mecab-ipadic-2.7.0-20070801.tar cd mecab-ipadic-2.7.0-20070801 ./configure make su make install
-
Compile MySQL using the
WITH_MECAB
CMake option. Set theWITH_MECAB
option tosystem
if you have installedmecab
andmecab-ipadic
to the default location.-DWITH_MECAB=system
If you defined a custom installation directory, set
WITH_MECAB
to the custom directory. For example:-DWITH_MECAB=/path/to/mecab
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)