ElasticSearch - How to search for a part of a word with ElasticSearch
Search a part of word with ElasticSearch
来自stackoverflow
https://stackoverflow.com/questions/6467067/how-to-search-for-a-part-of-a-word-with-elasticsearch
场景还原
// 初始化数据
POST /my_idx/my_type/_bulk
{"index": {"_id": "1"}}
{"name": "John Doeman", "function": "Janitor"}
{"index": {"_id": "2"}}
{"name": "Jane Doewoman", "function": "Teacher"}
{"index": {"_id": "3"}}
{"name": "Jimmy Jackal", "function": "Student"}
Question
ElasticSearch中有数据如下:
{
"_id" : "1",
"name" : "John Doeman",
"function" : "Janitor"
}
{
"_id" : "2",
"name" : "Jane Doewoman",
"function" : "Teacher"
}
{
"_id" : "3",
"name" : "Jimmy Jackal",
"function" : "Student"
}
现在期望搜索所有包含Doe
的文档
// 并没有返回任何文档
GET /my_idx/my_type/_search?q=Doe
// 返回一个文档
GET /my_idx/my_type/_search?q=Doeman
提问者还更换了分词器,改用请求体的方式,但这也不行:
GET /my_idx/my_type/_search
{
"query": {
"term": {
"name": "Doe"
}
}
}
后来使用了nGram
的tokenizer
和filter
{
"index": {
"index": "my_idx",
"type": "my_type",
"bulk_size": "100",
"bulk_timeout": "10ms",
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer",
"filter": [
"my_ngram_filter"
]
}
},
"filter": {
"my_ngram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
}
}
}
}
引入了另外一个问题:任意的查询都可以返回所有文档
Answers
首先这是一个分词引起的问题,索引默认情况下使用standard
分词器,对于文档:
{
"_id" : "1",
"name" : "John Doeman",
"function" : "Janitor"
}
{
"_id" : "2",
"name" : "Jane Doewoman",
"function" : "Teacher"
}
{
"_id" : "3",
"name" : "Jimmy Jackal",
"function" : "Student"
}
索引后会得到这样一个映射,这里只考虑了name
字段的分词:
segment | document id list |
---|---|
john | 1 |
doeman | 1 |
jane | 2 |
doewoman | 2 |
jimmy | 3 |
jackal | 3 |
那么现在考虑我们的搜索
Search 1
GET /my_idx/my_type/_search?q=Doe
standard
分词器会将Doe
分析为doe
,然后到索引表中查找,并不会找到doe
这个索引,因此返回空
Search 2
GET /my_idx/my_type/_search?q=Doeman
standard
分词器会将Doeman
分析为doeman
,然后到索引表中找到了该索引,会发现只有doc ID 1
包含该索引,所以只返回一个文档
Search 3
GET /my_idx/my_type/_search
{
"query": {
"term": {
"name": "Doe"
}
}
}
term
查询,Doe
还是Doe
,不会被分析器分析,但是Doe
在索引表中依然是不存在的,所以这个方法也无法返回任何文档。
Search 4
额外说明,题主并没有用这种方式试过
GET /my_idx/my_type/_search
{
"query": {
"term": {
"name": "Doeman"
}
}
}
不要以为这样就能找到了,因为term
不进行分析,所以直接从索引表中找Doeman
也是没有任何文档匹配的,除非把Doeman
改为doeman
解决方案
总结了一下stackoverflow上的答案,目前有这么几种可行方案:
- 正则匹配法
- 通配符匹配法
- 前缀匹配法
- nGram分词器法
正则匹配法
GET my_idx/my_type/_search
{
"query": {
"regexp": {
"name": "doe.*"
}
}
}
通配符匹配法
使用query_string
配合通配符进行查询,需要注意的是,通配符查找可能使用大量内存且效率低下
后缀匹配(前导通配符)
是非常重的操作(e.g. "*ing
"),索引中所有的term
都会被查找一遍,可以通过allow_leading_wildcard
来关闭后缀匹配
功能
GET my_idx/my_type/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "Doe*"
}
}
}
前缀匹配法
原答案说使用prefix
,但是prefix
并没有对查询进行分析,这里我们使用match_phrase_prefix
GET my_idx/my_type/_search
{
"query": {
"match_phrase_prefix": {
"name": {
"query": "Doe",
"max_expansions": 10
}
}
}
}
nGram分词器法
创建索引
PUT my_idx
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
测试一下分词器
POST my_idx/_analyze
{
"analyzer": "my_analyzer",
"text": "Doeman"
}
// response
{
"tokens": [
{
"token": "Doe",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "oem",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "ema",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "man",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 3
}
]
}
再查就可以查到了。而题主虽然使用了ngram
,但是min_gram
和max_gram
都配置为1
长度越小,匹配到的文档越多,但匹配的质量会越差
长度越大,检索到的文档越匹配。推荐使用长度为3的tri-gram
。官方文档
对此有详细介绍