ES系列十三、Elasticsearch Suggester API(自动补全)
1.概念
1.补全api主要分为四类
- Term Suggester(纠错补全,输入错误的情况下补全正确的单词)
- Phrase Suggester(自动补全短语,输入一个单词补全整个短语)
- Completion Suggester(完成补全单词,输出如前半部分,补全整个单词)
- Context Suggester(上下文补全)
整体效果类似百度搜索,如图:
2.Term Suggester(纠错补全)
2.1.api
1.建立索引
PUT /book4 { "mappings": { "english": { "properties": { "passage": { "type": "text" } } } } }
2.插入数据
curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d' { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Lucene is cool"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elasticsearch rocks"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "elk rocks"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "elasticsearch is rock solid"} '
3.看下储存的分词有哪些
post /_analyze { "text": [ "Lucene is cool", "Elasticsearch builds on top of lucene", "Elasticsearch rocks", "Elastic is the company behind ELK stack", "elk rocks", "elasticsearch is rock solid" ] }
结果:
{ "tokens": [ { "token": "lucene", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 7, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "cool", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "elasticsearch", "start_offset": 15, "end_offset": 28, "type": "<ALPHANUM>", "position": 103 }, { "token": "builds", "start_offset": 29, "end_offset": 35, "type": "<ALPHANUM>", "position": 104 }, { "token": "on", "start_offset": 36, "end_offset": 38, "type": "<ALPHANUM>", "position": 105 }, { "token": "top", "start_offset": 39, "end_offset": 42, "type": "<ALPHANUM>", "position": 106 }, { "token": "of", "start_offset": 43, "end_offset": 45, "type": "<ALPHANUM>", "position": 107 }, { "token": "lucene", "start_offset": 46, "end_offset": 52, "type": "<ALPHANUM>", "position": 108 }, { "token": "elasticsearch", "start_offset": 53, "end_offset": 66, "type": "<ALPHANUM>", "position": 209 }, { "token": "rocks", "start_offset": 67, "end_offset": 72, "type": "<ALPHANUM>", "position": 210 }, { "token": "elastic", "start_offset": 73, "end_offset": 80, "type": "<ALPHANUM>", "position": 311 }, { "token": "is", "start_offset": 81, "end_offset": 83, "type": "<ALPHANUM>", "position": 312 }, { "token": "the", "start_offset": 84, "end_offset": 87, "type": "<ALPHANUM>", "position": 313 }, { "token": "company", "start_offset": 88, "end_offset": 95, "type": "<ALPHANUM>", "position": 314 }, { "token": "behind", "start_offset": 96, "end_offset": 102, "type": "<ALPHANUM>", "position": 315 }, { "token": "elk", "start_offset": 103, "end_offset": 106, "type": "<ALPHANUM>", "position": 316 }, { "token": "stack", "start_offset": 107, "end_offset": 112, "type": "<ALPHANUM>", "position": 317 }, { "token": "elk", "start_offset": 113, "end_offset": 116, "type": "<ALPHANUM>", "position": 418 }, { "token": "rocks", "start_offset": 117, "end_offset": 122, "type": "<ALPHANUM>", "position": 419 }, { "token": "elasticsearch", "start_offset": 123, "end_offset": 136, "type": "<ALPHANUM>", "position": 520 }, { "token": "is", "start_offset": 137, "end_offset": 139, "type": "<ALPHANUM>", "position": 521 }, { "token": "rock", "start_offset": 140, "end_offset": 144, "type": "<ALPHANUM>", "position": 522 }, { "token": "solid", "start_offset": 145, "end_offset": 150, "type": "<ALPHANUM>", "position": 523 } ] }
4.term suggest api(搜索单个字段)
搜索下试试,给出错误单词Elasticsearaach
POST /book4/_search { "suggest" : { "my-suggestion" : { "text" : "Elasticsearaach", "term" : { "field" : "passage",
"suggest_mode": "popular" } } } }
response:
{ "took": 26, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "my-suggestion": [ { "text": "elasticsearaach", "offset": 0, "length": 15, "options": [ { "text": "elasticsearch", "score": 0.84615386, "freq": 3 } ] } ] } }
5.搜索多个字段分别给出提示:
POST _search { "suggest": { "my-suggest-1" : { "text" : "tring out Elasticsearch", "term" : { "field" : "message" } }, "my-suggest-2" : { "text" : "kmichy", "term" : { "field" : "user" } } } }
该term
建议者提出基于编辑距离条款。在建议术语之前分析提供的建议文本。建议的术语是根据分析的建议文本标记提供的。该term
建议者不走查询到的是是的请求部分。
|
建议文字。建议文本是必需的选项,需要全局或按建议设置。 |
|
从中获取候选建议的字段。这是一个必需的选项,需要全局设置或根据建议设置。 |
|
用于分析建议文本的分析器。默认为建议字段的搜索分析器。 |
|
每个建议文本标记返回的最大更正。 |
|
定义如何根据建议文本术语对建议进行排序。两个可能的值:
|
|
建议模式控制包含哪些建议或控制建议的文本术语,建议。可以指定三个可能的值:
|
|
在文本分析之后,建议文本术语小写。 |
|
最大编辑距离候选建议可以具有以便被视为建议。只能是介于1和2之间的值。任何其他值都会导致抛出错误的请求错误。默认为2。 |
|
必须匹配的最小前缀字符的数量才是候选建议。默认为1.增加此数字可提高拼写检查性能。通常拼写错误不会出现在术语的开头。(旧名“prefix_len”已弃用) |
|
建议文本术语必须具有的最小长度才能包含在内。默认为4.(旧名称“min_word_len”已弃用) |
|
设置从每个单独分片中检索的最大建议数。在减少阶段,仅根据 |
|
用于乘以的因子, |
|
建议应出现的文档数量的最小阈值。可以指定为绝对数字或文档数量的相对百分比。这可以仅通过建议高频项来提高质量。默认为0f且未启用。如果指定的值大于1,则该数字不能是小数。分片级文档频率用于此选项。 |
|
建议文本令牌可以存在的文档数量的最大阈值,以便包括在内。可以是表示文档频率的相对百分比数(例如0.4)或绝对数。如果指定的值大于1,则不能指定小数。默认为0.01f。这可用于排除高频术语的拼写检查。高频术语通常拼写正确,这也提高了拼写检查的性能。分片级文档频率用于此选项。 |
|
用于比较类似建议术语的字符串距离实现。可以指定五个可能的值: |
2.phase sguesster:短语纠错
phrase 短语建议,在term的基础上,会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等
示例1:
POST book4/_search
"text": "Elasticsearch rock",
"phrase": {
"field": "passage"
}
}
}
}
{ "took": 11, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "myss": [ { "text": "Elasticsearch rock", "offset": 0, "length": 18, "options": [ { "text": "elasticsearch rocks", "score": 0.3467123 } ] } ] } }
3. Completion suggester 自动补全
针对自动补全场景而设计的建议器。此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在。
1.建立索引
POST /book5 { "mappings": { "music" : { "properties" : { "suggest" : { "type" : "completion" }, "title" : { "type": "keyword" } } } } }
插入数据:
POST /book5/music { "suggest":"test my book" }
Input 指定输入词 Weight 指定排序值(可选)
PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh { "suggest" : { "input": [ "test", "book" ], "weight" : 34 } }
指定不同的排序值:
PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh { "suggest" : [ { "input": "test", "weight" : 10 }, { "input": "good", "weight" : 3 } ]}
示例1:查询建议根据前缀查询
POST book5/_search?pretty { "suggest": { "song-suggest" : { "prefix" : "te", "completion" : { "field" : "suggest" } } } }
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "song-suggest": [ { "text": "te", "offset": 0, "length": 2, "options": [ { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6Xu6mmUBYLvVFwGWpXeL", "_score": 1, "_source": { "suggest": "test my book1" } }, { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6nu8mmUBYLvVFwGWSndC", "_score": 1, "_source": { "suggest": "test my book1" } }, { "text": "test my book1 english", "_index": "book5", "_type": "music", "_id": "63u8mmUBYLvVFwGWZHdC", "_score": 1, "_source": { "suggest": "test my book1 english" } } ] } ] } }
示例2:对建议查询结果去重
{ "suggest": { "song-suggest" : { "prefix" : "te", "completion" : { "field" : "suggest" , "skip_duplicates": true } } } }
示例3:查询建议文档存储短语
POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh { "suggest" : { "input": [ "book1 english", "test english" ], "weight" : 20 } }
查询:
POST book5/_search?pretty { "suggest": { "song-suggest" : { "prefix" : "test", "completion" : { "field" : "suggest" , "skip_duplicates": true } } } }
结果:
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "song-suggest": [ { "text": "test", "offset": 0, "length": 4, "options": [ { "text": "test english", "_index": "book5", "_type": "music", "_id": "63u8mmUBYLvVFwGWZHdC", "_score": 20, "_source": { "suggest": { "input": [ "book1 english", "test english" ], "weight": 20 } } }, { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6Xu6mmUBYLvVFwGWpXeL", "_score": 1, "_source": { "suggest": "test my book1" } } ] } ] } }
4. 总结和建议
因此用好Completion Sugester并不是一件容易的事,实际应用开发过程中,需要根据数据特性和业务需要,灵活搭配analyzer和mapping参数,反复调试才可能获得理想的补全效果。
回到篇首搜索框的补全/纠错功能,如果用ES怎么实现呢?我能想到的一个的实现方式:
- 在用户刚开始输入的过程中,使用Completion Suggester进行关键词前缀匹配,刚开始匹配项会比较多,随着用户输入字符增多,匹配项越来越少。如果用户输入比较精准,可能Completion Suggester的结果已经够好,用户已经可以看到理想的备选项了。
- 如果Completion Suggester已经到了零匹配,那么可以猜测是否用户有输入错误,这时候可以尝试一下Phrase Suggester。
- 如果Phrase Suggester没有找到任何option,开始尝试term Suggester。
精准程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。