ElasticSearch之分词器edge_ngram和ngram的区别

ElasticSearch一看就懂之分词器edge_ngram和ngram的区别
1 year ago
edge_ngram和ngram是ElasticSearch自带的两个分词器，一般设置索引映射的时候都会用到，设置完步长之后，就可以直接给解析器analyzer的tokenizer赋值使用。
这里，我们统一用字符串来做分词示例：
字符串

edge_ngram分词器，分词结果如下：
{
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
}
]
}
ngram分词器，分词结果如下：
{
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "符",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 3
},
{
"token": "符串",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "串",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 5
}
]
}
一目了然，看明白了吗？简单理解来说：edge_ngram的分词器，就是从首字开始，按步长，逐字符分词，直至最终结尾文字；ngram呢，就不仅是从首字开始，而是逐字开始按步长，逐字符分词。
具体应用呢？如果必须首字匹配的情况，那么用edge_ngram自然是最佳选择，如果需要文中任意字符的匹配，ngram就更为合适了。
原文链接：https://blog.csdn.net/Frankltf/article/details/109734447

posted @ 2021-05-09 21:49 ppjj 阅读(243) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· AI与.NET技术实操系列（二）：开始使用ML.NET
· 单线程的Redis速度为什么快？

ppjj

ElasticSearch之分词器edge_ngram和ngram的区别

公告

随笔分类

资源

阅读排行榜

评论排行榜

推荐排行榜

最新评论