ES分词器详解
一、分词器
1、作用:①切词
②normalizaton(提升recall召回率:能搜索到的结果的比率)
2、分析器
①character filter:分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch
A、HTML Strip Character Filter:html_strip
escaped_tags 需要保留的html标签
PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"html_strip",
"escaped_tags":["a"] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":"my_char_filter" } } } } }
测试分词
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "liuyucheng <a><b>edu</b></a>"
}
B、Mapping Character Filter:type mapping
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "٠ => 0", "١ => 1", "٢ => 2", "٣ => 3", "٤ => 4", "٥ => 5", "٦ => 6", "٧ => 7", "٨ => 8", "٩ => 9" ] } } } } }
测试分词 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My license plate is ٢٥٠١٥" }
C、Pattern Replace Character Filter:正则替换type pattern_replace
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } }
测试分词 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }
②tokenizer:分词器
③token filter:时态转换、大小写转换、同义词转换、语气词处理等
比如:has=>have him=>he apples=>apple the/oh/a=>干掉
A、大小写 lowercase token filter
GET _analyze { "tokenizer" : "standard", "filter" : ["lowercase"], "text" : "THE Quick FoX JUMPs" } GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "condition", "filter": [ "lowercase" ], "script": { "source": "token.getTerm().length() < 5" } } ], "text": "THE QUICK BROWN FOX" }
B、停用词 stopwords token filter
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"standard", "stopwords":"_english_" } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text": "Teacher Ma is in the restroom" }
C、分词器 tokenizer :standard
GET /my_index/_analyze { "text": "江山如此多娇,小姐姐哪里可以撩", "analyzer": "standard" }
D、自定义 analysis,设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
PUT /test_analysis { "settings": { "analysis": { "char_filter": { "test_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] } }, "filter": { "test_stopwords": { "type": "stop", "stopwords": ["is","in","at","the","a","for"] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "test_char_filter" ], "tokenizer": "standard", "filter": ["lowercase","test_stopwords"] } } } } } GET /test_analysis/_analyze { "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!", "analyzer": "my_analyzer" }
E、创建mapping时候指定分词器
PUT /test_analysis/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "test_analysis" } } }
二、中文分词器
(1) 中文分词器:
① IK分词:ES的安装目录 不要有中文 空格
1) 下载:https://github.com/medcl/elasticsearch-analysis-ik
2) 创建插件文件夹 cd your-es-root/plugins/ && mkdir ik
3) 将插件解压缩到文件夹 your-es-root/plugins/ik
4) 重新启动es
② 两种analyzer
1) ik_max_word:细粒度
2) ik_smart:粗粒度
③ IK文件描述
1) IKAnalyzer.cfg.xml:IK分词配置文件
2) 主词库:main.dic
3) 英文停用词:stopword.dic,不会建立在倒排索引中
4) 特殊词库:
- quantifier.dic:特殊词库:计量单位等
- suffix.dic:特殊词库:后缀名
- surname.dic:特殊词库:百家姓
- preposition:特殊词库:语气词
5) 自定义词库:比如当下流行词:857、emmm...、渣女、舔屏、996
6) 热更新:
- 修改ik分词器源码
- 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新
作者:http://cnblogs.com/lyc-code/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文链接,否则保留追究法律责任的权力。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 零经验选手,Compose 一天开发一款小游戏!
· 因为Apifox不支持离线,我果断选择了Apipost!
· 通过 API 将Deepseek响应流式内容输出到前端