分析器相关

为文档使用分词器。

1、创建索引的时候设定分词器

2、ES配置文件中，设定全局默认分词器

一、通过索引指定分词器

% curl -XPOST 'localhost:9200/myindex' -d '{
　　"settings":{
　　　　"number_of_shards":2,
　　　　"number_of_replicas":1,
　　　　"index":{
　　　　　　"analysis":{　　　　　　　　　　　　#分析对象中设置分词器
　　　　　　　　"myCcustomAnalyzer":{      #定制分词器的名称
　　　　　　　　　　"type":"custom",　　　　# 定制化的分词器
　　　　　　　　　　"tokenizer":"myCustomTokenizer",　　　　　　　　　　# 分词器
　　　　　　　　　　"filter":["myCustomFilter1","myCustomFilter2"]  # 分词 过滤器
　　　　　　　　　　"char_filter":["myCustomCharFilter"]　　　　　　 # 字符过滤器　　
　　　　　　　　}
　　　　　　},
　　　　　　"tokenizer":{
　　　　　　　　"myCustomTokenizer":{
　　　　　　　　　　"type":"letter"
　　　　　　　　}
　　　　　　},
　　　　　　"filter":{
　　　　　　　　"myCustomTokenizer1":{
　　　　　　　　　　"type":"lowercase"
　　　　　　　　},
　　　　　　　　"myCustomTokenizer2":{
　　　　　　　　　　"type":"kstem"
　　　　　　　　}
　　　　　　},
　　　　　　"char_filter":{
　　　　　　　　"myCustomCharFilter":{
　　　　　　　　　　"type":"mapping",
　　　　　　　　　　"mappings":["ph=>f","u=>you"]
　　　　　　　　}
　　　　　　}
　　　　}
　　},
　　"mappings":{
　　　　...索引映射
　　}
}'

二、ES配置中设置分析器

index:
　　analysis:
　　　　analyzer:
　　　　　　myCustomAnalyzer:
　　　　　　　　type: custom
　　　　　　　　tokenizer: myCustomTokenizer
　　　　　　　　filter: [myCustomFilter1, myCustomFilter2]
　　　　　　　　char_filter: myCustomCharFilter
　　　　tokenizer:
　　　　　　myCustomTokenizer:
　　　　　　　　type:letter
　　　　filter:
　　　　　　myCustomFilter1:
　　　　　　　　type: lowercase
　　　　　　myCustomFilter2:
　　　　　　　　type: kstem
　　　　char_filter:
　　　　　　myCustomCharFilter:
　　　　　　　　type: mapping
　　　　　　　　mappings: ["ph=>f","u=>you"]

三、字段指定分词器

% curl -XPOST 'localhost:9200/myindex' -d '{
　　"settings":{
　　　　"number_of_shards":2,
　　　　"number_of_replicas":1,
　　　　"index":{
　　　　　　... 分析器设置
　　　　　}
　　},
　　"mappings":{
　　　　"document":{
　　　　　　"properties" : {
　　　　　　　　"name":{
　　　　　　　　　　"type":"string",
　　　　　　　　　　"analyzer":"myCustomAnalyzer"
　　　　　　　　　　# "idnex": "not_analyzed" # 指定不要分析name字段
　　　　　　　　}
　　　　　　}
　　　　}
　　}
}'

四、分析API分析文本

% curl -XPOST 'localhost:9200/_analyze?analyzer=standard' -d 'share your experience with NoSql & big data'

% curl -XPOST 'localhost:9200/_analyze?analyzer=myCustomAnalyzer' -d 'share your experience with NoSql & big data'  # 指定es配置文件中的自定义分词器

% curl -XPOST 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,reverse' -d 'share your experience with NoSql & big data' # 分词，反转
erash,ruoy,ecneirepxe,htiw,lqson,&,gib,atad,seigolonhcet

五、词条向量API

% curl 'localhost:9200/get-together/group/1/_termvector?pretty=true'

六、分析器和分词器

　　1、标准分析器

　　2、简单分析器

　　3、空白分析器

　　4、停用词分析器

　　5、关键词分析器

　　6、模式分析器

　　7、语言和多语言分析器

　　8、雪球分析器

　　9、电磁邮件分词器

　　10、路径分词器

七、过滤器

　　1、标准分词过滤器

　　2、小写分词过滤器

　　3、长度分词过滤器：过滤掉超出最短和最长限制范围的单词

　　4、停用词过滤器、

　　5、截断分词过滤器、修建分词过滤器和限制分词数量过滤器

　　6、颠倒分词过滤器

　　7、唯一分词过滤器，只保留唯一的分词

　　8、ASCII折叠词分词过滤器，将不是普通ASCII字符的Unicode字符转化为ASCII中等同的字符

　　9、同义词分词过滤器，

　　10、N元语法、侧边N元语法和滑动窗口

　　11、提取词干

posted @ 2024-09-10 21:27 Wind_LPH 阅读(4) 评论(0) 编辑收藏举报

刷新页面返回顶部

Wind_LPH

凡所向往，皆是虚妄

分析器相关

公告