es 基础概念总结 —— 自定义分析器
一、分析器 analyzer
包括
1.字符过滤器 character filter
比如去除HTML标记,或者转化“&”为“and”
2.分词器 tokenizer
比如按空格分词
3.词单元标准化过滤器 token filter
如大小写转换,去掉停用词,增加同义词
二、内置分析器
标准分析器
根据单词边界分词,去标点符号,转小写
GET _analyze { "analyzer": "standard", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
简单分析器
根据非字母切分,非字母去除,转小写
GET _analyze { "analyzer": "simple", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
stop 分析器
根据非字母切分,非字母去除,转小写,停用词
GET _analyze { "analyzer": "stop", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
空格分析器
依据空格切分,不转换小写
GET _analyze { "analyzer": "whitespace", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
正则分析器
默认为非字符符号(\w+)分隔,转小写
GET _analyze { "analyzer": "pattern", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
keyword 分析器
不分词
GET _analyze { "analyzer": "keyword", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
三、自定义 analyzer
字符过滤器
# html_strip POST _analyze { "tokenizer": "keyword", "char_filter": ["html_strip"], "text": "<b>hello world</b>" } # 映射替换 POST _analyze { "tokenizer": "standard", "char_filter": [ { "type": "mapping", "mappings": ["-=>_"] } ], "text": "123-456, i-test" } # 正则替换 POST _analyze { "tokenizer": "standard", "char_filter": [ { "type": "pattern_replace", "pattern": "http://(.*)", "replacement": "$1" } ], "text": "http://www.elastic.co" }
分词器
# 路径分词 POST _analyze { "tokenizer": "path_hierarchy", "text": "/usr/abc/efg" } # 空格分词 POST _analyze { "tokenizer": "whitespace", "filter": ["stop"], "text": "The rain in Spain falls mainly on the plain." }
词单元过滤器
# 转小写、停用词去除 POST _analyze { "tokenizer": "whitespace", "filter": ["lowercase", "stop"], "text": "The rain in Spain falls mainly on the plain." }
相关阅读
https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html#_character_filters