Elasticsearch学习笔记之—分词器 analyzer
analyzer
由三部分构成:
Character Filters、Tokenizers、Token filters
Character Filters 负责字符过滤 官方的解释是:字符过滤器用来把阿拉伯数字(٠١٢٣٤٥٦٧٨٩)转成成Arabic-Latin的等价物(0123456789)或用于去掉html内容,如:<b>。
Tokenizers 负责分词,常用的分词器有:whitespace、standard
1、standard 分词器 StandardAnalyzer 对于英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分。他会将词汇单元转换成小写形式,并去除停用词和标点符号。
2、simple 分词器 SimpleAnalyzer 功能强于WhitespaceAnalyzer, 首先会通过非字母字符来分割文本信息,然后将词汇单元统一为小写形式。该分析器会去掉数字类型的字符。
3、Whitespace 分词器 WhitespaceAnalyzer 仅仅是去除空格,对字符没有lowcase化,不支持中文; 并且不对生成的词汇单元进行其他的规范化处理。
4、Stop 分词器 StopAnalyzer StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上增加了去除英文中的常用单词(如the,a等),也可以更加自己的需要设置常用单词;不支持中文
5、keyword 分词器 KeywordAnalyzer 把整个输入作为一个单独词汇单元,方便特殊类型的文本进行索引和检索。针对邮政编码,地址等文本信息使用关键词分词器进行索引项建立非常方便。
6、pattern 分词器 一个pattern类型的analyzer可以通过正则表达式将文本分成"terms"(经过token Filter 后得到的东西 )
7、snowball 分词器 一个snowball类型的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter这四个filter构成的。 snowball analyzer 在Lucene中通常是不推荐使用的。
8、Custom 分词器 是自定义的analyzer。允许多个零到多个tokenizer,零到多个 Char Filters. custom analyzer 的名字不能以 "_"开头.
9、ik-analyzer 分词器 采用字典分词 ik_max_word:会将文本做最细粒度的拆分;尽可能多的拆分出词语 ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有
为索引定义个default分词器
PUT /my_index10 { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "ik_smart", "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } }, "mappings": { "_doc": { "properties": { "title": { "type": "text" } } } } }
PUT person_index { "mappings": { "personn": { "properties": { "id":{ "type":"integer" }, "name":{ "type":"text", "analyzer":"standard" }, "address": { "type": "text", "analyzer":"standard" } } } } }
Token filters
- Standard Token Filter 目前什么也不做
- ASCII Folding Token Filter asciifolding 类型的词元过滤器,将不在前127个ASCII字符(“基本拉丁文”Unicode块)中的字母,数字和符号Unicode字符转换为ASCII等效项(如果存在)。
- Length Token Filter
length用于去掉过长或者过短的单词;
min 定义最短长度
max 定义最长长度
用法如下:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "this is a test" }
结果:
"tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 } ]
- Lowercase Token Filter 将词元文本规范化为小写
- Uppercase Token Filter 将词元文本规范化为大写
- Stop Token Filter 过滤某些关键字 输入:
{ "tokenizer" : "standard", "filter": [{"type": "stop", "stopwords": ["this", "a"]}], "text" : ["this is a test"] }
输出:
# stopwords中拦截词this, a 被过滤掉; "tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ]
- Stemmer Token Filter 可以添加几乎所有的词元过滤器,所以是一个通用接口 用法如下
PUT /my_index { "settings": { "analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "my_stemmer"] } }, "filter" : { "my_stemmer" : { "type" : "stemmer", "name" : "light_german" } } } } }
- Synonym Token Filter 同意词
- Reverse Token Filter 将词反转,示例如下:
调用:
GET _analyze { "tokenizer": "standard", "filter": ["reverse"], "text": ["hello world"] }结果:
"tokens": [ { "token": "olleh", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "dlrow", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ] - Unique Token Filter
GET _analyze { "tokenizer": "standard", "filter": ["unique"], "text": ["this is a test test test"] }
后面的多个test,最终生成的时候,只有一个。
输出:
"tokens": [ { "token": "this", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ] - Trim Token Filter 去除词元周围的空格
- Delimited Payload Token Filter delimiter定义分割符号, 默认为’|’
GET _analyze { "tokenizer": "standard", "filter": ["delimited_payload_filter"], "text": ["the|1 quick|2 fox|3"] }
"tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "1", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "quick", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "2", "start_offset": 12, "end_offset": 13, "type": "<NUM>", "position": 3 }, { "token": "fox", "start_offset": 14, "end_offset": 17, "type": "<ALPHANUM>", "position": 4 }, { "token": "3", "start_offset": 18, "end_offset": 19, "type": "<NUM>", "position": 5 } ]
- Keep Words Token Filter 只保留固定的词,如:
GET _analyze { "tokenizer": "standard", "filter": [{"type":"keep", "keep_words":["this", "test"]}], "text": ["this is a test"] }
#这里 is, a 因为没有在keep_words中定义而被过滤 "tokens": [ { "token": "this", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ]
- Fingerprint Token Filter 所有词元按照升序排序,再去重
GET _analyze { "tokenizer": "standard", "filter": ["fingerprint"], "text": ["b a f e c f"] }
"tokens": [ { "token": "a b c e f", "start_offset": 0, "end_offset": 11, "type": "fingerprint", "position": 0 } ]