elasticsearch之内置分词过滤器
elasticsearch之内置分词过滤器
前言
asticsearch内置很多(真是变态多啊!但一般用不到,美滋滋!!!)的分词过滤器。其中包含分词过滤器和字符过滤器。
常见分词过滤器
这里仅列举几个常见的分词过滤器(token filter)包括:
- 标准分词过滤器(Standard Token Filter)在6.5.0版本弃用。此筛选器已被弃用,将在下一个主要版本中删除。在之前的版本中其实也没干啥,甚至在更老版本的
Lucene
中,它用于去除单词结尾的s
字符,还有不必要的句点字符,但是现在, 连这些小功能都被其他的分词器和分词过滤器顺手干了,真可怜! - ASCII折叠分词过滤器(ASCII Folding Token Filter)将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果存在的话)。
- 扁平图形分词过滤器(Flatten Graph Token Filter)接受任意图形标记流。例如由同义词图形标记过滤器生成的标记流,并将其展平为适合索引的单个线性标记链。这是一个有损的过程,因为单独的侧路径被压扁在彼此之上,但是如果在索引期间使用图形令牌流是必要的,因为Lucene索引当前不能表示图形。 出于这个原因,最好只在搜索时应用图形分析器,因为这样可以保留完整的图形结构,并为邻近查询提供正确的匹配。该功能在Lucene中为实验性功能。
- 长度标记过滤器(Length Token Filter)会移除分词流中太长或者太短的标记,它是可配置的,我们可以在settings中设置。
- 小写分词过滤器(Lowercase Token Filter)将分词规范化为小写,它通过
language
参数支持希腊语、爱尔兰语和土耳其语小写标记过滤器。 - 大写分词过滤器(Uppercase Token Filter)将分词规范为大写。
其余分词过滤器不一一列举。详情参见官网。
自定义分词过滤器
接下来我们简单的来学习自定义两个分词过滤器。首先是长度分词过滤器。
PUT pattern_test3
{
"settings": {
"analysis": {
"filter": {
"my_test_length":{
"type":"length",
"max":8,
"min":2
}
}
}
}
}
上例中,我们自定义了一个长度过滤器,过滤掉长度大于8和小于2的分词。
需要补充的是,max参数表示最大分词长度。默认为Integer.MAX_VALUE,就是2147483647($2^31-1$),而min则表示最小长度,默认为0。
POST pattern_test3/_analyze
{
"tokenizer": "standard",
"filter": ["my_test_length"],
"text":"a Small word and a longerword"
}
结果如下:
{
"tokens" : [
{
"token" : "Small",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "word",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
自定义小写分词过滤器
自定义一个小写分词过滤器,过滤希腊文:
PUT lowercase_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_lowercase_example": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"]
},
"greek_lowercase_example": {
"type": "custom",
"tokenizer": "standard",
"filter": ["greek_lowercase"]
}
},
"filter": {
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
}
}
}
}
}
过滤内容是:
POST lowercase_example/_analyze
{
"tokenizer": "standard",
"filter": ["greek_lowercase"],
"text":"Ένα φίλτρο διακριτικού τύπου πεζά s ομαλοποιεί το κείμενο διακριτικού σε χαμηλότερη θήκη"
}
结果如下:
{
"tokens" : [
{
"token" : "ενα",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "φιλτρο",
"start_offset" : 4,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "διακριτικου",
"start_offset" : 11,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "τυπου",
"start_offset" : 23,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "πεζα",
"start_offset" : 29,
"end_offset" : 33,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "s",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "ομαλοποιει",
"start_offset" : 36,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "το",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "κειμενο",
"start_offset" : 50,
"end_offset" : 57,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "διακριτικου",
"start_offset" : 58,
"end_offset" : 69,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "σε",
"start_offset" : 70,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "χαμηλοτερη",
"start_offset" : 73,
"end_offset" : 83,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "θηκη",
"start_offset" : 84,
"end_offset" : 88,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
多个分词过滤器
除此之外,我们可以使用多个分词过滤器。例如我们在使用长度过滤器时,可以同时使用小写分词过滤器或者更多。
POST _analyze
{
"tokenizer": "standard",
"filter": ["length","lowercase"],
"text":"a Small word and a longerword"
}
上例中,我们用列表来管理多个分词过滤器。
结果如下:
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "small",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "word",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "a",
"start_offset" : 17,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "longerword",
"start_offset" : 19,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
see also:Standard Token Filter | ASCII Folding Token Filter | Flatten Graph Token Filter | Length Token Filter | Lowercase Token Filter | Uppercase Token Filter | NGram Token Filter | Edge NGram Token Filter | Porter Stem Token Filter | Shingle Token Filter | Stop Token Filter | Word Delimiter Token Filter | Word Delimiter Graph Token Filter | Multiplexer Token Filter | Conditional Token Filter | Predicate Token Filter Script | Stemmer Token Filter | Stemmer Override Token Filter | Keyword Marker Token Filter | Keyword Repeat Token Filter | KStem Token Filter | Snowball Token Filter | Phonetic Token Filter | Synonym Token Filter | Parsing synonym files | Synonym Graph Token Filter | Parsing synonym files | Compound Word Token Filters | Reverse Token Filter | Elision Token Filter | Truncate Token Filter | Unique Token Filter | Pattern Capture Token Filter | Pattern Replace Token Filter | Trim Token Filter | Limit Token Count Token Filter | Hunspell Token Filter | Common Grams Token Filter | Normalization Token Filter | CJK Width Token Filter | CJK Bigram Token Filter | Delimited Payload Token Filter | Keep Words Token Filter | Keep Types Token Filter | Exclude mode settings example | Classic Token Filter | Apostrophe Token Filter | Decimal Digit Token Filter | Fingerprint Token Filter | Minhash Token Filter | Remove Duplicates Token Filter | Character Filters | HTML Strip Char Filter | Mapping Char Filter | Pattern Replace Char Filter