elasticsearch之内置分析器
elasticsearch之内置分析器
前言
在elasticsearch中,一个分析器可以包括:
- 可选的字符过滤器
- 一个分词器
- 0个或多个分词过滤器
接下来简要的介绍各内置分词的大致情况。在介绍之前,为了方便演示。如果你已经按照之前的教程安装了ik analysis
,现在请暂时将该插件移出plugins
目录。
标准分析器:standard analyzer
标准分析器(standard analyzer):是elasticsearch的默认分析器,该分析器综合了大多数欧洲语言来说合理的默认模块,包括标准分词器、标准分词过滤器、小写转换分词过滤器和停用词分词过滤器。
POST _analyze
{
"analyzer": "standard",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
分词结果如下:
{
"tokens" : [
{
"token" : "to",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "that",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "莎",
"start_offset" : 45,
"end_offset" : 46,
"type" : "<IDEOGRAPHIC>",
"position" : 10
},
{
"token" : "士",
"start_offset" : 46,
"end_offset" : 47,
"type" : "<IDEOGRAPHIC>",
"position" : 11
},
{
"token" : "比",
"start_offset" : 47,
"end_offset" : 48,
"type" : "<IDEOGRAPHIC>",
"position" : 12
},
{
"token" : "亚",
"start_offset" : 48,
"end_offset" : 49,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}
]
}
简单分析器:simple analyzer
简单分析器(simple analyzer):简单分析器仅使用了小写转换分词,这意味着在非字母处进行分词,并将分词自动转换为小写。这个分词器对于亚种语言来说效果不佳,因为亚洲语言不是根据空白来分词的,所以一般用于欧洲言中。
POST _analyze
{
"analyzer": "simple",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
分词结果如下:
{
"tokens" : [
{
"token" : "to",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "word",
"position" : 5
},
{
"token" : "that",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "莎士比亚",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 10
}
]
}
空白分析器:whitespace analyzer
空白(格)分析器(whitespace analyzer):这玩意儿只是根据空白将文本切分为若干分词,真是有够偷懒!
POST _analyze
{
"analyzer": "whitespace",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
分词结果如下:
{
"tokens" : [
{
"token" : "To",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be,",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 5
},
{
"token" : "That",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "————",
"start_offset" : 40,
"end_offset" : 44,
"type" : "word",
"position" : 10
},
{
"token" : "莎士比亚",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 11
}
]
}
停用词分析器:stop analyzer
停用词分析(stop analyzer)和简单分析器的行为很像,只是在分词流中额外的过滤了停用词。
POST _analyze
{
"analyzer": "stop",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
结果也很简单:
{
"tokens" : [
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "莎士比亚",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 10
}
]
}
关键词分析器:keyword analyzer
关键词分析器(keyword analyzer)将整个字段当做单独的分词,如无必要,我们不在映射中使用关键词分析器。
POST _analyze
{
"analyzer": "keyword",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
结果如下:
{
"tokens" : [
{
"token" : "To be or not to be, That is a question ———— 莎士比亚",
"start_offset" : 0,
"end_offset" : 49,
"type" : "word",
"position" : 0
}
]
}
说的一点没错,分析结果是将整段当做单独的分词。
模式分析器:pattern analyzer
模式分析器(pattern analyzer)允许我们指定一个分词切分模式。但是通常更佳的方案是使用定制的分析器,组合现有的模式分词器和所需要的分词过滤器更加合适。
POST _analyze
{
"analyzer": "pattern",
"explain": false,
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
结果如下:
{
"tokens" : [
{
"token" : "to",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "word",
"position" : 5
},
{
"token" : "that",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
}
]
}
我们来自定制一个模式分析器,比如我们写匹配邮箱的正则。
PUT pattern_test
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer":{
"type":"pattern",
"pattern":"\\W|_",
"lowercase":true
}
}
}
}
}
上例中,我们在创建一条索引的时候,配置分析器为自定义的分析器。
需要注意的是,在json
字符串中,正则的斜杠需要转义。
我们使用自定义的分析器来查询。
POST pattern_test/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
结果如下:
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "smith",
"start_offset" : 5,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "foo",
"start_offset" : 11,
"end_offset" : 14,
"type" : "word",
"position" : 2
},
{
"token" : "bar",
"start_offset" : 15,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 4
}
]
}
语言和多语言分析器:chinese
elasticsearch为很多世界流行语言提供良好的、简单的、开箱即用的语言分析器集合:阿拉伯语、亚美尼亚语、巴斯克语、巴西语、保加利亚语、加泰罗尼亚语、中文、捷克语、丹麦、荷兰语、英语、芬兰语、法语、加里西亚语、德语、希腊语、北印度语、匈牙利语、印度尼西亚、爱尔兰语、意大利语、日语、韩国语、库尔德语、挪威语、波斯语、葡萄牙语、罗马尼亚语、俄语、西班牙语、瑞典语、土耳其语和泰语。
我们可以指定其中之一的语言来指定特定的语言分析器,但必须是小写的名字!如果你要分析的语言不在上述集合中,可能还需要搭配相应的插件支持。
POST _analyze
{
"analyzer": "chinese",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
结果如下:
{
"tokens" : [
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "莎",
"start_offset" : 45,
"end_offset" : 46,
"type" : "<IDEOGRAPHIC>",
"position" : 10
},
{
"token" : "士",
"start_offset" : 46,
"end_offset" : 47,
"type" : "<IDEOGRAPHIC>",
"position" : 11
},
{
"token" : "比",
"start_offset" : 47,
"end_offset" : 48,
"type" : "<IDEOGRAPHIC>",
"position" : 12
},
{
"token" : "亚",
"start_offset" : 48,
"end_offset" : 49,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}
]
}
也可以是别语言:
POST _analyze
{
"analyzer": "french",
"text":"Je suis ton père"
}
POST _analyze
{
"analyzer": "german",
"text":"Ich bin dein vater"
}
雪球分析器:snowball analyzer
雪球分析器(snowball analyzer)除了使用标准的分词和分词过滤器(和标准分析器一样)也是用了小写分词过滤器和停用词过滤器,除此之外,它还是用了雪球词干器对文本进行词干提取。
POST _analyze
{
"analyzer": "snowball",
"text":"To be or not to be, That is a question ———— 莎士比亚"
}
结果如下:
{
"tokens" : [
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "莎",
"start_offset" : 45,
"end_offset" : 46,
"type" : "<IDEOGRAPHIC>",
"position" : 10
},
{
"token" : "士",
"start_offset" : 46,
"end_offset" : 47,
"type" : "<IDEOGRAPHIC>",
"position" : 11
},
{
"token" : "比",
"start_offset" : 47,
"end_offset" : 48,
"type" : "<IDEOGRAPHIC>",
"position" : 12
},
{
"token" : "亚",
"start_offset" : 48,
"end_offset" : 49,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}
]
}
see also:elasticsearch analyzer
欢迎斧正,that's all