elasticsearch之内置字符过滤器
elasticsearch之内置字符过滤器
前言
字符过滤器在<charFilter>
属性中定义,它是对字符流进行处理。字符过滤器种类不多。elasticearch只提供了三种字符过滤器:
- HTML字符过滤器(HTML Strip Char Filter)
- 映射字符过滤器(Mapping Char Filter)
- 模式替换过滤器(Pattern Replace Char Filter)
我们来分别看看都是怎么玩的吧!
HTML字符过滤器
HTML字符过滤器(HTML Strip Char Filter)从文本中去除HTML元素。
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text":"<p>I'm so <b>happy</b>!</p>"
}
结果如下:
{
"tokens" : [
{
"token" : """
I'm so happy!
""",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
映射字符过滤器
映射字符过滤器(Mapping Char Filter)接收键值的映射,每当遇到与键相同的字符串时,它就用该键关联的值替换它们。
PUT pattern_test4
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
},
"char_filter": {
"my_char_filter":{
"type":"mapping",
"mappings":["苍井空 => 666","武藤兰 => 888"]
}
}
}
}
}
}
上例中,我们自定义了一个分析器,其内的分词器使用关键字分词器,字符过滤器则是自定制的,将字符中的苍井空替换为666,武藤兰替换为888。
POST pattern_test4/_analyze
{
"analyzer": "my_analyzer",
"text": "苍井空热爱武藤兰,可惜后来苍井空结婚了"
}
结果如下:
{
"tokens" : [
{
"token" : "666热爱888,可惜后来666结婚了",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}
模式替换过滤器
模式替换过滤器(Pattern Replace Char Filter)使用正则表达式匹配并替换字符串中的字符。但要小心你写的抠脚的正则表达式。因为这可能导致性能变慢!
PUT pattern_test5
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
上例中,我们自定义了一个正则规则。
POST pattern_test5/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
结果如下:
{
"tokens" : [
{
"token" : "My",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "credit",
"start_offset" : 3,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "card",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "is",
"start_offset" : 15,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "123_456_789",
"start_offset" : 18,
"end_offset" : 29,
"type" : "<NUM>",
"position" : 4
}
]
}
我们大致的了解elasticsearch分析处理数据的流程。但可以看到的是,我们极少地在例子中演示中文处理。因为elasticsearch内置的分析器处理起来中文不是很好。所以,接下来要介绍一个重量级的插件就是elasticsearch analysis ik(一般习惯称呼为ik分词器)。
欢迎斧正,that's all