Elasticsearch自定义分词器

分词发生时期

分词器的处理过程发生在Index TimeSearch Time两个时期

  • Index Time:文档写入并创建倒排索引时期,其分词逻辑取决于映射参数analyzer
  • Search Time:搜索发生时期,其分词仅对搜索词产生作用。

分词器的组成

  • 切词器(Tokenizer):用于定义切词(分词)逻辑。
  • 词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑。
  • 字符过滤器(Character Filter):用于处理单个字符。

_analyzer API

_analyzer API可以用来查看指定分词器的分词结果。

语法如下:

GET _analyze
{
    "text": "hello world",
    "analyzer": "english"
}

分词器:Tokenizer

tokenizer是分词器的核心组成部分之一,其主要作用是分词,或称之为切词,主要用来对原始文本进行细粒度拆分,拆分之后的每一个部分被称之为一个term,或称之为一个词项。

可以把切词器理解为预定义的切词规则。

官方内置了很多种切词器,默认的切词器为standard

例如:

### standard切词器对英文进行分词
GET _analyze
{
  "tokenizer": "standard",
  "text": ["hello world"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
### standard切词器对中文进行分词
GET _analyze
{
  "tokenizer": "standard",
  "text": ["我是中国人"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "中",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "国",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "人",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

词项过滤器:Token Filter

简介

词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。

官方同样内置了很多词项过滤器,基本可以满足日常开发的需要,当然也支持第三方自行开发的。

Lowercase和Uppercase

GET _analyze
{
  "filter": ["lowercase"],
  "text": ["HELLO world"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "hello world",
      "start_offset": 0,
      "end_offset": 11,
      "type": "word",
      "position": 0
    }
  ]
}
——————————————————————————————————————————————————————————————————————
GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["uppercase"],
  "text": ["www.elastic.org.cn","www elastic org cn"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "WWW.ELASTIC.ORG.CN",
      "start_offset": 0,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "WWW",
      "start_offset": 19,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 101
    },
    {
      "token": "ELASTIC",
      "start_offset": 23,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 102
    },
    {
      "token": "ORG",
      "start_offset": 31,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 103
    },
    {
      "token": "CN",
      "start_offset": 35,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 104
    }
  ]
}

停用词

在切词完成之后会被干掉的词项即停用词。停用词可以自定义。

### 官方内置停用词过滤器
GET _analyze
{
  "tokenizer": "standard",
  "filter": ["stop"],
  "text": ["what are you doing"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "what",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "you",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "doing",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
————————————————————————————————————————————————————————————————
### 自定义停用词过滤器
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": ["world"],
          "ignore_case": true //忽略大小写
        }
      }
    }
  }
}
### 测试
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"],
  "text": ["hello world"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

同义词

同义词定义规则

  • a,b,c => d:这种方式a,b,c会被d代替
  • a,b,c,d:这种方式a,b,c,d是等价的

同义词定义方式

  • 内联:直接在synonym内部声明规则
  • 文件:在文件中定义规则,文件相对顶级目录为ES的Config文件夹

内联

### 自定义同义词过滤器
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "synonym",
          "synonyms": ["a,b,c=>d"]
        }
      }
    }
  }
}
### 测试
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_filter"],
  "text": ["a"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}
————————————————————————————————————————————————————————————————————————
### 自定义同义词过滤器
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "synonym",
          "synonyms": ["a,b,c,d"]
        }
      }
    }
  }
}
### 测试
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_filter"],
  "text": ["a"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "c",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

文件

### 自定义同义词过滤器
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt" //相对于elastic的config目录下
        }
      }
    }
  }
}

字符过滤器:Character Filter

基本概念

分词之前的预处理,过滤无用字符

基本用法

语法

PUT <index_name>
{
    "settings": {
        "analysis": {
            "char_filter": {
                "my_char_filter": {
                    "type": "<char_filter_type>"
                }
            }
        }
    }
}

参数

  • type:使用的字符过滤器类型名称,可配置以下值
    • html_strip
    • mapping
    • pattern_replace

HTML标签过滤器

字符过滤器会去除HTML标签和转义HTML元素,如<b>、&nbsp;

### 自定义标签过滤器
PUT test_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["a"]
        }
      }
    }
  }
}
### 测试
GET test_char_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": ["<p>hello&nbsp;<a>world</a></p>"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": """
hello <a>world</a>
""",
      "start_offset": 0,
      "end_offset": 30,
      "type": "word",
      "position": 0
    }
  ]
}

参数:

  • escaped_tags:需要保留的html标签。

字符映射过滤器

通过定义映射替换规则,把特定字符替换为指定字符。

### 自定义字符映射过滤器
PUT test_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["垃圾=>*","辣鸡=>*","滚=>*"]
        }
      }
    }
  }
}
### 测试
GET test_char_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": ["你就是个垃圾,滚!"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "你就是个*,*!",
      "start_offset": 0,
      "end_offset": 9,
      "type": "word",
      "position": 0
    }
  ]
}

正则替换过滤器

### 自定义正则替换过滤器
PUT test_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": """(\d{3})\d{4}(\d{4})""",
          "replacement": "$1****$2"
        }
      }
    }
  }
}
### 测试
GET test_char_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": ["16637524752"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "166****4752",
      "start_offset": 0,
      "end_offset": 11,
      "type": "word",
      "position": 0
    }
  ]
}

内置分词器

  • Standard⭐:默认分词器,中文支持的不理想,会逐字拆分,参数值为standard。
  • Pattern:以正则匹配分隔符,把文本拆分为若干词项,参数为pattern。
  • Simple:除了英文单词和字母,其他统统过滤掉,参数值为simple。
  • Whitespace⭐:以空白符分隔,不会改变大小写,参数值为whitespace。
  • Keyword⭐:可以理解为不做任何操作的分词器,会保留原有文本的所有属性,参数值为keyword。
  • Stop:分词规则和Simple Analyzer相同,但是增加了对停用词的支持,参数值为stop。
  • Language Analyzer:支持全球三十多种主流语言。
  • Fingerprint:一种特殊领域分词器,不常用。

自定义分词器

如果ES内置分词器无法满足需要,可以通过对分词器、词项过滤器、字符过滤器三个组件的自由组合来自定义分词器。在使用分词器的时候需要注意必须满足以下要求:

  • Tokenizer:必须包含一个并且只能指定一个分词器,即必须指定分词器的分词规则。
  • Token Filter:可以不指定词项过滤器,也可以指定多个词项过滤器。
  • Char Filter:可以不指定字符过滤器,也可以指定多个字符过滤器。
### 自定义分词器
PUT test_analyzer
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": [",。!?:"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["垃圾=>*","菜=>*"]
        }
      },
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": ["GTA"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_tokenizer",
          "char_filter": ["my_char_filter"],
          "filter": ["my_filter"]
        }
      }
    }
  }
}
### 测试
GET test_analyzer/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["你玩GTA吗?GTA,你玩的菜吗?"]
}
### 返回结果如下
{
  "tokens": [
    {
      "token": "你玩GTA吗",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "你玩的*吗",
      "start_offset": 11,
      "end_offset": 16,
      "type": "word",
      "position": 2
    }
  ]
}

将自定义分词器与索引中的字段进行映射

PUT test_analyzer/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

测试

GET test_analyzer/_search
{
  "query": {
    "match": {
      "title": "你玩的菜吗"
    }
  }
}
### 返回结果如下
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test_analyzer",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "title": "你玩GTA吗?GTA,你玩的菜吗?"
        }
      }
    ]
  }
}

analyzer和search_analyzer

  • analyzer:为字段指定的分词器,仅对文本字段生效,针对的是源数据字段,也就是source data
  • search_analyzer:搜索时分词器,即作用于搜索词的分词器,作用对象为用户传入的搜索词。
  • 当search_analyzer未指定时,其缺省值为analyzer,若analyzer未指定,search_analyzer和analyzer的值都为standard。

例如

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": [",。!?:"]
        },
        "my_search_tokenizer": {
          "type": "<type>"
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["垃圾=>*","菜=>*"]
        }
      },
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": ["GTA"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_tokenizer",
          "char_filter": ["my_char_filter"],
          "filter": ["my_filter"]
        },
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "<tokenizer>",
          "char_filter": ["<char_filter>"],
          "filter": ["filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "my_search_analyzer"
      }
    }
  }
}
posted @ 2023-07-23 15:08  (x²+y²-1)³=x²y³  阅读(96)  评论(0编辑  收藏  举报