Elasticsearch自定义分词器
分词发生时期
分词器的处理过程发生在Index Time
和Search Time
两个时期
- Index Time:文档写入并创建倒排索引时期,其分词逻辑取决于映射参数
analyzer
。 - Search Time:搜索发生时期,其分词仅对搜索词产生作用。
分词器的组成
- 切词器(Tokenizer):用于定义切词(分词)逻辑。
- 词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑。
- 字符过滤器(Character Filter):用于处理单个字符。
_analyzer API
_analyzer API
可以用来查看指定分词器的分词结果。
语法如下:
GET _analyze
{
"text": "hello world",
"analyzer": "english"
}
分词器:Tokenizer
tokenizer
是分词器的核心组成部分之一,其主要作用是分词,或称之为切词,主要用来对原始文本进行细粒度拆分,拆分之后的每一个部分被称之为一个term,或称之为一个词项。
可以把切词器理解为预定义的切词规则。
官方内置了很多种切词器,默认的切词器为standard
。
例如:
### standard切词器对英文进行分词
GET _analyze
{
"tokenizer": "standard",
"text": ["hello world"]
}
### 返回结果如下
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
### standard切词器对中文进行分词
GET _analyze
{
"tokenizer": "standard",
"text": ["我是中国人"]
}
### 返回结果如下
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "中",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "国",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "人",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
}
]
}
词项过滤器:Token Filter
简介
词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。
官方同样内置了很多词项过滤器,基本可以满足日常开发的需要,当然也支持第三方自行开发的。
Lowercase和Uppercase
GET _analyze
{
"filter": ["lowercase"],
"text": ["HELLO world"]
}
### 返回结果如下
{
"tokens": [
{
"token": "hello world",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0
}
]
}
——————————————————————————————————————————————————————————————————————
GET _analyze
{
"tokenizer": "standard",
"filter": ["uppercase"],
"text": ["www.elastic.org.cn","www elastic org cn"]
}
### 返回结果如下
{
"tokens": [
{
"token": "WWW.ELASTIC.ORG.CN",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "WWW",
"start_offset": 19,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 101
},
{
"token": "ELASTIC",
"start_offset": 23,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 102
},
{
"token": "ORG",
"start_offset": 31,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 103
},
{
"token": "CN",
"start_offset": 35,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 104
}
]
}
停用词
在切词完成之后会被干掉的词项即停用词。停用词可以自定义。
### 官方内置停用词过滤器
GET _analyze
{
"tokenizer": "standard",
"filter": ["stop"],
"text": ["what are you doing"]
}
### 返回结果如下
{
"tokens": [
{
"token": "what",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "you",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "doing",
"start_offset": 13,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
}
]
}
————————————————————————————————————————————————————————————————
### 自定义停用词过滤器
PUT test_token_filter_stop
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "stop",
"stopwords": ["world"],
"ignore_case": true //忽略大小写
}
}
}
}
}
### 测试
GET test_token_filter_stop/_analyze
{
"tokenizer": "standard",
"filter": ["my_filter"],
"text": ["hello world"]
}
### 返回结果如下
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
}
]
}
同义词
同义词定义规则
- a,b,c => d:这种方式a,b,c会被d代替
- a,b,c,d:这种方式a,b,c,d是等价的
同义词定义方式
- 内联:直接在
synonym
内部声明规则 - 文件:在文件中定义规则,文件相对顶级目录为ES的Config文件夹
内联
### 自定义同义词过滤器
PUT test_token_filter_synonym
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "synonym",
"synonyms": ["a,b,c=>d"]
}
}
}
}
}
### 测试
GET test_token_filter_synonym/_analyze
{
"tokenizer": "standard",
"filter": ["my_filter"],
"text": ["a"]
}
### 返回结果如下
{
"tokens": [
{
"token": "d",
"start_offset": 0,
"end_offset": 1,
"type": "SYNONYM",
"position": 0
}
]
}
————————————————————————————————————————————————————————————————————————
### 自定义同义词过滤器
PUT test_token_filter_synonym
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "synonym",
"synonyms": ["a,b,c,d"]
}
}
}
}
}
### 测试
GET test_token_filter_synonym/_analyze
{
"tokenizer": "standard",
"filter": ["my_filter"],
"text": ["a"]
}
### 返回结果如下
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "b",
"start_offset": 0,
"end_offset": 1,
"type": "SYNONYM",
"position": 0
},
{
"token": "c",
"start_offset": 0,
"end_offset": 1,
"type": "SYNONYM",
"position": 0
},
{
"token": "d",
"start_offset": 0,
"end_offset": 1,
"type": "SYNONYM",
"position": 0
}
]
}
文件
### 自定义同义词过滤器
PUT test_token_filter_synonym
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt" //相对于elastic的config目录下
}
}
}
}
}
字符过滤器:Character Filter
基本概念
分词之前的预处理,过滤无用字符
基本用法
语法
PUT <index_name>
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "<char_filter_type>"
}
}
}
}
}
参数
- type:使用的字符过滤器类型名称,可配置以下值
- html_strip
- mapping
- pattern_replace
HTML标签过滤器
字符过滤器会去除HTML标签和转义HTML元素,如<b>、
### 自定义标签过滤器
PUT test_char_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["a"]
}
}
}
}
}
### 测试
GET test_char_filter/_analyze
{
"char_filter": ["my_char_filter"],
"text": ["<p>hello <a>world</a></p>"]
}
### 返回结果如下
{
"tokens": [
{
"token": """
hello <a>world</a>
""",
"start_offset": 0,
"end_offset": 30,
"type": "word",
"position": 0
}
]
}
参数:
- escaped_tags:需要保留的html标签。
字符映射过滤器
通过定义映射替换规则,把特定字符替换为指定字符。
### 自定义字符映射过滤器
PUT test_char_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["垃圾=>*","辣鸡=>*","滚=>*"]
}
}
}
}
}
### 测试
GET test_char_filter/_analyze
{
"char_filter": ["my_char_filter"],
"text": ["你就是个垃圾,滚!"]
}
### 返回结果如下
{
"tokens": [
{
"token": "你就是个*,*!",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
正则替换过滤器
### 自定义正则替换过滤器
PUT test_char_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": """(\d{3})\d{4}(\d{4})""",
"replacement": "$1****$2"
}
}
}
}
}
### 测试
GET test_char_filter/_analyze
{
"char_filter": ["my_char_filter"],
"text": ["16637524752"]
}
### 返回结果如下
{
"tokens": [
{
"token": "166****4752",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0
}
]
}
内置分词器
- Standard⭐:默认分词器,中文支持的不理想,会逐字拆分,参数值为standard。
- Pattern:以正则匹配分隔符,把文本拆分为若干词项,参数为pattern。
- Simple:除了英文单词和字母,其他统统过滤掉,参数值为simple。
- Whitespace⭐:以空白符分隔,不会改变大小写,参数值为whitespace。
- Keyword⭐:可以理解为不做任何操作的分词器,会保留原有文本的所有属性,参数值为keyword。
- Stop:分词规则和Simple Analyzer相同,但是增加了对停用词的支持,参数值为stop。
- Language Analyzer:支持全球三十多种主流语言。
- Fingerprint:一种特殊领域分词器,不常用。
自定义分词器
如果ES内置分词器无法满足需要,可以通过对分词器、词项过滤器、字符过滤器三个组件的自由组合来自定义分词器。在使用分词器的时候需要注意必须满足以下要求:
- Tokenizer:必须包含一个并且只能指定一个分词器,即必须指定分词器的分词规则。
- Token Filter:可以不指定词项过滤器,也可以指定多个词项过滤器。
- Char Filter:可以不指定字符过滤器,也可以指定多个字符过滤器。
### 自定义分词器
PUT test_analyzer
{
"settings": {
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": [",。!?:"]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["垃圾=>*","菜=>*"]
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": ["GTA"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"char_filter": ["my_char_filter"],
"filter": ["my_filter"]
}
}
}
}
}
### 测试
GET test_analyzer/_analyze
{
"analyzer": "my_analyzer",
"text": ["你玩GTA吗?GTA,你玩的菜吗?"]
}
### 返回结果如下
{
"tokens": [
{
"token": "你玩GTA吗",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "你玩的*吗",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 2
}
]
}
将自定义分词器与索引中的字段进行映射
PUT test_analyzer/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
测试
GET test_analyzer/_search
{
"query": {
"match": {
"title": "你玩的菜吗"
}
}
}
### 返回结果如下
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "test_analyzer",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "你玩GTA吗?GTA,你玩的菜吗?"
}
}
]
}
}
analyzer和search_analyzer
- analyzer:为字段指定的分词器,仅对文本字段生效,针对的是源数据字段,也就是
source data
。 - search_analyzer:搜索时分词器,即作用于搜索词的分词器,作用对象为用户传入的搜索词。
- 当search_analyzer未指定时,其缺省值为analyzer,若analyzer未指定,search_analyzer和analyzer的值都为standard。
例如
PUT test_analyzer
{
"settings": {
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": [",。!?:"]
},
"my_search_tokenizer": {
"type": "<type>"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["垃圾=>*","菜=>*"]
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": ["GTA"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"char_filter": ["my_char_filter"],
"filter": ["my_filter"]
},
"my_search_analyzer": {
"type": "custom",
"tokenizer": "<tokenizer>",
"char_filter": ["<char_filter>"],
"filter": ["filter"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_search_analyzer"
}
}
}
}
-------------------------------------------
个性签名:独学而无友,则孤陋而寡闻。做一个灵魂有趣的人!
如果觉得这篇文章对你有小小的帮助的话,记得在右下角点个“推荐”哦,博主在此感谢!
万水千山总是情,打赏一分行不行,所以如果你心情还比较高兴,也是可以扫码打赏博主,哈哈哈(っ•̀ω•́)っ✎⁾⁾!