Elasticsearch之分词器
一、分词器(Analyzer)概念
1.1 什么是分词器?
文本分析就是把全文本转换成一系列单词(term/token)的过程,也叫分词。在 ES 中,Analysis 是通过分词器(Analyzer) 来实现的,可使用 ES 内置的分析器或者按需定制化分析器。
举一个分词简单的例子:比如你输入 Mastering Elasticsearch
,会自动帮你分成两个单词,一个是 mastering
,另一个是 elasticsearch
,可以看出单词也被转化成了小写的。
再简单了解了 Analysis 与 Analyzer 之后,让我们来看下分词器的组成:
1.2 分词器的组成
分词器是专门处理分词的组件,分词器由以下三部分组成:
- Character Filters:针对原始文本处理,比如去除 html 标签
- Tokenizer:按照规则切分为单词,比如按照空格切分
- Token Filters:将切分的单词进行加工,比如大写转小写,删除 stopwords,增加同义语
Analyzer 三个部分也是有顺序的,从上到下依次经过 Character Filters
,Tokenizer
以及 Token Filters
,一个文本进来先对文本数据进行处理,再去分词,最后对分词的结果进行过滤。
ES 内置了许多分词器:
- Standard Analyzer - 默认分词器,按词切分,小写处理
- Simple Analyzer - 按照非字母切分(符号被过滤),小写处理
- Stop Analyzer - 小写处理,停用词过滤(the ,a,is)
- Whitespace Analyzer - 按照空格切分,不转小写
- Keyword Analyzer - 不分词,直接将输入当做输出
- Pattern Analyzer - 正则表达式,默认 \W+
- Language - 提供了 30 多种常见语言的分词器
- Customer Analyzer - 自定义分词器
二、Analyzer API
它可以通过以下三种方式来查看分词器是怎么样工作的:
2.1 直接指定 Analyzer 进行测试
GET _analyze
{
"analyzer": "standard",
"text" : "Mastering Elasticsearch , elasticsearch in Action"
}
2.2 指定索引的字段进行测试
POST books/_analyze
{
"field": "title",
"text": "Mastering Elasticesearch"
}
2.3 自定义分词进行测试
POST /_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Mastering Elasticesearch"
}
再了解了 Analyzer API 后,让我们一起看下 ES 内置的分词器:
三、内置分词器实践
3.1 Stamdard Analyzer
Stamdard Analyzer 是 ES 默认的分词器,它会对输入的文本按词的方式进行切分,切分好以后会进行转小写处理,默认的 stopwords 是关闭的。
GET _analyze
{
"analyzer": "standard",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "in",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "2020",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "java",
"start_offset": 9,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "the",
"start_offset": 17,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "language",
"start_offset": 26,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "in",
"start_offset": 35,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "the",
"start_offset": 38,
"end_offset": 41,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "world",
"start_offset": 42,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 9
}
]
}
可以看出是按照空格、非字母的方式对输入的文本进行了转换,比如对 Java
做了转小写,对一些停用词也没有去掉,比如 in
。
其中 token
为分词结果;start_offset
为起始偏移;end_offset
为结束偏移;position
为分词位置。
3.2 Simple Analyzer
它只包括了 Lower Case
的 Tokenizer
,它会按照非字母切分,非字母的会被去除,最后对切分好的做转小写处理,然后接着用刚才的输入文本,分词器换成 simple
来进行分词:
GET _analyze
{
"analyzer": "simple",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "in",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "java",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 1
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "the",
"start_offset": 17,
"end_offset": 20,
"type": "word",
"position": 3
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "language",
"start_offset": 26,
"end_offset": 34,
"type": "word",
"position": 5
},
{
"token": "in",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 6
},
{
"token": "the",
"start_offset": 38,
"end_offset": 41,
"type": "word",
"position": 7
},
{
"token": "world",
"start_offset": 42,
"end_offset": 47,
"type": "word",
"position": 8
}
]
}
3.3 Whitespace Analyzer
Whitespace Analyzer 是按照空格进行切分:
GET _analyze
{
"analyzer": "whitespace",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "In",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "2020,",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "Java",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "the",
"start_offset": 17,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 5
},
{
"token": "language",
"start_offset": 26,
"end_offset": 34,
"type": "word",
"position": 6
},
{
"token": "in",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 7
},
{
"token": "the",
"start_offset": 38,
"end_offset": 41,
"type": "word",
"position": 8
},
{
"token": "world.",
"start_offset": 42,
"end_offset": 48,
"type": "word",
"position": 9
}
]
}
3.4 Stop Analyzer
由 Lowe Case
的 Tokenizer
和 Stop
的 Token Filters
组成的,相较于刚才提到的 Simple Analyzer
,多了 stop 过滤,stop 就是会把 the
,a
,is
等修饰词去除。
GET _analyze
{
"analyzer": "stop",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "java",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 1
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "language",
"start_offset": 26,
"end_offset": 34,
"type": "word",
"position": 5
},
{
"token": "world",
"start_offset": 42,
"end_offset": 47,
"type": "word",
"position": 8
}
]
}
3.5 Keyword Analyzer
Keyword Analyzer 不做分词处理,只是将输入作为 Term 输出:
GET _analyze
{
"analyzer": "keyword",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "In 2020, Java is the best language in the world.",
"start_offset": 0,
"end_offset": 48,
"type": "word",
"position": 0
}
]
}
3.6 Pattern Analyzer
通过正则表达式的方式进行分词,默认是用 \W+
进行分割的,按非字符的符号进行分割:
GET _analyze
{
"analyzer": "pattern",
"text": "In 2020, Java is the best language in the world."
}
运行结果:
{
"tokens": [
{
"token": "in",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "2020",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "java",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "the",
"start_offset": 17,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 5
},
{
"token": "language",
"start_offset": 26,
"end_offset": 34,
"type": "word",
"position": 6
},
{
"token": "in",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 7
},
{
"token": "the",
"start_offset": 38,
"end_offset": 41,
"type": "word",
"position": 8
},
{
"token": "world",
"start_offset": 42,
"end_offset": 47,
"type": "word",
"position": 9
}
]
}
3.7 Language Analyzer
ES 为不同国家语言的输入提供了 Language Analyzer
分词器,在里面可以指定不同的语言,我们用 english
进行分词:
GET _analyze
{
"analyzer": "english",
"text": "In 2020, Java is the best language in the world."
}
返回结果:
{
"tokens": [
{
"token": "2020",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "java",
"start_offset": 9,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "best",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "languag",
"start_offset": 26,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "world",
"start_offset": 42,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 9
}
]
}
四、中文分词
中文分词有特定的难点,不像英文,单词有自然的空格作为分隔,在中文句子中,不能简单地切分成一个个的字,而是需要分成有含义的词,但是在不同的上下文,是有不同的理解的。
4.1 使用 Stamdard Analyzer
GET _analyze
{
"analyzer": "standard",
"text": "各国有企业相继倒闭"
}
{
"tokens": [
{
"token": "各",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "国",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "有",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "企",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "业",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "相",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "继",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "倒",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "闭",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
}
]
}
4.2 使用 ICU
ICU不是内置分词器,需要先去安装 elasticsearch插件,命令如下:
bin/elasticsearch-plugin install analysis-icu
docker版本:
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
重启 elasticsearch 容器后,在 kibana 执行如下命令:
GET _analyze
{
"analyzer": "icu_analyzer",
"text": "各国有企业相继倒闭"
}
返回结果:
{
"tokens": [
{
"token": "各国",
"start_offset": 0,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "有",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "企业",
"start_offset": 3,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "相继",
"start_offset": 5,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "倒闭",
"start_offset": 7,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 4
}
]
}
4.3 其他分词器推荐
IK:
- 支持自定义词库,支持热更新分词字典
- https://github.com/medcl/elasticsearch-analysis-ik
安装命令:
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.8.2/elasticsearch-analysis-ik-8.8.2.zip
ik版本需要和 Elasticsearch 保持一致,目前 IK 最新的是 8.8.2,笔者本机 elasticsearch是 8.9.1 所以导致安装不上。
从网上找到一个 8.9.1版本:
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/files/12424800/elasticsearch-analysis-ik-8.9.1.zip
GET _analyze
{
"analyzer": "ik_smart",
"text": "各国有企业相继倒闭"
}
返回结果:
{
"tokens": [
{
"token": "各",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "国有企业",
"start_offset": 1,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
{
"token": "相继",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "倒闭",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 3
}
]
}
jieba:
- Python 中最流行的分词系统,支持分词和词性标注
- 支持繁体分词、自定义词典、并行分词等
- https://github.com/sing1ee/elasticsearch-jieba-plugin
THULAC:
- THU Lexucal Analyzer for Chinese, 清华大学自然语言处理和社会人文计算实验室的一套中文分词器
- https://github.com/thunlp/THULAC-Java