Elasticsearch之分词器

一、分词器(Analyzer)概念

1.1 什么是分词器？

文本分析就是把全文本转换成一系列单词（term/token）的过程，也叫分词。在 ES 中，Analysis 是通过分词器（Analyzer） 来实现的，可使用 ES 内置的分析器或者按需定制化分析器。

举一个分词简单的例子：比如你输入 Mastering Elasticsearch，会自动帮你分成两个单词，一个是 mastering，另一个是 elasticsearch，可以看出单词也被转化成了小写的。

再简单了解了 Analysis 与 Analyzer 之后，让我们来看下分词器的组成：

1.2 分词器的组成

分词器是专门处理分词的组件，分词器由以下三部分组成：

Character Filters：针对原始文本处理，比如去除 html 标签
Tokenizer：按照规则切分为单词，比如按照空格切分
Token Filters：将切分的单词进行加工，比如大写转小写，删除 stopwords，增加同义语

Analyzer 三个部分也是有顺序的，从上到下依次经过 Character Filters，Tokenizer 以及 Token Filters，一个文本进来先对文本数据进行处理，再去分词，最后对分词的结果进行过滤。

ES 内置了许多分词器：

Standard Analyzer - 默认分词器，按词切分，小写处理
Simple Analyzer - 按照非字母切分（符号被过滤），小写处理
Stop Analyzer - 小写处理，停用词过滤（the ，a，is）
Whitespace Analyzer - 按照空格切分，不转小写
Keyword Analyzer - 不分词，直接将输入当做输出
Pattern Analyzer - 正则表达式，默认 \W+
Language - 提供了 30 多种常见语言的分词器
Customer Analyzer - 自定义分词器

二、Analyzer API

它可以通过以下三种方式来查看分词器是怎么样工作的：

2.1 直接指定 Analyzer 进行测试

GET _analyze
{
    "analyzer": "standard",
    "text" : "Mastering Elasticsearch , elasticsearch in Action"
}

2.2 指定索引的字段进行测试

POST books/_analyze
{
    "field": "title",
    "text": "Mastering Elasticesearch"
}

2.3 自定义分词进行测试

POST /_analyze
{
    "tokenizer": "standard", 
    "filter": ["lowercase"],
    "text": "Mastering Elasticesearch"
}

再了解了 Analyzer API 后，让我们一起看下 ES 内置的分词器：

三、内置分词器实践

3.1 Stamdard Analyzer

Stamdard Analyzer 是 ES 默认的分词器，它会对输入的文本按词的方式进行切分，切分好以后会进行转小写处理，默认的 stopwords 是关闭的。

GET _analyze
{
  "analyzer": "standard",
  "text": "In 2020, Java is the best language in the world."
}

返回结果：

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

可以看出是按照空格、非字母的方式对输入的文本进行了转换，比如对 Java 做了转小写，对一些停用词也没有去掉，比如 in。

其中 token 为分词结果；start_offset 为起始偏移；end_offset 为结束偏移；position 为分词位置。

3.2 Simple Analyzer

它只包括了 Lower Case 的 Tokenizer，它会按照非字母切分，非字母的会被去除，最后对切分好的做转小写处理，然后接着用刚才的输入文本，分词器换成 simple 来进行分词：

GET _analyze
{
  "analyzer": "simple",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 4
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 5
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 7
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 8
    }
  ]
}

3.3 Whitespace Analyzer

Whitespace Analyzer 是按照空格进行切分：

GET _analyze
{
  "analyzer": "whitespace",
  "text": "In 2020, Java is the best language in the world."
}

返回结果：

{
  "tokens": [
    {
      "token": "In",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "2020,",
      "start_offset": 3,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "Java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 8
    },
    {
      "token": "world.",
      "start_offset": 42,
      "end_offset": 48,
      "type": "word",
      "position": 9
    }
  ]
}

3.4 Stop Analyzer

由 Lowe Case 的 Tokenizer 和 Stop 的 Token Filters 组成的，相较于刚才提到的 Simple Analyzer，多了 stop 过滤，stop 就是会把 the，a，is 等修饰词去除。

GET _analyze
{
  "analyzer": "stop",
  "text": "In 2020, Java is the best language in the world."
}

返回结果：

{
  "tokens": [
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 4
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 5
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 8
    }
  ]
}

3.5 Keyword Analyzer

Keyword Analyzer 不做分词处理，只是将输入作为 Term 输出：

GET _analyze
{
  "analyzer": "keyword",
  "text": "In 2020, Java is the best language in the world."
}

返回结果：

{
  "tokens": [
    {
      "token": "In 2020, Java is the best language in the world.",
      "start_offset": 0,
      "end_offset": 48,
      "type": "word",
      "position": 0
    }
  ]
}

3.6 Pattern Analyzer

通过正则表达式的方式进行分词，默认是用 \W+ 进行分割的，按非字符的符号进行分割：

GET _analyze
{
  "analyzer": "pattern",
  "text": "In 2020, Java is the best language in the world."
}

运行结果：

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 8
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 9
    }
  ]
}

3.7 Language Analyzer

ES 为不同国家语言的输入提供了 Language Analyzer 分词器，在里面可以指定不同的语言，我们用 english 进行分词:

GET _analyze
{
  "analyzer": "english",
  "text": "In 2020, Java is the best language in the world."
}

返回结果：

{
  "tokens": [
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "languag",
      "start_offset": 26,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

四、中文分词

中文分词有特定的难点，不像英文，单词有自然的空格作为分隔，在中文句子中，不能简单地切分成一个个的字，而是需要分成有含义的词，但是在不同的上下文，是有不同的理解的。

4.1 使用 Stamdard Analyzer

GET _analyze
{
  "analyzer": "standard",
  "text": "各国有企业相继倒闭"
}

{
  "tokens": [
    {
      "token": "各",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "国",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "有",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "企",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "业",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "相",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "继",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "倒",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "闭",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    }
  ]
}

4.2 使用 ICU

ICU不是内置分词器，需要先去安装 elasticsearch插件，命令如下：

bin/elasticsearch-plugin install analysis-icu

docker版本:

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

重启 elasticsearch 容器后，在 kibana 执行如下命令:

GET _analyze
{
  "analyzer": "icu_analyzer",
  "text": "各国有企业相继倒闭"
}

返回结果：

{
  "tokens": [
    {
      "token": "各国",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "有",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "企业",
      "start_offset": 3,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "相继",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "倒闭",
      "start_offset": 7,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

4.3 其他分词器推荐

IK：

支持自定义词库，支持热更新分词字典
https://github.com/medcl/elasticsearch-analysis-ik

安装命令：

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.8.2/elasticsearch-analysis-ik-8.8.2.zip

ik版本需要和 Elasticsearch 保持一致，目前 IK 最新的是 8.8.2,笔者本机 elasticsearch是 8.9.1 所以导致安装不上。

从网上找到一个 8.9.1版本：

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/files/12424800/elasticsearch-analysis-ik-8.9.1.zip

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "各国有企业相继倒闭"
}

返回结果：

{
  "tokens": [
    {
      "token": "各",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "国有企业",
      "start_offset": 1,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "相继",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "倒闭",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

jieba：

Python 中最流行的分词系统，支持分词和词性标注
支持繁体分词、自定义词典、并行分词等
https://github.com/sing1ee/elasticsearch-jieba-plugin

THULAC：

THU Lexucal Analyzer for Chinese, 清华大学自然语言处理和社会人文计算实验室的一套中文分词器
https://github.com/thunlp/THULAC-Java

posted @ 2023-09-07 16:09 emptyironbox 阅读(49) 评论(0) 编辑收藏举报

刷新页面返回顶部

emptyironbox

Elasticsearch之分词器

一、分词器(Analyzer)概念

1.1 什么是分词器？

1.2 分词器的组成

二、Analyzer API

2.1 直接指定 Analyzer 进行测试

2.2 指定索引的字段进行测试

2.3 自定义分词进行测试

三、内置分词器实践

3.1 Stamdard Analyzer

3.2 Simple Analyzer

3.3 Whitespace Analyzer

3.4 Stop Analyzer

3.5 Keyword Analyzer

3.6 Pattern Analyzer

3.7 Language Analyzer

四、中文分词

4.1 使用 Stamdard Analyzer

4.2 使用 ICU

4.3 其他分词器推荐

公告