Elasticsearch之分词器

一、分词器(Analyzer)概念

1.1 什么是分词器?

文本分析就是把全文本转换成一系列单词(term/token)的过程,也叫分词。在 ES 中,Analysis 是通过分词器(Analyzer) 来实现的,可使用 ES 内置的分析器或者按需定制化分析器。

举一个分词简单的例子:比如你输入 Mastering Elasticsearch,会自动帮你分成两个单词,一个是 mastering,另一个是 elasticsearch,可以看出单词也被转化成了小写的。

再简单了解了 Analysis 与 Analyzer 之后,让我们来看下分词器的组成:

1.2 分词器的组成

分词器是专门处理分词的组件,分词器由以下三部分组成:

  • Character Filters:针对原始文本处理,比如去除 html 标签
  • Tokenizer:按照规则切分为单词,比如按照空格切分
  • Token Filters:将切分的单词进行加工,比如大写转小写,删除 stopwords,增加同义语

Analyzer 三个部分也是有顺序的,从上到下依次经过 Character FiltersTokenizer 以及 Token Filters,一个文本进来先对文本数据进行处理,再去分词,最后对分词的结果进行过滤。

ES 内置了许多分词器:

  • Standard Analyzer - 默认分词器,按词切分,小写处理
  • Simple Analyzer - 按照非字母切分(符号被过滤),小写处理
  • Stop Analyzer - 小写处理,停用词过滤(the ,a,is)
  • Whitespace Analyzer - 按照空格切分,不转小写
  • Keyword Analyzer - 不分词,直接将输入当做输出
  • Pattern Analyzer - 正则表达式,默认 \W+
  • Language - 提供了 30 多种常见语言的分词器
  • Customer Analyzer - 自定义分词器

二、Analyzer API

它可以通过以下三种方式来查看分词器是怎么样工作的:

2.1 直接指定 Analyzer 进行测试

GET _analyze
{
    "analyzer": "standard",
    "text" : "Mastering Elasticsearch , elasticsearch in Action"
}

2.2 指定索引的字段进行测试

POST books/_analyze
{
    "field": "title",
    "text": "Mastering Elasticesearch"
}

2.3 自定义分词进行测试

POST /_analyze
{
    "tokenizer": "standard", 
    "filter": ["lowercase"],
    "text": "Mastering Elasticesearch"
}

再了解了 Analyzer API 后,让我们一起看下 ES 内置的分词器:

三、内置分词器实践

3.1 Stamdard Analyzer

Stamdard Analyzer 是 ES 默认的分词器,它会对输入的文本按词的方式进行切分,切分好以后会进行转小写处理,默认的 stopwords 是关闭的

GET _analyze
{
  "analyzer": "standard",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

可以看出是按照空格、非字母的方式对输入的文本进行了转换,比如对 Java 做了转小写,对一些停用词也没有去掉,比如 in

其中 token 为分词结果;start_offset 为起始偏移;end_offset 为结束偏移;position 为分词位置。

3.2 Simple Analyzer

它只包括了 Lower CaseTokenizer,它会按照非字母切分非字母的会被去除,最后对切分好的做转小写处理,然后接着用刚才的输入文本,分词器换成 simple 来进行分词:

GET _analyze
{
  "analyzer": "simple",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 4
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 5
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 7
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 8
    }
  ]
}

3.3 Whitespace Analyzer

Whitespace Analyzer 是按照空格进行切分

GET _analyze
{
  "analyzer": "whitespace",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "In",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "2020,",
      "start_offset": 3,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "Java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 8
    },
    {
      "token": "world.",
      "start_offset": 42,
      "end_offset": 48,
      "type": "word",
      "position": 9
    }
  ]
}

3.4 Stop Analyzer

Lowe CaseTokenizerStopToken Filters 组成的,相较于刚才提到的 Simple Analyzer,多了 stop 过滤,stop 就是会把 theais 等修饰词去除。

GET _analyze
{
  "analyzer": "stop",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 4
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 5
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 8
    }
  ]
}

3.5 Keyword Analyzer

Keyword Analyzer 不做分词处理,只是将输入作为 Term 输出:

GET _analyze
{
  "analyzer": "keyword",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "In 2020, Java is the best language in the world.",
      "start_offset": 0,
      "end_offset": 48,
      "type": "word",
      "position": 0
    }
  ]
}

3.6 Pattern Analyzer

通过正则表达式的方式进行分词,默认是用 \W+ 进行分割的,按非字符的符号进行分割:

GET _analyze
{
  "analyzer": "pattern",
  "text": "In 2020, Java is the best language in the world."
}

运行结果:

{
  "tokens": [
    {
      "token": "in",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "the",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "word",
      "position": 5
    },
    {
      "token": "language",
      "start_offset": 26,
      "end_offset": 34,
      "type": "word",
      "position": 6
    },
    {
      "token": "in",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 8
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "word",
      "position": 9
    }
  ]
}

3.7 Language Analyzer

ES 为不同国家语言的输入提供了 Language Analyzer 分词器,在里面可以指定不同的语言,我们用 english 进行分词:

GET _analyze
{
  "analyzer": "english",
  "text": "In 2020, Java is the best language in the world."
}

返回结果:

{
  "tokens": [
    {
      "token": "2020",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 9,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "best",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "languag",
      "start_offset": 26,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "world",
      "start_offset": 42,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

四、中文分词

中文分词有特定的难点,不像英文,单词有自然的空格作为分隔,在中文句子中,不能简单地切分成一个个的字,而是需要分成有含义的词,但是在不同的上下文,是有不同的理解的。

4.1 使用 Stamdard Analyzer

GET _analyze
{
  "analyzer": "standard",
  "text": "各国有企业相继倒闭"
}
{
  "tokens": [
    {
      "token": "各",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "国",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "有",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "企",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "业",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "相",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "继",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "倒",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "闭",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    }
  ]
}

4.2 使用 ICU

ICU不是内置分词器,需要先去安装 elasticsearch插件,命令如下:

bin/elasticsearch-plugin install analysis-icu

docker版本:

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

重启 elasticsearch 容器后,在 kibana 执行如下命令:

GET _analyze
{
  "analyzer": "icu_analyzer",
  "text": "各国有企业相继倒闭"
}

返回结果:

{
  "tokens": [
    {
      "token": "各国",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "有",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "企业",
      "start_offset": 3,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "相继",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "倒闭",
      "start_offset": 7,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

4.3 其他分词器推荐

IK

安装命令:

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.8.2/elasticsearch-analysis-ik-8.8.2.zip

ik版本需要和 Elasticsearch 保持一致,目前 IK 最新的是 8.8.2,笔者本机 elasticsearch是 8.9.1 所以导致安装不上。

从网上找到一个 8.9.1版本:

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/files/12424800/elasticsearch-analysis-ik-8.9.1.zip
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "各国有企业相继倒闭"
}

返回结果:

{
  "tokens": [
    {
      "token": "各",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "国有企业",
      "start_offset": 1,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "相继",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "倒闭",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

jieba

THULAC

posted @ 2023-09-07 16:09  emptyironbox  阅读(49)  评论(0编辑  收藏  举报