解析器

IK分词器

各平台配置IK分词器

elasticsearch-analysis-ik releases：https://github.com/medcl/elasticsearch-analysis-ik/releases

Windows：在elasticsearch安装目录中的plugins目录内新建ik目录，将从GitHub下载的压缩包解压到ik目录内即可：
Mac：在elasticsearch安装目录中的plugins目录内新建ik目录，将从GitHub下载的压缩包解压到ik目录内即可：
Centos：cd到elasticsearch安装目录的plugings目录，下载并解压：

[root@cs ik]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.4/elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# unzip  elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# ll
total 5832
-rw-r--r--. 1 root root  263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r--. 1 root root   61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x. 2 root root    4096 Aug 26  2018 config
-rw-r--r--. 1 root root   54693 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.jar
-rw-r--r--. 1 root root 4504539 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.zip
-rw-r--r--. 1 root root  736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r--. 1 root root  326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r--. 1 root root    1805 Dec 23  2018 plugin-descriptor.properties
-rw-r--r--. 1 root root     125 Dec 23  2018 plugin-security.policy

测试

首先将elascticsearch和kibana服务重启。
然后浏览器地址栏输入http://localhost:5601(kibana地址)，在Dev Tools中的Console界面的左侧输入命令，再点击绿色的执行按钮执行。

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "学不学的会靠天收"
}

ik目录简介

我们简要的介绍一下ik分词配置文件：

IKAnalyzer.cfg.xml，用来配置自定义的词库
main.dic，ik原生内置的中文词库，大约有27万多条，只要是这些单词，都会被分在一起。
surname.dic，中国的姓氏。
suffix.dic，特殊（后缀）名词，例如乡、江、所、省等等。
preposition.dic，中文介词，例如不、也、了、仍等等。
stopword.dic，英文停用词库，例如a、an、and、the等。
quantifier.dic，单位名词，如厘米、件、倍、像素等。

其他解析器

Simple Analyzer – 按照非字母切分（符号等被过滤），小写处理

GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

执行结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

Stop Analyzer – 小写处理，停用词过滤（the，a，is）

GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

执行结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

剩下的解析器依次为

Whitespace Analyzer – 按照空格切分，不转小写

Keyword Analyzer – 不分词，直接将输入当作输出

Patter Analyzer – 正则表达式，默认 \W+ (非字符分隔)

Language – 提供了30多种常见语言的分词器

#查看不同的analyzer的效果
#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃"
}

posted @ 2021-01-25 15:31 lovelifest 阅读(89) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

解析器

IK分词器

ik目录简介

其他解析器

Simple Analyzer – 按照非字母切分（符号等被过滤），小写处理

Stop Analyzer – 小写处理，停用词过滤（the，a，is）

Whitespace Analyzer – 按照空格切分，不转小写

Keyword Analyzer – 不分词，直接将输入当作输出

Patter Analyzer – 正则表达式，默认 \W+ (非字符分隔)

Language – 提供了30多种常见语言的分词器

公告