解析器
IK分词器
各平台配置IK分词器
elasticsearch-analysis-ik releases:https://github.com/medcl/elasticsearch-analysis-ik/releases
- Windows:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
- Mac:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
- Centos:cd到elasticsearch安装目录的plugings目录,下载并解压:
[root@cs ik]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.4/elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# unzip elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# ll
total 5832
-rw-r--r--. 1 root root 263965 May 6 2018 commons-codec-1.9.jar
-rw-r--r--. 1 root root 61829 May 6 2018 commons-logging-1.2.jar
drwxr-xr-x. 2 root root 4096 Aug 26 2018 config
-rw-r--r--. 1 root root 54693 Dec 23 2018 elasticsearch-analysis-ik-6.5.4.jar
-rw-r--r--. 1 root root 4504539 Dec 23 2018 elasticsearch-analysis-ik-6.5.4.zip
-rw-r--r--. 1 root root 736658 May 6 2018 httpclient-4.5.2.jar
-rw-r--r--. 1 root root 326724 May 6 2018 httpcore-4.4.4.jar
-rw-r--r--. 1 root root 1805 Dec 23 2018 plugin-descriptor.properties
-rw-r--r--. 1 root root 125 Dec 23 2018 plugin-security.policy
测试
- 首先将elascticsearch和kibana服务重启。
- 然后浏览器地址栏输入http://localhost:5601(kibana地址),在Dev Tools中的Console界面的左侧输入命令,再点击绿色的执行按钮执行。
GET _analyze
{
"analyzer": "ik_max_word",
"text": "学不学的会靠天收"
}
ik目录简介
我们简要的介绍一下ik分词配置文件:
- IKAnalyzer.cfg.xml,用来配置自定义的词库
- main.dic,ik原生内置的中文词库,大约有27万多条,只要是这些单词,都会被分在一起。
- surname.dic,中国的姓氏。
- suffix.dic,特殊(后缀)名词,例如乡、江、所、省等等。
- preposition.dic,中文介词,例如不、也、了、仍等等。
- stopword.dic,英文停用词库,例如a、an、and、the等。
- quantifier.dic,单位名词,如厘米、件、倍、像素等。
其他解析器
Simple Analyzer – 按照非字母切分(符号等被过滤),小写处理
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
执行结果
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}
Stop Analyzer – 小写处理,停用词过滤(the,a,is)
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
执行结果
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}
剩下的解析器依次为
Whitespace Analyzer – 按照空格切分,不转小写
Keyword Analyzer – 不分词,直接将输入当作输出
Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
Language – 提供了30多种常见语言的分词器
#查看不同的analyzer的效果
#standard
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#simpe
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "这个苹果不大好吃"
}
自律人的才是可怕的人