elasticsearch(v2.4.6)添加中文分词器ik
一、 参考
二、 编译安装 analysis-ik
2.1 下载源码
git clone --depth 1 --branch v1.10.6 https://github.com/medcl/elasticsearch-analysis-ik.git
因为ES2.4.6
对应的ik v1.10.6
,所以仅仅clone
该tag
源码
2.2 编译
(1) 下载安装 maven
# 源码下载
wget https://mirror.olnevhost.net/pub/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
# 解压目录
mkdir /usr/local/maven
tar -zxvf apache-maven-3.6.3-bin.tar.gz --directory /usr/local/maven
# 环境变量设置
export JAVA_HOME=/home/java/jdk1.8.0_131
MAVEN_HOME=/usr/local/maven/apache-maven-3.6.3
export MAVEN_HOME
export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
source /etc/profile
# 查看版本信息
mvn --version
(2) 编译ik
# 编译
cd elasticsearch-analysis-ik/
mvn package
# 将编译文件添加到plugins
cd cd target/releases/
cp elasticsearch-analysis-ik-1.10.6.zip /home/elastic/elasticsearch-2.4.6/plugins/ik/
cd /home/elastic/elasticsearch-2.4.6/plugins/
unzip elasticsearch-analysis-ik-1.10.6.zip
2.3 重启es服务
三、测试ik分词效果
3.1 内置的中文分词
# 请求
GET http://127.0.0.1:9200/_analyze
{
"text": "正是江南好风景"
}
# 返回
{
"tokens": [
{
"token": "正",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "江",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "南",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "好",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "风",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "景",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
}
]
}
3.2 ik的ik_max_word分词器
# 请求
GET http://127.0.0.1:9200/_analyze
{
"analyzer": "ik_max_word",
"text": "正是江南好风景"
}
# 返回
{
"tokens": [
{
"token": "正是",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "江南",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "江",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "南",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 3
},
{
"token": "好",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 4
},
{
"token": "风景",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 5
},
{
"token": "景",
"start_offset": 6,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
}
]
}
3.3 ik的ik_smart分词器
# 请求
GET http://127.0.0.1:9200/_analyze
{
"analyzer": "ik_smart",
"text": "正是江南好风景"
}
# 返回
{
"tokens": [
{
"token": "正是",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "江南",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "好",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 2
},
{
"token": "风景",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
}
]
}
3.4 比较结果
(1) 默认的分词器将中文按照一个个汉字来分词,肯定不符合大部分使用场景
(2) ik_max_word
会作最细粒度的分词,而ik_smart
则正相反,会作最粗粒度的分词