ElasticSearch使用IK分词器,并重建索引
背景
建立索引时,使用ElasticSearch
默认的分词器。此时使用中文作为关键词进行搜索时,结果会出现偏差,不精准,因此准备切换使用IK
分词器。
解决
安装IK分词器
进入到elasticsearch
的安装目录,如/mnt/public/elasticsearch-7.17.3/plugins
,创建ik
目录。
在这里下载elasticsearch-analysis-ik-7.17.3
,注意:必须下载跟elasticsearch一样版本的ik分词器才能启动成功。下载成功之后,上传到/mnt/public/elasticsearch-7.17.3/plugins/ik
目录下,之后进行解压。
确保plugin-descriptor.properties
中的 "elasticsearch.version" 为你使用的elasticsearch
版本,否则会启动失败。
接着进入到bin
目录,执行./elasticsearch-plugin list
命令,确认插件是否成功安装。
然后重启elasticsearch
即可。注意:重启之前记得切换成启动elasticsearch服务的用户,不然会报错。
IK分词器
IK
分词器中有两种analyzer
,可根据自身需求进行选择。以纳税人这个关键字为例:
ik_max_word
: 会将文本做最细粒度的拆分,比如会将“纳税人”拆分为“纳税人”、“纳税”、“人”,会穷尽各种可能的组合;ik_smart
: 会做最粗粒度的拆分,比如“纳税人”,这个时候就不会进行拆分,直接就一种组合,“纳税人”;
未使用IK分词器
{
"tokens": [
{
"token": "纳",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "税",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "人",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
}
]
}
ik_max_word分词器
{
"tokens": [
{
"token": "纳税人",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "纳税",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "人",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 2
}
]
}
ik_smart分词器
{
"tokens": [
{
"token": "纳税人",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
}
]
}
重建索引
由于之前的历史数据没有使用IK
分词器,且要满足对历史数据使用ik_max_word
进行分词,使用ik_smart
进行搜索,所以现在要进行索引重建。
创建新索引
创建一个新的索引,假设您希望将新索引命名为 hot_question_new
。使用 IK
分词器设置索引的分析器。之前的索引结构如下所示:
{
"hot_question": {
"mappings": {
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"hotQuestion": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"business_item_third": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"business_scenario_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"demandSourceChannel": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"existFlag": {
"type": "boolean"
},
"hotCommonCause": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"hotReply": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"industry_large_category_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"province": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"provinceSet": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"tax_policy_measure_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"tax_policy_topic_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"tax_type_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
下面是创建新索引的curl
命令:
curl -u username:password -X PUT "http://172.30.xxx.xxx:9200/hot_question_new" -H 'Content-Type: application/json' -d '{
"settings": {
"analysis": {
"analyzer": {
"ik_max_word": {
"type": "ik_max_word"
},
"ik_smart": {
"type": "ik_smart"
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"hotQuestion": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"business_item_third": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"business_scenario_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"demandSourceChannel": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"industry_large_category_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"province": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_policy_measure_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_policy_topic_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_type_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}'
hot_question_new
的数据结构如下:
{
"hot_question_new": {
"mappings": {
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"hotQuestion": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"business_item_third": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"business_scenario_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"demandSourceChannel": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"industry_large_category_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"province": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_policy_measure_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_policy_topic_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tax_type_second": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
重建历史数据到新索引
使用_reindex API
将数据从旧索引复制到新索引,以下是重建数据的命令:
curl -u username:password -X POST "http://172.30.xxx.xxx:9200/_reindex" -H 'Content-Type: application/json' -d '{
"source": {
"index": "hot_question"
},
"dest": {
"index": "hot_question_new"
}
}'
验证数据
curl -u username:password -X GET "http://172.30.14.200:9200/hot_question_new/_search?pretty" -H 'Content-Type: application/json' -d '{
"query": {
"match_all": {}
}
}'
测试 IK 分词器
确认IK
分词器是否正常工作,可以使用_analyze API
进行测试。以下是使用ik_max_word
和ik_smart
分词的示例命令:
# 使用 ik_max_word 分词
curl -u username:password -X POST "http://172.30.xxx.xxx:9200/hot_question_new/_analyze" -H 'Content-Type: application/json' -d '{
"analyzer": "ik_max_word",
"text": "纳税人"
}'
# 使用 ik_smart 分词
curl -u username:password -X POST "http://172.30.xxx.xxx:9200/hot_question_new/_analyze" -H 'Content-Type: application/json' -d '{
"analyzer": "ik_smart",
"text": "你的测试文本"
}'