ElasticSearch使用IK分词器,并重建索引

背景

建立索引时,使用ElasticSearch默认的分词器。此时使用中文作为关键词进行搜索时,结果会出现偏差,不精准,因此准备切换使用IK分词器。

解决

安装IK分词器

进入到elasticsearch的安装目录,如/mnt/public/elasticsearch-7.17.3/plugins,创建ik目录。
image

这里下载elasticsearch-analysis-ik-7.17.3,注意:必须下载跟elasticsearch一样版本的ik分词器才能启动成功。下载成功之后,上传到/mnt/public/elasticsearch-7.17.3/plugins/ik目录下,之后进行解压。
image
确保plugin-descriptor.properties中的 "elasticsearch.version" 为你使用的elasticsearch版本,否则会启动失败。
image

接着进入到bin目录,执行./elasticsearch-plugin list命令,确认插件是否成功安装。
image
然后重启elasticsearch即可。注意:重启之前记得切换成启动elasticsearch服务的用户,不然会报错

image

IK分词器

IK分词器中有两种analyzer,可根据自身需求进行选择。以纳税人这个关键字为例:

  • ik_max_word: 会将文本做最细粒度的拆分,比如会将“纳税人”拆分为“纳税人”、“纳税”、“人”,会穷尽各种可能的组合;
  • ik_smart: 会做最粗粒度的拆分,比如“纳税人”,这个时候就不会进行拆分,直接就一种组合,“纳税人”;

未使用IK分词器

image

{
  "tokens": [
    {
      "token": "纳",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "税",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "人",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    }
  ]
}

ik_max_word分词器

image

{
  "tokens": [
    {
      "token": "纳税人",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "纳税",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "人",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 2
    }
  ]
}

ik_smart分词器

image

{
  "tokens": [
    {
      "token": "纳税人",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

重建索引

由于之前的历史数据没有使用IK分词器,且要满足对历史数据使用ik_max_word进行分词,使用ik_smart进行搜索,所以现在要进行索引重建

创建新索引

创建一个新的索引,假设您希望将新索引命名为 hot_question_new。使用 IK分词器设置索引的分析器。之前的索引结构如下所示:

{
  "hot_question": {
    "mappings": {
      "properties": {
        "id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "hotQuestion": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "business_item_third": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "business_scenario_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "country": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "demandSourceChannel": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "existFlag": {
          "type": "boolean"
        },
        "hotCommonCause": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "hotReply": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "industry_large_category_name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "province": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "provinceSet": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "tax_policy_measure_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "tax_policy_topic_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "tax_type_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "version": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

下面是创建新索引的curl命令:

curl -u username:password -X PUT "http://172.30.xxx.xxx:9200/hot_question_new" -H 'Content-Type: application/json' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_max_word": {
          "type": "ik_max_word"
        },
        "ik_smart": {
          "type": "ik_smart"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "hotQuestion": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "business_item_third": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "business_scenario_second": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "demandSourceChannel": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "industry_large_category_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "province": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "tax_policy_measure_second": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "tax_policy_topic_second": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "tax_type_second": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "version": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}'


image

hot_question_new的数据结构如下:

{
  "hot_question_new": {
    "mappings": {
      "properties": {
        "id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "hotQuestion": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "business_item_third": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "business_scenario_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "demandSourceChannel": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "industry_large_category_name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "province": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "tax_policy_measure_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "tax_policy_topic_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "tax_type_second": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        },
        "version": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

重建历史数据到新索引

使用_reindex API将数据从旧索引复制到新索引,以下是重建数据的命令:

curl -u username:password -X POST "http://172.30.xxx.xxx:9200/_reindex" -H 'Content-Type: application/json' -d '{
  "source": {
    "index": "hot_question"
  },
  "dest": {
    "index": "hot_question_new"
  }
}'

image

验证数据

curl -u username:password -X GET "http://172.30.14.200:9200/hot_question_new/_search?pretty" -H 'Content-Type: application/json' -d '{
  "query": {
    "match_all": {}
  }
}'

image

测试 IK 分词器

确认IK分词器是否正常工作,可以使用_analyze API进行测试。以下是使用ik_max_wordik_smart分词的示例命令:

# 使用 ik_max_word 分词
curl -u username:password -X POST "http://172.30.xxx.xxx:9200/hot_question_new/_analyze" -H 'Content-Type: application/json' -d '{
  "analyzer": "ik_max_word",
  "text": "纳税人"
}'

image

# 使用 ik_smart 分词
curl -u username:password -X POST "http://172.30.xxx.xxx:9200/hot_question_new/_analyze" -H 'Content-Type: application/json' -d '{
  "analyzer": "ik_smart",
  "text": "你的测试文本"
}'

image

对比

未使用IK分词器

image

使用ik_max_word分词器

image

使用ik_smart分词器

image

posted @ 2024-11-19 17:27  Reecelin  阅读(15)  评论(0编辑  收藏  举报