ElasticSearch学习八 ik分词器

analyzer

分词器使用的两个情形： 1，Index time analysis. 创建或者更新文档时，会对文档进行分词 2，Search time analysis. 查询时，对查询语句分词

指定查询时使用哪个分词器的方式有：

查询时通过analyzer指定分词器

GET test_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "lin",
        "analyzer": "standard"
      }
    }
  }
}

创建index mapping时指定search_analyzer

PUT test_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "whitespace",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

es内置很多分词器，但是对中文分词并不友好，例如使用standard分词器对一句中文话进行分词，会分成一个字一个字的。这时可以使用第三方的Analyzer插件，比如 ik、pinyin等。

IK分词器

下载

下载ik分词器，版本要与es版本一致。

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.4.2

下载的时候需要下载已经编译好的ik分词器，如果是下载源码的话，还需要编译，会比较麻烦。

创建目录

在elasticsearch目录中的plugins目录下创建一个ik目录

#创建ik目录
mkdir /usr/local/elasticsearch-7.4.2/plugins/ik
#把下载的压缩包解压到ik目录
unzip elasticsearch-analysis-ik-7.4.2.zip -d /usr/local/elasticsearch-7.4.2/plugins/ik/
#重新更改ik目录的所属组与所属用户
chown -Rf es:es /usr/local/elasticsearch-7.4.2/

验证

ik分词器有两种分词模式:ik_max_word细粒度的分词和 ik_smart 粗粒度

用ik分词器对一句进行分词

//get请求
get http://192.168.2.128:9200/_analyze


//入参
{
  "analyzer": "ik_max_word",//细粒度的分词
  "text": "你好吗?我有一句话要对你说呀。"
}

分词结果

{
  "tokens" : [
    {
      "token" : "你",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "好吗",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "我",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "有",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "一句话",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "要对",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "你",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "说呀",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

posted @ 2022-05-07 18:26 阿瞒123 阅读(521) 评论(0) 收藏举报

刷新页面返回顶部

阿瞒123

ElasticSearch学习八 ik分词器

IK分词器

下载

创建目录

验证

公告