Elasticsearch使用ik分词器

分析器

什么是分词概念？分析器能使用到哪些场景？

分词意思就是将词语进行分组切片；比如你好啊可能就会被分词为 "你好" 和 "啊"；在进行搜索时如果根据高亮匹配就会将 "你好" 或者 "啊" 所有匹配的标题内容都匹配出来；

分析器一般就是使用在搜索引擎上；比如百度搜索 mysql时间减7天就会匹配到 mysql, 时间，减， 7天相匹配的内容；

如下的词条经过不同的分析器会得到不同的词条：

"Set the shape to semi-transparent by calling set_trans(5)"

标准分析器

标准分析器是Elasticsearch默认使用的分析器。它是分析各种语言文本最常用的选择。它根据 Unicode 联盟定义的 单词边界 划分文本。删除绝大部分标点。最后，将词条小写。它会产生

set, the, shape, to, semi, transparent, by, calling, set_trans, 5

简单分析器

简单分析器在任何不是字母的地方分隔文本，将词条小写。它会产生

set, the, shape, to, semi, transparent, by, calling, set, trans

空格分析器

空格分析器在空格的地方划分文本。它会产生

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

语言分析器

特定语言分析器可用于很多语言。它们可以考虑指定语言的特点。例如， 英语 分析器附带了一组英语无用词（常用单词，例如 and 或者 the ，它们对相关性没有多少影响），它们会被删除。由于理解英语语法的规则，这个分词器可以提取英语单词的词干。

英语 分词器会产生下面的词条：

set, shape, semi, transpar, call, set_tran, 5

注意看 transparent、 calling 和 set_trans 已经变为词根格式。

测试分析器

对 "Text to analyze" 词条进行分析标准分析会得到 text , to , analyze 三个词；

发送请求

GET http://localhost:9200/_analyze

body

{
  "analyzer": "standard",
  "text": "Text to analyze"
}

输出

{
    "tokens": [
        {
            "token": "text",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "to",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "analyze",
            "start_offset": 8,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

IK中文分词器

安装IK分词器

es默认的标准分析器通常情况下无法满足我们中文使用，除了使用映射的方法，我们一般会安装IK中文分词器插件进行分词；

github:https://github.com/medcl/elasticsearch-analysis-ik

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

一般是下载已经打包好的安装版本，如果下载源码需要自己编译；

选择对应版本的分析器下载，我这边es是 7.8 就下载7.8版本；不同版本es匹配分析器参考如下

7.8 版本下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.8.0

（无法下载的朋友可以关注gong众号：知识追寻者：回复 es 获取）

windows操作系统下载对应的zip即可；

在plugins 目录下新建ik文件夹
将 elasticsearch-analysis-ik-7.8.0.zip 拷贝至es的 plugins 进行解压到ik文件夹里面；
重启es

测试ik分词器

在ik分词器中提供了两种分词算法

ik_smart: 最少切分
ik_max_word: 最细粒度划分

一般情况下我们使用 ik_smart 即可

发送请求

GET http://localhost:9200/_analyze

body

{
  "analyzer": "ik_smart",
  "text": "钢铁般的意志"
}

输出结果

{
    "tokens": [
        {
            "token": "钢铁",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "般",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "的",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "意志",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

使用ik分词器

首先需要创建索引new ，并且创建映射

PUT localhost:9200/news

body

{
  "mappings": {
    "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
  }
}

字段 content 类型为text表示可以被分词

为类型_doc添加数据

POST localhost:9200/news/_doc

body

{
    "content": "我国灵活就业者已经达到2亿人"
}
{
    "content": "五笔输入法为什么输给拼音输入法"
}
{
    "content": "碳水化合物会导致发胖吗"
}

查询匹配结果测试

POST localhost:9200/news/_search

body

{
  "query": {
    "match": {
      "content": "发胖"
    }
  }
}

匹配结果

{
    "took": 14,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0065652,
        "hits": [
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "20Vg4n4B7c1LMGBvNd71",
                "_score": 1.0065652,
                "_source": {
                    "content": "碳水化合物会导致发胖吗"
                }
            }
        ]
    }
}

扩展词典

我们现在想要将 “钢铁般的意志” 做为一个词，而不是分割为多个词条，就需要用到扩展词典；

plugins\ik\config 目录下新建文件 my.dic；
在my.dic中写入钢铁般的意志

在 IKAnalyzer.cfg.xml 中添加扩展词典 my.dic

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">my.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

请求测试输出结果如下

{
    "tokens": [
        {
            "token": "钢铁般的意志",
            "start_offset": 0,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

注： plugins\ik\config 文件说明

IKAnalyzer.cfg.xml：用来配置自定义词库
main.dic：ik原生内置的中文词库，总共有27万多条
quantifier.dic：些单位相关的词
suffix.dic：后缀
surname.dic：中国的姓氏
stopword.dic：英文停用词

参考文档：https://www.elastic.co/guide/cn/elasticsearch/guide/current/mapping-intro.html

posted @ 2022-02-24 17:40 知识追寻者阅读(423) 评论(0) 编辑收藏举报

刷新页面返回顶部

知识追寻者[同公众号]

公众号：知识追寻者，谢谢关注

Elasticsearch使用ik分词器

分析器

测试分析器

IK中文分词器

安装IK分词器

测试ik分词器

使用ik分词器

扩展词典

公告