Elasticsearch 分词

一.分词

一个 tokenizer（分词器）接收一个字符流，将之分割为独立的 tokens（词元，通常是独立的单词），然后输出 tokens 流。例如，whitespace tokenizer 遇到空白字符时分割文本。它会将文本 “Quick brown fox!” 分割为 [Quick, brown, fox!]。该 tokenizer（分词器）还负责记录各个 term（词条）的顺序或 position 位置（用于 phrase 短语和 word proximity 词近邻查询），以及 term（词条）所代表的原始 word（单词）的 start（起始）和 end（结束）的 character offsets（字符偏移量）（用于高亮显示搜索的内容）。Elasticsearch 提供了很多内置的分词器，可以用来构建 custom analyzers（自定义分词器）。

二.标准分词器standard

1、分词英语

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog bone"
}

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog",
      "start_offset" : 45,
      "end_offset" : 48,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2、分词中文并不能很好的处理中文分词

POST _analyze
{
  "analyzer": "standard",
  "text": "分词中文"
}

{
  "tokens" : [
    {
      "token" : "分",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "词",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "文",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

三、安装ik分词器-Docker版

注意：不能用默认 elasticsearch-plugin install xxx.zip 进行自动安装
github地址：https://github.com/medcl/elasticsearch-analysis-ik

1、进入es容器内部plugins目录

docker exec -it 容器id /bin/bash
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

unzip 下载的文件

rm -rf *.zip

mv elasticsearch/ik

2、可以确认是否安装好了分词器

cd ../bin

elasticsearch plugin list: 即可列出系统的分词器

四、测试分词器

1、智能分词 ik_smart

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "我是中国人"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

2、ik_max_word 最细粒度的拆分文本

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "我是中国人"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

五、自定义词库

由于一些新兴的网络用语，ik分词器并不支持，

这就需要我们去扩展词库。

1、配置文件

修改/mydata/elasticsearch/plugins/ik/config（这里的目录是宿主机的数据卷映射目录，然后也可以去es容器里面去改）下的 IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

2 、Docker 安装 Nginx 利用Nginx 配置远程词库

安装：https://www.cnblogs.com/mangoubiubiu/p/16796373.html

创建分词文件夹

mkdir -p /mydata/nginx/html/es
cd /mydata/nginx/html/es
vi fenci.txt
#写入乔碧罗

3、修改配置文件 - 远程词库改为我们nginx 定义的文件地址

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!- -用户可以在这里配置远程扩展字典 -->
         <entry key="remote_ext_dict">http://192.168.56.10/es/fenci.txt</entry>
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

4、重启es

docker restart 容器id

上一篇Docker 安装 Nginx

下一篇Elasticsearch SpringBoot 整合 ES

本文作者：KwFruit

本文链接：https://www.cnblogs.com/mangoubiubiu/p/16796577.html

posted @ 2022-10-16 17:09 KwFruit 阅读(100) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

mangoubiubiu

Elasticsearch 分词

一.分词

二.标准分词器standard

1、分词英语

2、分词中文并不能很好的处理中文分词

三、安装ik分词器-Docker版

1、进入es容器内部plugins目录

2、可以确认是否安装好了分词器

四、测试分词器

1、智能分词 ik_smart

2、ik_max_word 最细粒度的拆分文本

五、自定义词库

1、配置文件

2 、Docker 安装 Nginx 利用Nginx 配置远程词库

3、修改配置文件 - 远程词库改为我们nginx 定义的文件地址

4、重启es

公告

常用链接

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

mangoubiubiu

Elasticsearch 分词

一.分词

二.标准分词器standard

1、分词英语

2、分词中文 并不能很好的处理中文分词

三、安装ik分词器-Docker版

1、进入es容器内部plugins目录

2、可以确认是否安装好了分词器

四、测试分词器

1、智能分词 ik_smart

2、ik_max_word 最细粒度的拆分文本

五、自定义词库

1、配置文件

2 、Docker 安装 Nginx 利用Nginx 配置远程词库

3、修改配置文件 - 远程词库改为我们nginx 定义的文件地址

4、重启es

公告

常用链接

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

2、分词中文并不能很好的处理中文分词