es之IK分词器

1：默认的分析器-- standard

使用默认的分词器

curl -XGET 'http://hadoop01:9200/_analyze?pretty&analyzer=standard' -d '我爱中国'
curl -XGET 'http://hadoop01:9200/_analyze?pretty&analyzer=simple' -d '我爱中国'

这就是默认的分词器，但是默认的分析器有时候在生产环境会出现问题，比如：

curl -XPUT 'http://hadoop01:9200/test/class/1' -d '{"title" : "我爱中国"}'

去hadoop01:9100查看当前索引数据：

然后使用命令查询：

curl -XGET 'hadoop01:9200/test/class/_search?pretty' -d '
{ "query": { "term": {"title": "我爱中国"}}}'

结果：

发现没有找到数据！

这说明分词器出了问题！然后我们可以通过分析器提供的接口去做查询，查看一下当前的分词器是如何进行分词的：

curl -XGET 'hadoop01:9200/test/_analyze?field=title?pretty' -d '我爱中国'

通过图片我们可以看到，我们的数据经过分词器后，会将数据切割成---我、爱、中、国；如果单独去查询每一个字段，都是可以查询到的：

整体查询确查询不到；这就是默认的分词器存在的弊端；

2：IK分词器

2.1：安装

1）：下载

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.2/elasticsearch-analysis-ik-5.5.2.zip

2）：下载好了之后解压，将解压后的文件夹放在elasticsearch目录下的plugins目录下，并重命名为analysis-ik【发送到所有机器(本课程提供安装包)】

3）：重启elasticsearch

ik带有两个分词器:

•   ik_max_word ：会将文本做最细粒度的拆分；尽可能多的拆分出词语
•   ik_smart：会做最粗粒度的拆分；已被分出的词语将不会再次被其它词语占有

2.2：使用IK分词

1）：构建索引，指定IK分词器

curl -XPUT 'http://hadoop01:9200/iktest?pretty' -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "subject" : {
                    "type" : "string",
                    "analyzer" : "ik_max_word"
                }
            }
        }
    }
}'

2）：插入数据

curl -XPOST http://hadoop01:9200/iktest/article/_bulk?pretty -d '
{ "index" : { "_id" : "1" } }
{"subject" : "＂闺蜜＂崔顺实被韩检方传唤 韩总统府促彻查真相" }
{ "index" : { "_id" : "2" } }
{"subject" : "韩举行＂护国训练＂ 青瓦台:决不许国家安全出问题" }
{ "index" : { "_id" : "3" } }
{"subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮" }
{ "index" : { "_id" : "4" } }
{"subject" : "村上春树获安徒生奖 演讲中谈及欧洲排外问题" }
{ "index" : { "_id" : "5" } }
{"subject" : "希拉里团队炮轰FBI 参院民主党领袖批其”违法”" }
'

3）：查询测试

curl -XPOST http://hadoop01:9200/iktest/article/_search?pretty  -d'
{
    "query" : { "match" : { "subject" : "希拉里和韩国" }},
    "highlight" : {
        "pre_tags" : ["<font color='red'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "subject" : {}
        }
    }
}
'

2.3：热词更新

现在网络热词很多，每隔一段时间就会出现网红热词；但是如果直接使用IK分词，是识别不了这些词的；

curl -XGET 'http://hadoop01:9200/_analyze?pretty&analyzer=ik_max_word' -d '
王者荣耀'

返回结果：

IK并没有识别出网红的词汇；ik 的主词典中没有【老铁、没毛病】词，所以被拆分了。

修改 IK 的配置文件：ES 目录/export/servers/elasticsearch-5.5.2/plugins/analysis-ik/config/IKAnalyzer.cfg.xml

修改配置如下:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->    
    <entry key="ext_dict">custom/my.dic</entry>     
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">ext_stopword.dic</entry>
    <!--用户可以在这里配置远程扩展字典 --> 
    <entry key="remote_ext_dict">192.168.0.1/stopworld.di</entry>
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

在custom/my.dic中添加：老铁、没毛病

然后重启es；在次执行：

curl -XGET 'http://hadoop01:9200/_analyze?pretty&analyzer=ik_max_word' -d '老铁没毛病，双击666'

但是，上面的操作是需要进行重启的，上面的步骤只是更新词库，并不是所谓的热更新；所谓的热更新词库，是要在不重启es的前提下完成的；

posted @ 2017-05-22 23:06 niutao 阅读(1252) 评论(0) 编辑收藏举报

刷新页面返回顶部