Fork me on GitHub

elasticsearch安装中文分词器

1. 分词器的安装

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.3/elasticsearch-analysis-ik-6.2.3.zip

NOTE: replace 6.2.3 to your own elasticsearch version

github上面的地址

https://github.com/medcl/elasticsearch-analysis-ik

需要注意安装的版本和对应的elasticsearch相匹配

使用方法:

1> 在ElasticSearch的配置文件config/elasticsearch.yml中的最后一行添加参数 index.analysis.analyzer.default.type: ik,则设置所有索引的默认分词器为ik分词。

2> 也可以通过设置mapping来使用ik分词

2. IK分词器的两种分词模式。

1> ik_max_word: 会将文本做最细粒度的拆分,比如会将"北京邮电大学"拆分,会穷尽各种可能的组合;

{
    "tokens":[
        {
            "token":"北京邮电",
            "start_offset":0,
            "end_offset":4,
            "type":"CN_WORD",
            "position":0
        },
        {
            "token":"北京",
            "start_offset":0,
            "end_offset":2,
            "type":"CN_WORD",
            "position":1
        },
        {
            "token":"邮电大学",
            "start_offset":2,
            "end_offset":6,
            "type":"CN_WORD",
            "position":2
        },
        {
            "token":"邮电",
            "start_offset":2,
            "end_offset":4,
            "type":"CN_WORD",
            "position":3
        },
        {
            "token":"电大",
            "start_offset":3,
            "end_offset":5,
            "type":"CN_WORD",
            "position":4
        },
        {
            "token":"大学",
            "start_offset":4,
            "end_offset":6,
            "type":"CN_WORD",
            "position":5
        }
    ]
}

2> ik_smart: 会做最粗粒度的拆分

{
    "tokens":[
        {
            "token":"北京",
            "start_offset":0,
            "end_offset":2,
            "type":"CN_WORD",
            "position":0
        },
        {
            "token":"邮电大学",
            "start_offset":2,
            "end_offset":6,
            "type":"CN_WORD",
            "position":1
        }
    ]
}
posted @ 2018-12-04 20:50  archer-wong  阅读(944)  评论(0编辑  收藏  举报