ElasticSearch中文分词(IK)
ElasticSearch常用的很受欢迎的是IK,这里稍微介绍下安装过程及测试过程。
1、ElasticSearch官方分词
自带的中文分词器很弱,可以体检下:
[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d '岁月如梭'
{
"tokens": [
{
"token": "岁",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "月",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "如",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "梭",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
}
]
}
{
"tokens": [
{
"token": "岁",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "月",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "如",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "梭",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
}
]
}
[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d 'i am an enginner'
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "am",
"start_offset": 2,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "an",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "enginner",
"start_offset": 8,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
}
]
}
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "am",
"start_offset": 2,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "an",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "enginner",
"start_offset": 8,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
}
]
}
由此看见,ES的官方中文分词能力较差。
2、IK中文分词器
2.1、如何你下载的ik是源码半,需要打包该分词器,linux安装maven
tar zxvf apache-maven-3.0.5-bin.tar.gz
mv apache-maven-3.0.5 /usr/local/apache-maven-3.0.5
vi /etc/profile
增加:
export MAVEN_HOME=/usr/local/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin
export PATH=$PATH:$MAVEN_HOME/bin
source /etc/profile
mvn -v
2.2、对源码打包得到target/目录下的内容
mvn clean package
将打包好的IK插件内容部署到ES中:
[zsz@VS-zsz ~]$ cd /home/zsz/elasticsearch-analysis-ik-1.10.0/target/releases/
[zsz@VS-zsz releases]$ mkdir /usr/local/elasticsearch-2.4.0/plugins/ik/
[zsz@VS-zsz releases]$ cp elasticsearch-analysis-ik-1.10.0.zip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip
[zsz@VS-zsz releases]$ unzip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip
[zsz@VS-zsz releases]$ cd /usr/local/elasticsearch-2.4.0/plugins/ik/
[zsz@VS-zsz ik]$ rm elasticsearch-analysis-ik-1.10.0.zip
[zsz@VS-zsz ik]$ mkdir /usr/local/elasticsearch-2.4.0/config/ik
将IK的配置copy到ElasticSearch的配置中:
[zsz@VS-zsz ik]$ cp /home/zsz/elasticsearch-analysis-ik-1.10.0/config /usr/local/elasticsearch-2.4.0/config/ik
更改ElasticSearch的配置:
[zsz@VS-zsz ik]$ vi /usr/local/elasticsearch-2.4.0/config/elasticsearch.yml
在最后加上分词解析器的配置:
index.analysis.analyzer.ik.type : "ik"
启动ElasticSearch:
[zsz@VS-zsz ik]$ cd /usr/local/elasticsearch-2.4.0/
[zsz@VS-zsz elasticsearch-2.4.0]$ ./bin/elasticsearch -d
测试IK分词器的效果:
[zsz@VS-zsz elasticsearch-2.4.0]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d '岁月如梭'
{
"tokens": [
{
"token": "岁月如梭",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "岁月",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "如梭",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "梭",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
}
]
}
"tokens": [
{
"token": "岁月如梭",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "岁月",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "如梭",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "梭",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
}
]
}
[zsz@VS-zsz config]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d 'elasticsearch很受欢迎的的一款拥有活跃社区开源的搜索解决方案'
{
"tokens": [
{
"token": "elasticsearch",
"start_offset": 0,
"end_offset": 13,
"type": "CN_WORD",
"position": 0
},
{
"token": "elastic",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 1
},
{
"token": "很受",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 2
},
{
"token": "受欢迎",
"start_offset": 14,
"end_offset": 17,
"type": "CN_WORD",
"position": 3
},
{
"token": "欢迎",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 4
},
{
"token": "一款",
"start_offset": 19,
"end_offset": 21,
"type": "CN_WORD",
"position": 5
},
{
"token": "一",
"start_offset": 19,
"end_offset": 20,
"type": "TYPE_CNUM",
"position": 6
},
{
"token": "款",
"start_offset": 20,
"end_offset": 21,
"type": "COUNT",
"position": 7
},
{
"token": "拥有",
"start_offset": 21,
"end_offset": 23,
"type": "CN_WORD",
"position": 8
},
{
"token": "拥",
"start_offset": 21,
"end_offset": 22,
"type": "CN_WORD",
"position": 9
},
{
"token": "有",
"start_offset": 22,
"end_offset": 23,
"type": "CN_CHAR",
"position": 10
},
{
"token": "活跃",
"start_offset": 23,
"end_offset": 25,
"type": "CN_WORD",
"position": 11
},
{
"token": "跃",
"start_offset": 24,
"end_offset": 25,
"type": "CN_WORD",
"position": 12
},
{
"token": "社区",
"start_offset": 25,
"end_offset": 27,
"type": "CN_WORD",
"position": 13
},
{
"token": "开源",
"start_offset": 27,
"end_offset": 29,
"type": "CN_WORD",
"position": 14
},
{
"token": "搜索",
"start_offset": 30,
"end_offset": 32,
"type": "CN_WORD",
"position": 15
},
{
"token": "索解",
"start_offset": 31,
"end_offset": 33,
"type": "CN_WORD",
"position": 16
},
{
"token": "索",
"start_offset": 31,
"end_offset": 32,
"type": "CN_WORD",
"position": 17
},
{
"token": "解决方案",
"start_offset": 32,
"end_offset": 36,
"type": "CN_WORD",
"position": 18
},
{
"token": "解决",
"start_offset": 32,
"end_offset": 34,
"type": "CN_WORD",
"position": 19
},
{
"token": "方案",
"start_offset": 34,
"end_offset": 36,
"type": "CN_WORD",
"position": 20
}
]
}
"tokens": [
{
"token": "elasticsearch",
"start_offset": 0,
"end_offset": 13,
"type": "CN_WORD",
"position": 0
},
{
"token": "elastic",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 1
},
{
"token": "很受",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 2
},
{
"token": "受欢迎",
"start_offset": 14,
"end_offset": 17,
"type": "CN_WORD",
"position": 3
},
{
"token": "欢迎",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 4
},
{
"token": "一款",
"start_offset": 19,
"end_offset": 21,
"type": "CN_WORD",
"position": 5
},
{
"token": "一",
"start_offset": 19,
"end_offset": 20,
"type": "TYPE_CNUM",
"position": 6
},
{
"token": "款",
"start_offset": 20,
"end_offset": 21,
"type": "COUNT",
"position": 7
},
{
"token": "拥有",
"start_offset": 21,
"end_offset": 23,
"type": "CN_WORD",
"position": 8
},
{
"token": "拥",
"start_offset": 21,
"end_offset": 22,
"type": "CN_WORD",
"position": 9
},
{
"token": "有",
"start_offset": 22,
"end_offset": 23,
"type": "CN_CHAR",
"position": 10
},
{
"token": "活跃",
"start_offset": 23,
"end_offset": 25,
"type": "CN_WORD",
"position": 11
},
{
"token": "跃",
"start_offset": 24,
"end_offset": 25,
"type": "CN_WORD",
"position": 12
},
{
"token": "社区",
"start_offset": 25,
"end_offset": 27,
"type": "CN_WORD",
"position": 13
},
{
"token": "开源",
"start_offset": 27,
"end_offset": 29,
"type": "CN_WORD",
"position": 14
},
{
"token": "搜索",
"start_offset": 30,
"end_offset": 32,
"type": "CN_WORD",
"position": 15
},
{
"token": "索解",
"start_offset": 31,
"end_offset": 33,
"type": "CN_WORD",
"position": 16
},
{
"token": "索",
"start_offset": 31,
"end_offset": 32,
"type": "CN_WORD",
"position": 17
},
{
"token": "解决方案",
"start_offset": 32,
"end_offset": 36,
"type": "CN_WORD",
"position": 18
},
{
"token": "解决",
"start_offset": 32,
"end_offset": 34,
"type": "CN_WORD",
"position": 19
},
{
"token": "方案",
"start_offset": 34,
"end_offset": 36,
"type": "CN_WORD",
"position": 20
}
]
}
可以看到,中文分词变得更加合理。
本文地址:http://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html