升级es分词检索遇到查询数据量不匹配的问题-问题确认,ik代码变更(一)

升级es集群同时更新同版本ik词包测试发现检索不一致

对比了两版本自编译词包的代码，同时查看es变更日志，发现问题点

ik从6.* 到7.* 基本分词相关的功能性代码变更，主要变更的是日志内容(中文变英文)，加载远程词库的一些调整

经测试导致影响的主要变化为

https://github.com/medcl/elasticsearch-analysis-ik/commit/06e8a23d1828ae993aef21b81829a7729d06f224#diff-34e2e8feb4a24c48b1157d4edd30fd621511757f8b32a3ecac122863ea8c9f3f

es6

elasticsearch-analysis-ik/src/main/java/org/wltea/analyzer/core/AnalyzeContext.java

					//字典中无单字，但是词元冲突了，切分出相交词元的前一个词元中的单字
					int innerIndex = index + 1;
					for (; innerIndex < index + l.getLength(); innerIndex++) {
						Lexeme innerL = path.peekFirst();
						if (innerL != null && innerIndex == innerL.getBegin()) {
							this.outputSingleCJK(innerIndex - 1);
						}
					}

es7

					//字典中无单字，但是词元冲突了，切分出相交词元的前一个词元中的单字
					/*int innerIndex = index + 1;
					for (; innerIndex < index + l.getLength(); innerIndex++) {
						Lexeme innerL = path.peekFirst();
						if (innerL != null && innerIndex == innerL.getBegin()) {
							this.outputSingleCJK(innerIndex - 1);
						}
					}*/

这段代码6.x生效，7.x被注释，

Screen Shot 2021-04-21 at 5.40.42 PM

缩小数据，定位到分词不一致的doc,对doc验证不同ES的分词结果，确认问题

分词结果查询方式

curl -k -H 'Content-Type: application/json' http://ip:port/index_name/_analyze -d'{"text": "字典中无单字，但是词元冲突了，切分出相交词元的前一个词元中的单字 ","tokenizer": "ik_max_word"}'

结果类似


{
    "tokens": [
        {
            "token": "#",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },     
        {
            "token": "z8ytwll",
            "start_offset": 89,
            "end_offset": 96,
            "type": "LETTER",
            "position": 77
        },
        {
            "token": "z",
            "start_offset": 89,
            "end_offset": 90,
            "type": "ENGLISH",
            "position": 78
        },
        {
            "token": "8",
            "start_offset": 90,
            "end_offset": 91,
            "type": "ARABIC",
            "position": 79
        },
        {
            "token": "ytwll",
            "start_offset": 91,
            "end_offset": 96,
            "type": "ENGLISH",
            "position": 80
        }
    ]
}

posted @ 2021-06-21 22:35 cclient 阅读(546) 评论(0) 编辑收藏举报

刷新页面返回顶部

吾生也有涯，而知也无涯

心有阳光，正视黑暗

升级es分词检索遇到查询数据量不匹配的问题-问题确认,ik代码变更(一)

公告