elasticsearch-搜索之中英文搜索(四)
需求
雪花啤酒 需要搜索雪花、啤酒 、雪花啤酒、xh、pj、xh啤酒、雪花pj
ik导入
参考https://www.cnblogs.com/LQBlog/p/10443862.html,不需要修改源码步骤就行
拼音分词器导入
跟ik一样 下载下来打包移动到es plugins 目录名字改为pinyin https://github.com/medcl/elasticsearch-analysis-pinyin
测试
get请求:http://127.0.0.1:9200/_analyze
body:
{ "analyzer":"pinyin", "text":"雪花啤酒" }
响应:
{ "tokens": [ { "token": "xue", "start_offset": 0, "end_offset": 0, "type": "word", "position": 0 }, { "token": "xhpj", "start_offset": 0, "end_offset": 0, "type": "word", "position": 0 }, { "token": "hua", "start_offset": 0, "end_offset": 0, "type": "word", "position": 1 }, { "token": "pi", "start_offset": 0, "end_offset": 0, "type": "word", "position": 2 }, { "token": "jiu", "start_offset": 0, "end_offset": 0, "type": "word", "position": 3 } ] }
说明导入成功
测试中文加拼音搜索
自定义mapping和自定义分词器
put请求:http://127.0.0.1:9200/opcm3
body:
{ "settings": { "analysis": { "analyzer": { "ik_pinyin_analyzer": {//自定义一个分词器名字叫ik_pinyin_analyzer "type": "custom",//表示自定义分词器 "tokenizer": "ik_smart",//使用ik分词 ik_smart为粗粒度分词 ik_max_word为最细粒度分词 "filter": ["my_pinyin"]//分词后结果 交给过滤器再次分词 }, "onlyOne_analyzer": { "tokenizer": "onlyOne_pinyin" } }, "tokenizer": { "onlyOne_pinyin": { "type": "pinyin", "keep_separate_first_letter": "true", "keep_full_pinyin":"false" } },"filter": { "my_pinyin": {//定义过滤器 "type": "pinyin", "keep_joined_full_pinyin": true,//分词的时候词组首字母分词后组合 如:雪花 分词:xuehua xh "keep_separate_first_letter": true//分词的时候支持首字母不单独分词如:会分词xue hua xuehua xh x,h "none_chinese_pinyin_tokenize": true//xh 分词为x,h,xh } } } }, "mappings": { "doc": { "properties": { "productName": { "type": "text", "analyzer": "ik_pinyin_analyzer",//指定分词索引为自定义分词 中文分词后再通过filter交给pinyin分词 "fields": {//暂时未用 只是保留让 自己能够知道有这种方式根据不同条件选择不同的搜索分词 "keyword_once_pinyin": {//新的分词字段 只分词不存在source productName.keyword_once_pinyin 查询时需要判断如果是单字母使用此搜索 "type": "text", "analyzer": "onlyOne_analyzer" } } } } } } }
filter个人理解
我的理解是 ik分词 然后将分词后的逐项结果通过filter交给拼音分词 雪花啤酒 ik会分成 雪花,啤酒 然后雪花交给pinyin会分词 xue,hua,xh,x,h 啤酒会分词 pi,jiu,p,j
插入测试数据
http://127.0.0.1:9200/opcm3/doc/1
{ "productName":"雪花纯生勇闯天涯9度100ml" }
put请求:http://127.0.0.1:9200/opcm3/doc/2
body:
{ "productName":"金威纯生勇闯天涯9度100ml" }
查看分词结果
get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName
get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName.keyword_once_pinyin
测试搜索
http://127.0.0.1:9200/opcm3/_search
{ "query":{ "match_phrase":{ "productName":{ "query":"雪花纯生" } } } }
会查出雪花纯生和金威纯生 看个人是模糊匹配还是相邻匹配 选用match或者match_phrase
我的需求是相邻匹配改为
{ "query":{ "match_phrase":{ "productName":{ "query":"雪花纯生" } } } }
则只会搜索出雪花纯生
搜索雪花纯生9度的产品
{ "query":{ "match_phrase":{ "productName":{ "query":"雪花纯生9度" } } } }
会发现搜索不出来数据
原因请查阅:https://www.cnblogs.com/LQBlog/p/10580247.html
改为就能搜索出来:
{ "query":{ "match_phrase":{ "productName":{ "query":"雪花纯生9度", "slop":5 } } } }
pingpin分词还支持很多参数 比如:
以上模型排查及解决
添加测试数据
{
"productName":"纯生"
}
{
"productName":"纯爽"
}
测试
搜索
{ "query":{ "match_phrase":{ "productName":{ "query":"纯生", "slop":5 } } } }
返回结果
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 2.8277423, "hits": [ { "_index": "opcm3", "_type": "doc", "_id": "1", "_score": 2.8277423, "_source": { "productName": "纯爽" } }, { "_index": "opcm3", "_type": "doc", "_id": "2", "_score": 1.4466299, "_source": { "productName": "纯生" } } ] } }
可以发现纯爽也出来了
排查
1.查看纯爽分词结果
http://127.0.0.1:9200/opcm3/doc/2/_termvectors?fields=productName
[c,chun,s,sheng]
[c,chun,s,shuang]
2.查看搜索分词
http://127.0.0.1:9200/opcm3/_validate/query?explain { "query":{ "match_phrase":{ "productName":{ "query":"纯生", "slop":5 } } } }
body
{ "valid": true, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "explanations": [ { "index": "opcm3", "valid": true, "explanation": "productName:\"(c chun) (s sheng)\"~5" } ] }
可以理解为index=(c or chun) and (s or shuang)
所以c,s 匹配了纯爽
解决办法
分词按最小粒度分 搜索 按最大粒度分
如纯生文档分词为[chun,sheng,chun,sheng,cs,c,s]
搜索分词为[chun,sheng,chunsheng]
一下模型就能满足搜索: 雪花,雪花cs ,雪花chunsheng ,xhcs,xh纯生,雪花纯生 都能正确搜索出数据
{ "settings": { "analysis": { "analyzer": { "ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_smart", "filter": ["pinyin_max_word_filter"] }, "ik_pingying_smark": { "type": "custom", "tokenizer": "ik_smart", "filter": ["pinyin_smark_word_filter"] } }, "filter": { "pinyin_max_word_filter": { "type": "pinyin", "keep_full_pinyin": "true",#分词全拼如雪花 分词xue,hua "keep_separate_first_letter":"true",#分词简写如雪花 分词xh "keep_joined_full_pinyin":true#分词会quanpin 连接 比如雪花分词 xuehua }, "pinyin_smark_word_filter": { "type": "pinyin", "keep_separate_first_letter": "false",#不分词简写如雪花 分词不分词xh "keep_first_letter":"false"#不分词单个首字母 如雪花 不分词 x,h } } } }, "mappings": { "doc": { "properties": { "productName": { "type": "text", "analyzer": "ik_pinyin_analyzer",#做文档所用的分词器 "search_analyzer":"ik_pingying_smark"#搜索使用的分词器 } } } } }