elasticsearch自动补全
根据用户输入的字母,提示完整词条的功能,就是自动补全。
索引库中就需要有词条对应的拼音数据,使用拼音分词将指定字段重新分词。
拼音分词器
要实现根据字母做补全,就必须对文档按照拼音分词。在GitHub上恰好有elasticsearch的拼音分词插件。地址:https://github.com/medcl/elasticsearch-analysis-pinyin
安装方式与IK分词器一样,分三步:
①解压
②上传到虚拟机中,elasticsearch的plugin目录
③重启elasticsearch
docker restart es
④测试
POST /_analyze { "text": "如家酒店还不错", "analyzer": "pinyin" }
默认的拼音分词器会将每个汉字单独分为拼音,而我们希望的是每个词条形成一组拼音,需要对拼音分词器做个性化定制,形成自定义分词器。
自定义分词器
elasticsearch中分词器(analyzer)的组成包含三部分:
- character filters:在tokenizer之前对文本进行处理。例如删除字符、替换字符
- tokenizer:将文本按照一定的规则切割成词条(term)。例如keyword,就是不分词;还有ik_smart
- tokenizer filter:将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等
文档分词时会依次由这三部分来处理文档:
声明自定义分词器的语法如下:
PUT /test //创建索引库 { "settings": { "analysis": { "analyzer": { // 自定义分词器 "my_analyzer": {//分词器名称 "char_filter":"my_char_filter", // 自定义 character filters "tokenizer": "ik_max_word", // 指定文本分词(term)规则 "filter": "py" // 自定义 tokenizer filter } }, "char_filter": { // 自定义 character filters "my_char_filter": { // character filters 名称 "type": "mapping", "mappings": [ "LOL => laughing out loud", "BRB => be right back", "OMG => oh my god" ] } }, "filter": { // 自定义 tokenizer filter "py": { // tokenizer filter 名称 "type": "pinyin", "keep_full_pinyin": false, "keep_joined_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } } } } }
测试:
POST /test/_analyze { "text": "OMG如家酒店还不错", "analyzer": "my_analyzer" }
注意: 自定义分词器需要在索引创建时定义
示例:
1)创建索引库时,使用自定义分词器
注意:"com_analyzer": {"tokenizer": "keyword","filter": "py"} :自定义分词器,不分词,拼音分词

PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer": { "char_filter":"my_char_filter", "tokenizer": "ik_max_word", "filter": "py" }, "com_analyzer": { "tokenizer": "keyword", "filter": "py" } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "LOL => laughing out loud", "BRB => be right back", "OMG => oh my god" ] } }, "filter": { "py": { "type": "pinyin", "keep_full_pinyin": false, "keep_joined_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } } } } }
2)mapping映射,指定字段使用自定义分词器

PUT /test/_mapping { "properties": { "title":{ "type": "completion", "analyzer": "com_analyzer", "search_analyzer": "ik_smart" } } }
3)测试分词器
POST /test/_analyze { "text": ["OMG如家酒店还不错","上海如家酒"], "analyzer": "com_analyzer" }
4)结果:
- 每个字符转成完整拼音(例如:OMG如家酒店还不错=> omg,rujiajiudianhaibucuo)
- 每个字符的首字母(例如:OMG如家酒店还不错 => omgrjjdhbc)
自动补全查询
elasticsearch提供了Completion Suggester查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率,对于文档中字段的类型有一些约束:
-
参与补全查询的字段必须是completion类型。
-
字段的内容一般是用来补全的多个词条形成的数组。
比如,一个这样的索引库:
PUT /test/_mapping { "properties": { "title":{ "type": "completion", "analyzer": "my_analyzer", "search_analyzer": "ik_smart" } } }
插入下面的数据:
POST test/_doc { "title": ["Sony", "WH-1000XM3"] } POST test/_doc { "title": ["SK-II", "PITERA"] } POST test/_doc { "title": ["Nintendo", "switch"] }
查询的DSL语句如下:
// 自动补全查询 GET /test/_search { "suggest": { "title_suggest": { "text": "s", // 关键字 "completion": { "field": "title", // 补全查询的字段 "skip_duplicates": true, // 跳过重复的 "size": 10 // 获取前10条结果 } } } }
RestApi 自动补全
修改酒店映射结构(利用拼音分词器)
// 创建 索引库 PUT /hotel { "settings": { "analysis": { // 自定义分词器 "analyzer": { "text_anlyzer": { "tokenizer": "ik_max_word", // 分词规则:ik_max_word "filter": "py" // 拼音分词 }, "completion_analyzer": { "tokenizer": "keyword", // 分词规则:不分词 "filter": "py" // 拼音分词 } }, "filter": { "py": { "type": "pinyin", "keep_full_pinyin": false, "keep_joined_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } } } }, "mappings": { // mapping 映射 "properties": { "id":{ "type": "keyword" }, "name":{ "type": "text", "analyzer": "text_anlyzer", // 数据存储时,使用ik_max_word分词,再利用拼音分词器,将词条专场拼音 "search_analyzer": "ik_smart",// 检索时,使用ik_smart分词器 "copy_to": "all" }, "address":{ "type": "keyword", "index": false }, "price":{ "type": "integer" }, "score":{ "type": "integer" }, "brand":{ "type": "keyword", "copy_to": "all" }, "city":{ "type": "keyword" }, "starName":{ "type": "keyword" }, "business":{ "type": "keyword", "copy_to": "all" }, "location":{ "type": "geo_point" }, "pic":{ "type": "keyword", "index": false }, "all":{ "type": "text", "analyzer": "text_anlyzer", "search_analyzer": "ik_smart" }, "suggestion":{ // 自动补全字段 "type": "completion", // 补全查询的字段必须是completion类型 "analyzer": "completion_analyzer" // 不做分词的拼音分词 } } } }
修改HotelDoc实体

1 @Data 2 @NoArgsConstructor 3 public class HotelDoc { 4 private Long id; 5 private String name; 6 private String address; 7 private Integer price; 8 private Integer score; 9 private String brand; 10 private String city; 11 private String starName; 12 private String business; 13 private String location; 14 private String pic; 15 private Object distance; 16 private Boolean isAD; 17 private List<String> suggestion; 18 19 public HotelDoc(Hotel hotel) { 20 this.id = hotel.getId(); 21 this.name = hotel.getName(); 22 this.address = hotel.getAddress(); 23 this.price = hotel.getPrice(); 24 this.score = hotel.getScore(); 25 this.brand = hotel.getBrand(); 26 this.city = hotel.getCity(); 27 this.starName = hotel.getStarName(); 28 this.business = hotel.getBusiness(); 29 this.location = hotel.getLatitude() + ", " + hotel.getLongitude(); 30 this.pic = hotel.getPic(); 31 // 组装suggestion 32 if(this.business.contains("/")){ 33 // business有多个值,需要切割 34 String[] arr = this.business.split("/"); 35 // 添加元素 36 this.suggestion = new ArrayList<>(); 37 this.suggestion.add(this.brand); 38 Collections.addAll(this.suggestion, arr); 39 }else { 40 this.suggestion = Arrays.asList(this.brand, this.business); 41 } 42 } 43 }
导入数据

1 @Test 2 void testBulkRequest() throws IOException { 3 // 查询所有的酒店数据 4 List<Hotel> list = hotelService.list(); 5 6 // 1.准备Request 7 BulkRequest request = new BulkRequest(); 8 // 2.准备参数 9 for (Hotel hotel : list) { 10 // 2.1.转为HotelDoc 11 HotelDoc hotelDoc = new HotelDoc(hotel); 12 // 2.2.转json 13 String json = JSON.toJSONString(hotelDoc); 14 // 2.3.添加请求 15 request.add(new IndexRequest("hotel").id(hotel.getId().toString()).source(json, XContentType.JSON)); 16 } 17 18 // 3.发送请求 19 client.bulk(request, RequestOptions.DEFAULT); 20 }
自动补全查询的JavaAPI
自动补全的结果也解析的代码如下:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2020-01-20 JavaWeb学习基本流程及tomcat发布