前言
上文介绍了ES的各种查询;
本文介绍如何在ES进行MySQL中的分组和聚合查询
实现用户输入拼音自动补全功能
实现MySQL和ES之间的数据自动同步;
一、
在ES中对于聚合查询,主要分为2大类:指标(Metric)聚合 与 桶(Bucket)聚合。
- 指标聚合:max、min、sum等,作用等同于Mysql中的相关聚合函数。
- 桶聚合:group by,作用等同于Mysql中根据哪1个字段进行分组
注意,我们不能对text类型的字段进行分组,因为text会进行分词,导致无法进行分组。
指标聚合相当于MySQL中聚合函数,统计品牌为万豪的最贵酒店价格
GET /hotel/_search
{
"query": {
"term": {
"brand": {
"value": "万豪"
}
}
},
"size": 0,
"aggs": {
"最贵的": {
"max": {
"field": "price"
}
},
"最便宜的": {
"min": {
"field": "price"
}
}
}
}
GET /hotel/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "万豪"
}
}
},
"aggs": {
"按星级名称分组": {
"terms": {
"field": "specs",
"size": 20
}
}
}
}
对数据库中所有数据,按照星级和品牌分组;
GET /hotel/_search
{
"size": 0,
"aggs": {
"按品牌分组": {
"terms": {
"field": "brand",
"size": 20
}
},
"按星级分组": {
"terms": {
"field": "specs",
"size": 20
}
}
}
}
3.总结
在ES中1次请求,可以写多个聚合函数;
4.功能实现
根据搜索条件筛选之后,再根据品牌进行分组;
GET hotel/_search { "size": 0, "query": { "query_string": { "fields": ["name","synopsis","area","address"], "query": "三亚 OR 商务" } }, "aggs": { "hotel_brands": { "terms": { "field": "brand", "size": 100 } } } }
@Override public Map<String, Object> searchBrandGroupQuery(Integer current, Integer size, Map<String, Object> searchParam) { //设置查询请求头 SearchRequest searchRequest = new SearchRequest("hotel"); //设置查询请求体 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //设置查询方式 if (!StringUtils.isEmpty(searchParam.get("condition"))) { QueryBuilder queryBuilder = QueryBuilders.queryStringQuery(searchParam.get("condition").toString()) .field("name") .field("synopsis") .field("area") .field("address") .defaultOperator(Operator.OR); searchSourceBuilder.query(queryBuilder); } //设置按品牌分组 AggregationBuilder aggregationBuilder = AggregationBuilders.terms("brand_groups") .size(200) .field("brand"); searchSourceBuilder.aggregation(aggregationBuilder); //设置分页 searchSourceBuilder.from((current - 1) * size); searchSourceBuilder.size(size); searchRequest.source(searchSourceBuilder); try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHits hits = searchResponse.getHits(); long totalHits = hits.getTotalHits().value; ArrayList<String> groupNameList = new ArrayList<>(); //获取并处理聚合查询结果 Terms brandGroups = searchResponse.getAggregations().get("brand_groups"); for (Terms.Bucket bucket : brandGroups.getBuckets()) { String key = (String) bucket.getKey(); groupNameList.add(key); } Map<String, Object> map = new HashMap<>(); // map.put("list", list); map.put("totalResultSize", totalHits); map.put("current", current); //设置总页数 map.put("totalPage", (totalHits + size - 1) / size); //设置品牌分组列表 map.put("brandList", groupNameList); return map; } catch (IOException e) { e.printStackTrace(); } return null; }
5.分组和聚合一起使用
通常情况我们统计数据时,会先进行分组,然后再在分组的基础上进行聚合操作;
根据用户输入的日期,统计某品牌下所有酒店销量。 对于该功能的实现,需要进行多层聚合。
- 根据品牌进行分组查询
5.1.
GET hotel/_search
{
"size": 0,
"query": {
"range": {
"createTime": {
"gte": "2015-01-01",
"lte": "2015-12-31"
}
}
},
"aggs": {
"根据品牌分组": {
"terms": {
"field": "brand",
"size": 100
},
"aggs": {
"该品牌总销量": {
"sum": {
"field": "salesVolume"
}
},
"该品牌销量平均值": {
"avg": {
"field": "salesVolume"
}
}
}
}
}
}
public List<Map<String, Object>> searchDateHistogram(Map<String, Object> searchParam) { //定义结果集 List<Map<String, Object>> result = new ArrayList<>(); //设置查询 SearchRequest searchRequest = new SearchRequest("hotel"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //todo 自定义日期时间段范围查询 RangeQueryBuilder queryBuilder = QueryBuilders.rangeQuery("createTime") .gte(searchParam.get("minTime")) .lte(searchParam.get("maxTime")) .format("yyyy-MM-dd"); searchSourceBuilder.query(queryBuilder); //todo 聚合查询设置 TermsAggregationBuilder aggregationBuilder = AggregationBuilders.terms("hotel_brand").field("brand").size(100); //构建二级聚合 SumAggregationBuilder secondAggregation = AggregationBuilders.sum("hotel_salesVolume").field("salesVolume"); aggregationBuilder.subAggregation(secondAggregation); searchSourceBuilder.aggregation(aggregationBuilder); searchRequest.source(searchSourceBuilder); try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); //todo 获取聚合结果并处理 Aggregations aggregations = searchResponse.getAggregations(); Map<String, Aggregation> aggregationMap = aggregations.asMap(); Terms terms = (Terms) aggregationMap.get("hotel_brand"); List<? extends Terms.Bucket> buckets = terms.getBuckets(); buckets.forEach(bucket -> { Map<String, Object> info = new HashMap<>(); info.put("brand",bucket.getKeyAsString()); //获取二级聚合数据 ParsedSum parsedSum = bucket.getAggregations().get("hotel_salesVolume"); Integer sumValue = (int) parsedSum.getValue(); info.put("sumValue",sumValue); result.add(info); }); return result; } catch (IOException e) { e.printStackTrace(); } return null; }
-
fuction score:算分函数查询,可以控制文档相关性算分,控制文档排名
-
GET hotel/_search
{
"query": {
"match": {
"name": "北京市东城区万豪"
}
}
}
#结果
[
{
"_score" : 7.060467,
"_source" : {
"name" : "北京市东城区万豪酒店",
}
},
{
"_score" : 7.060467,
"_source" : {
"name" : "北京市东城区金陵酒店",
}
},
{
"_score" : 7.060467,
"_source" : {
"name" : "北京市东城区华天酒店",
}
}
]
在ElasticSearch中,早期使用的打分算法是TF-IDF算法,公式如下:
在后来的5.1版本升级中,elasticsearch将算法改进为BM25算法,公式如下:
TF-IDF算法有一各缺陷,就是词条频率越高,文档得分也会越高,单个词条对文档影响较大。而BM25则会让单个词条的算分有一个上限,曲线更加平滑:
-
TF-IDF算法
-
#查询多域展示相关结果数据
GET hotel/_search
{
"query": {
"query_string": {
"fields": ["name","synopsis","area","address"],
"query": "北京市万豪spa三星"
}
}
查询结果
2.2.权重设置
在查询的时候给每1条数据的权重进行加分操作,但是没有用因为每1条数据都涨了(内卷),无法实现竞价排名;
GET hotel/_search
{
"query": {
"query_string": {
"fields": ["name","synopsis","area","address"],
"query": "北京市万豪spa三星",
"boost": 50
}
}
}
查询结果
2.2.2.索引设置(静态)
在创建索引时,指定字段的配置权重;
该方式在开发中不常用,因为随着业务的改变,无法随时调整权重;
而索引一旦创建则无法修改,除非删除索引重建。
PUT hotel
{
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "ik_max_word",
"boost": 5
},
"address":{
"type": "text",
"analyzer": "ik_max_word",
"boost": 3
}
}
}
}
2.2.3.查询设置(动态)
在下列查询中,query中的内容为主查询条件,functions中为判断要为哪些数据加权。weight
假设x豪掏了告费用,那我就为品牌为x豪的酒店,权重值增加50倍;
GET hotel/_search
{
"query": {
"function_score": {
"query": {
"query_string": {
"fields": ["name","synopsis","area","address"],
"query": "北京市spa三星"
}
},
"functions": [
{
"filter": {
"term": {
"brand": "x豪"
}
},
"weight": 50
}
]
}
}
}
查询结果
3.
GET hotel/_search
{
"query": {
"function_score": {
"query": {
"query_string": {
"fields": [
"name",
"specs",
"area"
],
"query": "北京市万豪sap三星"
}
},
"functions": [
{
"filter": {
"term": {
"isAd": "1"
}
},
"weight": 100
}
]
}
}
}
public Map<String, Object> searchScoreQuery(Integer current, Integer size, Map<String, Object> searchParam) { SearchRequest searchRequest = new SearchRequest("hotel"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //构建主查询条件 QueryStringQueryBuilder queryStringQueryBuilder = QueryBuilders.queryStringQuery(searchParam.get("condition").toString()) .field("name") .field("synopsis") .field("area") .field("address") .defaultOperator(Operator.OR); //构建加权条件 FunctionScoreQueryBuilder.FilterFunctionBuilder[] scoreFunctionBuilder = new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{ new FunctionScoreQueryBuilder.FilterFunctionBuilder(QueryBuilders.termQuery("isAd",1), ScoreFunctionBuilders.weightFactorFunction(100)) }; FunctionScoreQueryBuilder queryBuilder = QueryBuilders.functionScoreQuery(queryStringQueryBuilder, scoreFunctionBuilder); searchSourceBuilder.query(queryBuilder); //设置分页 searchSourceBuilder.from((current - 1) * size); searchSourceBuilder.size(size); searchRequest.source(searchSourceBuilder); try { //处理查询结果 SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHits hits = searchResponse.getHits(); long totalHits = hits.getTotalHits().value; SearchHit[] searchHits = hits.getHits(); List<HotelEntity> list = new ArrayList<>(); for (SearchHit searchHit : searchHits) { String sourceAsString = searchHit.getSourceAsString(); list.add(JSON.parseObject(sourceAsString, HotelEntity.class)); } Map<String, Object> map = new HashMap<>(); map.put("list", list); map.put("totalResultSize", totalHits); map.put("current", current); //设置总页数 map.put("totalPage", (totalHits + size - 1) / size); return map; } catch (IOException e) { e.printStackTrace(); } return null; }
1.
PUT user
{
"mappings": {
"properties": {
"first_name": {
"type": "text"
},
"last_name": {
"type": "text"
}
}
}
}
#添加数据
PUT user/_doc/1
{
"first_name": "John",
"last_name": "Smith"
}
#查询
GET user/_search
{
"query": {
"query_string": {
"fields": ["first_name","last_name"],
"query": "John OR Smith"
}
}
}
我们可以利用copy_to属性完成将多个字段,合并拷贝到一个字段中简化查询;
这是典型的空间换时间操作;
DELETE user
PUT user
{
"mappings": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
PUT user/_doc/1
{
"first_name": "John",
"last_name": "Smith"
}
#用match当做单字段查询
GET user/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}
- copy_to属性可以帮助我们将多个字段或者一个字段拷贝到另外一个字段
- copy_to属性可以帮助我们简化查询
2.1.解压
2.2.上传到ES的插件目录下
[root@zhanggen plugins]# ls elasticsearch-analysis-ik-7.10.1 elasticsearch-analysis-pinyin-7.10.1 [root@zhanggen plugins]# pwd /mydata/elasticsearch/plugins
2.3.重启es容器
POST /_analyze
{
"text": "张根",
"analyzer": "pinyin"
}
2.5.测试结果
{
"tokens" : [
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "zg",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "gen",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
}
]
}
3.1.声明自定义分词器
声明自定义分词器的语法如下:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "ik_smart"
}
}
}
}
POST test/_analyze
{
"text": "张根",
"analyzer": "my_analyzer"
}
3.2.查看分词结果
{
"tokens" : [
{
"token" : "张",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "z",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "根",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "gen",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "g",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
}
]
}
4.
-
参与补全查询的字段必须是completion类型。
-
字段的内容一般是用来补全的多个词条形成的数组。
PUT test { "mappings": { "properties": { "title": { "type": "completion" } } } }
然后插入下面的数据
#示例数据
POST test/_doc
{
"title": [
"Sony",
"WH-1000XM3"
]
}
POST test/_doc
{
"title": [
"SK-II",
"PITERA"
]
}
POST test/_doc
{
"title": [
"Nintendo",
"switch"
]
}
查询的DSL语句如下
#自动补全
GET test/_search
{
"suggest": {
"YOUR_SUGGESTION": {
"text": "s",
"completion": {
"field": "title",
"skip_duplicates":true,
"size":10
}
}
}
}
- 1.定义分词器
- 2.创建suggest字段
# 酒店数据索引库
PUT hotel_3
{
"settings": {
"analysis": {
"analyzer": {
"text_anlyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"suggest":{
"type": "completion",
"analyzer": "completion_analyzer"
},
"address" : {
"type" : "text",
"analyzer" : "text_anlyzer",
"search_analyzer" : "ik_smart"
},
"area" : {
"type" : "text",
"analyzer" : "text_anlyzer",
"search_analyzer" : "ik_smart"
},
"brand" : {
"type" : "keyword",
"copy_to": "suggest"
},
"createTime" : {
"type" : "date",
"format" : "yyyy-MM-dd"
},
"id" : {
"type" : "long"
},
"imageUrl" : {
"type" : "text"
},
"isAd" : {
"type" : "integer"
},
"name" : {
"type" : "text",
"analyzer" : "text_anlyzer",
"search_analyzer" : "ik_smart",
"copy_to": "suggest"
},
"price" : {
"type" : "integer"
},
"salesVolume" : {
"type" : "integer"
},
"specs" : {
"type" : "keyword"
},
"synopsis" : {
"type" : "text",
"analyzer" : "text_anlyzer",
"search_analyzer" : "ik_smart"
},
"type" : {
"type" : "keyword"
}
}
}
}
所谓Reindex是将数据从1个Index移动到另1个Index的过程;
当1个索引能够写入文档后就意味着该索引的shard分片,mapping等结构不能再被修改了,而有时候却需要调整索引的结构。
比如,当前索引的Shard分片数可能满足不了实际的查询需求,需要扩大一些。
#平滑迁移数据 POST _reindex?wait_for_completion=false&requests_per_second=200 { "source": { "index": "hotel_2" }, "dest": { "index":"hotel_3" } } #检查任务状态 GET _tasks/_6af5BFpS7mrvRyP6f8xlg:6792 #重新指向别名 #断开原来的关系 POST _aliases { "actions": [ { "remove": { "index": "hotel_2", "alias": "hotel" } } ] } #删除原来的索引表 DELETE hotel_2 #新建hotel_2的关系 POST _aliases { "actions": [ { "add": { "index": "hotel_3", "alias": "hotel" } } ] }
跨ES实例迁移
#后台异步执行避免clent-timeout新增wait_for_completion=false POST _reindex?wait_for_completion=false { "source": { "remote": { "host": "https://es-cn-gd7.public.elasticsearch.aliyuncs.com:9200", "username": "elastic", "password": "YourPassword" }, "index": "shrink-app-ad-log-info-2024.04.*" }, "script": { "source": "ctx._source.remove('headers')" }, "dest": { "index": "app-ad-log-info-2024.04" } } #查看目标索引 GET app-ad-log-info-2024.04/_search #查看目标索的数量 GET app-ad-log-info-2024.04/_count #查看任务 GET _tasks/JjRgYiAWQLym-4lCQ0eLUg:5958176 #取消任务 POST _tasks/e6D0cJ8RTqinohpdKpF4yA:3147320/_cancel #删除索引 DELETE app-ad-log-info-2024.04
5.3.
模拟用户输入了1个拼音wan
GET hotel/_search
{
"_source": false,
"suggest": {
"my_suggest": {
"text": "wan",
"completion": {
"field": "suggest",
"skip_duplicates":true,
"size":10
}
}
}
}
5.4.查看结果
查到了万事达、万豪、王朝
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"my_suggest" : [
{
"text" : "wan",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "万事达",
"_index" : "hotel_3",
"_type" : "_doc",
"_id" : "AeSfyIEBhlAS7ARu8P7t",
"_score" : 1.0
},
{
"text" : "万悦",
"_index" : "hotel_3",
"_type" : "_doc",
"_id" : "_uSfyIEBhlAS7ARu8P3t",
"_score" : 1.0
},
{
"text" : "万豪",
"_index" : "hotel_3",
"_type" : "_doc",
"_id" : "wuSfyIEBhlAS7ARu8P3t",
"_score" : 1.0
},
{
"text" : "王朝",
"_index" : "hotel_3",
"_type" : "_doc",
"_id" : "1eSfyIEBhlAS7ARu8P3t",
"_score" : 1.0
}
]
}
]
}
}
5.5.
public List<String> searchSuggestInfo(String key) { //定义结果集 List<String> result = new ArrayList<>(); //设置查询 SearchRequest searchRequest = new SearchRequest("hotel"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //todo 构建自动补全搜索 searchSourceBuilder.fetchSource(false); SuggestBuilder suggestBuilder = new SuggestBuilder(); CompletionSuggestionBuilder suggest = SuggestBuilders .completionSuggestion("suggest") .prefix(key) .skipDuplicates(true) .size(10); suggestBuilder.addSuggestion("my_suggest",suggest); searchSourceBuilder.suggest(suggestBuilder); searchRequest.source(searchSourceBuilder); try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); //处理自动补全查询结果 CompletionSuggestion my_suggest = searchResponse.getSuggest().getSuggestion("my_suggest"); List<CompletionSuggestion.Entry.Option> options = my_suggest.getOptions(); for (CompletionSuggestion.Entry.Option option : options) { String s = option.getText().string(); result.add(s); } return result; } catch (IOException e) { throw new RuntimeException(e); } }
5.6.效果
E