知识问答检索中的分桶检索相关设置
1 分桶检索的需求
在基于索引的QA问答对匹配流程梳理的匹配原理介绍中,我们对QA的相似问进行了入库预处理,并生成了相关的特征向量。在入库时我们是针对问题进行的入库,但在实际的业务场景中,每一个类目下有很多的知识,每个知识又有很多的问法,如果单纯的进行了相似问法匹配返回问法的得分,就会出现同一个知识的问法占据了topN问题。针对这个问题,我们希望针对检索的问法进行合并,每一个知识仅返回该知识中得分最高的一条即可,同时返回的问法数量可以控制。
2 设计实现
ES在字段设计时增加kid知识字段,用于存储每一个问法所属的知识id,是一对多的形式,在检索时基于kid字段进行分组查询,每组返回一条得分最高的数据,同时设置返回的分桶数量。
经过上述设计后,进行了数据实现,并测试验证(此代码后续验证有bug),分组查询的相关代码如下所示:
// 查询条件封装
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 构建morelikethis查询语句
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
.filter(QueryBuilders.termsQuery("online", "1"))
.filter(QueryBuilders.termsQuery("userId", userId))
.filter(QueryBuilders.termsQuery("category", category.split(",")))
.must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
// 排序规则
AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
// 获取每个分组时间倒序排列的第一条记录
AggregationBuilder top = AggregationBuilders.topHits("result")
.fetchSource(new String[]{"id", "title", "kId"}, null)
.size(1);
// 封装分组查询的相关条件
TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
.field("kId").executionHint("map");
// 返回分组数
groupTermsBuilder.size(maxNum);
groupTermsBuilder.subAggregation(top);
groupTermsBuilder.subAggregation(maxScore);
searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
在进行验证查询时发现,每个组是返回了该组的最高得分,但是组之间还存在更高的得分的问题,如下查询结果所示(结果做了处理,仅展示):
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 135,
"max_score": 25.215496,
"hits": [{
"_index": "qaknowwledge",
"_type": "doc",
"_id": "11657847994935215",
"_score": 25.215496,
"_source": {
"category": "11656187146936040",
"id": "11657847994935215",
"kId": "11657847993624508",
"online": "1",
"qStr": "存储的问法1",
"title": "知识1",
"userId": "10869305621348777"
}
},
{
"_index": "qaknowwledge",
"_type": "doc",
"_id": "11657847994935216",
"_score": 10.988454,
"_source": {
"category": "11656187146936040",
"id": "11657847994935216",
"kId": "11657847993624508",
"online": "1",
"questionStr": "问法2",
"title": "知识2",
"userId": "10869305621348777"
}
}
]
},
"aggregations": {
"groupkId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 72,
"buckets": [{
"key": "11657847993624494",
"doc_count": 5,
"result": {
"hits": {
"total": 5,
"max_score": 3.8905885,
"hits": [{
"_index": "qaknowwledge",
"_type": "doc",
"_id": "11657847994935160",
"_score": 3.8905885,
"_source": {
"kId": "11657847993624494",
"id": "11657847994935160",
"title": "知识"
}
}]
}
},
"scoreTop": {
"value": 3.8905885219573975
}
}
]
}
}
}
我们发现打分搞的第一条记录并没有出现在分组的查询中,我们把查询语句打印出来如下:
{
"size": 20,
"timeout": "60s",
"query": {
"bool": {
"must": [
{
"more_like_this": {
"fields": [
"questionStr"
],
"like": [
"问法"
],
"max_query_terms": 25,
"min_term_freq": 0,
"min_doc_freq": 0,
"max_doc_freq": 2147483647,
"min_word_length": 2,
"max_word_length": 0,
"minimum_should_match": "30%",
"boost_terms": 0,
"include": false,
"fail_on_unsupported_field": true,
"boost": 1
}
}
],
"filter": [
{
"terms": {
"online": [
"1"
],
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"groupkId": {
"terms": {
"field": "kId",
"size": 20,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"execution_hint": "map",
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"result": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"_source": {
"includes": [
"id",
"title",
"kId"
],
"excludes": []
}
}
},
"scoreTop": {
"max": {
"script": {
"source": "_score",
"lang": "painless"
}
}
}
}
}
}
}
分析发现,我们设置的排序策略并没有生效,从上文看排序仍然是按照分组匹配到的数量进行的排序,也就是
"terms": {
"field": "kId",
"size": 20,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"execution_hint": "map",
"order": [{
"_count": "desc"
},
{
"_key": "asc"
}
]
}
对上述查询代码进行查看,发现我们仅设置了聚合后的查询字段,但是该查询字段并没有应用到分组上,进行处理即可,代码如下:
// 查询条件封装
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 构建morelikethis查询语句
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
.filter(QueryBuilders.termsQuery("online", "1"))
.filter(QueryBuilders.termsQuery("userId", userId))
.filter(QueryBuilders.termsQuery("category", category.split(",")))
.must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
// 排序规则
AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
// 获取每个分组时间倒序排列的第一条记录
AggregationBuilder top = AggregationBuilders.topHits("result")
.fetchSource(new String[]{"id", "title", "kId", "qSimhas"}, null)
.size(1);
// 封装分组查询的相关条件
TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
.field("kId").executionHint("map").order(BucketOrder.aggregation("scoreTop", false));
// 返回分组数
groupTermsBuilder.size(maxNum);
groupTermsBuilder.subAggregation(top);
groupTermsBuilder.subAggregation(maxScore);
searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
即将"scoreTop"应用到"groupTermsBuilder"上即可,这样对打印出的查询语句即可看到,排序已经按照每组的查询最高分进行了。
参考:
es term 聚合时能按_score进行排序么
es java api 进行聚合+桶聚合查询 terms+top_hits+max