elasticsearch之警惕inner hits的性能问题

一、inner hits简介

elasticsearch提供了nested数据类型来处理主子文档的问题,可以解决子文档字段被分裂平铺导致字段之间失去了整体的关联性;

elasticsearch提供的inner hits主要完成在通过子文档进行匹配查询的时候,可以方便控制匹配的子文档的返回;

二、数据描述

数据结构及index情况可以参考 elasticsearch支持大table格式数据的搜索

三、问题简介

通过一个简单的ip来搜索,只匹配了一个主文档,而且返回了十个子元素,并进行了高亮处理;

查询语句

{
  "_source": {
    "excludes": [
      "content"
    ]
  },
  "query": {
    "bool": {
      "should": {
        "nested": {
          "path": "content",
          "query": {
            "query_string": {
              "query": "192.168.1.1*",
              "fields": [
                "content.*"
              ]
            }
          },
          "inner_hits": {
            "from": 0,
            "size": 10,
            "highlight": {
              "fields": {
                "*": {}
              },
              "fragment_size": 1000
            }
          },
          "score_mode": "avg",
          "ignore_unmapped": true
        }
      }
    }
  },
  "size": 20,
  "timeout": "20s"
}

执行语句的时间长达3111ms,只是匹配了一个文档,并且只高亮返回10个子文档,时间不至于这么长;

{
    "took":3111,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":1,
        "max_score":0.001722915,
        "hits":[
		]
    }
}

四、定位问题

执行以下语句,使用profile api来查看query执行的时间;


{
  "profile": true,
  "_source": {
    "excludes": [
      "content"
    ]
  },
  "query": {
    "bool": {
      "should": {
        "nested": {
          "path": "content",
          "query": {
            "query_string": {
              "query": "192.168.1.1*",
              "fields": [
                "content.*"
              ]
            }
          },
          "inner_hits": {
            "from": 0,
            "size": 10,
            "highlight": {
              "fields": {
                "*": {}
              },
              "fragment_size": 1000
            }
          },
          "score_mode": "avg",
          "ignore_unmapped": true
        }
      }
    }
  },
  "size": 20,
  "timeout": "20s"
}

通过profile部分,我们可以看到整个search的时间不到20ms,肯定不是查询导致的问题了;

{
    "took":2859,
    "timed_out":false,
    "profile":{
        "shards":[
            {
                "searches":[
                    {
                        "query":[
                            {
                                "type":"BooleanQuery",
                                "time":"9.9ms",
                                "time_in_nanos":9945310,
                                "breakdown":{
                                    "score":9349172,
                                    "build_scorer_count":6,
                                    "match_count":0,
                                    "create_weight":398951,
                                    "next_doc":1262,
                                    "match":0,
                                    "create_weight_count":1,
                                    "next_doc_count":1,
                                    "score_count":1,
                                    "build_scorer":176010,
                                    "advance":19905,
                                    "advance_count":1
                                }
                            }
                        ],
                        "rewrite_time":41647,
                        "collector":[
                            {
                                "name":"CancellableCollector",
                                "reason":"search_cancelled",
                                "time":"9.3ms",
                                "time_in_nanos":9376796,
                                "children":[
                                    {
                                        "name":"SimpleTopScoreDocCollector",
                                        "reason":"search_top_hits",
                                        "time":"9.3ms",
                                        "time_in_nanos":9355874
                                    }
                                ]
                            }
                        ]
                    }
                ],
                "aggregations":[

                ]
            }
        ]
    }
}

是不是高亮的问题呢?

去掉查询语句中的高亮部分,执行如下查询语句;

{
  "_source": {
    "excludes": [
      "content"
    ]
  },
  "query": {
    "bool": {
      "should": {
        "nested": {
          "path": "content",
          "query": {
            "query_string": {
              "query": "192.168.1.1*",
              "fields": [
                "content.*"
              ]
            }
          },
          "inner_hits": {
            "from": 0,
            "size": 10
          },
          "score_mode": "avg",
          "ignore_unmapped": true
        }
      }
    }
  },
  "size": 20,
  "timeout": "20s"
}

可以看到执行时间并没有什么大的变化;

{
    "took":3117,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":1,
        "max_score":0.001722915,
        "hits":[
            {
                 "inner_hits":{
                    "content":{
                        "hits":{
                            "total":400000,
                            "max_score":0.001722915,
                            "hits":[
                             ]
                        }
                    }
                }
            }
        ]
    }
}

现在剩下的只能是跟返回的文档有关系了;

禁止返回主文档,执行如下查询语句;

{
  "_source": false,
  "query": {
    "bool": {
      "should": {
        "nested": {
          "path": "content",
          "query": {
            "query_string": {
              "query": "192.168.1.1*",
              "fields": [
                "content.*"
              ]
            }
          },
          "inner_hits": {
            "from": 0,
            "size": 10
          },
          "score_mode": "avg",
          "ignore_unmapped": true
        }
      }
    }
  },
  "size": 20,
  "timeout": "20s"
}

可以看到时间还是没有什么变化;

{
    "took":2915,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":1,
        "max_score":0.001722915,
        "hits":[
            {
                 "inner_hits":{
                    "content":{
                        "hits":{
                            "total":400000,
                            "max_score":0.001722915,
                            "hits":[
                             ]
                        }
                    }
                }
            }
        ]
    }
}

修改查询语句,禁止返回子文档,执行以下语句

{
  "_source": false,
  "query": {
    "bool": {
      "should": {
        "nested": {
          "path": "content",
          "query": {
            "query_string": {
              "query": "192.168.1.1*",
              "fields": [
                "content.*"
              ]
            }
          },
          "inner_hits": {
            "from": 0,
            "size": 0
          },
          "score_mode": "avg",
          "ignore_unmapped": true
        }
      }
    }
  },
  "size": 20,
  "timeout": "20s"
}

可以看到10ms就执行完成了;

{
    "took":10,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":1,
        "max_score":0.001722915,
        "hits":[
            {
                "_type":"_doc",
                "_score":0.001722915,
                "inner_hits":{
                    "content":{
                        "hits":{
                            "total":400000,
                            "max_score":0,
                            "hits":[

                            ]
                        }
                    }
                }
            }
        ]
    }
}

五、问题原因分析

通过以上分析我们可以知道,由于返回了10个子文档,导致了执行时间的增长;从直观考虑来说淡出的返回10个不大的文档,不至于会耗时这么长时间啊;

inner hits提供了from和size来控制返回子文档的数量,我们以为可以像普通的查询那样使用,但是这里size的默认值是3,from+size必须小于100;

{
                "type":"illegal_argument_exception",
                "reason":"Inner result window is too large, the inner hit definition's [null]'s from + size must be less than or equal to: [100] but was [101]. This limit can be set by changing the [index.max_inner_result_window] index level setting."
            }

既然有这个限制,那么肯定是inner hit的性能不是很好,肯定跟nested type的存储结构和inner hits的实现机制有关系了;其实由于主文档和所有相关的子文档数据都保存在父文档的source字段,导致返回子文档的时候
,需要加载和解析主文档的source字段,并定位处理子文档;通过上边的查询返回结果可以看到,虽然只匹配了一个主文档,但是这个主文档下有40W的子文档,这么多的文档势必会导致source很大,最终导致执行时间的暴涨;

ested document don’t have a _source field, because the entire source of document is stored with the root document under its _source field. To include the source of just the nested document, the source of the root document is parsed and just the relevant bit for the nested document is included as source in the inner hit. Doing this for each matching nested document has an impact on the time it takes to execute the entire search request, especially when size and the inner hits' size are set higher than the default. To avoid the relatively expensive source extraction for nested inner hits, one can disable including the source and solely rely on doc values fields.

六、解决方案

  1. 单个文档只会存储在单个分片上,无法通过增加分片提高查询的速度;
  2. 文档提到了禁用source,并依赖doc values字段,但是经测试查询时间基本没有任何改善;
  3. 减少返回的子文档个数,可以显著的降低查询时间,例如下边返回3个;
{
    "took":967,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":1,
        "max_score":0.001722915,
        "hits":[
            {
                "_type":"_doc",
                "_score":0.001722915,
                "inner_hits":{
                    "content":{
                        "hits":{
                            "total":100008,
                            "max_score":0.001722915
                        }
                    }
                }
            }
        ]
    }
}
posted @ 2022-01-06 22:39  无风听海  阅读(2211)  评论(1编辑  收藏  举报