Elasticsearch 之(28)percentiles 以及 percentiles rank网站访问时延SLA统计
需求
有一个网站,记录下了每次请求的访问的耗时,需要统计tp50,tp90,tp99tp50:50%的请求的耗时最长在多长时间
tp90:90%的请求的耗时最长在多长时间
tp99:99%的请求的耗时最长在多长时间
设置索引,准备数据
PUT /website { "mappings": { "logs": { "properties": { "latency": { "type": "long" }, "province": { "type": "keyword" }, "timestamp": { "type": "date" } } } } } POST /website/logs/_bulk { "index": {}} { "latency" : 105, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 83, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 92, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 112, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 68, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 76, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 101, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 275, "province" : "新疆", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 166, "province" : "新疆", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 654, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 389, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 302, "province" : "新疆", "timestamp" : "2016-10-29" }
pencentiles 语法
GET /website/logs/_search
{
"size": 0,
"aggs": {
"latency_percentiles": {
"percentiles": {
"field": "latency",
"percents": [
50,
95,
99
]
}
},
"latency_avg": {
"avg": {
"field": "latency"
}
}
}
}
{
"took": 31,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"latency_avg": {
"value": 201.91666666666666
},
"latency_percentiles": {
"values": {
"50.0": 108.5,
"95.0": 508.24999999999983,
"99.0": 624.8500000000001
}
}
}
}
根据province(省份)分组精确查询GET /website/logs/_search
{
"size": 0,
"aggs": {
"group_by_province": {
"terms": {
"field": "province"
},
"aggs": {
"latency_percentiles": {
"percentiles": {
"field": "latency",
"percents": [
50,
95,
99
]
}
},
"latency_avg": {
"avg": {
"field": "latency"
}
}
}
}
}
}
{
"took": 33,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_province": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "新疆",
"doc_count": 6,
"latency_avg": {
"value": 314.5
},
"latency_percentiles": {
"values": {
"50.0": 288.5,
"95.0": 587.75,
"99.0": 640.75
}
}
},
{
"key": "江苏",
"doc_count": 6,
"latency_avg": {
"value": 89.33333333333333
},
"latency_percentiles": {
"values": {
"50.0": 87.5,
"95.0": 110.25,
"99.0": 111.65
}
}
}
]
}
}
}
percentile ranks 语法
SLA:就是提供的服务的标准我们的网站的提供的访问延时的SLA,确保所有的请求100%,都必须在200ms以内,大公司内,一般都是要求100%在200ms以内。如果超过1s,则需要升级到A级故障,代表网站的访问性能和用户体验急剧下降
需求:在200ms以内的,有百分之多少,在1000毫秒以内的有百分之多少,percentile ranks metric
*percentile ranks,其实比pencentile还要常用
按照品牌分组,计算,电视机,售价在1000占比,2000占比,3000占比
GET /website/logs/_search
{
"size": 0,
"aggs": {
"group_by_province": {
"terms": {
"field": "province"
},
"aggs": {
"latency_percentile_ranks": {
"percentile_ranks": {
"field": "latency",
"values": [
200,
1000
]
}
}
}
}
}
}
{
"took": 38,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_province": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "新疆",
"doc_count": 6,
"latency_percentile_ranks": {
"values": {
"200.0": 29.40613026819923,
"1000.0": 100
}
}
},
{
"key": "江苏",
"doc_count": 6,
"latency_percentile_ranks": {
"values": {
"200.0": 100,
"1000.0": 100
}
}
}
]
}
}
}
percentile的优化
和基数一样,计算百分位需要一个近似算法。
percentiles 使用一个 TDigest 算法,与 HyperLogLog 一样,不需要理解完整的技术细节,但有必要了解算法的特性:
- 百分位的准确度与百分位的 极端程度 相关,也就是说 1 或 99 的百分位要比 50 百分位要准确。这只是数据结构内部机制的一种特性,但这是一个好的特性,因为多数人只关心极端的百分位。
- 对于数值集合较小的情况,百分位非常准确。如果数据集足够小,百分位可能 100% 精确。
- 随着桶里数值的增长,算法会开始对百分位进行估算。它能有效在准确度和内存节省之间做出权衡。 不准确的程度比较难以总结,因为它依赖于 聚合时数据的分布以及数据量的大小。
限制节点数量最多 compression * 20 = 2000个node去计算
越大,占用内存越多,越精准,性能越差
一个节点占用32字节,100 * 20 * 32 = 64KB
如果你想要percentile算法越精准,compression可以设置的越大