druid.io实践---实现近似直方图和分位数
直方图特性还在druid的实验特性里面,应该是还不太完善。
应用场景:approxHistogram-agg配合使用quantile或者quantiles等分位数post-agg可以实现查询0.95/0.98/0.99的响应时间。
添加扩展支持
查看{DRUID}/extensions目录下druid-histogram存在。
druid-histogram需要添加到extension:
druid.extensions.loadList=["druid-histogram",.....]
节点需要重启来加载新添加的extension:
- 查询端,需要重启historical节点和broker节点。
- 数据摄入端,需要重启overlord节点。
重启节点输出:
2017-02-17T14:22:30,007 INFO [main] io.druid.initialization.Initialization - Loading extension [druid-histogram] for class [io.druid.cli.CliCommandCreator] 2017-02-17T14:22:30,008 INFO [main] io.druid.initialization.Initialization - added URL[file:/disk1/druid-0.9.1.1/extensions/druid-histogram/druid-histogram-0.9.1.1.jar] 2017-02-17T14:22:30,426 INFO [main] io.druid.initialization.Initialization - Loading extension [druid-histogram] for class [io.druid.initialization.DruidModule] 2017-02-17T14:22:30,429 INFO [main] io.druid.initialization.Initialization - Adding local file system extension module [io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class [io.druid.initialization.DruidModule]
数据摄入–创建直方图的草图
在索引时必须包含以下两种聚合器之一,并且只能适用于数值:
- approxHistogram:缺失值将被当成0
- approxHistogramFold:缺失值将被忽略。
查询结果时,“approxHistogramFold”聚合器必须包括在查询。
{ "type" : "approxHistogram/approxHistogramFold "name" : <output_name>, "fieldName" : <metric_name>, "resolution" : <integer>, "numBuckets" : <integer>, "lowerLimit" : <float>, "upperLimit" : <float> }
参数 | 含义 | 默认值 |
---|---|---|
resolution | 数量的质心(数据点)来存储。分辨率越高,更准确的结果,但计算将越慢。 | 50 |
numBuckets | 生成直方图的输出桶数 | 7 |
lowerLimit/upperLimit | 限制草图逼近到给定的范围内。这个范围以外的值将聚合成两个重心。项值超过此范围仍然保持 | -INF/+INF |
resolution和numBuckets可以根据实际情况调整大。lowerLimit/upperLimit范围也可以根据需要进行调整。
举例
摄取时
metricsSpec" : [ { "type" : "longSum", "name" : "total_num", "fieldName" : "total_num" }, { "type" : "approxHistogramFold", "name" : "rsp_time", "fieldName" : "rsp_time", "resolution" : 500, "numBuckets" : 500, "lowerLimit" : 0.0, "upperLimit" : 1000000.0 } ],
查询时
"aggregations": [ { "type": "longSum", "name": "total_num", "fieldName": "total_num" }, { "type": "approxHistogramFold", "name": "rsp_time", "fieldName": "rsp_time", "resolution" : 500, "numBuckets" : 500 } ],
举个实际的例子。查询时,如果numBuckets不指定(默认为7),不使用post-aggregators的原始数据格式为:
"rsp_time" : { "breaks" : [ -123.5, 1.0, 125.5, 250.0, 374.5, 499.0, 623.5, 748.0 ], "counts" : [ 1.0, 10.0, 4.0, 0.0, 0.0, 0.0, 1.0 ] }
break有8个值,将会分为7段,每段的count对应如上,例如:0-870之间的数量为3430个,870-1741的个数为241个。
数据查询时–后聚合器post-aggregators
Post-aggregators用来提取刚才查询到的原始数据,计算分位数、最小值、最大值等指标
- Equal buckets post-aggregator
- Buckets post-aggregator
- Custom buckets post-aggregator:指定break分段
- min post-aggregator:最小值
- max post-aggregator:最大值
- quantile post-aggregator:给定单个分位数,例如0.99
- quantiles post-aggregator:给定分位数数组,可以实现常用的0.75、0.99等
{ "type": "customBuckets", "name": "histogram", "fieldName": "rsp_time", "breaks": [ 0, 10000, 999999999 ] } { "type": "quantile", "name": "rsp_time", "fieldName": "rsp_time", "probability": 0.99 } { "type" : "quantiles", "name" : <output_name>, "fieldName" : <aggregator_name>, "probabilities" : [ <quantile>, <quantile>, ... ] }
举例
针对刚才total_time_histogram的聚合器,对应的post-aggregators如下:
"postAggregations":[ { "type" : "quantiles", "name" : "响应时间", "fieldName" : "rsp_time","probabilities" : [ 0.5,0.75,0.9,0.95,0.99] } ],
numBuckets不设置,默认为7时,结果示例如下,可以调大来增加准确性:
{ "version" : "v1", "timestamp" : "2017-03-02T16:53:08.000Z", "event" : { .... .... "响应时间" : { "probabilities" : [ 0.5, 0.75, 0.9, 0.95, 0.99 ], "quantiles" : [ 84.0, 152.0, 205.59998, 349.5999, 668.32007 ], "min" : 1.0, "max" : 748.0 }, "rsp_time" : { "breaks" : [ -123.5, 1.0, 125.5, 250.0, 374.5, 499.0, 623.5, 748.0 ], "counts" : [ 1.0, 10.0, 4.0, 0.0, 0.0, 0.0, 1.0 ] } .... .... } }
理论
mark一下,有空研究:
paper: A Streaming Parallel Decision Tree Algorithm
post: The Art of Approximating Distributions: Histograms and Quantiles at Scale
参考
官方文档
Druid 大数据分析之查询
Google Group关于histogram的问题1
Google Group关于histogram的问题2
Google Group关于histogram的问题3
原文: https://fangyeqing.github.io/2017/03/15/druid.io实践6---实现近似直方图和分位数/ 作者: 小懒