druid.io实践---实现近似直方图和分位数

直方图特性还在druid的实验特性里面,应该是还不太完善。

应用场景:approxHistogram-agg配合使用quantile或者quantiles等分位数post-agg可以实现查询0.95/0.98/0.99的响应时间。

添加扩展支持

查看{DRUID}/extensions目录下druid-histogram存在。
druid-histogram需要添加到extension:

druid.extensions.loadList=["druid-histogram",.....]

节点需要重启来加载新添加的extension:

  • 查询端,需要重启historical节点和broker节点。
  • 数据摄入端,需要重启overlord节点。

重启节点输出:

2017-02-17T14:22:30,007 INFO [main] io.druid.initialization.Initialization - Loading extension [druid-histogram] for class [io.druid.cli.CliCommandCreator]
2017-02-17T14:22:30,008 INFO [main] io.druid.initialization.Initialization - added URL[file:/disk1/druid-0.9.1.1/extensions/druid-histogram/druid-histogram-0.9.1.1.jar]
2017-02-17T14:22:30,426 INFO [main] io.druid.initialization.Initialization - Loading extension [druid-histogram] for class [io.druid.initialization.DruidModule]
2017-02-17T14:22:30,429 INFO [main] io.druid.initialization.Initialization - Adding local file system extension module [io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class [io.druid.initialization.DruidModule]

数据摄入–创建直方图的草图

在索引时必须包含以下两种聚合器之一,并且只能适用于数值:

  • approxHistogram:缺失值将被当成0
  • approxHistogramFold:缺失值将被忽略。

查询结果时,“approxHistogramFold”聚合器必须包括在查询。

{
  "type" : "approxHistogram/approxHistogramFold
  "name" : <output_name>,
  "fieldName" : <metric_name>,
  "resolution" : <integer>,
  "numBuckets" : <integer>,
  "lowerLimit" : <float>,
  "upperLimit" : <float>
}
参数含义默认值
resolution 数量的质心(数据点)来存储。分辨率越高,更准确的结果,但计算将越慢。 50
numBuckets 生成直方图的输出桶数 7
lowerLimit/upperLimit 限制草图逼近到给定的范围内。这个范围以外的值将聚合成两个重心。项值超过此范围仍然保持 -INF/+INF

resolution和numBuckets可以根据实际情况调整大。lowerLimit/upperLimit范围也可以根据需要进行调整。

举例

摄取时

metricsSpec" : [ {
        "type" : "longSum",
        "name" : "total_num",
        "fieldName" : "total_num"
      }, {
        "type" : "approxHistogramFold",
        "name" : "rsp_time",
        "fieldName" : "rsp_time",
        "resolution" : 500,
        "numBuckets" : 500,
        "lowerLimit" : 0.0,
        "upperLimit" : 1000000.0
      } ],

查询时

 "aggregations": [
            {
              "type": "longSum",
              "name": "total_num",
              "fieldName": "total_num"
            },
            {
              "type": "approxHistogramFold",
              "name": "rsp_time",
              "fieldName": "rsp_time",
              "resolution" : 500,
              "numBuckets" : 500
            }
],

举个实际的例子。查询时,如果numBuckets不指定(默认为7),不使用post-aggregators的原始数据格式为:

"rsp_time" : {
     "breaks" : [ -123.5, 1.0, 125.5, 250.0, 374.5, 499.0, 623.5, 748.0 ],
     "counts" : [ 1.0, 10.0, 4.0, 0.0, 0.0, 0.0, 1.0 ]
   }

break有8个值,将会分为7段,每段的count对应如上,例如:0-870之间的数量为3430个,870-1741的个数为241个。

 

数据查询时–后聚合器post-aggregators

Post-aggregators用来提取刚才查询到的原始数据,计算分位数、最小值、最大值等指标

  • Equal buckets post-aggregator
  • Buckets post-aggregator
  • Custom buckets post-aggregator:指定break分段
  • min post-aggregator:最小值
  • max post-aggregator:最大值
  • quantile post-aggregator:给定单个分位数,例如0.99
  • quantiles post-aggregator:给定分位数数组,可以实现常用的0.75、0.99等
{
    "type": "customBuckets",
    "name": "histogram",
    "fieldName": "rsp_time",
    "breaks": [
        0,
        10000,
        999999999
    ]
}
 {
    "type": "quantile",
    "name": "rsp_time",
    "fieldName": "rsp_time",
    "probability": 0.99
}
{ 
    "type" : "quantiles",
    "name" : <output_name>, 
    "fieldName" : <aggregator_name>,
    "probabilities" : [ <quantile>, <quantile>, ... ]
}

举例

针对刚才total_time_histogram的聚合器,对应的post-aggregators如下:

"postAggregations":[
            { "type" : "quantiles", "name" : "响应时间", "fieldName" : "rsp_time","probabilities" : [ 0.5,0.75,0.9,0.95,0.99] }
        ],

numBuckets不设置,默认为7时,结果示例如下,可以调大来增加准确性:

{
  "version" : "v1",
  "timestamp" : "2017-03-02T16:53:08.000Z",
  "event" : {
    ....
    ....
    "响应时间" : {
      "probabilities" : [ 0.5, 0.75, 0.9, 0.95, 0.99 ],
      "quantiles" : [ 84.0, 152.0, 205.59998, 349.5999, 668.32007 ],
      "min" : 1.0,
      "max" : 748.0
    },
    "rsp_time" : {
      "breaks" : [ -123.5, 1.0, 125.5, 250.0, 374.5, 499.0, 623.5, 748.0 ],
      "counts" : [ 1.0, 10.0, 4.0, 0.0, 0.0, 0.0, 1.0 ]
    }
    ....
    ....
  }
}

理论

mark一下,有空研究:
paper: A Streaming Parallel Decision Tree Algorithm
post: The Art of Approximating Distributions: Histograms and Quantiles at Scale

参考

官方文档
Druid 大数据分析之查询
Google Group关于histogram的问题1
Google Group关于histogram的问题2
Google Group关于histogram的问题3

原文: https://fangyeqing.github.io/2017/03/15/druid.io实践6---实现近似直方图和分位数/  作者: 小懒

 

posted @ 2018-08-06 15:19  舞羊  阅读(1218)  评论(1编辑  收藏  举报