OLAP之Druid之查询
数据查询
Druid的聚合查询主要有三种形式:
- Timeseries
- TopN
- GroupBy
一般而言,OLAP系统最核心的能力是GroupBy
查询,Druid也不例外。 但是GroupBy查询资源消耗较多,TopN
和Timeseries
作为GroupBy的有益补充,能够改善查询的性能。我们建议:如果TopN
和Timeseries
能够满足业务的应用场景,那么尽量采用这两种查询,而非GroupBy
。
Druid提供RESTful的查询接口,用户使用JSON表达查询意图。
查询命令:
curl -X POST 'broker:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
注意点
在Druid查询中,过滤条件是所有查询都可能涉及的部分,并且有一些使用技巧,需要特别注意。请参考Filters。
指标聚合这部分也是非常重要的,Aggregations也提供了系统的介绍,此处就不再赘述了。我们需要指出的是,这一页文档中Filtered Aggregator
能够提供非常强大的查询功能,比如在查询过程中根据维度取值定制指标。
GroupBy
示例
{ "queryType": "groupBy", "dataSource": "sample_datasource", "granularity": "day", "dimensions": ["country", "device"], #需要聚合的维度列 "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] }, #limit语句 "filter": { #过滤条件 "type": "and", "fields": [ { "type": "selector", "dimension": "carrier", "value": "AT&T" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "make", "value": "Apple" }, { "type": "selector", "dimension": "make", "value": "Samsung" } ] } ] }, "aggregations": [ #返回的指标列 { "type": "longSum", "name": "total_usage", "fieldName": "user_count" }, { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" } ], "postAggregations": [ #这部分是可选的 { "type": "arithmetic", "name": "avg_usage", "fn": "/", "fields": [ { "type": "fieldAccess", "fieldName": "data_transfer" }, { "type": "fieldAccess", "fieldName": "total_usage" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ], #本次查询需要覆盖的时间范围 "having": { #having语句,这部分是可选的 "type": "greaterThan", "aggregation": "total_usage", "value": 100 } }
Timeseries
示例
{ "queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "descending": "true", #是否排序 "filter": { #过滤条件 "type": "and", "fields": [ { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" }, { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" } ] } ] }, "aggregations": [ #返回的指标列 { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }, { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" } ], "postAggregations": [ #这部分是可选的 { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" }, { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ] #本次查询覆盖的时间范围 }
Timeseries query通常对空的查询时间段返回0作为查询结果
TopN
- TopN查询返回的是根据某一维度进行group by后再排序,返回结果集
- 为了提高执行效率,TopN的查询是近似查询(从我们使用经验来看,返回结果基本是比较准确的)
示例
{ "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", #需要聚合的维度列 "threshold": 5, "metric": "count", #作为排序依据的指标列 "granularity": "all", "filter": { #过滤条件 "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ #返回的指标列 { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ #后处理逻辑,这部分是可选的 { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" #查询覆盖的时间范围 ] }