skywalking告警篇详细分析(二)

https://blog.csdn.net/feiying0canglang/article/details/121562890

http://www.manongjc.com/detail/26-asnfhftlcafxjai.html

网上看了很多,发现对于Skywalking支持哪些指标名称metrics,官方文档跟博客几乎都是指明了一个路径,没有人详细的解释,支持哪些指标,这些指标的作用又有什么作用,导致大家自定义指标的时候有很多困难。

所以这里给大家总结下,如有错误,及时指正:

Skywalking的oap指标存放在:/apache-skywalking-apm-bin-es78/config/oal/*.oap 目录下

先来看第一个oap文件:

core.oal

1 / All scope metrics
 2 all_percentile = from(All.latency).percentile(10);  // Multiple values including p50, p75, p90, p95, p99
 3 all_heatmap = from(All.latency).histogram(100, 20); // 
 4 
 5 // Service scope metrics 服务
 6 service_resp_time = from(Service.latency).longAvg(); // 服务的平均响应时间
 7 service_sla = from(Service.*).percent(status == true); // 服务的请求成功率
 8 service_cpm = from(Service.*).cpm(); //服务的每分钟调用次数
 9 service_percentile = from(Service.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
10 service_apdex = from(Service.latency).apdex(name, status); // 服务的应用性能指标,apdex的衡量的是衡量满意的响应时间与不满意的响应时间的比率,默认的请求满意时间是500ms
11 
12 // Service relation scope metrics for topology 服务与服务间调用的调用度量指标
13 service_relation_client_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();//在客户端检测到的每分钟调用次数
14 service_relation_server_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();//在服务端检测到的每分钟调用的次数
15 service_relation_client_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);//在客户端检测到成功率
16 service_relation_server_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);//在服务端检测到的成功率
17 service_relation_client_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();//在客户端检测到的平均响应时间
18 service_relation_server_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();//在服务端检测到的平均响应时间
19 service_relation_client_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
20 service_relation_server_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99
21 
22 // Service Instance relation scope metrics for topology 服务实例与服务实例之间的调用度量指标
23 service_instance_relation_client_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();//在客户端实例检测到的每分钟调用次数
24 service_instance_relation_server_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();//在服务端实例检测到的每分钟调用次数
25 service_instance_relation_client_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);//在客户端实例检测到的成功率
26 service_instance_relation_server_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);//在服务端实例检测到的成功率
27 service_instance_relation_client_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();//在客户端实例检测到的平均响应时间
28 service_instance_relation_server_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();//在服务端实例检测到的平均响应时间
29 service_instance_relation_client_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
30 service_instance_relation_server_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99
31 
32 // Service Instance Scope metrics
33 service_instance_sla = from(ServiceInstance.*).percent(status == true);//服务实例的成功率
34 service_instance_resp_time= from(ServiceInstance.latency).longAvg();//服务实例的平均响应时间
35 service_instance_cpm = from(ServiceInstance.*).cpm();//服务实例的每分钟调用次数
36 
37 // Endpoint scope metrics
38 endpoint_cpm = from(Endpoint.*).cpm();//端点的每分钟调用次数
39 endpoint_avg = from(Endpoint.latency).longAvg();//端口平均响应时间
40 endpoint_sla = from(Endpoint.*).percent(status == true);//端点的成功率
41 endpoint_percentile = from(Endpoint.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
42 
43 // Endpoint relation scope metrics
44 endpoint_relation_cpm = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();//在服务端端点检测到的每分钟调用次数
45 endpoint_relation_resp_time = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).longAvg();//在服务端检测到的rpc调用的平均耗时
46 endpoint_relation_sla = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);//在服务端检测到的请求成功率
47 endpoint_relation_percentile = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99
48 
49 database_access_resp_time = from(DatabaseAccess.latency).longAvg();//数据库的处理平均响应时间
50 database_access_sla = from(DatabaseAccess.*).percent(status == true);//数据库的请求成功率
51 database_access_cpm = from(DatabaseAccess.*).cpm();//数据库的每分钟调用次数
52 database_access_percentile = from(DatabaseAccess.latency).percentile(10);

java-agent.oal

// JVM instance metrics
instance_jvm_cpu = from(ServiceInstanceJVMCPU.usePercent).doubleAvg();//jvm 平均cpu耗时百分比
instance_jvm_memory_heap = from(ServiceInstanceJVMMemory.used).filter(heapStatus == true).longAvg();//jvm 堆空间的平均使用空间
instance_jvm_memory_noheap = from(ServiceInstanceJVMMemory.used).filter(heapStatus == false).longAvg();//jvm 非堆空间的平均使用空间
instance_jvm_memory_heap_max = from(ServiceInstanceJVMMemory.max).filter(heapStatus == true).longAvg();//jvm 最大堆内存的平均值
instance_jvm_memory_noheap_max = from(ServiceInstanceJVMMemory.max).filter(heapStatus == false).longAvg();//jvm 最大非堆内存的平均值
instance_jvm_young_gc_time = from(ServiceInstanceJVMGC.time).filter(phrase == GCPhrase.NEW).sum();//年轻代gc的耗时
instance_jvm_old_gc_time = from(ServiceInstanceJVMGC.time).filter(phrase == GCPhrase.OLD).sum();//老年代gc的耗时
instance_jvm_young_gc_count = from(ServiceInstanceJVMGC.count).filter(phrase == GCPhrase.NEW).sum();//年轻代gc的次数
instance_jvm_old_gc_count = from(ServiceInstanceJVMGC.count).filter(phrase == GCPhrase.OLD).sum();//老年代gc的次数
instance_jvm_thread_live_count = from(ServiceInstanceJVMThread.liveCount).longAvg();//存活的线程数
instance_jvm_thread_daemon_count = from(ServiceInstanceJVMThread.daemonCount).longAvg();//守护线程数
instance_jvm_thread_peak_count = from(ServiceInstanceJVMThread.peakCount).longAvg();//峰值线程数

  

告警的设置

rules:
    # 告警规则 名称唯一 必须以_rule 结尾
  service_resp_time_rule:
      # 度量名称,只支持int long double
    metrics-name: service_resp_time
    # 操作符
    op: ">"
    # 阈值 ms
    threshold: 1000
    # 评估度量的时间长度
    period: 10
    # 度量有多少次符合告警条件后,才会触发告警
    count: 2
    # 静默时间 默认情况下,它和周期一样,在同一个周期内只会触发一次。
    silence-period: 10
    message: 服务【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
  service_sla_rule:
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    period: 10
    count: 2
    silence-period: 10
    message: 服务【{name}】的成功率在最近10分钟内有2分钟低于80%
composite-rules:
  # 规则名称:在告警信息中显示的唯一名称,必须以_rule结尾
  comp_rule:
    # 指定如何组成规则,支持&&, ||, ()操作符
    expression: service_resp_time_rule && service_sla_rule
    message: 服务【{name}】在最近10分钟内有2分钟平均响应时间超过1秒并且成功率低于80%

本文介绍SkyWalking的OAL语法的用法。

官网

OAL介绍

https://github.com/apache/skywalking/blob/master/docs/en/guides/backend-oal-scripts.md

OAL规则语法:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md

范围和字段:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md

OAL简介
SkyWalking从8.0.0开始支持OAL脚本,它所在路径为:config/oal/*.oal。我们可以修改它,比如:添加过滤条件或者新的衡量标准,重启OAP生效。

Apache SkyWalking告警是由一组规则驱动,这些规则定义在config/alarm-settings.yml文件中,alarm-settings.yml中的rules.xxx_rule.metrics-name对应的是config/oal路径下的配置文件中的详细规则:core.oal、event.oal,java-agent.oal, browser.oal。

endpoint 规则相比 service、instance 规则耗费更多内存及资源。

OAL(Observability Analysis Language):观测分析语言。

在流模式(Streaming mode)下,SkyWalking 提供了OAL来分析流入的数据。OAL 聚焦于服务,服务实例以及端点的度量指标,因此 OAL 非常易于学习和使用。

6.3版本以后,OAL引擎嵌入在OAP服务器运行时中,称为oal-rt(OAL运行时)。OAL脚本现在位于/config文件夹,用户可以简单地改变和重新启动服务器,使其有效。

但是,OAL脚本仍然是编译语言,OAL运行时动态生成Java代码。您可以在系统环境上设置SW_OAL_ENGINE_DEBUG=Y,查看生成了哪些类。

配置示例
// 计算Endpoint1 和 Endpoint2 的 p99。
endpoint_p99 = from(Endpoint.latency).filter(name in ("Endpoint1", "Endpoint2")).summary(0.99)

// 计算以“serv”开头的端点名字的 p99。
serv_Endpoint_p99 = from(Endpoint.latency).filter(name like "serv%").summary(0.99)

// 计算每个端点的响应平均时长
endpoint_avg = from(Endpoint.latency).avg()

// 计算每个端点 p50,p75,p90,p95 and p99 的延迟柱状图,每隔 50 毫秒一条柱
endpoint_percentile = from(Endpoint.latency).percentile(10)

// 统计每个服务响应状态为 true 的百分比
endpoint_success = from(Endpoint.*).filter(status == true).percent()

// 计算每个服务的响应码为[404, 500, 503]的总和
endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).count()

// 计算每个服务的请求类型为[PRC, gRPC]的总和
endpoint_rpc_calls_sum = from(Endpoint.*).filter(type in [RequestType.PRC, RequestType.gRPC]).sum()

// 计算每个端点的端点名称为["/v1", "/v2"]的总和
endpoint_url_sum = from(Endpoint.*).filter(endpointName in ["/v1", "/v2"]).sum()

// 统计每个服务的调用总量
endpoint_calls = from(Endpoint.*).count()

// 计算每个服务的GET方法的CPM。值的组成为:`tagKey:tagValue`.
// 方案1, 使用`tags contain`.
service_cpm_http_get = from(Service.*).filter(tags contain "http.method:GET").cpm()
// 方案2, 使用 `tag[key]`.
service_cpm_http_get = from(Service.*).filter(tag["http.method"] == "GET").cpm();

// 计算每个服务的除了GET的方法的CPM。值的组成为:`tagKey:tagValue`.
service_cpm_http_other = from(Service.*).filter(tags not contain "http.method:GET").cpm()

// 计算浏览应用的错误率。分子是FIRST_ERROR,分母是NORMAL
browser_app_error_rate = from(BrowserAppTraffic.*).rate(trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR, trafficCategory == BrowserAppTrafficCategory.NORMAL);

disable(segment);
disable(endpoint_relation_server_side);
disable(top_n_database_statement);

默认的配置
config/oal/core.oal

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/

// For services using protocols HTTP 1/2, gRPC, RPC, etc., the cpm metrics means "calls per minute",
// for services that are built on top of TCP, the cpm means "packages per minute".

// All scope metrics
all_percentile = from(All.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
all_heatmap = from(All.latency).histogram(100, 20);

// Service scope metrics
service_resp_time = from(Service.latency).longAvg();
service_sla = from(Service.*).percent(status == true);
service_cpm = from(Service.*).cpm();
service_percentile = from(Service.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_apdex = from(Service.latency).apdex(name, status);

// Service relation scope metrics for topology
service_relation_client_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();
service_relation_server_cpm = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
service_relation_client_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);
service_relation_server_call_sla = from(ServiceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
service_relation_client_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();
service_relation_server_resp_time = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();
service_relation_client_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_relation_server_percentile = from(ServiceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

// Service Instance relation scope metrics for topology
service_instance_relation_client_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).cpm();
service_instance_relation_server_cpm = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
service_instance_relation_client_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.CLIENT).percent(status == true);
service_instance_relation_server_call_sla = from(ServiceInstanceRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
service_instance_relation_client_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).longAvg();
service_instance_relation_server_resp_time = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).longAvg();
service_instance_relation_client_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.CLIENT).percentile(10); // Multiple values including p50, p75, p90, p95, p99
service_instance_relation_server_percentile = from(ServiceInstanceRelation.latency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

// Service Instance Scope metrics
service_instance_sla = from(ServiceInstance.*).percent(status == true);
service_instance_resp_time= from(ServiceInstance.latency).longAvg();
service_instance_cpm = from(ServiceInstance.*).cpm();

// Endpoint scope metrics
endpoint_cpm = from(Endpoint.*).cpm();
endpoint_avg = from(Endpoint.latency).longAvg();
endpoint_sla = from(Endpoint.*).percent(status == true);
endpoint_percentile = from(Endpoint.latency).percentile(10); // Multiple values including p50, p75, p90, p95, p99

// Endpoint relation scope metrics
endpoint_relation_cpm = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).cpm();
endpoint_relation_resp_time = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).longAvg();
endpoint_relation_sla = from(EndpointRelation.*).filter(detectPoint == DetectPoint.SERVER).percent(status == true);
endpoint_relation_percentile = from(EndpointRelation.rpcLatency).filter(detectPoint == DetectPoint.SERVER).percentile(10); // Multiple values including p50, p75, p90, p95, p99

database_access_resp_time = from(DatabaseAccess.latency).longAvg();
database_access_sla = from(DatabaseAccess.*).percent(status == true);
database_access_cpm = from(DatabaseAccess.*).cpm();
database_access_percentile = from(DatabaseAccess.latency).percentile(10);

config/oal/event.oal

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/

event_total = from(Event.*).count();

event_normal_count = from(Event.*).filter(type == "Normal").count();
event_error_count = from(Event.*).filter(type == "Error").count();

event_start_count = from(Event.*).filter(name == "Start").count();
event_shutdown_count = from(Event.*).filter(name == "Shutdown").count();

config/oal/java-agent.oal

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/

// JVM instance metrics
instance_jvm_cpu = from(ServiceInstanceJVMCPU.usePercent).doubleAvg();
instance_jvm_memory_heap = from(ServiceInstanceJVMMemory.used).filter(heapStatus == true).longAvg();
instance_jvm_memory_noheap = from(ServiceInstanceJVMMemory.used).filter(heapStatus == false).longAvg();
instance_jvm_memory_heap_max = from(ServiceInstanceJVMMemory.max).filter(heapStatus == true).longAvg();
instance_jvm_memory_noheap_max = from(ServiceInstanceJVMMemory.max).filter(heapStatus == false).longAvg();
instance_jvm_young_gc_time = from(ServiceInstanceJVMGC.time).filter(phrase == GCPhrase.NEW).sum();
instance_jvm_old_gc_time = from(ServiceInstanceJVMGC.time).filter(phrase == GCPhrase.OLD).sum();
instance_jvm_young_gc_count = from(ServiceInstanceJVMGC.count).filter(phrase == GCPhrase.NEW).sum();
instance_jvm_old_gc_count = from(ServiceInstanceJVMGC.count).filter(phrase == GCPhrase.OLD).sum();
instance_jvm_thread_live_count = from(ServiceInstanceJVMThread.liveCount).longAvg();
instance_jvm_thread_daemon_count = from(ServiceInstanceJVMThread.daemonCount).longAvg();
instance_jvm_thread_peak_count = from(ServiceInstanceJVMThread.peakCount).longAvg();
instance_jvm_thread_runnable_state_thread_count = from(ServiceInstanceJVMThread.runnableStateThreadCount).longAvg();
instance_jvm_thread_blocked_state_thread_count = from(ServiceInstanceJVMThread.blockedStateThreadCount).longAvg();
instance_jvm_thread_waiting_state_thread_count = from(ServiceInstanceJVMThread.waitingStateThreadCount).longAvg();
instance_jvm_thread_timed_waiting_state_thread_count = from(ServiceInstanceJVMThread.timedWaitingStateThreadCount).longAvg();
instance_jvm_class_loaded_class_count = from(ServiceInstanceJVMClass.loadedClassCount).longAvg();
instance_jvm_class_total_unloaded_class_count = from(ServiceInstanceJVMClass.totalUnloadedClassCount).longAvg();
instance_jvm_class_total_loaded_class_count = from(ServiceInstanceJVMClass.totalLoadedClassCount).longAvg();

config/oal/browser.oal

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
// browser app
browser_app_pv = from(BrowserAppTraffic.count).filter(trafficCategory == BrowserAppTrafficCategory.NORMAL).sum();
browser_app_error_rate = from(BrowserAppTraffic.*).rate(trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR,trafficCategory == BrowserAppTrafficCategory.NORMAL);
browser_app_error_sum = from(BrowserAppTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).sum();

// browser app single version
browser_app_single_version_pv = from(BrowserAppSingleVersionTraffic.count).filter(trafficCategory == BrowserAppTrafficCategory.NORMAL).sum();
browser_app_single_version_error_rate = from(BrowserAppSingleVersionTraffic.trafficCategory).rate(trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR,trafficCategory == BrowserAppTrafficCategory.NORMAL);
browser_app_single_version_error_sum = from(BrowserAppSingleVersionTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).sum();

// browser app page
browser_app_page_pv = from(BrowserAppPageTraffic.count).filter(trafficCategory == BrowserAppTrafficCategory.NORMAL).sum();
browser_app_page_error_rate = from(BrowserAppPageTraffic.*).rate(trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR,trafficCategory == BrowserAppTrafficCategory.NORMAL);
browser_app_page_error_sum = from(BrowserAppPageTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).sum();

browser_app_page_ajax_error_sum = from(BrowserAppPageTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).filter(errorCategory == BrowserErrorCategory.AJAX).sum();
browser_app_page_resource_error_sum = from(BrowserAppPageTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).filter(errorCategory == BrowserErrorCategory.RESOURCE).sum();
browser_app_page_js_error_sum = from(BrowserAppPageTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).filter(errorCategory in [BrowserErrorCategory.JS,BrowserErrorCategory.VUE,BrowserErrorCategory.PROMISE]).sum();
browser_app_page_unknown_error_sum = from(BrowserAppPageTraffic.count).filter(trafficCategory != BrowserAppTrafficCategory.NORMAL).filter(errorCategory == BrowserErrorCategory.UNKNOWN).sum();

// browser performance metrics
browser_app_page_redirect_avg = from(BrowserAppPagePerf.redirectTime).longAvg();
browser_app_page_dns_avg = from(BrowserAppPagePerf.dnsTime).longAvg();
browser_app_page_ttfb_avg = from(BrowserAppPagePerf.ttfbTime).longAvg();
browser_app_page_tcp_avg = from(BrowserAppPagePerf.tcpTime).longAvg();
browser_app_page_trans_avg = from(BrowserAppPagePerf.transTime).longAvg();
browser_app_page_dom_analysis_avg = from(BrowserAppPagePerf.domAnalysisTime).longAvg();
browser_app_page_fpt_avg = from(BrowserAppPagePerf.fptTime).longAvg();
browser_app_page_dom_ready_avg = from(BrowserAppPagePerf.domReadyTime).longAvg();
browser_app_page_load_page_avg = from(BrowserAppPagePerf.loadPageTime).longAvg();
browser_app_page_res_avg = from(BrowserAppPagePerf.resTime).longAvg();
browser_app_page_ssl_avg = from(BrowserAppPagePerf.sslTime).longAvg();
browser_app_page_ttl_avg = from(BrowserAppPagePerf.ttlTime).longAvg();
browser_app_page_first_pack_avg = from(BrowserAppPagePerf.firstPackTime).longAvg();
browser_app_page_fmp_avg = from(BrowserAppPagePerf.fmpTime).longAvg();

browser_app_page_fpt_percentile = from(BrowserAppPagePerf.fptTime).percentile(10);
browser_app_page_ttl_percentile = from(BrowserAppPagePerf.ttlTime).percentile(10);
browser_app_page_dom_ready_percentile = from(BrowserAppPagePerf.domReadyTime).percentile(10);
browser_app_page_load_page_percentile = from(BrowserAppPagePerf.loadPageTime).percentile(10);
browser_app_page_first_pack_percentile = from(BrowserAppPagePerf.firstPackTime).percentile(10);
browser_app_page_fmp_percentile = from(BrowserAppPagePerf.fmpTime).percentile(10);

// Disable unnecessary hard core stream, targeting @Stream#name
/
//disable(browser_error_log);

OAL语法
OAL 脚本文件应该以 .oal 为后缀。

// Declare the metrics.
METRICS_NAME = from(SCOPE.(* | [FIELD][,FIELD ...]))
[.filter(FIELD OP [INT | STRING])]
.FUNCTION([PARAM][, PARAM ...])

// Disable hard code
disable(METRICS_NAME);
域(Scope)
域包括全局(All)、服务(Service)、服务实例(Service Instance)、端点(Endpoint)、服务关系(Service Relation)、服务实例关系(Service Instance Relation)、端点关系(Endpoint Relation)。

当然还有一些字段,他们都属于以上某个域。

过滤器(Filter)
使用在使用过滤器的时候,通过指定字段名或表达式来构建字段值的过滤条件。

表达式可以使用 and,or 和 () 进行组合。

操作符包含==,!=,>,<,>=,<=,in [...],like %...,like ...%,like %...%,他们可以基于字段类型进行类型检测,

如果类型不兼容会在编译/代码生成期间报错。

聚合函数(Aggregation Function)
默认的聚合函数由 SkyWalking OAP 核心实现。并可自由扩展更多函数。

提供的函数:

longAvg:某个域实体所有输入的平均值,输入字段必须是 long 类型。

instance_jvm_memory_max = from(ServiceInstanceJVMMemory.max).longAvg();
在上面的例子中,输入是 ServiceInstanceJVMMemory 域的每个请求,平均值是基于字段 max 进行求值的。

doubleAvg:某个域实体的所有输入的平均值,输入的字段必须是 double 类型。

​​​​​​​instance_jvm_cpu = from(ServiceInstanceJVMCPU.usePercent).doubleAvg();
在上面的例子中,输入是 ServiceInstanceJVMCPU 域的每个请求,平均值是基于 usePercent 字段进行求值的。

percent:对于输入中匹配指定条件的百分比数.

endpoint_percent = from(Endpoint.*).percent(status == true);
在上面的例子中,输入是每个端点的请求,条件是 endpoint.status == true。

rate:对于条件匹配的输入,比率以100的分数表示。

​​​​​​​browser_app_error_rate = from(BrowserAppTraffic.*).rate(trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR, trafficCategory == BrowserAppTrafficCategory.NORMAL);
在上面的例子中,所有的输入都是每个浏览器应用流量的请求。分子的条件是trafficCategory == BrowserAppTrafficCategory.FIRST_ERROR,分母的条件是trafficCategory == BrowserAppTrafficCategory.NORMAL。

其中,第一个参数是分子的条件,第二个参数是分母的条件。

sum:某个域实体的调用总数。

​​​​​​​service_calls_sum = from(Service.*).sum();
在上面的例子中,统计每个服务的调用数。

histogram:热力图 更多详见Heatmap in WIKI。

all_heatmap = from(All.latency).histogram(100, 20);
在上面的例子中,计算了所有传入请求的热力学热图。

第一个参数是计算延迟的精度,在上面的例子中,在101-200ms组中,113ms和193ms被认为是相同的。

第二个参数是分组数量,在上面的例子中,一共有21组数据分别为0-100ms,101-200ms......1901-2000ms,2000ms以上.

apdex:应用性能指数(Application Performance Index)

service_apdex = from(Service.latency).apdex(name, status);
在上面的例子中,计算了所有服务的应用性能指数。

第一个参数是服务名称,该名称的Apdex阈值在配置文件service-apdex-threshold.yml中定义。

第二个参数是请求状态,状态(成功或失败)影响Apdex的计算。

P99,P95,P90,P75,P50:百分位 更多详见Percentile in WIKI

百分位是自7.0版本引入的第一个多值度量。由于有多个值,可以通过getMultipleLinearIntValuesGraphQL查询进行查询。

all_percentile = from(All.latency).percentile(10);
在上面的例子中,计算了所有传入请求的 P99,P95,P90,P75,P50。参数是百分位计算的精度,在上例中120ms和124被认为是相同的。

度量指标名称(Metrics Name)
存储实现,告警以及查询模块的度量指标名称,SkyWalking 内核支持自动类型推断。

组(Group)
所有度量指标数据都会使用 Scope.ID 和最小时间桶(min-level time bucket) 进行分组。

在端点的域中,Scope.ID 为端点的 ID(基于服务及其端点的唯一标志)。

强制转换(Cast)
源的字段是静态类型。在一些情况下,过滤语句和聚合语句所需要的字段类型和源的字段类型不匹配,例如:源的tag的值是String类型,大部分的聚合计算需要是数字类型。强制转换表达式就是用来解决这个的。

用法

(str->long) or (long), cast string type into long.
(str->int) or (int), cast string type into int.
示例:

mq_consume_latency = from((str->long)Service.tag["transmission.latency"]).longAvg(); // the value of tag is string type.
强制转换表达式支持如下位置:

From statement. from((cast)source.attre).
Filter expression. .filter((cast)tag["transmission.latency"] > 0)
Aggregation function parameter. .longAvg((cast)strField1== 1, (cast)strField2)
禁用(Disable)
Disable是OAL中的高级语句,只在特定情况下使用。

一些聚合和度量是通过核心硬代码定义的,这个Disable语句是设计用来让它们停止活动的,
比如segment, top_n_database_statement。

在默认情况下,没有被禁用的。
————————————————
版权声明:本文为CSDN博主「IT利刃出鞘」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/feiying0canglang/article/details/121562890

posted on 2022-03-20 21:49  luzhouxiaoshuai  阅读(2934)  评论(0编辑  收藏  举报

导航