grafana图表功能-06-监控指标详解

一 . prometheus的4种监控指标

Counters  计数器
Gauges  仪表/测量
Histograms 直方图
Summaries 汇总

指标学习链接
https://dbaplus.cn/news-134-5049-1.html

二，指标详解

2.1 Counter， Counter的实际值通常本身并不十分有用。一个计数器的值经常被用来计算两个时间戳之间的delta或者随时间变化的速率

计数器，以cpu的监控指标 node_cpu_seconds_total 为例，这个指标就是计数器类型的，

node_cpu_seconds_total 这个指标就是某一核cpu某一个模式(idle,)的运行时间。我们一般查看的是空闲模式的运行时间，这个运行时间实际上就是系统文件/proc/uptime中的第一列，第二列是各核空闲时间总和

2.1.1 查看各核的cpu空闲时间总和

node_cpu_seconds_total{host="node1",env=~"test",mode="idle"}

效果如下

2.1.2 查看各核的cpu空闲使用率

rate(node_cpu_seconds_total{host="node1",env=~"test",mode="idle"}[5m])

效果如下

2.1.3 使用聚合函数avg,求一个instance分组内，每个机器cpu空闲平均值

avg(irate(node_cpu_seconds_total{host="node1",env=~"test",mode="idle"}[5m]) by (instance)

效果如下

2.1.4 查cup平均使用率

(1 - avg(rate(node_cpu_seconds_total{host="node1",env=~"test",mode="idle"}[5m])) by (instance)) * 100

效果如下

2.2 gauge（仪表/测量）

node_exporter常见指标分类

https://www.volcengine.com/docs/6731/177137

Gauge 指标用于任何可以任意增加或减少的测量。这可能是你最熟悉的指标类型，因为即使没有经过额外处理的数据，实际值也是有意义的，他们经常被使用到。比如测量温度和内存的指标，或者队列的大小。都是Gauge.。

与counter指标不同，rate和delta函数对Gange没有意义。然而，计算特定时间序列的平均数，最大值，最小值或百分比的函数经常与Gange一起使用。

在prometheus中，这些函数的名称是avg_over_time,max_over_time, min_over_time和 quantile_over_time。

要计算过去10分钟内在node1上使用的平均内存，你可以这样做

avg_over_time(node_memory_VmallocUsed_bytes[5m])

Linux 内核通过 vmalloc 支持虚拟内存的动态分配，这意味着在任何运行 Linux 的物理服务器上，都可能生成该指标

其他常用的内存查看方法

(1 - node_memory_MemAvailable_bytes{host="node1",env=~"test"} /node_memory_MemTotal_bytes{host="node1",env=~"test"})*100

2.3 histogram （直方图）

学习连接

https://www.infoq.cn/article/qikmtssy5w6llkmspxat

在大多数情况下人们都倾向于使用某些量化指标的平均值，例如 CPU 的平均使用率、页面的平均响应时间。这种方式的问题很明显，以系统 API 调用的平均响应时间为例：如果大多数 API 请求都维持在 100ms 的响应时间范围内，而个别请求的响应时间需要 5s，那么就会导致某些 WEB 页面的响应时间落到中位数的情况，而这种现象被称为长尾问题。
为了区分是平均的慢还是长尾的慢，最简单的方式就是按照请求延迟的范围进行分组。例如，统计延迟在 0~10ms 之间的请求数有多少而 10~20ms 之间的请求数又有多少。通过这种方式可以快速分析系统慢的原因。Histogram 和 Summary 都是为了能够解决这样问题的存在，通过 Histogram 和 Summary 类型的监控指标，我们可以快速了解监控样本的分布情况。
Histogram 在一段时间范围内对数据进行采样（通常是请求持续时间或响应大小等），并将其计入可配置的存储桶（bucket）中，后续可通过指定区间筛选样本，也可以统计样本总数，最后一般将数据展示为直方图。

样本的值分布在 bucket 中的数量，命名为 _bucket{le="<上边界>"}。解释得更通俗易懂一点，这个值表示指标值小于等于上边界的所有样本数量。

 // 在总共2次请求当中。http 请求响应时间 <=0.005 秒 的请求次数为0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0
 // 在总共2次请求当中。http 请求响应时间 <=0.01 秒 的请求次数为0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0
 // 在总共2次请求当中。http 请求响应时间 <=0.025 秒 的请求次数为0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.025",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.05",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.075",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.1",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.25",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.5",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.75",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="1.0",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="2.5",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="5.0",} 0.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="7.5",} 2.0
 // 在总共2次请求当中。http 请求响应时间 <=10 秒 的请求次数为 2
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="10.0",} 2.0
 io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="+Inf",} 2.0

2.4 summery (摘要)

与 Histogram 类型类似，用于表示一段时间内的数据采样结果（通常是请求持续时间或响应大小等），但它直接存储了分位数（通过客户端计算，然后展示出来），而不是通过区间来计算。

Summary 类型的样本也会提供三种指标（假设指标名称为）：

 // 含义：这 12 次 http 请求中有 50% 的请求响应时间是 3.052404983s
 io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983
 // 含义：这 12 次 http 请求中有 90% 的请求响应时间是 8.003261666s
 io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666

posted @ 2024-11-01 20:37 solomon123 阅读(283) 评论(0) 编辑收藏举报

刷新页面返回顶部

36°艳阳天

grafana图表功能-06-监控指标详解

公告