012k8s_Prometheus常见监控指标解析

一、jvm进程cpu利用率优化

# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.12489451476793248

告警优化:
(1)原来的告警规则:

(2)新的告警规则
      - alert: JVM_P HighCPU
        expr: process_cpu_usage > 0.8
        for: 2m
        annotations:
          summary: "warn Instance {{ $labels.instance }} jvm process High cpu usage"
          description: " {{ $labels.instance }} of job {{ $labels.job }} jvm process cpu usage is too high for more than 2 minutes."

二、Pod CPU使用率 > 80

(1)计算公式

sum by(cluster, pod) (irate(container_cpu_usage_seconds_total{pod!=""}[5m]) * 100 ) /
sum by(cluster, pod) (container_spec_cpu_quota{pod!=""} / container_spec_cpu_period{pod!=""}) > 80 

(2)计算公式分解

sum by(cluster, pod) (
        irate(
            container_cpu_usage_seconds_total{pod!=""}[5m]
            ) * 100
        )
/
sum by(cluster, pod) (
            container_spec_cpu_quota{pod!=""} / container_spec_cpu_period{pod!=""}

        ) > 80

(3) 

①container_spec_cpu_quota

container_spec_cpu_quota等同于k8s里的容器的cpu limit值;

②再次解释,下面解释的非常好,很清楚

I was having some difficulties understanding how this could result in some useful percentage, especially how the right part of the division was related to the left part in seconds.

So I'm sharing this for others as well. Please correct me if I'm wrong.

Some definitions

container_cpu_usage_seconds_total - CPU usage time in seconds of a specific container, as the name suggests. A rate on top of this will give us how many CPU seconds a container used per second.

container_spec_cpu_period - Denotes the period in which container CPU utilisation is tracked. I understood this as the duration of a CPU "cycle". Typically 100000 microseconds for docker containers.

container_spec_cpu_quota - How much CPU time your container has for each cpu_period in microseconds. Results from multiplying a "CPU unit"(7 CPUs) by the container_spec_cpu_period. You only have it if you define a limit for your container.

Let's pick an example with numbers

Limit of 7 CPUs/"CPU units" and container_spec_cpu_period of 100000 microseconds.

This means container_spec_cpu_quota will be 700000 microseconds, meaning that we have 700000 CPU microseconds every 100000 microseconds, and the right part of our division will result in 7.
700000/100000 = 7 (container_spec_cpu_quota / container_spec_cpu_period)

Let's now say rate(container_cpu_usage_seconds_total[10m]) is 1.34. This means our container spent 1.34 CPU seconds in that specific second. The final result would be:
1.34/7=~0.191=~19.1% of CPU usage based on the defined limit.

But how is 7 related to 1.34 CPU seconds per second?

Going back to our container_spec_cpu_quota of 700000 microseconds and its meaning.
It means that we have 700000 CPU microseconds every 100000 microseconds. Which is the same as...
700 milliseconds of CPU every 100 milliseconds. And...
0.7 seconds of CPU every 100 milliseconds. And...
7 seconds of CPU every second.

So 1.34 CPU seconds per second is actually ~19.1% of 7 CPU seconds per second, which is our allowed CPU time.

TLDR

container_spec_cpu_quota / container_spec_cpu_period will actually tell you how many CPU seconds you have in each second and that's why you can use it to dividerate(container_cpu_usage_seconds_total[x]).

(4)解决相关报警:

增加容器的cpu的limit值;

Reference: https://github.com/google/cadvisor/issues/2026

 

posted @ 2023-02-05 16:21  arun_yh  阅读(972)  评论(0编辑  收藏  举报