rancher prometheus监控

 

集群指标

集群 CPU 利用率

目录    表达式
详细信息    1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
摘要       1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))

集群平均负载

load1    sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)
load5    sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)
load15    sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)

 

load1    sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})
load5    sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})
load15    sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})

集群内存利用率

详细信息    1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)
摘要       1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)

 

集群磁盘利用率

目录    表达式
详细信息    (sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)
摘要       (sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})

 

集群磁盘 I/O

目录    表达式
详细信息    
read    sum(rate(node_disk_read_bytes_total[5m])) by (instance)
written    sum(rate(node_disk_written_bytes_total[5m])) by (instance)
摘要    
read    sum(rate(node_disk_read_bytes_total[5m]))
written    sum(rate(node_disk_written_bytes_total[5m]))

 

 

节点指标

节点 CPU 利用率

目录      表达式
详细信息    avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode)
摘要      1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))

节点平均负载

目录      表达式
详细信息    
load1    sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
load5    sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
load15    sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
摘要    
load1    sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
load5    sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
load15    sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})

 

节点内存利用率

目录      表达式
详细信息    1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})
摘要      1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})

节点磁盘利用率

目录      表达式
详细信息    (sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)
摘要      (sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})

 

节点磁盘 I/O

目录      表达式
详细信息    
read      sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))
written    sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))
摘要    
read      sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))
written    sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))

 

Etcd 指标

Etcd Leader

max(etcd_server_has_leader)

Leader 改变次数

max(etcd_server_leader_changes_seen_total)

 

失败 Proposals 次数

sum(etcd_server_proposals_failed_total)

 

GRPC 客户端流量

目录    表达式
详细信息    
in      sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance)
out      sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance)
摘要    
in     sum(rate(etcd_network_client_grpc_received_bytes_total[5m]))
out    sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))

对等流量

目录    表达式
详细信息    
in    sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)
out    sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)
摘要    
in    sum(rate(etcd_network_peer_received_bytes_total[5m]))
out    sum(rate(etcd_network_peer_sent_bytes_total[5m]))

 

DB 大小

目录    表达式
详细信息    sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)
摘要    sum(etcd_debugging_mvcc_db_total_size_in_bytes)

 

RPC 速率

目录    表达式
详细信息    
total    sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance)
fail    sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance)
摘要    
total    sum(rate(grpc_server_started_total{grpc_type="unary"}[5m]))
fail    sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))

 

磁盘操作

目录    表达式
详细信息    
commit-called-by-backend    sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance)
fsync-called-by-wal        sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance)
摘要    
commit-called-by-backend    sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m]))
fsync-called-by-wal        sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))

 

磁盘同步持续时间

目录    表达式
详细信息    
wal    histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))
db    histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))
摘要    
wal    sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)))
db    sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))

 

Kubernetes 组件指标

API Server 请求延迟

目录      表达式
详细信息    avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06
摘要      avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06

 

API Server 请求率

目录    表达式
详细信息    sum(rate(apiserver_request_count[5m])) by (instance, code)
摘要      sum(rate(apiserver_request_count[5m])) by (instance)

 

调度失败的 Pod

目录    表达式
详细信息    sum(kube_pod_status_scheduled{condition="false"})
摘要      sum(kube_pod_status_scheduled{condition="false"})

 

控制器管理器队列深度

目录    表达式
详细信息    
volumes    sum(volumes_depth) by instance
deployment    sum(deployment_depth) by instance
replicaset    sum(replicaset_depth) by instance
service    sum(service_depth) by instance
serviceaccount    sum(serviceaccount_depth) by instance
endpoint    sum(endpoint_depth) by instance
daemonset    sum(daemonset_depth) by instance
statefulset    sum(statefulset_depth) by instance
replicationmanager    sum(replicationmanager_depth) by instance
摘要    
volumes    sum(volumes_depth)
deployment    sum(deployment_depth)
replicaset    sum(replicaset_depth)
service    sum(service_depth)
serviceaccount    sum(serviceaccount_depth)
endpoint    sum(endpoint_depth)
daemonset    sum(daemonset_depth)
statefulset    sum(statefulset_depth)
replicationmanager    sum(replicationmanager_depth)

 

调度器 E2E 调度延迟

目录    表达式
详细信息    histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06
摘要      sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)

 

调度程序抢占尝试

目录    表达式
详细信息    sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)
摘要      sum(rate(scheduler_total_preemption_attempts[5m]))

Ingress Controller 连接数

目录    表达式
详细信息    
reading    sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance)
waiting    sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance)
writing    sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance)
accepted    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance)
active    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance)
handled    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance)
摘要    
reading    sum(nginx_ingress_controller_nginx_process_connections{state="reading"})
waiting    sum(nginx_ingress_controller_nginx_process_connections{state="waiting"})
writing    sum(nginx_ingress_controller_nginx_process_connections{state="writing"})
accepted    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m])))
active    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m])))
handled    sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))

 

Ingress Controller 请求处理时间

目录      表达式
详细信息    topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))
摘要      topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))

 

Workload 指标

Workload CPU 利用率

目录              表达式
详细信息    
cfs throttled seconds    sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
user seconds          sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
system seconds        sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
usage seconds         sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
摘要    
cfs throttled seconds    sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
user seconds          sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
system seconds        sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
usage seconds         sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))

 

Workload 内存利用率

目录      表达式
详细信息    sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name)
摘要      sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})

Workload 网络数据包

目录    表达式
详细信息    
receive-packets    sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
receive-dropped    sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
receive-errors    sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
transmit-packets    sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
transmit-dropped    sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
transmit-errors    sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)
摘要    
receive-packets    sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
receive-dropped    sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
receive-errors    sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
transmit-packets    sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
transmit-dropped    sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
transmit-errors    sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))

 

Pod 指标

Pod CPU 利用率

目录    表达式
详细信息    
cfs throttled seconds    sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)
usage seconds    sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)
system seconds    sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)
user seconds    sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)
摘要    
cfs throttled seconds    sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))
usage seconds    sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))
system seconds    sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))
user seconds    sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))

 

Pod 内存利用率

目录      表达式
详细信息    sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name)
摘要      sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})

 

Pod 网络数据包

目录    表达式
详细信息    
receive-packets    sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
receive-dropped    sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
receive-errors    sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-packets    sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-dropped    sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-errors    sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
摘要    
receive-packets    sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
receive-dropped    sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
receive-errors    sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-packets    sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-dropped    sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
transmit-errors    sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))

 

Container 指标

Container CPU 利用率

目录    表达式
cfs throttled seconds    sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
usage seconds    sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
system seconds    sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
user seconds    sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))

 

Container 内存利用率

sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})

 

Container 磁盘 I/O

目录    表达式
read    sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
write    sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))

 

摘自

https://docs.rancher.cn/docs/rancher2.5/monitoring-alerting/expression/_index

 

posted @ 2022-12-09 15:24  fengjian1585  阅读(361)  评论(0编辑  收藏  举报