rancher prometheus监控
集群指标
集群 CPU 利用率
目录 表达式 详细信息 1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) 摘要 1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))
集群平均负载
load1 sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) load5 sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) load15 sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)
load1 sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) load5 sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) load15 sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})
集群内存利用率
详细信息 1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance) 摘要 1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)
集群磁盘利用率
目录 表达式 详细信息 (sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) 摘要 (sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})
集群磁盘 I/O
目录 表达式
详细信息
read sum(rate(node_disk_read_bytes_total[5m])) by (instance)
written sum(rate(node_disk_written_bytes_total[5m])) by (instance)
摘要
read sum(rate(node_disk_read_bytes_total[5m]))
written sum(rate(node_disk_written_bytes_total[5m]))
节点指标
节点 CPU 利用率
目录 表达式 详细信息 avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode) 摘要 1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))
节点平均负载
目录 表达式 详细信息 load1 sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"}) load5 sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"}) load15 sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"}) 摘要 load1 sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"}) load5 sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"}) load15 sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})
节点内存利用率
目录 表达式 详细信息 1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"}) 摘要 1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})
节点磁盘利用率
目录 表达式 详细信息 (sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) 摘要 (sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})
节点磁盘 I/O
目录 表达式 详细信息 read sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m])) written sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m])) 摘要 read sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m])) written sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))
Etcd 指标
Etcd Leader
max(etcd_server_has_leader)
Leader 改变次数
max(etcd_server_leader_changes_seen_total)
失败 Proposals 次数
sum(etcd_server_proposals_failed_total)
GRPC 客户端流量
目录 表达式 详细信息 in sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance) out sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance) 摘要 in sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) out sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))
对等流量
目录 表达式 详细信息 in sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance) out sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance) 摘要 in sum(rate(etcd_network_peer_received_bytes_total[5m])) out sum(rate(etcd_network_peer_sent_bytes_total[5m]))
DB 大小
目录 表达式
详细信息 sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)
摘要 sum(etcd_debugging_mvcc_db_total_size_in_bytes)
RPC 速率
目录 表达式 详细信息 total sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance) fail sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance) 摘要 total sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) fail sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))
磁盘操作
目录 表达式 详细信息 commit-called-by-backend sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance) fsync-called-by-wal sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance) 摘要 commit-called-by-backend sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) fsync-called-by-wal sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))
磁盘同步持续时间
目录 表达式 详细信息 wal histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) db histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)) 摘要 wal sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))) db sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))
Kubernetes 组件指标
API Server 请求延迟
目录 表达式 详细信息 avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06 摘要 avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06
API Server 请求率
目录 表达式
详细信息 sum(rate(apiserver_request_count[5m])) by (instance, code)
摘要 sum(rate(apiserver_request_count[5m])) by (instance)
调度失败的 Pod
目录 表达式 详细信息 sum(kube_pod_status_scheduled{condition="false"}) 摘要 sum(kube_pod_status_scheduled{condition="false"})
控制器管理器队列深度
目录 表达式
详细信息
volumes sum(volumes_depth) by instance
deployment sum(deployment_depth) by instance
replicaset sum(replicaset_depth) by instance
service sum(service_depth) by instance
serviceaccount sum(serviceaccount_depth) by instance
endpoint sum(endpoint_depth) by instance
daemonset sum(daemonset_depth) by instance
statefulset sum(statefulset_depth) by instance
replicationmanager sum(replicationmanager_depth) by instance
摘要
volumes sum(volumes_depth)
deployment sum(deployment_depth)
replicaset sum(replicaset_depth)
service sum(service_depth)
serviceaccount sum(serviceaccount_depth)
endpoint sum(endpoint_depth)
daemonset sum(daemonset_depth)
statefulset sum(statefulset_depth)
replicationmanager sum(replicationmanager_depth)
调度器 E2E 调度延迟
目录 表达式 详细信息 histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06 摘要 sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)
调度程序抢占尝试
目录 表达式
详细信息 sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)
摘要 sum(rate(scheduler_total_preemption_attempts[5m]))
Ingress Controller 连接数
目录 表达式 详细信息 reading sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance) waiting sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance) writing sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance) accepted sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance) active sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance) handled sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance) 摘要 reading sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) waiting sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) writing sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) accepted sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) active sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) handled sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))
Ingress Controller 请求处理时间
目录 表达式 详细信息 topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m])))) 摘要 topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))
Workload 指标
Workload CPU 利用率
目录 表达式 详细信息 cfs throttled seconds sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) user seconds sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) system seconds sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) usage seconds sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) 摘要 cfs throttled seconds sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) user seconds sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) system seconds sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) usage seconds sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
Workload 内存利用率
目录 表达式 详细信息 sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name) 摘要 sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})
Workload 网络数据包
目录 表达式 详细信息 receive-packets sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) receive-dropped sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) receive-errors sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) transmit-packets sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) transmit-dropped sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) transmit-errors sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name) 摘要 receive-packets sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) receive-dropped sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) receive-errors sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) transmit-packets sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) transmit-dropped sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) transmit-errors sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))
Pod 指标
Pod CPU 利用率
目录 表达式 详细信息 cfs throttled seconds sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name) usage seconds sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name) system seconds sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name) user seconds sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name) 摘要 cfs throttled seconds sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) usage seconds sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) system seconds sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) user seconds sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))
Pod 内存利用率
目录 表达式 详细信息 sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name) 摘要 sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})
Pod 网络数据包
目录 表达式 详细信息 receive-packets sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) receive-dropped sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) receive-errors sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-packets sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-dropped sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-errors sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) 摘要 receive-packets sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) receive-dropped sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) receive-errors sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-packets sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-dropped sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) transmit-errors sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))
Container 指标
Container CPU 利用率
目录 表达式 cfs throttled seconds sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m])) usage seconds sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m])) system seconds sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m])) user seconds sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
Container 内存利用率
sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})
Container 磁盘 I/O
目录 表达式 read sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m])) write sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))
摘自
https://docs.rancher.cn/docs/rancher2.5/monitoring-alerting/expression/_index