zabbix6监控k8s指标说明
kube.pod.status.phase获取不准确,经测试当pod状态为Crashloopbackoff时,数据仍为running(2),所以不能使用Kubernetes_test nodes by HTTP模板中的自动发现Node discovery中的Namesace [{#NAMESPACE}] Pod [{#POD}] Status: Phase 监控项原型。
在zabbix中模板只有告警规则没有恢复规则,所以需要自己设置恢复规则。
一.deploy中的指标
1.1 Deployment 副本数未达预期告警
min(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_mismatched[{#NAMESPACE}/{#NAME}],{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}:{#NAME}"})>0
and last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_desired[{#NAMESPACE}/{#NAME}])>=0
and last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_available[{#NAMESPACE}/{#NAME}])>=0
说明:
1)min(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_mismatched[{#NAMESPACE}/{#NAME}],{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}:{#NAME}"})>0
kube.deployment.replicas_mismatched为deployment副本数量不一致的数量,{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD}为模板中的设置的宏设置为#5即5个监控周期,server默认的监控周期是30s,在其主要项Kubernetes: Get state metrics中设置的监控周期是1m,覆盖掉默认的20s监控,所以5个监控周期为5分钟。
在宏中可以通过配置{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD}来配置不同的告警检测时间,如设置所有的deployment告警检测时间为5分钟{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:regex:"deployment:.*:.*"} = #5,设置default中deployment名为nginx的告警检测时间为为3分钟{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:default:nginx"} = #3。
所以第一句即为5分钟之内最小副本不匹配数为大于0。
2)last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_desired[{#NAMESPACE}/{#NAME}])>=0
kube.deployment.replicas_desired为deployment所需副本数,大于等于0
3)last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_available[{#NAMESPACE}/{#NAME}])>=0
kube.deployment.replicas_available为deployment可用副本,大于等于0
1.2 Deployment 副本数未达预期恢复
max(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_mismatched[{#NAMESPACE}/{#NAME}],{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}:{#NAME}"})=0
and last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_desired[{#NAMESPACE}/{#NAME}])>=0
and last(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_available[{#NAMESPACE}/{#NAME}])>=0
说明
1)max(/Kubernetes_test cluster state by HTTP/kube.deployment.replicas_mismatched[{#NAMESPACE}/{#NAME}],{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}:{#NAME}"})=0
5分钟内最大deployment副本数量不一致的数量为0
2)kube.deployment.replicas_desired为deployment所需副本数,大于等于0
3)kube.deployment.replicas_available为deployment可用副本,大于等于0
二.pod中的指标
1.告警指标:Pod 不健康
10分钟内最小pod失败状态大于0,或者10分钟内最小pod pending状态大于0,或者10分钟内最小pod未知状态大于0.。总结为10分钟内最小非正常状态大于0
min(/Kubernetes cluster state by HTTP/kube.pod.phase.failed[{#NAMESPACE}/{#NAME}],10m)>0 or min(/Kubernetes cluster state by HTTP/kube.pod.phase.pending[{#NAMESPACE}/{#NAME}],10m)>0 or min(/Kubernetes cluster state by HTTP/kube.pod.phase.unknown[{#NAMESPACE}/{#NAME}],10m)>0
恢复指标
5分钟内最pod的最小running或者成功的状态大于等于1
min(/Kubernetes cluster state by HTTP/kube.pod.phase.running[{#NAMESPACE}/{#NAME}],5m)>=1 or min(/Kubernetes cluster state by HTTP/kube.pod.phase.succeeded[{#NAMESPACE}/{#NAME}],5m)>=1
2.告警指标:pod崩溃循环
15分钟之内最新的重启次数减去最小的重启次数大于1
(last(/Kubernetes cluster state by HTTP/kube.pod.containers_restarts[{#NAMESPACE}/{#NAME}])-min(/Kubernetes cluster state by HTTP/kube.pod.containers_restarts[{#NAMESPACE}/{#NAME}],15m))>1
恢复告警
15分钟之内最新的重启次数减去最小的重启次数等于0
(last(/Kubernetes cluster state by HTTP/kube.pod.containers_restarts[{#NAMESPACE}/{#NAME}])-min(/Kubernetes cluster state by HTTP/kube.pod.containers_restarts[{#NAMESPACE}/{#NAME}],15m))=0
三.StatefulSet中的指标
告警指标:StatefulSet down机
最近的ready副本数/最近当前副本数不为1
(last(/Kubernetes cluster state by HTTP/kube.statefulset.replicas_ready[{#NAMESPACE}/{#NAME}]) / last(/Kubernetes cluster state by HTTP/kube.statefulset.replicas_current[{#NAMESPACE}/{#NAME}]))<>1
恢复告警
最近的ready副本数/最近当前副本数为1
(last(/Kubernetes cluster state by HTTP/kube.statefulset.replicas_ready[{#NAMESPACE}/{#NAME}]) / last(/Kubernetes cluster state by HTTP/kube.statefulset.replicas_current[{#NAMESPACE}/{#NAME}]))=1
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· DeepSeek R1 简明指南:架构、训练、本地部署及硬件要求