>>> 目录 <<<
一、概述
二、结构分析
三、Prometheus配置
四、PrometheusRule配置
五、添加外部监控
六、Scheduler和Controller配置
七、Alertmanager配置
八、监控数据持久化
九、Grafana仪表板配置
十、汇总
>>> 正文 <<<
一、概述
首先Prometheus整体监控结构略微复杂,一个个部署并不简单。另外监控Kubernetes就需要访问内部数据,必定需要进行认证、鉴权、准入控制,
那么这一整套下来将变得难上加难,而且还需要花费一定的时间,如果你没有特别高的要求,我还是建议选用开源比较好的一些方案。
关于Prometheus具体介绍不再多说,可以参考另外一篇博文:Kubernetes实战总结 - Prometheus部署(v0.3.0)
本篇主要针对Kubernetes部署Prometheus相关配置介绍,本人采用的是github开源的部署方案:
关于这个kube-prometheus目前应该是开源最好的方案了,该存储库收集Kubernetes清单,Grafana仪表板和Prometheus规则,以及文档和脚本,
以使用Prometheus Operator 通过Prometheus提供易于操作的端到端Kubernetes集群监视。以容器的方式部署到k8s集群,而且还可以自定义配置,非常的方便。
注意:本人使用的kubernetes-1.17.5 + release-0.3,由于网络问题本人已修改全部镜像地址。
二、结构分析
kube-prometheus相关部署文件在manifests目录中,共65个yaml,其中setup文件夹中包含所有自定义资源配置CustomResourceDefinition(一般不用修改,也不要轻易修改),所以部署时必须先执行这个文件夹。
其中包括告警(Alertmanager)、监控(Prometheus)、监控项(PrometheusRule)这三类资源定义,所以如果你想直接在k8s中修改对应控制器配置是没有用的(比如kubectl edit sts prometheus-k8s -n monitoring) 。
这里yaml文件看着很多,只要我们梳理一下就会很容易理解了,首先分为7个组件prometheus-operator、prometheus-adapter、prometheus、alertmanager、grafana、kube-state-metrics、node-exporter,
然后每个组件都会定义控制器、配置文件、集群权限、访问配置、监控配置, 但是我们一般只需要进行自定义告警配置和监控项,这样一筛选发现只需要修改几个文件即可(其中红色后面重点说明,紫色可根据项目情况调整资源配置)。
[root@ymt108 manifests]# tree . ├── alertmanager-alertmanager.yaml ├── alertmanager-secret.yaml # 告警配置 ├── alertmanager-serviceAccount.yaml ├── alertmanager-serviceMonitor.yaml ├── alertmanager-service.yaml ├── grafana-dashboardDatasources.yaml ├── grafana-dashboardDefinitions.yaml ├── grafana-dashboardSources.yaml ├── grafana-deployment.yaml ├── grafana-serviceAccount.yaml ├── grafana-serviceMonitor.yaml ├── grafana-service.yaml ├── kube-state-metrics-clusterRoleBinding.yaml ├── kube-state-metrics-clusterRole.yaml ├── kube-state-metrics-deployment.yaml ├── kube-state-metrics-roleBinding.yaml ├── kube-state-metrics-role.yaml ├── kube-state-metrics-serviceAccount.yaml ├── kube-state-metrics-serviceMonitor.yaml ├── kube-state-metrics-service.yaml ├── node-exporter-clusterRoleBinding.yaml ├── node-exporter-clusterRole.yaml ├── node-exporter-daemonset.yaml ├── node-exporter-serviceAccount.yaml ├── node-exporter-serviceMonitor.yaml ├── node-exporter-service.yaml ├── prometheus-adapter-apiService.yaml ├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml ├── prometheus-adapter-clusterRoleBindingDelegator.yaml ├── prometheus-adapter-clusterRoleBinding.yaml ├── prometheus-adapter-clusterRoleServerResources.yaml ├── prometheus-adapter-clusterRole.yaml ├── prometheus-adapter-configMap.yaml ├── prometheus-adapter-deployment.yaml ├── prometheus-adapter-roleBindingAuthReader.yaml ├── prometheus-adapter-serviceAccount.yaml ├── prometheus-adapter-service.yaml ├── prometheus-clusterRoleBinding.yaml ├── prometheus-clusterRole.yaml ├── prometheus-operator-serviceMonitor.yaml ├── prometheus-prometheus.yaml # 监控配置 ├── prometheus-roleBindingConfig.yaml ├── prometheus-roleBindingSpecificNamespaces.yaml ├── prometheus-roleConfig.yaml ├── prometheus-roleSpecificNamespaces.yaml ├── prometheus-rules.yaml # 默认监控项 ├── prometheus-serviceAccount.yaml ├── prometheus-serviceMonitorApiserver.yaml ├── prometheus-serviceMonitorCoreDNS.yaml ├── prometheus-serviceMonitorKubeControllerManager.yaml ├── prometheus-serviceMonitorKubelet.yaml ├── prometheus-serviceMonitorKubeScheduler.yaml ├── prometheus-serviceMonitor.yaml ├── prometheus-service.yaml └── setup ├── 0namespace-namespace.yaml ├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml ├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml ├── prometheus-operator-0prometheusCustomResourceDefinition.yaml ├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml ├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml ├── prometheus-operator-clusterRoleBinding.yaml ├── prometheus-operator-clusterRole.yaml ├── prometheus-operator-deployment.yaml ├── prometheus-operator-serviceAccount.yaml └── prometheus-operator-service.yaml 1 directories, 65 files
三、Prometheus配置
为了保留原始文件,我们复制一份prometheus-prometheus.yaml进行如下修改:
1)replicas:根据项目情况调整副本数
2)retention:修改Prometheus数据保留期限,默认值为“24h”,并且必须与正则表达式“ [0-9] +(ms | s | m | h | d | w | y)”匹配。
3)additionalScrapeConfigs:增加额外监控项配置,具体配置查看第五部分“添加k8s外部监控”。
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web # baseImage: quay.io/prometheus/prometheus baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml retention: 15d nodeSelector: kubernetes.io/os: linux podMonitorSelector: {} replicas: 2 resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.11.0
四、PrometheusRule配置
首先查看默认监控项配置prometheus-rules.yaml,其中包括76个告警项,基本覆盖了k8s常用监控点,同样为了保留源文件,我们复制一份prometheus-rules.yaml进行一些修改。
其中general-rules规则与我自定义规则冲突,被我注释了,最后增加了platform参数区分环境,以及进行部分提示语中译。
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: node-exporter.rules rules: - expr: | count without (cpu) ( count without (mode) ( node_cpu_seconds_total{job="node-exporter"} ) ) record: instance:node_num_cpu:sum - expr: | 1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m]) ) record: instance:node_cpu_utilisation:rate1m - expr: | ( node_load1{job="node-exporter"} / instance:node_num_cpu:sum{job="node-exporter"} ) record: instance:node_load1_per_cpu:ratio - expr: | 1 - ( node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} ) record: instance:node_memory_utilisation:ratio - expr: | rate(node_vmstat_pgmajfault{job="node-exporter"}[1m]) record: instance:node_vmstat_pgmajfault:rate1m - expr: | rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]) record: instance_device:node_disk_io_time_seconds:rate1m - expr: | rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]) record: instance_device:node_disk_io_time_weighted_seconds:rate1m - expr: | sum without (device) ( rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_receive_bytes_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_transmit_bytes_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_receive_drop_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_transmit_drop_excluding_lo:rate1m - name: kube-apiserver.rules rules: - expr: | histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m]) ) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate - expr: | container_memory_working_set_bytes{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_working_set_bytes - expr: | container_memory_rss{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_rss - expr: | container_memory_cache{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_cache - expr: | container_memory_swap{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_swap - expr: | sum(container_memory_usage_bytes{job="kubelet", image!="", container!="POD"}) by (namespace) record: namespace:container_memory_usage_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod) * on (namespace, pod) group_left(label_name) kube_pod_labels{job="kube-state-metrics"} ) record: namespace:kube_pod_container_resource_requests_memory_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod) * on (namespace, pod) group_left(label_name) kube_pod_labels{job="kube-state-metrics"} ) record: namespace:kube_pod_container_resource_requests_cpu_cores:sum - expr: | sum( label_replace( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)" ) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner{job="kube-state-metrics"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: deployment record: mixin_pod_workload - expr: | sum( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: daemonset record: mixin_pod_workload - expr: | sum( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: statefulset record: mixin_pod_workload - name: kube-scheduler.rules rules: - expr: | histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - name: node.rules rules: - expr: sum(min(kube_pod_info) by (node)) record: ':kube_pod_info_node_count:' - expr: | max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod) record: 'node_namespace_pod:kube_pod_info:' - expr: | count by (node) (sum by (node, cpu) ( node_cpu_seconds_total{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: )) record: node:node_num_cpu:sum - expr: | sum( node_memory_MemAvailable_bytes{job="node-exporter"} or ( node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"} ) ) record: :node_memory_MemAvailable_bytes:sum - name: kube-prometheus-node-recording.rules rules: - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance) record: instance:node_cpu:rate:sum - expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})) BY (instance) record: instance:node_filesystem_usage:sum - expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance) record: instance:node_network_receive_bytes:rate:sum - expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance) record: instance:node_network_transmit_bytes:rate:sum - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance) record: instance:node_cpu:ratio - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) record: cluster:node_cpu:sum_rate5m - expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu)) record: cluster:node_cpu:ratio - name: node-exporter rules: - alert: NodeFilesystemSpaceFillingUp annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. summary: "预计文件系统将在接下来的24小时内用完空间。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemSpaceFillingUp annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. summary: "预计文件系统将在接下来的4个小时内用完空间。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemAlmostOutOfSpace annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. summary: "文件系统剩余空间不到5%。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemAlmostOutOfSpace annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. summary: "文件系统剩余空间不到3%。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemFilesFillingUp annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up. summary: "预计文件系统将在接下来的24小时内用尽inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemFilesFillingUp annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast. summary: "预计文件系统将在接下来的4小时内用尽inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemAlmostOutOfFiles annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. summary: "文件系统仅剩不到5%的inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemAlmostOutOfFiles annotations: platform: "育苗通测试平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. summary: "文件系统仅剩不到3%的inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeNetworkReceiveErrs annotations: platform: "育苗通测试平台" description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.' summary: "网络接口报告许多接收错误。" expr: | increase(node_network_receive_errs_total[2m]) > 10 for: 1h labels: severity: warning - alert: NodeNetworkTransmitErrs annotations: platform: "育苗通测试平台" description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.' summary: "网络接口报告许多传输错误。" expr: | increase(node_network_transmit_errs_total[2m]) > 10 for: 1h labels: severity: warning - name: kubernetes-apps rules: - alert: KubePodCrashLooping annotations: platform: "育苗通测试平台" message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. expr: | rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 for: 15m labels: severity: critical - alert: KubePodNotReady annotations: platform: "育苗通测试平台" message: "Pod {{$labels.namespace}}/{{$labels.pod}}处于未就绪状态的时间超过15分钟。" expr: | sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Failed|Pending|Unknown"} * on(namespace, pod) group_left(owner_kind) kube_pod_owner{owner_kind!="Job"}) > 0 for: 15m labels: severity: critical - alert: KubeDeploymentGenerationMismatch annotations: platform: "育苗通测试平台" message: "Deployment {{$labels.namespace}}/{{$labels.deployment}}生成不匹配,这表明Deployment已失败但尚未回滚。" expr: | kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeDeploymentReplicasMismatch annotations: platform: "育苗通测试平台" message: "Deployment {{$labels.namespace}}/{{$labels.deployment}}超过15分钟未匹配预期的副本数。" expr: | kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetReplicasMismatch annotations: platform: "育苗通测试平台" message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}超过15分钟未匹配预期的副本数。" expr: | kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetGenerationMismatch annotations: platform: "育苗通测试平台" message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}生成不匹配,这表明StatefulSet已失败但尚未回滚。" expr: | kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetUpdateNotRolledOut annotations: platform: "育苗通测试平台" message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. expr: | max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) for: 15m labels: severity: critical - alert: KubeDaemonSetRolloutStuck annotations: platform: "育苗通测试平台" message: Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready. expr: | kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00 for: 15m labels: severity: critical - alert: KubeContainerWaiting annotations: platform: "育苗通测试平台" message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. expr: | sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 for: 1h labels: severity: warning - alert: KubeDaemonSetNotScheduled annotations: platform: "育苗通测试平台" message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' expr: | kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning - alert: KubeDaemonSetMisScheduled annotations: platform: "育苗通测试平台" message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' expr: | kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning - alert: KubeCronJobRunning annotations: platform: "育苗通测试平台" message: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. expr: | time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 for: 1h labels: severity: warning - alert: KubeJobCompletion annotations: platform: "育苗通测试平台" message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete. expr: | kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 for: 1h labels: severity: warning - alert: KubeJobFailed annotations: platform: "育苗通测试平台" message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. expr: | kube_job_failed{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning - alert: KubeHpaReplicasMismatch annotations: platform: "育苗通测试平台" message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. expr: | (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning - alert: KubeHpaMaxedOut annotations: platform: "育苗通测试平台" message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. expr: | kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} for: 15m labels: severity: warning - name: kubernetes-resources rules: - alert: KubeCPUOvercommit annotations: platform: "育苗通测试平台" message: "集群已超额使用Pod的CPU资源请求,因此无法容忍节点故障。" expr: | sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: warning - alert: KubeMemOvercommit annotations: platform: "育苗通测试平台" message: "集群已过量使用Pod的内存资源请求,因此无法容忍节点故障。" expr: | sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning - alert: KubeCPUOvercommit annotations: platform: "育苗通测试平台" message: "集群已超额使用了对命名空间的CPU资源请求。" expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning - alert: KubeMemOvercommit annotations: platform: "育苗通测试平台" message: "集群已过量使用了对命名空间的内存资源请求。" expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 for: 5m labels: severity: warning - alert: KubeQuotaExceeded annotations: platform: "育苗通测试平台" message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. expr: | kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90 for: 15m labels: severity: warning - alert: CPUThrottlingHigh annotations: message: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' expr: | sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 ) for: 15m labels: severity: warning - name: kubernetes-storage rules: - alert: KubePersistentVolumeUsageCritical annotations: platform: "育苗通测试平台" message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. expr: | kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} < 0.03 for: 1m labels: severity: critical - alert: KubePersistentVolumeFullInFourDays annotations: platform: "育苗通测试平台" message: "根据最近的抽样,{{$labels.persistentvolumeclaim}}在命名空间{{$labels.namespace}}中声明的PersistentVolume预计将在四天内填满,目前{{$value | humanizePercentage}}可用。" expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: critical - alert: KubePersistentVolumeErrors annotations: platform: "育苗通测试平台" message: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. expr: | kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 for: 5m labels: severity: critical - name: kubernetes-system rules: - alert: KubeVersionMismatch annotations: platform: "育苗通测试平台" message: There are {{ $value }} different semantic versions of Kubernetes components running. expr: | count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning - alert: KubeClientErrors annotations: platform: "育苗通测试平台" message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' expr: | (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.01 for: 15m labels: severity: warning - name: kubernetes-system-apiserver rules: - alert: KubeAPILatencyHigh annotations: platform: "育苗通测试平台" message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. expr: | cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 1 for: 10m labels: severity: warning - alert: KubeAPILatencyHigh annotations: platform: "育苗通测试平台" message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. expr: | cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 4 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "育苗通测试平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.03 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "育苗通测试平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.01 for: 10m labels: severity: warning - alert: KubeAPIErrorsHigh annotations: platform: "育苗通测试平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.10 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "育苗通测试平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05 for: 10m labels: severity: warning - alert: KubeClientCertificateExpiration annotations: platform: "育苗通测试平台" message: "用于验证apiserver的客户端证书的有效期限少于7天。" expr: | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning - alert: KubeClientCertificateExpiration annotations: platform: "育苗通测试平台" message: "用于验证apiserver的客户端证书的有效期限少于24小时。" expr: | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical - alert: KubeAPIDown annotations: platform: "育苗通测试平台" message: "KubeAPI已从Prometheus目标发现中消失。" expr: | absent(up{job="apiserver"} == 1) for: 15m labels: severity: critical - name: kubernetes-system-kubelet rules: - alert: KubeNodeNotReady annotations: platform: "育苗通测试平台" message: "{{$labels.node}}尚未准备就绪超过15分钟。" expr: | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 for: 15m labels: severity: warning - alert: KubeNodeUnreachable annotations: platform: "育苗通测试平台" message: "{{$labels.node}}无法访问,某些工作负荷可能会重新安排。" expr: | kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 labels: severity: warning - alert: KubeletTooManyPods annotations: platform: "育苗通测试平台" message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. expr: | max(max(kubelet_running_pod_count{job="kubelet"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"}) by(node) > 0.95 for: 15m labels: severity: warning - alert: KubeletDown annotations: platform: "育苗通测试平台" message: "Kubelet已从Prometheus目标发现中消失。" expr: | absent(up{job="kubelet"} == 1) for: 15m labels: severity: critical - name: kubernetes-system-scheduler rules: - alert: KubeSchedulerDown annotations: message: "KubeScheduler已从Prometheus目标发现中消失。" expr: | absent(up{job="kube-scheduler"} == 1) for: 15m labels: severity: critical - name: kubernetes-system-controller-manager rules: - alert: KubeControllerManagerDown annotations: message: "KubeControllerManager已从Prometheus目标发现中消失。" expr: | absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical - name: prometheus rules: - alert: PrometheusBadConfig annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration. summary: "Prometheus配置重新加载失败。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0 for: 10m labels: severity: critical - alert: PrometheusNotificationQueueRunningFull annotations: platform: "育苗通测试平台" description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full. summary: "Prometheus警报通知队列预计将在30m以内用完。" expr: | # Without min_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m]) ) for: 15m labels: severity: warning - alert: PrometheusErrorSendingAlertsToSomeAlertmanagers annotations: platform: "育苗通测试平台" description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.' summary: "Prometheus在将警报发送到特定的Alertmanager时遇到了超过1%的错误。" expr: | ( rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) * 100 > 1 for: 15m labels: severity: warning - alert: PrometheusErrorSendingAlertsToAnyAlertmanager annotations: platform: "育苗通测试平台" description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' summary: "Prometheus在将警报发送到任何Alertmanager时遇到3%以上的错误。" expr: | min without(alertmanager) ( rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) * 100 > 3 for: 15m labels: severity: critical - alert: PrometheusNotConnectedToAlertmanagers annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers. summary: "Prometheus未与任何Alertmanager连接。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1 for: 10m labels: severity: warning - alert: PrometheusTSDBReloadsFailing annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h. summary: "Prometheus从磁盘重新加载块时遇到问题。" expr: | increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning - alert: PrometheusTSDBCompactionsFailing annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h. summary: "Prometheus在压缩块时遇到问题。" expr: | increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning - alert: PrometheusNotIngestingSamples annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples. summary: "Prometheus没有获取到样本" expr: | rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0 for: 10m labels: severity: warning - alert: PrometheusDuplicateTimestamps annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with different values but duplicated timestamp. summary: "Prometheus正在删除带有重复时间戳的样本。" expr: | rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning - alert: PrometheusOutOfOrderTimestamps annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with timestamps arriving out of order. summary: "Prometheus丢弃带有乱序时间戳的样本。" expr: | rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning - alert: PrometheusRemoteStorageFailures annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to queue {{$labels.queue}}. summary: "Prometheus无法将样本发送到远程存储。" expr: | ( rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / ( rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) ) * 100 > 1 for: 15m labels: severity: critical - alert: PrometheusRemoteWriteBehind annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for queue {{$labels.queue}}. summary: "Prometheus远程写入落后了。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m]) - on(job, instance) group_right max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m]) ) > 120 for: 15m labels: severity: critical - alert: PrometheusRemoteWriteDesiredShards annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}` $labels.instance | query | first | value }}. summary: "Prometheus远程写入所需的分片计算要比配置的最大分片运行更多。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m]) ) for: 15m labels: severity: warning - alert: PrometheusRuleFailures annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m. summary: "Prometheus无法通过规则评估。" expr: | increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: critical - alert: PrometheusMissingRuleEvaluations annotations: platform: "育苗通测试平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m. summary: "Prometheus由于规则组评估速度慢而缺少规则评估。" expr: | increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: warning - name: alertmanager.rules rules: - alert: AlertmanagerConfigInconsistent annotations: platform: "育苗通测试平台" message: "Alertmanager {{$labels.service}}实例的配置不同步。" expr: | count_values("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="prometheus-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "alertmanager-$1", "name", "(.*)") != 1 for: 5m labels: severity: critical - alert: AlertmanagerFailedReload annotations: platform: "育苗通测试平台" message: "Alertmanager {{$labels.namespace}}/{{$labels.pod}}重新加载配置失败。" expr: | alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0 for: 10m labels: severity: warning - alert: AlertmanagerMembersInconsistent annotations: platform: "育苗通测试平台" message: "Alertmanager尚未找到集群的所有其他成员。" expr: | alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"} != on (service) GROUP_LEFT() count by (service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}) for: 5m labels: severity: critical # - name: general.rules # rules: # - alert: TargetDown # annotations: # platform: "育苗通测试平台" # message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }} targets in # {{ $labels.namespace }} namespace are down.' # runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md # expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, # namespace, service)) > 10 # for: 10m # labels: # severity: warning # - alert: Watchdog # annotations: # platform: "育苗通测试平台" # message: "此警报始终处于触发状态,旨在确保整个警报管道均正常运行。" # expr: vector(1) # labels: # severity: none - name: node-time rules: - alert: ClockSkewDetected annotations: platform: "育苗通测试平台" message: Clock skew detected on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}. Ensure NTP is configured correctly on this host. expr: | abs(node_timex_offset_seconds{job="node-exporter"}) > 0.05 for: 2m labels: severity: warning - name: node-network rules: - alert: NodeNetworkInterfaceFlapping annotations: platform: "育苗通测试平台" message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}" expr: | changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2 for: 2m labels: severity: warning - name: prometheus-operator rules: - alert: PrometheusOperatorReconcileErrors annotations: platform: "育苗通测试平台" message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning - alert: PrometheusOperatorNodeLookupErrors annotations: platform: "育苗通测试平台" message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning
接下来参考prometheus-rules.yaml,新建自定义的告警项prometheus-additional-rules.yaml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-additional-rules namespace: monitoring spec: groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.instance}} 采集器已停止工作" description: "{{$labels.instance}} 服务器延时超过5分钟" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) * 100) > 80 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} CPU使用率过高!" description: "{{$labels.mountpoint }} CPU使用大于80%(目前使用:{{$value}}%)" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} 内存使用率过高!" description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)" - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!" description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" - alert: NodeDiskIOUsage expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) * 100) > 80 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!" description: "{{$labels.mountpoint }} 流入磁盘IO大于80%(目前使用:{{$value}})" - alert: NodeNetworkReceive expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 1048576 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} 流入网络带宽过高!" description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于1G. RX带宽使用率{{$value}}" - alert: NodeNetworkTransmit expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 1048576 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} 流出网络带宽过高!" description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于1G. RX带宽使用率{{$value}}" - alert: NodeTCPCurrEstab expr: node_netstat_Tcp_CurrEstab > 1000 for: 1m labels: status: critical annotations: platform: "育苗通测试平台" summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000(目前使用:{{$value}}%)"
五、添加外部监控
一个项目开始可能很难实现全部容器化,比如数据库、CDH集群。但是我们依然需要监控他们,如果分成两套prometheus不利于管理,所以我们统一添加这些监控到kube-prometheus中。
那么接下来我们新建prometheus-additional.yaml文件,添加额外监控组件配置scrape_configs。
- job_name: 'node-exporter-others' static_configs: - targets: - *.*.*.149:31190 - *.*.*.150:31190 - *.*.*.122:31190 - job_name: 'mysql-exporter' static_configs: - targets: - *.*.*.104:9592 - *.*.*.125:9592 - *.*.*.128:9592 - job_name: 'nacos-exporter' metrics_path: '/nacos/actuator/prometheus' static_configs: - targets: - *.*.*.113:8848 - *.*.*.114:8848 - *.*.*.118:8848 - job_name: 'elasticsearch-exporter' static_configs: - targets: - *.*.*.110:9597 - *.*.*.107:9597 - *.*.*.117:9597 - job_name: 'zookeeper-exporter' static_configs: - targets: - *.*.*.115:9595 - *.*.*.121:9595 - *.*.*.120:9595 - job_name: 'nginx-exporter' static_configs: - targets: - *.*.*.149:9593 - *.*.*.150:9593 - *.*.*.122:9593 - job_name: 'redis-exporter' static_configs: - targets: - *.*.*.109:9594 - job_name: 'redis-exporter-targets' static_configs: - targets: - redis://*.*.*.146:7090 - redis://*.*.*.144:7090 - redis://*.*.*.133:7091 metrics_path: /scrape relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: *.*.*.109:9594
然后我们需要将这些监控配置以secret资源类型存储到k8s集群中。
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring
六、Scheduler和Controller配置
展开Status菜单,查看targets,可以看到只有图中两个监控任务没有对应的目标,这和serviceMonitor资源对象有关。
查看yaml文件prometheus-serviceMonitorKubeScheduler,selector匹配的是service的标签,但是kube-system namespace中并没有k8s-app=kube-scheduler的service
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: monitoring spec: endpoints: - interval: 30s # 每30s获取一次信息 port: http-metrics # 对应service的端口名 jobLabel: k8s-app namespaceSelector: # 表示去匹配某一命名空间中的service,如果想从所有的namespace中匹配用any: true matchNames: - kube-system selector: # 匹配的 Service 的labels,如果使用mathLabels,则下面的所有标签都匹配时才会匹配该service,如果使用matchExpressions,则至少匹配一个标签的service都会被选择 matchLabels: k8s-app: kube-scheduler
新建prometheus-kubeSchedulerService.yaml
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: k8s-app: kube-scheduler #与servicemonitor中的selector匹配 spec: selector: component: kube-scheduler # 与scheduler的pod标签一直 ports: - name: http-metrics port: 10251 targetPort: 10251 protocol: TCP
同理新建prometheus-kubeControllerManagerService.yaml
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: k8s-app: kube-controller-manager spec: selector: component: kube-controller-manager ports: - name: http-metrics port: 10252 targetPort: 10252 protocol: TCP
七、Alertmanager配置
监控和告警项已经配置好了,那么接下来我们将进行alertmanager告警配置了。
常用的接收方式就是邮件了,但这里我们将使用企业微信号进行接收,所以开发一个连接微信的应用appalertservice,进行消息转发和处理。
当然,你也可以直接配置微信号和消息模板,可参考:第3章 Prometheus告警处理。
global: resolve_timeout: 5m smtp_from: '123456789@qq.com' smtp_smarthost: 'smtp.qq.com:465' smtp_auth_username: '123456789@qq.com' smtp_auth_password: '123456789' smtp_require_tls: false smtp_hello: 'qq.com' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' # receiver: webhook receivers: - name: 'email' email_configs: - to: '123456789@qq.com' send_resolved: true #- name: webhook # webhook_configs: # - url: 'http://app-alert.service:20119/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
然后我们需要将alertmanager配置以secret资源类型存储到k8s集群中。
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
八、监控数据持久化
默认Prometheus和Grafana不做数据持久化,那么服务重启以后配置的Dashboard、账号密码、监控数据等信息将会丢失,所以做数据持久化也是很有必要的。
原始的数据是以 emptyDir 形式存放在pod里面,生命周期与pod相同,出现问题时,容器重启,监控相关的数据就全部消失了。
这里我们通过 storageclass 来做数据持久化,具体配置可以参考:Kubernetes实战总结 - 动态存储管理StorageClass
对于Grafana来说,我们需要在grafana-deployment.yaml中增加PVC配置,并且更改volumes.grafana-storage。
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana namespace: monitoring annotations: volume.beta.kubernetes.io/storage-class: "managed-nfs-storage" spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: grafana name: grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: #- image: grafana/grafana:6.4.3 - image: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/grafana:6.4.3 name: grafana ports: - containerPort: 3000 name: http readinessProbe: httpGet: path: /api/health port: http resources: limits: cpu: 200m memory: 200Mi requests: cpu: 100m memory: 100Mi volumeMounts: - mountPath: /var/lib/grafana name: grafana-storage readOnly: false ...
volumes: # - emptyDir: {} # name: grafana-storage - name: grafana-storage persistentVolumeClaim: claimName: grafana ...
对于Prometheus来说,我们只需要在prometheus-prometheus.yaml中增加volumeClaimTemplate配置即可。
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web #baseImage: quay.io/prometheus/prometheus baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml retention: 15d storage: volumeClaimTemplate: spec: storageClassName: managed-nfs-storage resources: requests: storage: 100Gi ...
九、Grafana仪表板配置
前面我们把监控和告警已经配置好了,那接下来就剩展示了。打开grafana -> 点击添加按钮 ->Import ->Upload .json file,导入监控仪表板。
{ "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "This dashboard provides cluster admins with the ability to monitor nodes and identify workload bottlenecks. It can be deployed with PSPs enabled using the following helm chart - https://github.com/pivotal-cf/charts-grafana", "editable": true, "gnetId": 10000, "graphTooltip": 0, "id": 102, "iteration": 1597137794957, "links": [], "panels": [ { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 34, "panels": [], "repeat": null, "title": "Summary", "type": "row" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 0, "y": 1 }, "height": "180px", "id": 4, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}) / sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster memory usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 8, "y": 1 }, "height": "180px", "id": 6, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) / sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster CPU usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 16, "y": 1 }, "height": "180px", "id": 7, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_usage_bytes{id=\"/\"}) / sum (container_fs_limit_bytes{id=\"/\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "metric": "", "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster filesystem usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 0, "y": 6 }, "height": "1px", "id": 9, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "20%", "prefix": "", "prefixFontSize": "20%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 4, "y": 6 }, "height": "1px", "id": 10, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "none", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 8, "y": 6 }, "height": "1px", "id": 11, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": " cores", "postfixFontSize": "30%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "none", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 12, "y": 6 }, "height": "1px", "id": 12, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": " cores", "postfixFontSize": "30%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 16, "y": 6 }, "height": "1px", "id": 13, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_usage_bytes{id=\"/\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 20, "y": 6 }, "height": "1px", "id": 14, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_limit_bytes{id=\"/\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 9 }, "id": 35, "panels": [], "repeat": null, "title": "Memory", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 0, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 10 }, "id": 25, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": false, "max": false, "min": false, "rightSide": true, "show": true, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": true, "steppedLine": false, "targets": [ { "expr": "sum (container_memory_working_set_bytes{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "{{ pod }}", "metric": "container_memory_usage:sort_desc", "refId": "A", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods memory usage", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 17 }, "id": 37, "panels": [], "title": "CPU", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 3, "editable": true, "error": false, "fill": 0, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 18 }, "height": "", "id": 17, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": false, "min": false, "rightSide": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": true, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "{{ pod }}", "metric": "container_cpu", "refId": "A", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods CPU usage", "tooltip": { "msResolution": true, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "none", "label": "cores", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 25 }, "id": 33, "panels": [], "repeat": null, "title": "Network I/O", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 26 }, "id": 16, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": false, "min": false, "rightSide": true, "show": true, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_network_receive_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "-> {{ pod }}", "metric": "network", "refId": "A", "step": 10 }, { "expr": "- sum (rate (container_network_transmit_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "<- {{ pod }}", "metric": "network", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods network I/O", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 5, "w": 24, "x": 0, "y": 33 }, "height": "200px", "id": 32, "legend": { "alignAsTable": false, "avg": true, "current": true, "max": false, "min": false, "rightSide": false, "show": false, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "Received", "metric": "network", "refId": "A", "step": 10 }, { "expr": "- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "Sent", "metric": "network", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Network I/O pressure", "tooltip": { "msResolution": false, "shared": true, "sort": 0, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } } ], "refresh": "10s", "schemaVersion": 20, "style": "dark", "tags": [ "Prometheus", "Kubernetes" ], "templating": { "list": [ { "auto": true, "auto_count": 20, "auto_min": "2m", "current": { "text": "auto", "value": "$__auto_interval_interval" }, "hide": 2, "label": null, "name": "interval", "options": [ { "selected": true, "text": "auto", "value": "$__auto_interval_interval" }, { "selected": false, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "refresh": 2, "skipUrlSync": false, "type": "interval" }, { "current": { "text": "prometheus", "value": "prometheus" }, "hide": 0, "includeAll": false, "label": null, "multi": false, "name": "datasource", "options": [], "query": "prometheus", "refresh": 1, "regex": "", "skipUrlSync": false, "type": "datasource" }, { "allValue": ".*", "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "", "hide": 0, "includeAll": true, "label": null, "multi": false, "name": "Node", "options": [], "query": "label_values(kubernetes_io_hostname)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 0, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-5m", "to": "now" }, "timepicker": { "refresh_intervals": [ "5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "timezone": "browser", "title": "育苗通K8S集群监控", "uid": "6KoW2MIGk", "version": 15 }
{ "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "【中文版本】2020.06.28更新,增加整体资源展示!支持 Grafana6&7,Node Exporter v0.16及以上的版本,优化重要指标展示。包含整体资源展示与资源明细图表:CPU 内存 磁盘 IO 网络等监控指标。https://github.com/starsliao/Prometheus", "editable": true, "gnetId": 8919, "graphTooltip": 0, "id": 72, "iteration": 1597137684806, "links": [ { "icon": "external link", "tags": [], "targetBlank": true, "title": "更新node_exporter", "tooltip": "", "type": "link", "url": "https://github.com/prometheus/node_exporter/releases" }, { "icon": "external link", "tags": [], "targetBlank": true, "title": "更新当前仪表板", "tooltip": "", "type": "link", "url": "https://grafana.com/dashboards/8919" }, { "icon": "external link", "tags": [], "targetBlank": true, "title": "StarsL.cn", "tooltip": "", "type": "link", "url": "https://starsl.cn" }, { "asDropdown": true, "icon": "external link", "tags": [], "targetBlank": true, "title": "", "type": "dashboards" } ], "panels": [ { "collapsed": false, "datasource": "prometheus", "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 187, "panels": [], "title": "资源总览(关联JOB项)当前选中主机:【$show_hostname】实例:$node", "type": "row" }, { "columns": [], "datasource": "prometheus", "description": "分区使用率、磁盘读取、磁盘写入、下载带宽、上传带宽,如果有多个网卡或者多个分区,是采集的使用率最高的网卡或者分区的数值。", "fontSize": "100%", "gridPos": { "h": 12, "w": 24, "x": 0, "y": 1 }, "id": 185, "options": {}, "pageSize": 10, "showHeader": true, "sort": { "col": 5, "desc": false }, "styles": [ { "alias": "主机名", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "link": false, "linkTooltip": "", "linkUrl": "", "mappingType": 1, "pattern": "nodename", "thresholds": [], "type": "string", "unit": "bytes" }, { "alias": "IP(链接到明细)", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": true, "linkTargetBlank": false, "linkTooltip": "浏览主机明细", "linkUrl": "/d/9CWBz0bik/node-exporter?orgId=1&var-job=${job}&var-hostname=All&var-node=${__cell}&var-device=All", "mappingType": 1, "pattern": "instance", "thresholds": [], "type": "number", "unit": "short" }, { "alias": "内存", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "Value #B", "thresholds": [], "type": "number", "unit": "bytes" }, { "alias": "CPU核", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": null, "mappingType": 1, "pattern": "Value #C", "thresholds": [], "type": "number", "unit": "short" }, { "alias": " 运行时间", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #D", "thresholds": [], "type": "number", "unit": "s" }, { "alias": "分区使用率*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #E", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "CPU使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #F", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "内存使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #G", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "磁盘读取*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #H", "thresholds": [ "10485760", "20485760" ], "type": "number", "unit": "Bps" }, { "alias": "磁盘写入*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #I", "thresholds": [ "10485760", "20485760" ], "type": "number", "unit": "Bps" }, { "alias": "下载带宽*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #J", "thresholds": [ "30485760", "104857600" ], "type": "number", "unit": "bps" }, { "alias": "上传带宽*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #K", "thresholds": [ "30485760", "104857600" ], "type": "number", "unit": "bps" }, { "alias": "5m负载", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #L", "thresholds": [], "type": "number", "unit": "short" }, { "alias": "", "align": "right", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 2, "pattern": "/.*/", "thresholds": [], "type": "hidden", "unit": "short" } ], "targets": [ { "expr": "node_uname_info{job=~\"$job\"} - 0", "format": "table", "instant": true, "interval": "", "legendFormat": "主机名", "refId": "A" }, { "expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "运行时间", "refId": "D" }, { "expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "总内存", "refId": "B" }, { "expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "总核数", "refId": "C" }, { "expr": "node_load5{job=~\"$job\"}", "format": "table", "instant": true, "interval": "", "legendFormat": "5分钟负载", "refId": "L" }, { "expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "CPU使用率", "refId": "F" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "内存使用率", "refId": "G" }, { "expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "分区使用率", "refId": "E" }, { "expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[5m])) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "最大读取", "refId": "H" }, { "expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[5m])) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "最大写入", "refId": "I" }, { "expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[5m])*8) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "下载带宽", "refId": "J" }, { "expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[5m])*8) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "上传带宽", "refId": "K" } ], "timeFrom": null, "timeShift": null, "title": "服务器资源总览表(每页10行)", "transform": "table", "type": "table" }, { "aliasColors": { "192.168.200.241:9100_Total": "dark-red", "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁盘花费在I/O操作占比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 13 }, "hiddenSeries": false, "id": 191, "legend": { "alignAsTable": false, "avg": false, "current": true, "hideEmpty": true, "hideZero": true, "max": false, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "总平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "总核数", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "count(node_cpu_seconds_total{job=~\"$job\", mode='system'})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "总核数", "refId": "B", "step": 240 }, { "expr": "sum(node_load5{job=~\"$job\"})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "总5分钟负载", "refId": "A", "step": 240 }, { "expr": "avg(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "总平均使用率", "refId": "F", "step": 240 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整体总负载与整体平均CPU使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "short", "label": "总负载", "logBase": 1, "max": null, "min": null, "show": true }, { "decimals": 0, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_总内存": "dark-red", "内存_Avaliable": "#6ED0E0", "内存_Cached": "#EF843C", "内存_Free": "#629E51", "内存_Total": "#6d1f62", "内存_Used": "#eab839", "可用": "#9ac48a", "总内存": "#bf1b00" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 13 }, "height": "300", "hiddenSeries": false, "id": 195, "legend": { "alignAsTable": false, "avg": false, "current": true, "max": false, "min": false, "rightSide": false, "show": true, "sort": "current", "sortDesc": false, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "总内存", "color": "#C4162A", "fill": 0 }, { "alias": "总平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"})", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "总内存", "refId": "A", "step": 4 }, { "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "总已用", "refId": "B", "step": 4 }, { "expr": "(sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"}) / sum(node_memory_MemTotal_bytes{job=~\"$job\"}))*100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "总平均使用率", "refId": "H" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整体总内存与整体平均内存使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "bytes", "label": "总内存量", "logBase": 1, "max": null, "min": "0", "show": true }, { "decimals": null, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 13 }, "hiddenSeries": false, "id": 197, "legend": { "alignAsTable": false, "avg": false, "current": true, "hideEmpty": false, "hideZero": false, "max": false, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "总平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "总磁盘量", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "总磁盘量", "refId": "E" }, { "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "总使用量", "refId": "C" }, { "expr": "(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))))", "format": "time_series", "instant": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "总平均使用率", "refId": "A" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整体总磁盘与整体平均磁盘使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": 1, "format": "bytes", "label": "总磁盘量", "logBase": 1, "max": null, "min": "0", "show": true }, { "decimals": null, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": "prometheus", "gridPos": { "h": 1, "w": 24, "x": 0, "y": 21 }, "id": 189, "panels": [], "title": "资源明细:【$show_hostname】", "type": "row" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorPrefix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": 0, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "s", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "threshcisLabels": false, "threshcisMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 22 }, "hideTimeOverride": true, "id": 15, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "null", "nullText": null, "options": {}, "pluginVersion": "6.4.2", "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(time() - node_boot_time_seconds{instance=~\"$node\"})", "format": "time_series", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 40 } ], "threshciss": "1,2", "thresholds": "1,3", "title": "运行时间", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "datasource": "prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "custom": {}, "decimals": 2, "displayName": "", "mappings": [ { "from": "", "id": 1, "operator": "", "text": "N/A", "to": "", "type": 1, "value": "0" } ], "max": 100, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 70 }, { "color": "#EAB839", "value": 90 } ] }, "unit": "percent" }, "overrides": [] }, "gridPos": { "h": 6, "w": 3, "x": 2, "y": 22 }, "id": 177, "options": { "displayMode": "lcd", "fieldOptions": { "calcs": [ "last" ], "defaults": { "decimals": 1, "mappings": [ { "from": "", "id": 1, "operator": "", "text": "N/A", "to": "", "type": 1, "value": "0" } ], "max": 100, "min": 0.1, "thresholds": { "0": { "color": "green", "value": null }, "1": { "color": "red", "value": 80 }, "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 70 }, { "color": "red", "value": 90 } ] }, "unit": "percent" }, "override": {}, "overrides": [], "values": false }, "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "values": false }, "showUnfilled": true }, "pluginVersion": "6.4.3", "targets": [ { "expr": "100 - (avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) * 100)", "instant": true, "interval": "", "legendFormat": "总CPU使用率", "refId": "A" }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100", "hide": true, "instant": true, "interval": "", "legendFormat": "IOwait使用率", "refId": "C" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100", "instant": true, "interval": "", "legendFormat": "内存使用率", "refId": "B" }, { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"})*100 /(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}))", "hide": false, "instant": true, "interval": "", "legendFormat": "最大分区({{mountpoint}})使用率", "refId": "D" }, { "expr": "(1 - ((node_memory_SwapFree_bytes{instance=~\"$node\"} + 1)/ (node_memory_SwapTotal_bytes{instance=~\"$node\"} + 1))) * 100", "instant": true, "legendFormat": "交换分区使用率", "refId": "F" } ], "timeFrom": null, "timeShift": null, "title": "", "type": "bargauge" }, { "columns": [], "datasource": "prometheus", "description": "本看板中的:磁盘总量、使用量、可用量、使用率保持和df命令的Size、Used、Avail、Use% 列的值一致,并且Use%的值会四舍五入保留一位小数,会更加准确。\n\n注:df中Use%算法为:(size - free) * 100 / (avail + (size - free)),结果是整除则为该值,非整除则为该值+1,结果的单位是%。\n参考df命令源码:", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fontSize": "100%", "gridPos": { "h": 6, "w": 10, "x": 5, "y": 22 }, "id": 181, "links": [ { "targetBlank": true, "title": "https://github.com/coreutils/coreutils/blob/master/src/df.c", "url": "https://github.com/coreutils/coreutils/blob/master/src/df.c" } ], "options": {}, "pageSize": null, "scroll": true, "showHeader": true, "sort": { "col": 6, "desc": false }, "styles": [ { "alias": "分区", "align": "auto", "colorMode": null, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "mountpoint", "thresholds": [ "" ], "type": "string", "unit": "bytes" }, { "alias": "可用空间", "align": "auto", "colorMode": "value", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "mappingType": 1, "pattern": "Value #A", "thresholds": [ "10000000000", "20000000000" ], "type": "number", "unit": "bytes" }, { "alias": "使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "mappingType": 1, "pattern": "Value #B", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "总空间", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 0, "link": false, "mappingType": 1, "pattern": "Value #C", "thresholds": [], "type": "number", "unit": "bytes" }, { "alias": "文件系统", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "fstype", "thresholds": [], "type": "string", "unit": "short" }, { "alias": "设备名", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "device", "preserveFormat": false, "sanitize": false, "thresholds": [], "type": "string", "unit": "short" }, { "alias": "", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 2, "pattern": "/.*/", "preserveFormat": true, "sanitize": false, "thresholds": [], "type": "hidden", "unit": "short" } ], "targets": [ { "expr": "node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0", "format": "table", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "总量", "refId": "C" }, { "expr": "node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0", "format": "table", "hide": false, "instant": true, "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A" }, { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))", "format": "table", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "B" } ], "title": "【$show_hostname】:各分区可用空间(EXT.*/XFS)", "transform": "table", "type": "table" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "#d44a3a" ], "datasource": "prometheus", "decimals": 2, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 22 }, "id": 20, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "pluginVersion": "6.4.2", "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": true, "lineColor": "#3274D9", "show": true, "ymax": null, "ymin": null }, "tableColumn": "", "targets": [ { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "20,50", "timeFrom": null, "timeShift": null, "title": "CPU iowait", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "avg" }, { "aliasColors": { "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in": "light-red", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in下载": "green", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_out上传": "yellow", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_in下载": "purple", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out": "purple", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out上传": "blue" }, "bars": true, "dashLength": 10, "dashes": false, "datasource": "prometheus", "editable": true, "error": false, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 6, "w": 7, "x": 17, "y": 22 }, "hiddenSeries": false, "id": 183, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "show": false, "sort": "current", "sortDesc": true, "total": true, "values": true }, "lines": false, "linewidth": 2, "links": [], "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 1, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*_out上传$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "increase(node_network_receive_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])", "interval": "60m", "intervalFactor": 1, "legendFormat": "{{device}}_in下载", "metric": "", "refId": "A", "step": 600, "target": "" }, { "expr": "increase(node_network_transmit_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])", "hide": false, "interval": "60m", "intervalFactor": 1, "legendFormat": "{{device}}_out上传", "refId": "B", "step": 600 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每小时流量$device", "tooltip": { "msResolution": false, "shared": true, "sort": 0, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": "上传(-)/下载(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "short", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 24 }, "id": 14, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "count(node_cpu_seconds_total{instance=~\"$node\", mode='system'})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "1,2", "title": "CPU 核数", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "short", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 24 }, "id": 179, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(node_filesystem_files_free{instance=~\"$node\",mountpoint=\"$maxmount\",fstype=~\"ext.?|xfs\"})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "100000,1000000", "title": "剩余节点数:$maxmount ", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": 0, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 26 }, "id": 75, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "70%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum(node_memory_MemTotal_bytes{instance=~\"$node\"})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{instance}}", "refId": "A", "step": 20 } ], "thresholds": "2,3", "title": "总内存", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "locale", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 26 }, "id": 178, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(node_filefd_maximum{instance=~\"$node\"})", "format": "time_series", "instant": true, "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "1024,10000", "title": "总文件描述符", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "aliasColors": { "192.168.200.241:9100_Total": "dark-red", "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁盘花费在I/O操作占比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 28 }, "hiddenSeries": false, "id": 7, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*总使用率/", "color": "#C4162A", "fill": 0 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "系统使用率", "refId": "A", "step": 20 }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"user\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "用户使用率", "refId": "B", "step": 240 }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "磁盘IO使用率", "refId": "D", "step": 240 }, { "expr": "(1 - avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) by (instance))*100", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "总使用率", "refId": "F", "step": 240 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "CPU使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": 0, "format": "percent", "label": "", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_总内存": "dark-red", "使用率": "yellow", "内存_Avaliable": "#6ED0E0", "内存_Cached": "#EF843C", "内存_Free": "#629E51", "内存_Total": "#6d1f62", "内存_Used": "#eab839", "可用": "#9ac48a", "总内存": "#bf1b00" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 28 }, "height": "300", "hiddenSeries": false, "id": 156, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "总内存", "color": "#C4162A", "fill": 0 }, { "alias": "使用率", "color": "rgb(0, 209, 255)", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "总内存", "refId": "A", "step": 4 }, { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemAvailable_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "已用", "refId": "B", "step": 4 }, { "expr": "node_memory_MemAvailable_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "可用", "refId": "F", "step": 4 }, { "expr": "node_memory_Buffers_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "内存_Buffers", "refId": "D", "step": 4 }, { "expr": "node_memory_MemFree_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "内存_Free", "refId": "C", "step": 4 }, { "expr": "node_memory_Cached_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "内存_Cached", "refId": "E", "step": 4 }, { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - (node_memory_Cached_bytes{instance=~\"$node\"} + node_memory_Buffers_bytes{instance=~\"$node\"} + node_memory_MemFree_bytes{instance=~\"$node\"})", "format": "time_series", "hide": true, "intervalFactor": 1, "refId": "G" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 10, "legendFormat": "使用率", "refId": "H" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "内存信息", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": null, "logBase": 1, "max": null, "min": "0", "show": true }, { "format": "percent", "label": "内存使用率", "logBase": 1, "max": "100", "min": "0", "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.10.227:9100_em1_in下载": "super-light-green", "192.168.10.227:9100_em1_out上传": "dark-blue" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 28 }, "height": "300", "hiddenSeries": false, "id": 157, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_out上传$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_network_receive_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_in下载", "refId": "A", "step": 4 }, { "expr": "irate(node_network_transmit_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_out上传", "refId": "B", "step": 4 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每秒网络带宽使用$device", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bps", "label": "上传(-)/下载(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "15分钟": "#6ED0E0", "1分钟": "#BF1B00", "5分钟": "#CCA300" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "grid": {}, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 36 }, "height": "300", "hiddenSeries": false, "id": 13, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*总核数/", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_load1{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "1分钟负载", "metric": "", "refId": "A", "step": 20, "target": "" }, { "expr": "node_load5{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "5分钟负载", "refId": "B", "step": 20 }, { "expr": "node_load15{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "15分钟负载", "refId": "C", "step": 20 }, { "expr": " sum(count(node_cpu_seconds_total{instance=~\"$node\", mode='system'}) by (cpu,instance)) by(instance)", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "CPU总核数", "refId": "D", "step": 20 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "系统平均负载", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda_write": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Read bytes 每个磁盘分区每秒读取的比特数\nWritten bytes 每个磁盘分区每秒写入的比特数", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 36 }, "height": "300", "hiddenSeries": false, "id": 168, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_读取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_read_bytes_total{instance=~\"$node\"}[5m])", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_读取", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_written_bytes_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_写入", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每秒磁盘读写容量", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "Bps", "label": "读取(-)/写入(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 36 }, "hiddenSeries": false, "id": 174, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/Inodes.*/", "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{mountpoint}}", "refId": "A" }, { "expr": "node_filesystem_files_free{instance=~'$node',fstype=~\"ext.?|xfs\"} / node_filesystem_files{instance=~'$node',fstype=~\"ext.?|xfs\"}", "hide": true, "interval": "", "legendFormat": "Inodes:{{instance}}:{{mountpoint}}", "refId": "B" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "磁盘使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "percent", "label": "", "logBase": 1, "max": "100", "min": "0", "show": true }, { "decimals": 2, "format": "percentunit", "label": null, "logBase": 1, "max": "1", "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda_write": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Reads completed: 每个磁盘分区每秒读完成次数\n\nWrites completed: 每个磁盘分区每秒写完成次数\n\nIO now 每个磁盘分区每秒正在处理的输入/输出请求数", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 9, "w": 8, "x": 0, "y": 44 }, "height": "300", "hiddenSeries": false, "id": 161, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_读取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_读取", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_写入", "refId": "B", "step": 10 }, { "expr": "node_disk_io_now{instance=~\"$node\"}", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "磁盘读写速率(IOPS)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "iops", "label": "读取(-)/写入(+)I/O ops/sec", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁盘花费在I/O操作占比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": null, "description": "每一秒钟的自然时间内,花费在I/O上的耗时。(wall-clock time)\n\nnode_disk_io_time_seconds_total:\n磁盘花费在输入/输出操作上的秒数。该值为累加值。(Milliseconds Spent Doing I/Os)\n\nirate(node_disk_io_time_seconds_total[1m]):\n计算每秒的速率:(last值-last前一个值)/时间戳差值,即:1秒钟内磁盘花费在I/O操作的时间占比。", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 9, "w": 8, "x": 8, "y": 44 }, "hiddenSeries": false, "id": 175, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": null, "sortDesc": null, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_每秒I/O操作%", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每1秒内I/O操作耗时占比", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "percentunit", "label": "", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Read time seconds 每个磁盘分区读操作花费的秒数\n\nWrite time seconds 每个磁盘分区写操作花费的秒数\n\nIO time seconds 每个磁盘分区输入/输出操作花费的秒数\n\nIO time weighted seconds每个磁盘分区输入/输出操作花费的加权秒数", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "gridPos": { "h": 9, "w": 8, "x": 16, "y": 44 }, "height": "300", "hiddenSeries": false, "id": 160, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/,*_读取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_read_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_读取", "refId": "B" }, { "expr": "irate(node_disk_write_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_写入", "refId": "C" }, { "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_io_time_weighted_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_加权", "refId": "D" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每次IO读写的耗时(参考:小于100ms)(beta)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "s", "label": "读取(-)/写入(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_TCP_alloc": "semi-dark-blue", "TCP": "#6ED0E0", "TCP_alloc": "blue" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Sockets_used - 已使用的所有协议套接字总量\n\nCurrEstab - 当前状态为 ESTABLISHED 或 CLOSE-WAIT 的 TCP 连接数\n\nTCP_alloc - 已分配(已建立、已申请到sk_buff)的TCP套接字数量\n\nTCP_tw - 等待关闭的TCP连接数\n\nUDP_inuse - 正在使用的 UDP 套接字数量\n\nRetransSegs - TCP 重传报文数\n\nOutSegs - TCP 发送的报文数\n\nInSegs - TCP 接收的报文数", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 16, "x": 0, "y": 53 }, "height": "300", "hiddenSeries": false, "id": 158, "interval": "", "legend": { "alignAsTable": true, "avg": false, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "rightSide": true, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*Sockets_used/", "color": "#E02F44", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_netstat_Tcp_CurrEstab{instance=~'$node'}", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "CurrEstab", "refId": "A", "step": 20 }, { "expr": "node_sockstat_TCP_tw{instance=~'$node'}", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "TCP_tw", "refId": "D" }, { "expr": "node_sockstat_sockets_used{instance=~'$node'}", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "Sockets_used", "refId": "B" }, { "expr": "node_sockstat_UDP_inuse{instance=~'$node'}", "interval": "", "legendFormat": "UDP_inuse", "refId": "C" }, { "expr": "node_sockstat_TCP_alloc{instance=~'$node'}", "interval": "", "legendFormat": "TCP_alloc", "refId": "E" }, { "expr": "irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "{{instance}}_Tcp_PassiveOpens", "refId": "G" }, { "expr": "irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "{{instance}}_Tcp_ActiveOpens", "refId": "F" }, { "expr": "irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])", "interval": "", "legendFormat": "Tcp_InSegs", "refId": "H" }, { "expr": "irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])", "interval": "", "legendFormat": "Tcp_OutSegs", "refId": "I" }, { "expr": "irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])", "hide": false, "interval": "", "legendFormat": "Tcp_RetransSegs", "refId": "J" }, { "expr": "irate(node_netstat_TcpExt_ListenDrops{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "", "refId": "K" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "网络Socket连接信息", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "transformations": [], "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": "已使用的所有协议套接字总量", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "filefd_192.168.200.241:9100": "super-light-green", "switches_192.168.200.241:9100": "semi-dark-red", "使用的文件描述符_10.118.72.128:9100": "red", "每秒上下文切换次数_10.118.71.245:9100": "yellow", "每秒上下文切换次数_10.118.72.128:9100": "yellow" }, "bars": false, "cacheTimeout": null, "dashLength": 10, "dashes": false, "datasource": "prometheus", "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 1, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 53 }, "hiddenSeries": false, "hideTimeOverride": false, "id": 16, "legend": { "alignAsTable": false, "avg": false, "current": true, "max": false, "min": false, "rightSide": false, "show": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pluginVersion": "6.4.2", "pointradius": 1, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/每秒上下文切换次数.*/", "color": "#FADE2A", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "/使用的文件描述符.*/", "color": "#F2495C" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_filefd_allocated{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 5, "legendFormat": "使用的文件描述符", "refId": "B" }, { "expr": "irate(node_context_switches_total{instance=~\"$node\"}[5m])", "interval": "", "intervalFactor": 5, "legendFormat": "每秒上下文切换次数", "refId": "A" }, { "expr": " (node_filefd_allocated{instance=~\"$node\"}/node_filefd_maximum{instance=~\"$node\"}) *100", "format": "time_series", "hide": true, "instant": false, "interval": "", "intervalFactor": 5, "legendFormat": "使用的文件描述符占比_{{instance}}", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "打开的文件描述符(左 )/每秒上下文切换次数(右)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "label": "使用的文件描述符", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": "context_switches", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } } ], "refresh": "", "schemaVersion": 20, "style": "dark", "tags": [ "Prometheus", "node_exporter" ], "templating": { "list": [ { "allValue": null, "current": { "tags": [], "text": "node-exporter", "value": "node-exporter" }, "datasource": "prometheus", "definition": "label_values(node_uname_info, job)", "hide": 0, "includeAll": false, "index": -1, "label": "JOB", "multi": false, "name": "job", "options": [], "query": "label_values(node_uname_info, job)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\"}, nodename)", "hide": 0, "includeAll": true, "index": -1, "label": "主机名", "multi": false, "name": "hostname", "options": [], "query": "label_values(node_uname_info{job=~\"$job\"}, nodename)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allFormat": "glob", "allValue": null, "current": { "text": "ymt108", "value": "ymt108" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)", "hide": 0, "includeAll": false, "index": -1, "label": "Instance", "multi": true, "multiFormat": "regex values", "name": "node", "options": [], "query": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allFormat": "glob", "allValue": null, "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)", "hide": 0, "includeAll": true, "index": -1, "label": "网卡", "multi": true, "multiFormat": "regex values", "name": "device", "options": [], "query": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "/", "value": "/" }, "datasource": "prometheus", "definition": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))", "hide": 2, "includeAll": false, "index": -1, "label": "最大挂载目录", "multi": false, "name": "maxmount", "options": [], "query": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))", "refresh": 2, "regex": "/.*\\\"(.*)\\\".*/", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "ymt108", "value": "ymt108" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)", "hide": 2, "includeAll": false, "index": -1, "label": "展示使用的主机名", "multi": false, "name": "show_hostname", "options": [], "query": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-12h", "to": "now" }, "timepicker": { "hidden": false, "now": true, "refresh_intervals": [ "15s", "30s", "1m", "5m", "15m", "30m" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "timezone": "browser", "title": "育苗通Node资源监控", "uid": "hb7fSE0Zz", "version": 11 }
当然,默认还内置了很多k8s相关的资源监控模板。
十、汇总
特殊说明1:我们还可以自定义etcd监控,详情可参考:https://www.jianshu.com/p/2fbbe767870d
- 第一步:建立一个 ServiceMonitor 对象,用于 Prometheus 添加监控项;
- 第二步:为 ServiceMonitor 对象关联 metrics 数据接口的一个 Service 对象;
- 第三步:确保 Service 对象可以正确获取到 metrics 数据。
特殊说明2:部署kube-prometheus可能会出现无法连接apiserver问题,详情可参考:【解决】Error from server (ServiceUnavailable): the server is currently unable to handle the request
当我们完成了所有配置, 那接下来还需要整理一下,编写升级脚本upgrade.sh,方便之后部署,以及修改更新。
#!/bin/sh # deploy kubernetes service kubectl apply -f prometheus-kubeControllerManagerService.yaml kubectl apply -f prometheus-kubeSchedulerService.yaml # upgrade alertmanager configuration kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring --dry-run -oyaml > alertmanager-secret.yaml kubectl apply -f alertmanager-secret.yaml # upgrade scrape configs kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring --dry-run -oyaml > additional-scrape-configs.yaml kubectl apply -f additional-scrape-configs.yaml # upgrade prometheus rules kubectl apply -f prometheus-additional-rules.yaml kubectl apply -f prometheus-rules.yaml # upgrade prometheus configuration kubectl apply -f prometheus-prometheus.yaml # upgrade grafana configuration kubectl apply -f grafana-deployment.yaml
作者:Leozhanggg
出处:https://www.cnblogs.com/leozhanggg/p/13502983.html
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。