K8S中安装alertmanager:v0.21及Prometheus告警模板

 

1.安装alertmanager:v0.21

 1.1 准备docker镜像 

docker pull docker.io/prom/alertmanager:v0.21.0

docker tag  c876f5897d7b harbor.st.com/infra/alertmanager:v0.21.0

docker push harbor.st.com/infra/alertmanager:v0.21.0

 

准备目录

mkdir /data/k8s-yaml/alertmanager

cd /data/k8s-yaml/alertmanager

 

 1.2 准备cm资源清单

cat >cm0.21.yaml <<'EOF'

apiVersion: v1

kind: ConfigMap

metadata:

  name: alertmanager-config

  namespace: test

data:

  config.yml: |-

    global:

      # 在没有报警的情况下声明为已解决的时间

      resolve_timeout: 5m

      # 配置邮件发送信息

      smtp_smarthost: 'smtp.qq.com:465'

      smtp_from: 'xxxxx@qq.com'

      smtp_auth_username: 'xxxxx@qq.com'

      smtp_auth_password: 'nxhwqcmaaaaaa'

      # 这里为第三方登录 QQ 邮箱的授权码,非 QQ 账户登录密码

      smtp_hello: 'xxxx@qq.com'

      # 这里必须增加smtp_hello,否则无法发送

      smtp_require_tls: false

    templates:  

      - '/etc/alertmanager/*.tmpl'

    # 所有报警信息进入后的根路由,用来设置报警的分发策略

    route:

      # 这里的标签列表是接收到报警信息后的重新分组标签

      group_by: ['alertname', 'cluster']

      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。

      group_wait: 30s

      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。

      group_interval: 5m

      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送它们

      repeat_interval: 5m

      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器

      receiver: 'default'

    receivers:

    - name: 'default'

      email_configs:

      - to: 'xxxx@qq.com'

        send_resolved: true

        html: '{{ template "email.to.html" . }}'

        headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }  

  email.tmpl: |

    {{ define "email.to.html" }}

    {{- if gt (len .Alerts.Firing) 0 -}}

    {{ range .Alerts }}

    告警程序: prometheus_alert <br>

    告警级别: {{ .Labels.severity }} <br>

    告警类型: {{ .Labels.alertname }} <br>

    故障主机: {{ .Labels.instance }} <br>

    告警主题: {{ .Annotations.summary }}  <br>

    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>

    {{ end }}{{ end -}}

    {{- if gt (len .Alerts.Resolved) 0 -}}

    {{ range .Alerts }}

    告警程序: prometheus_alert <br>

    告警级别: {{ .Labels.severity }} <br>

    告警类型: {{ .Labels.alertname }} <br>

    故障主机: {{ .Labels.instance }} <br>

    告警主题: {{ .Annotations.summary }} <br>

    触发时间: {{ .StartsAt.Format "2021-01-01 08:00:00" }} <br>

    恢复时间: {{ .EndsAt.Format "2021-01-01 08:00:00" }} <br>

    {{ end }}{{ end -}}

    {{- end }}

EOF

 

1.3 准备dp资源清单

cat >dp0.21.yaml <<'EOF'

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

  name: alertmanager

  namespace: test

spec:

  replicas: 1

  selector:

    matchLabels:

      app: alertmanager

  template:

    metadata:

      labels:

        app: alertmanager

    spec:

      containers:

      - name: alertmanager

        image: harbor.st.com/infra/alertmanager:v0.21.0

        args:

          - "--config.file=/etc/alertmanager/config.yml"

          - "--storage.path=/alertmanager"

          - "--cluster.advertise-address=0.0.0.0:9093"

        ports:

        - name: alertmanager

          containerPort: 9093

        volumeMounts:

        - name: alertmanager-cm

          mountPath: /etc/alertmanager

      volumes:

      - name: alertmanager-cm

        configMap:

          name: alertmanager-config

      imagePullSecrets:

      - name: harbor

EOF

 

1.4 准备svc资源清单

cat >svc0.21.yaml <<'EOF'

apiVersion: v1

kind: Service

metadata:

  name: alertmanager

  namespace: test

spec:

  selector:

    app: alertmanager

  ports:

    - port: 80

      targetPort: 9093

EOF

  

1.5 应用资源配置清单

kubectl apply -f http://k8s-yaml.st.com/alertmanager/cm0.21.yaml

kubectl apply -f http://k8s-yaml.st.com/alertmanager/dp0.21.yaml

kubectl apply -f http://k8s-yaml.st.com/alertmanager/svc0.21.yaml

 

补充: 

1.6 准备ingress资源清单 

cat > ingress0.21.yaml <<'EOF'

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

  name: alertmanager

  namespace: test

spec:

  rules:

  - host: alertmanager.st.com

    http:

      paths:

      - path: /

        backend:

          serviceName: alertmanager

          servicePort: 80

EOF 

kubectl apply -f http://k8s-yaml.st.com/alertmanager/ingress0.21.yaml

 

1.7 添加域名解析

alertmanager                A    192.168.40.13

增加ingress主要是为了外部网络可以通过浏览器直接访问查看。 

 

 

 

 

排错1:

alertmanager版本过高(0.15以上)

报错:Failed to get final advertise address: No private IP address found, and explicit IP not provided

 解决办法1

把alertmanager版本降低到prom/alertmanager:v0.14.0 就没有问题

 解决办法2

you should add arg "--cluster.advertise-address=0.0.0.0:9093" when startup

 

 

2 结合prometheus告警

2.1 prometheus告警规则(按实际路径调整)

cat >/data/nfs-volume/prometheus/etc/rules.yml <<'EOF'

groups:

- name: hostStatsAlert

  rules:

  - alert: hostCpuUsageAlert

    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"

  - alert: hostMemUsageAlert

    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"

  - alert: OutOfInodes

    expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Out of inodes (instance {{ $labels.instance }})"

      description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"

  - alert: OutOfDiskSpace

    expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Out of disk space (instance {{ $labels.instance }})"

      description: "Disk is almost full (< 10% left) (current value: {{ $value }})"

  - alert: UnusualNetworkThroughputIn

    expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual network throughput in (instance {{ $labels.instance }})"

      description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"

  - alert: UnusualNetworkThroughputOut

    expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual network throughput out (instance {{ $labels.instance }})"

      description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"

  - alert: UnusualDiskReadRate

    expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual disk read rate (instance {{ $labels.instance }})"

      description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"

  - alert: UnusualDiskWriteRate

    expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual disk write rate (instance {{ $labels.instance }})"

      description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"

  - alert: UnusualDiskReadLatency

    expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual disk read latency (instance {{ $labels.instance }})"

      description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"

  - alert: UnusualDiskWriteLatency

    expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Unusual disk write latency (instance {{ $labels.instance }})"

      description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"

- name: http_status

  rules:

  - alert: ProbeFailed

    expr: probe_success == 0

    for: 1m

    labels:

      severity: error

    annotations:

      summary: "Probe failed (instance {{ $labels.instance }})"

      description: "Probe failed (current value: {{ $value }})"

  - alert: StatusCode

    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400

    for: 1m

    labels:

      severity: error

    annotations:

      summary: "Status Code (instance {{ $labels.instance }})"

      description: "HTTP status code is not 200-399 (current value: {{ $value }})"

  - alert: SslCertificateWillExpireSoon

    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"

      description: "SSL certificate expires in 30 days (current value: {{ $value }})"

  - alert: SslCertificateHasExpired

    expr: probe_ssl_earliest_cert_expiry - time()  <= 0

    for: 5m

    labels:

      severity: error

    annotations:

      summary: "SSL certificate has expired (instance {{ $labels.instance }})"

      description: "SSL certificate has expired already (current value: {{ $value }})"

  - alert: BlackboxSlowPing

    expr: probe_icmp_duration_seconds > 2

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox slow ping (instance {{ $labels.instance }})"

      description: "Blackbox ping took more than 2s (current value: {{ $value }})"

  - alert: BlackboxSlowRequests

    expr: probe_http_duration_seconds > 2

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox slow requests (instance {{ $labels.instance }})"

      description: "Blackbox request took more than 2s (current value: {{ $value }})"

  - alert: PodCpuUsagePercent

    expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"

EOF

 

2.2 K8S 更新配置

在prometheus配置文件中追加配置:(按实际路径调整)

cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["alertmanager.test"]

rule_files:

 - "/data/etc/rules.yml"

EOF

 

 重载配置:   

curl -X POST http://prometheus.st.com/-/reload

 

 排错2

Prometheus和alertmanager还要在同一个名称空间下才可识别到警告

Prometheusalertmanager在不同名称空间下

cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["alertmanager.test"]

rule_files:

 - "/data/etc/rules.yml"

EOF

 

Prometheusalertmanager在相同名称空间下

cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["alertmanager"]

rule_files:

 - "/data/etc/rules.yml"

EOF

 

 

3.压测触发CPU告警

stress -c 4 -t 10000

alert中项目变为红色的时候就开会发邮件告警

 

 

 

 

 

posted @ 2021-03-15 20:03  ST运维  阅读(633)  评论(0编辑  收藏  举报