docker compose部署grafana alertmanager prometheus webhook- pushgateway 各项指标监控

 

本次搭建实现:grafana图形  prometheus监控告警   钉钉告警

一、了解服务作用

  • Prometheus开源的系统监控和报警框架,灵感源自Google的Borgmon监控系统
  • AlertManager 处理由客户端应用程序(如Prometheus server)发送的警报。它负责将重复数据删除,分组和路由到正确的接收者集成,还负责沉默和抑制警报
  • Node_Exporter 用来监控各节点的资源信息的exporter,应部署到prometheus监控的所有节点
  • cAdvisor 监控容器
  • mysqld-exporter 用于收集mysql数据
  • pushgateway 通过curl推送到pushgateway组件收集数据 prometheus组件去拉取
  • prometheus-webhook-dingtalk 钉钉告警插件
  • grafana 监控可视化

 

简单拓扑图

 

 

 

二、创建prometheus目录 便于存放所有监控 。以及机器信息

服务器就一台:10.1.1.10 存放所有服务。想监控多台  配置文件新增个job  ,被监控方启个Node_Exporter服务即可

mkdir /data/prometheus    #以下所有操作都在prometheus目录下操作
cd /data/prometheus

 

三、创建prometheus配置文件以及数据目录。用于启动prometheu时读取

mkdir ./prometheus/data -p
chmod 777 ./prometheus/data         #创建存放prometheus数据目录
vim ./prometheus/prometheus.yml
global:
  scrape_interval:     15s    # 多久 收集 一次数据
  evaluation_interval: 15s    # 多久 评估 一次规则
  scrape_timeout:      10s    # 每次 收集数据的 超时时间

# 收集数据 配置 列表
scrape_configs:
  - job_name: prometheus            # 必须配置, 自动附加的job labels, 必须唯一
    static_configs:
      - targets: ['10.1.1.10:9090']       # 指定prometheus ip端口
        labels:
          instance: prometheus                 #标签

  - job_name: 1.1.1.1-node1          #node-exporter
    static_configs:
      - targets: ['10.1.1.10:9100']
        labels:
          instance: 1.1.1.1      #标签一致 可以同时收集node和cadvisor信息
  - job_name: 1.1.1.1-node2              #cadvisor
    static_configs:
      - targets: ['10.1.1.10:9200']
        labels:
          instance: 1.1.1.1
alerting: #Alertmanager相关的配置 alertmanagers: - static_configs: - targets: - 10.1.1.10:9093 #指定告警模块 rule_files: #告警规则文件, 可以使用通配符 - "/etc/prometheus/rules/*.yml"

 

四、创建告警规则文件及触发条件文件 。用于prometheus配置文件读取此告警内容

4.1: 
mkdir rules #先创建rules目录
vim rules/alert-rules.yml       #通用
groups:
  - name: prometheus-alert
    rules:
    - alert: node-down
      expr: prometheus:up == 0
      for: 1m
      labels:
        severity: 'critical'
      annotations:
        summary: "instance: {{ $labels.instance }} 宕机了"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"



    - alert: cpu-high
      expr:  prometheus:cpu:total:percent > 80
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: cpu-iowait-high
      expr:  prometheus:cpu:iowait:percent >= 12
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"



    - alert:  load-load1-high
      expr:  (prometheus:load:load1) > (prometheus:cpu:count) * 1.2
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert:  memory-high
      expr:  prometheus:memory:used:percent > 85
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: disk-high
      expr:  prometheus:disk:used:percent > 80
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: disk-read:count-high
      expr:  prometheus:disk:read:count:rate > 2000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: disk-write-count-high
      expr:  prometheus:disk:write:count:rate > 2000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: disk-read-mb-high
      expr:  prometheus:disk:read:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}"
        description: ""
        instance: "{{ $labels.instance }}"
        value: "{{ $value }}"


    - alert: disk-write-mb-high
      expr:  prometheus:disk:write:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: filefd-allocated-percent-high
      expr:  prometheus:filefd_allocated:percent > 80
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: network-netin-error-rate-high
      expr:  prometheus:network:netin:error:rate > 4
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: network-netin-packet-rate-high
      expr:  prometheus:network:netin:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: network-netout-packet-rate-high
      expr:  prometheus:network:netout:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: network-tcp-total-count-high
      expr:  prometheus:network:tcp:total:count > 40000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: process-zoom-total-count-high
      expr:  prometheus:process:zoom:total:count > 10
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: time-offset-high
      expr:  prometheus:time:offset > 0.03
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} {{ $labels.desc }}  {{ $value }} {{ $labels.unit }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"
vim rules/record-rules.yml
groups:
  - name: prometheus-record
    rules:
 #   - expr: up{job!="prometheus"} == 0
 ##     record: prometheus:up
 #     labels:
#        desc: "节点是否在线, 在线1,不在线0"
#        unit: " "
#        job: "prometheus"
    - expr: time() - node_boot_time_seconds{}
      record: prometheus:node_uptime
      labels:
        desc: "节点的运行时间"
        unit: "s"
        job: "prometheus"
##############################################################################################
#                              cpu                                                           #
    - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m])))  * 100
      record: prometheus:cpu:total:percent
      labels:
        desc: "节点的cpu总消耗百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m])))  * 100
      record: prometheus:cpu:idle:percent
      labels:
        desc: "节点的cpu idle百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m])))  * 100
      record: prometheus:cpu:iowait:percent
      labels:
        desc: "节点的cpu iowait百分比"
        unit: "%"
        job: "prometheus"



    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m])))  * 100
      record: prometheus:cpu:system:percent
      labels:
        desc: "节点的cpu system百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m])))  * 100
      record: prometheus:cpu:user:percent
      labels:
        desc: "节点的cpu user百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m])))  * 100
      record: prometheus:cpu:other:percent
      labels:
        desc: "节点的cpu 其他的百分比"
        unit: "%"
        job: "prometheus"
##############################################################################################

##############################################################################################
#                                    memory                                                  #
    - expr: node_memory_MemTotal_bytes{job!="prometheus"}
      record: prometheus:memory:total
      labels:
        desc: "节点的内存总量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemFree_bytes{job!="prometheus"}
      record: prometheus:memory:free
      labels:
        desc: "节点的剩余内存量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"}
      record: prometheus:memory:used
      labels:
        desc: "节点的已使用内存量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"}
      record: prometheus:memory:actualused
      labels:
        desc: "节点用户实际使用的内存量"
        unit: byte
        job: "prometheus"

    - expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
      record: prometheus:memory:used:percent
      labels:
        desc: "节点的内存使用百分比"
        unit: "%"
        job: "prometheus"

    - expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
      record: prometheus:memory:free:percent
      labels:
        desc: "节点的内存剩余百分比"
        unit: "%"
        job: "prometheus"
##############################################################################################
#                                   load                                                     #
    - expr: sum by (instance) (node_load1{job!="prometheus"})
      record: prometheus:load:load1
      labels:
        desc: "系统1分钟负载"
        unit: " "
        job: "prometheus"

    - expr: sum by (instance) (node_load5{job!="prometheus"})
      record: prometheus:load:load5
      labels:
        desc: "系统5分钟负载"
        unit: " "
        job: "prometheus"

    - expr: sum by (instance) (node_load15{job!="prometheus"})
      record: prometheus:load:load15
      labels:
        desc: "系统15分钟负载"
        unit: " "
        job: "prometheus"

##############################################################################################
#                                 disk                                                       #
    - expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:total
      labels:
        desc: "节点的磁盘总量"
        unit: byte
        job: "prometheus"

    - expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:free
      labels:
        desc: "节点的磁盘剩余空间"
        unit: byte
        job: "prometheus"

    - expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:used
      labels:
        desc: "节点的磁盘使用的空间"
        unit: byte
        job: "prometheus"

    - expr:  (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100
      record: prometheus:disk:used:percent
      labels:
        desc: "节点的磁盘的使用百分比"
        unit: "%"
        job: "prometheus"

    - expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m])
      record: prometheus:disk:read:count:rate
      labels:
        desc: "节点的磁盘读取速率"
        unit: "次/秒"
        job: "prometheus"

    - expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m])
      record: prometheus:disk:write:count:rate
      labels:
        desc: "节点的磁盘写入速率"
        unit: "次/秒"
        job: "prometheus"

    - expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024
      record: prometheus:disk:read:mb:rate
      labels:
        desc: "节点的设备读取MB速率"
        unit: "MB/s"
        job: "prometheus"

    - expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024
      record: prometheus:disk:write:mb:rate
      labels:
        desc: "节点的设备写入MB速率"
        unit: "MB/s"
        job: "prometheus"

##############################################################################################
#                                filesystem                                                  #
    - expr:   (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100
      record: prometheus:filesystem:used:percent
      labels:
        desc: "节点的inode的剩余可用的百分比"
        unit: "%"
        job: "prometheus"
#############################################################################################
#                                filefd                                                     #
    - expr: node_filefd_allocated{job!="prometheus"}
      record: prometheus:filefd_allocated:count
      labels:
        desc: "节点的文件描述符打开个数"
        unit: "%"
        job: "prometheus"

    - expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100
      record: prometheus:filefd_allocated:percent
      labels:
        desc: "节点的文件描述符打开百分比"
        unit: "%"
        job: "prometheus"

#############################################################################################
#                                network                                                    #
    - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:bit:rate
      labels:
        desc: "节点网卡eth0每秒接收的比特数"
        unit: "bit/s"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:bit:rate
      labels:
        desc: "节点网卡eth0每秒发送的比特数"
        unit: "bit/s"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:packet:rate
      labels:
        desc: "节点网卡每秒接收的数据包个数"
        unit: "个/秒"
        job: "prometheus"


    - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:packet:rate
      labels:
        desc: "节点网卡发送的数据包个数"
        unit: "个/秒"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:error:rate
      labels:
        desc: "节点设备驱动器检测到的接收错误包的数量"
        unit: "个/秒"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:error:rate
      labels:
        desc: "节点设备驱动器检测到的发送错误包的数量"
        unit: "个/秒"
        job: "prometheus"

    - expr: node_tcp_connection_states{job!="prometheus", state="established"}
      record: prometheus:network:tcp:established:count
      labels:
        desc: "节点当前established的个数"
        unit: ""
        job: "prometheus"

    - expr: node_tcp_connection_states{job!="prometheus", state="time_wait"}
      record: prometheus:network:tcp:timewait:count
      labels:
        desc: "节点timewait的连接数"
        unit: ""
        job: "prometheus"

    - expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"})
      record: prometheus:network:tcp:total:count
      labels:
        desc: "节点tcp连接总数"
        unit: ""
        job: "prometheus"

 

五、创建grafana数据目录以及配置文件 。 用于grafana存放数据

mkdir grafana/grafana-storage -p
chmod 777 grafana/grafana-storage

grafana.ini 配置文件可从grafana容器里拷贝一份出来  

 

六、创建alert配置。用于向webhook发送告警

mkdir alert
vim alert/alertmanager.yml
global:
  resolve_timeout: 5m
  # smtp配置 g根据需求是否配置邮箱
   # smtp_from: "123456789@qq.com"
   # smtp_smarthost: 'smtp.qq.com:465'
    #smtp_auth_username: "123456789@qq.com"
   # smtp_auth_password: "auth_pass"
   # smtp_require_tls: true

route:
  receiver: webhook
  group_wait: 30s   # 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。
  group_interval: 5m  # 如果组内内容不变化,合并为一条警报信息,5m后发送
  repeat_interval: 5m  # 发送报警间隔,如果指定时间内没有修复,则重新发送报警
  group_by: [alertname]
  routes:
  - receiver: webhook
    group_wait: 10s
receivers:
- name: webhook
  webhook_configs:
  - url: http://10.1.1.10:8060/dingtalk/webhook1/send             
    send_resolved: true
~

指向webhook的地址

七、webhook配置文件

mkdir ./webhook
vim ./webhook/dingding.tmpl
[root@prometheus ~]# cat dingding.tmpl

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}

{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
告警程序:prometheus_alert
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
告警主机: {{ .Labels.instance }}
命名空间: {{ .Labels.namespace }}            
Pod: {{ .Labels.pod }}  
告警主题: {{ .Annotations.summary }}
告警描叙: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Local.Forma "2006-01-02 15:04:05" }}       #.Local.Forma读取当地时间 解决告警8小时时间
------------------------

{{ end }}{{ end }}

{{ define "__text_resolve_list" }}{{ range . }}
恢复程序:{{ .Labels.alertname }}
恢复主机: {{ .Labels.instance }}
恢复描叙: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Local.Forma "2006-01-02 15:04:05" }}
------------------------
{{ end }}{{ end }}



{{ define "ding.link.title" }}{{ template "__subject" . }}{{ end }}
{{ define "ding.link.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}
![警报 图标](https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fsafe-img.xhscdn.com%2Fbw1%2F0b4c39ef-bd3c-47cf-962f-83d824fa48cf%3FimageView2%2F2%2Fw%2F1080%2Fformat%2Fjpg&refer=http%3A%2F%2Fsafe-img.xhscdn.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1690597215&t=d72bb6712215b52124bd9e1d3f1dbd3a)



 

**====侦测到故障====**

{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
恢复列表:
{{ template "__text_resolve_list" .Alerts.Resolved }}
{{- end }}
{{- end }}

 

 

 

八、编辑docker-compose启动服务yml

vim docker-compose.yml
version: '3.2'
services:
  prometheus:
    image: prom/prometheus
    restart: "always"
    ports:
      - 19090:9090
    container_name: "prometheus"
    volumes:
      - "./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml"
      - "./rules:/etc/prometheus/rules"
      - "./prometheus/data:/prometheus"
      - "/etc/localtime:/etc/localtime:ro"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'          #设置yml路径  跟上面挂载对应
      - '--storage.tsdb.path=/prometheus'                     #设置数据路径   跟上面挂载对应

#告警
  alertmanager:
    image: prom/alertmanager:latest
    restart: "always"
    ports:
      - 19093:9093
    container_name: "alertmanager"
    volumes:
      - "./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml"
      - "/etc/localtime:/etc/localtime:ro"

#钉钉插件
  webhook:
    image: timonwong/prometheus-webhook-dingtalk:v0.3.0
    restart: "always"
    ports:
      - 19092:8060
    container_name: "webhook"           #token指定钉钉
    volumes:
      - "./webhook/dingding.tmpl:/root/dingding.tmpl"   #配置文件
      - "/etc/localtime:/etc/localtime:ro"
    command:
      - '--ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxx'
      - '--template.file=/root/dingding.tmpl'

#web界
  grafana:
    image: grafana/grafana
    restart: "always"
    ports:
      - 19091:3000
    container_name: "grafana"
    volumes:
      - "./grafana/grafana.ini:/etc/grafana/grafana.ini"              #配置文件自行拷贝出来
      - "./grafana/grafana-storage:/var/lib/grafana"
      - "/etc/localtime:/etc/localtime:ro"

 启动

docker-compose -f docker-compose.yml up -d

九、创建启动收集服务node-exporter-compose.yml

vim  node-exporter-compose.yml
docker-compose -f node-exporter-compose.yml up -d
version: '3.2'
services:
  node-exporter:
    image: prom/node-exporter
    restart: "always"
    ports:
      - "19100:9100"
    container_name: "node-exporter"
    volumes:
      - "/proc:/host/proc:ro"
      - "/sys:/host/sys:ro"
      - "/:/rootfs:ro"


  cAdvisor:
    image: google/cadvisor
    container_name: cAdvisor
    restart: always
    ports:
      - "19200:8080"
    volumes:
      - "/:/rootfs:ro"
      - "/var/run:/var/run/:rw"
      - "/sys:/sys:ro"
      - "/var/lib/docker:/var/lib/docker:ro"
 
  mysqld-exporte:
    image: prom/mysqld-exporter
    container_name: mysqld-exporte
    restart: always
    ports:
      - "19300:9104"
    volumes:
      - ./my.cnf:/etc/mysql/my.cnf          #需要挂载数据,不然无法启动,配置如下
    command: ["--config.my-cnf=/etc/mysql/my.cnf"]

  pushgateway-nei:
    image: prom/pushgateway
    container_name: pushgayeway-nei
    restart: always
    ports:
      - 19400:9091

每加一台。创建一份即可。 本机也行

 my.cnf

[client]
user=exporter            #使用mysql创建一个用户给mysqld-exporter登录
password=Aa123456        #创建密码。使用复杂密码
host=192.168.1.1         #指定数据库ip   
port=3307                #指定数据库端口

 

#以下数据库创建用户语句
#CREATE USER 'exporter'@'%' IDENTIFIED BY 'Aa123456';
#GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'exporter'@'%';

 

十、 node-exporter 推向pushgateway (其他exporter同理)

curl 127.0.0.1:9100/metrics|curl --data-binary @- http://xxx.com:9091/metrics/job/test/instance/10.2.1.11/hostname/ip-10-2-1-11

脚本

#!/bin/bash
job_name="test"
hostname=$(hostname)
HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}')

/usr/bin/curl 127.0.0.1:9100/metrics|/usr/bin/curl --data-binary @- http://xxxx.com:9091/metrics/job/$job_name/instance/$HOST_IP/hostname/$hostname

记得在prometheus 配置文件 ,配置收集pushgateway信息

 

十一、检查 

docker ps -a   #检查容器是否启动
netstat -nltp    #检查端口是否启动

页面访问ip:9090

Prometheus读取当地时间 

 

 告警手动测试  token换成自己得

curl -H "Content-Type: application/json" -d '{"msgtype":"text","text":{"content":"prometheus alert test"}}' https://oapi.dingtalk.com/robot/send?access_token=***

 

十二、配置Grafana    ip:3000    默认账号密码admin/admin

 

 

 

 

 

 

 

 

效果展示  (插件下载方式在下方。 可以使用16314、 1860)

 

 

 

 

#去官方下载监控模板即可

 插件地址:https://grafana.com/grafana/dashboards

到这就部署完了。 谢谢观看,转载请@此文章

 

posted @ 2021-06-22 18:36  mrdongdong  阅读(2929)  评论(1编辑  收藏  举报