Prometheus+Grafana+Altermanager钉钉报警

一.prometheus介绍

Prometheus是一个开源的系统监控和报警的工具包，最初由SoundCloud发布。

特点：

多维数据模型（有metric名称和键值对确定的时间序列）
灵活的查询语言
不依赖分布式存储
通过pull方式采集时间序列，通过http协议传输
支持通过中介网关的push时间序列的方式
监控数据通过服务或者静态配置来发现
支持图表和dashboard等多种方式

组件：　

Prometheus ：主程序，Prometheus服务端，由于存储及收集数据，提供相关api对外查询用，主要是负责存储、抓取、聚合、查询方面。
Alertmanager：程序，主要是负责实现报警功能。
Pushgateway ：程序，主要是实现接收由Client push过来的指标数据，在指定的时间间隔，由主程序来抓取。
*_exporter ：类似传统意义上的被监控端的agent，有区别的是，它不会主动推送监控数据到server端，而是等待server端定时来收集数据，即所谓的主动监控。

架构：

二.prometheus部署

Prometheus官网下载：https://prometheus.io/download/

1. 下载&部署

# 下载
[root@prometheus src]# cd /usr/local/src/
[root@prometheus src]# wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz

# 部署到/usr/local/目录
# promethus不用编译安装，解压目录中有配置文件与启动文件

[root@prometheus src]#tar zxf prometheus-2.3.2.linux-amd64.tar.gz -C /usr/local/

[root@prometheus src]# cd /usr/local/ 
[root@prometheus local]# mv prometheus-2.0.0.linux-amd64/ prometheus/ 
# 验证 
[root@prometheus local]# cd prometheus/ 
[root@prometheus prometheus]# ./prometheus --version

2. 配置文件

# 解压目录中的prometheus.yml
# 简单验证，主要配置采用默认文件配置，有修改/新增处用红色标示
[root@prometheus prometheus]# vim prometheus.yml
# 全局配置
global:
  scrape_interval:     15s # 设置抓取(pull)时间间隔，默认是1m
  evaluation_interval: 15s # 设置rules评估时间间隔，默认是1m
  # scrape_timeout is set to the global default (10s).

# 告警管理配置，暂未使用，默认配置
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# 加载rules，并根据设置的时间间隔定期评估，暂未使用，默认配置
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# 抓取(pull)，即监控目标配置
# 默认只有主机本身的监控配置
scrape_configs:
  # 监控目标的label（这里的监控目标只是一个metric，而不是指某特定主机，可以在特定主机取多个监控目标），在抓取的每条时间序列表中都会添加此label
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    # 可覆盖全局配置设置的抓取间隔，由15秒重写成5秒。
    scrape_interval: 5s

    # 静态指定监控目标，暂不涉及使用一些服务发现机制发现目标
static_configs:
      - targets: ['localhost:9090']
        # (opentional)再添加一个label，标识了监控目标的主机
labels:
          instance: prometheus

  - job_name: 'linux'
    scrape_interval: 10s
static_configs:
  # 采用node_exporter默认开放的端口
      - targets: ['172.20.1.212:9100','192.168.233.131:9100']
labels:
          instance: node1

3. 设置用户

# 添加用户，后期用此账号启动服务
[root@prometheus prom etheus]# groupadd prometheus
[root@prometheus prometheus]# useradd -g prometheus -s /sbin/nologin prometheus

# 赋权
[root@prometheus prometheus]# cd ~
[root@prometheus ~]# chown -R prometheus:prometheus /usr/local/prometheus/

# 创建prometheus运行数据目录
[root@prometheus ~]# mkdir -p /var/lib/prometheus
[root@prometheus ~]# chown -R prometheus:prometheus /var/lib/prometheus/

4. 设置开机启动




[root@prometheus ~]# touch /usr/lib/systemd/system/prometheus.service 
[root@prometheus ~]# chown prometheus:prometheus /usr/lib/systemd/system/prometheus.service

[root@prometheus ~]# vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
# Type设置为notify时，服务会不断重启
Type=simple
User=prometheus
# --storage.tsdb.path是可选项，默认数据目录在运行目录的./dada目录中
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus
Restart=on-failure

[Install]
WantedBy=multi-user.target

# 设置开机启动
[root@prometheus ~]# systemctl enable Prometheus
[root@prometheus ~]# systemctl start prometheus

5. 启动并验证

1）查看服务状态

[root@prometheus ~]# systemctl status prometheus

[root@prometheus ~]# netstat -tunlp | grep 9090

2）web ui

Prometheus自带有简单的UI,http://localhost:9090

在Status菜单下，Configuration，Rule，Targets等，

Statu-->Configuration展示prometheus.yml的配置，

三．部署node_exporter

Node_exporter收集机器的系统数据，这里采用prometheus官方提供的exporter，除node_exporter外，官方还提供consul，memcached，haproxy，mysqld等exporter，具体可查看官网或去github下载（官网没有的github有可能有如：windows的exporter）。

这里在prometheus node节点部署相关服务。

1. 下载&部署

# 下载
[root@node1 ~]# cd /usr/local/src/
[root@node1 src]# wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz

# 部署
[root@node1 src]# tar -zxvf node_exporter-0.16.0.linux-amd64.tar.gz-C /usr/local/
[root@node1 src]# cd /usr/local/
[root@node1 local]# mv node_exporter-0.16.0.linux-amd64/ node_exporter/

2. 设置用户

[root@node1 ~]# groupadd prometheus
[root@node1 ~]# useradd -g prometheus -s /sbin/nologin prometheus
[root@node1 ~]# chown -R prometheus:prometheus /usr/local/node_exporter/

3. 设置开机启动

[root@node1 ~]# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

[root@node1 ~]# systemctl enable node_exporter
[root@node1 ~]# systemctl start node_exporter

4. 验证

访问prometheus，查看node1主机已经可被监控

5. 绘图

访问：http://192.168.233.131:9100/metrics，查看从exporter具体能抓到的数据，如下：

访问：prometheus，在输入框中任意输入1个exporter能抓取得值，点击"Execute"与"Execute"按钮，即可见相应抓取数据的图形，同时可对时间与unit做调整，

四．部署grafana

在prometheus& grafana server节点部署grafana服务。

1. 下载&安装

# 下载
[root@prometheus ~]# cd /usr/local/src/
[root@prometheus src]# wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.2.2-1.x86_64.rpm

# 安装 
sudo yum localinstall grafana-5.2.2-1.x86_64.rpm

2. 配置文件

配置文件位于/etc/grafana/grafana.ini，这里暂时保持默认配置即可。

3. 设置开机启动

[root@prometheus src]# systemctl enable grafana-server
[root@prometheus src]# systemctl start grafana-server

5. 添加数据源

1）登陆

访问：http://localhost:3000，默认账号/密码：admin/admin

2）添加数据源

在登陆首页，点击"Configuration-Data Sources"按钮，跳转到添加数据源页面，配置如下：

Name: prometheus

Type: prometheus

URL: http://localhost:9090/

Access: Server

取消Default的勾选，其余默认，点击"Add"，如下：

在"Dashboards"页签下"import"自带的模版，如下：

6. 导入dashboard

从grafana官网下载相关dashboard到本地，如：https://grafana.com/dashboards/1860

Upload已下载至本地的json文件

Grafana.com Dashboard输入grafana官网的Dashboard链接（如：https://grafana.com/dashboards/1860）

可以下载使用upload上传，也可不下载直接复制链接

7. 查看dashboard

Grafana首页-->Dashboard-->Home，Home下拉列表中可见有已添加的两个dashboard，"Prometheus Stats"与"Node Exporter Full"，选择1个即可

补充

grafana官网如果没有你想要的dashboard，你可去github上看看。

大部分的dashborad是无法直接使用，它们呈现不出图像显示“no data”或者显示的图像和本来的图像不符合，比如你要显示磁盘剩余但他显示的是磁盘已使用多少。这就很尴尬了。可以通过修改Metrics的计算公式来是之有效。

选中一个不好使的图标点击Edit

、

Add Query添加一个监控值

五．部署Alertmanager 钉钉报警

虽然说grafana也有报警但是使用过后感觉不太好用，grafana报警无法使用模板变量并且报警规则比较繁琐，然后重新比对决定使用Alertmanager的钉钉报警。但是alertmanager不止是钉钉报警，还有微信，邮件等。

1. 下载&安装

[root@localhost src]# wget https://github.com/prometheus/alertmanager/releases/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz
[root@localhost src]# tar zxf alertmanager-0.15.2.linux-amd64.tar.gz

2.配置文件

alertmanager的webhook集成了钉钉报警，所以他不是本来就有的。钉钉对格式要求很严格，一会还需要使用插件进行格式转换。

cat alertmanager.yml
global:
  resolve_timeout: 5m
route:
  receiver: webhook
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [alertname]
  routes:
  - receiver: webhook
    group_wait: 10s
    match:
      team: node
receivers:
- name: webhook
  webhook_configs:
  - url: http://localhost:8060/dingtalk/ops_dingding/send 
    send_resolved: true

3.启动alertmanager

nohup ./alertmanager --config.file=alertmanager.yml 2>&1 1>altermanager.log &
#查看端口
netstat -anpt | grep 9093

4.报警规则

监控主机是否存活

cd /usr/local/prometheus
cat rules.yml
groups:
    - name: test-rule
      rules:
      - alert: 主机状态
        expr: up == 0
        for: 2m
        labels:
          status: warning
        annotations:
          summary: "{{$labels.instance}}:服务器关闭"
          description: "{{$labels.instance}}:服务器关闭"

5.修改prometheus配置文件

修改alerting和rule_file

rule_files可以指定多个规则

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "second_rules.yml"

重启

6.将钉钉接入 Prometheus AlertManager WebHook

参考文档：http://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/

插件下载地址：https://github.com/timonwong/prometheus-webhook-dingtalk

安装

先安装go环境

建议把主机名改成主机ip，方便报警时提供url或者改成域名也可以

cd /root/go/src/github.com/timonwong/

git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git

cd prometheus-webhook-dingtalk
make（出错不要管他）
如果没有生成prometheus-webhook-dingtalk，创建新目录，进入目录git clone软件重新编译

mkdir -p /usr/lib/golang/src/github.com/timonwong/

启动

不会加机器人的去网上搜

ding.profile是钉钉机器人的webhook

nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxx"   2>&1 1>dingding.log & 
netstat -anpt | grep 8060

7.测试

把监控主机的exporter关闭或者关闭主机

再启动exporter，已经恢复

posted @ 2018-08-28 15:26 庞优秀阅读(3900) 评论(0) 收藏举报

刷新页面返回顶部

庞优秀

Prometheus+Grafana+Altermanager钉钉报警

二.prometheus部署

1. 下载&部署

2. 配置文件

3. 设置用户

4. 设置开机启动

5. 启动并验证

1）查看服务状态

2）web ui

三．部署node_exporter

1. 下载&部署

2. 设置用户

3. 设置开机启动

4. 验证

5. 绘图

四．部署grafana

1. 下载&安装

2. 配置文件

3. 设置开机启动

5. 添加数据源

1）登陆

2）添加数据源

6. 导入dashboard

7. 查看dashboard

五．部署Alertmanager 钉钉报警

1. 下载&安装

2.配置文件

3.启动alertmanager

4.报警规则

5.修改prometheus配置文件

6.将钉钉接入 Prometheus AlertManager WebHook

7.测试

公告