Prometheus 与 grafana的搭配使用

prometheus 核心是一个单独的二进制方式文件 pull模型内置的时间序列数据库（TSDB）强大的查询语言 PromQL 可视化开放化

维度存储模型 OLAP系统

1、存储计算层

> Prometheus Server ，里面包含了存储引擎和计算引擎

> Retrieval 组件为取数组件，它会主动从Pushgateway 或Exporter 拉取数据

> Service discovery 可以动态发现要监控的目标

> TSDB ，数据核心存储和查询

> HTTP server ，对外提供HTTP 服务

2、采集层

采集层分为两类，一类是生命周期较短的作业，还有一类是生命周期较长的作业

> 短作业：直接通过API ,在退出时间指标推送给Pushgateway

> 长作业： Retrieval 组件直接从Job 或者Exporter 拉取数据

3、应用层

应用层主要分为两种，一种是AlertManager，另一种是数据可视化

> AlertManager 对接Pagerduty ,是一套付费的监控报警系统，短信，电话，Email 发邮件

> 数据可视化 Prometheus build-in WebUI Grafana 其他基于API开发的客户端

一、实操利用docker 安装prometheus 、granfan

1.统一环境配置

下载了docker 并关闭防火墙和selinux

2.下载相关镜像

docker pull prom/prometheus
docker pull prom/alertmanager
docker pull grafana/grafana

3.启动相关组件

prometheus-webhook-dingtalk 启动
docker run -d -p 8060:8060 -v /data/prom/config.yml:/etc/prometheus-webhook-dingtalk/config.yml --name alertdingtalk timonwong/prometheus-webhook-dingtalk

alertmanager 启动
docker run -d -p 9093:9093 -p 9094:9094 -v /data/prom/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager

alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.exmail.qq.com:465'                #邮箱smtp服务器代理，启用SSL发信, 端口一般是465
  smtp_from: 'test@qq.com'              #发送邮箱名称
  smtp_auth_username: 'test@qq.com'              #邮箱名称
  smtp_auth_password: 'passwd'                #邮箱密码或授权码
  smtp_require_tls: false
route:
  receiver: 'default-receiver'    # 所有不匹配以下子路由的告警都将保留在根节点，并发送到“default-receiver”
  group_wait: 30s                 # 为一个组发送通知的初始等待时间，默认30s
  group_interval: 5m              # 在发送新告警前的等待时间。通常5m或以上
  repeat_interval: 1h             # 发送重复告警的周期。如果已经发送了通知，再次发送之前需要等待多长时间。
  group_by: [alertname]  # 报警分组依据

  routes:- receiver: 'bigdata-pager'    # 所有带有team=bigdata标签的告警都与此子路由匹配，可以自己在alert-rules.yml中的labels添加即可 
    group_wait: 10s
    match:
      team: bigdata
receivers:                        # 定义接收者，将告警发送给谁
- name: 'default-receiver'
  email_configs:
  - to: 'xx@qq.com,xx@qq.com'

- name: 'bigdata-pager'
  email_configs:
  - to: 'xxx@qq.com,xx@qq.com'

prometheus.yml

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:       #指定alertmanager报警组件地址
  alertmanagers:
  - static_configs:
    - targets: [ '192.168.188.2:9093']

rule_files:  #指定报警规则文件
  - "*rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.188.2:9090']
	  
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.188.3:9100']

  - job_name: 'alertmanager'
    static_configs:
      - targets: [ '192.168.188.2:9093']

alert-rules.yml

groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
    expr: up *on(instance)group_left(nodename)(node_uname_info) == 0
    for: 5m
    labels:
      level: waring
    annotations:
      summary: "{{$labels.instance}}:服务器宕机"
      description: "{{$labels.instance}}({{$labels.nodename}}):服务器延时超过3分钟"
  - alert: 主机cpu使用情况
    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) *100  *on(instance)group_left(nodename)(node_uname_info) > 90
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }}cpu使用率过高"
      description: "{{ $labels.instance }}({{$labels.nodename}}):cpu使用率超过90%(当前使用率: {{ $value }}%)"
  - alert: 主机内存使用情况
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes* 100 *on(instance)group_left(nodename)(node_uname_info) > 90
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}({{$labels.nodename}}): 内存使用率超过 90% (当前使用率: {{ $value }}%)"
  - alert: 主机磁盘使用情况
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"})*100 *on(instance)group_left(nodename)(node_uname_info)  > 85
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 磁盘空间使用率过高！"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 磁盘空间使用大于85%(当前使用率: {{$value}}%)"
  - alert: 磁盘IO性能
    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) *100) *on(instance)group_left(nodename)(node_uname_info)   < 60
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流入磁盘IO使用率过高！"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入磁盘IO大于60%(当前使用率: {{$value}}%)"
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab *on(instance)group_left(nodename)(node_uname_info)  > 1500
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} TCP_ESTABLISHED过高！"
      description: "{{ $labels.instance }}({{$labels.nodename}}): TCP_ESTABLISHED大于1000%(当前使用率: {{$value}}%)"
  - alert: inside网络
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) *on(instance)group_left(nodename)(node_uname_info)   > 204800
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流入网络带宽过高！"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入网络带宽持续2分钟高于200M(当前使用: {{$value}})"
  - alert: outside网络
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) *on(instance)group_left(nodename)(node_uname_info) > 204800
    for: 3m
    labels:
      team: bigdata
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流出网络带宽过高！"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流出网络带宽持续2分钟高于200M(当前使用: {{$value}})"

prometheus 启动
docker run -d -p 9090:9090 \
-v /data/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data/prom/alert-rules.yml:/etc/prometheus/alert-rules.yml \
-v /data/prom/data:/prometheus --name prometheus prom/prometheus:latest

grafana启动
docker run -d -p 3000:3000 -v /data/prom/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest

node-exporter 启动 #Node-exporter需要监控实际的主机硬件信息，不推荐用docker来安装，所以通过二进制包来安装
docker run -d -p 9100:9100 --name node-exporter prom/node-exporter:latest
docker run -d -p 9100:9100 --net=host -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --name node-exporter prom/node-exporter:latest

consul 启动 #自动发现主机并注册，可以参考下另一篇文章 https://www.cnblogs.com/xq0422/p/17470150.html
docker run -d -p 8500:8500 \
--name=consul -v /data/consul/data:/consul/data \
-v /data/consul/config:/consul/config consul

客户端下载地址：https://github.com/prometheus/node_exporter/releases

同样找到Linux-amd64这个版本，下载解压即可

#下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
#解压
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
#重命名
mv node_exporter-1.5.0.linux-amd64 node_exporter

启动方式：
#不保存日志
nohup ./node_exporter >/dev/null 2>&1 &
#保存日志到/var/log/node_exporter.log
nohup ./node_exporter >/var/log/node_exporter.log 2>&1 &

mysqld_exporter 与上面类似

MySQL需要注意先在创建用于监视数据库的用户exporter

mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'see_teampass' WITH MAX_USER_CONNECTIONS 5;

mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';

说明：使用max_user_connections参数来限制exporter用户最大连接数，避免监控引起数据库过载，需要注意的是该参数并不是MySQL/Mariadb每个版本都支持

mysql > flush privileges;

下载安装mysqld_exporter https://github.com/prometheus/mysqld_exporter/releases

tar xvf mysqld_exporter-0.14.1.linux-amd64.tar.gz -C /usr/local/

cd /usr/local/

mv mysqld_exporter-0.14.1.linux-amd64 mysqld_exporter

cd /usr/local/mysqld_exporter

创建连接文件

cat > .my.cnf <<EOF

[client]

user=exporter

password=see_teampass

EOF

使用systemd方式启动

cat >/usr/lib/systemd/system/mysqld_exporter.service <<EOF

[Unit]

Description=Prometheus

[Service]

ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

systemctl daemon-reload

systemctl enable mysqld_exporter

systemctl start mysqld_exporter

二、上面node_exporter 属于白盒监控，下面介绍下黑盒监控：

判断接口页面是否正常，端口是否健康，证书多久到期

1.安装部署blackbox exporter

wget  https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz

tar -zxvf  blackbox_exporter-0.23.0.linux-amd64.tar.gz -C /usr/local

mv /usr/local/blackbox_exporter-0.23.0.linux-amd64.tar.gz   /usr/local/blackbox_exporter

2.先配置下当前探针 cat /usr/local/blackbox_exporter/blackbox.yml 探针类型（prober）

modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

3.添加到启动项 cat /usr/lib/systemd/system/blackbox_exporter.service

[Unit]
Description=blackbox_exporter

[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter  --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure

4.检查是否正常运行

同时也可以通过访问 http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com 对baidu.com 进行探测。

这里通过在URL中提供module参数指定了当前使用的探针，target参数指定探测目标，探针的探测结果通过Metrics的形式返回：

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004366919
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.09053371
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 81
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.040772637
probe_http_duration_seconds{phase="processing"} 0.04430544
probe_http_duration_seconds{phase="resolve"} 0.004366919
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.00019256
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.26330408e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 81
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.6694721e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

从返回的样本中，用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标，从而帮助管理员主动的发现故障和问题。

5.prometheus中添加相关监控追加到prometheus.yml

# 网站监控
- job_name: 'http_status'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.188.10:3928/anywhere/#/before          #任意门
        - http://192.168.188.81:9092/index01.html               #学习视频
        - http://192.168.188.10:8000/accounts/login/?next=/ #藏经阁
        - http://192.168.188.21:8090/#all-updates               #confluence
       # - https://192.168.188.22/dev/dist/               #龙华思妍
        - https://www.baidu.com                   #龙华思妍
        - http://192.168.188.30:6682/dist/project     #龙华思视
        labels:
          instance: http_status
          group: web
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# ping 检测
- job_name: 'ping_status'
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 192.168.188.200
        labels:
          instance: 'ping_status'
          group: icmp
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# 端口监控
- job_name: 'port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 192.168.188.10:3928
        - 192.168.188.22:3306
        - 192.168.188.200:8090
        labels:
          instance: 'port_status'
          group: port
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

6.grafana模板号：9965

告警规则可以监控probe_success参数

icmp、tcp、http、post 监测是否正常可以观察 probe_success 这一指标
probe_success == 0 ##联通性异常
probe_success == 1 ##联通性正常
告警也是判断这个指标是否等于 0，如等于 0 则触发异常报警

大数据hadoop相关监控搭建信息可参考：

https://github.com/tamtran96/hadoop-jmx-exporter/tree/master/dashboards

更多prometheus 相关exporter可参考：

https://blog.51cto.com/u_14065119/4166081

参考链接：

https://it.cha138.com/mysql/show-99068.html

https://www.infoq.cn/article/sxextntuttxduedeagiq

https://www.prometheus.wang/exporter/install_blackbox_exporter.html

posted @ 2023-05-18 15:33 会bk的鱼阅读(371) 评论(0) 收藏举报

刷新页面返回顶部

会bk的鱼儿

Prometheus 与 grafana的搭配使用

公告