Prometheus 与 grafana的搭配使用


undefinedundefined

 

prometheus  核心是一个单独的二进制方式文件   pull模型  内置的时间序列数据库(TSDB)  强大的查询语言 PromQL  可视化  开放化

 维度存储模型   OLAP系统

1、存储计算层

 > Prometheus Server ,里面包含了存储引擎和计算引擎

> Retrieval 组件为取数组件,它会主动从Pushgateway 或Exporter 拉取数据

> Service discovery 可以动态发现要监控的目标

> TSDB ,数据核心存储和查询

> HTTP server ,对外提供HTTP 服务

2、采集层

采集层分为两类,一类是生命周期较短的作业,还有一类是生命周期较长的作业

> 短作业: 直接通过API ,在退出时间指标推送给Pushgateway

> 长作业: Retrieval 组件直接从Job 或者Exporter 拉取数据

3、应用层

应用层主要分为 两种 ,一种是AlertManager,另一种是数据可视化

> AlertManager  对接Pagerduty ,是一套付费的监控报警系统,短信 ,电话,Email 发邮件

> 数据可视化   Prometheus build-in WebUI   Grafana   其他基于API开发的客户端

 

 

一、实操 利用docker 安装prometheus 、granfan

1.统一环境配置 

下载了docker  并关闭防火墙和selinux

 

2.下载相关镜像

docker pull prom/prometheus
docker pull prom/alertmanager
docker pull grafana/grafana

 

3.启动相关组件

prometheus-webhook-dingtalk 启动
docker run -d -p 8060:8060 -v /data/prom/config.yml:/etc/prometheus-webhook-dingtalk/config.yml --name alertdingtalk timonwong/prometheus-webhook-dingtalk

 

alertmanager 启动
docker run -d -p 9093:9093 -p 9094:9094 -v /data/prom/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager

 

 alertmanager.yml

复制代码
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.exmail.qq.com:465'                #邮箱smtp服务器代理,启用SSL发信, 端口一般是465
  smtp_from: 'test@qq.com'              #发送邮箱名称
  smtp_auth_username: 'test@qq.com'              #邮箱名称
  smtp_auth_password: 'passwd'                #邮箱密码或授权码
  smtp_require_tls: false
route:
  receiver: 'default-receiver'    # 所有不匹配以下子路由的告警都将保留在根节点,并发送到“default-receiver”
  group_wait: 30s                 # 为一个组发送通知的初始等待时间,默认30s
  group_interval: 5m              # 在发送新告警前的等待时间。通常5m或以上
  repeat_interval: 1h             # 发送重复告警的周期。如果已经发送了通知,再次发送之前需要等待多长时间。
  group_by: [alertname]  # 报警分组依据

  routes:- receiver: 'bigdata-pager'    # 所有带有team=bigdata标签的告警都与此子路由匹配,可以自己在alert-rules.yml中的labels添加即可 
    group_wait: 10s
    match:
      team: bigdata
receivers:                        # 定义接收者,将告警发送给谁
- name: 'default-receiver'
  email_configs:
  - to: 'xx@qq.com,xx@qq.com'

- name: 'bigdata-pager'
  email_configs:
  - to: 'xxx@qq.com,xx@qq.com'
复制代码

 

 

prometheus.yml 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:       #指定alertmanager报警组件地址
  alertmanagers:
  - static_configs:
    - targets: [ '192.168.188.2:9093']
 
rule_files:  #指定报警规则文件
  - "*rules.yml"
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.188.2:9090']
       
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.188.3:9100']
 
  - job_name: 'alertmanager'
    static_configs:
      - targets: [ '192.168.188.2:9093']

  

alert-rules.yml

1
groups:<br>- name: 主机状态-监控告警<br>  rules:<br>  - alert: 主机状态<br>    expr: up *on(instance)group_left(nodename)(node_uname_info) == 0<br>    for: 5m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{$labels.instance}}:服务器宕机"<br>      description: "{{$labels.instance}}({{$labels.nodename}}):服务器延时超过3分钟"<br>  - alert: 主机cpu使用情况<br>    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) *100  *on(instance)group_left(nodename)(node_uname_info) > 90<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }}cpu使用率过高"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}):cpu使用率超过90%(当前使用率: {{ $value }}%)"<br>  - alert: 主机内存使用情况<br>    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes* 100 *on(instance)group_left(nodename)(node_uname_info) > 90<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{$labels.instance}}: High Memory usage detected"<br>      description: "{{$labels.instance}}({{$labels.nodename}}): 内存使用率超过 90% (当前使用率: {{ $value }}%)"<br>  - alert: 主机磁盘使用情况<br>    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"})*100 *on(instance)group_left(nodename)(node_uname_info)  > 85<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }} 磁盘空间使用率过高!"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}): 磁盘空间使用大于85%(当前使用率: {{$value}}%)"<br>  - alert: 磁盘IO性能<br>    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) *100) *on(instance)group_left(nodename)(node_uname_info)   < 60<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }} 流入磁盘IO使用率过高!"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入磁盘IO大于60%(当前使用率: {{$value}}%)"<br>  - alert: TCP会话<br>    expr: node_netstat_Tcp_CurrEstab *on(instance)group_left(nodename)(node_uname_info)  > 1500<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}): TCP_ESTABLISHED大于1000%(当前使用率: {{$value}}%)"<br>  - alert: inside网络<br>    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) *on(instance)group_left(nodename)(node_uname_info)   > 204800<br>    for: 3m<br>    labels:<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }} 流入网络带宽过高!"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入网络带宽持续2分钟高于200M(当前使用: {{$value}})"<br>  - alert: outside网络<br>    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) *on(instance)group_left(nodename)(node_uname_info) > 204800<br>    for: 3m<br>    labels:<br>      team: bigdata<br>      level: waring<br>    annotations:<br>      summary: "{{ $labels.instance }} 流出网络带宽过高!"<br>      description: "{{ $labels.instance }}({{$labels.nodename}}): 流出网络带宽持续2分钟高于200M(当前使用: {{$value}})"

 


prometheus 启动
docker run -d -p 9090:9090 \
-v /data/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data/prom/alert-rules.yml:/etc/prometheus/alert-rules.yml \
-v /data/prom/data:/prometheus --name prometheus prom/prometheus:latest


grafana启动 
docker run -d -p 3000:3000 -v /data/prom/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest

 

node-exporter 启动 #Node-exporter需要监控实际的主机硬件信息,不推荐用docker来安装,所以通过二进制包来安装
docker run -d -p 9100:9100 --name node-exporter prom/node-exporter:latest
docker run -d -p 9100:9100 --net=host -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --name node-exporter prom/node-exporter:latest

 

consul 启动    #自动发现主机并注册 ,可以参考下另一篇文章 https://www.cnblogs.com/xq0422/p/17470150.html
docker run -d -p 8500:8500  \
  --name=consul -v /data/consul/data:/consul/data \
  -v /data/consul/config:/consul/config consul 

 

客户端下载地址:https://github.com/prometheus/node_exporter/releases

同样找到Linux-amd64这个版本,下载解压即可

#下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
#解压
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
#重命名
mv node_exporter-1.5.0.linux-amd64 node_exporter

启动方式:
#不保存日志
nohup ./node_exporter >/dev/null 2>&1 &
#保存日志到/var/log/node_exporter.log
nohup ./node_exporter >/var/log/node_exporter.log 2>&1 &

 

 mysqld_exporter 与上面类似  

MySQL需要注意先在创建用于监视数据库的用户exporter

mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'see_teampass' WITH MAX_USER_CONNECTIONS 5;

mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
说明:使用max_user_connections参数来限制exporter用户最大连接数,避免监控引起数据库过载,需要注意的是该参数并不是MySQL/Mariadb每个版本都支持

mysql > flush privileges;

下载安装mysqld_exporter     https://github.com/prometheus/mysqld_exporter/releases

tar xvf mysqld_exporter-0.14.1.linux-amd64.tar.gz  -C /usr/local/

cd /usr/local/

mv mysqld_exporter-0.14.1.linux-amd64 mysqld_exporter

cd /usr/local/mysqld_exporter

创建连接文件

cat > .my.cnf <<EOF

[client]

user=exporter

password=see_teampass

EOF

 

使用systemd方式启动 

 cat >/usr/lib/systemd/system/mysqld_exporter.service  <<EOF

[Unit]

Description=Prometheus

[Service]

ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

 

systemctl daemon-reload

 systemctl enable mysqld_exporter

 systemctl start mysqld_exporter

 

 

二、上面node_exporter 属于白盒监控,下面介绍下黑盒监控:

                                                                      判断接口页面是否正常,端口是否健康,证书多久到期

 

1.安装部署blackbox exporter

wget  https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz

tar -zxvf  blackbox_exporter-0.23.0.linux-amd64.tar.gz -C /usr/local

mv /usr/local/blackbox_exporter-0.23.0.linux-amd64.tar.gz   /usr/local/blackbox_exporter

2.先配置下当前探针 cat /usr/local/blackbox_exporter/blackbox.yml    探针类型(prober)

复制代码
modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5
复制代码

3.添加到启动项 cat /usr/lib/systemd/system/blackbox_exporter.service

[Unit]
Description=blackbox_exporter

[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter  --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure

4.检查是否正常运行  

 同时也可以通过访问 http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com  对baidu.com 进行探测。

这里通过在URL中提供module参数指定了当前使用的探针,target参数指定探测目标,探针的探测结果通过Metrics的形式返回:

复制代码
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004366919
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.09053371
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 81
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.040772637
probe_http_duration_seconds{phase="processing"} 0.04430544
probe_http_duration_seconds{phase="resolve"} 0.004366919
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.00019256
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.26330408e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 81
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.6694721e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
复制代码

从返回的样本中,用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标,从而帮助管理员主动的发现故障和问题。

5.prometheus中添加相关监控  追加到prometheus.yml


# 网站监控
  - job_name: 'http_status'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.188.10:3928/anywhere/#/before          #任意门
        - http://192.168.188.81:9092/index01.html               #学习视频
        - http://192.168.188.10:8000/accounts/login/?next=/ #藏经阁
        - http://192.168.188.21:8090/#all-updates               #confluence
       # - https://192.168.188.22/dev/dist/               #龙华思妍
        - https://www.baidu.com                   #龙华思妍
        - http://192.168.188.30:6682/dist/project     #龙华思视
        labels:
          instance: http_status
          group: web
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# ping 检测
  - job_name: 'ping_status'
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 192.168.188.200
        labels:
          instance: 'ping_status'
          group: icmp
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# 端口监控
  - job_name: 'port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 192.168.188.10:3928
        - 192.168.188.22:3306
        - 192.168.188.200:8090
        labels:
          instance: 'port_status'
          group: port
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

6.grafana模板号:9965 

 告警规则可以监控probe_success参数

  • icmp、tcp、http、post 监测是否正常可以观察 probe_success 这一指标

  • probe_success == 0 ##联通性异常

  • probe_success == 1 ##联通性正常

  • 告警也是判断这个指标是否等于 0,如等于 0 则触发异常报警

 

 

 

 大数据hadoop相关监控 搭建信息可参考:

https://github.com/tamtran96/hadoop-jmx-exporter/tree/master/dashboards

 

更多prometheus 相关exporter可参考:

 https://blog.51cto.com/u_14065119/4166081 

 

参考链接:

https://it.cha138.com/mysql/show-99068.html

https://www.infoq.cn/article/sxextntuttxduedeagiq

https://www.prometheus.wang/exporter/install_blackbox_exporter.html

 

posted @   会bk的鱼  阅读(230)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· DeepSeek R1 简明指南:架构、训练、本地部署及硬件要求
点击右上角即可分享
微信分享提示