Prometheus 与 grafana的搭配使用
undefinedundefined
prometheus 核心是一个单独的二进制方式文件 pull模型 内置的时间序列数据库(TSDB) 强大的查询语言 PromQL 可视化 开放化
维度存储模型 OLAP系统
1、存储计算层
> Prometheus Server ,里面包含了存储引擎和计算引擎
> Retrieval 组件为取数组件,它会主动从Pushgateway 或Exporter 拉取数据
> Service discovery 可以动态发现要监控的目标
> TSDB ,数据核心存储和查询
> HTTP server ,对外提供HTTP 服务
2、采集层
采集层分为两类,一类是生命周期较短的作业,还有一类是生命周期较长的作业
> 短作业: 直接通过API ,在退出时间指标推送给Pushgateway
> 长作业: Retrieval 组件直接从Job 或者Exporter 拉取数据
3、应用层
应用层主要分为 两种 ,一种是AlertManager,另一种是数据可视化
> AlertManager 对接Pagerduty ,是一套付费的监控报警系统,短信 ,电话,Email 发邮件
> 数据可视化 Prometheus build-in WebUI Grafana 其他基于API开发的客户端
一、实操 利用docker 安装prometheus 、granfan
1.统一环境配置
下载了docker 并关闭防火墙和selinux
2.下载相关镜像
docker pull prom/prometheus
docker pull prom/alertmanager
docker pull grafana/grafana
3.启动相关组件
prometheus-webhook-dingtalk 启动
docker run -d -p 8060:8060 -v /data/prom/config.yml:/etc/prometheus-webhook-dingtalk/config.yml --name alertdingtalk timonwong/prometheus-webhook-dingtalk
alertmanager 启动
docker run -d -p 9093:9093 -p 9094:9094 -v /data/prom/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager
alertmanager.yml
global: resolve_timeout: 5m smtp_smarthost: 'smtp.exmail.qq.com:465' #邮箱smtp服务器代理,启用SSL发信, 端口一般是465 smtp_from: 'test@qq.com' #发送邮箱名称 smtp_auth_username: 'test@qq.com' #邮箱名称 smtp_auth_password: 'passwd' #邮箱密码或授权码 smtp_require_tls: false route: receiver: 'default-receiver' # 所有不匹配以下子路由的告警都将保留在根节点,并发送到“default-receiver” group_wait: 30s # 为一个组发送通知的初始等待时间,默认30s group_interval: 5m # 在发送新告警前的等待时间。通常5m或以上 repeat_interval: 1h # 发送重复告警的周期。如果已经发送了通知,再次发送之前需要等待多长时间。 group_by: [alertname] # 报警分组依据 routes:- receiver: 'bigdata-pager' # 所有带有team=bigdata标签的告警都与此子路由匹配,可以自己在alert-rules.yml中的labels添加即可 group_wait: 10s match: team: bigdata receivers: # 定义接收者,将告警发送给谁 - name: 'default-receiver' email_configs: - to: 'xx@qq.com,xx@qq.com' - name: 'bigdata-pager' email_configs: - to: 'xxx@qq.com,xx@qq.com'
prometheus.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. alerting: #指定alertmanager报警组件地址 alertmanagers: - static_configs: - targets: [ '192.168.188.2:9093' ] rule_files: #指定报警规则文件 - "*rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: [ '192.168.188.2:9090' ] - job_name: 'node' static_configs: - targets: [ '192.168.188.3:9100' ] - job_name: 'alertmanager' static_configs: - targets: [ '192.168.188.2:9093' ] |
alert-rules.yml
1 | groups :<br>- name: 主机状态-监控告警<br> rules:<br> - alert: 主机状态<br> expr : up *on(instance)group_left(nodename)(node_uname_info) == 0<br> for : 5m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{$labels.instance}}:服务器宕机" <br> description: "{{$labels.instance}}({{$labels.nodename}}):服务器延时超过3分钟" <br> - alert: 主机cpu使用情况<br> expr : 100-avg(irate(node_cpu_seconds_total{mode= "idle" }[5m])) by(instance) *100 *on(instance)group_left(nodename)(node_uname_info) > 90<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }}cpu使用率过高" <br> description: "{{ $labels.instance }}({{$labels.nodename}}):cpu使用率超过90%(当前使用率: {{ $value }}%)" <br> - alert: 主机内存使用情况<br> expr : (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes* 100 *on(instance)group_left(nodename)(node_uname_info) > 90<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{$labels.instance}}: High Memory usage detected" <br> description: "{{$labels.instance}}({{$labels.nodename}}): 内存使用率超过 90% (当前使用率: {{ $value }}%)" <br> - alert: 主机磁盘使用情况<br> expr : 100-(node_filesystem_free_bytes{fstype=~ "ext4|xfs" } /node_filesystem_size_bytes {fstype=~ "ext4|xfs" })*100 *on(instance)group_left(nodename)(node_uname_info) > 85<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }} 磁盘空间使用率过高!" <br> description: "{{ $labels.instance }}({{$labels.nodename}}): 磁盘空间使用大于85%(当前使用率: {{$value}}%)" <br> - alert: 磁盘IO性能<br> expr : 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) *100) *on(instance)group_left(nodename)(node_uname_info) < 60<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }} 流入磁盘IO使用率过高!" <br> description: "{{ $labels.instance }}({{$labels.nodename}}): 流入磁盘IO大于60%(当前使用率: {{$value}}%)" <br> - alert: TCP会话<br> expr : node_netstat_Tcp_CurrEstab *on(instance)group_left(nodename)(node_uname_info) > 1500<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!" <br> description: "{{ $labels.instance }}({{$labels.nodename}}): TCP_ESTABLISHED大于1000%(当前使用率: {{$value}}%)" <br> - alert: inside网络<br> expr : (( sum (rate (node_network_receive_bytes_total{device!~ 'tap.*|veth.*|br.*|docker.*|virbr*|lo*' }[5m])) by (instance)) / 100) *on(instance)group_left(nodename)(node_uname_info) > 204800<br> for : 3m<br> labels:<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }} 流入网络带宽过高!" <br> description: "{{ $labels.instance }}({{$labels.nodename}}): 流入网络带宽持续2分钟高于200M(当前使用: {{$value}})" <br> - alert: outside网络<br> expr : (( sum (rate (node_network_transmit_bytes_total{device!~ 'tap.*|veth.*|br.*|docker.*|virbr*|lo*' }[5m])) by (instance)) / 100 ) *on(instance)group_left(nodename)(node_uname_info) > 204800<br> for : 3m<br> labels:<br> team: bigdata<br> level: waring<br> annotations:<br> summary: "{{ $labels.instance }} 流出网络带宽过高!" <br> description: "{{ $labels.instance }}({{$labels.nodename}}): 流出网络带宽持续2分钟高于200M(当前使用: {{$value}})" |
prometheus 启动
docker run -d -p 9090:9090 \
-v /data/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data/prom/alert-rules.yml:/etc/prometheus/alert-rules.yml \
-v /data/prom/data:/prometheus --name prometheus prom/prometheus:latest
grafana启动
docker run -d -p 3000:3000 -v /data/prom/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest
node-exporter 启动 #Node-exporter需要监控实际的主机硬件信息,不推荐用docker来安装,所以通过二进制包来安装
docker run -d -p 9100:9100 --name node-exporter prom/node-exporter:latest
docker run -d -p 9100:9100 --net=host -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --name node-exporter prom/node-exporter:latest
consul 启动 #自动发现主机并注册 ,可以参考下另一篇文章 https://www.cnblogs.com/xq0422/p/17470150.html
docker run -d -p 8500:8500 \
--name=consul -v /data/consul/data:/consul/data \
-v /data/consul/config:/consul/config consul
客户端下载地址:https://github.com/prometheus/node_exporter/releases
同样找到Linux-amd64这个版本,下载解压即可
#下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
#解压
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
#重命名
mv node_exporter-1.5.0.linux-amd64 node_exporter
启动方式:
#不保存日志
nohup ./node_exporter >/dev/null 2>&1 &
#保存日志到/var/log/node_exporter.log
nohup ./node_exporter >/var/log/node_exporter.log 2>&1 &
mysqld_exporter 与上面类似
MySQL需要注意先在创建用于监视数据库的用户exporter
mysql > flush privileges;
下载安装mysqld_exporter https://github.com/prometheus/mysqld_exporter/releases
tar xvf mysqld_exporter-0.14.1.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv mysqld_exporter-0.14.1.linux-amd64 mysqld_exporter
cd /usr/local/mysqld_exporter
创建连接文件
cat > .my.cnf <<EOF
[client]
user=exporter
password=see_teampass
EOF
使用systemd方式启动
cat >/usr/lib/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=Prometheus
[Service]
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable mysqld_exporter
systemctl start mysqld_exporter
二、上面node_exporter 属于白盒监控,下面介绍下黑盒监控:
判断接口页面是否正常,端口是否健康,证书多久到期
1.安装部署blackbox exporter
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz tar -zxvf blackbox_exporter-0.23.0.linux-amd64.tar.gz -C /usr/local mv /usr/local/blackbox_exporter-0.23.0.linux-amd64.tar.gz /usr/local/blackbox_exporter
2.先配置下当前探针 cat /usr/local/blackbox_exporter/blackbox.yml 探针类型(prober)
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
icmp_ttl5:
prober: icmp
timeout: 5s
icmp:
ttl: 5
3.添加到启动项 cat /usr/lib/systemd/system/blackbox_exporter.service
[Unit] Description=blackbox_exporter [Service] User=root Type=simple ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml Restart=on-failure
4.检查是否正常运行
同时也可以通过访问 http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com 对baidu.com 进行探测。
这里通过在URL中提供module参数指定了当前使用的探针,target参数指定探测目标,探针的探测结果通过Metrics的形式返回:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds # TYPE probe_dns_lookup_time_seconds gauge probe_dns_lookup_time_seconds 0.004366919 # HELP probe_duration_seconds Returns how long the probe took to complete in seconds # TYPE probe_duration_seconds gauge probe_duration_seconds 0.09053371 # HELP probe_failed_due_to_regex Indicates if probe failed due to regex # TYPE probe_failed_due_to_regex gauge probe_failed_due_to_regex 0 # HELP probe_http_content_length Length of http content response # TYPE probe_http_content_length gauge probe_http_content_length 81 # HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects # TYPE probe_http_duration_seconds gauge probe_http_duration_seconds{phase="connect"} 0.040772637 probe_http_duration_seconds{phase="processing"} 0.04430544 probe_http_duration_seconds{phase="resolve"} 0.004366919 probe_http_duration_seconds{phase="tls"} 0 probe_http_duration_seconds{phase="transfer"} 0.00019256 # HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime # TYPE probe_http_last_modified_timestamp_seconds gauge probe_http_last_modified_timestamp_seconds 1.26330408e+09 # HELP probe_http_redirects The number of redirects # TYPE probe_http_redirects gauge probe_http_redirects 0 # HELP probe_http_ssl Indicates if SSL was used for the final redirect # TYPE probe_http_ssl gauge probe_http_ssl 0 # HELP probe_http_status_code Response HTTP status code # TYPE probe_http_status_code gauge probe_http_status_code 200 # HELP probe_http_uncompressed_body_length Length of uncompressed response body # TYPE probe_http_uncompressed_body_length gauge probe_http_uncompressed_body_length 81 # HELP probe_http_version Returns the version of HTTP of the probe response # TYPE probe_http_version gauge probe_http_version 1.1 # HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes. # TYPE probe_ip_addr_hash gauge probe_ip_addr_hash 3.6694721e+08 # HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6 # TYPE probe_ip_protocol gauge probe_ip_protocol 4 # HELP probe_success Displays whether or not the probe was a success # TYPE probe_success gauge probe_success 1
从返回的样本中,用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标,从而帮助管理员主动的发现故障和问题。
5.prometheus中添加相关监控 追加到prometheus.yml
# 网站监控
- job_name: 'http_status'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://192.168.188.10:3928/anywhere/#/before #任意门
- http://192.168.188.81:9092/index01.html #学习视频
- http://192.168.188.10:8000/accounts/login/?next=/ #藏经阁
- http://192.168.188.21:8090/#all-updates #confluence
# - https://192.168.188.22/dev/dist/ #龙华思妍
- https://www.baidu.com #龙华思妍
- http://192.168.188.30:6682/dist/project #龙华思视
labels:
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:9115
# ping 检测
- job_name: 'ping_status'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.188.200
labels:
instance: 'ping_status'
group: icmp
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:9115
# 端口监控
- job_name: 'port_status'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.188.10:3928
- 192.168.188.22:3306
- 192.168.188.200:8090
labels:
instance: 'port_status'
group: port
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:9115
6.grafana模板号:9965
告警规则可以监控probe_success参数
-
icmp、tcp、http、post 监测是否正常可以观察 probe_success 这一指标
-
probe_success == 0 ##联通性异常
-
probe_success == 1 ##联通性正常
-
告警也是判断这个指标是否等于 0,如等于 0 则触发异常报警
大数据hadoop相关监控 搭建信息可参考:
https://github.com/tamtran96/hadoop-jmx-exporter/tree/master/dashboards
更多prometheus 相关exporter可参考:
https://blog.51cto.com/u_14065119/4166081
参考链接:
https://it.cha138.com/mysql/show-99068.html
https://www.infoq.cn/article/sxextntuttxduedeagiq
https://www.prometheus.wang/exporter/install_blackbox_exporter.html
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· DeepSeek R1 简明指南:架构、训练、本地部署及硬件要求