prometheus监控实战

第一节、环境和软件版本

1.1、操作系统环境

主机ip	操作系统	部署软件	备注
192.168.10.10	Centos7.9	Grafana、Pushgateway、Blackbox Exporter	监控ui
192.168.10.11	Centos7.9	Loki	日志存储
192.168.10.12	Centos7.9	Promethues	存储监控指标
192.168.10.13	Centos7.9	logstash	日志过滤
192.168.10.14	Centos7.9	Filebeat、node_exporter	日志和监控指标采集
192.168.10.15	Windows server2016	Filebeat、node_exporter	日志和监控指标采集
192.168.10.16	Centos7.9	alertmanager	告警

1.2、软件版本

软件名称	版本	备注
grafana	8.3.3	监控ui
Loki	2.5.0	日志存储
promethues	2.32.1	监控指标存储
pushgateway	1.4.2	接收自定义监控指标
filebeat	6.4.3	日志采集客户端
node_exporter	1.3.1	监控指标采集客户端
logstash	7.16.2	日志过滤
Blackbox Exporter	0.19.0	监控网站、http\tcp\udp等
alertmanager	0.24.0	告警

1.3、系统初始化

1、关闭防火墙

systemctl stop firewalld
systemctl disable firewalld

2、关闭selinux

setenforce 0
vim /etc/selinux/config
SELINUX=disabled

1.4、架构图

第二节、监控平台部署

2.1、服务端部署

1、grafana
提示：主机192.168.10.10操作
安装

tar -xvf grafana-8.3.3.linux-amd64.tar
cd grafana-8.3.3/

启动

nohup ./bin/grafana-server > ./log/grafana.log &

浏览器访问：http://192.168.10.10:3000
用户名和密码：admin/admin

2、promethues
提示：主机192.168.10.12操作
安装

tar -xvf prometheus-2.32.1.linux-amd64.tar
cd prometheus-2.32.1.linux-amd64/

启动

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

浏览器访问：http://192.168.10.12:49800

3、grafana集成promethues
在grafana添加数据源promethues，具体步骤如图

2.2、客户端部署

1、linuxx系统

安装
部署node_exporter，解压tar包即可

tar -xvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64/

启动

nohup ./node_exporter --web.listen-address=:49999 --log.format=logfmt --collector.textfile.directory=./collection.textfile.directory/ --collector.ntp.server-is-local  >/dev/null &

2、windows系统

安装
Widows安装解压即可
编写启动脚本startNode.bat

start /b "" .\windows_exporter-0.17.0-amd64.exe --telemetry.addr=":9182" --collector.textfile.directory="./collection.textfile.directory/"

启动
双击启动脚本即可，如下图

3、配置promethues

编写配置文件
vi prometheus.yml

  - job_name: "NODE"
    static_configs:
      - targets: ['192.168.10.14:49999']
        labels:
          env: prd001
          group: PAAS
          hostip: 192.168.10.14

      - targets: ['192.168.10.15:9182']
        labels:
          env: prd001
          group: PAAS
          hostip: 192.168.10.15

重启promethues

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

查看promethues

4、配置grafana并查看
导入监控模板
在grafan导入监控windows和linux模板，Windows模板编号：10467，Linux模板编号：11074，具体操作如下图
查看linux面板
查看windows面板

第三节、部署日志平台

3.1、安装服务端

1、安装

tar -xvf loki.tar.gz
cd loki/

启动

nohup ./loki-linux-amd64 -config.file=config.yaml 1> ./log/loki.log & 2> ./log/loki_error.log &
ss -tunlp | grep 3100
tcp    LISTEN     0      128    [::]:3100               [::]:*                   users:(("loki-linux-amd6",pid=8422,fd=10))

2、配置grafana

3.2、部署logstash

tar -xvf logstash-7.16.2.tar
cd logstash-7.16.2/
bin/logstash-plugin install file:///bankapp/logstash/plugin/logstash-codec-plain.zip 
bin/logstash-plugin install file:///bankapp/logstash/plugin/logstash-output-loki.zip
vi pipelines/log_collect.conf
input{
   beats {
       port => 10515
   }  
}
input{
   http {
       host => "0.0.0.0"
       port => 10516
       type => "healthcheck"
   }
}
filter {
    grok{                                                                         
          match => {
               "message" => ".*\[INFO\] \[(?<funcname>(.*?)):.*"
          }
    }
    grok {
        match => ["message", "%{TIMESTAMP_ISO8601:logdate}"]
    }
    if [appname] == "switch" {
        date {
            match => ["logdate", "yyyy-MM-dd HH:mm:ss.SSS"]
            target => "@timestamp"  ## 榛樿target灏辨槸"@timestamp
        }
    }else {
        date {
            match => ["logdate", "yyyy-MM-dd'T'HH:mm:ss.SSS"]
            target => "@timestamp"  ## 榛樿target灏辨槸"@timestamp
        }
    }
    mutate {
        remove_field => ["tags"]
        remove_field => ["offset"]
        remove_field => ["logdate"]
    }
}
output {
    if [type] == "healthcheck" {
    }else{
        loki {
            url => "http://192.168.10.10:3100/loki/api/v1/push"
            batch_size => 112640 #112.64 kilobytes
            retries => 5
            min_delay => 3
            max_delay => 500
            message_field => "message"
        }
    }

启动

nohup ./bin/logstash -f ./pipelines/log_collect.conf 1>nohup.loog 2>nohup.log &

3.3、部署客户端filebeat

日志格式如下

gtms-switch-center 2022-04-19 17:28:14.616 [http-nio-8080-exec-989] INFO  c.p.switchcenter.web.controller.SwitchController

1、linux系统

安装

tar -xvf filebeat.tar.gz
cd filebeat/

编写配置文件

vi filebeat.yml
filebeat.prospectors:
  - input_type: log
    paths:
      - /bankapp/switch/gtms-switch-center/com.pactera.jep.log.biz*.log
    multiline:
      pattern: '^gtms-switch-center'
      negate: true
      match: after
      max_lines: 200
      timeout: 20s
    fields:
      env: "prd001"
      appid: "switch"
      appname: "switch"
      hostip: "192.168.10.15"
    reload.enabled: true
    reload.period: 2S
fields_under_root: true
output.logstash:
  hosts: ["192.168.10.11:10515" ]
  enabled: true

启动

nohup ./filebeat -e -c filebeat.yml -d "publish" 1>/dev/null 2>&1 &

2、windows系统

windows安装直接解压即可，解压如下
编写配置文件filebeat.yml

filebeat.prospectors:
  - input_type: log
    encoding: gbk
    paths:
      - C:/bankapp/switch/gtms-switch-center/com.pactera.jep.log.biz*.log
    multiline:
      pattern: '^gtms-switch-center'
      negate: true
      match: after
      max_lines: 200
      timeout: 20s
    fields:
      env: "prd001"
      appid: "switch"
      appname: "switch"
      hostip: "192.168.10.16"
    reload.enabled: true
    reload.period: 2S
    fields_under_root: true
output.logstash:
  hosts: ["192.168.10.11:10515" ]
  enabled: true

编写后台启动脚本startFilebeat.vbs

set ws=WScript.CreateObject("WScript.Shell") 
ws.Run "filebeat.exe -e -c filebeat.yml",0

启动，双击脚本startFilebeat.vbs

3.4、grafana查看日志

用grafana查看日志，可以根据自己的删选条件（关键字、时间等）选择查询响应的日志信息，具体如图

第四节、自定义监控

自定义监控可以根据自己编写的脚本，把需要监控的监控指标发送到pushgateway上，最后存储在promethues，使用grafana查看。

4.1、pushgateway

1、部署pushgateway

tar -xvf pushgateway-1.4.2.linux-amd64.tar.gz 
cd pushgateway-1.4.2.linux-amd64/

启动

nohup ./pushgateway --web.listen-address=:48888 1>nohup.log 2>&1 &

2、promethues集成pushgateway

编辑配置文件
vi prometheus.yml

  - job_name: 'pushgateway'
    static_configs:
      - targets: [‘192.168.10.10:48888']
        labels:
          instance: pushgateway

重启prometheus

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示：停掉prometheus，再次启动

4.2、监控jvm

1、编写监控jvm脚本并运行
编写脚本

vi jvm_stat_exporter.sh
!# /bin/ksh
echo  "start ..."
#JAVA_PROCESS_LIST=`jps | grep -v " Jps$" | grep -v " Jstat$"` 
#echo $JAVA_PROCESS_LIST
HOST_IP=`ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v 192.168|grep -v inet6|awk '{print $2}'|tr -d "addr:"`
#echo  "$HOST_IP"
push_jvm_stat()
{
  line=$1
  #echo $line
  PID=`echo $line | cut -d ' ' -f 1`
  PNAME=`echo $line | cut -d ' ' -f 2`
  #echo "PID:$PID,HOST_IP:$HOST_IP,PNAME:$PNAME"

  GC_LINE=`jstat -gc $PID | tail -1`
  #echo "$GC_LINE"
  # S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
  # S0C
  S0C=`echo $GC_LINE | cut -d ' ' -f 1`
  S1C=`echo $GC_LINE | cut -d ' ' -f 2`
  S0U=`echo $GC_LINE | cut -d ' ' -f 3`
  S1U=`echo $GC_LINE | cut -d ' ' -f 4`
  EC=`echo $GC_LINE | cut -d ' ' -f 5`
  EU=`echo $GC_LINE | cut -d ' ' -f 6`
  OC=`echo $GC_LINE | cut -d ' ' -f 7`
  OU=`echo $GC_LINE | cut -d ' ' -f 8`
  MC=`echo $GC_LINE | cut -d ' ' -f 9`
  MU=`echo $GC_LINE | cut -d ' ' -f 10`
  CCSC=`echo $GC_LINE | cut -d ' ' -f 11`
  CCSU=`echo $GC_LINE | cut -d ' ' -f 12`
  YGC=`echo $GC_LINE | cut -d ' ' -f 13`
  YGCT=`echo $GC_LINE | cut -d ' ' -f 14`
  FGC=`echo $GC_LINE | cut -d ' ' -f 15`
  FGCT=`echo $GC_LINE | cut -d ' ' -f 16`
  GCT=`echo $GC_LINE | cut -d ' ' -f 17`
  #echo $S0C $S1C $S0U    $S1U      $EC       $EU        $OC         $OU       $MC     $MU    $CCSC   $CCSU   $YGC     $YGCT    $FGC    $FGCT     $GCT
  #echo "******* $HOST_IP $PNAME *******"
  cat <<EOF | curl --data-binary @- http://192.168.10.10:48888/metrics/job/test_jvm_job/instance/${HOST_IP}_$PNAME
  # TYPE jvm_s0c gauge
  jvm_s0c{processname="$PNAME",hostip="$HOST_IP"} $S0C
  # TYPE jvm_s1c gauge
  jvm_s1c{processname="$PNAME",hostip="$HOST_IP"} $S1C
  # TYPE jvm_s0u gauge
  jvm_s0u{processname="$PNAME",hostip="$HOST_IP"} $S0U
  # TYPE jvm_s1u gauge
  jvm_s1u{processname="$PNAME",hostip="$HOST_IP"} $S1U
  # TYPE jvm_ec gauge
  jvm_ec{processname="$PNAME",hostip="$HOST_IP"} $EC
  # TYPE jvm_eu gauge
  jvm_eu{processname="$PNAME",hostip="$HOST_IP"} $EU
  # TYPE jvm_oc gauge
  jvm_oc{processname="$PNAME",hostip="$HOST_IP"} $OC
  # TYPE jvm_ou gauge
  jvm_ou{processname="$PNAME",hostip="$HOST_IP"} $OU
  # TYPE jvm_mc gauge
  jvm_mc{processname="$PNAME",hostip="$HOST_IP"} $MC
  # TYPE jvm_mu gauge
  jvm_mu{processname="$PNAME",hostip="$HOST_IP"} $MU
  # TYPE jvm_ccsc gauge
  jvm_ccsc{processname="$PNAME",hostip="$HOST_IP"} $CCSC
  # TYPE jvm_ccsu gauge
  jvm_ccsu{processname="$PNAME",hostip="$HOST_IP"} $CCSU
  # TYPE jvm_ygc counter
  jvm_ygc{processname="$PNAME",hostip="$HOST_IP"} $YGC
  # TYPE jvm_ygct counter
  jvm_ygct{processname="$PNAME",hostip="$HOST_IP"} $YGCT
  # TYPE jvm_fgc counter
  jvm_fgc{processname="$PNAME",hostip="$HOST_IP"} $FGC
  # TYPE jvm_fgct counter
  jvm_fgct{processname="$PNAME",hostip="$HOST_IP"} $FGCT
  # TYPE jvm_gct counter
  jvm_gct{processname="$PNAME",hostip="$HOST_IP"} $GCT
EOF
 # echo "******* $PNAME 2 *******"
}
while [ 1 = 1 ]
do
  jps |grep -v " Jps$" | grep -v " Jstat$" | while read line_jps
  do
    push_jvm_stat "$line_jps"
  done
  echo "`date` pushed" > ./lastpushed.log
  sleep 5
done

授权并运行脚本

chmod +x  jvm_stat_exporter.sh
./jvm_stat_exporter.sh

2、查看jvm指标

在pushgateway查看如下图
在grafana查看监控指标如下

第五节、监控服务

5.1、部署Blackbox Exporter

1、安装

tar -xvf blackbox_exporter-0.19.0.linux-amd64.tar.gz
cd blackbox_exporter-0.19.0.linux-amd64/

2、启动

nohup ./blackbox_exporter &

3、访问
浏览器访问http://192.168.10.10:9115

5.2、监控端口

1、配置promethues集成blackbox_exporter监控端口22

  - job_name: 'prometheus_port_status'
    metrics_path: /probe
    params:
        module: [tcp_connect]
    static_configs:
        - targets: ['192.168.10.14:22]
          labels:
            instance: port_22_ssh
            hostip: 192.168.10.14
            group: 'tcp'
    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 192.168.10.10:9115

2、重启prometheus

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示：停掉prometheus，再次启动

5.3、监控http

1、配置promethues集成blackbox_exporter监控http

  - job_name: web_status
    metrics_path: /probe
    params:
        module: [http_2xx]
    static_configs:
        - targets: ['http://192.168.10.15:8080]
          labels:
            instance: starweb
            hostip: 192.168.10.15
            group: 'web'

    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 192.168.10.10:9115

2、重启prometheus

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示：停掉prometheus，再次启动

第六节、告警

6.1、部署alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.24.0.linux-amd64.tar.gz

访问http://192.168.4.16:9093/#/alerts

6.2、邮件告警配置

1、配置

vi alertmanager.yml
global:
  resolve_timeout: 5m      #处理超时时间，默认为5min
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '17585217552@163.com'  #邮件发送地址
  smtp_auth_username: '17585217552@163.com'  #邮件发送地址用户名
  smtp_auth_password: 'HUSEUWWWYAYZOENXZET' #邮件发送地址授权码
  smtp_require_tls: false
templates:
  - 'conf/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 4h
  receiver: 'default'
receivers:
- name: 'default'
  email_configs:
  - to: '473145009@qq.com'    #邮件接收地址
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

2、告警模板

vi mail.tmpl
{{ define "email.from" }}2877364346@qq.com{{ end }}
{{ define "email.to" }}2877364346@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert<br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }}<br>
故障主机: {{ .Labels.instance }}<br>
告警主题: {{ .Annotations.summary }}<br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
=========end==========<br>
{{ end }}
{{ end }}

6.3、微信告警

1、配置

2、微信告警模板

{{ define "wechat.default.message" }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ range .Alerts }}
告警级别：{{ .Labels.severity }}
告警类型：{{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ range .Alerts }}
告警级别：{{ .Labels.severity }}
告警类型：{{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
告警链接:
{{ template "__alertmanagerURL" . }}
{{- end }}

6.4、prometheus集成alertmanager

prometheus.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 192.168.4.16:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/*.yml

6.5、配置prometheus告警规则

1、host_rule.yml

groups:
  - name: 主机状态-监控告警
    rules:
    - alert: 主机状态
      expr: up == 0
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.instance}}:服务器宕机"
        description: "{{$labels.instance}}:服务器延时超过1分钟"
    - alert: 服务状态
      expr: probe_success{} == 0
      for: 2m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.instance}}:服务状态异常"
        description: "{{$labels.instance}}:服务状态异常超过2分钟"
    - alert: LINUX磁盘容量
      expr: 100-(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" 
    - alert: LINUX内存不足告警
      expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 80
      for: 3m
      labels:
        severity: 一般告警
      annotations:
        summary: "{{$labels.mountpoint}} 内存使用率过高！"
        description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"   
    - alert: LINUX cpu 使用率
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: 一般告警
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"     
    - alert: Windows磁盘容量
      expr: (sum(windows_logical_disk_size_bytes{volume!~"Harddisk.*"}) by (group,hostip) - sum(windows_logical_disk_free_bytes{volume!~"Harddisk.*"}) by (group,hostip)) / sum(windows_logical_disk_size_bytes{volume!~"Harddisk.*"}) by (group,hostip) * 100 > 80
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" 
    - alert: Windows内存不足告警
      expr: sum by (group,hostip) ((windows_cs_physical_memory_bytes{} - windows_os_physical_memory_free_bytes{}) / windows_cs_physical_memory_bytes{} * 100) > 80
      for: 3m
      labels:
        severity: 一般告警
      annotations:
        summary: "{{$labels.mountpoint}} 内存使用率过高！"
        description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"   
    - alert: Windows cpu 使用率
      expr: 100 - (avg by (group, hostip) (irate(windows_cpu_time_total{mode="idle"}[1m])) * 100) > 80
      for: 5m
      labels:
        severity: 一般告警
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

2、查看规则
http://192.168.4.12:49800/classic/alerts

6.6、loki集成alermanager

1、修改Loki配置文件，在loki-local-config.yml

ruler:
  alertmanager_url: http://10.0.14.90:9093           
  enable_alertmanager_v2: true
  enable_api: true                              
  enable_sharding: true            
  ring:                                 
        kvstore:
          store: inmemory
  rule_path: /bankapp/loki/rules-temp           
  storage:                        
        type: local
        local:
          directory: /bankapp/loki/rules          
  flush_period: 1m

提示：
alertmanager_url：alertmanager地址
enable_api：启用loki rules API
enable_sharding：对rules分片，支持ruler多实例
ring：ruler服务的一致性哈希环配置，用于支持多实例和分片
rule_path：rules规则文件临时存储路径
storage：rules规则存储,主要支持本地存储和对象存储
directory：rules规则文件存储路径
flush_period：rules规则加载时间

6.7、配置loki告警规则

1、新建目录

mkdir -p /bankapp/loki/{rules-temp,rules}
mkdir /bankapp/loki/rules/fake

提示：在/monitor/loki/rules 下创建名为fake的文件夹，将rule放在该文件夹下。为什么要创建名为fake的文件夹，这个因为Loki中定义单租户的Loki系统中，fake为其默认租户名，如果是多租户系统，则/monitor/loki/rules 下多个其他名字的文件夹也可以。
2、配置规则

cd /bankapp/loki/rules/fake
vi log-alert.yml
groups:
  - name: bank_connection_status
    rules:
    - alert: GYCB_status
      expr: sum(rate({bankno="GYCB"} |="LOGDEBUG"[5m]))by(ip)>0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: GYCB status is down[5m]

3、重启loki查看规则

4、发送邮件如下

其他

prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:8080"]

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['localhost:48888']
        labels:
          instance: pushgateway

  - job_name: "NODE"
    static_configs:
      # PAAS
      - targets: ['10.0.14.206:49999']
        labels:
          env: prd001
          group: PAAS
          hostip: 10.0.14.206

      - targets: ['10.0.14.205:49999']
        labels:
          env: prd001
          group: APP
          hostip: 10.0.14.205

      # NGINX
      - targets: ['10.0.14.200:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.200

      - targets: ['10.0.14.201:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.201

      - targets: ['10.0.14.202:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.202

      - targets: ['10.0.14.203:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.203

      # SWITCH
      - targets: ['10.0.14.209:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.209

      - targets: ['10.0.14.210:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.210

      - targets: ['10.0.14.211:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.211

      - targets: ['10.0.14.214:49999']
        labels:
          env: prd001
          group: LOGSTASH
          hostip: 10.0.14.214

      - targets: ['10.0.14.215:49999']
        labels:
          env: prd001
          group: LOGSTASH
          hostip: 10.0.14.215

      - targets: ['10.0.14.221:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.221

      - targets: ['10.0.14.216:49999']
        labels:
          env: prd001
          group: TOOLS
          hostip: 10.0.14.216

  - job_name: 'prometheus_port_status'
    metrics_path: /probe
    params:
        module: [tcp_connect]
    static_configs:
        - targets: ['10.0.14.222:6789']
          labels:
            instance: port_6789_PINGAN
            hostip: 10.0.14.222
            group: 'tcp'
    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 127.0.0.1:9115

  - job_name: web_status
    metrics_path: /probe
    params:
        module: [http_2xx]
    static_configs:
        - targets: ['https://star.moutai.com.cn/nginx_status']
          labels:
            instance: starweb
            hostip: 10.0.14.241
            group: 'web'

        - targets: ['http://10.0.14.226:8046/actuator/health']
          labels:
            instance: gtms-service-business
            hostip: 10.0.14.226
            group: 'web'
            
        - targets: ['http://10.0.14.225:8043/actuator/health']
          labels:
            instance: gtms-service-gateway
            hostip: 10.0.14.225
            group: 'web'
        - targets: ['http://10.0.14.226:8043/actuator/health']
          labels:
            instance: gtms-service-gateway
            hostip: 10.0.14.226
            group: 'web'
            
        - targets: ['http://10.0.14.204:8047/actuator/health']
          labels:
            instance: gtms-service-job
            hostip: 10.0.14.204
            group: 'web'
        - targets: ['http://10.0.14.205:8047/actuator/health']
          labels:
            instance: gtms-service-job
            hostip: 10.0.14.205
            group: 'web'
            
        - targets: ['http://10.0.14.225:9080/star-api/actuator/health']
          labels:
            instance: ijep-router-zuul-star-api
            hostip: 10.0.14.225
            group: 'web'
        - targets: ['http://10.0.14.226:9080/star-api/actuator/health']
          labels:
            instance: ijep-router-zuul-star-api
            hostip: 10.0.14.226
            group: 'web'

    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 127.0.0.1:9115

loki-config.yaml

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
   active_index_directory: /bankapp/loki/data/index
   cache_location: /bankapp/loki/data/index/cache
   shared_store: filesystem

  filesystem:
    directory: /bankapp/loki/data/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 336h
  per_stream_rate_limit: "30MB"
  ingestion_rate_mb: 50
  retention_period: 336h

compactor:
  working_directory: /bankapp/loki/data/compactor
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true

FAQ

1、loki接收不到日志或者promethues获取不到监控指标登录

解决办法：
查看防火墙规则是否放开，或者直接关闭防火墙（生产环境不建议关闭防火墙）

posted @ 2022-10-14 08:57 jluo123 阅读(346) 评论(0) 编辑收藏举报

刷新页面返回顶部

低调的太阳

prometheus监控实战

第一节、环境和软件版本

1.1、操作系统环境

1.2、软件版本

1.3、系统初始化

1.4、架构图

第二节、监控平台部署

2.1、服务端部署

2.2、客户端部署

第三节、部署日志平台

3.1、安装服务端

3.2、部署logstash

3.3、部署客户端filebeat

3.4、grafana查看日志

第四节、自定义监控

4.1、pushgateway

4.2、监控jvm

第五节、监控服务

5.1、部署Blackbox Exporter

5.2、监控端口

5.3、监控http

第六节、告警

6.1、部署alertmanager

6.2、邮件告警配置

6.3、微信告警

6.4、prometheus集成alertmanager

6.5、配置prometheus告警规则

6.6、loki集成alermanager

6.7、配置loki告警规则

其他

prometheus.yml

loki-config.yaml

FAQ

1、loki接收不到日志或者promethues获取不到监控指标登录

公告