prometheus监控实战

第一节、环境和软件版本

1.1、操作系统环境

主机ip 操作系统 部署软件 备注
192.168.10.10 Centos7.9 Grafana、Pushgateway、Blackbox Exporter 监控ui
192.168.10.11 Centos7.9 Loki 日志存储
192.168.10.12 Centos7.9 Promethues 存储监控指标
192.168.10.13 Centos7.9 logstash 日志过滤
192.168.10.14 Centos7.9 Filebeat、node_exporter 日志和监控指标采集
192.168.10.15 Windows server2016 Filebeat、node_exporter 日志和监控指标采集
192.168.10.16 Centos7.9 alertmanager 告警

1.2、软件版本

软件名称 版本 备注
grafana 8.3.3 监控ui
Loki 2.5.0 日志存储
promethues 2.32.1 监控指标存储
pushgateway 1.4.2 接收自定义监控指标
filebeat 6.4.3 日志采集客户端
node_exporter 1.3.1 监控指标采集客户端
logstash 7.16.2 日志过滤
Blackbox Exporter 0.19.0 监控网站、http\tcp\udp等
alertmanager 0.24.0 告警

1.3、系统初始化

1、关闭防火墙

systemctl stop firewalld
systemctl disable firewalld

2、关闭selinux

setenforce 0
vim /etc/selinux/config
SELINUX=disabled

1.4、架构图

image

第二节、监控平台部署

2.1、服务端部署

1、grafana
提示:主机192.168.10.10操作
安装

tar -xvf grafana-8.3.3.linux-amd64.tar
cd grafana-8.3.3/

启动

nohup ./bin/grafana-server > ./log/grafana.log &

浏览器访问:http://192.168.10.10:3000
用户名和密码:admin/admin
image

2、promethues
提示:主机192.168.10.12操作
安装

tar -xvf prometheus-2.32.1.linux-amd64.tar
cd prometheus-2.32.1.linux-amd64/

启动

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

浏览器访问:http://192.168.10.12:49800
image

3、grafana集成promethues
在grafana添加数据源promethues,具体步骤如图
image

image

image
image
image

2.2、客户端部署

1、linuxx系统

  • 安装
    部署node_exporter,解压tar包即可
tar -xvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64/
  • 启动
nohup ./node_exporter --web.listen-address=:49999 --log.format=logfmt --collector.textfile.directory=./collection.textfile.directory/ --collector.ntp.server-is-local  >/dev/null &

2、windows系统

  • 安装
    Widows安装解压即可
    image
  • 编写启动脚本startNode.bat
start /b "" .\windows_exporter-0.17.0-amd64.exe --telemetry.addr=":9182" --collector.textfile.directory="./collection.textfile.directory/"
  • 启动
    双击启动脚本即可,如下图
    image

3、配置promethues

  • 编写配置文件
    vi prometheus.yml
  - job_name: "NODE"
    static_configs:
      - targets: ['192.168.10.14:49999']
        labels:
          env: prd001
          group: PAAS
          hostip: 192.168.10.14

      - targets: ['192.168.10.15:9182']
        labels:
          env: prd001
          group: PAAS
          hostip: 192.168.10.15
  • 重启promethues
nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &
  • 查看promethues
    image
    image
    4、配置grafana并查看
  • 导入监控模板
    在grafan导入监控windows和linux模板,Windows模板编号:10467,Linux模板编号:11074,具体操作如下图
    image
    image
    image
  • 查看linux面板
    image
    image
  • 查看windows面板
    image
    image

第三节、部署日志平台

3.1、安装服务端

1、安装

tar -xvf loki.tar.gz
cd loki/

启动

nohup ./loki-linux-amd64 -config.file=config.yaml 1> ./log/loki.log & 2> ./log/loki_error.log &
ss -tunlp | grep 3100
tcp    LISTEN     0      128    [::]:3100               [::]:*                   users:(("loki-linux-amd6",pid=8422,fd=10))

2、配置grafana
image
image
image
image
image

3.2、部署logstash

tar -xvf logstash-7.16.2.tar
cd logstash-7.16.2/
bin/logstash-plugin install file:///bankapp/logstash/plugin/logstash-codec-plain.zip 
bin/logstash-plugin install file:///bankapp/logstash/plugin/logstash-output-loki.zip
vi pipelines/log_collect.conf
input{
   beats {
       port => 10515
   }  
}
input{
   http {
       host => "0.0.0.0"
       port => 10516
       type => "healthcheck"
   }
}
filter {
    grok{                                                                         
          match => {
               "message" => ".*\[INFO\] \[(?<funcname>(.*?)):.*"
          }
    }
    grok {
        match => ["message", "%{TIMESTAMP_ISO8601:logdate}"]
    }
    if [appname] == "switch" {
        date {
            match => ["logdate", "yyyy-MM-dd HH:mm:ss.SSS"]
            target => "@timestamp"  ## 榛樿target灏辨槸"@timestamp
        }
    }else {
        date {
            match => ["logdate", "yyyy-MM-dd'T'HH:mm:ss.SSS"]
            target => "@timestamp"  ## 榛樿target灏辨槸"@timestamp
        }
    }
    mutate {
        remove_field => ["tags"]
        remove_field => ["offset"]
        remove_field => ["logdate"]
    }
}
output {
    if [type] == "healthcheck" {
    }else{
        loki {
            url => "http://192.168.10.10:3100/loki/api/v1/push"
            batch_size => 112640 #112.64 kilobytes
            retries => 5
            min_delay => 3
            max_delay => 500
            message_field => "message"
        }
    }

启动

nohup ./bin/logstash -f ./pipelines/log_collect.conf 1>nohup.loog 2>nohup.log &

3.3、部署客户端filebeat

日志格式如下

gtms-switch-center 2022-04-19 17:28:14.616 [http-nio-8080-exec-989] INFO  c.p.switchcenter.web.controller.SwitchController

1、linux系统

  • 安装
tar -xvf filebeat.tar.gz
cd filebeat/
  • 编写配置文件
vi filebeat.yml
filebeat.prospectors:
  - input_type: log
    paths:
      - /bankapp/switch/gtms-switch-center/com.pactera.jep.log.biz*.log
    multiline:
      pattern: '^gtms-switch-center'
      negate: true
      match: after
      max_lines: 200
      timeout: 20s
    fields:
      env: "prd001"
      appid: "switch"
      appname: "switch"
      hostip: "192.168.10.15"
    reload.enabled: true
    reload.period: 2S
fields_under_root: true
output.logstash:
  hosts: ["192.168.10.11:10515" ]
  enabled: true
  • 启动
nohup ./filebeat -e -c filebeat.yml -d "publish" 1>/dev/null 2>&1 &

2、windows系统

  • windows安装直接解压即可,解压如下
    image
  • 编写配置文件filebeat.yml
filebeat.prospectors:
  - input_type: log
    encoding: gbk
    paths:
      - C:/bankapp/switch/gtms-switch-center/com.pactera.jep.log.biz*.log
    multiline:
      pattern: '^gtms-switch-center'
      negate: true
      match: after
      max_lines: 200
      timeout: 20s
    fields:
      env: "prd001"
      appid: "switch"
      appname: "switch"
      hostip: "192.168.10.16"
    reload.enabled: true
    reload.period: 2S
    fields_under_root: true
output.logstash:
  hosts: ["192.168.10.11:10515" ]
  enabled: true
  • 编写后台启动脚本startFilebeat.vbs
set ws=WScript.CreateObject("WScript.Shell") 
ws.Run "filebeat.exe -e -c filebeat.yml",0
  • 启动,双击脚本startFilebeat.vbs
    image

3.4、grafana查看日志

用grafana查看日志,可以根据自己的删选条件(关键字、时间等)选择查询响应的日志信息,具体如图
image
image
image

第四节、自定义监控

自定义监控可以根据自己编写的脚本,把需要监控的监控指标发送到pushgateway上,最后存储在promethues,使用grafana查看。

4.1、pushgateway

1、部署pushgateway

tar -xvf pushgateway-1.4.2.linux-amd64.tar.gz 
cd pushgateway-1.4.2.linux-amd64/

启动

nohup ./pushgateway --web.listen-address=:48888 1>nohup.log 2>&1 &

2、promethues集成pushgateway

  • 编辑配置文件
    vi prometheus.yml
  - job_name: 'pushgateway'
    static_configs:
      - targets: [‘192.168.10.10:48888']
        labels:
          instance: pushgateway
  • 重启prometheus
nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示:停掉prometheus,再次启动

4.2、监控jvm

1、编写监控jvm脚本并运行
编写脚本

vi jvm_stat_exporter.sh
!# /bin/ksh
echo  "start ..."
#JAVA_PROCESS_LIST=`jps | grep -v " Jps$" | grep -v " Jstat$"` 
#echo $JAVA_PROCESS_LIST
HOST_IP=`ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v 192.168|grep -v inet6|awk '{print $2}'|tr -d "addr:"`
#echo  "$HOST_IP"
push_jvm_stat()
{
  line=$1
  #echo $line
  PID=`echo $line | cut -d ' ' -f 1`
  PNAME=`echo $line | cut -d ' ' -f 2`
  #echo "PID:$PID,HOST_IP:$HOST_IP,PNAME:$PNAME"

  GC_LINE=`jstat -gc $PID | tail -1`
  #echo "$GC_LINE"
  # S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
  # S0C
  S0C=`echo $GC_LINE | cut -d ' ' -f 1`
  S1C=`echo $GC_LINE | cut -d ' ' -f 2`
  S0U=`echo $GC_LINE | cut -d ' ' -f 3`
  S1U=`echo $GC_LINE | cut -d ' ' -f 4`
  EC=`echo $GC_LINE | cut -d ' ' -f 5`
  EU=`echo $GC_LINE | cut -d ' ' -f 6`
  OC=`echo $GC_LINE | cut -d ' ' -f 7`
  OU=`echo $GC_LINE | cut -d ' ' -f 8`
  MC=`echo $GC_LINE | cut -d ' ' -f 9`
  MU=`echo $GC_LINE | cut -d ' ' -f 10`
  CCSC=`echo $GC_LINE | cut -d ' ' -f 11`
  CCSU=`echo $GC_LINE | cut -d ' ' -f 12`
  YGC=`echo $GC_LINE | cut -d ' ' -f 13`
  YGCT=`echo $GC_LINE | cut -d ' ' -f 14`
  FGC=`echo $GC_LINE | cut -d ' ' -f 15`
  FGCT=`echo $GC_LINE | cut -d ' ' -f 16`
  GCT=`echo $GC_LINE | cut -d ' ' -f 17`
  #echo $S0C $S1C $S0U    $S1U      $EC       $EU        $OC         $OU       $MC     $MU    $CCSC   $CCSU   $YGC     $YGCT    $FGC    $FGCT     $GCT
  #echo "******* $HOST_IP $PNAME *******"
  cat <<EOF | curl --data-binary @- http://192.168.10.10:48888/metrics/job/test_jvm_job/instance/${HOST_IP}_$PNAME
  # TYPE jvm_s0c gauge
  jvm_s0c{processname="$PNAME",hostip="$HOST_IP"} $S0C
  # TYPE jvm_s1c gauge
  jvm_s1c{processname="$PNAME",hostip="$HOST_IP"} $S1C
  # TYPE jvm_s0u gauge
  jvm_s0u{processname="$PNAME",hostip="$HOST_IP"} $S0U
  # TYPE jvm_s1u gauge
  jvm_s1u{processname="$PNAME",hostip="$HOST_IP"} $S1U
  # TYPE jvm_ec gauge
  jvm_ec{processname="$PNAME",hostip="$HOST_IP"} $EC
  # TYPE jvm_eu gauge
  jvm_eu{processname="$PNAME",hostip="$HOST_IP"} $EU
  # TYPE jvm_oc gauge
  jvm_oc{processname="$PNAME",hostip="$HOST_IP"} $OC
  # TYPE jvm_ou gauge
  jvm_ou{processname="$PNAME",hostip="$HOST_IP"} $OU
  # TYPE jvm_mc gauge
  jvm_mc{processname="$PNAME",hostip="$HOST_IP"} $MC
  # TYPE jvm_mu gauge
  jvm_mu{processname="$PNAME",hostip="$HOST_IP"} $MU
  # TYPE jvm_ccsc gauge
  jvm_ccsc{processname="$PNAME",hostip="$HOST_IP"} $CCSC
  # TYPE jvm_ccsu gauge
  jvm_ccsu{processname="$PNAME",hostip="$HOST_IP"} $CCSU
  # TYPE jvm_ygc counter
  jvm_ygc{processname="$PNAME",hostip="$HOST_IP"} $YGC
  # TYPE jvm_ygct counter
  jvm_ygct{processname="$PNAME",hostip="$HOST_IP"} $YGCT
  # TYPE jvm_fgc counter
  jvm_fgc{processname="$PNAME",hostip="$HOST_IP"} $FGC
  # TYPE jvm_fgct counter
  jvm_fgct{processname="$PNAME",hostip="$HOST_IP"} $FGCT
  # TYPE jvm_gct counter
  jvm_gct{processname="$PNAME",hostip="$HOST_IP"} $GCT
EOF
 # echo "******* $PNAME 2 *******"
}
while [ 1 = 1 ]
do
  jps |grep -v " Jps$" | grep -v " Jstat$" | while read line_jps
  do
    push_jvm_stat "$line_jps"
  done
  echo "`date` pushed" > ./lastpushed.log
  sleep 5
done

授权并运行脚本

chmod +x  jvm_stat_exporter.sh
./jvm_stat_exporter.sh

2、查看jvm指标

  • 在pushgateway查看如下图
    image
  • 在grafana查看监控指标如下
    image

第五节、监控服务

5.1、部署Blackbox Exporter

1、安装

tar -xvf blackbox_exporter-0.19.0.linux-amd64.tar.gz
cd blackbox_exporter-0.19.0.linux-amd64/

2、启动

nohup ./blackbox_exporter &

3、访问
浏览器访问http://192.168.10.10:9115

5.2、监控端口

1、配置promethues集成blackbox_exporter监控端口22

  - job_name: 'prometheus_port_status'
    metrics_path: /probe
    params:
        module: [tcp_connect]
    static_configs:
        - targets: ['192.168.10.14:22]
          labels:
            instance: port_22_ssh
            hostip: 192.168.10.14
            group: 'tcp'
    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 192.168.10.10:9115

2、重启prometheus

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示:停掉prometheus,再次启动

5.3、监控http

1、配置promethues集成blackbox_exporter监控http

  - job_name: web_status
    metrics_path: /probe
    params:
        module: [http_2xx]
    static_configs:
        - targets: ['http://192.168.10.15:8080]
          labels:
            instance: starweb
            hostip: 192.168.10.15
            group: 'web'

    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 192.168.10.10:9115

2、重启prometheus

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=:49800 1>nohup.log 2>&1 &

提示:停掉prometheus,再次启动

第六节、告警

6.1、部署alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.24.0.linux-amd64.tar.gz

访问http://192.168.4.16:9093/#/alerts
image

6.2、邮件告警配置

1、配置

vi alertmanager.yml
global:
  resolve_timeout: 5m      #处理超时时间,默认为5min
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '17585217552@163.com'  #邮件发送地址
  smtp_auth_username: '17585217552@163.com'  #邮件发送地址用户名
  smtp_auth_password: 'HUSEUWWWYAYZOENXZET' #邮件发送地址授权码
  smtp_require_tls: false
templates:
  - 'conf/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 4h
  receiver: 'default'
receivers:
- name: 'default'
  email_configs:
  - to: '473145009@qq.com'    #邮件接收地址
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

2、告警模板

vi mail.tmpl
{{ define "email.from" }}2877364346@qq.com{{ end }}
{{ define "email.to" }}2877364346@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert<br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }}<br>
故障主机: {{ .Labels.instance }}<br>
告警主题: {{ .Annotations.summary }}<br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
=========end==========<br>
{{ end }}
{{ end }}

6.3、微信告警

1、配置

2、微信告警模板

{{ define "wechat.default.message" }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ range .Alerts }}
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ range .Alerts }}
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
告警链接:
{{ template "__alertmanagerURL" . }}
{{- end }}

6.4、prometheus集成alertmanager

prometheus.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 192.168.4.16:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/*.yml

6.5、配置prometheus告警规则

1、host_rule.yml

groups:
  - name: 主机状态-监控告警
    rules:
    - alert: 主机状态
      expr: up == 0
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.instance}}:服务器宕机"
        description: "{{$labels.instance}}:服务器延时超过1分钟"
    - alert: 服务状态
      expr: probe_success{} == 0
      for: 2m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.instance}}:服务状态异常"
        description: "{{$labels.instance}}:服务状态异常超过2分钟"
    - alert: LINUX磁盘容量
      expr: 100-(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" 
    - alert: LINUX内存不足告警
      expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 80
      for: 3m
      labels:
        severity: 一般告警
      annotations:
        summary: "{{$labels.mountpoint}} 内存使用率过高!"
        description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"   
    - alert: LINUX cpu 使用率
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: 一般告警
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"     
    - alert: Windows磁盘容量
      expr: (sum(windows_logical_disk_size_bytes{volume!~"Harddisk.*"}) by (group,hostip) - sum(windows_logical_disk_free_bytes{volume!~"Harddisk.*"}) by (group,hostip)) / sum(windows_logical_disk_size_bytes{volume!~"Harddisk.*"}) by (group,hostip) * 100 > 80
      for: 1m
      labels:
        severity: 非常严重
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" 
    - alert: Windows内存不足告警
      expr: sum by (group,hostip) ((windows_cs_physical_memory_bytes{} - windows_os_physical_memory_free_bytes{}) / windows_cs_physical_memory_bytes{} * 100) > 80
      for: 3m
      labels:
        severity: 一般告警
      annotations:
        summary: "{{$labels.mountpoint}} 内存使用率过高!"
        description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"   
    - alert: Windows cpu 使用率
      expr: 100 - (avg by (group, hostip) (irate(windows_cpu_time_total{mode="idle"}[1m])) * 100) > 80
      for: 5m
      labels:
        severity: 一般告警
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"          

2、查看规则
http://192.168.4.12:49800/classic/alerts
image

6.6、loki集成alermanager

1、修改Loki配置文件,在loki-local-config.yml

ruler:
  alertmanager_url: http://10.0.14.90:9093           
  enable_alertmanager_v2: true
  enable_api: true                              
  enable_sharding: true            
  ring:                                 
        kvstore:
          store: inmemory
  rule_path: /bankapp/loki/rules-temp           
  storage:                        
        type: local
        local:
          directory: /bankapp/loki/rules          
  flush_period: 1m

提示:
alertmanager_url:alertmanager地址
enable_api:启用loki rules API
enable_sharding:对rules分片,支持ruler多实例
ring:ruler服务的一致性哈希环配置,用于支持多实例和分片
rule_path:rules规则文件临时存储路径
storage:rules规则存储,主要支持本地存储和对象存储
directory:rules规则文件存储路径
flush_period:rules规则加载时间

6.7、配置loki告警规则

1、新建目录

mkdir -p /bankapp/loki/{rules-temp,rules}
mkdir /bankapp/loki/rules/fake

提示:在/monitor/loki/rules 下创建名为fake的文件夹,将rule放在该文件夹下。为什么要创建名为fake的文件夹,这个因为Loki中定义单租户的Loki系统中,fake为其默认租户名,如果是多租户系统,则/monitor/loki/rules 下多个其他名字的文件夹也可以。
2、配置规则

cd /bankapp/loki/rules/fake
vi log-alert.yml
groups:
  - name: bank_connection_status
    rules:
    - alert: GYCB_status
      expr: sum(rate({bankno="GYCB"} |="LOGDEBUG"[5m]))by(ip)>0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: GYCB status is down[5m]

3、重启loki查看规则
image

4、发送邮件如下

image

其他

prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:8080"]

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['localhost:48888']
        labels:
          instance: pushgateway

  - job_name: "NODE"
    static_configs:
      # PAAS
      - targets: ['10.0.14.206:49999']
        labels:
          env: prd001
          group: PAAS
          hostip: 10.0.14.206

      - targets: ['10.0.14.205:49999']
        labels:
          env: prd001
          group: APP
          hostip: 10.0.14.205

      # NGINX
      - targets: ['10.0.14.200:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.200

      - targets: ['10.0.14.201:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.201

      - targets: ['10.0.14.202:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.202

      - targets: ['10.0.14.203:49999']
        labels:
          env: prd001
          group: NGINX
          hostip: 10.0.14.203

      # SWITCH
      - targets: ['10.0.14.209:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.209

      - targets: ['10.0.14.210:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.210

      - targets: ['10.0.14.211:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.211

      - targets: ['10.0.14.214:49999']
        labels:
          env: prd001
          group: LOGSTASH
          hostip: 10.0.14.214

      - targets: ['10.0.14.215:49999']
        labels:
          env: prd001
          group: LOGSTASH
          hostip: 10.0.14.215

      - targets: ['10.0.14.221:49999']
        labels:
          env: prd001
          group: SWITCH
          hostip: 10.0.14.221

      - targets: ['10.0.14.216:49999']
        labels:
          env: prd001
          group: TOOLS
          hostip: 10.0.14.216

  - job_name: 'prometheus_port_status'
    metrics_path: /probe
    params:
        module: [tcp_connect]
    static_configs:
        - targets: ['10.0.14.222:6789']
          labels:
            instance: port_6789_PINGAN
            hostip: 10.0.14.222
            group: 'tcp'
    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 127.0.0.1:9115

  - job_name: web_status
    metrics_path: /probe
    params:
        module: [http_2xx]
    static_configs:
        - targets: ['https://star.moutai.com.cn/nginx_status']
          labels:
            instance: starweb
            hostip: 10.0.14.241
            group: 'web'

        - targets: ['http://10.0.14.226:8046/actuator/health']
          labels:
            instance: gtms-service-business
            hostip: 10.0.14.226
            group: 'web'
            
        - targets: ['http://10.0.14.225:8043/actuator/health']
          labels:
            instance: gtms-service-gateway
            hostip: 10.0.14.225
            group: 'web'
        - targets: ['http://10.0.14.226:8043/actuator/health']
          labels:
            instance: gtms-service-gateway
            hostip: 10.0.14.226
            group: 'web'
            
        - targets: ['http://10.0.14.204:8047/actuator/health']
          labels:
            instance: gtms-service-job
            hostip: 10.0.14.204
            group: 'web'
        - targets: ['http://10.0.14.205:8047/actuator/health']
          labels:
            instance: gtms-service-job
            hostip: 10.0.14.205
            group: 'web'
            
        - targets: ['http://10.0.14.225:9080/star-api/actuator/health']
          labels:
            instance: ijep-router-zuul-star-api
            hostip: 10.0.14.225
            group: 'web'
        - targets: ['http://10.0.14.226:9080/star-api/actuator/health']
          labels:
            instance: ijep-router-zuul-star-api
            hostip: 10.0.14.226
            group: 'web'

    relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: 127.0.0.1:9115

loki-config.yaml

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
   active_index_directory: /bankapp/loki/data/index
   cache_location: /bankapp/loki/data/index/cache
   shared_store: filesystem

  filesystem:
    directory: /bankapp/loki/data/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 336h
  per_stream_rate_limit: "30MB"
  ingestion_rate_mb: 50
  retention_period: 336h

compactor:
  working_directory: /bankapp/loki/data/compactor
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true

FAQ

1、loki接收不到日志或者promethues获取不到监控指标登录

解决办法:
查看防火墙规则是否放开,或者直接关闭防火墙(生产环境不建议关闭防火墙)

posted @ 2022-10-14 08:57  jluo123  阅读(336)  评论(0编辑  收藏  举报