Java编写监控指标并集成Prometheus、Grafana

一、Pushgateway

1、上传pushgateway-1.4.3.linux-amd64.tar.gz至/opt/soft/

  下载地址:https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz

2、解压至/opt/module/,并修改目录名称

tar -zxvf /opt/soft/pushgateway-1.4.3.linux-amd64.tar.gz -C /opt/module/
mv /opt/module/pushgateway-1.4.3.linux-amd64/ /opt/module/pushgateway-1.4.3

3、启动pushgateway

nohup /opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091 >/opt/module/pushgateway-1.4.3/pushgateway.log 2>&1 &

4、查看Web页面

  http://127.0.0.1:9091

5、设置开机自启

vi /etc/systemd/system/pushgateway.service

  添加如下内容:

[Unit]
Description=logging pushgateway service

[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091

Restart=on-failure

[Install]
WantedBy=multi-user.target

6、赋予权限

chmod 777 /etc/systemd/system/pushgateway.service

7、刷新文件

systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway

8、停止

ps -ef | grep pushgateway | grep -v grep | awk '{print $2}' | xargs kill

9、编写重启脚本

vi /opt/module/pushgateway-1.4.3/restart.sh

  添加如下内容:

#!/bin/bash
process=`ps -ef | grep /opt/module/pushgateway-1.4.3/pushgateway | grep -v grep | awk '{print $2}'`;
echo "process=$process"
if [ "$process" == "" ]; then
echo "$(date "+%Y-%m-%d %H:%M:%S"):pushgateway已经停止";
else
kill -9 $process
echo "$(date "+%Y-%m-%d %H:%M:%S"):pushgateway已被kill";
fi

sleep 3

echo "启动pushgateway"
nohup /opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091 >/opt/module/pushgateway-1.4.3/pushgateway.log 2>&1 &

10、赋予可执行权限

chmod +x /opt/module/pushgateway-1.4.3/restart.sh

11、定时执行(目的是pushgateway使用时间过久会出现卡死现象,故设置1小时重启一次)

vi /etc/crontab

  添加如下内容:

36 */1 * * * root cd /opt/module/pushgateway-1.4.3 && sh restart.sh

二、prometheus server

1、上传prometheus-2.38.0.linux-amd64.tar.gz至/opt/soft/

  下载地址:https://github.com/prometheus/prometheus/releases/download/v2.38.0/prometheus-2.38.0.linux-amd64.tar.gz

2、解压至/opt/module/,并修改目录名称

tar -zxvf /opt/soft/prometheus-2.38.0.linux-amd64.tar.gz -C /opt/module/
mv /opt/module/prometheus-2.38.0.linux-amd64/ /opt/module/prometheus-2.38.0

3、修改配置文件

vi /opt/module/prometheus-2.38.0/prometheus.yml

  修改如下内容:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["127.0.0.1:9093"] 

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['127.0.0.1:9090']
    
  # 添加 PushGateway 监控配置
  - job_name: 'pushgateway'
    static_configs:
    - targets: ['127.0.0.1:9091']
      labels:
        instance: pushgateway
View Code

4、启动prometheus

nohup /opt/module/prometheus-2.38.0/prometheus --config.file=/opt/module/prometheus-2.38.0/prometheus.yml --web.enable-admin-api > /opt/module/prometheus-2.38.0/prometheus.log 2>&1 &

5、使用Systemctl管理Prometheus

vi /etc/systemd/system/prometheus.service

  添加如下内容:

[Unit]
Description=logging prometheus service
Documentation=https://prometheus.io

[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/module/prometheus-2.38.0/prometheus --config.file=/opt/module/prometheus-2.38.0/prometheus.yml --web.enable-admin-api

Restart=on-failure

[Install]
WantedBy=multi-user.target

6、赋予权限

chmod 777 /etc/systemd/system/prometheus.service

7、刷新文件并设置自启

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus

8、停止

systemctl stop prometheus
ps -ef |grep -v grep | grep /opt/module/prometheus-2.38.0/prometheus | awk '{print $2}' | xargs kill

9、查看

ps -ef |grep -v grep | grep /opt/module/prometheus-2.38.0/prometheus

10、查看Web页面

  http://127.0.0.1:9090

三、监控Zookeeper(已停用)

1、修改配置文件

vi /opt/module/prometheus-2.38.0/prometheus.yml
# 添加 zookeeper 监控配置
- job_name: 'zookeeper'
static_configs:
- targets: ['127.0.0.1:7000','127.0.0.1:7000','127.0.0.1:7000']
labels:
instance: zookeeper

2、重启prometheus

systemctl restart prometheus

四、监控集群

1、Idea新建Maven工程

2、修改pom文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.yuange</groupId>
    <artifactId>prometheus</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>pom</packaging>

    <repositories>
        <repository>
            <id>aliyun</id>
            <name>aliyun</name>
            <url>https://maven.aliyun.com/repository/public</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <name>cloudera</name>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
        </repository>
    </repositories>

    <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <slf4j.version>1.7.30</slf4j.version>
        <httpclient.version>4.5</httpclient.version>
        <fastjson.version>1.2.75</fastjson.version>
        <lombok.version>1.18.16</lombok.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>${slf4j.version}</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
                <version>${slf4j.version}</version>
            </dependency>
            <!-- lombok插件 -->
            <dependency>
                <groupId>org.projectlombok</groupId>
                <artifactId>lombok</artifactId>
                <version>${lombok.version}</version>
            </dependency>
            <dependency>
                <groupId>com.alibaba</groupId>
                <artifactId>fastjson</artifactId>
                <version>${fastjson.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpclient</artifactId>
                <version>${httpclient.version}</version>
            </dependency>
        </dependencies>
    </dependencyManagement>

</project>
View Code

3、创建子工程

4、修改子工程pom文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>prometheus</artifactId>
        <groupId>com.yuange</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>yuange-prometheus</artifactId>

    <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <spring-boot.version>2.4.1</spring-boot.version>
        <fastjson.version>1.2.75</fastjson.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>
    </dependencies>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-dependencies</artifactId>
                <version>${spring-boot.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <version>2.4.1</version>
                <configuration>
                    <mainClass>com.yuange.promethues.YuangePromethuesApplication</mainClass>
                </configuration>
                <executions>
                    <execution>
                        <id>repackage</id>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
View Code

5、添加 application.properties 文件

6、创建SpringBoot启动主类

com.yuange.promethues.YuangePromethuesApplication
package com.yuange.promethues;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

/**
 * @作者:袁工
 * @时间:2023/2/22 14:31
 */
@SpringBootApplication
public class YuangePromethuesApplication {

    public static void main(String[] args) {
        SpringApplication.run(YuangePromethuesApplication.class,args);
    }
    
}

7、后续代码介绍

  代码过多就不一一说明,码云地址为:https://gitee.com/LzMingYueShanPao/prometheus.git

   yuange-prometheus模块是一个SpringBoot项目,项目启动后的端口为8999,可在application.properties中配置。

  此外,controller中的ClusterMetricController.java,编写了接口:/metrics/prdClusterMetric 表示大数据集群监控接口

  监控的指标数据包括:内存/核心数的使用百分比,Yarn资源的相关指标,NameNode/HiveServer2的HeapMemoryUsage指标,DataNode的心跳数据,Telnet方式监控的IP+端口存活监控等

8、在Idea中使用Maven打包后将Jar包上传至/opt/module/prometheus-2.38.0/java目录

9、编写启停脚本restart.sh

vim /opt/module/prometheus-2.38.0/java/restart.sh
#!/bin/bash

# 日志级别 debug-1, info-2, warn-3, error-4, always-5
LOG_LEVEL=3

# 调试日志
function log_debug(){
  content="[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') $@"
  [ $LOG_LEVEL -le 1  ] && echo -e "\033[32m"  ${content}  "\033[0m"
}
# 信息日志
function log_info(){
  content="[INFO] $(date '+%Y-%m-%d %H:%M:%S') $@"
  [ $LOG_LEVEL -le 2  ] && echo -e "\033[32m"  ${content} "\033[0m"
}
# 警告日志
function log_warn(){
  content="[WARN] $(date '+%Y-%m-%d %H:%M:%S') $@"
  [ $LOG_LEVEL -le 3  ] && echo -e "\033[33m" ${content} "\033[0m"
}
# 错误日志
function log_err(){
  content="[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $@"
  [ $LOG_LEVEL -le 4  ] && echo -e "\033[31m" ${content} "\033[0m"
}
# 一直都会打印的日志
function log_always(){
   content="[ALWAYS] $(date '+%Y-%m-%d %H:%M:%S') $@"
   [ $LOG_LEVEL -le 5  ] && echo -e  "\033[32m" ${content} "\033[0m"
}


SpringBoot=$1

if [ "$SpringBoot" = "" ];
then
    log_err "Please enter the Jar application name"
    lot=$(find ./ -maxdepth 1 -type f -and -name "*.jar")
    # lot_pat=${lot#*/}
    log_err "The Optional Jar applications are as follows: $lot"
    exit 1
fi

ADATE=$(date +%Y%m%d%H%M%S)

# 启动参数
START_OPTS=$3

# JVM参数
APP_HOME=$(pwd)

dirname $0|grep "^/" >/dev/null


# 获取当前执行路径

if [ $? -eq 0 ];then
     APP_HOME=$(DIR_NAME $0)
else
     dirname $0|grep "^\." >/dev/null
     retval=$?
     if [ $retval -eq 0 ];then
        APP_HOME=$(dirname $0|sed "s#^.#$APP_HOME#")
     else
        APP_HOME=$(dirname $0|sed "s#^#$APP_HOME/#")
     fi
fi

log_info "Current directory is $APP_HOME"

ENV_PORT=${START_OPTS#*=}

ENV_DIR="$APP_HOME/logs"


if [ "$ENV_PORT" = "" ]; then
    log_info "Application Port is Null"
    log_info "$ENV_DIR"
    if [ ! -d "$ENV_DIR"  ];then
        mkdir -p $ENV_DIR
    fi

else
    log_info "Application Port is $ENV_PORT"
    ENV_DIR="$ENV_DIR/$ENV_PORT"
    log_info "$ENV_DIR"
    if [ ! -d "$ENV_DIR"  ];then
        mkdir -p $ENV_DIR
    fi
fi

log_warn " Construct log folder $ENV_DIR"

pid=0

Purpose=$2

# 当没有输入具体,操作,默认为 START ,此时需要用户二次确认,输入 Y|y|YES|Yes 同意操作
# 或者 n|N|NO|no ,不同意重启,直接退出
if [ "$Purpose" = "" ];
    then
        log_err "Operation Name Not Entered : The Default Action Is START !"
        read -p  "Are You Sure?[y/n]:"  sure
        case  $sure  in
            y|Y|Yes|YES)  
                log_warn "You Enter $a"
                log_warn "Prepare to restart the app: $SpringBoot"
                Purpose="start"
                ;;
            n|N|NO|no)
                echo "you enter $a"
                log_warn "Ready To Exit Startup: $SpringBoot"
                exit 1
                ;;
            *)
                echo "error";;
        esac
fi


log_debug "##############################"

log_debug "Java environment variable information"

LOG_PATH=$ENV_DIR/$SpringBoot-$ADATE.log
GC_LOG_PATH=$ENV_DIR/gc-$SpringBoot-$ADATE.log

VSpringBoot=${SpringBoot%%.*}

LOG_DEBUG_PATH=$ENV_DIR/$VSpringBoot/debug.log

log_debug "LOG_DEBUG_PATH is $LOG_DEBUG_PATH"


log_debug "$(java -version)"
log_debug "Startup log  $LOG_PATH"

log_debug "Startup gc log  $GC_LOG_PATH"

JVM_OPTS="-Dname=$SpringBoot -Duser.timezone=Asia/Shanghai -Xms8192M -Xmx8192M -XX:PermSize=1024M -XX:MaxPermSize=1024M -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDateStamps -Xloggc:$GC_LOG_PATH -XX:+PrintGCDetails -XX:NewRatio=1 -XX:SurvivorRatio=30 -XX:+UseParallelGC -XX:+UseParallelOldGC"

log_debug "$JVM_OPTS $GC_LOG_PATH"

# 启动项目,如果项目已经启动过,则先 kill 掉原先项目,再启动
function start()
{
    # 检查项目的进程是否存在
    checkPid
    # 
    if [ $pid -ne "0" ]; then
        log_info "Application  $SpringBoot Running... PID:$pid ,Please Stop It"
        # echo -e "\033[31m Application  $SpringBoot Running... PID:$pid ,Please Stop It  \033[0m"
        # Kill the current Process
        killPid
        log_info ".............."
        startFun
    else
        log_info ".............."
        startFun
        log_info ".............."
    fi
}


checkPid()
{
    pid=$(ps -ef |grep $SpringBoot |grep 'java -jar'|grep -v grep|awk '{print $2}')
    # `ps -aux | grep $SpringBoot | grep 'java -jar'|grep -v grep | awk '{print $2}' | xargs kill`
}

killPid()
{
    log_err "Application PID:$pid is being stopped, Please wait for a while, or fish"
    # echo -e "\033[31m Application PID:$pid is being stopped, Please wait for a while, or fish  \033[0m"
    $(ps -ef |grep $SpringBoot |grep 'java -jar'|grep -v grep|awk '{print $2}' | xargs kill)
    tail -n 10 $LOG_DEBUG_PATH
    sleep 10s
}

startFun()
{
    log_debug "Begin Start $SpringBoot ..."
    # echo -e "\033[32m Begin Start $SpringBoot ...  \033[0m"
    $(java -jar $JVM_OPTS $SpringBoot $START_OPTS > $LOG_PATH 2>&1 &)
    # nohup java -jar $JVM_OPTS $SpringBoot --spring.config.location=file:./application.yml $START_OPTS > $LOG_PATH 2>&1 &
    log_debug "$SpringBoot SUCCESS..."
    #echo -e "\033[32m $SpringBoot SUCCESS...  \033[0m"
    sleep 10s
    tail -n 300 $LOG_PATH

}

function stop()
{
    checkPid
    log_info "Begin Stop Application $SpringBoot"
    # echo "Begin Stop Application $SpringBoot"
    if [ "$pid" -ne "0" ]; then
        log_err "$SpringBoot stop..."
        # echo "$SpringBoot stop..."
        killPid
    else
        log_info "$SpringBoot not running!"
        # echo "$SpringBoot not running!"
    fi

}

function restart()
{
    # stop
    sleep 3s
    start
}

function status()
{
    checkPid
    if [ "$pid" -ne "0" ]; then
        log_info "$SpringBoot not running!"
        #echo "$SpringBoot not running!"    
    else
        log_info "$SpringBoot is running... PID:$pid"
        # echo "$SpringBoot is running... PID:$pid"
    fi
}

case $Purpose in
    start) start;;
    stop) stop;;
    restart) restart;;
    status) status;;
    *) log_info "require start|stop|restart|status"  ;;

esac
View Code

10、赋予权限

chmod +x /opt/module/prometheus-2.38.0/java/restart.sh

11、创建日志文件

mkdir -p /opt/module/prometheus-2.38.0/java/logs/meiyijia-promethues-1
touch /opt/module/prometheus-2.38.0/java/logs/meiyijia-promethues-1/debug.log

12、启动Java程序

cd /opt/module/prometheus-2.38.0/java/
sh restart.sh meiyijia-promethues-1.0-SNAPSHOT.jar start

  查看接口地址:http://127.0.0.1:8999/metrics/prdClusterMetric

13、修改promethues配置

vi /opt/module/prometheus-2.38.0/prometheus.yml

添加如下内容:

# 添加自定义集群监控配置
- job_name: 'prdClusterMetric'
metrics_path: '/metrics/prdClusterMetric'
static_configs:
- targets: ['127.0.0.1:8999']
labels:
instance: prdClusterMetric

14、重启promethues

systemctl restart prometheus

15、查看状态

systemctl status prometheus

  http://127.0.0.1:9090/targets?search=

五、安装和配置Grafana

1、下载安装包

wget https://repo.huaweicloud.com/grafana/7.4.3/grafana-7.4.3-1.x86_64.rpm -P /opt/soft

2、依赖安装

yum install -y /sbin/service fontconfig freetype urw-fonts

3、安装

rpm -ivh /opt/soft/grafana-7.4.3-1.x86_64.rpm

4、刷新

systemctl daemon-reload

5、启动

systemctl start grafana-server.service

6、开机自启

systemctl enable grafana-server.service

7、查看状态

systemctl status grafana-server.service

8、配置域名访问

vim /usr/share/grafana/conf/defaults.ini

  在[server]下添加如下内容:

[server]
domain = grafana-yuange.yuange.com.cn

  并修改如下内容:

root_url = %(protocol)s://%(domain)s:%(http_port)s/

  为:

root_url = %(protocol)s://%(domain)s/

9、访问Web(账号:admin  密码:admin)

http://127.0.0.1:3000/login
https://grafana-yuange.yuange.com.cn/

10、添加Promethues数据源

11、Grafana配置相应指标监控

(1)对于IP端口监控示例

sum(ACTIVESTATE{instance="prdClusterMetric",name="NameNode",type="prdActive"})

(2)对于NumStaleDataNodes指标(DataNode节点进入State状态监控)

ACTIVESTATE{instance="prdClusterMetric",name="NumStaleDataNodes"}

(3)对于yarn_unhealthynodes指标(Yarn不良节点数监控)

yarn_unhealthynodes{instance="prdClusterMetric"}

(4)对于HeapMemoryUsage、NonHeapMemoryUsage指标(堆内存监控)

abs(delta(ACTIVESTATE{instance="prdClusterMetric",name="HeapMemoryUsage",ip="127.0.0.1",port="9870"}[10m]))
abs(delta(ACTIVESTATE{instance="prdClusterMetric",name="NonHeapMemoryUsage",ip="127.0.0.1",port="9870"}[10m]))

(5)对于HeartbeatsTotalNumOps指标(DataNode节点心跳监控)

increase(ACTIVESTATE{instance="prdClusterMetric",name="HeartbeatsTotalNumOps",ip="127.0.0.1"}[1m])

(6)对于yarn_reservedmb指标(yarn申请中的内存)

yarn_reservedmb{instance="prdClusterMetric"}

(7)对于yarn_reservedvirtualcores指标(yarn申请中的核心)

yarn_reservedvirtualcores{instance="prdClusterMetric"}

(8)对于yarn_activenodes指标(集群运行中的节点数)

yarn_activenodes{instance="prdClusterMetric"}

(9)对于yarn_allocatedmb指标(集群已使用的总内存)

yarn_allocatedmb{instance="prdClusterMetric"}

(10)对于memory_percentage指标(集群中的各队列中的内存使用百分率)

memory_percentage{instance="prdClusterMetric"}

(11)对于vcores_percentage指标(集群中的各队列中的CPU核数使用百分率)

vcores_percentage{instance="prdClusterMetric"}

12、监控告警(集成飞书、企业微信)模版

 13、告警内容模版

报!监控到生产集群中NameNode挂掉超过5分钟!请主公定夺!
<#list root as c>
指标:${c.metric}  告警值:${c.value}
</#list>

posted @ 2023-02-22 15:39  落魄的大数据转AI小哥  阅读(1150)  评论(0编辑  收藏  举报