Java编写监控指标并集成Prometheus、Grafana
一、Pushgateway
1、上传pushgateway-1.4.3.linux-amd64.tar.gz至/opt/soft/
2、解压至/opt/module/,并修改目录名称
tar -zxvf /opt/soft/pushgateway-1.4.3.linux-amd64.tar.gz -C /opt/module/
mv /opt/module/pushgateway-1.4.3.linux-amd64/ /opt/module/pushgateway-1.4.3
3、启动pushgateway
nohup /opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091 >/opt/module/pushgateway-1.4.3/pushgateway.log 2>&1 &
4、查看Web页面
5、设置开机自启
vi /etc/systemd/system/pushgateway.service
添加如下内容:
[Unit] Description=logging pushgateway service [Service] Type=simple User=root Group=root ExecStart=/opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091 Restart=on-failure [Install] WantedBy=multi-user.target
6、赋予权限
chmod 777 /etc/systemd/system/pushgateway.service
7、刷新文件
systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway
8、停止
ps -ef | grep pushgateway | grep -v grep | awk '{print $2}' | xargs kill
9、编写重启脚本
vi /opt/module/pushgateway-1.4.3/restart.sh
添加如下内容:
#!/bin/bash process=`ps -ef | grep /opt/module/pushgateway-1.4.3/pushgateway | grep -v grep | awk '{print $2}'`; echo "process=$process" if [ "$process" == "" ]; then echo "$(date "+%Y-%m-%d %H:%M:%S"):pushgateway已经停止"; else kill -9 $process echo "$(date "+%Y-%m-%d %H:%M:%S"):pushgateway已被kill"; fi sleep 3 echo "启动pushgateway" nohup /opt/module/pushgateway-1.4.3/pushgateway --web.listen-address 127.0.0.1:9091 >/opt/module/pushgateway-1.4.3/pushgateway.log 2>&1 &
10、赋予可执行权限
chmod +x /opt/module/pushgateway-1.4.3/restart.sh
11、定时执行(目的是pushgateway使用时间过久会出现卡死现象,故设置1小时重启一次)
vi /etc/crontab
添加如下内容:
36 */1 * * * root cd /opt/module/pushgateway-1.4.3 && sh restart.sh
二、prometheus server
1、上传prometheus-2.38.0.linux-amd64.tar.gz至/opt/soft/
2、解压至/opt/module/,并修改目录名称
tar -zxvf /opt/soft/prometheus-2.38.0.linux-amd64.tar.gz -C /opt/module/
mv /opt/module/prometheus-2.38.0.linux-amd64/ /opt/module/prometheus-2.38.0
3、修改配置文件
vi /opt/module/prometheus-2.38.0/prometheus.yml
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ["127.0.0.1:9093"] # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['127.0.0.1:9090'] # 添加 PushGateway 监控配置 - job_name: 'pushgateway' static_configs: - targets: ['127.0.0.1:9091'] labels: instance: pushgateway
4、启动prometheus
nohup /opt/module/prometheus-2.38.0/prometheus --config.file=/opt/module/prometheus-2.38.0/prometheus.yml --web.enable-admin-api > /opt/module/prometheus-2.38.0/prometheus.log 2>&1 &
5、使用Systemctl管理Prometheus
vi /etc/systemd/system/prometheus.service
添加如下内容:
[Unit] Description=logging prometheus service Documentation=https://prometheus.io [Service] Type=simple User=root Group=root ExecStart=/opt/module/prometheus-2.38.0/prometheus --config.file=/opt/module/prometheus-2.38.0/prometheus.yml --web.enable-admin-api Restart=on-failure [Install] WantedBy=multi-user.target
6、赋予权限
chmod 777 /etc/systemd/system/prometheus.service
7、刷新文件并设置自启
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
8、停止
systemctl stop prometheus
ps -ef |grep -v grep | grep /opt/module/prometheus-2.38.0/prometheus | awk '{print $2}' | xargs kill
9、查看
ps -ef |grep -v grep | grep /opt/module/prometheus-2.38.0/prometheus
10、查看Web页面
三、监控Zookeeper(已停用)
1、修改配置文件
vi /opt/module/prometheus-2.38.0/prometheus.yml
# 添加 zookeeper 监控配置 - job_name: 'zookeeper' static_configs: - targets: ['127.0.0.1:7000','127.0.0.1:7000','127.0.0.1:7000'] labels: instance: zookeeper
2、重启prometheus
systemctl restart prometheus
四、监控集群
1、Idea新建Maven工程
2、修改pom文件
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.yuange</groupId> <artifactId>prometheus</artifactId> <version>1.0-SNAPSHOT</version> <packaging>pom</packaging> <repositories> <repository> <id>aliyun</id> <name>aliyun</name> <url>https://maven.aliyun.com/repository/public</url> </repository> <repository> <id>cloudera</id> <name>cloudera</name> <url>https://repository.cloudera.com/artifactory/cloudera-repos</url> </repository> </repositories> <properties> <java.version>1.8</java.version> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> <slf4j.version>1.7.30</slf4j.version> <httpclient.version>4.5</httpclient.version> <fastjson.version>1.2.75</fastjson.version> <lombok.version>1.18.16</lombok.version> </properties> <dependencyManagement> <dependencies> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>${slf4j.version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>${slf4j.version}</version> </dependency> <!-- lombok插件 --> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <version>${lombok.version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson.version}</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>${httpclient.version}</version> </dependency> </dependencies> </dependencyManagement> </project>
3、创建子工程
4、修改子工程pom文件
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <parent> <artifactId>prometheus</artifactId> <groupId>com.yuange</groupId> <version>1.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <artifactId>yuange-prometheus</artifactId> <properties> <java.version>1.8</java.version> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> <spring-boot.version>2.4.1</spring-boot.version> <fastjson.version>1.2.75</fastjson.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson.version}</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </dependency> </dependencies> <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-dependencies</artifactId> <version>${spring-boot.version}</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.8.1</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <version>2.4.1</version> <configuration> <mainClass>com.yuange.promethues.YuangePromethuesApplication</mainClass> </configuration> <executions> <execution> <id>repackage</id> <goals> <goal>repackage</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </project>
5、添加 application.properties 文件
6、创建SpringBoot启动主类
com.yuange.promethues.YuangePromethuesApplication
package com.yuange.promethues; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; /** * @作者:袁工 * @时间:2023/2/22 14:31 */ @SpringBootApplication public class YuangePromethuesApplication { public static void main(String[] args) { SpringApplication.run(YuangePromethuesApplication.class,args); } }
7、后续代码介绍
代码过多就不一一说明,码云地址为:https://gitee.com/LzMingYueShanPao/prometheus.git
yuange-prometheus模块是一个SpringBoot项目,项目启动后的端口为8999,可在application.properties中配置。
此外,controller中的ClusterMetricController.java,编写了接口:/metrics/prdClusterMetric 表示大数据集群监控接口
监控的指标数据包括:内存/核心数的使用百分比,Yarn资源的相关指标,NameNode/HiveServer2的HeapMemoryUsage指标,DataNode的心跳数据,Telnet方式监控的IP+端口存活监控等
8、在Idea中使用Maven打包后将Jar包上传至/opt/module/prometheus-2.38.0/java目录
9、编写启停脚本restart.sh
vim /opt/module/prometheus-2.38.0/java/restart.sh
#!/bin/bash # 日志级别 debug-1, info-2, warn-3, error-4, always-5 LOG_LEVEL=3 # 调试日志 function log_debug(){ content="[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') $@" [ $LOG_LEVEL -le 1 ] && echo -e "\033[32m" ${content} "\033[0m" } # 信息日志 function log_info(){ content="[INFO] $(date '+%Y-%m-%d %H:%M:%S') $@" [ $LOG_LEVEL -le 2 ] && echo -e "\033[32m" ${content} "\033[0m" } # 警告日志 function log_warn(){ content="[WARN] $(date '+%Y-%m-%d %H:%M:%S') $@" [ $LOG_LEVEL -le 3 ] && echo -e "\033[33m" ${content} "\033[0m" } # 错误日志 function log_err(){ content="[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $@" [ $LOG_LEVEL -le 4 ] && echo -e "\033[31m" ${content} "\033[0m" } # 一直都会打印的日志 function log_always(){ content="[ALWAYS] $(date '+%Y-%m-%d %H:%M:%S') $@" [ $LOG_LEVEL -le 5 ] && echo -e "\033[32m" ${content} "\033[0m" } SpringBoot=$1 if [ "$SpringBoot" = "" ]; then log_err "Please enter the Jar application name" lot=$(find ./ -maxdepth 1 -type f -and -name "*.jar") # lot_pat=${lot#*/} log_err "The Optional Jar applications are as follows: $lot" exit 1 fi ADATE=$(date +%Y%m%d%H%M%S) # 启动参数 START_OPTS=$3 # JVM参数 APP_HOME=$(pwd) dirname $0|grep "^/" >/dev/null # 获取当前执行路径 if [ $? -eq 0 ];then APP_HOME=$(DIR_NAME $0) else dirname $0|grep "^\." >/dev/null retval=$? if [ $retval -eq 0 ];then APP_HOME=$(dirname $0|sed "s#^.#$APP_HOME#") else APP_HOME=$(dirname $0|sed "s#^#$APP_HOME/#") fi fi log_info "Current directory is $APP_HOME" ENV_PORT=${START_OPTS#*=} ENV_DIR="$APP_HOME/logs" if [ "$ENV_PORT" = "" ]; then log_info "Application Port is Null" log_info "$ENV_DIR" if [ ! -d "$ENV_DIR" ];then mkdir -p $ENV_DIR fi else log_info "Application Port is $ENV_PORT" ENV_DIR="$ENV_DIR/$ENV_PORT" log_info "$ENV_DIR" if [ ! -d "$ENV_DIR" ];then mkdir -p $ENV_DIR fi fi log_warn " Construct log folder $ENV_DIR" pid=0 Purpose=$2 # 当没有输入具体,操作,默认为 START ,此时需要用户二次确认,输入 Y|y|YES|Yes 同意操作 # 或者 n|N|NO|no ,不同意重启,直接退出 if [ "$Purpose" = "" ]; then log_err "Operation Name Not Entered : The Default Action Is START !" read -p "Are You Sure?[y/n]:" sure case $sure in y|Y|Yes|YES) log_warn "You Enter $a" log_warn "Prepare to restart the app: $SpringBoot" Purpose="start" ;; n|N|NO|no) echo "you enter $a" log_warn "Ready To Exit Startup: $SpringBoot" exit 1 ;; *) echo "error";; esac fi log_debug "##############################" log_debug "Java environment variable information" LOG_PATH=$ENV_DIR/$SpringBoot-$ADATE.log GC_LOG_PATH=$ENV_DIR/gc-$SpringBoot-$ADATE.log VSpringBoot=${SpringBoot%%.*} LOG_DEBUG_PATH=$ENV_DIR/$VSpringBoot/debug.log log_debug "LOG_DEBUG_PATH is $LOG_DEBUG_PATH" log_debug "$(java -version)" log_debug "Startup log $LOG_PATH" log_debug "Startup gc log $GC_LOG_PATH" JVM_OPTS="-Dname=$SpringBoot -Duser.timezone=Asia/Shanghai -Xms8192M -Xmx8192M -XX:PermSize=1024M -XX:MaxPermSize=1024M -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDateStamps -Xloggc:$GC_LOG_PATH -XX:+PrintGCDetails -XX:NewRatio=1 -XX:SurvivorRatio=30 -XX:+UseParallelGC -XX:+UseParallelOldGC" log_debug "$JVM_OPTS $GC_LOG_PATH" # 启动项目,如果项目已经启动过,则先 kill 掉原先项目,再启动 function start() { # 检查项目的进程是否存在 checkPid # if [ $pid -ne "0" ]; then log_info "Application $SpringBoot Running... PID:$pid ,Please Stop It" # echo -e "\033[31m Application $SpringBoot Running... PID:$pid ,Please Stop It \033[0m" # Kill the current Process killPid log_info ".............." startFun else log_info ".............." startFun log_info ".............." fi } checkPid() { pid=$(ps -ef |grep $SpringBoot |grep 'java -jar'|grep -v grep|awk '{print $2}') # `ps -aux | grep $SpringBoot | grep 'java -jar'|grep -v grep | awk '{print $2}' | xargs kill` } killPid() { log_err "Application PID:$pid is being stopped, Please wait for a while, or fish" # echo -e "\033[31m Application PID:$pid is being stopped, Please wait for a while, or fish \033[0m" $(ps -ef |grep $SpringBoot |grep 'java -jar'|grep -v grep|awk '{print $2}' | xargs kill) tail -n 10 $LOG_DEBUG_PATH sleep 10s } startFun() { log_debug "Begin Start $SpringBoot ..." # echo -e "\033[32m Begin Start $SpringBoot ... \033[0m" $(java -jar $JVM_OPTS $SpringBoot $START_OPTS > $LOG_PATH 2>&1 &) # nohup java -jar $JVM_OPTS $SpringBoot --spring.config.location=file:./application.yml $START_OPTS > $LOG_PATH 2>&1 & log_debug "$SpringBoot SUCCESS..." #echo -e "\033[32m $SpringBoot SUCCESS... \033[0m" sleep 10s tail -n 300 $LOG_PATH } function stop() { checkPid log_info "Begin Stop Application $SpringBoot" # echo "Begin Stop Application $SpringBoot" if [ "$pid" -ne "0" ]; then log_err "$SpringBoot stop..." # echo "$SpringBoot stop..." killPid else log_info "$SpringBoot not running!" # echo "$SpringBoot not running!" fi } function restart() { # stop sleep 3s start } function status() { checkPid if [ "$pid" -ne "0" ]; then log_info "$SpringBoot not running!" #echo "$SpringBoot not running!" else log_info "$SpringBoot is running... PID:$pid" # echo "$SpringBoot is running... PID:$pid" fi } case $Purpose in start) start;; stop) stop;; restart) restart;; status) status;; *) log_info "require start|stop|restart|status" ;; esac
10、赋予权限
chmod +x /opt/module/prometheus-2.38.0/java/restart.sh
11、创建日志文件
mkdir -p /opt/module/prometheus-2.38.0/java/logs/meiyijia-promethues-1
touch /opt/module/prometheus-2.38.0/java/logs/meiyijia-promethues-1/debug.log
12、启动Java程序
cd /opt/module/prometheus-2.38.0/java/
sh restart.sh meiyijia-promethues-1.0-SNAPSHOT.jar start
查看接口地址:http://127.0.0.1:8999/metrics/prdClusterMetric
13、修改promethues配置
vi /opt/module/prometheus-2.38.0/prometheus.yml
添加如下内容:
# 添加自定义集群监控配置 - job_name: 'prdClusterMetric' metrics_path: '/metrics/prdClusterMetric' static_configs: - targets: ['127.0.0.1:8999'] labels: instance: prdClusterMetric
14、重启promethues
systemctl restart prometheus
15、查看状态
systemctl status prometheus
http://127.0.0.1:9090/targets?search=
五、安装和配置Grafana
1、下载安装包
wget https://repo.huaweicloud.com/grafana/7.4.3/grafana-7.4.3-1.x86_64.rpm -P /opt/soft
2、依赖安装
yum install -y /sbin/service fontconfig freetype urw-fonts
3、安装
rpm -ivh /opt/soft/grafana-7.4.3-1.x86_64.rpm
4、刷新
systemctl daemon-reload
5、启动
systemctl start grafana-server.service
6、开机自启
systemctl enable grafana-server.service
7、查看状态
systemctl status grafana-server.service
8、配置域名访问
vim /usr/share/grafana/conf/defaults.ini
在[server]下添加如下内容:
[server]
domain = grafana-yuange.yuange.com.cn
并修改如下内容:
root_url = %(protocol)s://%(domain)s:%(http_port)s/
为:
root_url = %(protocol)s://%(domain)s/
9、访问Web(账号:admin 密码:admin)
http://127.0.0.1:3000/login
https://grafana-yuange.yuange.com.cn/
10、添加Promethues数据源
11、Grafana配置相应指标监控
(1)对于IP端口监控示例
sum(ACTIVESTATE{instance="prdClusterMetric",name="NameNode",type="prdActive"})
(2)对于NumStaleDataNodes指标(DataNode节点进入State状态监控)
ACTIVESTATE{instance="prdClusterMetric",name="NumStaleDataNodes"}
(3)对于yarn_unhealthynodes指标(Yarn不良节点数监控)
yarn_unhealthynodes{instance="prdClusterMetric"}
(4)对于HeapMemoryUsage、NonHeapMemoryUsage指标(堆内存监控)
abs(delta(ACTIVESTATE{instance="prdClusterMetric",name="HeapMemoryUsage",ip="127.0.0.1",port="9870"}[10m]))
abs(delta(ACTIVESTATE{instance="prdClusterMetric",name="NonHeapMemoryUsage",ip="127.0.0.1",port="9870"}[10m]))
(5)对于HeartbeatsTotalNumOps指标(DataNode节点心跳监控)
increase(ACTIVESTATE{instance="prdClusterMetric",name="HeartbeatsTotalNumOps",ip="127.0.0.1"}[1m])
(6)对于yarn_reservedmb指标(yarn申请中的内存)
yarn_reservedmb{instance="prdClusterMetric"}
(7)对于yarn_reservedvirtualcores指标(yarn申请中的核心)
yarn_reservedvirtualcores{instance="prdClusterMetric"}
(8)对于yarn_activenodes指标(集群运行中的节点数)
yarn_activenodes{instance="prdClusterMetric"}
(9)对于yarn_allocatedmb指标(集群已使用的总内存)
yarn_allocatedmb{instance="prdClusterMetric"}
(10)对于memory_percentage指标(集群中的各队列中的内存使用百分率)
memory_percentage{instance="prdClusterMetric"}
(11)对于vcores_percentage指标(集群中的各队列中的CPU核数使用百分率)
vcores_percentage{instance="prdClusterMetric"}
12、监控告警(集成飞书、企业微信)模版
13、告警内容模版
报!监控到生产集群中NameNode挂掉超过5分钟!请主公定夺! <#list root as c> 指标:${c.metric} 告警值:${c.value} </#list>