KingbaseES 集群启停系列 04 -- 集群启动无法加载VIP
案例说明:
KingbaseES V8R6集群在使用sys_monitor.sh启动集群时,将会在主库节点加载vip;生产现场,在启动集群时,加载vip失败,本案例详细描述了问题解决的过程。
适用版本:
KingbaseES V8R6
操作系统:
kylin信安
问题解决思路:
1. 通过ping测试vip地址是否被占用。
2. 执行sh -x sys_monitor.sh start分析脚本执行过程,查看具体的vip加载语句。
3. 执行脚本中vip的测试语句,查看返回结果。
4. 分析脚本测试语句,判断故障原因。
5. 对照脚本语句,在当前系统下执行vip测试语句,对比脚本语句,确定故障原因。
6. 针对故障原因,提供解决方案。
一、故障现象
如下所示,sys_monitor.sh启动集群后,出现以下错误提示,vip加载失败:
2024-06-14 19:29:32 The primary DB is started.
2024-06-14 19:29:32 The virtual ip [172.172.20.115] has already exists and not on primary host [172.172.20.113], exit.
# vip已经在其他节点被占用,但不在主库节点上。
二、问题分析
1、执行ping测试(vip地址是否被占用)
如下所示,ping vip测试,vip地址并未被占用:
[kingbase@node201 bin]$ ping 172.172.20.115 -c 3 -w 3
PING 172.172.20.115 (172.172.20.115) 56(84) bytes of data.
--- 172.172.20.115 ping 统计 ---
发送3个包,已接收0个包, 100% packet loss, time 2000ms
2、执行sh -x sys_monitor.sh start分析
如下所示,执行sh -x sys_monitor.sh start分析启动过程,发现在执行ping vip测试时,变量结果为空:ping_result=
++ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 -l kingbase -T 172.172.20.113 'ping 172.172.20.115 -c 3 -w 3 | grep received | awk '\''{print $4}'\'''
++ '[' 0 -ne 0 ']'
++ return 0
+ ping_result=
+ '[' 0 -ne 0 ']'
+ '[' x = 0x ']'
+ return 0
+ '[' 0 -eq 0 ']'
+ return 0
+ '[' 0 -eq 0 ']'
++ execute_command root 172.172.20.113 '/usr/sbin/ip addr | grep -w "172.172.20.115" | wc -l'
++ local user=root
++ local host=172.172.20.113
++ local 'command=/usr/sbin/ip addr | grep -w "172.172.20.115" | wc -l'
++ '[' 0 -eq 0 ']'
++ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 -l root -T 172.172.20.113 '/usr/sbin/ip addr | grep -w "172.172.20.115" | wc -l'
++ '[' 0 -ne 0 ']'
++ return 0
+ local on_primary_host=0
+ '[' 0 -ne 0 ']'
+ '[' 0x '!=' x ']'
+ '[' 0 -eq 0 ']'
++ date '+%Y-%m-%d %H:%M:%S'
+ echo '2024-06-14 19:29:32 The virtual ip [172.172.20.115] has already exists and not on primary host [172.172.20.113], exit.'
2024-06-14 19:29:32 The virtual ip [172.172.20.115] has already exists and not on primary host [172.172.20.113], exit.
+ exit 1
如下所示,在测试ping vip时,如果vip未被占用,结果应该是:ping_result=0
++ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l kingbase -T 192.168.1.201 'ping 192.168.1.88 -c 3 -w 3 | grep received | awk '\''{print $4}'\'''
++ '[' 0 -ne 0 ']'
++ return 0
+ ping_result=0
3、查看sys_monitor.sh语句
如下所示,脚本执行测试语句:
function ping_ip_on_host()
{
local host=$1
local ip=$2
local ping_result=""
local is_ipv6=`echo "$ip" | grep -o ":" | wc -l`
if [ ${is_ipv6} -eq 0 ]
then
ping_result=`execute_command ${execute_user} $host "ping ${ip} -c 3 -w 3 | grep received | awk '{print \\$4}'"`
else
ping_result=`execute_command ${execute_user} $host "ping6 ${ip} -c 3 -w 3 | grep received | awk '{print \\$4}'"`
fi
if [ $? -ne 0 ] || [ "$ping_result"x = "0"x ]
then
return 1
fi
return 0
}
当前系统执行测试语句:
[系统未激活][kingbase@localhost bin] ssh -o StrictHostKeyChecking=no
-o ConnectTimeout=10 -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no
-o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 -l kingbase
-T 172.172.20.113 'ping 172.172.20.115 -c 3 -w 3 | grep received
| awk '\''{print $4}'\'''
系统返回空值(未返回结果)
CentOS环境测试:
# 如果vip已被占用,返回结果为:3
[kingbase@node201 bin]$ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10
-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2
-o ServerAliveCountMax=5 -p 22 -l kingbase -T 192.168.1.201
'ping 192.168.1.88 -c 3 -w 3 | grep received | awk '\''{print $4}'\'''
3
# 如果vip未被占用,返回结果为:0
[kingbase@node201 bin]$ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10
-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2
-o ServerAliveCountMax=5 -p 22 -l kingbase -T 192.168.1.201
'ping 192.168.1.99 -c 3 -w 3 | grep received | awk '\''{print $4}'\'''
0
4、系统执行ping vip测试
如下所示,kylin信安系统在执行ping后,返回的结果是中文:received=’已接收’
[kingbase@node201 bin]$ ping 172.172.20.115 -c 3 -w 3
PING 172.172.20.115 (172.172.20.115) 56(84) bytes of data.
--- 172.172.20.115 ping 统计 ---
发送3个包,已接收0个包, 100% packet loss, time 2000ms
而sys_monitor.sh脚本grep过滤单词为‘received’,导致过滤失败,变量ping_result返回为null:
ping_result为空后,导致vip地址检测错误,集群启动加载vip失败。
三、问题解决
1. 配置kylin信安系统,将命令执行返回提示改为英文。
2. 或者修改sys_monitor.sh脚本,将grep过滤改为‘已接收’。
四、问题总结
此案例发生的原因,是脚本和操作系统的兼容性问题,对Linux系统在使用时,尽量将命令执行的返回提示用英文标识,避免造成脚本执行的兼容性故障。
KINGBASE研究院