KingbaseES集群典型案例之---集群部署和启动及切换vip加载故障集锦

案例说明:
KingbaseES V8R6集群使用vip连接访问,在部署或启动及切换集群时都会涉及到vip的加载,本案例描述了在以上情况下vip加载失败的原因及解决方案。

适用版本:
KingbaseES V8R6

案例一、OpenEuler和kylin系统脚本部署集群无法加载vip

1、问题现象
通过脚本方式部署KingbaseES V8R6集群,脚本执行过程,提示“ vip无法加载”,如下图所示:

2、问题分析

1)查看故障日志
如下图所示,脚本执行过程中,在测试vip地址是否已被占用(ping)测试,返回的变量值为null。

[kingbase@localhost R6_cluster]$ sh -x cluster_install.sh

2)分析部署脚本引起'vip Cannot use'的原因
如下所示,脚本通过如下语句,判断vip地址是否被占用,当ping返回的结果非0,而且vip已经被加载,将提示 'vip Cannot use'错误。

if [ "$function_name"x == "install"x ]
            then
                vip_ping=`${ping_path}/ping $vip -c 3 2>/dev/null |grep -w "received" |awk '{print $4}'`
                vip_exist=`${ipaddr_path}/ip addr |grep -w "$vip"|wc -l`
                if [ "$vip_ping"x != "0"x -a "$vip_exist"x = "0"x ]
                then
                    echo "$(date +"%Y-%m-%d %T") [ERROR] [Virtual IP] $virtual_ip Cannot use"
                    exit 1
                else
                    echo "$(date +"%Y-%m-%d %T") [INFO] [Virtual IP] $virtual_ip OK"
                fi
            fi
        fi
    else

3)openEuler系统ping测试

如下所示,在openEuler相同的语句测试,ping不通的ip返回null。

[kingbase@localhost R6_cluster]$ ping 172.31.254.121 -c 3 2>/dev/null |grep -w "received"|awk '{print $4}'

[kingbase@localhost R6_cluster]$

如下所示,openEuler系统执行ping测试后返回的提示为中文“已接收”,而脚本是通过英文“received”过滤,因此返回null。

[kingbase@localhost R6_cluster]$ ping 172.31.254.121 -c 3

PING 172.31.254.121 (172.31.254.121) 56(84) bytes of data.

--- 172.31.254.121 ping 统计 ---

发送3个包,已接收0个包, 100% packet loss, time 2000ms

经过多次测试,需要将ping语句改为如下:(返回0)

[kingbase@localhost R6_cluster]$ ping 172.31.254.121 -c 3 |grep "接收" |awk '{print $5}'

0

3、问题解决
修改部署脚本判断vip是否被占用ping语句,如下所示:

if [ "$function_name"x == "install"x ]
            then
                # 根据系统测试修改ping语句如下:
                vip_ping=`${ping_path}/ping $vip -c 3 |grep "接收" |awk '{print $5}'`
                vip_exist=`${ipaddr_path}/ip addr |grep -w "$vip"|wc -l`

案例二、集群启动vip加载失败

1、问题现象
如下所示,执行sys_monitor.sh start启动集群时,由于ip和arping缺失权限,导致集群启动失败:

[kingbase@node201 bin]$ ./sys_monitor.sh start
2024-07-12 14:42:40 Ready to start all DB ...
......
2024-07-12 14:42:49 DB on "[192.168.1.202]" start success.
[ERROR] No execute permission for "/usr/sbin/ip"
incorrect command permissions for the virtual ip.
2024-07-12 14:42:49 There is no primary DB running, will do nothing and exit.

[kingbase@node201 bin]$ ./sys_monitor.sh restart
2024-07-12 14:43:10 Ready to stop all DB ...
......
2024-07-12 14:43:22 execute to start DB on "[192.168.1.202]" success, connect to check it.
2024-07-12 14:43:23 DB on "[192.168.1.202]" start success.
[ERROR] No execute permission for "/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping"
incorrect command permissions for the virtual ip.
2024-07-12 14:43:23 There is no primary DB running, will do nothing and exit.

2、问题分析

集群在启动过程中通过kbha进程加载vip,实际执行的操作为:

1)vip加载命令

# 其中192.168.1.88为vip地址
[root@node201 ~]# /usr/sbin/ip address add 192.168.1.88/24 dev enp0s3
[root@node201 ~]# /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping -U 192.168.1.88 -I enp0s3 -w 5 -c 3

2)查看集群配置
如下所示,ip和arping命令的路径:

[kingbase@node201 bin]$ cat ../etc/repmgr.conf |grep _path
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'

3、问题解决
通过kingbase用户调用ip和arping命令时,命令属主必须是root,权限配置setuid。

# 对于通用机环境,ip 的属主必须是root,需要u+s的权限(4755)
[root@node201 ~]# chown root.root /usr/sbin/ip
[root@node201 ~]# chmod 4755 /usr/sbin/ip
[root@node201 ~]# ls -lh /usr/sbin/ip
-rwsr-xr-x 1 root root 460K Oct  1  2020 /usr/sbin/ip

# 对于通用机环境,arping 的属主必须是root,需要u+s的权限(4755)
[root@node201 ~]# chown root.root /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping
[root@node201 ~]# chmod 4755 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping
[root@node201 ~]# ls -lh /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping
-rwsr-xr-x 1 root root 14K Sep  2  2023 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/arping

配置完成后,集群启动成功。

案例三、vip配置错误,failover切换失败

1、问题现象
集群一主一备架构,failover切换失败,如下所示,原备库hamgr.log日志,新主库加载vip超时(vip_timeout=60s)失败,导致切换失败:

#pqping()检测超过阈值5次后,执行主备切换。
[2024-07-12 15:53:31] [INFO] checking state of node "node1" (ID: 1), 5 of 5 attempts
......
[2024-07-12 15:53:38] [WARNING] ping host"192.168.1.88" failed
[2024-07-12 15:53:38] [DETAIL] average RTT value is not greater than zero

# 执行kbha加载vip,arping参数配置错误,执行失败。
[2024-07-12 15:53:38] [DEBUG] executing:
  /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A loadvip
[2024-07-12 15:53:38] [WARNING] arping//home/kingbase/cluster/R6C8/HAC8/kingbase/bin: unknown paramName/value pair provided; ignoring
[2024-07-12 15:53:38] [ERROR] No execute permission for "/sbin/arping"
[2024-07-12 15:53:38] [DEBUG] result of command was 1 (256)
[2024-07-12 15:53:38] [DEBUG] LocalCommand(): oneLineStr returned was:
incorrect command permissions for the virtual ip.
[2024-07-12 15:53:38] [INFO] loadvip result: 0, arping result: 0
[2024-07-12 15:53:38] [WARNING] new primary node (ID: 2) acquire the virtual ip 192.168.1.88/24 failed
[2024-07-12 15:53:38] [DETAIL] incorrect command permissions for the virtual ip.
.......
# vip尝试加载超时退出(vip_timeout=60s),无法完成切换。
[2024-07-12 15:54:35] [ERROR] the time from the first failure to acquire VIP is 61 seconds (max 60 seconds), timeout
[2024-07-12 15:54:35] [ERROR] cannot promote myself

2、问题分析

1)日志信息
如下所示,日志错误信息,arping参数配置错误,调用/sbin/aring,权限缺失,执行失败。

2)查看arping配置
如下所示arping配置,参数名称"arping",正确的参数名称应该是“arping_path”;由于集群无法识别到有效的参数,所以调用'/sbin/arping':

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep arping
arping='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
正确配置:
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'

# arping权限(缺失setuid权限,kingbase用户无法执行arping):
[kingbase@node202 bin]$ ls -lh /sbin/arping
-rwxr-xr-x 1 root root 24K Aug  4  2017 /sbin/arping

3、问题解决
修改repmgr.conf中arping配置:

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep arping
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'

总结:
对于集群中出现的vip加载问题,一般大部分是因为ip和arping命令的属主及权限配置错误导致,还有一些国产的Linux系统,比如UOS、kylin、OpenEuler、凝思等,会出现和部署脚本或启动脚本shell不兼容的情况,需要通过分析脚本执行来排查故障。

posted @ 2024-07-12 14:28  天涯客1224  阅读(88)  评论(0编辑  收藏  举报