KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解
案例说明:
在KingbaseES V8R3集群,network_rewind.sh用于当节点数据库服务down时,实现数据库服务的自动恢复功能。在network_rewind.sh执行时,会对数据库的存储(data)所在的磁盘进行R/W的检查,默认如果读写检查失败,将会关闭数据库;在生产环境,磁盘I/O压力较大的情况下,可能会触发误判,导致数据库关闭,影响正常的应用。可以通过参数调整,在检测失败的情况下,不关闭数据库服务。
适用版本:
KingbaseES V8R3
一、集群架构
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | false | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | true | 0
(2 rows)
二、测试默认磁盘检测功能
1、模拟磁盘检测故障
[root@node102 db]# pwd
/home/kingbase/cluster/HAR3/db
[root@node102 db]# ls -lhd data
drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data
[root@node102 db]# chown root.root data
[root@node102 db]# chmod 700 data
[root@node102 db]# ls -lhd data
drwx------ 20 root root 4.0K Mar 17 10:20 data
---如上所示,对于数据存储data目录,数据库用户kingbase无读写权限。
2、查看节点recovery.log
Tips:
默认在KingbaseES V8R3集群,每过一分钟,crond调用network_rewind.sh脚本检测节点数据库状态,可以通过recovery.log获取详细执行信息。
2023-03-17 10:21:01 recover beging...
my pid is 9836,officially began to perform recovery
2023-03-17 10:21:01 check read/write on mount point
2023-03-17 10:21:01 check read/write on mount point (1 / 6).
2023-03-17 10:21:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:01 failed to check read/write on mount point (1 / 6).
2023-03-17 10:21:11 check read/write on mount point (2 / 6).
2023-03-17 10:21:11 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:11 failed to check read/write on mount point (2 / 6).
2023-03-17 10:21:21 check read/write on mount point (3 / 6).
2023-03-17 10:21:21 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:21 failed to check read/write on mount point (3 / 6).
2023-03-17 10:21:31 check read/write on mount point (4 / 6).
2023-03-17 10:21:31 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:31 failed to check read/write on mount point (4 / 6).
2023-03-17 10:21:41 check read/write on mount point (5 / 6).
2023-03-17 10:21:41 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:41 failed to check read/write on mount point (5 / 6).
2023-03-17 10:21:51 check read/write on mount point (6 / 6).
2023-03-17 10:21:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:21:51 failed to check read/write on mount point (6 / 6).
2023-03-17 10:22:01 execute check_mount_point() failed, maybe the disk is error
2023-03-17 10:22:01 USE_CHECK_DISK = on, will exit with stop db.
exit with error and stop db.....
sys_ctl: could not open PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid": Permission denied
2023-03-17 10:22:01 now will del vip [192.168.1.204/24]
I'm already recovery now pid[9836], return nothing to do,will exit script will success
now, there is no 192.168.1.204/24 on my DEV
......
---如上所示,对“/home/kingbase/cluster/HAR3/db/data”目录读写执行检测。
如下图所示:磁盘检测失败关闭数据库服务
三、调整磁盘检测功能
Tips:
默认" if failed in check_mount_point(), should stop the database? default is on, do stop db",参数
USE_CHECK_DISK=1(默认),将关闭数据库服务;USE_CHECK_DISK=0,不关闭数据库服务。
1、配置磁盘检测参数
[root@node102 db]# cat etc/HAmodule.conf |grep -i disk
USE_CHECK_DISK=0
---在所有节点HAmodule.conf增加此参数配置(默认配置文件无此参数)。
2、模拟磁盘检测故障
[root@node102 db]# pwd
/home/kingbase/cluster/HAR3/db
[root@node102 db]# ls -lhd data
drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data
[root@node102 db]# chown root.root data
[root@node102 db]# chmod 700 data
[root@node102 db]# ls -lhd data
drwx------ 20 root root 4.0K Mar 17 10:20 data
---如上所示,对于数据存储data目录,数据库用户kingbase无读写权限。
3、查看节点recovery.log
---------------------------------------------------------------------
2023-03-17 10:33:01 recover beging...
my pid is 16274,officially began to perform recovery
2023-03-17 10:33:01 check read/write on mount point
2023-03-17 10:33:01 check read/write on mount point (1 / 6).
2023-03-17 10:33:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
.......
2023-03-17 10:33:51 check read/write on mount point (6 / 6).
2023-03-17 10:33:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied
could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it
could not execute "ls /home/kingbase/cluster/HAR3/db/data".
2023-03-17 10:33:51 failed to check read/write on mount point (6 / 6).
2023-03-17 10:34:01 execute check_mount_point() failed, maybe the disk is error
2023-03-17 10:34:01 USE_CHECK_DISK = off, do nothing.
2023-03-17 10:34:01 check read/write on mount point ... ok
2023-03-17 10:34:01 check if the network is ok
I'm already recovery now pid[16274], return nothing to do,will exit script will success
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
determine if i am master or standby
........
如下图所示:磁盘检测失败,但没有触发数据库关闭:
四、总结
磁盘检测功能有助于集群数据库数据的安全,但是在有的生产环境,磁盘I/O压力大情况下,有可能引起误判,可以根据生产应用环境,调整"USE_CHECK_DISK"参数,即保证集群的高可用性,又保证数据的安全。
KINGBASE研究院