KingbaseES V8R6集群部署之---脚本部署读取db_user参数故障
案例说明:
KingbaseES V8R6集群通过脚本在线扩容,增加新备库节点。部署过程出现“无法获取db_user变量值“故障,导致部署失败。通过案例分析,定位问题发生的原因及并提供解决方案,为解决类似脚本部署问题,提供思路。
适用版本 :
KingbaseES V8R6
问题处理流程:
1. 针对故障现象查看对应的配置文件相关配置。
2. 执行'sh -x cluster_install.sh expand'分析脚本执行过程,定位问题。
3. 在系统执行报错语句,查看具体的报错信息。
4. 分析脚本执行语句,确定故障原因。
5. 通过脚本的定位,查看相应的配置文件配置项。
6. 综合以上分析,获取故障的最终原因。
7. 获取故障原因后,提出解决方案。
一、问题现象
如下所示,KingbaseES V8R6集群通过脚本执行扩容部署时,提示“param[db_user]: is null in cluster, please check cluster status,exit!”,db_user参数无法赋值,部署失败。
[kingbase@node203 r6_install]$ sh cluster_install.sh expand
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "192.168.1.203" ..... OK
[RUNNING] success connect to "192.168.1.203" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.203" by 'ssh' ..... OK
[RUNNING] success connect to the target "192.168.1.201" ..... OK
[RUNNING] success connect to "192.168.1.201" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.201" by 'ssh' ..... OK
[RUNNING] Primary node ip is 192.168.1.201 ...
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[RUNNING] Primary node ip is 192.168.1.201 ... OK
[INSTALL] load config from cluster.....
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
---脚本执行报错,db_user变量值为null。
二、问题分析
1、检查install.conf配置
如下所示,install.conf中db_user配置为‘system’,变量配置正常。
[kingbase@node201 bin]$ cat install.conf |grep db_user
db_user="system" # the user name of database
2、分析脚本执行过程
如下图所示,执行:sh -x cluster_install.sh expand ,在对db_user变量执行判断时语句出现故障。
++ local user=root
++ local host=192.168.1.201
++ local 'command=test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -eq 1 ']'
++ ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1
++ '[' 1 -ne 0 ']'
++ return 1
+ local cluster_db_user=
+ compare_param_diff db_user system ''
+ local param_name=db_user
+ local install_param=system
+ local cluster_param=
+ '[' x == x ']'
+ echo '[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!'
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
+ exit 1
---如上所示,执行此语句时,返回结果 $?非零,执行异常:
ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1
在系统下执行报错语句,查看返回结果:
[kingbase@node203 r6_install]$ ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
[kingbase@node203 r6_install]$ echo $?
1
执行:echo $? 返回结果为非0,语句执行失败。
3、查看all_nodes_tools.conf配置
如下所示,脚本执行过程中读取此配置文件,获取‘db_user’参数值,然后和install.conf中的‘db_user’进行对比。
[kingbase@node201 bin]$ cat ../etc/all_nodes_tools.conf
db_u=system
db_password=MTIzNDU2NzhhYg==
db_port=54321
db_name=test
4、分析脚本
如下图所示,脚本读取install.conf和all_nodes_tools.conf中的‘db_user'参数值进行比对。
5、检查脚本报错语句
如下所示,发现all_nodes_tools.conf的存储路径多了层’kingbase‘目录,导致获取脚本信息失败。
ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
---检查发现,此语句'/home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'目录,多了层kingbase。
三、问题解决
1、检查install.conf配置
如下所示,在install.conf配置文件中,【expand】模块下’install_dir'目录配置报错,多了层‘kingbase’目录:
[expand]
expand_type="0" # The node type of standby/witness node, which would be add to cluster. 0:standby 1:witness
primary_ip="192.168.1.201" # The ip addr of cluster primary node, which need to expand a standby/witness node.
expand_ip="192.168.1.203" # The ip addr of standby/witness node, which would be add to cluster.
node_id="3" # The node_id of standby/witness node, which would be add to cluster. It does not the same with any one in cluster node
# for example: node_id="3"
## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8/kingbase" ###路径配置错误
脚本【install_dir】说明:
## the path of cluster to be deployed, for example: install_dir="/home/kingbase/tmp_kingbase" [if it is BMJ, you do not need to configure this parameter]
## the directory structure after deployment:
## ${install_dir}/kingbase/data the data directory
## ${install_dir}/kingbase/archive log archive directory
## ${install_dir}/kingbase/etc configuration file directory
## ${install_dir}/kingbase/bin、lib、share、log install file directory
## the last layer of directory could not add '/'
如下所示,当前环境集群部署存储架构:
---集群部署路径【install_dir】
[kingbase@node203 HAC8]$ pwd
/home/kingbase/cluster/R6C8/HAC8
[kingbase@node203 HAC8]$ ls -lh
total 308M
-rw------- 1 kingbase kingbase 62 Oct 11 2023 control.so
-rwxr-xr-x 1 kingbase root 308M Oct 11 2023 db.zip
drwxrwxr-x 10 kingbase kingbase 127 Nov 7 19:08 kingbase
-rwxrwxrwx 1 kingbase kingbase 3.6K Oct 11 2023 license.dat
---kingbase目录在部署过程中自动创建
[kingbase@node203 HAC8]$ tree -d -L 2
.
└── kingbase
├── archive
├── bin
├── data
├── etc
├── include
├── lib
├── log
└── share
9 directories
2、修改install.conf中配置
## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8
3、重新执行脚本部署
四、总结
对于使用脚本部署集群,要注意install.conf中参数配置的正确性;出现部署异常后,可以通过执行'sh -x cluster_install.sh'分析脚本执行过程,获取到故障具体位置,然后针对性的处理。