KingbaseES V8R6集群部署之---脚本部署读取db_user参数故障

案例说明:
KingbaseES V8R6集群通过脚本在线扩容,增加新备库节点。部署过程出现“无法获取db_user变量值“故障,导致部署失败。通过案例分析,定位问题发生的原因及并提供解决方案,为解决类似脚本部署问题,提供思路。

适用版本 :
KingbaseES V8R6

问题处理流程:

1. 针对故障现象查看对应的配置文件相关配置。
2. 执行'sh -x cluster_install.sh expand'分析脚本执行过程,定位问题。
3. 在系统执行报错语句,查看具体的报错信息。
4. 分析脚本执行语句,确定故障原因。
5. 通过脚本的定位,查看相应的配置文件配置项。
6. 综合以上分析,获取故障的最终原因。
7. 获取故障原因后,提出解决方案。

一、问题现象
如下所示,KingbaseES V8R6集群通过脚本执行扩容部署时,提示“param[db_user]: is null in cluster, please check cluster status,exit!”,db_user参数无法赋值,部署失败。

[kingbase@node203 r6_install]$ sh cluster_install.sh expand
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "192.168.1.203" ..... OK
[RUNNING] success connect to "192.168.1.203" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.203" by 'ssh' ..... OK
[RUNNING] success connect to the target "192.168.1.201" ..... OK
[RUNNING] success connect to "192.168.1.201" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.201" by 'ssh' ..... OK
[RUNNING] Primary node ip is 192.168.1.201 ...
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[RUNNING] Primary node ip is 192.168.1.201 ... OK
[INSTALL] load config from cluster.....
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
---脚本执行报错,db_user变量值为null。

二、问题分析
1、检查install.conf配置
如下所示,install.conf中db_user配置为‘system’,变量配置正常。

[kingbase@node201 bin]$ cat install.conf |grep db_user
db_user="system"                 # the user name of database

2、分析脚本执行过程
如下图所示,执行:sh -x cluster_install.sh expand ,在对db_user变量执行判断时语句出现故障。

++ local user=root
++ local host=192.168.1.201
++ local 'command=test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -eq 1 ']'
++ ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1
++ '[' 1 -ne 0 ']'
++ return 1
+ local cluster_db_user=
+ compare_param_diff db_user system ''
+ local param_name=db_user
+ local install_param=system
+ local cluster_param=
+ '[' x == x ']'
+ echo '[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!'
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
+ exit 1

---如上所示,执行此语句时,返回结果 $?非零,执行异常:
ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1

在系统下执行报错语句,查看返回结果:

[kingbase@node203 r6_install]$  ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
[kingbase@node203 r6_install]$ echo $?
1

执行:echo $? 返回结果为非0,语句执行失败。

3、查看all_nodes_tools.conf配置
如下所示,脚本执行过程中读取此配置文件,获取‘db_user’参数值,然后和install.conf中的‘db_user’进行对比。

[kingbase@node201 bin]$ cat ../etc/all_nodes_tools.conf
db_u=system
db_password=MTIzNDU2NzhhYg==
db_port=54321
db_name=test

4、分析脚本
如下图所示,脚本读取install.conf和all_nodes_tools.conf中的‘db_user'参数值进行比对。

5、检查脚本报错语句
如下所示,发现all_nodes_tools.conf的存储路径多了层’kingbase‘目录,导致获取脚本信息失败。

ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
---检查发现,此语句'/home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'目录,多了层kingbase。

三、问题解决

1、检查install.conf配置

如下所示,在install.conf配置文件中,【expand】模块下’install_dir'目录配置报错,多了层‘kingbase’目录:

[expand]
expand_type="0"                   # The node type of standby/witness node, which would be add to cluster. 0:standby  1:witness
primary_ip="192.168.1.201"                    # The ip addr of cluster primary node, which need to expand a standby/witness node.
expand_ip="192.168.1.203"                     # The ip addr of standby/witness node, which would be add to cluster.
node_id="3"                       # The node_id of standby/witness node, which would be add to cluster. It does not the same with any one in  cluster node
                                 # for example: node_id="3"
## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8/kingbase"  ###路径配置错误

脚本【install_dir】说明:

## the path of cluster to be deployed, for example: install_dir="/home/kingbase/tmp_kingbase" [if it is BMJ, you do not need to configure this parameter]
## the directory structure after deployment:
##           ${install_dir}/kingbase/data                               the data directory
##           ${install_dir}/kingbase/archive                            log archive directory
##           ${install_dir}/kingbase/etc                                configuration file directory
##           ${install_dir}/kingbase/bin、lib、share、log       install file directory
## the last layer of directory could not add '/'

如下所示,当前环境集群部署存储架构:

---集群部署路径【install_dir】
[kingbase@node203 HAC8]$ pwd
/home/kingbase/cluster/R6C8/HAC8

[kingbase@node203 HAC8]$ ls -lh
total 308M
-rw-------  1 kingbase kingbase   62 Oct 11  2023 control.so
-rwxr-xr-x  1 kingbase root     308M Oct 11  2023 db.zip
drwxrwxr-x 10 kingbase kingbase  127 Nov  7 19:08 kingbase
-rwxrwxrwx  1 kingbase kingbase 3.6K Oct 11  2023 license.dat

---kingbase目录在部署过程中自动创建
[kingbase@node203 HAC8]$ tree -d -L 2
.
└── kingbase
    ├── archive
    ├── bin
    ├── data
    ├── etc
    ├── include
    ├── lib
    ├── log
    └── share

9 directories

2、修改install.conf中配置

## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8

3、重新执行脚本部署

四、总结
对于使用脚本部署集群,要注意install.conf中参数配置的正确性;出现部署异常后,可以通过执行'sh -x cluster_install.sh'分析脚本执行过程,获取到故障具体位置,然后针对性的处理。

posted @ 2023-11-09 11:46  天涯客1224  阅读(0)  评论(0编辑  收藏  举报