KingbaseES V8R6集群部署之---脚本部署读取db_user参数故障

案例说明：
KingbaseES V8R6集群通过脚本在线扩容，增加新备库节点。部署过程出现“无法获取db_user变量值“故障，导致部署失败。通过案例分析，定位问题发生的原因及并提供解决方案，为解决类似脚本部署问题，提供思路。

适用版本：
KingbaseES V8R6

问题处理流程：

1. 针对故障现象查看对应的配置文件相关配置。
2. 执行'sh -x cluster_install.sh expand'分析脚本执行过程，定位问题。
3. 在系统执行报错语句，查看具体的报错信息。
4. 分析脚本执行语句，确定故障原因。
5. 通过脚本的定位，查看相应的配置文件配置项。
6. 综合以上分析，获取故障的最终原因。
7. 获取故障原因后，提出解决方案。

一、问题现象
如下所示，KingbaseES V8R6集群通过脚本执行扩容部署时，提示“param[db_user]: is null in cluster, please check cluster status,exit!”，db_user参数无法赋值，部署失败。

[kingbase@node203 r6_install]$ sh cluster_install.sh expand
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "192.168.1.203" ..... OK
[RUNNING] success connect to "192.168.1.203" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.203" by 'ssh' ..... OK
[RUNNING] success connect to the target "192.168.1.201" ..... OK
[RUNNING] success connect to "192.168.1.201" from current node by 'ssh' ..... OK
[RUNNING] success connect to "" from "192.168.1.201" by 'ssh' ..... OK
[RUNNING] Primary node ip is 192.168.1.201 ...
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[RUNNING] Primary node ip is 192.168.1.201 ... OK
[INSTALL] load config from cluster.....
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
---脚本执行报错，db_user变量值为null。

二、问题分析
1、检查install.conf配置
如下所示，install.conf中db_user配置为‘system’，变量配置正常。

[kingbase@node201 bin]$ cat install.conf |grep db_user
db_user="system"                 # the user name of database

2、分析脚本执行过程
如下图所示，执行：sh -x cluster_install.sh expand ，在对db_user变量执行判断时语句出现故障。

++ local user=root
++ local host=192.168.1.201
++ local 'command=test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -eq 1 ']'
++ ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1
++ '[' 1 -ne 0 ']'
++ return 1
+ local cluster_db_user=
+ compare_param_diff db_user system ''
+ local param_name=db_user
+ local install_param=system
+ local cluster_param=
+ '[' x == x ']'
+ echo '[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!'
[WARNING] the /home/kingbase/r6_install/install.conf param[db_user]: is null in cluster, please check cluster status,exit!
+ exit 1

---如上所示，执行此语句时，返回结果 $?非零，执行异常：
ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
++ '[' 1 -ne 0 ']'
++ return 1

在系统下执行报错语句，查看返回结果：

[kingbase@node203 r6_install]$  ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
[kingbase@node203 r6_install]$ echo $?
1

执行：echo $? 返回结果为非0，语句执行失败。

3、查看all_nodes_tools.conf配置
如下所示，脚本执行过程中读取此配置文件，获取‘db_user’参数值，然后和install.conf中的‘db_user’进行对比。

[kingbase@node201 bin]$ cat ../etc/all_nodes_tools.conf
db_u=system
db_password=MTIzNDU2NzhhYg==
db_port=54321
db_name=test

4、分析脚本
如下图所示，脚本读取install.conf和all_nodes_tools.conf中的‘db_user'参数值进行比对。

5、检查脚本报错语句
如下所示，发现all_nodes_tools.conf的存储路径多了层’kingbase‘目录，导致获取脚本信息失败。

ssh -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 -l root -T 192.168.1.201 'test -f /home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'
---检查发现，此语句'/home/kingbase/cluster/R6C8/HAC8/kingbase/kingbase/etc/all_nodes_tools.conf'目录，多了层kingbase。

三、问题解决

1、检查install.conf配置

如下所示，在install.conf配置文件中，【expand】模块下’install_dir'目录配置报错，多了层‘kingbase’目录：

[expand]
expand_type="0"                   # The node type of standby/witness node, which would be add to cluster. 0:standby  1:witness
primary_ip="192.168.1.201"                    # The ip addr of cluster primary node, which need to expand a standby/witness node.
expand_ip="192.168.1.203"                     # The ip addr of standby/witness node, which would be add to cluster.
node_id="3"                       # The node_id of standby/witness node, which would be add to cluster. It does not the same with any one in  cluster node
                                 # for example: node_id="3"
## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8/kingbase"  ###路径配置错误

脚本【install_dir】说明：

## the path of cluster to be deployed, for example: install_dir="/home/kingbase/tmp_kingbase" [if it is BMJ, you do not need to configure this parameter]
## the directory structure after deployment:
##           ${install_dir}/kingbase/data                               the data directory
##           ${install_dir}/kingbase/archive                            log archive directory
##           ${install_dir}/kingbase/etc                                configuration file directory
##           ${install_dir}/kingbase/bin、lib、share、log       install file directory
## the last layer of directory could not add '/'

如下所示，当前环境集群部署存储架构：

---集群部署路径【install_dir】
[kingbase@node203 HAC8]$ pwd
/home/kingbase/cluster/R6C8/HAC8

[kingbase@node203 HAC8]$ ls -lh
total 308M
-rw-------  1 kingbase kingbase   62 Oct 11  2023 control.so
-rwxr-xr-x  1 kingbase root     308M Oct 11  2023 db.zip
drwxrwxr-x 10 kingbase kingbase  127 Nov  7 19:08 kingbase
-rwxrwxrwx  1 kingbase kingbase 3.6K Oct 11  2023 license.dat

---kingbase目录在部署过程中自动创建
[kingbase@node203 HAC8]$ tree -d -L 2
.
└── kingbase
    ├── archive
    ├── bin
    ├── data
    ├── etc
    ├── include
    ├── lib
    ├── log
    └── share

9 directories

2、修改install.conf中配置

## Specific instructions ,see it under [install]
install_dir="/home/kingbase/cluster/R6C8/HAC8

3、重新执行脚本部署

四、总结
对于使用脚本部署集群，要注意install.conf中参数配置的正确性；出现部署异常后，可以通过执行'sh -x cluster_install.sh'分析脚本执行过程，获取到故障具体位置，然后针对性的处理。

posted @ 2023-11-09 11:46 天涯客1224 阅读(16) 评论(0) 编辑收藏举报

刷新页面返回顶部

天涯客1224

KingbaseES V8R6集群部署之---脚本部署读取db_user参数故障

公告