KingbaseES RAC运维案例之---集群及数据库管理
案例说明:
KingbaseES RAC在部署完成后,进行日常的集群及数据库管理。
适用版本:
KingbaseES V008R006C008M030B0010
操作系统版本:
[root@node201 KingbaseHA]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
集群架构:
如下所示,node1和node2为集群节点:
节点信息:
[root@node201 KingbaseHA]# vi /etc/hosts
192.168.1.201 node201
192.168.1.202 node202
192.168.1.203 node203 iscsi_Srv
一、集群数据库结构
1、数据库服务进程
如下所示,集群每个节点都有一个instance,可以访问共享数据库。手工使用sys_ctl在每个节点启动instance,每个instance对应一个pid文件:
[root@node201 KingbaseHA]# ps -ef |grep kingbase
kingbase 23496 1 0 11:05 ? 00:00:00 /opt/Kingbase/ES/V8/KESRealPro/V008R006C008M030B0010/Server/bin/kingbase -D /sharedata/data_gfs2/kingbase/data -c config_file=/sharedata/data_gfs2/kingbase/data/kingbase.conf -c log_directory=sys_log -h 0.0.0.0
kingbase 24164 23496 0 11:06 ? 00:00:00 kingbase: logger
kingbase 24165 23496 0 11:06 ? 00:00:00 kingbase: lmon
kingbase 24166 23496 0 11:06 ? 00:00:00 kingbase: lms 1
kingbase 24167 23496 0 11:06 ? 00:00:00 kingbase: lms 2
kingbase 24168 23496 0 11:06 ? 00:00:00 kingbase: lms 3
kingbase 24169 23496 0 11:06 ? 00:00:00 kingbase: lms 4
kingbase 24170 23496 0 11:06 ? 00:00:00 kingbase: lms 5
kingbase 24171 23496 0 11:06 ? 00:00:00 kingbase: lms 6
kingbase 24172 23496 0 11:06 ? 00:00:00 kingbase: lms 7
kingbase 24393 23496 0 11:06 ? 00:00:00 kingbase: checkpointer
kingbase 24394 23496 0 11:06 ? 00:00:00 kingbase: background writer
kingbase 24395 23496 0 11:06 ? 00:00:00 kingbase: global deadlock checker
kingbase 24396 23496 0 11:06 ? 00:00:00 kingbase: transaction syncer
kingbase 24397 23496 0 11:06 ? 00:00:00 kingbase: walwriter
kingbase 24398 23496 0 11:06 ? 00:00:00 kingbase: autovacuum launcher
kingbase 24399 23496 0 11:06 ? 00:00:00 kingbase: archiver last was 00000001000000000000000E
kingbase 24402 23496 0 11:06 ? 00:00:00 kingbase: stats collector
kingbase 24403 23496 0 11:06 ? 00:00:00 kingbase: kwr collector
kingbase 24404 23496 0 11:06 ? 00:00:00 kingbase: ksh writer
kingbase 24405 23496 0 11:06 ? 00:00:00 kingbase: ksh collector
kingbase 24406 23496 0 11:06 ? 00:00:00 kingbase: logical replication launche
Tips:
lms进程处理集群请求与其他节点之间的通信。
lms会占用7个端口。
# 每个节点上有一个instance,实例进程id:
[root@node201 ~]# ls -lh /sharedata/data_gfs2/kingbase/data/kingbase*.pid
-rw------- 1 kingbase kingbase 100 Aug 12 11:06 /sharedata/data_gfs2/kingbase/data/kingbase_1.pid
-rw------- 1 kingbase kingbase 100 Aug 12 11:06 /sharedata/data_gfs2/kingbase/data/kingbase_2.pid
2、数据存储架构
1)数据库存储目录data(存储在gfs2的共享文件系统上)
test=# show data_directory;
data_directory
------------------------------------
/sharedata/data_gfs2/kingbase/data
(1 row)
2)每个节点配置文件
默认所有实例访问data/kingbase.conf配置,可以为每个节点配置单独的配置文件(优先级高于数据库共享配置):
[root@node201 ~]# ls -lh /sharedata/data_gfs2/kingbase/data/kingbase*.conf
-rw------- 1 kingbase kingbase 0 Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase_1.conf
-rw------- 1 kingbase kingbase 0 Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase_2.conf
-rw------- 1 kingbase kingbase 0 Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase_3.conf
-rw------- 1 kingbase kingbase 0 Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase_4.conf
-rw------- 1 kingbase kingbase 88 Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase.auto.conf
-rw------- 1 kingbase kingbase 28K Aug 2 11:45 /sharedata/data_gfs2/kingbase/data/kingbase.conf
# 启动节点配置文件:
kingbase.conf配置:
sub_config_file='/sharedata/data_gfs2/kingbase/data/kingbase_node.conf'
3)节点wal日志和sys_log日志:
如下所示,节点的wal日志和sys_log日志按照节点id单独存储在sys_wal下的子目录中:
# sys_wal日志
[root@node201 ~]# ls -lh /sharedata/data_gfs2/kingbase/data/sys_wal
total 16K
drwx------ 3 kingbase kingbase 3.8K Aug 12 11:11 1
drwx------ 3 kingbase kingbase 3.8K Aug 12 11:11 2
# sys_log日志
[root@node201 ~]# ls -lh /sharedata/data_gfs2/kingbase/data/sys_log
total 8.0K
drwx------ 2 kingbase kingbase 3.8K Aug 12 11:06 1
drwx------ 2 kingbase kingbase 3.8K Aug 12 11:05 2
二、启动集群及数据库
1、启动集群(all nodes)
[root@node201 ~]# cd /opt/KingbaseHA/
[root@node201 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
clean qdisk fence flag start
clean qdisk fence flag success
Starting Qdisk Fenced daemon (qdisk-fenced): [ OK ]
Starting Corosync Qdevice daemon (corosync-qdevice): [ OK ]
Waiting for quorate:.....................................................................................................................................[ OK ]
Starting Pacemaker Cluster Manager[ OK ]
2、查看资源状态
# 查看集群服务状态
[root@node201 KingbaseHA]# ./cluster_manager.sh status
corosync (pid 2937) is running...
pacemakerd (pid 3277) is running...
corosync-qdevice (pid 2955) is running...
[root@node201 KingbaseHA]# ./cluster_manager.sh --status_pacemaker
pacemakerd (pid 11521) is running...
[root@node201 KingbaseHA]# ./cluster_manager.sh --status_corosync
corosync (pid 9924) is running...
[root@node201 KingbaseHA]# ./cluster_manager.sh --status_qdevice
corosync-qdevice (pid 11499) is running...
[root@node201 KingbaseHA]# ./cluster_manager.sh --status_qdisk_fenced
qdisk-fenced is stopped
# 如下所示dlm和gfs2的资源未加载
[root@node202 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Fri Aug 9 18:05:20 2024
* Last change: Fri Aug 9 18:01:06 2024 by hacluster via crmd on node201
* 2 nodes configured
* 0 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources: # 无资源被加载
* No resources
3、启动PINGD、FIP、DB资源
[root@node201 KingbaseHA]# ./cluster_manager.sh --config_gfs2_resource
config dlm and gfs2 resource start
3e934629-a2b8-4b7d-a153-ded2dbec7a28
config dlm and gfs2 resource success
如下所示,dlm、gfs2等资源被启动:
[root@node201 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 15:31:41 2024
* Last change: Mon Aug 12 15:31:31 2024 by root via cibadmin on node201
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources: # dlm和gfs2资源被加载和启动
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]
* Started: [ node201 node202 ]
4、启动数据库资源
1)启动数据库资源
[root@node201 KingbaseHA]# ./cluster_manager.sh --config_rac_resource
crm configure DB resource start
crm configure DB resource end
2)查看集群资源配置
如下所示,数据库资源:DB:
[root@node201 ~]# crm config show
node 1: node201
node 2: node202
primitive DB ocf:kingbase:kingbase \
params sys_ctl="/opt/Kingbase/ES/V8/Server/bin/sys_ctl" ksql="/opt/Kingbase/ES/V8/Server/bin/ksql" sys_isready="/opt/Kingbase/ES/V8/Server/bin/sys_isready" kb_data="/sharedata/data_gfs2/kingbase/data" kb_dba=kingbase kb_host=0.0.0.0 kb_user=system kb_port=55321 kb_db=template1 logfile="/home/kingbase/log/kingbase1.log" \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
op monitor interval=9s timeout=30 on-fail=stop \
meta failure-timeout=5min
primitive dlm ocf:pacemaker:controld \
params daemon="/opt/KingbaseHA/dlm-dlm/sbin/dlm_controld" dlm_tool="/opt/KingbaseHA/dlm-dlm/sbin/dlm_tool" args="-s 0 -f 0" allow_stonith_disabled=true \
op start interval=0 \
op stop interval=0 \
op monitor interval=60 timeout=60
primitive gfs2 Filesystem \
params device="-U 3e934629-a2b8-4b7d-a153-ded2dbec7a28" directory="/sharedata/data_gfs2" fstype=gfs2 \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60 \
op monitor interval=30s timeout=60 OCF_CHECK_LEVEL=20 \
meta failure-timeout=5min
clone clone-DB DB \
meta target-role=Started
clone clone-dlm dlm \
meta interleave=true target-role=Started
clone clone-gfs2 gfs2 \
meta interleave=true target-role=Started
colocation cluster-colo1 inf: clone-gfs2 clone-dlm
order cluster-order1 clone-dlm clone-gfs2
order cluster-order2 clone-dlm clone-gfs2
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
cluster-name=krac \
no-quorum-policy=freeze \
stonith-enabled=false
3)查看数据库服务状态
如下所示,在查看集群资源状态,DB资源已经被启动:
[root@node201 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 15:32:50 2024
* Last change: Mon Aug 12 15:32:43 2024 by root via cibadmin on node201
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources:
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
* Clone Set: clone-DB [DB]: # 数据库资源DB被加载和启动
* Started: [ node201 node202 ]
4)数据库服务状态
[root@node201 KingbaseHA]# netstat -antlp |grep 553
tcp 0 0 0.0.0.0:55321 0.0.0.0:* LISTEN 29041/kingbase
5)集群状态实时监控
[root@node201 ~]# crm_mon -1
Cluster Summary:
* Stack: corosync
* Current DC: node202 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 11:20:47 2024
* Last change: Mon Aug 12 10:55:34 2024 by root via cibadmin on node201
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node201 node202 ]
Active Resources:
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
* Clone Set: clone-DB [DB]:
* Started: [ node201 node202 ]
5、停止集群
[root@node201 KingbaseHA]# ./cluster_manager.sh stop
Signaling Pacemaker Cluster Manager to terminate[ OK ]
Waiting for cluster services to unload.......[ OK ]
Signaling Qdisk Fenced daemon (qdisk-fenced) to terminate: [ OK ]
Waiting for qdisk-fenced services to unload:..[ OK ]
Signaling Corosync Qdevice daemon (corosync-qdevice) to terminate: [ OK ]
Waiting for corosync-qdevice services to unload:.[ OK ]
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:..[ OK ]
# 另外节点查看资源状态:
[root@node202 KingbaseHA]# crm resource status
fence_qdisk_0 (stonith:fence_qdisk): Started
fence_qdisk_1 (stonith:fence_qdisk): Started
Clone Set: clone-dlm [dlm]
Started: [ node201 node202 ]
Clone Set: clone-gfs2 [gfs2]
Started: [ node201 node202 ]
Clone Set: clone-DB [DB]
Stopped (disabled): [ node201 node202 ]
三、资源自动恢复
KingbaseRAC以资源的形式管理数据库,当使用sys_ctl stop或者kill数据库服务后,pacemaker会自动拉起资源:
1、关闭数据库服务
[kingbase@node201 bin]$ ./sys_ctl stop -D /sharedata/data_gfs2/kingbase/data/
waiting for server to shut down................... done
server stopped
2、查看资源状态
如下所示,pacemaker监控到DB资源运行状态异常:
[root@node201 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node202 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 11:56:05 2024
* Last change: Mon Aug 12 11:53:25 2024 by root via cibadmin on node202
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources:
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
* Clone Set: clone-DB [DB]:
* DB (ocf::kingbase:kingbase): Stopping node202
* DB (ocf::kingbase:kingbase): FAILED node201
Failed Resource Actions:
* DB_monitor_9000 on node201 'not running' (7): call=35, status='complete', exitreason='', last-rc-change='2024-08-12 11:56:04 +08:00', queued=0ms, exec=0ms
3、数据库资源恢复正常
如下所示,一段时间后,数据库资源被pacemaker拉起:
[root@node201 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node202 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 13:56:02 2024
* Last change: Mon Aug 12 11:53:25 2024 by root via cibadmin on node202
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources:
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
* Clone Set: clone-DB [DB]:
* Started: [ node201 node202 ]
# 数据库服务运行正常
[root@node201 KingbaseHA]# netstat -antlp |grep 553
tcp 0 0 0.0.0.0:55321 0.0.0.0:* LISTEN 20963/kingbase
四、访问数据库
1、ksql连接数据库访问
[kingbase@node201 bin]$ ./ksql -U system test -p 55321
Type "help" for help.
# 查看集群节点信息
test=# select sys_rac_nodelist;
sys_rac_nodelist
-------------------------------------------
(1,NODESTATE_MEMBER_ACTIVE,192.168.1.201)
(2,NODESTATE_MEMBER_ACTIVE,192.168.1.202)
(2 rows)
prod=# select * from t1 limit 10;
id | name
----+-------
1 | usr1
2 | usr2
3 | usr3
4 | usr4
5 | usr5
6 | usr6
7 | usr7
8 | usr8
9 | usr9
10 | usr10
(10 rows)
[kingbase@node202 bin]$ ./ksql -U system test -p 55321
Type "help" for help.
test=# \c prod
prod=# select count(*) from t1;
count
-------
1000
(1 row)
2、测试集群一个实例down(节点系统重启)
1)另外的实例状态
如下所示,当一个节点的实例down后,另外一个实例处于恢复状态,数据库暂时不可访问:
[kingbase@node209 bin]$ ps -ef |grep kingbase
kingbase 7989 1 0 10:19 ? 00:00:00 /home/kingbase/cluster/v9/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/v9/kingbase/bin/../etc/repmgr.conf
kingbase 20781 1 2 11:00 ? 00:00:54 /opt/Kingbase/ES/V8/Server/bin/kingbase -D /sharedata/data_gfs2/kingbase/data -c config_file=/sharedata/data_gfs2/kingbase/data/kingbase.conf -c log_directory=sys_log -h 0.0.0.0
kingbase 20951 20781 0 11:00 ? 00:00:00 kingbase: logger
kingbase 21123 20781 0 11:01 ? 00:00:00 kingbase: lmon
kingbase 21124 20781 0 11:01 ? 00:00:00 kingbase: lms 1
kingbase 21125 20781 0 11:01 ? 00:00:00 kingbase: lms 2
kingbase 21126 20781 0 11:01 ? 00:00:00 kingbase: lms 3
kingbase 21127 20781 0 11:01 ? 00:00:00 kingbase: lms 4
kingbase 21128 20781 0 11:01 ? 00:00:00 kingbase: lms 5
kingbase 21129 20781 0 11:01 ? 00:00:00 kingbase: lms 6
kingbase 21130 20781 0 11:01 ? 00:00:00 kingbase: lms 7
kingbase 22305 20781 0 11:02 ? 00:00:00 kingbase: checkpointer
kingbase 22312 20781 0 11:02 ? 00:00:00 kingbase: stats collector
root 45967 9676 0 11:32 pts/0 00:00:00 su - kingbase
kingbase 45968 45967 0 11:32 pts/0 00:00:00 -bash
kingbase 46430 20781 0 11:32 ? 00:00:00 kingbase: startup waiting for cluster recovery
连接数据库访问失败:
[kingbase@node209 bin]$ ./ksql -U system test
ksql: error: could not connect to server: FATAL: 数据库系统在恢复模式中
2)查看集群资源状态
如下所示,node208的资源状态异常:
[root@node209 ~]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Mon Jan 6 11:34:56 2025
* Last change: Thu Nov 14 14:42:59 2024 by root via cibadmin on node208
* 2 nodes configured
* 8 resource instances configured
Node List:
* Node node208: UNCLEAN (offline)
* Online: [ node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started [ node208 node209 ]
* Clone Set: clone-dlm [dlm]:
* dlm (ocf::pacemaker:controld): Started node208 (UNCLEAN)
* Started: [ node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* gfs2 (ocf::heartbeat:Filesystem): Started node208 (UNCLEAN)
* Started: [ node209 ]
* Clone Set: clone-DB [DB]:
* DB (ocf::kingbase:kingbase): Started node208 (UNCLEAN)
* Started: [ node209 ]
3)另外的实例恢复正常
如下所示,在集群自动恢复后,另外的实例恢复正常,可以访问数据库:
[kingbase@node209 bin]$ ps -ef |grep kingbase
kingbase 20781 1 2 11:00 ? 00:00:54 /opt/Kingbase/ES/V8/Server/bin/kingbase -D /sharedata/data_gfs2/kingbase/data -c config_file=/sharedata/data_gfs2/kingbase/data/kingbase.conf -c log_directory=sys_log -h 0.0.0.0
kingbase 20951 20781 0 11:00 ? 00:00:00 kingbase: logger
kingbase 21123 20781 0 11:01 ? 00:00:00 kingbase: lmon
kingbase 21124 20781 0 11:01 ? 00:00:00 kingbase: lms 1
kingbase 21125 20781 0 11:01 ? 00:00:00 kingbase: lms 2
kingbase 21126 20781 0 11:01 ? 00:00:00 kingbase: lms 3
kingbase 21127 20781 0 11:01 ? 00:00:00 kingbase: lms 4
kingbase 21128 20781 0 11:01 ? 00:00:00 kingbase: lms 5
kingbase 21129 20781 0 11:01 ? 00:00:00 kingbase: lms 6
kingbase 21130 20781 0 11:01 ? 00:00:00 kingbase: lms 7
kingbase 22305 20781 0 11:02 ? 00:00:00 kingbase: checkpointer
kingbase 22312 20781 0 11:02 ? 00:00:00 kingbase: stats collector
root 45967 9676 0 11:32 pts/0 00:00:00 su - kingbase
kingbase 45968 45967 0 11:32 pts/0 00:00:00 -bash
kingbase 47810 20781 0 11:33 ? 00:00:00 kingbase: background writer
kingbase 47811 20781 0 11:33 ? 00:00:00 kingbase: deadlock checker
kingbase 47812 20781 0 11:33 ? 00:00:00 kingbase: txn synchronizer
kingbase 47813 20781 0 11:33 ? 00:00:00 kingbase: walwriter
kingbase 47814 20781 0 11:33 ? 00:00:00 kingbase: autovacuum launcher
kingbase 47815 20781 0 11:33 ? 00:00:00 kingbase: archiver last was 000000010000000000000017 for node2
kingbase 47816 20781 0 11:33 ? 00:00:00 kingbase: kwr collector
kingbase 47818 20781 0 11:33 ? 00:00:00 kingbase: ksh writer
kingbase 47819 20781 0 11:33 ? 00:00:00 kingbase: ksh collector
kingbase 47820 20781 0 11:33 ? 00:00:00 kingbase: logical replication launcher
ksql连接数据库访问正常:
[kingbase@node209 bin]$ ./ksql -U system test
License Type: TEST-企业版 V9R1.
Type "help" for help.
test=#
五、附件
故障1:集群服务启动失败
如下所示i,集群服务启动异常:
[root@node201 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
clean qdisk fence flag start
查看集群配置:
[root@node201 ~]# cat /opt/KingbaseHA/cluster_manager.conf|grep fence
################# fence #################
enable_fence=1
配置enable_fence=0后,启动集群:
[root@node201 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
Starting Corosync Qdevice daemon (corosync-qdevice): [ OK ]
Waiting for quorate:...........[ OK ]
Starting Pacemaker Cluster Manager[ OK ]
案例2:
如下所示,集群启动失败:
1)节点没有连接到共享存储
[root@node208 KingbaseHA]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 122.5G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 121.5G 0 part
├─openeuler-root 253:0 0 70G 0 lvm /
├─openeuler-swap 253:1 0 4.2G 0 lvm [SWAP]
└─openeuler-home 253:2 0 47.3G 0 lvm /home
sr0 11:0 1 1024M 0 rom
2)启动集群(集群启动失败)
[root@node208 ~]# cd /opt/KingbaseHA/
[root@node208 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
clean qdisk fence flag start
3)连接共享存储
[root@node208 KingbaseHA]# sh ~/iscsi.sh
Logging in to [iface: default, target: iqn.2024-08.pip.cc:server, portal: 192.168.1.203,3260]
Login to [iface: default, target: iqn.2024-08.pip.cc:server, portal: 192.168.1.203,3260] successful.
4)查看存储信息
[root@node208 KingbaseHA]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 122.5G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 121.5G 0 part
├─openeuler-root 253:0 0 70G 0 lvm /
├─openeuler-swap 253:1 0 4.2G 0 lvm [SWAP]
└─openeuler-home 253:2 0 47.3G 0 lvm /home
sdb 8:16 0 512M 0 disk
sdc 8:32 0 60.6G 0 disk
sdd 8:48 0 128M 0 disk
sde 8:64 0 128M 0 disk
sdf 8:80 0 10.7G 0 disk
sdg 8:96 0 2.2G 0 disk
sdh 8:112 0 128M 0 disk
sr0 11:0 1 1024M 0 rom
5)启动集群
如下所示,集群启动正常:
[root@node208 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [ OK ]
clean qdisk fence flag start
clean qdisk fence flag success
Starting Qdisk Fenced daemon (qdisk-fenced): [ OK ]
Starting Corosync Qdevice daemon (corosync-qdevice): [ OK ]
Waiting for quorate:......[ OK ]
Starting Pacemaker Cluster Manager[ OK ]
故障2:crm resource start clone-DB失败
1)启动集群服务
[root@node201 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
Starting Corosync Qdevice daemon (corosync-qdevice): [ OK ]
Waiting for quorate:...........[ OK ]
Starting Pacemaker Cluster Manager[ OK ]
2)查看集群资源状态
如下所示dlm和gfs2的资源未加载
[root@node202 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Fri Aug 9 18:05:20 2024
* Last change: Fri Aug 9 18:01:06 2024 by hacluster via crmd on node201
* 2 nodes configured
* 0 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources: # 无资源被加载
* No resources
3)启动PINGD、FIP、DB资源
[root@node201 KingbaseHA]# ./cluster_manager.sh --config_gfs2_resource
config dlm and gfs2 resource start
3e934629-a2b8-4b7d-a153-ded2dbec7a28
config dlm and gfs2 resource success
如下所示,dlm、gfs2等资源被启动,但是仍然缺失数据库DB资源:
[root@node201 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 15:31:41 2024
* Last change: Mon Aug 12 15:31:31 2024 by root via cibadmin on node201
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources: # dlm和gfs2资源被加载和启动
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
4)配置数据库DB资源
crm configure primitive DB ocf:kingbase:kingbase \
params sys_ctl="/opt/Kingbase/ES/V8/Server/bin/sys_ctl" \
ksql="/opt/Kingbase/ES/V8/Server/bin/ksql" \
sys_isready="/opt/Kingbase/ES/V8/Server/bin/sys_isready" \
kb_data="/sharedata/data_gfs2/kingbase/data" \
kb_dba="kingbase" kb_host="0.0.0.0" \
kb_user="system" \
kb_port="55321" \
kb_db="template1" \
logfile="/home/kingbase/log/kingbase1.log" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
op monitor interval="9s" timeout="30" on-fail=stop \
meta failure-timeout=5min target-role=Stopped
[root@node201 KingbaseHA]# crm configure primitive DB ocf:kingbase:kingbase \
> params sys_ctl="/opt/Kingbase/ES/V8/Server/bin/sys_ctl" \
> ksql="/opt/Kingbase/ES/V8/Server/bin/ksql" \
> sys_isready="/opt/Kingbase/ES/V8/Server/bin/sys_isready" \
> kb_data="/sharedata/data_gfs2/kingbase/data" \
> kb_dba="kingbase" kb_host="0.0.0.0" \
> kb_user="system" \
> kb_port="55321" \
> kb_db="template1" \
> logfile="/home/kingbase/log/kingbase1.log" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> op monitor interval="9s" timeout="30" on-fail=stop \
> meta failure-timeout=5min target-role=Stopped
# 配置为clone资源及配置资源启动顺序
[root@node201 KingbaseHA]# crm configure clone clone-DB DB
[root@node201 KingbaseHA]# crm configure order cluster-order2 clone-dlm clone-gfs2 clone-DB
5)查看集群资源
如下所示,在集群资源中增加了DB资源:
[root@node201 KingbaseHA]# crm config show
node 1: node201
node 2: node202
primitive DB ocf:kingbase:kingbase \
params sys_ctl="/opt/Kingbase/ES/V8/Server/bin/sys_ctl" ksql="/opt/Kingbase/ES/V8/Server/bin/ksql" sys_isready="/opt/Kingbase/ES/V8/Server/bin/sys_isready" kb_data="/sharedata/data_gfs2/kingbase/data" kb_dba=kingbase kb_host=0.0.0.0 kb_user=system kb_port=55321 kb_db=template1 logfile="/home/kingbase/log/kingbase1.log" \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
op monitor interval=9s timeout=30 on-fail=stop \
meta failure-timeout=5min
primitive dlm ocf:pacemaker:controld \
params daemon="/opt/KingbaseHA/dlm-dlm/sbin/dlm_controld" dlm_tool="/opt/KingbaseHA/dlm-dlm/sbin/dlm_tool" args="-s 0 -f 0" allow_stonith_disabled=true \
op start interval=0 \
op stop interval=0 \
op monitor interval=60 timeout=60
primitive gfs2 Filesystem \
params device="-U 3e934629-a2b8-4b7d-a153-ded2dbec7a28" directory="/sharedata/data_gfs2" fstype=gfs2 \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60 \
op monitor interval=30s timeout=60 OCF_CHECK_LEVEL=20 \
meta failure-timeout=5min
clone clone-DB DB \
meta target-role=Started
clone clone-dlm dlm \
meta interleave=true target-role=Started
clone clone-gfs2 gfs2 \
meta interleave=true target-role=Started
colocation cluster-colo1 inf: clone-gfs2 clone-dlm
order cluster-order1 clone-dlm clone-gfs2
order cluster-order2 clone-dlm clone-gfs2
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
cluster-name=krac \
no-quorum-policy=freeze \
stonith-enabled=false
[root@node201 KingbaseHA]# crm config verify
[root@node201 KingbaseHA]# crm config commit
6)启动数据库资源服务
[root@node201 KingbaseHA]# crm resource start clone-DB
[root@node201 KingbaseHA]# crm resource status clone-DB
resource clone-DB is running on: node201
resource clone-DB is running on: node202
# 数据库服务被启动
[root@node201 KingbaseHA]# netstat -antlp |grep 553
tcp 0 0 0.0.0.0:55321 0.0.0.0:* LISTEN 3240/kingbase
查看数据库服务状态:
[root@node202 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node201 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Mon Aug 12 14:57:06 2024
* Last change: Mon Aug 12 14:56:00 2024 by root via cibadmin on node201
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node201 node202 ]
Full List of Resources:
* Clone Set: clone-dlm [dlm]:
* Started: [ node201 node202 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node201 node202 ]
* Clone Set: clone-DB [DB]: # 数据库资源服务被加载和启动
* Started: [ node201 node202 ]
故障3:crm status显示异常"Faied Fencing Actions"
1)crm status显示异常
[root@node208 KingbaseHA]# crm status
* Current DC: node208 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Mon Jan 6 11:02:41 2025
* Last change: Thu Nov 14 14:42:59 2024 by root via cibadmin on node208
* 2 nodes configured
* 8 resource instances configured
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Started: [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Started: [ node208 node209 ]
Failed Resource Actions:
* DB_monitor_9000 on node208 'not running' (7): call=32, status='complete', exitreason='', last-rc-change='2025-01-06 11:01:44 +08:00', queued=0ms, exec=0ms
Failed Fencing Actions:
* reboot of node209 failed: delegate=node208, client=pacemaker-controld.20180, origin=node208, last-failed='2025-01-06 10:55:29 +08:00'
2)清理gfs2资源
[root@node208 data]# crm resoure cleanup gfs2
Cleaned up gfs2:0 on node208
Cleaned up gfs2:1 on node209
3)重启集群服务
[root@node208 KingbaseHA]# ./cluster_manager.sh stop
Signaling Pacemaker Cluster Manager to terminate[ OK ]
Waiting for cluster services to unload..............................[ OK ]
Signaling Qdisk Fenced daemon (qdisk-fenced) to terminate: [ OK ]
Waiting for qdisk-fenced services to unload:[ OK ]
Signaling Corosync Qdevice daemon (corosync-qdevice) to terminate: [ OK ]
Waiting for corosync-qdevice services to unload:.[ OK ]
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:..[ OK ]
[root@node208 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
Starting Corosync Cluster Engine (corosync): [WARNING]
clean qdisk fence flag start
clean qdisk fence flag success
Starting Qdisk Fenced daemon (qdisk-fenced): [ OK ]
Starting Corosync Qdevice daemon (corosync-qdevice): [ OK ]
Waiting for quorate:.....[ OK ]
Starting Pacemaker Cluster Manager[ OK ]
4)crm status状态正常
[root@node208 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Mon Jan 6 15:30:16 2025
* Last change: Mon Jan 6 15:15:01 2025 by root via cibadmin on node208
* 2 nodes configured
* 8 resource instances configured
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Started: [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Started: [ node208 node209 ]
案例:手工启动集群服务
1)查看集群状态
[root@node209 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Wed Jan 15 10:27:06 2025
* Last change: Tue Jan 14 18:32:12 2025 by root via cibadmin on node208
* 2 nodes configured
* 9 resource instances configured (9 DISABLED)
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Stopped (disabled)
* fence_qdisk_1 (stonith:fence_qdisk): Stopped (disabled)
* Clone Set: clone-dlm [dlm]:
* Stopped (disabled): [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Stopped (disabled): [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Stopped (disabled): [ node208 node209 ]
* fence_qdisk_2 (stonith:fence_qdisk): Stopped (disabled)
2)启动fence_qdisk资源服务
[root@node209 KingbaseHA]# crm resource start fence_qdisk_0
[root@node209 KingbaseHA]# crm resource start fence_qdisk_1
[root@node209 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Wed Jan 15 10:27:24 2025
* Last change: Wed Jan 15 10:27:22 2025 by root via cibadmin on node209
* 2 nodes configured
* 9 resource instances configured (6 DISABLED)
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Stopped (disabled): [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Stopped (disabled): [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Stopped (disabled): [ node208 node209 ]
* fence_qdisk_2 (stonith:fence_qdisk): Started node208
3)启动dlm资源服务
[root@node209 KingbaseHA]# crm resource start dlm
[root@node209 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Wed Jan 15 10:27:36 2025
* Last change: Wed Jan 15 10:27:33 2025 by root via cibadmin on node209
* 2 nodes configured
* 9 resource instances configured (4 DISABLED)
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Started: [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Stopped (disabled): [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Stopped (disabled): [ node208 node209 ]
* fence_qdisk_2 (stonith:fence_qdisk): Started node208
4)启动gfs2资源服务
[root@node209 KingbaseHA]# crm resource start gfs2
[root@node209 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Wed Jan 15 10:27:46 2025
* Last change: Wed Jan 15 10:27:45 2025 by root via cibadmin on node209
* 2 nodes configured
* 9 resource instances configured (2 DISABLED)
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Started: [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* gfs2 (ocf::heartbeat:Filesystem): Starting node209
* gfs2 (ocf::heartbeat:Filesystem): Starting node208
* Clone Set: clone-DB [DB]:
* Stopped (disabled): [ node208 node209 ]
* fence_qdisk_2 (stonith:fence_qdisk): Started node208
5)启动数据库资源服务
[root@node209 KingbaseHA]# crm resource start clone-DB
[root@node209 KingbaseHA]# crm status
Cluster Summary:
* Stack: corosync
* Current DC: node209 Pacemaker (Kingbase) V008R006B1108 (2.0.3.0.0 4b1f869f0f:1268c00dfa83) - partition with quorum
* Last updated: Wed Jan 15 10:47:19 2025
* Last change: Wed Jan 15 10:34:54 2025 by root via cibadmin on node209
* 2 nodes configured
* 9 resource instances configured
Node List:
* Online: [ node208 node209 ]
Full List of Resources:
* fence_qdisk_0 (stonith:fence_qdisk): Started node209
* fence_qdisk_1 (stonith:fence_qdisk): Started node208
* Clone Set: clone-dlm [dlm]:
* Started: [ node208 node209 ]
* Clone Set: clone-gfs2 [gfs2]:
* Started: [ node208 node209 ]
* Clone Set: clone-DB [DB]:
* Started: [ node208 node209 ]
* fence_qdisk_2 (stonith:fence_qdisk): Started node208
六、清理及卸载集群
1、清理集群配置环境(all nodes)
[root@node201 KingbaseHA]# ./cluster_manager.sh --clean_all
clean all start
Pacemaker Cluster Manager is already stopped[ OK ]
clean env variable start
clean env variable success
clean host start
clean host success
remove pacemaker daemon user start
remove pacemaker daemon user success
clean all success
# 如下所示,集群配置被清理
[root@node201 KingbaseHA]# crm config show
ERROR: running cibadmin -Ql: Connection to the CIB manager failed: Transport endpoint is not connected
Init failed, could not perform requested operations
ERROR: configure: Missing requirements
[root@node201 KingbaseHA]# ./cluster_manager.sh start
Waiting for node failover handling:[ OK ]
./cluster_manager.sh: line 1143: /etc/init.d/corosync: No such file or directory
2、卸载集群(all nodes)
如下所示,执行集群卸载,将删除/opt/KingbaseHA目录:
[root@node202 KingbaseHA]# ./cluster_manager.sh --uninstall
uninstall start
./cluster_manager.sh: line 1276: /etc/init.d/pacemaker: No such file or directory
./cluster_manager.sh: line 1335: /etc/init.d/corosync-qdevice: No such file or directory
./cluster_manager.sh: line 1148: /etc/init.d/corosync: No such file or directory
clean env variable start
clean env variable success
clean host start
clean host success
remove pacemaker daemon user start
userdel: user 'hacluster' does not exist
groupdel: group 'haclient' does not exist
remove pacemaker daemon user success
uninstall success
# /opt/KingbaseHA目录被删除
[root@node202 KingbaseHA]# ls -lh /opt/KingbaseHA/
ls: cannot access /opt/KingbaseHA/: No such file or directory
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」