Linux 高可用仲裁设备配置

Red Hat Enterprise Linux 7.4 完全支持配置作为集群的第三方设备的独立仲裁设备。它的主要用途是允许集群保持比标准仲裁规则允许更多的节点故障。建议在具有偶数节点的集群中使用仲裁设备。对于双节点群集,使用仲裁设备可以更好地决定在脑裂情况下保留哪些节点。
在配置仲裁设备,您必须考虑以下内容。
建议您在与使用该仲裁设备的集群相同的站点中的不同的物理网络中运行仲裁设备。理想情况下,仲裁设备主机应该独立于主集群,或者至少位于一个独立的 PSU,而不要与 corosync 环或者环位于同一个网络网段。
您不能同时在集群中使用多个仲裁设备。
虽然您不能同时在集群中使用多个仲裁设备,但多个集群可能同时使用一个仲裁设备。每个使用这个仲裁设备的集群都可以使用不同的算法和仲裁选项,因为它们保存在集群节点本身。例如,单个仲裁设备可由一个具有破坏 ( fifty/fifty split)算法的集群和具有 lms (last man standing)算法的第二个群集使用。
不应在现有集群节点中运行仲裁设备。

系统环境:

[root@node201 ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

系统架构:

[root@node203 ~]# cat /etc/hosts
192.168.1.201 node201
192.168.1.202 node202
192.168.1.203 node203  qdevice

一、系统环境部署
1、集群节点部署

[root@node201 ~]# dnf install corosync-qdevice
[root@node201 ~]# rpm -qa |grep qdevice
corosync-qdevice-2.4.5-7.el7_9.2.x86_64

[root@node201 ~]# rpm -qa |grep pcs
pcs-0.9.169-3.el7.centos.3.x86_64
pcsc-lite-libs-1.8.8-8.el7.x86_64

[root@node201 ~]# rpm -qa |egrep 'pacemaker|corosync'
corosynclib-2.4.5-7.el7_9.2.x86_64
pacemaker-1.1.23-1.el7_9.1.x86_64
pacemaker-libs-1.1.23-1.el7_9.1.x86_64
pacemaker-doc-1.1.23-1.el7_9.1.x86_64
corosync-qdevice-2.4.5-7.el7_9.2.x86_64
pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
pacemaker-cli-1.1.23-1.el7_9.1.x86_64
corosync-2.4.5-7.el7_9.2.x86_64

2、仲裁节点部署

[root@node203 ~]# dnf install pcs corosync-qnet
[root@node203 corosync]# yum install -y corosync-qdevice
# 启动pcs服务   
[root@node203 ~]# systemctl start pcsd.service
[root@node203 ~]# systemctl status pcsd.service
[root@node203 ~]# systemctl enable pcsd.service

二、建立集群
1、建立用户认证
如下所示,在集群节点及qdevice节点建立哈cluster用户并设置密码:

[root@node201 ~]# id hacluster
uid=003(hacluster) gid=1004(haclient) groups=1004(haclient)
[root@node202 ~]# id hacluster
uid=5001(hacluster) gid=5010(haclient) groups=5010(haclient)
[root@node203 ~]# id hacluster
uid=5001(hacluster) gid=1004(haclient) groups=1004(haclient)

如下所示,在集群节点建立到qdevice节点的认证:

[root@node201 ~]# pcs cluster auth node203
Username: hacluster
Password:
node203: Authorized
[root@node202 ~]# pcs cluster auth node203
Username: hacluster
Password:
node203: Authorized

2、创建集群
如下所示,创建集群test_cluster:

[root@node201 pcs]# pcs cluster setup --name test_cluster node201 node202 node203 --force
Destroying cluster on nodes: node201, node202, node203...
node202: Stopping Cluster (pacemaker)...
node203: Stopping Cluster (pacemaker)...
node201: Stopping Cluster (pacemaker)...
node203: Successfully destroyed cluster
node202: Successfully destroyed cluster
node201: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node201', 'node202', 'node203'
node201: successful distribution of the file 'pacemaker_remote authkey'
node203: successful distribution of the file 'pacemaker_remote authkey'
node202: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded

Synchronizing pcsd certificates on nodes node201, node202, node203...
node201: Success
node203: Success
node202: Success
Restarting pcsd on the nodes in order to reload the certificates...
node201: Success
node203: Success
node202: Success

[root@node201 pcs]#  pcs cluster start --all
node201: Starting Cluster (corosync)...
node202: Starting Cluster (corosync)...
node203: Starting Cluster (corosync)...
node203: Starting Cluster (pacemaker)...
node202: Starting Cluster (pacemaker)...
node201: Starting Cluster (pacemaker)...

[root@node201 pcs]#  pcs cluster status
Cluster Status:
 Stack: unknown
 Current DC: NONE
 Last updated: Thu Aug 29 19:24:45 2024
 Last change: Thu Aug 29 19:24:40 2024 by hacluster via crmd on node203
 3 nodes configured
 0 resource instances configured
PCSD Status:
  node203: Online
  node201: Online
  node202: Online

[root@node201 pcs]# pcs cluster enable --all
node201: Cluster Enabled
node202: Cluster Enabled
node203: Cluster Enabled

三、配置仲裁设备
仲裁设备模型是 net,这是目前唯一支持的模型。net 模型支持以下算法:

  • ffsplit :5-fifty split.这为拥有最多活跃节点的分区提供一个投票。
  • lMS:le -man-standing.如果节点是集群中唯一可以看到 qnetd 服务器的节点,则它将返回一个投票。

1、配置并启动仲裁设备模型net

[root@node203 ~]# pcs qdevice setup model net --enable --start
Quorum device 'net' initialized
quorum device enabled
Starting quorum device...
quorum device started

2、查看仲裁设备状态

[root@node203 ~]# pcs qdevice status net --full
QNetd address:                  *:5403
TLS:                            Supported (client certificate required)
Connected clients:              0
Connected clusters:             0
Maximum send/receive size:      32768/32768 bytes

如下所示,将qdevice访问加入到防火墙:

[root@node203 ~]# firewall-cmd --permanent --add-service=high-availability
FirewallD is not running
[root@node203 ~]# firewall-cmd --add-service=high-availability
FirewallD is not running

3、集群节点添加仲裁
1)集群节点配置corosync.conf(所有集群节点)

[root@node202 corosync]#  cat /etc/corosync/corosync.conf |grep -v ^#|grep -v ^$|grep -v '#'
totem {
        version: 2
        crypto_cipher: none
        crypto_hash: none
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.1.0
                mcastaddr: 239.255.1.1
                mcastport: 5405
                ttl: 1
        }
}
logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        logfile: /var/log/cluster/corosync.log
        to_syslog: yes
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}
quorum {
        provider: corosync_votequorum
        expected_votes: 7
}
nodelist {
        node { ring0_addr: node201
               nodeid: 1
        }
        node { ring0_addr: node202
               nodeid: 2
        }
}

quorum配置说明:

quorum {
        provider: corosync_votequorum      # 启动了votequorum
        expected_votes: 7             # 7表示,7个节点,quorum为4。如果设置了nodelist参数,expected_votes无效
        wait_for_all: 1              # 值为1表示,当集群启动,集群quorum被挂起,直到所有节点在线并加入集群,这个参数是Corosync 2.0新增的。
        last_man_standing: 1            # 为1表示,启用LMS特性。默认这个特性是关闭的,即值为0。
                             # 这个参数开启后,当集群的处于表决边缘(如expected_votes=7,而当前online nodes=4),处于表决边缘状态超过last_man_standing_window参数指定的时间,
                             # 则重新计算quorum,直到online nodes=2。如果想让online nodes能够等于1,必须启用auto_tie_breaker选项,生产环境不推荐。
        last_man_standing_window: 10000        # 单位为毫秒。在一个或多个主机从集群中丢失后,重新计算quorum

2)重启corosync服务

[root@node201 corosync]# systemctl restart corosync
[root@node201 corosync]# systemctl status corosync

3)查看并添加集群节点仲裁配置

[root@node201 ~]# pcs quorum config
Options:

查看quorum状态:
[root@node202 corosync]#  pcs quorum status
Quorum information
------------------
Date:             Thu Aug 29 17:56:32 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          3232235978
Ring ID:          -1062731319/85
Quorate:          No

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      3
Quorum:           4 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
3232235977          1         NR node201
3232235978          1         NR node202 (local)
3232235979          1         NR node203

如下所示,集群节点添加仲裁设备,并指定算法ffsplit:

[root@node201 pcs]# pcs quorum device add model net host=node203 algorithm=ffsplit --force
Setting up qdevice certificates on nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded
Enabling corosync-qdevice...
node203: corosync-qdevice enabled
node201: corosync-qdevice enabled
node202: corosync-qdevice enabled
Sending updated corosync.conf to nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded
Corosync configuration reloaded
Starting corosync-qdevice...
node203: corosync-qdevice started
node201: corosync-qdevice started
node202: corosync-qdevice started

[root@node202 corosync]# pcs quorum device add model net host=node203 algorithm=ffsplit --force
Error: quorum device is already defined

4)查看添加quorum后的配置

[root@node201 pcs]# pcs quorum config
Options:
Device:
  votes: 1
  Model: net
    algorithm: ffsplit
    host: node203
	
[root@node201 pcs]# pcs quorum status
Quorum information
------------------
Date:             Thu Aug 29 19:31:45 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1/98
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW node201 (local)
         2          1    A,V,NMW node202
         3          1    A,V,NMW node203
         0          1            Qdevice



[root@node201 pcs]# pcs quorum device status
Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
    2   Node ID = 3
Membership node list:   1, 2, 3

Qdevice-net information
----------------------
Cluster name:           test_cluster
QNetd host:             node203:5403
Algorithm:              Fifty-Fifty split
Tie-breaker:            Node with lowest node ID
State:                  Connected

5)在仲裁节点查看qdevice连接情况

[root@node203 corosync]# pcs qdevice status net --full
QNetd address:                  *:5403
TLS:                            Supported (client certificate required)
Connected clients:              3
Connected clusters:             1
Maximum send/receive size:      32768/32768 bytes
Cluster "test_cluster":
    Algorithm:          Fifty-Fifty split
    Tie-breaker:        Node with lowest node ID
    Node ID 3:
        Client address:         ::ffff:192.168.1.203:45974
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.62
        Membership node list:   1, 2, 3
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 1:
        Client address:         ::ffff:192.168.1.201:40657
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.62
        Membership node list:   1, 2, 3
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 2:
        Client address:         ::ffff:192.168.1.202:35765
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.62
        Membership node list:   1, 2, 3
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   No change (ACK)

6)集群节点管理仲裁设备

# 更改仲裁算法
[root@node1:~]# pcs quorum device update model algorithm=lms
Sending updated corosync.conf to nodes...
node1: Succeeded
node2: Succeeded
Corosync configuration reloaded
Reloading qdevice configuration on nodes...
node1: corosync-qdevice stopped
node2: corosync-qdevice stopped
node1: corosync-qdevice started
node2: corosync-qdevice started

# 删除仲裁设备
[root@node1:~]# pcs quorum device remove
Sending updated corosync.conf to nodes...
node1: Succeeded
node2: Succeeded
Corosync configuration reloaded
Disabling corosync-qdevice...
node1: corosync-qdevice disabled
node2: corosync-qdevice disabled
Stopping corosync-qdevice...
node1: corosync-qdevice stopped
node2: corosync-qdevice stopped
Removing qdevice certificates from nodes...
node1: Succeeded
node2: Succeeded

[root@node1:~]# pcs quorum device status
Error: Unable to get quorum status: corosync-qdevice-tool: Can't connect to QDevice socket (is QDevice running?): No such file or directory

7)仲裁节点销毁仲裁设备

[root@qdevice:~]# pcs qdevice destroy net
Stopping quorum device...
quorum device stopped
quorum device disabled
Quorum device 'net' configuration files removed

四、管理仲裁设备
PCS 提供了在本地主机上管理仲裁设备服务(corosync-qnetd)的功能,如下例所示。请注意,这些命令仅影响 corosync-qnetd 服务。

[root@qdevice:~]# pcs qdevice start net
[root@qdevice:~]# pcs qdevice stop net
[root@qdevice:~]# pcs qdevice enable net
[root@qdevice:~]# pcs qdevice disable net
[root@qdevice:~]# pcs qdevice kill net

附件:配置错误案例

案例1、查看qdevice状态异常
如下所示,集群节点查看quorum状态时报错:

[root@node201 ~]#  pcs quorum status
Error: Unable to get quorum status: Unable to start votequorum status tracking: CS_ERR_BAD_HANDLE

1)查看corosync.conf配置

2)启动corosync错误

[root@node202 ~]# systemctl restart corosync
Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.

3)查看corosync日志

[root@node201 corosync]# tail -1000 /var/log/cluster/corosync.log

Aug 29 17:14:31 [324] node203 corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 29 17:14:31 [324] node203 corosync notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 29 17:14:31 [324] node203 corosync crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Aug 29 17:14:31 [324] node203 corosync error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Aug 29 17:14:31 [324] node203 corosync error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.

如下所示,corosync.conf配置了 corosync_votequorum,必须配置expected_votes:

4)修改corosync.conf配置:

5)启动corosync服务
[root@node201 corosync]# systemctl restart corosync

6)查看qdevice状态

[root@node202 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Aug 29 17:41:02 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          3232235978
Ring ID:          -1062731319/67
Quorate:          No

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      2
Quorum:           1 Activity blocked
Flags:            2Node WaitForAll LastManStanding
Unable to get node 3232235979 info

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
3232235977          1         NR node201
3232235978          1         NR node202 (local)
3232235979          0         NR node203

案例2、添加qdevice故障

如下所示,集群节点添加qdevice时,出现python错误:

[root@node201 corosync]# pcs quorum device add model net host=node203 algorithm=ffsplit
Setting up qdevice certificates on nodes...
Traceback (most recent call last):
  File "/usr/sbin/pcs", line 9, in <module>
    load_entry_point('pcs==0.9.169', 'console_scripts', 'pcs')()
......
  File "/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py", line 160, in url
    host="[{0}]".format(self.host) if ":" in self.host else self.host,
TypeError: argument of type 'NoneType' is not iterable

创建集群:

[root@node201 pcs]# pcs cluster setup --name test_cluster node201 node202 node203 --force
Destroying cluster on nodes: node201, node202, node203...
node202: Stopping Cluster (pacemaker)...
node203: Stopping Cluster (pacemaker)...
node201: Stopping Cluster (pacemaker)...
node203: Successfully destroyed cluster
node202: Successfully destroyed cluster
node201: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node201', 'node202', 'node203'
node201: successful distribution of the file 'pacemaker_remote authkey'
node203: successful distribution of the file 'pacemaker_remote authkey'
node202: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded

Synchronizing pcsd certificates on nodes node201, node202, node203...
node201: Success
node203: Success
node202: Success
Restarting pcsd on the nodes in order to reload the certificates...
node201: Success
node203: Success
node202: Success
[root@node201 pcs]# pcs cluster setup --name test_cluster node201 node202 node203 --force
Destroying cluster on nodes: node201, node202, node203...
node202: Stopping Cluster (pacemaker)...
node203: Stopping Cluster (pacemaker)...
node201: Stopping Cluster (pacemaker)...
node203: Successfully destroyed cluster
node202: Successfully destroyed cluster
node201: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node201', 'node202', 'node203'
node201: successful distribution of the file 'pacemaker_remote authkey'
node203: successful distribution of the file 'pacemaker_remote authkey'
node202: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded

Synchronizing pcsd certificates on nodes node201, node202, node203...
node201: Success
node203: Success
node202: Success
Restarting pcsd on the nodes in order to reload the certificates...
node201: Success
node203: Success
node202: Success

如下所示,在创建集群后,节点添加qdevice成功:

[root@node201 pcs]# pcs quorum device add model net host=node203 algorithm=ffsplit --force
Setting up qdevice certificates on nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded
Enabling corosync-qdevice...
node203: corosync-qdevice enabled
node201: corosync-qdevice enabled
node202: corosync-qdevice enabled
Sending updated corosync.conf to nodes...
node201: Succeeded
node202: Succeeded
node203: Succeeded
Corosync configuration reloaded
Starting corosync-qdevice...
node203: corosync-qdevice started
node201: corosync-qdevice started
node202: corosync-qdevice started
posted @   天涯客1224  阅读(103)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
历史上的今天:
2023-08-30 KingbaseES V8R6集群运维案例之---failover切换后其他备库follow过程
2022-08-30 KingbaseES V8R6集群维护案例之---停用集群node_export进程
点击右上角即可分享
微信分享提示