FreeSwitch高可用解决方案

1.介绍

corosync是集群框架引擎程序，pacemaker是高可用集群资源管理器，crmsh是pacemaker的命令行工具。

1.1 corosync

Coreosync在传递信息的时候可以通过一个简单的配置文件来定义信息传递的方式和协议等。它是一个新兴的软件，2008年推出，但其实它并不是一个真正意义上的新软件，在2002年的时候有一个项目Openais它由于过大，分裂为两个子项目，其中可以实现HA心跳信息传输的功能就是Corosync ,它的代码60%左右来源于Openais. Corosync可以提供一个完整的HA功能，但是要实现更多，更复杂的功能，那就需要使用Openais了。Corosync是未来的发展方向。在以后的新项目里，一般采用Corosync，而hb_gui可以提供很好的HA管理功能，可以实现图形化的管理。另外相关的图形化有RHCS的套件luci+ricci.

1.2 pacemaker

pacemaker是一个开源的高可用资源管理器(CRM)，位于HA集群架构中资源管理、资源代理(RA)这个层次，它不能提供底层心跳信息传递的功能，要想与对方节点通信需要借助底层的心跳传递服务，将信息通告给对方。通常它与corosync的结合。

1.3 总结

资源管理层（pacemaker负责仲裁指定谁是活动节点、IP地址的转移、本地资源管理系统）、消息传递层负责心跳信息（heartbeat、corosync）、Resource Agent（理解为服务脚本）负责服务的启动、停止、查看状态。多个节点上允许多个不同服务，剩下的2个备节点称为故障转移域，主节点所在位置只是相对的，同样，第三方仲裁也是相对的。vote system:少数服从多数。当故障节点修复后，资源返回来称为failback，当故障节点修复后，资源仍在备用节点，称为failover。

CRM：cluster resource manager ===>pacemaker心脏起搏器，每个节点都要一个crmd（5560/tcp）的守护进程，有命令行接口crmsh和pcs(在heartbeat v3，红帽提出的)编辑xml文件，让crmd识别并负责资源服务的处理。也就是说crmsh和pcs等价。

Resource Agent,OCF(open cluster framework)

primtive：主资源，在集群中只运行一个实例。clone：克隆资源，在集群中可运行多个实例。每个资源都有一定的优先级。

无穷大+负无穷大=负无穷大。主机名要和DNS解析的名称相同才行

2.环境介绍

主机IP	主机名	安装配置
192.168.2.11	freeswitch-node1	corosync+pacemaker+pcsd+crmsh
192.168.2.12	freeswitch-node2	corosync+pacemaker+pcsd+crmsh

2.1 环境准备

2.1.1 主机名解析

两台机器都做

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.2.11    freeswitch-node1
192.168.2.12    freeswitch-node2

2.1.2 免密钥登录

两台机器相互做

[root@freeswitch-node1 ~] ssh-keygen
[root@freeswitch-node1 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node2

[root@freeswitch-node2 ~] ssh-keygen
[root@freeswitch-node2 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node1

2.1.3 时间同步

两台都做，方法有很多中，本篇文章才用同步硬件时间的方式

[root@freeswitch-node1 ~] hwclock -s

[root@freeswitch-node2 ~] hwclock –s

3.安装corosync&&pacemaker

3.1 安装corosync&&pacemaker

两台都做，centos自带源即可，也可以只安装pcs即可。

[root@freeswitch-node1 ~]# yum install corosync pacemaker -y
[root@freeswitch-node1 ~] cd /etc/corosync
[root@freeswitch-node1 corosync]# cp corosync.conf.example corosync.conf
[root@freeswitch-node1 corosync]# vi corosync.conf

#修改如下部分
bindnetaddr: 192.168.2.0 #改成机器所在的网段
#添加如下部分
service {
var: 0
name: pacemaker #表示启动pacemaker
}

[root@freeswitch-node1 corosync]# mv /dev/{random,random.bak}
[root@freeswitch-node1 corosync]# ln -s /dev/urandom /dev/random
[root@freeswitch-node1 corosync]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Writing corosync key to /etc/corosync/authkey.

[root@freeswitch-node1 corosync]# scp corosync.conf authkey root@freeswitch-node2:/etc/corosync/ 

[root@freeswitch-node2 corosync]# scp corosync.conf authkey root@freeswitch-node1:/etc/corosync/

#相互传 注意主机名，其实只需要一个节点传输给其它节点就行

[root@freeswitch-node1 corosync]# systemctl start corosync

4.安装pcs管理工具

[root@freeswitch-node1 corosync]# yum -y install pcs
[root@freeswitch-node1 corosync]# systemctl start pcsd
[root@freeswitch-node1 corosync]# echo "passw0rd"|passwd --stdin hacluster 

[root@freeswitch-node2 corosync]# yum -y install pcs
[root@freeswitch-node2 corosync]# systemctl start pcsd
[root@freeswitch-node2 corosync]# echo "passw0rd"|passwd --stdin hacluster

确定两台都启动后再做下续操作 --(有可能需要关闭防火墙，或者添加放行规则)

[root@freeswitch-node1 corosync]# pcs cluster auth freeswitch-node2 freeswitch-node1
Username: hacluster
Password:
freeswitch-node2: Authorized
freeswitch-node1: Authorized 

[root@freeswitch-node2 corosync]# pcs cluster auth freeswitch-node1 freeswitch-node2
freeswitch-node1: Already authorized
freeswitch-node2: Already authorized

4.1 建立集群

freeswitch-node1上面做

[root@freeswitch-node1 corosync]# pcs cluster setup --name mycluster freeswitch-node1 freeswitch-node2 --force

这里报错检查pacemaker的启功重启解决一切问题，没启动就启动，解决问题

执行完上述命令在freeswitch-node2查看配置文件，已经同步

[root@freeswitch-node2 corosync]# cat corosync.conf

totem {
version: 2
secauth: off
cluster_name: mycluster
transport: udpu
}

nodelist {
node {
ring0_addr: freeswitch-node1
nodeid: 1
}

node {
ring0_addr: freeswitch-node2
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
} 

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}

两台机器都做如下操作

[root@freeswitch-node1 ~]# pcs cluster start
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...

#每个节点要单独启动pcsd守护进程。

[root@freeswitch-node1 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition WITHOUT quorum
 Last updated: Thu Jun 18 11:33:51 2020
 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1
 2 nodes configured
 0 resources configured 

PCSD Status:
  freeswitch-node2: Online
  freeswitch-node1: Online

 

[root@freeswitch-node2 ~]# pcs cluster start
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...
#每个节点要单独启动pcsd守护进程。

[root@freeswitch-node2 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
 Last updated: Thu Jun 18 11:34:29 2020
 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1
 2 nodes configured
 0 resources configured

PCSD Status:
  freeswitch-node1: Online
  freeswitch-node2: Online 

[root@freeswitch-node1 corosync]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.2.11
        status  = ring 0 active with no faults 

[root@freeswitch-node2 corosync]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 192.168.2.12
        status  = ring 0 active with no faults 

[root@freeswitch-node1 corosync]# corosync-cmapctl |grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined 

[root@freeswitch-node2 corosync]# corosync-cmapctl |grep members
##检查当前的集群成员情况

runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined 

[root@freeswitch-node1 ~]#pcs status
##DC(Designated Coordinator)的意思是说指定的协调员
##每个node都有CRM，会有一个被选为DC，是整个Cluster的大脑，这个DC控制的CIB(cluster information base)是master CIB，其他的CIB都是副本

Cluster name: mycluster
WARNINGS:
No stonith devices and stonith-enabled is not false
##stonith没有启用隔离设备，也就是说在抢占资源的时候直接把对方给爆头

Stack: corosync
Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Thu Jun 18 11:48:22 2020
Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1
2 nodes configured
0 resources configured

Online: [ freeswitch-node1 freeswitch-node2 ] 

No resources 

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled 

[root@freeswitch-node2 corosync]# pcs status corosync
Membership information
----------------------
    Nodeid      Votes Name
         1          1 freeswitch-node1
         2          1 freeswitch-node2 (local)
[root@freeswitch-node1 ~]# crm_verify -L -V
##crm_verify命令用来验证当前的集群配置是否有错误
   error: unpack_resources:     Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources:     Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources:     NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
##出现这个报错 执行下面的命令，上方报错写的很明确了，不过多解释

 

 

[root@freeswitch-node1 ~]# pcs property set stonith-enabled=false
[root@freeswitch-node1 ~]# pcs property list
##查看已经更改过的集群属性，如果是全局的，使用pcs property --all

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: mycluster
 dc-version: 1.1.21-4.el7-f14e36fd43
 have-watchdog: false
 stonith-enabled: false

4.2 安装crmsh命令行集群管理工具

两台都做

[root@freeswitch-node1 corosync]# cd /etc/yum.repos.d/
[root@freeswitch-node1 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
[root@freeswitch-node1 ~]# yum install crmsh -y
[root@freeswitch-node1 ~]# yum -y install httpd
[root@freeswitch-node1 ~]# systemctl start httpd
##httpd不能够设置为enable，得靠crm自己管理

[root@freeswitch-node1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html 

[root@freeswitch-node2 corosync]# cd /etc/yum.repos.d/
[root@freeswitch-node2 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
[root@freeswitch-node2 ~]# yum install crmsh -y
[root@freeswitch-node2 ~]# yum -y install httpd
[root@freeswitch-node2 ~]# systemctl start httpd
##httpd不能够设置为enable，得靠crm自己管理

[root@freeswitch-node2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html

此时，可以从浏览器访问2个节点的web界面

两个节点安装httpd，注意，只能停止httpd服务，而不能重启，并且不能设置为开机自启动，因为resource manager会自动管理这些服务的运行或停止。

4.3 检查配置

两台节点都检查确认一下

[root@freeswitch-node1 ~]# crm
crm(live)# status ##必须保证所有节点都上线，才执行那些命令
crm(live)# ra
crm(live)ra# list systemd
httpd
crm(live)ra# help info
crm(live)ra# classes
crm(live)ra# cd
crm(live)# configure
crm(live)configure# help primitive

[root@freeswitch-node2 ~]# crm
crm(live)# status ##必须保证所有节点都上线，才执行那些命令
crm(live)# ra
crm(live)ra# list systemd
httpd
crm(live)ra# help info
crm(live)ra# classes
crm(live)ra# cd
crm(live)# configure
crm(live)configure# help primitive

4.4 定义高可用资源

freeswitch-node1上做

crm(live)ra# classes
crm(live)ra# list ocf ##ocf是classes
crm(live)ra# info ocf:IPaddr ##IPaddr是provider

crm(live)ra# cd ..
crm(live)#configure
crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 #设置VIP 高可用的IP 会自动漂移

crm(live)configure# show
node 1: freeswitch-node1
node 2: freeswitch-node2
primitive FloadtIP IPaddr \
        params ip=192.168.2.10
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.21-4.el7-f14e36fd43 \
        cluster-infrastructure=corosync \
        cluster-name=mycluster \
        stonith-enabled=false

crm(live)configure# verify
crm(live)configure# commit
crm(live)configure#cd ../
crm(live)# status
Stack: corosync
Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Thu Jun 18 14:23:04 2020
Last change: Thu Jun 18 14:22:50 2020 by root via cibadmin on freeswitch-node1

2 nodes configured
1 resource configured

Online: [ freeswitch-node1 freeswitch-node2 ]
Full list of resources:
 FloadtIP  (ocf::heartbeat:IPaddr):        Started freeswitch-node1
#上述添加FloatIP资源

crm(live)# configure
crm(live)configure# primitive WebServer systemd:httpd ##systemd是classes命令看到的
crm(live)configure# verify
crm(live)configure# commit
#上述添加WebServer资源 

crm(live)# configure
crm(live)configure# primitive FreeSwitch systemd:freeswitch
crm(live)configure# verify
crm(live)configure# commit
#上述添加FreeSwitch资源(参考上一篇文章，将FreeSwitch设置为服务启动) 

crm(live)configure# help group
crm(live)configure# group HAService FloadtIP WebServer FreeSwitch##它们之间是有顺序的，IP在哪儿，webserver/FreeSwitch就在哪儿
crm(live)configure# verify
crm(live)configure# commit
#上述webip和webservice绑定组资源 （将资源设定成为一组）
crm(live)# node standby ##把当前节点设为备节点，节点切换
##等同于root下直接执行crm node standby

4.5 设置开机启动

此时pcsd服务和cluster都还没有开机启动。

#systemctl enable pcsd         #设置pcsd开机启动，但不启动也能使用
#pcs cluster enable --all         #所有节点开机启动cluster

4.6 定义带有监控的资源

由于此时没有对资源进行监控，比如在root下直接停掉httpd服务，去查询crm status状态，仍然是started，我们可以对它重新定义带有监控的资源。

要对资源进行监控需要在全局下命令primitive定义资源时一同定义，因此先把之前定义的资源删掉后重新定义。

crm(live)# resource
crm(live)resource# show
 Resource Group: HAService
     FloatIP    (ocf::heartbeat:IPaddr):        Started
     WebServer  (systemd:httpd):        Started
     FreeSwitch (systemd:freeswitch):   Started

crm(live)resource# stop HAService    #停掉所有资源
crm(live)resource# show
 Resource Group: HAService
     FloatIP    (ocf::heartbeat:IPaddr):        Started (disabled)
     WebServer  (systemd:httpd):        Stopping (disabled)
     FreeSwitch (systemd:freeswitch):   Stopped (disabled)
crm(live)configure# edit    #编辑资源定义配置文件，删除掉定义的3个资源和group

重新定义带有监控的资源，每60秒监控一次，超时时长为20秒，时间不能小于建议时长，否则会报错

crm(live)# configure
crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 op monitor timeout=20s interval=60s
crm(live)configure# primitive WebServer systemd:httpd op monitor timeout=20s interval=60s
crm(live)configure# primitive FreeSwitch systemd:freeswitch op monitor timeout=20s interval=60s
crm(live)configure# group HAService FloadtIP WebServer FreeSwitch
crm(live)configure# property no-quorum-policy=ignore    #直接忽略当集群没有法定票数时直接忽略，如果是节点数是单数最好不要这么设置。
crm(live)configure# verify
WARNING: FreeSwitch: specified timeout 20s for monitor is smaller than the advised 100
WARNING: WebServer: specified timeout 20s for monitor is smaller than the advised 100
crm(live)configure# commit
crm(live)configure# cd
crm(live)# status
Stack: corosync
Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Wed Jun 24 11:04:50 2020
Last change: Wed Jun 24 11:04:41 2020 by root via cibadmin on freeswitch-node1
 
2 nodes configured
3 resources configured

Online: [ freeswitch-node1 freeswitch-node2 ]

Full list of resources:
 Resource Group: HAService
     FloadtIP   (ocf::heartbeat:IPaddr):        Started freeswitch-node1
     WebServer  (systemd:httpd):        Started freeswitch-node1
     FreeSwitch (systemd:freeswitch):   Started freeswitch-node1

测试一下，将服务停掉，过一会儿服务又自动会启动

[root@freeswitch-node1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:a6:29:82 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.11/16 brd 192.168.255.255 scope global noprefixroute ens192
       valid_lft forever preferred_lft forever
    inet 192.168.2.10/16 brd 192.168.255.255 scope global secondary ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fea6:2982/64 scope link
       valid_lft forever preferred_lft forever

[root@freeswitch-node1 ~]# systemctl stop httpd 

crm(live)# status
Stack: corosync
Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Wed Jun 24 11:19:13 2020
Last change: Wed Jun 24 11:15:22 2020 by root via crm_attribute on freeswitch-node2

2 nodes configured
3 resources configured 

Node freeswitch-node2: standby
Online: [ freeswitch-node1 ] 

Full list of resources:
 Resource Group: HAService
     FloadtIP   (ocf::heartbeat:IPaddr):        Started freeswitch-node1
     WebServer  (systemd:httpd):        Started freeswitch-node1
     FreeSwitch (systemd:freeswitch):   Started freeswitch-node1 

Failed Resource Actions:

* WebServer_monitor_60000 on freeswitch-node1 'not running' (7): call=50, status=complete, exitreason='',
    last-rc-change='Wed Jun 24 11:08:41 2020', queued=0ms, exec=0ms
* FreeSwitch_monitor_60000 on freeswitch-node1 'not running' (7): call=60, status=complete, exitreason='',
    last-rc-change='Wed Jun 24 11:10:50 2020', queued=0ms, exec=0ms

4.7 清除资源错误信息

【注意】当重新恢复httpd服务后记得清除资源的错误信息，否则无法启动资源

crm(live)# resource
crm(live)resource# cleanup HAService
Cleaned up FloadtIP on freeswitch-node2
Cleaned up FloadtIP on freeswitch-node1
Cleaned up WebServer on freeswitch-node2
Cleaned up WebServer on freeswitch-node1
Cleaned up FreeSwitch on freeswitch-node2
.Cleaned up FreeSwitch on freeswitch-node1
Waiting for 1 reply from the CRMd. OK
crm(live)resource# show
 Resource Group: HAService
     FloadtIP   (ocf::heartbeat:IPaddr):        Started
     WebServer  (systemd:httpd):        Started
     FreeSwitch (systemd:freeswitch):   Started

5 总结

1、当重新恢复资源的服务后一定记得清除资源的错误信息，否则无法启动资源

2、在利用corosync+pacemaker且是两个节点实现高可用时，需要注意的是要设置全局属性把stonith设备关闭，忽略法定票数不大于一半的机制

3、注意selinux和iptables(firewalld.service)对服务的影响

4、注意节点相互用/etc/hosts来解析

5、节点时间一定要保持同步

6、节点相互间进行无密钥通信

7、如果是2个节点或者双数节点，会存在法定票数不足导致的资源不转移的情况，解决此问题的方法有四种：

7.1、可以增加一个ping node节点。

7.2、可以增加一个仲裁磁盘。

7.3、让集群中的节点数成奇数个。

7.4、直接忽略当集群没有法定票数时直接忽略。property no-quorum-policy=ignore

posted @ 2020-06-24 17:18 飛言緋語阅读(3582) 评论(0) 收藏举报

刷新页面返回顶部

飛言緋語

秋風掃落葉，一葉知秋！