FreeSwitch高可用解决方案
1.介绍
corosync是集群框架引擎程序,pacemaker是高可用集群资源管理器,crmsh是pacemaker的命令行工具。
1.1 corosync
Coreosync在传递信息的时候可以通过一个简单的配置文件来定义信息传递的方式和协议等。它是一个新兴的软件,2008年推出,但其实它并不是一个真正意义上的新软件,在2002年的时候有一个项目Openais它由于过大,分裂为两个子项目,其中可以实现HA心跳信息传输的功能就是Corosync ,它的代码60%左右来源于Openais. Corosync可以提供一个完整的HA功能,但是要实现更多,更复杂的功能,那就需要使用Openais了。Corosync是未来的发展方向。在以后的新项目里,一般采用Corosync,而hb_gui可以提供很好的HA管理功能,可以实现图形化的管理。另外相关的图形化有RHCS的套件luci+ricci.
1.2 pacemaker
pacemaker是一个开源的高可用资源管理器(CRM),位于HA集群架构中资源管理、资源代理(RA)这个层次,它不能提供底层心跳信息传递的功能,要想与对方节点通信需要借助底层的心跳传递服务,将信息通告给对方。通常它与corosync的结合。
1.3 总结
资源管理层(pacemaker负责仲裁指定谁是活动节点、IP地址的转移、本地资源管理系统)、消息传递层负责心跳信息(heartbeat、corosync)、Resource Agent(理解为服务脚本)负责服务的启动、停止、查看状态。多个节点上允许多个不同服务,剩下的2个备节点称为故障转移域,主节点所在位置只是相对的,同样,第三方仲裁也是相对的。vote system:少数服从多数。当故障节点修复后,资源返回来称为failback,当故障节点修复后,资源仍在备用节点,称为failover。
CRM:cluster resource manager ===>pacemaker心脏起搏器,每个节点都要一个crmd(5560/tcp)的守护进程,有命令行接口crmsh和pcs(在heartbeat v3,红帽提出的)编辑xml文件,让crmd识别并负责资源服务的处理。也就是说crmsh和pcs等价。
Resource Agent,OCF(open cluster framework)
primtive:主资源,在集群中只运行一个实例。clone:克隆资源,在集群中可运行多个实例。每个资源都有一定的优先级。
无穷大+负无穷大=负无穷大。主机名要和DNS解析的名称相同才行
2.环境介绍
主机IP |
主机名 |
安装配置 |
192.168.2.11 |
freeswitch-node1 |
corosync+pacemaker+pcsd+crmsh |
192.168.2.12 |
freeswitch-node2 |
corosync+pacemaker+pcsd+crmsh |
2.1 环境准备
2.1.1 主机名解析
两台机器都做
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.2.11 freeswitch-node1 192.168.2.12 freeswitch-node2
2.1.2 免密钥登录
两台机器相互做
[root@freeswitch-node1 ~] ssh-keygen [root@freeswitch-node1 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node2 [root@freeswitch-node2 ~] ssh-keygen [root@freeswitch-node2 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node1
2.1.3 时间同步
两台都做,方法有很多中,本篇文章才用同步硬件时间的方式
[root@freeswitch-node1 ~] hwclock -s
[root@freeswitch-node2 ~] hwclock –s
3.安装corosync&&pacemaker
3.1 安装corosync&&pacemaker
两台都做,centos自带源即可,也可以只安装pcs即可。
[root@freeswitch-node1 ~]# yum install corosync pacemaker -y [root@freeswitch-node1 ~] cd /etc/corosync [root@freeswitch-node1 corosync]# cp corosync.conf.example corosync.conf [root@freeswitch-node1 corosync]# vi corosync.conf
#修改如下部分 bindnetaddr: 192.168.2.0 #改成机器所在的网段 #添加如下部分 service { var: 0 name: pacemaker #表示启动pacemaker } [root@freeswitch-node1 corosync]# mv /dev/{random,random.bak} [root@freeswitch-node1 corosync]# ln -s /dev/urandom /dev/random [root@freeswitch-node1 corosync]# corosync-keygen Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Writing corosync key to /etc/corosync/authkey. [root@freeswitch-node1 corosync]# scp corosync.conf authkey root@freeswitch-node2:/etc/corosync/ [root@freeswitch-node2 corosync]# scp corosync.conf authkey root@freeswitch-node1:/etc/corosync/ #相互传 注意主机名,其实只需要一个节点传输给其它节点就行 [root@freeswitch-node1 corosync]# systemctl start corosync
4.安装pcs管理工具
[root@freeswitch-node1 corosync]# yum -y install pcs [root@freeswitch-node1 corosync]# systemctl start pcsd [root@freeswitch-node1 corosync]# echo "passw0rd"|passwd --stdin hacluster [root@freeswitch-node2 corosync]# yum -y install pcs [root@freeswitch-node2 corosync]# systemctl start pcsd [root@freeswitch-node2 corosync]# echo "passw0rd"|passwd --stdin hacluster
确定两台都启动后再做下续操作 --(有可能需要关闭防火墙,或者添加放行规则)
[root@freeswitch-node1 corosync]# pcs cluster auth freeswitch-node2 freeswitch-node1 Username: hacluster Password: freeswitch-node2: Authorized freeswitch-node1: Authorized [root@freeswitch-node2 corosync]# pcs cluster auth freeswitch-node1 freeswitch-node2 freeswitch-node1: Already authorized freeswitch-node2: Already authorized
4.1 建立集群
freeswitch-node1上面做
[root@freeswitch-node1 corosync]# pcs cluster setup --name mycluster freeswitch-node1 freeswitch-node2 --force
这里报错 检查pacemaker的启功 重启解决一切问题,没启动就启动,解决问题
执行完上述命令 在freeswitch-node2查看配置文件,已经同步
[root@freeswitch-node2 corosync]# cat corosync.conf totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: freeswitch-node1 nodeid: 1 } node { ring0_addr: freeswitch-node2 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes }
两台机器都做如下操作
[root@freeswitch-node1 ~]# pcs cluster start Starting Cluster (corosync)... Starting Cluster (pacemaker)... #每个节点要单独启动pcsd守护进程。 [root@freeswitch-node1 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition WITHOUT quorum Last updated: Thu Jun 18 11:33:51 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured PCSD Status: freeswitch-node2: Online freeswitch-node1: Online [root@freeswitch-node2 ~]# pcs cluster start Starting Cluster (corosync)... Starting Cluster (pacemaker)... #每个节点要单独启动pcsd守护进程。 [root@freeswitch-node2 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 11:34:29 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured PCSD Status: freeswitch-node1: Online freeswitch-node2: Online [root@freeswitch-node1 corosync]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.2.11 status = ring 0 active with no faults [root@freeswitch-node2 corosync]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 192.168.2.12 status = ring 0 active with no faults [root@freeswitch-node1 corosync]# corosync-cmapctl |grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@freeswitch-node2 corosync]# corosync-cmapctl |grep members ##检查当前的集群成员情况 runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@freeswitch-node1 ~]#pcs status ##DC(Designated Coordinator)的意思是说指定的协调员 ##每个node都有CRM,会有一个被选为DC,是整个Cluster的大脑,这个DC控制的CIB(cluster information base)是master CIB,其他的CIB都是副本 Cluster name: mycluster WARNINGS: No stonith devices and stonith-enabled is not false ##stonith没有启用隔离设备,也就是说在抢占资源的时候直接把对方给爆头 Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 11:48:22 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured Online: [ freeswitch-node1 freeswitch-node2 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled [root@freeswitch-node2 corosync]# pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 freeswitch-node1 2 1 freeswitch-node2 (local) [root@freeswitch-node1 ~]# crm_verify -L -V ##crm_verify命令用来验证当前的集群配置是否有错误 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid ##出现这个报错 执行下面的命令,上方报错写的很明确了,不过多解释 [root@freeswitch-node1 ~]# pcs property set stonith-enabled=false [root@freeswitch-node1 ~]# pcs property list ##查看已经更改过的集群属性,如果是全局的,使用pcs property --all Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.21-4.el7-f14e36fd43 have-watchdog: false stonith-enabled: false
4.2 安装crmsh命令行集群管理工具
两台都做
[root@freeswitch-node1 corosync]# cd /etc/yum.repos.d/ [root@freeswitch-node1 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo [root@freeswitch-node1 ~]# yum install crmsh -y [root@freeswitch-node1 ~]# yum -y install httpd [root@freeswitch-node1 ~]# systemctl start httpd ##httpd不能够设置为enable,得靠crm自己管理 [root@freeswitch-node1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html [root@freeswitch-node2 corosync]# cd /etc/yum.repos.d/ [root@freeswitch-node2 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo [root@freeswitch-node2 ~]# yum install crmsh -y [root@freeswitch-node2 ~]# yum -y install httpd [root@freeswitch-node2 ~]# systemctl start httpd ##httpd不能够设置为enable,得靠crm自己管理 [root@freeswitch-node2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
此时,可以从浏览器访问2个节点的web界面
两个节点安装httpd,注意,只能停止httpd服务,而不能重启,并且不能设置为开机自启动,因为resource manager会自动管理这些服务的运行或停止。
4.3 检查配置
两台节点都检查确认一下
[root@freeswitch-node1 ~]# crm crm(live)# status ##必须保证所有节点都上线,才执行那些命令 crm(live)# ra crm(live)ra# list systemd httpd crm(live)ra# help info crm(live)ra# classes crm(live)ra# cd crm(live)# configure crm(live)configure# help primitive [root@freeswitch-node2 ~]# crm crm(live)# status ##必须保证所有节点都上线,才执行那些命令 crm(live)# ra crm(live)ra# list systemd httpd crm(live)ra# help info crm(live)ra# classes crm(live)ra# cd crm(live)# configure crm(live)configure# help primitive
4.4 定义高可用资源
freeswitch-node1上做
crm(live)ra# classes crm(live)ra# list ocf ##ocf是classes crm(live)ra# info ocf:IPaddr ##IPaddr是provider crm(live)ra# cd .. crm(live)#configure crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 #设置VIP 高可用的IP 会自动漂移 crm(live)configure# show node 1: freeswitch-node1 node 2: freeswitch-node2 primitive FloadtIP IPaddr \ params ip=192.168.2.10 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.21-4.el7-f14e36fd43 \ cluster-infrastructure=corosync \ cluster-name=mycluster \ stonith-enabled=false crm(live)configure# verify crm(live)configure# commit crm(live)configure#cd ../ crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 14:23:04 2020 Last change: Thu Jun 18 14:22:50 2020 by root via cibadmin on freeswitch-node1 2 nodes configured 1 resource configured Online: [ freeswitch-node1 freeswitch-node2 ] Full list of resources: FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 #上述添加FloatIP资源 crm(live)# configure crm(live)configure# primitive WebServer systemd:httpd ##systemd是classes命令看到的 crm(live)configure# verify crm(live)configure# commit #上述添加WebServer资源 crm(live)# configure crm(live)configure# primitive FreeSwitch systemd:freeswitch crm(live)configure# verify crm(live)configure# commit #上述添加FreeSwitch资源(参考上一篇文章,将FreeSwitch设置为服务启动) crm(live)configure# help group crm(live)configure# group HAService FloadtIP WebServer FreeSwitch##它们之间是有顺序的,IP在哪儿,webserver/FreeSwitch就在哪儿 crm(live)configure# verify crm(live)configure# commit #上述webip和webservice绑定组资源 (将资源设定成为一组) crm(live)# node standby ##把当前节点设为备节点,节点切换 ##等同于root下直接执行crm node standby
4.5 设置开机启动
此时pcsd服务和cluster都还没有开机启动。
#systemctl enable pcsd #设置pcsd开机启动,但不启动也能使用
#pcs cluster enable --all #所有节点开机启动cluster
4.6 定义带有监控的资源
由于此时没有对资源进行监控,比如在root下直接停掉httpd服务,去查询crm status状态,仍然是started,我们可以对它重新定义带有监控的资源。
要对资源进行监控需要在全局下命令primitive定义资源时一同定义,因此先把之前定义的资源删掉后重新定义。
crm(live)# resource
crm(live)resource# show
Resource Group: HAService
FloatIP (ocf::heartbeat:IPaddr): Started
WebServer (systemd:httpd): Started
FreeSwitch (systemd:freeswitch): Started
crm(live)resource# stop HAService #停掉所有资源
crm(live)resource# show
Resource Group: HAService
FloatIP (ocf::heartbeat:IPaddr): Started (disabled)
WebServer (systemd:httpd): Stopping (disabled)
FreeSwitch (systemd:freeswitch): Stopped (disabled)
crm(live)configure# edit #编辑资源定义配置文件,删除掉定义的3个资源和group
重新定义带有监控的资源,每60秒监控一次,超时时长为20秒,时间不能小于建议时长,否则会报错
crm(live)# configure crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 op monitor timeout=20s interval=60s crm(live)configure# primitive WebServer systemd:httpd op monitor timeout=20s interval=60s crm(live)configure# primitive FreeSwitch systemd:freeswitch op monitor timeout=20s interval=60s crm(live)configure# group HAService FloadtIP WebServer FreeSwitch crm(live)configure# property no-quorum-policy=ignore #直接忽略当集群没有法定票数时直接忽略,如果是节点数是单数最好不要这么设置。 crm(live)configure# verify WARNING: FreeSwitch: specified timeout 20s for monitor is smaller than the advised 100 WARNING: WebServer: specified timeout 20s for monitor is smaller than the advised 100 crm(live)configure# commit crm(live)configure# cd crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Wed Jun 24 11:04:50 2020 Last change: Wed Jun 24 11:04:41 2020 by root via cibadmin on freeswitch-node1 2 nodes configured 3 resources configured Online: [ freeswitch-node1 freeswitch-node2 ] Full list of resources: Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 WebServer (systemd:httpd): Started freeswitch-node1 FreeSwitch (systemd:freeswitch): Started freeswitch-node1
测试一下,将服务停掉,过一会儿服务又自动会启动
[root@freeswitch-node1 ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:a6:29:82 brd ff:ff:ff:ff:ff:ff inet 192.168.2.11/16 brd 192.168.255.255 scope global noprefixroute ens192 valid_lft forever preferred_lft forever inet 192.168.2.10/16 brd 192.168.255.255 scope global secondary ens192 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fea6:2982/64 scope link valid_lft forever preferred_lft forever
[root@freeswitch-node1 ~]# systemctl stop httpd crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Wed Jun 24 11:19:13 2020 Last change: Wed Jun 24 11:15:22 2020 by root via crm_attribute on freeswitch-node2
2 nodes configured 3 resources configured Node freeswitch-node2: standby Online: [ freeswitch-node1 ] Full list of resources: Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 WebServer (systemd:httpd): Started freeswitch-node1 FreeSwitch (systemd:freeswitch): Started freeswitch-node1 Failed Resource Actions: * WebServer_monitor_60000 on freeswitch-node1 'not running' (7): call=50, status=complete, exitreason='', last-rc-change='Wed Jun 24 11:08:41 2020', queued=0ms, exec=0ms * FreeSwitch_monitor_60000 on freeswitch-node1 'not running' (7): call=60, status=complete, exitreason='', last-rc-change='Wed Jun 24 11:10:50 2020', queued=0ms, exec=0ms
4.7 清除资源错误信息
【注意】当重新恢复httpd服务后记得清除资源的错误信息,否则无法启动资源
crm(live)# resource crm(live)resource# cleanup HAService Cleaned up FloadtIP on freeswitch-node2 Cleaned up FloadtIP on freeswitch-node1 Cleaned up WebServer on freeswitch-node2 Cleaned up WebServer on freeswitch-node1 Cleaned up FreeSwitch on freeswitch-node2 .Cleaned up FreeSwitch on freeswitch-node1 Waiting for 1 reply from the CRMd. OK crm(live)resource# show Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started WebServer (systemd:httpd): Started FreeSwitch (systemd:freeswitch): Started
5 总结
1、当重新恢复资源的服务后一定记得清除资源的错误信息,否则无法启动资源
2、在利用corosync+pacemaker且是两个节点实现高可用时,需要注意的是要设置全局属性把stonith设备关闭,忽略法定票数不大于一半的机制
3、注意selinux和iptables(firewalld.service)对服务的影响
4、注意节点相互用/etc/hosts来解析
5、节点时间一定要保持同步
6、节点相互间进行无密钥通信
7、如果是2个节点或者双数节点,会存在法定票数不足导致的资源不转移的情况,解决此问题的方法有四种:
7.1、可以增加一个ping node节点。
7.2、可以增加一个仲裁磁盘。
7.3、让集群中的节点数成奇数个。
7.4、直接忽略当集群没有法定票数时直接忽略。property no-quorum-policy=ignore