Mariadb galera 无法添加新节点故障处理

Mariadb数据库已用了一段时间,最近为了HA,需要配置 Mariadb galera 集群

配置情况,节点数3,原数据库节点node1,新增node2,node3

系统:CentOS Stream release 8

mariadb版本:mariadb-server-10.3.28

一、数据库安装


新节点安装数据库相关包

yum install mariadb mariadb-server python3-PyMySQL -y

 所有节点安装galara相关包

yum install galera mariadb-server-galera -y

node2,node3初始化数据库

systemctl enable mariadb.service
systemctl start mariadb.service
mysql_secure_installation
配置密码
Remove anonymous users? [Y/n] y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] y
Reload privilege tables now? [Y/n] y 

 

二、galera配置

node1配置如下,其他节点修改wsrep_node_address、wsrep_node_address参数

# cat /etc/my.cnf.d/galera.cnf 
[mysqld]
bind-address=0.0.0.0
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
innodb_buffer_pool_size=122M
wsrep_auto_increment_control=1
wsrep_causal_reads=0
wsrep_certify_nonPK=1
wsrep_cluster_name="my_wsrep_cluster"
wsrep_node_address=node1
wsrep_node_address=192.168.1.1
wsrep_cluster_address="gcomm://192.168.1.1,192.168.2.1,192.168.1.3"
wsrep_convert_LOCK_to_trx=0
wsrep_debug=0
wsrep_drupal_282555_workaround=0
wsrep_max_ws_rows=0
wsrep_max_ws_size=2147483647
wsrep_notify_cmd=
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options="gcache.size=300M; gcache.page_size=300M"
wsrep_retry_autocommit=1
wsrep_slave_threads=1
wsrep_sst_method=rsync

三、启动服务

node1节点执行

galera_new_cluster

如果mariadb服务已开启,需要先关闭

node2,node3执行

systemctl restart mariadb.service 

正常情况下,应该就已经可以了

四、问题

我按以上步骤完成后,发现node2和node3,无法启动

# systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process.
See "systemctl status mariadb.service" and "journalctl -xe" for details. 

无论重装,调整配置,怎么弄都不行

日志报错

# grep -Ei "err|war" /var/log/mariadb/mariadb.log
WSREP_SST: [ERROR] Parent mysqld process (PID:379774) terminated unexpectedly. (20221009 11:01:54.254)
2022-10-09 11:01:59 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2022-10-09 11:02:00 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2022-10-09 11:02:00 1 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (ab886bd7-46d6-11ed-8a83-fe4004c311ab): 1 (Operation not permitted)
2022-10-09 11:02:01 0 [Warning] WSREP: 0.0 (node-1): State transfer to 1.0 (node-2) failed: -255 (Unknown error 255)
2022-10-09 11:02:01 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():780: Will never receive state. Need to abort.

谷歌百度搜遍了也无法解决

a,删除galera.cache、grastate.dat、gvwstate.dat文件 (无效)我甚至将galera相关配置及文件全部删除,重新创建或安装,都不行

b,修改mariadb.service的TimeoutSec,(无效)

c,wsrep_cluster_address配置的地址顺序等(无效),这方案看着就不太靠谱,死马当活马医了

d,防火墙,selinux等等,(无效)

还有一些奇葩方法,一点用都没

直到后来无意中在/var/log/message中看到一条关于rsync的报错

rsyncd[380389]: rsyncd version 3.1.3 starting, listening on port 4444
rsyncd[380409]: connect from node1 (192.168.0.1)
rsyncd[380409]: rsync to rsync_sst/ from node1  (192.168.0.1)
rsyncd[380409]: rsync: on remote machine: --sparse-block=1024: unknown option
rsyncd[380409]: rsync error: requested action not supported (code 4) at clientserver.c(971) [Receiver=3.1.3]
rsyncd[380389]: sent 0 bytes  received 0 bytes  total size 0
rsyncd[380605]: rsyncd version 3.1.3 starting, listening on port 4444 

结合mariadb.log中rsync的日志

2022-10-09  3:27:39 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2022-10-09  3:27:39 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.0.2' --datadir '/var/lib/mysql/' --parent '3693645' --mysqld-args --basedir=/usr'
2022-10-09  3:27:40 2 [Note] WSREP: Prepared SST request: rsync|192.168.0.2:4444/rsync_sst 
2022-10-09 3:27:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2022-10-09 3:27:40 2 [Note] WSREP: Assign initial position for certification: 237433, protocol version: 4 2022-10-09 3:27:40 0 [Note] WSREP: Service thread queue flushed. 2022-10-09 3:27:40 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (ab886bd7-46d6-11ed-8a83-fe4004c311ab): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():467. IST will be unavailable.

怀疑是rsync有问题,可能版本太低,导致无法识别--sparse-block=1024这个选项,从而导致同步失败无法启动mariadb

于是顺手升级下rsync

# yum update rsync 

再次启动mariadb

# systemctl restart mariadb

居然启动成功了,热泪盈眶啊

原版本:rsync-3.1.3-14.el8.2.x86_64

新版本:rsync-3.1.3-19.el8.x86_64

# rpm -qa |grep rsync
rsync-3.1.3-14.el8.2.x86_64
# rsync --help |grep sparse
 -S, --sparse                turn sequences of nulls into sparse blocks


# rpm -qa |grep rsync
rsync-3.1.3-19.el8.x86_64
# rsync --help |grep sparse
 -S, --sparse                turn sequences of nulls into sparse blocks
     --sparse-block=SIZE     set block size used to handle sparse files

  

 

posted @ 2022-10-09 13:35  苦逼挨踢男  阅读(900)  评论(0编辑  收藏  举报