Mariadb galera 无法添加新节点故障处理
Mariadb数据库已用了一段时间,最近为了HA,需要配置 Mariadb galera 集群
配置情况,节点数3,原数据库节点node1,新增node2,node3
系统:CentOS Stream release 8
mariadb版本:mariadb-server-10.3.28
一、数据库安装
新节点安装数据库相关包
yum install mariadb mariadb-server python3-PyMySQL -y
所有节点安装galara相关包
yum install galera mariadb-server-galera -y
node2,node3初始化数据库
systemctl enable mariadb.service systemctl start mariadb.service mysql_secure_installation 配置密码 Remove anonymous users? [Y/n] y Disallow root login remotely? [Y/n] n Remove test database and access to it? [Y/n] y Reload privilege tables now? [Y/n] y
二、galera配置
node1配置如下,其他节点修改wsrep_node_address、wsrep_node_address参数
# cat /etc/my.cnf.d/galera.cnf [mysqld] bind-address=0.0.0.0 binlog_format=ROW default-storage-engine=innodb innodb_autoinc_lock_mode=2 innodb_buffer_pool_size=122M wsrep_auto_increment_control=1 wsrep_causal_reads=0 wsrep_certify_nonPK=1 wsrep_cluster_name="my_wsrep_cluster" wsrep_node_address=node1 wsrep_node_address=192.168.1.1 wsrep_cluster_address="gcomm://192.168.1.1,192.168.2.1,192.168.1.3" wsrep_convert_LOCK_to_trx=0 wsrep_debug=0 wsrep_drupal_282555_workaround=0 wsrep_max_ws_rows=0 wsrep_max_ws_size=2147483647 wsrep_notify_cmd= wsrep_on=ON wsrep_provider=/usr/lib64/galera/libgalera_smm.so wsrep_provider_options="gcache.size=300M; gcache.page_size=300M" wsrep_retry_autocommit=1 wsrep_slave_threads=1 wsrep_sst_method=rsync
三、启动服务
node1节点执行
galera_new_cluster
如果mariadb服务已开启,需要先关闭
node2,node3执行
systemctl restart mariadb.service
正常情况下,应该就已经可以了
四、问题
我按以上步骤完成后,发现node2和node3,无法启动
# systemctl restart mariadb Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.
无论重装,调整配置,怎么弄都不行
日志报错
# grep -Ei "err|war" /var/log/mariadb/mariadb.log WSREP_SST: [ERROR] Parent mysqld process (PID:379774) terminated unexpectedly. (20221009 11:01:54.254) 2022-10-09 11:01:59 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) 2022-10-09 11:02:00 1 [Warning] WSREP: Gap in state sequence. Need state transfer. 2022-10-09 11:02:00 1 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (ab886bd7-46d6-11ed-8a83-fe4004c311ab): 1 (Operation not permitted) 2022-10-09 11:02:01 0 [Warning] WSREP: 0.0 (node-1): State transfer to 1.0 (node-2) failed: -255 (Unknown error 255) 2022-10-09 11:02:01 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():780: Will never receive state. Need to abort.
谷歌百度搜遍了也无法解决
a,删除galera.cache、grastate.dat、gvwstate.dat文件 (无效)我甚至将galera相关配置及文件全部删除,重新创建或安装,都不行
b,修改mariadb.service的TimeoutSec,(无效)
c,wsrep_cluster_address配置的地址顺序等(无效),这方案看着就不太靠谱,死马当活马医了
d,防火墙,selinux等等,(无效)
还有一些奇葩方法,一点用都没
直到后来无意中在/var/log/message中看到一条关于rsync的报错
rsyncd[380389]: rsyncd version 3.1.3 starting, listening on port 4444 rsyncd[380409]: connect from node1 (192.168.0.1) rsyncd[380409]: rsync to rsync_sst/ from node1 (192.168.0.1) rsyncd[380409]: rsync: on remote machine: --sparse-block=1024: unknown option rsyncd[380409]: rsync error: requested action not supported (code 4) at clientserver.c(971) [Receiver=3.1.3] rsyncd[380389]: sent 0 bytes received 0 bytes total size 0 rsyncd[380605]: rsyncd version 3.1.3 starting, listening on port 4444
结合mariadb.log中rsync的日志
2022-10-09 3:27:39 2 [Warning] WSREP: Gap in state sequence. Need state transfer. 2022-10-09 3:27:39 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.0.2' --datadir '/var/lib/mysql/' --parent '3693645' --mysqld-args --basedir=/usr' 2022-10-09 3:27:40 2 [Note] WSREP: Prepared SST request: rsync|192.168.0.2:4444/rsync_sst
2022-10-09 3:27:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2022-10-09 3:27:40 2 [Note] WSREP: Assign initial position for certification: 237433, protocol version: 4 2022-10-09 3:27:40 0 [Note] WSREP: Service thread queue flushed. 2022-10-09 3:27:40 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (ab886bd7-46d6-11ed-8a83-fe4004c311ab): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():467. IST will be unavailable.
怀疑是rsync有问题,可能版本太低,导致无法识别--sparse-block=1024这个选项,从而导致同步失败无法启动mariadb
于是顺手升级下rsync
# yum update rsync
再次启动mariadb
# systemctl restart mariadb
居然启动成功了,热泪盈眶啊
原版本:rsync-3.1.3-14.el8.2.x86_64
新版本:rsync-3.1.3-19.el8.x86_64
# rpm -qa |grep rsync rsync-3.1.3-14.el8.2.x86_64 # rsync --help |grep sparse -S, --sparse turn sequences of nulls into sparse blocks # rpm -qa |grep rsync rsync-3.1.3-19.el8.x86_64 # rsync --help |grep sparse -S, --sparse turn sequences of nulls into sparse blocks --sparse-block=SIZE set block size used to handle sparse files