redis集群异常修复

概述

分析redis在异常情况在的处理流程,redis集群在正常情况下自动主从切换,前提条件master在线多于一半的情况

https://github.com/tair-opensource/RedisShake/wiki

http://www.redis.cn/commands/cluster-failover.html

 环境:

centos7.x

redis-5.x

redis-cluster(3master/3slave)

 

redis-cluster操作命令

127.0.0.1:3380> cluster help

 1) CLUSTER <subcommand> arg arg ... arg. Subcommands are:

 2) ADDSLOTS <slot> [slot ...] -- Assign slots to current node.

 3) BUMPEPOCH -- Advance the cluster config epoch.

 4) COUNT-failure-reports <node-id> -- Return number of failure reports for <node-id>.

 5) COUNTKEYSINSLOT <slot> - Return the number of keys in <slot>.

 6) DELSLOTS <slot> [slot ...] -- Delete slots information from current node.

 7) FAILOVER [force|takeover] -- Promote current replica node to being a master

http://www.redis.cn/commands/cluster-failover.html

 

 8) FORGET <node-id> -- Remove a node from the cluster.

 9) GETKEYSINSLOT <slot> <count> -- Return key names stored by current node in a slot.

10) FLUSHSLOTS -- Delete current node own slots information.

11) INFO - Return onformation about the cluster.

12) KEYSLOT <key> -- Return the hash slot for <key>.

13) MEET <ip> <port> [bus-port] -- Connect nodes into a working cluster.

14) MYID -- Return the node id.

15) NODES -- Return cluster configuration seen by node. Output format:

16)     <id> <ip:port> <flags> <master> <pings> <pongs> <epoch> <link> <slot> ... <slot>

17) REPLICATE <node-id> -- Configure current node as replica to <node-id>.

18) RESET [hard|soft] -- Reset current node (default: soft).

19) SET-config-epoch <epoch> - Set config epoch of current node.

20) SETSLOT <slot> (importing|migrating|stable|node <node-id>) -- Set slot state.

21) REPLICAS <node-id> -- Return <node-id> replicas.

22) SLOTS -- Return information about slots range mappings. Each range is made of:

23)     start, end, master and replicas IP addresses, ports and ids

127.0.0.1:3380>

 

场景

两个master在一台虚机上

查看节点信息 cluster nodes

redis-cli登录到2master在一台机器的slave节点,执行人工切换

cluster failover force

移除异常的redis id

cluster forget xxxx

检查redis集群状态 cluster nodes/cluster info

 

少于一半master在线

3主3从redis

数据目录存在(redis宿主机不存在文件丢失)

进入节点的服务器,启动redis服务,稍后观察集群是否恢复正常

cluster nodes

cluster info

数据目录丢失

01 人工提升当前redis集群中slave节点到master让集群恢复正常

在线redis的主从关系

人工提升slave->master集群恢复正常

备注:此时redis集群恢复正常,可以继续定位业务故障(抽出另外一个同事处理挂掉redis节点加入集群事宜)

02 重新初始化redis安装,然后以slave节点形式加入到集群

01 脚本初始化集群或者copy当前集群内的redis脚本、配置(修正正确),启动

脚本初始化redis(注释掉集群初始化slot)

copy其他节点修正ip/port

使用脚本启动redis

02 redis节点加入到集群中

新增slave,根据redis信息执行加入集群

./bin/redis-cli -h 172.24.20.31 -p 6379 -a xx

cluster nodes  //查看节点映射信息

./bin/redis-cli -h 172.24.20.31 -p 6379 -a xxx  --cluster add-node 172.24.20.30:3001 172.24.20.31:6379 --cluster-slave

注释:

172.24.20.30:3001   要加入集群的redis信息

172.24.20.31:6379   当前集群在线的节点任意一个

--cluster-slave  以slave身份加入到集群中自动寻找master作为依附

 

172.24.20.202:3380> cluster forget be91cd62eec29df6a95da23f59d01bb92bbd4656

OK

注释:

be91cd62eec29df6a95da23f59d01bb92bbd4656  redis节点id标识(master fail)

172.24.20.202:3380> cluster forget 6901e19a1924395bc9c3190992e1f25bbfc51577

OK

在查看集群恢复正常

cluster  info

posted @ 2023-06-12 14:41  mvpbang  阅读(421)  评论(0编辑  收藏  举报