redis集群异常修复

概述

分析redis在异常情况在的处理流程，redis集群在正常情况下自动主从切换，前提条件master在线多于一半的情况

https://github.com/tair-opensource/RedisShake/wiki

http://www.redis.cn/commands/cluster-failover.html

环境：

centos7.x

redis-5.x

redis-cluster(3master/3slave)

redis-cluster操作命令

127.0.0.1:3380> cluster help

1) CLUSTER <subcommand> arg arg ... arg. Subcommands are:

2) ADDSLOTS <slot> [slot ...] -- Assign slots to current node.

3) BUMPEPOCH -- Advance the cluster config epoch.

4) COUNT-failure-reports <node-id> -- Return number of failure reports for <node-id>.

5) COUNTKEYSINSLOT <slot> - Return the number of keys in <slot>.

6) DELSLOTS <slot> [slot ...] -- Delete slots information from current node.

7) FAILOVER [force|takeover] -- Promote current replica node to being a master

l http://www.redis.cn/commands/cluster-failover.html

8) FORGET <node-id> -- Remove a node from the cluster.

9) GETKEYSINSLOT <slot> <count> -- Return key names stored by current node in a slot.

10) FLUSHSLOTS -- Delete current node own slots information.

11) INFO - Return onformation about the cluster.

12) KEYSLOT <key> -- Return the hash slot for <key>.

13) MEET <ip> <port> [bus-port] -- Connect nodes into a working cluster.

14) MYID -- Return the node id.

15) NODES -- Return cluster configuration seen by node. Output format:

16) <id> <ip:port> <flags> <master> <pings> <pongs> <epoch> <link> <slot> ... <slot>

17) REPLICATE <node-id> -- Configure current node as replica to <node-id>.

18) RESET [hard|soft] -- Reset current node (default: soft).

19) SET-config-epoch <epoch> - Set config epoch of current node.

20) SETSLOT <slot> (importing|migrating|stable|node <node-id>) -- Set slot state.

21) REPLICAS <node-id> -- Return <node-id> replicas.

22) SLOTS -- Return information about slots range mappings. Each range is made of:

23) start, end, master and replicas IP addresses, ports and ids

127.0.0.1:3380>

场景

两个master在一台虚机上

查看节点信息 cluster nodes

redis-cli登录到2master在一台机器的slave节点，执行人工切换

cluster failover force

移除异常的redis id

cluster forget xxxx

检查redis集群状态 cluster nodes/cluster info

少于一半master在线

3主3从redis

数据目录存在(redis宿主机不存在文件丢失)

进入节点的服务器，启动redis服务，稍后观察集群是否恢复正常

cluster nodes

cluster info

数据目录丢失

01 人工提升当前redis集群中slave节点到master让集群恢复正常

在线redis的主从关系

人工提升slave->master集群恢复正常

备注：此时redis集群恢复正常，可以继续定位业务故障(抽出另外一个同事处理挂掉redis节点加入集群事宜)

02 重新初始化redis安装，然后以slave节点形式加入到集群

01 脚本初始化集群或者copy当前集群内的redis脚本、配置(修正正确)，启动

脚本初始化redis(注释掉集群初始化slot)

copy其他节点修正ip/port

使用脚本启动redis

02 redis节点加入到集群中

新增slave，根据redis信息执行加入集群

./bin/redis-cli -h 172.24.20.31 -p 6379 -a xx

cluster nodes //查看节点映射信息

./bin/redis-cli -h 172.24.20.31 -p 6379 -a xxx --cluster add-node 172.24.20.30:3001 172.24.20.31:6379 --cluster-slave

注释：

172.24.20.30:3001 要加入集群的redis信息

172.24.20.31:6379 当前集群在线的节点任意一个

--cluster-slave 以slave身份加入到集群中自动寻找master作为依附

172.24.20.202:3380> cluster forget be91cd62eec29df6a95da23f59d01bb92bbd4656

注释：

be91cd62eec29df6a95da23f59d01bb92bbd4656 redis节点id标识(master fail)

172.24.20.202:3380> cluster forget 6901e19a1924395bc9c3190992e1f25bbfc51577

在查看集群恢复正常