redis集群异常修复
概述
分析redis在异常情况在的处理流程,redis集群在正常情况下自动主从切换,前提条件master在线多于一半的情况
https://github.com/tair-opensource/RedisShake/wiki
http://www.redis.cn/commands/cluster-failover.html
环境:
centos7.x
redis-5.x
redis-cluster(3master/3slave)
redis-cluster操作命令
127.0.0.1:3380> cluster help
1) CLUSTER <subcommand> arg arg ... arg. Subcommands are:
2) ADDSLOTS <slot> [slot ...] -- Assign slots to current node.
3) BUMPEPOCH -- Advance the cluster config epoch.
4) COUNT-failure-reports <node-id> -- Return number of failure reports for <node-id>.
5) COUNTKEYSINSLOT <slot> - Return the number of keys in <slot>.
6) DELSLOTS <slot> [slot ...] -- Delete slots information from current node.
7) FAILOVER [force|takeover] -- Promote current replica node to being a master
l http://www.redis.cn/commands/cluster-failover.html
8) FORGET <node-id> -- Remove a node from the cluster.
9) GETKEYSINSLOT <slot> <count> -- Return key names stored by current node in a slot.
10) FLUSHSLOTS -- Delete current node own slots information.
11) INFO - Return onformation about the cluster.
12) KEYSLOT <key> -- Return the hash slot for <key>.
13) MEET <ip> <port> [bus-port] -- Connect nodes into a working cluster.
14) MYID -- Return the node id.
15) NODES -- Return cluster configuration seen by node. Output format:
16) <id> <ip:port> <flags> <master> <pings> <pongs> <epoch> <link> <slot> ... <slot>
17) REPLICATE <node-id> -- Configure current node as replica to <node-id>.
18) RESET [hard|soft] -- Reset current node (default: soft).
19) SET-config-epoch <epoch> - Set config epoch of current node.
20) SETSLOT <slot> (importing|migrating|stable|node <node-id>) -- Set slot state.
21) REPLICAS <node-id> -- Return <node-id> replicas.
22) SLOTS -- Return information about slots range mappings. Each range is made of:
23) start, end, master and replicas IP addresses, ports and ids
127.0.0.1:3380>
场景
两个master在一台虚机上
查看节点信息 cluster nodes
redis-cli登录到2master在一台机器的slave节点,执行人工切换
cluster failover force
移除异常的redis id
cluster forget xxxx
检查redis集群状态 cluster nodes/cluster info
少于一半master在线
3主3从redis
数据目录存在(redis宿主机不存在文件丢失)
进入节点的服务器,启动redis服务,稍后观察集群是否恢复正常
cluster nodes
cluster info
数据目录丢失
01 人工提升当前redis集群中slave节点到master让集群恢复正常
在线redis的主从关系
人工提升slave->master集群恢复正常
备注:此时redis集群恢复正常,可以继续定位业务故障(抽出另外一个同事处理挂掉redis节点加入集群事宜)
02 重新初始化redis安装,然后以slave节点形式加入到集群
01 脚本初始化集群或者copy当前集群内的redis脚本、配置(修正正确),启动
脚本初始化redis(注释掉集群初始化slot)
copy其他节点修正ip/port
使用脚本启动redis
02 redis节点加入到集群中
新增slave,根据redis信息执行加入集群
./bin/redis-cli -h 172.24.20.31 -p 6379 -a xx
cluster nodes //查看节点映射信息
./bin/redis-cli -h 172.24.20.31 -p 6379 -a xxx --cluster add-node 172.24.20.30:3001 172.24.20.31:6379 --cluster-slave
注释:
172.24.20.30:3001 要加入集群的redis信息
172.24.20.31:6379 当前集群在线的节点任意一个
--cluster-slave 以slave身份加入到集群中自动寻找master作为依附
172.24.20.202:3380> cluster forget be91cd62eec29df6a95da23f59d01bb92bbd4656
OK
注释:
be91cd62eec29df6a95da23f59d01bb92bbd4656 redis节点id标识(master fail)
172.24.20.202:3380> cluster forget 6901e19a1924395bc9c3190992e1f25bbfc51577
OK
在查看集群恢复正常
cluster info