Redis基础知识（学习笔记19--Redis Sentinel）

一. 优化配置参数

1.down-after-milliseconds

# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached【所依附的】 slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
# 这段的意思是sentinel在和master【注意：不仅仅是master，还有slave、sentinel】失联多少毫秒后，可以做出主节点S_DOWN的判断。
# 此参数的作用范围不仅仅是 sentinel到master的连接；还有sentinel到slave的连接；sentinel到sentinel的连接。
# Default is 30 seconds.

sentinel down-after-milliseconds mymaster 30000

换句话说

down-after-milliseconds is the time in milliseconds an instance should not be reachable (either does not reply to our PINGs or it is replying with an error) for a Sentinel starting to think it is down.

2. parallel-syncs

# sentinel parallel-syncs <master-name> <numslaves>
#
# How many slaves we can reconfigure to point to the new slave simultaneously【同时】
# during the failover. Use a low number if you use the slaves to serve query
# to avoid that all the slaves will be unreachable at about the same
# time while performing the synchronization with the master.
##如果想在failover期间，slave同步新master的这个过程中，仍然想有部分slave 可以提供查询服务，那么可以将这个
##值设置小一点。【需要注意的是，这样带来的坏处就是：提供的查询服务，不保证数据的一致性，和主节点是有数据差异的；
##此外，faiover的master 和 所有slave的数据同步过程被拉长了】
sentinel parallel-syncs mymaster 1

parallel-syncs sets the number of replicas that can be reconfigured to use the new master after a failover at the same time. The lower the number, the more time it will take for the failover process to complete, however if the replicas are configured to serve old data, you may not want all the replicas to re-synchronize with the master at the same time. While the replication process is mostly non blocking for a replica, there is a moment when it stops to load the bulk data from the master. You may want to make sure only one replica at a time is not reachable by setting this option to the value of 1.

该属性用于指定，在故障转移期间，即老的master出现问题，新的master刚晋升后，允许多少个slave同时从新的master进行数据同步。默认值为1表示所有slave逐个从新master进行数据同步。

3.failover-timeout

# sentinel failover-timeout <master-name> <milliseconds>
#
# Specifies the failover timeout in milliseconds. It is used in many ways: ##【注意：这个时间有多个用途】
#
# - The time needed to re-start a failover after a previous failover was
# already tried against the same master by a given Sentinel, is two
# times the failover timeout. ---定语比较多，比较复杂，
# 抽出核心主谓宾：The time is two times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
# to a Sentinel current configuration, to be forced to replicate
# with the right master, is exactly【确切地；恰好】 the failover timeout (counting since
# the moment a Sentinel detected the misconfiguration).##【从sentinel发现配置信息不准确时开始计时】
# 抽出核心主谓宾：The time is exactly the failover timeout.
#
# - The time needed to cancel a failover that is already in progress but
# did not produced any configuration change (SLAVEOF NO ONE yet not
# acknowledged by the promoted slave).
# 抽出核心主谓宾：The time needed to cancel a failover.
#
# - The maximum time a failover in progress waits for all the slaves to be
# reconfigured as slaves of the new master. However even after this time
# the slaves will be reconfigured by the Sentinels anyway, but not with
# the exact parallel-syncs progression as specified.
# 【however后面的意思是说：然而，不管怎么样，即是超过了这个定义的最大阈值，sentinel还是可以修改配置的；
# 但是，不是严格按照前面定义的parallel-syncs的方式，例如，不再是前面默认的一个一个slave节点处理了。】
# Default is 3 minutes.
sentinel failover-timeout mymaster 180000

Moreover Sentinels have a rule: if a Sentinel voted another Sentinel for the failover of a given master, it will wait some time to try to failover the same master again. This delay is the 2 * failover-timeout you can configure in sentinel.conf. This means that Sentinels will not try to failover the same master at the same time, the first to ask to be authorized will try, if it fails another will try after some time, and so forth.

指定故障转移的超时时间，默认时间为3分钟。该超时时间的用途有很多：

由于第一次故障转移失败，在同一个master上进行第二次故障转移尝试的时间为gai值的两倍。
新master晋升完毕，slave从老master强制移到新master进行数据同步的时间阈值。
取消正在进行的故障转移所需的时间阈值。
新master晋升完毕，所有的replicas的配置文件更新为新master的时间阈值。

二. 动态修改配置

通过redis-cli连接上sentinel后，通过sentinel set 命令可动态修改配置信息。例如，下面的的命令修改了sentinel monitor 中的quorum的值。

SENTINEL SET mymaster quorum 5

下面是sentinel set命令支持的参数

参数	实例
quorum	SENTINEL SET mymaster quorum 2
down-after-milliseconds	SENTINEL SET mymaster down-after-milliseconds 50000
failover-timeout	SENTINEL SET mymaster failover-timeout 300000
parallel-syncs	SENTINEL SET mymaster parallel-syncs 3
notification-script	SENTINEL SET mymaster notification-script /var/redis/notify.sh
client-reconfig-script	SENTINEL SET mymaster client-reconfig-script /var/redis/reconfig.sh

补充

Starting with Redis version 2.8.4, Sentinel provides an API in order to add, remove, or change the configuration of a given master. Note that if you have multiple sentinels you should apply the changes to all to your instances for Redis Sentinel to work properly. This means that changing the configuration of a single Sentinel does not automatically propagate the changes to the other Sentinels in the network.

The following is a list of SENTINEL subcommands used in order to update the configuration of a Sentinel instance.

SENTINEL MONITOR <name> <ip> <port> <quorum> This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the sentinel monitor configuration directive in sentinel.conf configuration file, with the difference that you can't use a hostname in as ip, but you need to provide an IPv4 or IPv6 address.
SENTINEL REMOVE <name> is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by SENTINEL masters and so forth.
SENTINEL SET <name> [<option> <value> ...] The SET command is very similar to the CONFIG SET command of Redis, and is used in order to change configuration parameters of a specific master. Multiple option / value pairs can be specified (or none at all). All the configuration parameters that can be configured via sentinel.conf are also configurable using the SET command.

Starting with Redis version 6.2, Sentinel also allows getting and setting global configuration parameters which were only supported in the configuration file prior to that.

SENTINEL CONFIG GET <name> Get the current value of a global Sentinel configuration parameter. The specified name may be a wildcard, similar to the Redis CONFIG GET command.
SENTINEL CONFIG SET <name> <value> Set the value of a global Sentinel configuration parameter.

三.哨兵机制原理

3.1 三个定时任务

Sentinel维护着三个定时任务以检测Redis节点及其它Sentinel节点的状态。

（1）info任务

每个Sentinel 节点每10秒就会向Redis集群中的每个节点发送info命令，以获得最新的Redis拓扑结构。

（2）心跳任务

每个sentinel节点每1秒就会向所有Redis节点及其它Sentinel节点发送一条ping命令，以检测这些节点的存活状态，该任务是判断节点在线状态的重要依据。

（3）发布/订阅任务

每个Sentinel节点在启动时都会向所有Redis节点订阅__sentinel__:hello 主题的信息，当Redis节点中该主题的信息发生了变化，就会立即通知到所有订阅者。

启动后，每个sentinel节点每2秒就会向每个redis节点发布一条__sentinel__:hello主题信息，该信息是当前sentinel对每个redis节点在线状态的判断结果及当前sentinel节点信息。

sentinel即是发布者也是订阅者；redis类似与sentinel间的信息中转站。

当sentinel节点接受到__sentinel__:hello主题信息后，就会读取并解析这些信息，然后主要完成以下三项工作：

如果发现有新的sentinel节点加入，则记录下新加入sentinel节点信息，并与其建立连接。
如果发现有sentinel leader选举的选票信息，则执行leader选举过程。
汇总其他sentinel节点对当前redis节点在线状态的判断结果，作为redis节点客观下线的判断依据。

3.2 Redis节点下线判断

（1）主观下线--Subjectively Down state

每个sentinel节点每秒就会向每个redis节点发送ping心跳检测，如果sentinel在down-after-milliseconds时间内没有收到某redis节点的回复，则sentinel节点就会对该redis节点做出“下线状态”的判断。这个判断仅仅是当前sentinel节点的“一家之言”，所以被称为主观下线。

（2）客观下线--Objectively Down state

当sentinel主观下线的节点是master时，该sentinel节点会向每个其它sentinel节点发送sentinel is-master-down-by-addr 命令，以询问其对master在线状态的判断结果。这些sentinel节点在接收到命令后就会向这个发问sentinel节点响应0（在线）或1（下线）。当sentinel收到超过quorum个下线判断后，就会对master做出客观下线判断。

【Redis Sentinel has two different concepts of being down, one is called a Subjectively Down condition (SDOWN) and is a down condition that is local to a given Sentinel instance. Another is called Objectively Down condition (ODOWN) and is reached when enough Sentinels (at least the number configured as the quorum parameter of the monitored master) have an SDOWN condition, and get feedback from other Sentinels using the SENTINEL is-master-down-by-addr command.】

3.3 Sentinel Leader选举

当sentinel节点对master做出客观下线判断后，会由sentinel leader来完成后续的故障转移，即sentinel集群中的节点也并非是对等节点，是存在leader 与 follower的。

sentinel 集群的leader选举是通过Raft算法实现的。大致思路：

每个选举参与者都具有当选leader的资格，当其完成了“客观下线”的判断后，就会立即“毛遂自荐”推选自己做leader，然后将自己的提案发送给所有参与者。其它参与者在收到提案后，只要自己手中的选票没有投出去，其就会立即通过该提案将同意结果反馈给提案者，后续再过来的提案会由于该参与者没有了选票而被拒绝。当提案者收到了同意反馈数量大于等于max(quorum,sentinelNum/2+1)时，该提案者当选leader。

说明：

（1）在网络良好的情况下，基本就是谁先做出了“客观下线”判断，谁就会首先发起sentinel leader的选举，谁就会等到大多数参与者的支持，谁就会当选leader。

（2）sentinel leader选举会在每次故障转移执行前进行。

（3）故障转移结束后，sentinel不再维护leader-follower关系，即leader不再存在。

3.4 master选择算法

在进行故障转移时，sentinel leader 需要从所有redis的slave节点中选择出新的master。其选择算法为：

（1）过滤掉所有主观下线的，或心跳没有相应sentinel的，或replica-priority值为0的redis节点。

（2）在剩余redis节点中选择出replica-priority最小的节点列表。如果只有一个节点，则直接返回，否则，继续。

（3）从优先级相同的节点列表中选择复制偏移量最大的节点。如果只有一个节点，则直接返回，否则，继续。

（4）从复制偏移量相同的节点列表中选择动态ID最小的节点返回。

简单概况如下

3.5 故障转移过程

sentinel leader 负责整个故障转移过程，主要步骤如下；

（1）sentinel leader 根据master选择算法选择出一个slave节点作为新的master。

（2）sentinel leader 向新master节点发送slaveof no one 指令，使其晋升为master。

（3）sentinel leader 向新的master发送info replication 指令，获取到master的动态ID。

（4）sentinel leader 向其余redis节点发送消息，以告知它们新master的动态ID。

（5）sentinel leader 向其余redis节点发送slaveof <masterip> <masterport>指令，使它们称为新master的slave。

（6）sentinel leader 从所有slave节点中每次选择出parallel-syncs个slave，从新master同步数据，直至所有slave全部同步完毕。

（7）故障转移完毕。

3.6 节点上线

分3类情况：原redis节点上线；新redis节点上线；sentinel节点上线。

（1）原redis节点上线

无论是原下线的master节点还是原下线的slave节点，只要是原redis集群中的节点上线，只需要启动redis即可。因为每个sentinel中都保存有原来其监控的所有redis节点列表，sentinel会定时查看这些redis节点是否恢复。如果查看到其已恢复，就会命其从当前master进行数据同步。

不过，如果是原来master上线，在新的master晋升后，sentinel leader会立即将原来master节点更新为slave，然后才会定时查看其是否恢复。

（2）新redis节点上线

如果需要在redis集群中添加一个新的节点，其未曾出现在redis集群中，则上线操作只能手工完成。即添加者在添加之前必须知道当前master是谁，然后在新节点启动后运行slaveof 命令加入集群。

（3）sentinel 节点上线

如果要添加的是sentinel节点，无论其是否曾经出现在sentinel集群中，都需要手工完成。即添加者在添加之前必须知道当前master是谁，然后在配置文件中修改sentinel monitor 属性，指定要监控的master。然后启动sentinel即可。

学习参阅特别声明

1.《High availability with Redis Sentinel》

https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/

2.【Redis视频从入门到高级】

【https://www.bilibili.com/video/BV1U24y1y7jF?p=11&vd_source=0e347fbc6c2b049143afaa5a15abfc1c】

posted @ 2024-07-18 23:30 东山絮柳仔阅读(90) 评论(0) 编辑收藏举报

刷新页面返回顶部

东山絮柳仔