Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。

在之前的主从复制方案中,如果主节点出现问题,需要手动将一个从节点升级为主节点,然后将其它从节点指向新的主节点,并且需要修改应用方主节点的地址。整个过程都需要人工干预。

下面通过日志具体看看Sentinel的切换流程。

Sentinel的切换流程

集群拓扑图如下。

角色                 IP              端口           runID

主节点             127.0.0.1   6379

从节点-1          127.0.0.1   6380

从节点-2          127.0.0.1   6381

Sentinel-1        127.0.0.1   26379    d4424b8684977767be4f5abd1e364153fbb0adbd

Sentinel-2        127.0.0.1   26380    18311edfbfb7bf89fe4b67d08ef432053db62fff

Sentinel-3        127.0.0.1   26381    3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8

kill -9 将主节点进程杀死。

1. 最先反应的是从节点。

其会马上输出如下信息。

:S  Oct ::34.184 # Connection with master lost.
:S Oct ::34.184 * Caching the disconnected master state.
:S Oct ::34.548 * Connecting to MASTER 127.0.0.1:
:S Oct ::34.548 * MASTER <-> SLAVE sync started
:S Oct ::34.548 # Error condition on socket for SYNC: Connection refused
:S Oct ::35.556 * Connecting to MASTER 127.0.0.1:
:S Oct ::35.556 * MASTER <-> SLAVE sync started
...

2. Sentinel的日志30s后才有输出,这个与“sentinel down-after-milliseconds mymaster 30000”的设置有关。

下面,依次贴出哨兵各个节点及slave的日志输出。

Sentinel-1

:X  Oct ::04.277 # +sdown master mymaster 127.0.0.1
:X Oct ::04.379 # +new-epoch
:X Oct ::04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::05.388 # +odown master mymaster 127.0.0.1 #quorum /
:X Oct ::05.388 # Next failover delay: I will not start a failover before Mon Oct ::
:X Oct ::05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.631 # +switch-master mymaster 127.0.0.1 127.0.0.1
:X Oct ::05.631 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.631 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::35.656 # +sdown slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1

Sentinel-2

:X  Oct ::04.289 # +sdown master mymaster 127.0.0.1
:X Oct ::04.366 # +odown master mymaster 127.0.0.1 #quorum /
:X Oct ::04.366 # +new-epoch
:X Oct ::04.366 # +try-failover master mymaster 127.0.0.1
:X Oct ::04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.450 # +elected-leader master mymaster 127.0.0.1
:X Oct ::04.450 # +failover-state-select-slave master mymaster 127.0.0.1
:X Oct ::04.528 # +selected-slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::04.586 * +failover-state-wait-promotion slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.543 # +promoted-slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1
:X Oct ::05.629 * +slave-reconf-sent slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::06.554 # -odown master mymaster 127.0.0.1
:X Oct ::06.555 * +slave-reconf-inprog slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::06.555 * +slave-reconf-done slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::06.606 # +failover-end master mymaster 127.0.0.1
:X Oct ::06.606 # +switch-master mymaster 127.0.0.1 127.0.0.1
:X Oct ::06.606 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::06.606 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::36.687 # +sdown slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1

Sentinel-3

:X  Oct ::04.288 # +sdown master mymaster 127.0.0.1
:X Oct ::04.378 # +new-epoch
:X Oct ::04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.385 # +odown master mymaster 127.0.0.1 #quorum /
:X Oct ::04.385 # Next failover delay: I will not start a failover before Mon Oct ::
:X Oct ::05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.630 # +switch-master mymaster 127.0.0.1 127.0.0.1
:X Oct ::05.630 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.630 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::35.709 # +sdown slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1

slave2

:S  Oct ::04.762 * MASTER <-> SLAVE sync started
:S Oct ::04.762 # Error condition on socket for SYNC: Connection refused
:S Oct ::05.630 * SLAVE OF 127.0.0.1: enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=
obl= oll= omem= events=r cmd=slaveof')28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
:S Oct ::05.770 * Connecting to MASTER 127.0.0.1:
:S Oct ::05.770 * MASTER <-> SLAVE sync started
:S Oct ::05.770 * Non blocking connect for SYNC fired the event.
:S Oct ::05.770 * Master replied to PING, replication can continue...
:S Oct ::05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:).
:S Oct ::05.770 * Successful partial resynchronization with master.
:S Oct ::05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
:S Oct ::05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

slave3

:S  Oct ::03.655 * MASTER <-> SLAVE sync started
:S Oct ::03.655 # Error condition on socket for SYNC: Connection refused
:M Oct ::04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: . New replication ID is a4022bb5c361353a4773fd460cec5cdc
c5c0203128253:M Oct ::04.586 * Discarding previously cached master state.
:M Oct ::04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-
free= obl= oll= omem= events=r cmd=exec')28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
:M Oct ::05.770 * Slave 127.0.0.1: asks for synchronization
:M Oct ::05.770 * Partial resynchronization request from 127.0.0.1: accepted. Sending bytes of backlog starting from offset .

结合上面的日志,可以看到,

各个Sentinel节点都判断127.0.0.1 6379为主观下线(Subjectively Down,缩写为sdown)。

:X  Oct ::04.289 # +sdown master mymaster 127.0.0.1 

达到quorum的设置,Sentinel-2判断其为客观下线(Objectively Down,缩写为odown)。结合其它两个Sentinel节点的日志,可以看到,Sentinel-2最先判定其客观下线。接下来,会进行Sentinel的领导者选举。一般来说,谁先完成客观下线的判定,谁就是领导者,只有Sentinel领导者才能进行failover。

:X  Oct ::04.366 # +odown master mymaster 127.0.0.1  #quorum /
:X Oct ::04.366 # +new-epoch
:X Oct ::04.366 # +try-failover master mymaster 127.0.0.1
:X Oct ::04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff
:X Oct ::04.450 # +elected-leader master mymaster 127.0.0.1

寻找合适的slave作为master

:X  Oct ::04.450 # +failover-state-select-slave master mymaster 127.0.0.1 

+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable slave for promotion.

将127.0.0.1 6381设置为新主

:X  Oct ::04.528 # +selected-slave slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

+selected-slave <instance details> -- We found the specified good slave to promote.

命令6381节点执行slaveof no one,使其成为主节点

:X  Oct ::04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

+failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted slave as master, waiting for it to switch.

等待6381节点升级为主节点

:X  Oct ::04.586 * +failover-state-wait-promotion slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

确认6381节点已经升级为主节点

:X  Oct ::05.543 # +promoted-slave slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

再来看看16:04:04.528到16:04:05.543这个时间段slave3的日志输出。可以看到,其开启了MASTER模式,且重写了配置文件。

:M  Oct ::04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: . New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M Oct ::04.586 * Discarding previously cached master state.
:M Oct ::04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free= obl= oll= omem= events=r cmd=exec')
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.

failover进入重新配置从节点阶段

:X  Oct ::05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 

命令6380节点复制新的主节点

:X  Oct ::05.629 * +slave-reconf-sent slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

+slave-reconf-sent <instance details> -- The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.

看看这个时间点slave2的日志输出,基本吻合。其进行的是增量同步。

:S  Oct ::05.630 * SLAVE OF 127.0.0.1: enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free= obl= oll= omem= events=r cmd=slaveof')
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
:S Oct ::05.770 * Connecting to MASTER 127.0.0.1:
:S Oct ::05.770 * MASTER <-> SLAVE sync started
:S Oct ::05.770 * Non blocking connect for SYNC fired the event.
:S Oct ::05.770 * Master replied to PING, replication can continue...
:S Oct ::05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:).
:S Oct ::05.770 * Successful partial resynchronization with master.
:S Oct ::05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
:S Oct ::05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

同时,在这个时间点,sentinel也有日志输出,以sentinel1为例。从日志中,可以看到,在这个时间点它会更改配置信息。

:X  Oct ::05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1  @ mymaster 127.0.0.1
:X Oct ::05.631 # +switch-master mymaster 127.0.0.1 127.0.0.1
:X Oct ::05.631 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1
:X Oct ::05.631 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1

switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.

同步过程尚未完成。

:X  Oct ::06.555 * +slave-reconf-inprog slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

+slave-reconf-inprog <instance details> -- The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.

主从同步完成。

:X  Oct ::06.555 * +slave-reconf-done slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

+slave-reconf-done <instance details> -- The slave is now synchronized with the new master.

failover切换完成。

:X  Oct ::06.606 # +failover-end master mymaster 127.0.0.1 

failover成功后,发布主节点的切换消息

:X  Oct ::06.606 # +switch-master mymaster 127.0.0.1  127.0.0.1 

关联新主节点的slave信息,需要注意的是,原来的主节点会作为新主节点的slave。

:X  Oct ::06.606 * +slave slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1
:X Oct ::06.606 * +slave slave 127.0.0.1: 127.0.0.1 @ mymaster 127.0.0.1

+slave <instance details> -- A new slave was detected and attached.

过了30s后,判定原来的主节点主观下线。

:X  Oct ::36.687 # +sdown slave 127.0.0.1: 127.0.0.1  @ mymaster 127.0.0.1 

综合来看,Sentinel进行failover的流程如下

1. 每隔1秒,每个Sentinel节点会向主节点、从节点、其余Sentinel节点发送一条ping命令做一次心跳检测,来确认这些节点当前是否可达。当这些节点超过down-after-milliseconds没有进行有效回复,Sentinel节点就会判定该节点为主观下线。

2. 如果被判定为主观下线的节点是主节点,该Sentinel节点会通过sentinel is master-down-by-addr命令向其他Sentinel节点询问对主节点的判断,当超过<quorum>个数,Sentinel节点会判定该节点为客观下线。如果从节点、Sentinel节点被判定为主观下线,并不会进行后续的故障切换操作。

3. 对Sentinel进行领导者选举,由其来进行后续的故障切换(failover)工作。选举算法基于Raft。

4. Sentinel领导者节点开始进行故障切换。

5. 选择合适的从节点作为新主节点。

6. Sentinel领导者节点对上一步选出来的从节点执行slaveof no one命令让其成为主节点。

7. 向剩余的从节点发送命令,让它们成为新主节点的从节点,复制规则和parallel-syncs参数有关。

8. 将原来的主节点更新为从节点,并将其纳入到Sentinel的管理,让其恢复后去复制新的主节点。

Sentinel的领导者选举流程。

Sentinel的领导者选举基于Raft协议。

1. 每个在线的Sentinel节点都有资格成为领导者,当它确认主节点主观下线时候,会向其他Sentinel节点发送sentinel is-master-down-by-addr命令,要求将自己设置为领导者。

2. 收到命令的Sentinel节点,如果没有同意过其他Sentinel节点的sentinel is-master-down-by-addr命令,将同意该请求,否则拒绝。

3. 如果该Sentinel节点发现自己的票数已经大于等于max(quorum,num(sentinels)/2+1),那么它将成为领导者。

新主节点的选择流程。

1. 删除所有已经处于下线或断线状态的从节点。

2. 删除最近5秒没有回复过领导者Sentinel的INFO命令的从节点。

3. 删除所有与已下线主节点连接断开超过down-after-milliseconds*10毫秒的从节点。

4. 选择优先级最高的从节点。

5. 选择复制偏移量最大的从节点。

6. 选择runid最小的从节点。

三个定时监控任务

1. 每隔10秒,每个Sentinel节点会向主节点和从节点发送info命令获取最新的拓扑结构。其作用如下:

1> 通过向主节点执行info命令,获取从节点的信息,这也是为什么Sentinel节点不需要显式配置监控从节点。
    2> 当有新的从节点加入时可立刻感知出来。
    3> 节点不可达或者故障切换后,可通过info命令实时更新节点拓扑信息。

2. 每隔2秒,每个Sentinel节点会向Redis数据节点的__sentinel__:hello频道上发送该Sentinel节点对于主节点的判断以及当前Sentinel节点的信息,同时每个Sentinel节点也会订阅该频道,来了解其它Sentinel节点以及它们对主节点的判断。其作用如下:

1> 发现新的Sentinel节点:通过订阅主节点的__sentinel__:hello了解其它Sentinel节点信息,如果是新加入的Sentinel节点,将该Sentinel节点信息保存起来,并与该Sentinel节点创建连接。
   2> Sentinel节点之间交换主节点的状态,作为后面客观下线以及领导者选举的依据。

3. 每隔1秒,每个Sentinel节点会向主节点、从节点、其余Sentinel节点发送一条ping命令做一次心跳检测,来确认这些节点当前是否可达。这个定时任务是节点失败判定的重要依据。

Sentinel的相关参数

# bind 127.0.0.1 192.168.1.1
# protected-mode no
port 26379
# sentinel announce-ip <ip>
# sentinel announce-port <port>
dir /tmp
sentinel monitor mymaster 127.0.0.1 6379 2
# sentinel auth-pass <master-name> <password>
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
# sentinel notification-script mymaster /var/redis/notify.sh
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh
sentinel deny-scripts-reconfig yes

其中,

dir:设置Sentinel的工作目录。

sentinel monitor mymaster 127.0.0.1 6379 2:其中2是quorum,即权重,代表至少需要两个Sentinel节点认为主节点主观下线,才可判定主节点为客观下线。一般建议将其设置为Sentinel节点的一半加1。不仅如此,quorum还与Sentinel节点的领导者选举有关。为了选出Sentinel的领导者,至少需要max(quorum, num(sentinels) / 2 + 1)个Sentinel节点参与选举。

sentinel down-after-milliseconds mymaster 30000:每个Sentinel节点都要通过定期发送ping命令来判断Redis节点和其余Sentinel节点是否可达。

如果在指定的时间内,没有收到主节点的有效回复,则判断其为主观下线。需要注意的是,该参数不仅用来判断主节点状态,同样也用来判断该主节点下面的从节点及其它Sentinel的状态。其默认值为30s。

sentinel parallel-syncs mymaster 1:在failover期间,允许多少个slave同时指向新的主节点。如果numslaves设置较大的话,虽然复制操作并不会阻塞主节点,但多个节点同时指向新的主节点,会增加主节点的网络和磁盘IO负载。

sentinel failover-timeout mymaster 180000:定义故障切换超时时间。默认180000,单位秒,即3min。需要注意的是,该时间不是总的故障切换的时间,而是适用于故障切换的多个场景。

# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
# already tried against the same master by a given Sentinel, is two
# times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
# to a Sentinel current configuration, to be forced to replicate
# with the right master, is exactly the failover timeout (counting since
# the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
# did not produced any configuration change (SLAVEOF NO ONE yet not
# acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
# reconfigured as slaves of the new master. However even after this time
# the slaves will be reconfigured by the Sentinels anyway, but not with
# the exact parallel-syncs progression as specified.

第一种适用场景:如果Redis Sentinel对一个主节点故障切换失败,那么下次再对该主节点做故障切换的起始时间是failover-timeout的2倍。这点从Sentinel的日志就可体现出来(28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018)

sentinel notification-script:定义通知脚本,当Sentinel出现WARNING级别的事件时,会调用该脚本,其会传入两个参数:事件类型,事件描述。

sentinel client-reconfig-script:当主节点发生切换时,会调用该参数定义的脚本,其会传入以下参数:<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>

关于脚本,其必须遵循一定的规则。

# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with "1" the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with "2" (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.

sentinel deny-scripts-reconfig:不允许使用SENTINEL SET设置notification-script和client-reconfig-script。

Sentinel的常见操作

  • PING This command simply returns PONG.
  • SENTINEL masters Show a list of monitored masters and their state.
  • SENTINEL master <master name> Show the state and info of the specified master.
  • SENTINEL slaves <master name> Show a list of slaves for this master, and their state.
  • SENTINEL sentinels <master name> Show a list of sentinel instances for this master, and their state.
  • SENTINEL get-master-addr-by-name <master name> Return the ip and port number of the master with that name. If a failover is in progress or terminated successfully for this master it returns the address and port of the promoted slave.
  • SENTINEL reset <pattern> This command will reset all the masters with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a master (including a failover in progress), and removes every slave and sentinel already discovered and associated with the master.
  • SENTINEL failover <master name> Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations).
  • SENTINEL ckquorum <master name> Check if the current Sentinel configuration is able to reach the quorum needed to failover a master, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok.
  • SENTINEL flushconfig Force Sentinel to rewrite its configuration on disk, including the current Sentinel state. Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart). However sometimes it is possible that the configuration file is lost because of operation errors, disk failures, package upgrade scripts or configuration managers. In those cases a way to to force Sentinel to rewrite the configuration file is handy. This command works even if the previous configuration file is completely missing.
  • SENTINEL MONITOR <name> <ip> <port> <quorum> This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the sentinel monitor configuration directive in sentinel.conf configuration file
  • SENTINEL REMOVE <name> is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by SENTINEL masters and so forth.
  • SENTINEL SET <name> <option> <value> The SET command is very similar to the CONFIG SET command of Redis, and is used in order to change configuration parameters of a specific master. Multiple option / value pairs can be specified (or none at all). All the configuration parameters that can be configured via sentinel.conf are also configurable using the SET command.

sentinel masters

输出被监控的主节点的状态信息

127.0.0.1:> sentinel masters
) ) "name"
) "mymaster"
) "ip"
) "127.0.0.1"
) "port"
) ""
) "runid"
) "6ab2be5db3a37c10f2473c8fb9daed147a32df3e"
) "flags"
) "master"
) "link-pending-commands"
) ""
) "link-refcount"
) ""
) "last-ping-sent"
) ""
) "last-ok-ping-reply"
) ""
) "last-ping-reply"
) ""
) "down-after-milliseconds"
) ""
) "info-refresh"
) ""
) "role-reported"
) "master"
) "role-reported-time"
) ""
) "config-epoch"
) ""
) "num-slaves"
) ""
) "num-other-sentinels"
) ""
) "quorum"
) ""
) "failover-timeout"
) ""
) "parallel-syncs"
) ""

也可单独查看某个主节点的状态

sentinel master mymaster

sentinel slaves mymaster

查看某个主节点slave的状态

127.0.0.1:> sentinel slaves mymaster
) ) "name"
) "127.0.0.1:6380"
) "ip"
) "127.0.0.1"
) "port"
) ""
) "runid"
) "983b87fd070c7f052b26f5135bbb30fdeb170a54"
) "flags"
) "slave"
) "link-pending-commands"
) ""
) "link-refcount"
) ""
) "last-ping-sent"
) ""
) "last-ok-ping-reply"
) ""
) "last-ping-reply"
) ""
) "down-after-milliseconds"
) ""
) "info-refresh"
) ""
) "role-reported"
) "slave"
) "role-reported-time"
) ""
) "master-link-down-time"
) ""
) "master-link-status"
) "ok"
) "master-host"
) "127.0.0.1"
) "master-port"
) ""
) "slave-priority"
) ""
) "slave-repl-offset"
) ""
) ) "name"
) "127.0.0.1:6381"
) "ip"
) "127.0.0.1"
) "port"
) ""
) "runid"
) "b88059cce9104dd4e0366afd6ad07a163dae8b15"
) "flags"
) "slave"
) "link-pending-commands"
) ""
) "link-refcount"
) ""
) "last-ping-sent"
) ""
) "last-ok-ping-reply"
) ""
) "last-ping-reply"
) ""
) "down-after-milliseconds"
) ""
) "info-refresh"
) ""
) "role-reported"
) "slave"
) "role-reported-time"
) ""
) "master-link-down-time"
) ""
) "master-link-status"
) "ok"
) "master-host"
) "127.0.0.1"
) "master-port"
) ""
) "slave-priority"
) ""
) "slave-repl-offset"
) ""

sentinel sentinels mymaster

查看其它Sentinel的状态

127.0.0.1:> sentinel sentinels mymaster
) ) "name"
) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
) "ip"
) "127.0.0.1"
) "port"
) ""
) "runid"
) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
) "flags"
) "sentinel"
) "link-pending-commands"
) ""
) "link-refcount"
) ""
) "last-ping-sent"
) ""
) "last-ok-ping-reply"
) ""
) "last-ping-reply"
) ""
) "down-after-milliseconds"
) ""
) "last-hello-message"
) ""
) "voted-leader"
) "?"
) "voted-leader-epoch"
) ""
) ) "name"
) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
) "ip"
) "127.0.0.1"
) "port"
) ""
) "runid"
) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
) "flags"
) "sentinel"
) "link-pending-commands"
) ""
) "link-refcount"
) ""
) "last-ping-sent"
) ""
) "last-ok-ping-reply"
) ""
) "last-ping-reply"
) ""
) "down-after-milliseconds"
) ""
) "last-hello-message"
) ""
) "voted-leader"
) "?"
) "voted-leader-epoch"
) ""

sentinel get-master-addr-by-name <master name>

返回指定<master name>主节点的IP地址和端口。如果在进行故障切换,则显示的是新主的信息。

127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"

sentinel reset <pattern>

对符合<pattern>(通配符风格)主节点的配置进行重置。

如果某个slave宕机了,其依然处于sentinel的管理中,所以,在其恢复正常后,其依然会加入到之前的复制环境中,即使配置文件中没有指定slaveof选项。不仅如此,如果主节点宕机了,在其重启后,其默认会作为从节点接入到之前的复制环境中。

但很多时候,我们可能就是想移除old master,slave,这个时候,sentinel reset就派上用场了。其会基于当前主节点的状态,重置其配置(they'll refresh the list of slaves within the next 10 seconds, only adding the ones listed as correctly replicating from the current master INFO output)。关键的是,对于非正常状态的slave,会从当前的配置中剔除。这样,被剔除节点在恢复正常后(注意此时的配置文件,需剔除slaveof的配置),也不会自动加入到之前的复制环境中。

需要注意的是,该命令仅对当前sentinel节点有效,如果要剔除某个节点,需要在所有的sentinel节点上执行reset操作。

sentinel failover <master name>

对指定 <master name> 主节点进行强制故障切换。相对于常规的故障切换,其无需进行Sentinel节点的领导者选举。直接由当前Sentinel节点进行后续的故障切换。

sentinel ckquorum <master name>

检测当前可达的Sentinel节点总数是否达到<quorum>的个数

127.0.0.1:26379> sentinel ckquorum mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached

sentinel flushconfig

将Sentinel节点的配置信息强制刷到磁盘上,这个命令Sentinel节点自身用得比较多,对于开发和运维人员只有当外部原因(例如磁盘损坏)造成配置文件损坏或者丢失时,才会用上。

sentinel remove <master name>

取消当前Sentinel节点对于指定<master name>主节点的监控。

[root@slowtech redis-4.0.]# grep -Ev "^#|^$" sentinel_26379.conf
port
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 127.0.0.1
sentinel config-epoch mymaster
sentinel leader-epoch mymaster
sentinel known-slave mymaster 127.0.0.1
sentinel known-slave mymaster 127.0.0.1
sentinel known-sentinel mymaster 127.0.0.1 738ccbddaa0d4379d89a147613d9aecfec765bcb
sentinel known-sentinel mymaster 127.0.0.1 7251bb129ca373ad0d8c7baf3b6577ae2593079f
sentinel current-epoch [root@slowtech redis-4.0.]# redis-cli -p
127.0.0.1:> sentinel remove mymaster
OK
127.0.0.1:> quit [root@slowtech redis-4.0.]# grep -Ev "^#|^$" sentinel_26379.conf
port
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel current-epoch

sentinel set <name> <option> <value>

参数                                   用法

quorum            sentinel set mymaster  quorum 3

down-after-milliseconds    sentinel set mymaster down-after-milliseconds 30000

failover-timeout       sentinel set mymaster failover-timeout 18000

parallel-syncs          sentinel set mymaster parallel-syncs 3

notification-script               sentinel set mymaster notification-script  /tmp/a.sh

client-reconfig-script          sentinel set mymaster client-reconfig-script  /tmp/b.sh

auth-pass           sentinel set mymaster auth-pass masterpassword

需要注意的是:

1. sentinel set命令只对当前Sentinel节点有效。

2. sentinel set命令如果执行成功会立即刷新配置文件,这点和Redis普通数据节点不同,后者修改完配置后,需要执行config rewrite刷新到配置文件。

3. 建议所有Sentinel节点的配置尽可能一致。

4. Sentinel不支持config命令。如何要查看参数的设置,可痛过SENTINEL MASTER命令查看。

参考:

1. 《Redis开发与运维》

2. 《Redis设计与实现》

3. 《Redis 4.X Cookbook》

4.  官方文档

深入理解Redis高可用方案-Sentinel的更多相关文章

  1. 理解redis高可用方案

    *:first-child { margin-top: 0 !important; } body>*:last-child { margin-bottom: 0 !important; } /* ...

  2. Windows版本redis高可用方案探究

    目录 Windows版本redis高可用方案探究 前言 搭建redis主从 配置主redis-28380 配置从redis-23381 配置从redis-23382 将redis部署为服务 启动red ...

  3. Redis高可用方案-哨兵与集群

    Redis高可用方案 一.名词解释   二.主从复制 Redis主从复制模式可以将主节点的数据同步给从节点,从而保障当主节点不可达的情况下,从节点可以作为 后备顶上来,并且可以保障数据尽量不丢失(主从 ...

  4. redis HA高可用方案Sentinel和shard

    1.搭建redis-master.redis-slave以及seninel哨兵监控 在最小配置:master.slave各一个节点的情况下,不管是master还是slave down掉一个,“完整的” ...

  5. Redis高可用方案----Redis主从+Sentinel+Haproxy

    安装环境 这里使用三台服务器,每台服务器上开启一个redis-server和redis-sentinel服务,redis-server端口为6379,redis-sentinel的端口为26379. ...

  6. redis高可用之sentinel哨兵

    一,单实例模式 当系统中只有一台redis运行时,一旦该redis挂了,会导致整个系统无法运行. 二,主从模式 由于单台redis出现单点故障,就会导致整个系统不可用,所以想到的办法自然就是备份.当一 ...

  7. Redis高可用方案哨兵机制------ 配置文件sentinel.conf详解

    Redis的哨兵机制是官方推荐的一种高可用(HA)方案,我们在使用Redis的主从结构时,如果主节点挂掉,这时是不能自动进行主备切换和通知客户端主节点下线的. Redis-Sentinel机制主要用三 ...

  8. 容器化redis高可用方案

    偶然看到一个GITHUB项目,提供了一套Docker Compose下的redis Sentinel方案. 项目地址https://github.com/AliyunContainerService/ ...

  9. redis 学习笔记(4)-HA高可用方案Sentinel配置

    上一节中介绍了master-slave模式,在最小配置:master.slave各一个节点的情况下,不管是master还是slave down掉一个,“完整的”读/写功能都将受影响,这在生产环境中显然 ...

随机推荐

  1. (其他)用sublime text3编写的html网页用浏览器打开出现中文乱码的原理及解决方法(转)

    最近发现Hbuler比较难用,换成sublime text3了,用了以前没用过的软件,就要学习他的操作,刚上手就出了点问题. 解决方法就是sublime text3以utf8 with bom保存. ...

  2. 新闻思考-阿里进军游戏产业,苹果发力ARM芯片

    2018.04.03 大家好,这是我开通博客的第一篇文章,我希望在这里分享我的知识,也学习更多的知识,希望大家学习愉快. 阿里进军游戏产业,拿下旅行青蛙的代理权.腾讯一直在进攻阿里的核心业务:电商和支 ...

  3. C#-方法(八)

    方法是什么 方法是C#中将一堆代码进行进行重用的机制 他是在类中实现一种特定功能的代码块,将重复性功能提取出来定义一个新的方法 这样可以提高代码的复用性,使编写程序更加快捷迅速 方法格式 访问修饰符 ...

  4. javaweb分页查询实现

    Javaweb分页技术实现 分页技术就是通过SQL语句(如下)来获取数据,具体实现看下面代码 //分页查询语句 select * from 表名 where limit page , count; 和 ...

  5. SQL Server "允许远程连接到此服务器" 配置

    在SQL Server的属性-->连接中我们可以看到这样一个选项:'允许远程连接到此服务器'(英文是remote access),其默认值是1,表示此选项开启. 但是这个参数并非是字面上所显示的 ...

  6. Windows Server 2016-清理残留域控信息

    本章紧接上文,当生产环境中域控出现问题无法修复以后,一方面我们需要考虑抢夺FSMO角色,另一方面我们需要考虑的问题是清理当前域控的残留信息,以防止残留数据信息导致用户验证或者解析异常等问题.本章讲到如 ...

  7. JavaScript获取IE版本号与HTML设置ie文档模式

    JavaScript获取IE版本代码: var gIE = getIE(); alert(gIE.version) function getIE() { var rmsie = /(msie) ([\ ...

  8. 创建一个C++制作的包含Opencv功能的dll,供C#程序使用

    目的:获取某图片指定位置的颜色. 实现该目的的方法有很多,但为了有助于扩充自己技术广度,所以决定采用标题中的方法来完成. 没有C++编程经验,也没有制作C++版Opencv语法经验,也没有制作dll的 ...

  9. Docker 从入门到实践(二)Docker 三个基本概念

    一.Docker 的三个进本概念? 了解 Docker 的三个基本概念,就可以大致了解 Docker 的生命周期. 镜像(Image) 容器(Container) 仓库(Repository) 二.镜 ...

  10. C# -- 使用Aspose.Cells创建和读取Excel文件

    使用Aspose.Cells创建和读取Excel文件 1. 创建Excel Aspose.Cells.License li = new Aspose.Cells.License(); li.SetLi ...