redis集群复制和故障转移
#### 一.集群的问题
- 1.当某个主节点宕机后,对应的槽位没有节点承担,整个集群处于失败状态,不可用,怎么办
- 2.如何判断某个主节点是否真正的岩机?
- 3.如果从某个主节点的所有从节点中选举出一个合适的节点作为新的主节点?
#### 二.集群复制
- 1.复制原理与单节点的主从复制一样
- 2.从节点也是运行在集群模式下,所以安装主节点的方式配置即可
- 3.通过cluster meet把此节点添加到集群中去
- 4.在即将成为从节点的节点命令行执行cluster replicate <node-id> ,即把此节点设置为node-jd对应节点的从节点
#### 三.故障检测
- 1.集群中所有节点都会向其它节点发送PING消息,当在规定的时间内,没有收到对应的PONG消息,就把此节点标记为疑似下线
- 2.在发送的PING消息里面,会带着当前集群和节点的信息;通过这种方式,即可检测节点的存活,又能维护集群信息的统一性,不过有一定的时延
- 3.疑似下线不是真的下线,只有满足以下条件才是真的下线
- 主节点并且是被分配了slot槽位的主节点中有超过一半的节点都认为此节点疑似下线,才能真的下线
- 4.当某个节点通过消息得知有一个节点的疑似下线投票已经超过集群一半的时候,会发送一个标识此节点下线的广播消息
- 5.其它节点收到某节点已经下线的广播后,把自己内部的集群维护信息也修改为节点已下线状态
#### 四.故障转移
- 1.从所有的从节点里面选举出一个新的主
- 2.选举出的新主会执行slaveof no one把自己的状态从slave变成master
- 3.撤销已下线的主节点的槽指派,并把这些槽位重新指派给自己
- 4.新的主节点向集群广播一条PONG消息,通过这个消息告诉所有集群节点:自己已经变成了主节点,接管了原来的主节点
- 5.新的主节点开始接收和处理与自己槽位相关的命令请求
#### 五.如何从所有的从节点中选举产生新的主节点
- 1.每一次选举,配置纪元都会加一,从0开始
- 2.所有具备投票资格的节点在一次选举里面只能投一次票,并且是先到先得
- 条件一:主节点
- 条件二:被分配了槽位
- 3.当从节点发现自己所属的主节点宕机后,从节点会向集群广播一条CLUSTERMSGTYPE-FAILOVERAUTH-REQUEST的消息,要求具备投票资格的节点给自己投票
- 4.如果收到消息的节点具备资格,并且没有投过别的节点,则返回一条CLUSTERMSG-TYPE-FAILOVERAUTH-ACK消息,表示自己投票了
- 5.发起投票的节点计算自己收到的投票数,如果超过了一半,则自己变成主节点,执行故障转移操作
- 6.如果没有节点满足一半的要求,则配置纪元加一,重新进行选举
#### 六.cluster的故障转移操作,截图展示
- 1.当前集群状态
```
127.0.0.1:6379> cluster nodes
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 slave d0beb418f4682c1d93ab133df804626252fbc265 0 1542932936823 10 connected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542932937000 2 connected 5803-10922
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542932934000 11 connected
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542932937825 12 connected
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,master - 0 1542932934000 9 connected 461-5460
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542932938827 11 connected 0-340 5461-5801 10923-11456
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542932935000 6 connected
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542932936000 12 connected 341-460 5802 11457-16383
```
- 2.kill掉节点`d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,master - 0 1542932934000 9 connected 461-5460`
```
# 对应slave的log 172.16.10.154:6379
1058:S 23 Nov 08:34:51.104 # Connection with master lost.
1058:S 23 Nov 08:34:51.104 * Caching the disconnected master state.
1058:S 23 Nov 08:34:51.240 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:34:51.240 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:34:51.240 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:08.287 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:35:08.287 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:35:08.287 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:09.173 * FAIL message received from fea781a76cf0dc84b047bdee3f82112be362d483 about d0beb418f4682c1d93ab133df804626252fbc265
1058:S 23 Nov 08:35:09.174 # Cluster state changed: fail
1058:S 23 Nov 08:35:09.189 # Start of election delayed for 830 milliseconds (rank #0, offset 1912008).
1058:S 23 Nov 08:35:09.289 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:35:09.289 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:35:09.289 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:10.091 # Starting a failover election for epoch 13.
1058:S 23 Nov 08:35:10.093 # Failover election won: I'm the new master.
1058:S 23 Nov 08:35:10.094 # configEpoch set to 13 after successful failover
1058:M 23 Nov 08:35:10.094 # Setting secondary replication ID to 1dff33da4b26e7a3b350cc5eb3d1d908d6a40915, valid up to offset: 1912009. New replication ID is b328006d8d93ecdfb0d98d55cc0a01b897b17d95
1058:M 23 Nov 08:35:10.094 * Discarding previously cached master state.
1058:M 23 Nov 08:35:10.094 # Cluster state changed: ok
```
大概20秒后slave提升为master
- 3.其他节点的log
```
14818:M 23 Nov 08:35:06.313 * FAIL message received from fea781a76cf0dc84b047bdee3f82112be362d483 about d0beb418f4682c1d93ab133df804626252fbc265
14818:M 23 Nov 08:35:06.313 # Cluster state changed: fail
14818:M 23 Nov 08:35:07.232 # Failover auth granted to 9555a765592418cd9e975ace7df053d202bcc876 for epoch 13
14818:M 23 Nov 08:35:07.234 # Cluster state changed: ok
```
- 4.故障转以后的集群状态
```
172.16.10.153:6379> cluster nodes
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542933587000 12 connected 341-460 5802 11457-16383
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 myself,slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542933586000 4 connected
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542933587455 11 connected
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542933588456 11 connected 0-340 5461-5801 10923-11456
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 master,fail - 1542933288229 1542933287000 9 disconnected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542933588000 2 connected 5803-10922
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 master - 0 1542933590459 13 connected 461-5460
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542933589458 6 connected
```
原slave 172.16.10.154:6379 提升为master,原master 172.16.10.142:6379 变为faild状态
- 5.重启开启原master
```
# 原master log,转变为slave,重新全量同步新的master
24809:M 23 Nov 08:45:08.078 # Configuration change detected. Reconfiguring myself as a replica of 9555a765592418cd9e975ace7df053d202bcc876
24809:S 23 Nov 08:45:08.078 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
24809:S 23 Nov 08:45:08.078 # Cluster state changed: ok
24809:S 23 Nov 08:45:09.079 * Connecting to MASTER 172.16.10.154:6379
24809:S 23 Nov 08:45:09.079 * MASTER <-> SLAVE sync started
24809:S 23 Nov 08:45:09.079 * Non blocking connect for SYNC fired the event.
24809:S 23 Nov 08:45:09.080 * Master replied to PING, replication can continue...
24809:S 23 Nov 08:45:09.080 * Trying a partial resynchronization (request 1aa4d36b7bf6dcf3e87f41c5768436b8e26ccbb5:1).
24809:S 23 Nov 08:45:09.082 * Full resync from master: b328006d8d93ecdfb0d98d55cc0a01b897b17d95:1912008
24809:S 23 Nov 08:45:09.082 * Discarding previously cached master state.
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: receiving 178 bytes from master
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Flushing old data
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Loading DB in memory
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Finished with success
24809:S 23 Nov 08:45:09.100 * Background append only file rewriting started by pid 24813
24809:S 23 Nov 08:45:09.122 * AOF rewrite child asks to stop sending diffs.
24813:C 23 Nov 08:45:09.123 * Parent agreed to stop sending diffs. Finalizing AOF...
24813:C 23 Nov 08:45:09.123 * Concatenating 0.00 MB of AOF diff received from parent.
24813:C 23 Nov 08:45:09.123 * SYNC append only file rewrite performed
24813:C 23 Nov 08:45:09.123 * AOF rewrite: 0 MB of memory used by copy-on-write
24809:S 23 Nov 08:45:09.179 * Background AOF rewrite terminated with success
24809:S 23 Nov 08:45:09.179 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
24809:S 23 Nov 08:45:09.179 * Background AOF rewrite finished successfully
# master的log
1058:M 23 Nov 08:45:10.982 * Clear FAIL state for node d0beb418f4682c1d93ab133df804626252fbc265: master without slots is reachable again.
1058:M 23 Nov 08:45:11.964 * Slave 172.16.10.142:6379 asks for synchronization
1058:M 23 Nov 08:45:11.965 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '1aa4d36b7bf6dcf3e87f41c5768436b8e26ccbb5', my replication IDs are 'b328006d8d93ecdfb0d98d55cc0a01b897b17d95' and '1dff33da4b26e7a3b350cc5eb3d1d908d6a40915')
1058:M 23 Nov 08:45:11.965 * Starting BGSAVE for SYNC with target: disk
1058:M 23 Nov 08:45:11.966 * Background saving started by pid 4063
4063:C 23 Nov 08:45:11.968 * DB saved on disk
4063:C 23 Nov 08:45:11.969 * RDB: 6 MB of memory used by copy-on-write
1058:M 23 Nov 08:45:11.983 * Background saving terminated with success
1058:M 23 Nov 08:45:11.983 * Synchronization with slave 172.16.10.142:6379 succeeded
# 其他节点识别
14818:M 23 Nov 08:45:08.135 * Clear FAIL state for node d0beb418f4682c1d93ab133df804626252fbc265: master without slots is reachable again.
```
- 6.集群状态
```
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542934118000 11 connected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542934119000 2 connected 5803-10922
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 master - 0 1542934118417 13 connected 461-5460
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542934120421 12 connected 341-460 5802 11457-16383
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542934120000 12 connected
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542934116413 6 connected
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542934118000 11 connected 0-340 5461-5801 10923-11456
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,slave 9555a765592418cd9e975ace7df053d202bcc876 0 1542934117000 9 connected
```
redis集群复制和故障转移的更多相关文章
- Redis集群以及自动故障转移测试
在Redis中,与Sentinel(哨兵)实现的高可用相比,集群(cluster)更多的是强调数据的分片或者是节点的伸缩性,如果在集群的主节点上加入对应的从节点,集群还可以自动故障转移,因此相比Sen ...
- Redis源码解析:28集群(四)手动故障转移、从节点迁移
一:手动故障转移 Redis集群支持手动故障转移.也就是向从节点发送"CLUSTER FAILOVER"命令,使其在主节点未下线的情况下,发起故障转移流程,升级为新的主节点,而原 ...
- Redis集群管理
1.简介 Redis在生产环境中一般是通过集群的方式进行运行,Redis集群包括主从复制集群和数据分片集群两种类型. *主从复制集群提供高可用性,而数据分片集群提供负载均衡. *数据分片集群中能实现主 ...
- Redis集群入门
官方文章: https://redis.io/topics/cluster-tutorial#redis-cluster-configuration-parameters 本文永久地址: https: ...
- redis集群及相关的使用
从redis 3.0之后版本支持redis-cluster集群,Redis-Cluster采用无中心结构,每个节点保存数据和整个集群状态,每个节点都和其他所有节点连接. 1.所有的redis节点彼此互 ...
- docker搭建redis集群和Sentinel,实现故障转移
0.引言 公司开发需要用到redis,虽然有运维自动搭建,还是记录下如何搭建redis集群和Sentinel. 采用的是vagrant虚拟机+docker的方式进行搭建. 搭建思路: 首先是借鉴下其他 ...
- window下使用Redis Cluster部署Redis集群
日常的项目很多时候都需要用到缓存.redis算是一个比较好的选择.一般情况下做一个主从就可以满足一些比较小的项目需要.在一些并发量比较大的项目可能就需要用到集群了,redis在Windows下做集群可 ...
- Redis集群~windows下搭建Sentinel环境及它对主从模式的实际意义
回到目录 关于redis-sentinel出现的原因 Redis集群的主从模式有个最大的弊端,就是当主master挂了之前,它的slave从服务器无法提升为主,而在redis-sentinel出现之后 ...
- [个人翻译]Redis 集群教程(上)
官方原文地址:https://redis.io/topics/cluster-tutorial 水平有限,如果您在阅读过程中发现有翻译的不合理的地方,请留言,我会尽快修改,谢谢. 这是 ...
随机推荐
- 动静结合?Ruby 和 Java 的基础语法比较(入门篇)
前言 这篇文章示例代码比较多, Java 程序员可以看到一些 Ruby 相关语法和使用,Ruby 程序员可以看看 Java 的基本语法和使用方法,本文比较长,将近万字左右,预计需要十几分钟,如果有耐心 ...
- webug3.0靶场渗透基础Day_2(完)
第八关: 管理员每天晚上十点上线 这题我没看懂什么意思,网上搜索到就是用bp生成一个poc让管理员点击,最简单的CSRF,这里就不多讲了,网上的教程很多. 第九关: 能不能从我到百度那边去? 构造下面 ...
- 2019-2020-1 20199329《Linux内核原理与分析》第五周作业
<Linux内核原理与分析>第五周作业 一.上周问题总结: 虚拟机将c文件汇编成汇编文件时忘记添加include<stdio.h> gdb跟踪汇编过程不熟练 二.本周学习内容: ...
- 构造最短程序打印自身的 MD5
一,介绍 比赛题目很简单:构造一个程序,在 stdout 上打印出自身的 MD5,程序越短越好.按最终程序文件大小字节数排名,文件越小,排名越靠前. 只能使用 ld-linux-x86-64.so, ...
- java并发中ExecutorService的使用
文章目录 创建ExecutorService 为ExecutorService分配Tasks 关闭ExecutorService Future ScheduledExecutorService Exe ...
- Java9新特性系列(module&maven&starter)
上篇已经深入分析了Java9中的模块化,有读者又提到了module与starter是什么关系?本篇将进行分析. 首先先回顾下module与maven/gradle的关系: module与maven/g ...
- PHP版DES算法加密数据(3DES)另附openssl_encrypt版本
PHP版DES算法加密数据(3DES) 可与java的DES(DESede/CBC/PKCS5Padding)加密方式兼容 <?php /** * Created by PhpStorm. * ...
- STL学习心得
STL的知识翻来复去,也就那么回事,但是真的想要熟练使用,要下一番功夫.无论是算法,还是STL容器,直白的说就是套路,然而对于一道题,告诉你是STL容器的题,让你套容器也绝非易事. 怎样使用容器,对于 ...
- TensorRT入门
本文转载于:子棐之GPGPU 的 TensorRT系列入门篇 学习一下加深印象 Why TensorRT 训练对于深度学习来说是为了获得一个性能优异的模型,其主要的关注点在与模型的准确度.精度等指标. ...
- 关于SQL Server中存储过程在C#中调用的简单示例
目录 0. 简介 1. 语法细节 2. 示例1:模拟转账 3. 示例2:测试返回DataTable 4. 源代码下载 shanzm-2020年5月3日 23:23:44 0. 简介 [定义]:存储过程 ...