Redis生产环境节点宕机问题报错及恢复排错

Redis故障发现

主观下线

当cluster-node-timeout时间内某节点无法与另一个节点顺利完成ping消息通信时，则将该节点标记为主观下线状态。

客观下线

当某个节点判断另一个节点主观下线后，该节点的下线报告会通过Gossip消息传播。当接收节点发现消息体中含有主观下线的节点，其会尝试对该节点进行客观下线，依据下线报告是否在有效期内(如果在cluster-node-timeout*2时间内无法收集到一半以上槽节点的下线报告，那么之前的下线报告会过期)，且数量大于槽节点总数的一半。若是，则将该节点更新为客观下线，并向集群广播下线节点的fail消息。

Redis故障恢复

故障节点变为客观下线后，如果下线节点是持有槽的主节点，则需要在它的从节点中选出一个替换它，从而保证集群的高可用，过程如下：

资格检查

每个从节点都要检查最后与主节点断线时间，判断是否有资格替换故障的主节点。如果从节点与主节点断线时间超过(cluster-node-timeout*cluster-slave-validity-factor)，则当前从节点不具备故障转移资格。

准备选举时间

从节点符合故障转移资格后，更新触发故障选举时间，只有到达该时间才能执行后续流程。采用延迟触发机制，主要是对多个从节点使用不同的延迟选举时间来支持优先级。复制偏移量越大说明从节点延迟越低，那么它应该具有更高的优先级。

发起选举

当从节点到达故障选举时间后，会触发选举流程：
1. 更新配置周期
  
  配置周期是一个只增不减的整数，每个主节点自身维护一个配置周期，标示当前主节点的版本，所有主节点的配置周期都不相等，从节点会复制主节点的配置周期。整个集群又维护一个全局的配置周期，用于记录集群内所有主节点配置周期的最大版本。每次集群发生重大事件，如新加入主节点或由从节点转换而来，从节点竞争选举，都会递增集群全局配置周期并赋值给相关主节点，用于记录这一关键事件。
2. 广播选举消息
  
  在集群内广播选举消息，并记录已发送过消息的状态，保证该从节点在一个配置周期内只能发起一次选举。
选举投票

只有持有槽的主节点才会处理故障选举消息，每个持有槽的节点在一个配置周期内都有唯一的一张选票，当接到第一个请求投票的从节点消息，回复消息作为投票，之后相同配置周期内其它从节点的选举消息将忽略。投票过程其实是一个领导者选举的过程。

每个配置周期代表了一次选举周期，如果在开始投票后的cluster-node-timeout*2时间内从节点没有获取足够数量的投票，则本次选举作废。从节点对配置周期自增并发起下一轮投票，直到选举成功为止。

替换主节点

当前从节点取消复制变为主节点，撤销故障主节点负责的槽，把这些槽委派给自己，并向集群广播告知所有节点当前从节点变为主节点。

Redis故障转移时间(failover-time)

主观下线识别时间等于 cluster-node-timeout。
主观下线状态消息传播时间 小于等于 cluster-node-timeout/2(消息通信机制会优先选取下线状态节点通信)。
从节点转移时间 小于等于 1000毫秒(偏移量最大的从节点最多延迟1秒发起选举，通常一次就会成功)。

所以，failover-time(毫秒) 小于等于 cluster-node-timeout + cluster-node-timeout/2 + 1000

Redis生产环境节点宕机问题报错及恢复

问题背景和分析

当前文档记录一次Redis生产环境的问题，当时是发现生产环境的业务，用户账号无法进行登录和注册，但是其他的一些业务操作是可以实现。

这里首先查看一些Java日志，发现如下报错：

Java日志报错

{"jv_time":"2020-10-01 08:02:24.222","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-33","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002] at org.redisson.command.CommandAsyncService$11.run(CommandAsyncService.java:682) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:12.935","jv_level":"ERROR","jv_thread":"redisson-netty-1-4","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:50.563","jv_level":"ERROR","jv_thread":"redisson-netty-1-14","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:59.953","jv_level":"ERROR","jv_thread":"redisson-netty-1-1","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host /172.20.20.9:7003 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for /172.20.20.9:7003 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-608","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-124","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-379","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

通过查看Java错误日志分析，主要有两条错误内容：

org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002]

第一条日志是因为redis 服务器地址链接超时，导致HSET写入数据失败。

org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts

第二条日志是也是因为链接超时，导致HGET查询失败，并且重试了3次还是查询失败问题。

总结：通过上述错误的Java日志分析出问题出现在Redis集群中，然后又找了Redis的日志进行查看。发现集群日志中有 redis cluster fail 问题，节点出现异常，集群启动自动恢复的机制，导致有些redis 客户端链接无法正常写入和查询redis 数据，所以出现了用户注册和登录异常，我们来分析一些redis 集群自动恢复的日志。

Redis 错误日志

生产环境redis 集群出现fail 问题，自动恢复时，导致业务出现繁忙问题，分析可能导致集群节点问题的原因：

业务接口被刷，导致大量查询命令堆积，形成redis慢查询，导致出现网络超时

通过redis 慢查询命令 SHOWLOG GET 来分析哪个查询命令导致集群崩溃了

http://redis.cn/commands/slowlog
cluster-node-timeout 时间内网络超时，redis节点自动重新下线进行选举
rdb 备份同步大量数据到Redis 服务器中，服务器性能下降，出现网络超时

下面我们查询下Redis 日志，同时看下redis 集群恢复过程：

redis 集群节点

redissrv1:7001> cluster nodes

191e4f15c0de39efe1ab9248bd413c9fdb2dddec 172.20.20.10:7005@17005 master - 0 1601964809000 11 connected 10923-16383

00fced34f12176c1a229211be9a3d579a51091df 172.20.20.9:7003@17003 slave 9a28b0084c1486ceb0bca3c45f2554d79041e57d 0 1601964810090 9 connected

d1fbc46714f89985ce45445b07355869085b2e7e 172.20.20.9:7002@17002 slave 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 0 1601964809090 11 connected

9a28b0084c1486ceb0bca3c45f2554d79041e57d 172.20.20.10:7004@17004 master - 0 1601964807087 9 connected 5461-10922

599f153057aae0b5ef2f1d895a8e4ac7d0474cec 172.20.20.10:7006@17006 master - 0 1601964809000 10 connected 0-5460

219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 172.20.20.9:7001@17001 myself,slave 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 0 1601964808000 1 connected

当前这份日志只分析了出现错误的节点，但是当redis 集群一个节点出现问题之后，其他集群节点也会打印出相关的集群错误日志，这里不做过多赘述。

4233:C 01 Oct 07:58:09.005 * RDB: 8 MB of memory used by copy-on-write

8971:M 01 Oct 07:58:09.072 * Background saving terminated with success

8971:M 01 Oct 07:59:50.030 * 1000 changes in 100 seconds. Saving...

8971:M 01 Oct 07:59:50.182 * Background saving started by pid 4914

4914:C 01 Oct 07:59:53.778 * DB saved on disk

4914:C 01 Oct 07:59:53.796 * RDB: 7 MB of memory used by copy-on-write

8971:M 01 Oct 07:59:53.909 * Background saving terminated with success

8971:M 01 Oct 08:00:54.791 # Cluster state changed: fail

# 在一次rdb 数据保存之后，redis 集群状态突然变成fail，这时说明集群已经处于不稳定状态

8971:M 01 Oct 08:00:59.888 # Cluster state changed: ok

8971:M 01 Oct 08:01:33.895 * Marking node d1fbc46714f89985ce45445b07355869085b2e7e as failing (quorum reached).

# d1fbc46714f89985ce45445b07355869085b2e7e 节点被标记为主观下线

8971:M 01 Oct 08:01:38.483 * 1000 changes in 100 seconds. Saving...

8971:M 01 Oct 08:02:23.076 * Background saving started by pid 5614

8971:M 01 Oct 08:02:23.852 # Cluster state changed: fail

8971:M 01 Oct 08:02:24.282 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).

# 00fced34f12176c1a229211be9a3d579a51091df 节点被标记为主观下线

8971:M 01 Oct 08:02:24.370 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: is reachable again and nobody is serving its slots after some time.

# 集群内其它节点接收到d1fbc46714f89985ce45445b07355869085b2e7e发来的ping消息，清空客观下线状态。

8971:M 01 Oct 08:02:25.158 # Failover auth denied to 191e4f15c0de39efe1ab9248bd413c9fdb2dddec: its master is up

# 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 被提升成为主节点

8971:M 01 Oct 08:02:25.301 # Failover auth granted to 9a28b0084c1486ceb0bca3c45f2554d79041e57d for epoch 9

# 当前节点为 9a28b0084c1486ceb0bca3c45f2554d79041e57d 进行投票

8971:M 01 Oct 08:02:27.454 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: master without slots is reachable again.

# 集群内其它节点接收到00fced34f12176c1a229211be9a3d579a51091df 发来的ping消息，清空客观下线状态。

5614:C 01 Oct 08:02:28.802 * DB saved on disk

5614:C 01 Oct 08:02:28.822 * RDB: 8 MB of memory used by copy-on-write

8971:M 01 Oct 08:02:28.914 * Background saving terminated with success

8971:M 01 Oct 08:02:29.903 # Cluster state changed: ok

# 经过从节点下线和提升新的主节点步骤，集群状态由fail 变成 ok 状态。

8971:M 01 Oct 08:03:50.110 * 5000 changes in 80 seconds. Saving...

8971:M 01 Oct 08:07:29.150 * Background saving started by pid 5708

8971:M 01 Oct 08:07:29.663 # Connection with slave client id #4 lost.

8971:M 01 Oct 08:07:29.752 * Slave 172.20.20.10:7006 asks for synchronization

8971:M 01 Oct 08:07:29.782 * Partial resynchronization request from 172.20.20.10:7006 accepted. Sending 40417 bytes of backlog starting from offset 372035256846.

# 该节点与7006 节点进行增量同步

5708:C 01 Oct 08:07:34.672 * DB saved on disk

5708:C 01 Oct 08:07:34.692 * RDB: 12 MB of memory used by copy-on-write

8971:M 01 Oct 08:07:34.815 * Background saving terminated with success

8971:M 01 Oct 08:08:55.159 * 5000 changes in 80 seconds. Saving...

8971:M 01 Oct 08:08:55.468 * Background saving started by pid 6485

6485:C 01 Oct 08:09:00.333 * DB saved on disk

6485:C 01 Oct 08:09:00.349 * RDB: 15 MB of memory used by copy-on-write

8971:M 01 Oct 08:09:00.439 * Background saving terminated with success

8971:M 01 Oct 08:09:37.073 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).

# 00fced34f12176c1a229211be9a3d579a51091df 节点被标记为主观下线

8971:M 01 Oct 08:10:25.863 * 5000 changes in 80 seconds. Saving...

8971:M 01 Oct 08:12:02.845 * Background saving started by pid 6751

8971:M 01 Oct 08:12:04.741 # Connection with slave client id #8319 lost.

8971:M 01 Oct 08:12:04.777 # Configuration change detected. Reconfiguring myself as a replica of 599f153057aae0b5ef2f1d895a8e4ac7d0474cec

# 219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 升级成为 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 的从节点

8971:S 01 Oct 08:12:04.787 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

8971:S 01 Oct 08:12:05.007 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:12:05.895 * Connecting to MASTER 172.20.20.10:7006

# 该节点与master 节点进行连接

8971:S 01 Oct 08:12:06.350 * MASTER <-> SLAVE sync started

# 该节点与master 节点进行同步备份

8971:S 01 Oct 08:12:06.543 * Non blocking connect for SYNC fired the event.

8971:S 01 Oct 08:12:06.543 * Master replied to PING, replication can continue...

8971:S 01 Oct 08:12:06.544 * Trying a partial resynchronization (request 57b5cf8f66406836b49f26855b77a6f61e7653fc:372038922930).

8971:S 01 Oct 08:12:06.559 * Full resync from master: 1a6c129f8d1de27cb06714b27c46a420eae014e2:372038873893

8971:S 01 Oct 08:12:06.559 * Discarding previously cached master state.

8971:S 01 Oct 08:12:10.182 * MASTER <-> SLAVE sync: receiving 165760616 bytes from master

6751:C 01 Oct 08:12:10.661 * DB saved on disk

6751:C 01 Oct 08:12:14.532 * RDB: 11 MB of memory used by copy-on-write

8971:S 01 Oct 08:12:29.561 * Background saving terminated with success

8971:S 01 Oct 08:12:37.365 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e

8971:S 01 Oct 08:12:37.764 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:12:49.951 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:12:50.674 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.

8971:S 01 Oct 08:14:08.346 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:14:13.072 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:14:40.230 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:14:42.439 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:16:35.122 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:16:38.671 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about d1fbc46714f89985ce45445b07355869085b2e7e

8971:S 01 Oct 08:16:42.318 * 50 changes in 250 seconds. Saving...

8971:S 01 Oct 08:19:59.146 * Background saving started by pid 7514

8971:S 01 Oct 08:20:00.886 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:20:01.319 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.

7514:C 01 Oct 08:20:04.454 * DB saved on disk

7514:C 01 Oct 08:20:04.514 * RDB: 11 MB of memory used by copy-on-write

8971:S 01 Oct 08:20:05.327 * Background saving terminated with success

8971:S 01 Oct 08:22:20.042 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:24:08.344 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:25:27.485 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e

8971:S 01 Oct 08:25:45.951 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:25:54.469 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.

8971:S 01 Oct 08:26:18.242 # Cluster state changed: fail

8971:S 01 Oct 08:26:20.579 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e

8971:S 01 Oct 08:26:28.343 # Cluster state changed: ok

8971:S 01 Oct 08:27:11.280 # Cluster state changed: fail

8971:S 01 Oct 08:27:18.984 # Cluster state changed: ok

8971:S 01 Oct 08:27:25.077 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.

8971:S 01 Oct 08:27:25.191 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:27:28.897 * MASTER <-> SLAVE sync: Flushing old data

8971:S 01 Oct 08:27:34.787 * MASTER <-> SLAVE sync: Loading DB in memory

# 上述节点日志中，全部是在进行节点的上下线和同步过程，到这步开始进行加载内存

8971:S 01 Oct 08:28:28.832 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e

8971:S 01 Oct 08:28:28.859 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df

8971:S 01 Oct 08:28:37.881 * MASTER <-> SLAVE sync: Finished with success

# 这里表示主节点和slave 备份数据同步成功。

8971:S 01 Oct 08:28:37.898 * Background append only file rewriting started by pid 10524

8971:S 01 Oct 08:28:38.733 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.

8971:S 01 Oct 08:28:39.330 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.

8971:S 01 Oct 08:28:43.029 * AOF rewrite child asks to stop sending diffs.

10524:C 01 Oct 08:28:43.040 * Parent agreed to stop sending diffs. Finalizing AOF...

10524:C 01 Oct 08:28:43.040 * Concatenating 6.43 MB of AOF diff received from parent.

10524:C 01 Oct 08:28:43.087 * SYNC append only file rewrite performed

10524:C 01 Oct 08:28:43.104 * AOF rewrite: 112 MB of memory used by copy-on-write

8971:S 01 Oct 08:28:43.332 * Background AOF rewrite terminated with success

8971:S 01 Oct 08:28:43.333 * Residual parent diff successfully flushed to the rewritten AOF (0.02 MB)

# 这里是针对aof 备份文件的重新对比校验数据和写入的步骤。

8971:S 01 Oct 08:28:43.333 * Background AOF rewrite finished successfully

8971:S 01 Oct 08:28:43.434 * 10 changes in 300 seconds. Saving...

8971:S 01 Oct 08:28:43.451 * Background saving started by pid 11643

11643:C 01 Oct 08:28:48.797 * DB saved on disk

11643:C 01 Oct 08:28:48.837 * RDB: 14 MB of memory used by copy-on-write

8971:S 01 Oct 08:28:48.916 * Background saving terminated with success

8971:S 01 Oct 08:30:09.007 * 5000 changes in 80 seconds. Saving...

8971:S 01 Oct 08:30:09.024 * Background saving started by pid 12605

12605:C 01 Oct 08:30:16.128 * DB saved on disk

总结：

Redis集群自身实现了高可用，当集群内少量节点出现故障时通过自动故障转移保证集群可以正常对外提供服务。

日志中Redis集群自动恢复机制，经过节点下线和提升新的主节点投票，上线等步骤，完成新的集群槽点同步，数据校验，全部是由redis 本身自动操作的。

错误分析汇总

问题总结：

经过Java日志和Redis日志的分析和汇总，结合生产业务的注册和登录进行判断，是因为Redis集群节点在进行同步和提升新的节点成为主节点时，更换了业务槽点，以及在同步数据时，redis 集群占用系统资源过多，造成用户工程访问redis集群节点超时，同时查询连接槽点失败，导致数据查询错误。

问题解决：

关于Redis集群节点恢复之后，生产业务依然超时，不能正常工作，考虑是连接旧的槽点分配，并且用户工程一直向旧的槽点发送请求，导致redis一直接受空的查询和写入，造成系统资源过高问题。

这里重新启动用户工程，使工程连接新的Redis节点状态，让数据写入正确的槽点位置，解决该问题。

Redis生产环境节点宕机问题报错及恢复排错的更多相关文章

redis集群节点宕机
redis集群是有很多个redis一起工作,那么就需要这个集群不是那么容易挂掉,所以呢,理论上就应该给集群中的每个节点至少一个备用的redis服务.这个备用的redis称为从节点(slave). 1. ...
【JVM】linux上tomcat中部署的web服务，时好时坏，莫名其妙宕机，报错：There is insufficient memory for the Java Runtime Environment to continue.
=========================================================================================== 环境: linu ...
生产环境nginx上传文件报错413 Request Entity Too Large
修改nginx配置文件/etc/nginx/nginx.conf 在http{}中添加 client_max_body_size 100m; 意思是设置上传文件大小
Vertica节点宕机处理一例
Vertica节点宕机处理一例: 查询数据库版本和各节点状态常规方式启动宕机节点失败进一步查看宕机节点的详细日志定位问题并解决 1. 查询数据库版本和各节点状态 dbadmin=> sel ...
Vertica集群单节点宕机恢复方法
Vertica集群单节点宕机恢复方法第一种方法: 直接通过admintools -> 5 Restart Vertica on Host 第二种方法: 若第一种方法无法恢复,则清空宕机节点的c ...
elasticsearch介绍集群,模拟横向扩展节点、节点宕机、改变分片
出处:[http://www.cnblogs.com/dennisit/p/4133131.html] ,防楼主删博,故保留一份! elasticsearch用于构建高可用和可扩展的系统.扩展 ...
clickhouse高可用-节点宕机数据一致性方案-热扩容
1. 集群节点及服务分配说明: 1.1. 在每个节点上启动两个clickhouse服务(后面会详细介绍如何操作这一步),一个数据分片,一个数据备份,为了确保宕机数据一致性,数据分片和数据备份不能同一 ...
openstack高可用集群20-openstack计算节点宕机迁移方案
openstack计算节点宕机迁移方案情景一:/var/lib/nova/instances/ 目录不共享的处理方法(类似手动迁移云主机到其他节点)
『叶问』#41，三节点的MGR集群，有两个节点宕机后还能正常工作吗
『叶问』#41,三节点的MGR集群,有两个节点宕机后还能正常工作吗每周学点MGR知识. 1. 三节点的MGR集群,有两个节点宕机后还能正常工作吗要看具体是哪种情况. 如果两个节点是正常关闭的话,则 ...

随机推荐

wifi IP address scanner on macOS
wifi IP address scanner on macOS Nmap Network Scanning https://nmap.org/book/inst-macosx.html https: ...
WiFi & QR Code
WiFi & QR Code https://github.com/bndw/wifi-card https://wifi.dev.bdw.to/ 扫码登录 WIFI:T:WPA;S:free ...
taro ref & wx.createSeletorQuery
taro ref & wx.createSeletorQuery https://developers.weixin.qq.com/miniprogram/dev/api/wxml/wx.cr ...
TypeScript 3.7 RC & Assertion Functions
TypeScript 3.7 RC & Assertion Functions assertion functions, assert https://devblogs.microsoft.c ...
nginx proxy
listen 80; server_name localhost; # 访问"localhost"的全部请求会被转发到"localhost:81" # loca ...
css 使用paint创建自定义css
See also: https://houdini.how/ https://github.com/una/extra.css#readme
perl 打印目录结构
更多 #!/usr/bin/perl # 递归打印目录结构 use v5.26; use strict; use utf8; use autodie; use warnings; use Encode ...
Flutter NotificationListener 监听列表的滚动
import 'package:flutter/material.dart'; import 'package:flutter_imagenetwork/flutter_imagenetwork.da ...
Python算法_递归：汉诺塔
游戏链接:https://zhangxiaoleiv.github.io/app/TowerOfHanoi/Hanoi.html 汉诺塔游戏算法: 1 def hanoi(n,x,y,z): 2 if ...
Echars统计图的使用
第一步去官网挑选自己需要的图形了. 官网地址:https://echarts.baidu.com/ 第二步按照官网的demo,下载需要的js组件并放到自己的项目中. 第三步仿照demo创建需要的div ...