背景

因磁盘满了,导致kafka所有的服务器全部宕机了,然后重启kafka集群,服务是启动成功了,但有一些报错:

broker1:

broker2:

broker3:一直在刷以下错误信息

虽然报了这些错,但kafka正常启动了,通过命令测试了集群能正常生产和消费消息,但是看kafka-manager界面,出现副本未分配的异常情况:

检查消费这些主题的程序,果然是消费失败了,一直在刷如下异常信息:

注:图中IP的是broker3节点

截止到这里可以看出,broker3节点出问题了,导致消费者程序连接不上,但奇怪的话,通过命令创建主题测试,在broker3节点又能消费。

继续分析broker3的日志,报错原因:集群要求的副本数是2,但只找到1个。

于是查看相关主题的详细信息,发现确实ISR列表中是少了副本

猜测由于宕机后,有些节点落后leader太多,还没有追上来,所以脱离了ISR列表,于是等它自动追上来。

等到第2天一看,还是一样,没有追上来,于是决定重启kafka集群,发现有些分区的会自动扩展成2,出问题的那些分区还是没有。。。。

然后想通过重新分配分区指定副本,看能否让它自动恢复一下副本,通过以下命令进行处理:

bin/kafka-reassign-partitions.sh --zookeeper 10.0.xx.x:,10.0.xx.x:,10.0.xx.x: --reassignment-json-file reassign.json --execute
reassign.json文件内容:
{"version":, "partitions":[
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]},
{"topic":"__consumer_offsets","partition":,"replicas":[,]}
]}`

重新分区指定副本的方法也不行,于是修改kafka配置,把集群要求的副本数改为1:

vi server.properties

重启kafka集群后,broker3不在就报错了,在重启消费都程序,也能正常连上kafka进行消费了。

总结:

kafka出现宕机后,副本脱离ISR列表(落后leader太多),按正常来说它会慢慢追上来后在自动重新加入ISR列表中,但我的等了20个小时后还没有,重启kafka集群后也没有恢复。导致服务启动有问题。

现在临时解决方案是调整成1,让它先跑一段时间后,看能否恢复回来,到时在设置成2。

问题:

1、原因尚未找到;

2、这样调整后,kafka会出现数据丢失的情况(出问题期间的数据都丢失了)。

Kafka管理与监控——broker宕机后无法消费问题的更多相关文章

  1. 假如Kafka集群中一个broker宕机无法恢复,应该如何处理?

    假如Kafka集群中一个broker宕机无法恢复, 应该如何处理? 今天面试时遇到这个问题, 网上资料说添加新的broker, 是不会自动同步旧数据的. 笨办法 环境介绍 三个broker的集群, z ...

  2. 万答#4,延迟从库加上MASTER_DELAY,主库宕机后如何快速恢复服务

    欢迎来到 GreatSQL社区分享的MySQL技术文章,如有疑问或想学习的内容,可以在下方评论区留言,看到后会进行解答 当主库宕机后,延迟从库如何才能"取消"主动延迟,以便恢复服务 ...

  3. 解Bug之路-记一次对端机器宕机后的tcp行为

    解Bug之路-记一次对端机器宕机后的tcp行为 前言 机器一般过质保之后,就会因为各种各样的问题而宕机.而这一次的宕机,让笔者观察到了平常观察不到的tcp在对端宕机情况下的行为.经过详细跟踪分析原因之 ...

  4. 记一次 oracle 数据库在宕机后的恢复

    系统:redhat 6.6 oracle版本: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production 问题描述: ...

  5. 实验:zk master宕机后,临时节点在新的master上是否存在,结果出人意料

    一.实验 实验说明:3台zk集群,主要验证:master上的客户端,在master上建立临时节点,当master宕机时,其他follower选为主后,临时节点是否存在. 主要是通过此来验证,基于zk的 ...

  6. 『叶问』#41,三节点的MGR集群,有两个节点宕机后还能正常工作吗

    『叶问』#41,三节点的MGR集群,有两个节点宕机后还能正常工作吗 每周学点MGR知识. 1. 三节点的MGR集群,有两个节点宕机后还能正常工作吗 要看具体是哪种情况. 如果两个节点是正常关闭的话,则 ...

  7. 关于mysql主从架构master宕机后,请求转移问题解决办法

    mysql架构:一主一从 问题一:有两台mysql数据库,已做好主从.如果运行某一天master服务器mysql故障导致前端请求无法处理怎么办? 答:将前端需要数据库处理的请求转移到slave机上. ...

  8. Kafka管理与监控——彻底删除topic

    一.配置 server.properties 设置 delete.topic.enable=true 如果没有设置 delete.topic.enable=true,则调用kafka 的delete命 ...

  9. Kafka管理与监控——调优

    1.JVM参数配置优化 如果使用的CMS GC算法,建议JVM Heap不要太大,在4GB以内就可以.JVM太大,导致Major GC或者Full GC产生的“stop the world”时间过长, ...

随机推荐

  1. 修改httpd端口

    修改httpd端口 默认httpd端口为80,现在改成800 修改两个地方: 1.修改配置文件httpd.conf listen 把80改成需要的端口 2.修改配置文件httpd-vhosts.con ...

  2. Python+request 使用pymysql连接数据库mysql的操作,基础篇《十一》

    笔记记录: (1)pymysql中所有的有关更新数据(insert,update,delete)的操作都需要commit,否则无法将数据提交到数据库,既然有了commit(),就一定有对应的rollb ...

  3. html 实现动态在线预览word、excel、pdf等文件(方便快捷)

    https://blog.csdn.net/superKM/article/details/81013304 太方便了 <iframe src='https://view.officeapps. ...

  4. 将 Django 应用程序部署到生产服务器

    原文出自: http://www.ibm.com/developerworks/cn/opensource/os-django/ 比较有启发性质的一篇文章,会避免很多弯路 Django 是一个基于 P ...

  5. sublime 不是插件安装越多越好,如xxxsnippet 自动完成插件太多,就非常耗电脑性能,经常性的卡着不动

    sublime 不是插件安装越多越好,如xxxsnippet 自动完成插件太多,就非常耗电脑性能,经常性的卡着不动

  6. 011_GoldWave软件安装及使用

    (一)软件安装包: 链接:https://pan.baidu.com/s/15c5veooyA8bAYIAgLFOLjg提取码:jiis 复制这段内容后打开百度网盘手机App,操作更方便哦 (二)降低 ...

  7. \ddd和\xddd 转义序列

    转自 http://blog.csdn.net/todd911/article/details/8851475 书中有如下描述: \ddd  ddd表示1~3个八进制数字,这个转义符表示的字符就是给定 ...

  8. 微信小程序客服系统

    微信公众平台 点击 客服 添加 微信文档-接收消息和事件   在页面中使用 第三方客服系统 芝麻小客服 填写对应的 appid && AppSecret 等信息 微信文档-接收消息和事 ...

  9. node的小知识点

    今天开始阅读node.js深入浅出这本书,阅读过程中会对某些理解有新的认识,所以特地把这些新认识或者知识点记录在这篇博客中 1.nodejs的优势在于 事件驱动.高并发.异步I/O 不适合cpu密集型 ...

  10. 分布式的一致性(分布式事物)-------2PC详述

    英文名:Two Phase Commit(2PC) 算法目的:实现分布式事物 算法概述: 有两类节点: -----协调者 -----事务参与者 流程阶段: -----请求阶段 -----提交阶段 算法 ...