Reference: https://blog.csdn.net/xjping0794/article/details/77784171

1.1            操作系统信息
1.1.1          CPU信息

从上述可以看出在问题点CPU使用率并不高,故排除CPU问题。

1.1.2          内存信息

CM监控界面并无内存信息,无法提供图片。

但问题期间,监控机器内存,发现还有剩余,故排除内存问题。

1.2            ZKFAILOVER日志信息

2017-09-0109:45:57,390 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-01 09:45:58,224 ERRORorg.apache.hadoop.ha.ActiveStandbyElector: Connection timed out: couldn'tconnect to ZooKeeper in 5000 milliseconds

2017-09-0109:45:58,450 INFO org.apache.zookeeper.ZooKeeper: Session: 0x0 closed

2017-09-0109:45:58,450 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down

2017-09-0109:45:58,451 WARN org.apache.hadoop.ha.ActiveStandbyElector:org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =ConnectionLoss

2017-09-0109:46:03,453 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,connectString=test-ssps-s-02:2181,test-ssps-s-03:2181,test-ssps-s-04:2181sessionTimeout=5000watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f0d36c

2017-09-0109:46:03,455 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection toserver test-ssps-s-04/10.117.210.216:2181. Will not attempt to authenticateusing SASL (unknown error)

2017-09-0109:46:03,463 INFO org.apache.zookeeper.ClientCnxn: Socket connectionestablished to test-ssps-s-04/10.117.210.216:2181, initiating session

2017-09-0109:46:05,131 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-0109:46:05,885 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection toserver test-ssps-s-03/10.51.20.155:2181. Will not attempt to authenticate usingSASL (unknown error)

2017-09-0109:46:06,626 INFO org.apache.zookeeper.ClientCnxn: Socket connectionestablished to test-ssps-s-03/10.51.20.155:2181, initiating session

2017-09-0109:46:08,293 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1667ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-01 09:46:08,454 ERROR org.apache.hadoop.ha.ActiveStandbyElector:Connection timed out: couldn't connect to ZooKeeper in 5000 milliseconds

2017-09-0109:46:09,157 INFO org.apache.zookeeper.ZooKeeper: Session: 0x0 closed

2017-09-01 09:46:09,157 INFO org.apache.zookeeper.ClientCnxn:EventThread shut down

2017-09-0109:46:09,158 WARN org.apache.hadoop.ha.ActiveStandbyElector:org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =ConnectionLoss

标红日志,显示无法与zk建立连接,导致failover终止。

1.3            YARN RESOURCEMANAGER日志信息
具体日志信息未保留,但均是由于ZK异常,导致服务挂掉。

1.4            ZOOKEEPER服务日志
摘取leader和follower 服务日志分析。

1.4.1           leader日志
分析异常时间点9点40到9点50日志,发现以下报错信息:

2017-09-0109:45:49,754 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40713

2017-09-0109:45:49,754 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to establish new session at /10.51.20.155:40713

2017-09-0109:45:51,090 INFO org.apache.zookeeper.server.PrepRequestProcessor: Gotuser-level KeeperException when processing sessionid:0x15e2a2922a213b5type:setData cxid:0x21 zxid:0x6d00050f33 txntype:-1 reqpath:n/a ErrorPath:/yarn-leader-election/yarnRM/ActiveBreadCrumb Error:KeeperErrorCode =BadVersion for /yarn-leader-election/yarnRM/ActiveBreadCrumb

2017-09-0109:45:52,000 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x35e2a2922671668, timeout of 5000ms exceeded

2017-09-0109:45:52,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processedsession termination for sessionid: 0x35e2a2922671668

2017-09-0109:45:52,482 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.117.68.10:33589

2017-09-0109:45:52,484 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to renew session 0x15e2a2922a21870 at /10.117.68.10:33589

2017-09-0109:45:52,484 INFO org.apache.zookeeper.server.ZooKeeperServer: Establishedsession 0x15e2a2922a21870 with negotiated timeout 5000 for client/10.117.68.10:33589

2017-09-0109:45:52,884 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0109:45:52,884 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.51.20.155:40712 which had sessionid 0x15de73379b60000

2017-09-0109:45:53,244 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40730

2017-09-0109:46:36,520 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40749

2017-09-01 09:46:29,773 WARN org.apache.zookeeper.server.persistence.FileTxnLog:fsync-ing the write ahead log in SyncThread:3 took53812ms which will adversely effect operation latency. See theZooKeeper troubleshooting guide

2017-09-01 09:46:17,239 INFOorg.apache.zookeeper.server.quorum.Leader: Shutting down

2017-09-0109:45:58,000 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a21870, timeout of 5000ms exceeded

2017-09-01 09:45:53,439 ERRORorg.apache.zookeeper.server.NIOServerCnxn: Unexpected Exception:

java.nio.channels.CancelledKeyException atsun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) atsun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) atorg.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) atorg.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)atorg.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171)atorg.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a21871, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a213b5, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15de73379b60000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x25de7337bb00000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x35e2a2922671890, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a213b6, timeout of 10000ms exceeded

2017-09-01 09:46:36,647 INFOorg.apache.zookeeper.server.quorum.Leader: Shutdown called

从标红日志信息可以看出,zookeerper 服务在同步日志过程中耗时太长,花了53812ms(正常应该在3秒内),同步日志会导致ZK无法响应外部请求,进而引发session过期,进而引发zk 服务端shut down。

另外,CancelledKeyException错误,是由于session失效后,socket已关闭,但服务端仍往该session发送回复信号,引发该错误,该错误并不致命,影响不大。是zookeeper版本(3.4.5)bug所致,已在ZK新版本中优化掉。

其他点日志报同样问题,尤其10点33分,zk刷新日志竟然花了将近2分钟,导致服务异常退出。

1.4.2          Follower日志

分析异常时间点13点00到13点10日志,发现以下报错信息:

2017-09-0113:08:55,526 INFO org.apache.zookeeper.server.ZooKeeperServer: Establishedsession 0x15e3b4962bc0000 with negotiated timeout 6000 for client/10.117.210.216:39709

2017-09-0113:08:57,259 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:53325

2017-09-0113:08:57,260 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to renew session 0x25e3b531e84006e at /10.51.20.155:53325

2017-09-0113:08:57,261 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session0x25e3b531e84006e with negotiated timeout 6000 for client /10.51.20.155:53325

2017-09-0113:08:57,261 INFO org.apache.zookeeper.server.PrepRequestProcessor: Gotuser-level KeeperException when processing sessionid:0x25e3b531e84006etype:delete cxid:0x11e3d zxid:0x7000006e8f txntype:-1 reqpath:n/a ErrorPath:/admin/preferred_replica_election Error:KeeperErrorCode = NoNode for/admin/preferred_replica_election

2017-09-0113:08:59,525 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0113:08:59,526 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.117.210.216:39709 which had sessionid0x15e3b4962bc0000

2017-09-01 13:09:00,302 WARNorg.apache.zookeeper.server.persistence.FileTxnLog: fsync-ing the write aheadlog in SyncThread:3 took 35356ms which will adversely effect operation latency.See the ZooKeeper troubleshooting guide

2017-09-0113:09:01,263 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0113:09:01,264 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.51.20.155:53325 which had sessionid 0x25e3b531e84006e

原因同上。

1.5            磁盘

在问题点,发现IO使用很高。

9点46分,问题出现后,基于当时的集群IO值分析,发现IO达到200M/s,同样,10点33分IO也达到了将近200M/s,磁盘IO可能是一个瓶颈。

1.6            网络

网络正常。

2  总结及建议

从监控及分析结果来看,均属ZK服务端在fsync-ing the write ahead log日志时超长引起。从问题产生时间段的IO来分析,发现磁盘IO较正常点高了很多,达到200M/s,证明问题发生时,有大作业在进行磁盘IO,与开发人员杨仪军确认,其今天确实较以往跑了大数据的作业。颈,需进一步通过专业工具测试磁盘IO能力。

2.1.1          官网建议
关于ZK日志存放,官网给出如下建议:

Having a dedicated log devicehas a large impact on throughput and stable latencies. It is highly recommenedto dedicate a log device and set dataLogDir to point to a directory on thatdevice, and then make sure to point dataDir to a directory not residing on thatdevice.

故为避免此类问题,dataLogDir存放目录应该与dataDir分开,可单独采用一套存储设备来存放ZK日志。

2.1.2          磁盘IO

从今天观察情况来看,磁盘IO严重不行,200M/s即达到瓶颈,需进一步通过专业工具测试磁盘IO能力。

2.1.3          关闭forceSync参数

在ZOO.CFG中增加:

forceSync=no

默认是开启的,为避免同步延迟问题,ZK接收到数据后会立刻去讲当前状态信息同步到磁盘日志文件中,同步完成后才会应答。将此项关闭后,客户端连接可以得到快速响应。Zk涮日志源码如下图:

关闭forceSync选项后,会存在潜在风险,虽然依旧会刷磁盘(log.flush()首先被执行),但因为操作系统为提高写磁盘效率,会先写缓存,当机器异常后,可能导致一些zk状态信息没有同步到磁盘,从而带来ZK前后信息不一样问题。

2.1.4          解决CancelledKeyException
该问题,已在ZooKeeper 3.4.8版本中得到修复。

如需解决该版本问题,可打补丁https://issues.apache.org/jira/browse/ZOOKEEPER-1237

然后重新编译ZK并使用。

---------------------
作者:jimmyxyalj
来源:CSDN
原文:https://blog.csdn.net/xjping0794/article/details/77784171
版权声明:本文为博主原创文章,转载请附上博文链接!

zookeeper频繁异常问题分析的更多相关文章

  1. alias导致virtualenv异常的分析和解法

    title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...

  2. CPU利用率异常的分析思路和方法交流探讨

    CPU利用率异常的分析思路和方法交流探讨在生产运行当中,经常会遇到CPU利用率异常或者不符合预期的情况,此时,往往暗示着系统性能问题.那么究竟是核心应用的问题?是监控工具的问题?还是系统.硬件.网络层 ...

  3. 修改List报ConcurrentModificationException异常原因分析

    使用迭代器遍历List的时候修改List报ConcurrentModificationException异常原因分析 在使用Iterator来迭代遍历List的时候如果修改该List对象,则会报jav ...

  4. 【Zookeeper】源码分析目录

    Zookeeper源码分析目录如下 1. [Zookeeper]源码分析之序列化 2. [Zookeeper]源码分析之持久化(一)之FileTxnLog 3. [Zookeeper]源码分析之持久化 ...

  5. 【Zookeeper】源码分析之服务器(五)之ObserverZooKeeperServer

    一.前言 前面分析了FollowerZooKeeperServer,接着分析ObserverZooKeeperServer. 二.ObserverZooKeeperServer源码分析 2.1 类的继 ...

  6. 网站开发进阶(八)tomcat异常日志分析及处理

    tomcat异常日志分析及处理 日志信息如下: 2015-10-29 18:39:49 org.apache.coyote.http11.Http11Protocol pause 信息: Pausin ...

  7. OutOfMemoryError/OOM/内存溢出异常实例分析--虚拟机栈和本地方法栈溢出

    关于虚拟机栈和本地方法栈,在JVM规范中描述了两种异常: 1.如果线程请求的栈深度大于JVM所允许的深度,将抛出StackOverflowError异常: 2.如果虚拟机在扩展栈时无法申请到足够的内存 ...

  8. ELK+Filebeat+Kafka+ZooKeeper 构建海量日志分析平台(elk5.2+filebeat2.11)

    ELK+Filebeat+Kafka+ZooKeeper 构建海量日志分析平台 参考:http://www.tuicool.com/articles/R77fieA 我在做ELK日志平台开始之初选择为 ...

  9. Hadoop生态圈-Zookeeper的工作原理分析

    Hadoop生态圈-Zookeeper的工作原理分析 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.   无论是是Kafka集群,还是producer和consumer都依赖于Zoo ...

随机推荐

  1. POJ1006 Biorhythms【中国剩余定理】

    <题目链接> 题目大意: 人体的体力每23天会达到峰值,情感每28天会达到峰值,智力每33天会达到峰值,一个人在a天体力达到峰值,b天情感达到峰值,c天智力达到峰值,求这个人下一次体力情感 ...

  2. 数据库学习之数据库增删改查(另外解决Mysql在linux下不能插入中文的问题)(二)

    数据库增删改查 增加 首先我们创建一个数据库user,然后创建一张表employee create table employee( id int primary key auto_increment, ...

  3. 工作->离职->考研

    1.工作篇 去年我大三,理论上来说我应该考研,也必须考研,我当时的想法也是这样.但是不知道什么情况,我竟然选择了工作,连我也没想到的反转,可能当时我对自己的技术很自信?我想可能是,有点对自己技术觉得还 ...

  4. Docker备忘录

    centOS安装教程:https://docs.docker-cn.com/engine/installation/linux/docker-ce/centos/ 一.常用命令 docker buil ...

  5. 洛谷.T22136.最长不下降子序列(01归并排序 分治)

    题目链接 \(Description\) 给定一个长为n的序列,每次可以反转 \([l,r]\) 区间,代价为 \(r-l+1\).要求在\(4*10^6\)代价内使其LIS长度最长,并输出需要操作的 ...

  6. PHP 利用QQ邮箱发送邮件「PHPMailer」

    在 PHP 应用开发中,往往需要验证用户邮箱.发送消息通知,而使用 PHP 内置的 mail() 函数,则需要邮件系统的支持. 如果熟悉 IMAP/SMTP 协议,结合 Socket 功能就可以编写邮 ...

  7. ELASTIC索引监控脚本

    报警方式自定义,我这里用的zabbix调用脚本监控 #!/bin/bash #power by kerwin #监控任意索引数据导入情况,若20分钟内无数据,报警触发 #使用方式,给脚本传索引名字的参 ...

  8. 吐槽下intellij idea 2018.3这个版本

    众所周知Springboot的@Service,@Controller,@Component,@Repository,@Configuration都是能扫描的,这些标签功能有完全一致的也有有区别的此处 ...

  9. ssdb安装注意事项

    官网的安装教程依赖于autoconf,需要提前安装.

  10. CentOS 6.8 安装 Erlang 及 RabbitMQ Server

    安装 Erlang 19.3 # 安装依赖包 yum install -y gcc gcc-c++ unixODBC-devel openssl-devel ncurses-devel # 下载 er ...