最近hbase集群很多region server挂掉,查看其中一个RegionServer1日志发现,17:17:14挂的时候服务器压力很大,有大量的responseTooSlow,也有不少gc,但是当时内存还有很多剩余,不是因为oom被kill

2018-03-13T17:17:13.372+0800: [GC (Allocation Failure) 2018-03-13T17:17:13.372+0800: [ParNew: 3280066K->256481K(3762880K), 0.0429718 secs] 23536952K->20513367K(32801920K), 0.0432932 secs] [Times: user=1.92 sys=0.01, real=0.04 secs]

Heap

par new generation   total 3762880K, used 1416770K [0x00007f2f4c000000, 0x00007f305f990000, 0x00007f305f990000)

eden space 3010368K,  38% used [0x00007f2f4c000000, 0x00007f2f92d182c0, 0x00007f3003bd0000)

from space 752512K,  34% used [0x00007f3003bd0000, 0x00007f30136487c0, 0x00007f3031ab0000)

to   space 752512K,   0% used [0x00007f3031ab0000, 0x00007f3031ab0000, 0x00007f305f990000)

concurrent mark-sweep generation total 29039040K, used 19905973K [0x00007f305f990000, 0x00007f374c000000, 0x00007f374c000000)

Metaspace       used 49530K, capacity 50072K, committed 55492K, reserved 57344K

region server挂掉的原因是tryRegionServerReport时发现被标记为dead server,抛出YouAreDeadException,然后HRegionServer.run方法会退出,意味着region server进程退出

2018-03-13 17:17:08,159 FATAL [regionserver/RegionServer1/RegionServer1:16020] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server

org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server

at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:434)

at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:339)

at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:339)

at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(Reg

ionServerStatusProtos.java:8617)

at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)

at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)

at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)

at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)

at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)

at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1153)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:962)

被标记为dead server是因为zk session expired

2018-03-13 17:17:08,128 INFO  [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 41499ms for sessionid 0x161d1433ba86c84, closing socket connection and attempting reconnect

2018-03-13 17:17:08,445 FATAL [main-EventThread] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: regionserver:16020-0x35fe18af217e685, quorum=ZKServer1:2181, baseZNode=/hbase regionserver:16020-0x35fe18af217e685 received expired from ZooKeeper, aborting

在17:17:08超时41499ms是因为17:16:38的时候有一个full gc耗时29s

2018-03-13T17:16:38.268+0800: [GC (Allocation Failure)

2018-03-13T17:16:38.268+0800: [ParNew (promotion failed): 3762880K->3762880K(3762880K), 3.2398750 secs]

2018-03-13T17:16:41.508+0800: [CMS2018-03-13T17:16:47.078+0800: [CMS-concurrent-sweep: 10.829/17.135 secs] [Times: user=161.97 sys=2.65, real=17.13 secs]

(concurrent mode failure): 28187725K->21036514K(29039040K), 26.5884371 secs] 31874835K->21036514K(32801920K), [Metaspace: 49335K->49335K(57344K)], 29.8297421 secs] [Times: user=40.45 sys=0.14, real=29.82 secs]

但是hbase-site.xml中配置session超时是180s,为什么41s就超时

<property>

<name>zookeeper.session.timeout</name>

<value>180000</value>

</property>

建立zk session时的negotiated timeout = 40000,也就是40s,为什么配置的是180s,实际是40s?

2018-03-13 17:15:11,522 INFO  [main-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Session establishment complet

e on server ZKServer1/ZKServer1:2181, sessionid = 0x35fe18af217e685, negotiated timeout = 40000

2018-03-13 17:15:11,526 INFO  [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)]

zookeeper.ClientCnxn: Session establishment complete on server RegionServer1/RegionServer1:2181, sessionid = 0x161d1433b

a86c84, negotiated timeout = 40000

zk中建立连接后输出

            LOG.info("Session establishment complete on server "

                    + clientCnxnSocket.getRemoteSocketAddress()

                    + ", sessionid = 0x" + Long.toHexString(sessionId)

                    + ", negotiated timeout = " + negotiatedSessionTimeout

                    + (isRO ? " (READ-ONLY mode)" : ""));

negotiatedSessionTimeout是在onConnected中被回调

        void onConnected(int _negotiatedSessionTimeout, long _sessionId,

                byte[] _sessionPasswd, boolean isRO) throws IOException {

回调是在ClientCnxnSocket中,取得是ConnectResponse.getTimeout()

        ConnectResponse conRsp = new ConnectResponse();

        conRsp.deserialize(bbia, "connect");

                  sendThread.onConnected(conRsp.getTimeOut(), this.sessionId,

                conRsp.getPasswd(), isRO);

这里取的是ServerCnxn中的sessionTimeout,这个值在ZooKeeperServer中做初始化

        int minSessionTimeout = getMinSessionTimeout();

        if (sessionTimeout < minSessionTimeout) {

            sessionTimeout = minSessionTimeout;

        }

        int maxSessionTimeout = getMaxSessionTimeout();

        if (sessionTimeout > maxSessionTimeout) {

            sessionTimeout = maxSessionTimeout;

        }

可以看到会根据minSessionTimeout和maxSessionTimeout限制

    public int getMaxSessionTimeout() {

        return maxSessionTimeout == -1 ? tickTime * 20 : maxSessionTimeout;

    }

由于zookeeper服务器tickTime设置的是2000ms,所以maxSessionTimeout默认会被设置为40000ms,所以解决这个问题需要修改zk的maxSessionTimeout;

【原创】大叔问题定位分享(1)HBase RegionServer频繁挂掉的更多相关文章

  1. 20130617 hbase regionserver 老挂掉

    hbase regionserver 老挂掉: 添加如下: <property><name>hbase.regionserver.restart.on.zk.expire< ...

  2. 【原创】大叔问题定位分享(25)ambari metrics collector内置standalone hbase启动失败

    ambari metrics collector内置hbase目录位于 /usr/lib/ams-hbase 配置位于 /etc/ams-hbase/conf 通过ruby启动 /usr/lib/am ...

  3. 【原创】大叔问题定位分享(13)HBase Region频繁下线

    问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...

  4. 一次bug死磕经历之Hbase堆内存小导致regionserver频繁挂掉

    环境如下: Centos6.5 Apache Hadoop2.7.1 Apache Hbase0.98.12 Apache Zookeeper3.4.6 JDK1.7 Ant1.9.5 Maven3. ...

  5. 【原创】大叔问题定位分享(24)hbase standalone方式启动报错

    hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...

  6. 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...

  7. 【原创】大叔问题定位分享(6)Dubbo monitor服务iowait高,负载高

    一 问题 Dubbo monitor所在服务器状态异常,iowait一直很高,load也一直很高,监控如下: iowait如图: load如图: 二 分析 通过iotop命令可以查看当前系统中磁盘io ...

  8. 【原创】大叔问题定位分享(3)Kafka集群broker进程逐个报错退出

    kafka0.8.1 一 问题现象 生产环境kafka服务器134.135.136分别在10月11号.10月13号挂掉: 134日志 [2014-10-13 16:45:41,902] FATAL [ ...

  9. 【原创】大叔问题定位分享(30)mesos agent启动失败:Failed to perform recovery: Incompatible agent info detected

    mesos agent启动失败,报错如下: Feb 15 22:03:18 server1.bj mesos-slave[1190]: E0215 22:03:18.622994 1192 slave ...

随机推荐

  1. hdu4746 Mophues (莫比乌斯进阶)

    参考博客:https://blog.csdn.net/acdreamers/article/details/12871643 题意:满足1<=x<=n,1<=y<=m,并且gc ...

  2. PyQt5基础应用一

    一.PyQt5基础   1.1 创建窗口 import sys from PyQt5.QtWidgets import QApplication, QWidget if __name__ == '__ ...

  3. appium框架之bootstrap

    (闲来无事,做做测试..)最近弄了弄appium,感觉挺有意思,就深入研究了下. 看小弟这篇文章之前,先了解一下appium的架构,对你理解有好处,推荐下面这篇文章:testerhome appium ...

  4. 软工+C(6): 最近发展区/脚手架

    // 上一篇:工具和结构化 // 下一篇:野生程序员 教育心理学里面有提到"最近发展区"这个概念,这个概念是前苏联发展心理学家维果茨基(Vygotsky)提出的,英文名词是Zone ...

  5. 采用VSPD、ModbusTool模拟串口、MODBUS TCP设备进行Python采集软件开发

    版权声明:本文为博主原创文章,欢迎转载,并请注明出处.联系方式:460356155@qq.com 不少仪器/设备都提供了数据采集的接口,其中不少是串口或网络的MODBUS/TCP协议. 串口是比较简单 ...

  6. MySQL 字符串 分割 多列

    mysql如何进行以,分割的字符串的拆分 - 我有一个梦想 - CSDN博客https://blog.csdn.net/u012009613/article/details/52770567 mysq ...

  7. SQL拼接字符串时单引号转义问题 单引号转义字符

    要拼接一个单引号到已有字符串前后, 开始以为(错误)可以用  \ 转义,如下: '\''+ str+'\'' 看颜色就知道是不行的. 正确方法是两个单引号就转义为单引号,如下: ''''+str+'' ...

  8. javaWeb1之Servlet

    Servlet Servlet 环境设置 servlet是扩展web服务器功能的组件规范.浏览器发送请求给web服务器,如果是动态资源的请求,web服务器会将请求转发给servlet容器来处理(由容器 ...

  9. [基础]Android 应用的启动

    Android 应用的启动模式分为两种,一种是通过启动器(Launcher)启动,另一种是通过Intent消息启动. 如果在通过Intent 消息启动前,希望判断欲启动的应用是否已经安装, 目前有两种 ...

  10. Codeforces 1077E Thematic Contests(二分)

    题目链接:Thematic Contests 题意:给出n道有类型的题目,每天组织一场专题比赛,该天题目数量必须是前一天的2倍,第一天的题目数量可以任意选择,求能消耗的最多题目数量. 题解:先整理成不 ...