HBase 永久RIT(Region-In-Transition)问题:异常关机导致HBase表损坏和丢失,大量Regions 处于Offline状态,无法上线。

  • 问题1:启动HBase时,HBase Regionserver Web UI,一直停留在The RegionServer is initializing! 界面
 Initializing Master file system (since 10mins, 16sec ago)
The RegionServer is initializing!
  • 问题1:故障排查及解决思路

查看HBase日志

2018-02-27 17:59:43,114 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 17:59:53,116 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:03,119 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:13,121 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:23,123 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:33,125 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:43,128 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:53,130 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:03,132 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:13,135 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:23,137 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:33,139 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:43,141 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:53,144 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:03,146 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:13,148 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:23,151 INFO [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...

# 查看hdfs safe mode
hadoop dfsadmin -safemode get
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it. Safe mode is ON in namenode02/172.31.132.72:9000
Safe mode is ON in namenode01/172.31.132.71:9000
# 退出hdfs safe mode
hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it. Safe mode is OFF in namenode02/172.31.132.72:9000
Safe mode is OFF in namenode01/172.31.132.71:9000
  • 问题2:Hadoop Namenode Web UI 界面的报错提示,有missing block。

  • 问题3:HBase  Master Web UI界面的提示,有大量Offline Regions 和RIT(Region-In-Transition)

  • 问题2/3:故障排查及解决思路
# 查看dfs 状态报告
hadoop dfsadmin -report

# 查看损坏文件、当前hdfs的副本数
hdfs fsck /
或者
hadoop fsck -locations

Under replicated blocks    副本数少于指定副本数的block数量
Blocks with corrupt replicas   存在损坏副本的block的数据
Missing blocks        丢失block数量

观察发现,662-53=609 ,正好和问题2中的RIT数量,610 吻合!可以初步判断,HBase Offline Regions 和RIT(Region-In-Transition)问题,是由于Hadoop里存在未修复的block造成。

之前多次通过hbase hbck 试图修复hbase表无效,原来还是因为hadoop的副本没有完全恢复好。

对以上截图中红框部分的理解是:hdfs上有662个block存在副本缺失问题(Under replicated blocks),有609 个 block只剩1-2个副本,有53个block副本全部丢失(Missing blocks)。

核心修复步骤1:

# 更改已经上传文件的副本数,修复Missing blocks
hadoop fs -setrep -R 3 /

通过该命令,对于存在副本缺失问题(Under replicated blocks)的662个block,可以从剩下的1-2个副本,重新生成3个副本,从而找回了丢失的副本。

Under replicated blocks: 53
Blocks with corrupt replicas: 0
Missing blocks: 53
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0

从这里可以看到,现在还有53个缺失的blocks,这53个缺失的block,一个副本都没有。存在1-2个副本的block,已经全部修复!

核心修复步骤2:

# 删除损坏文件
hdfs fsck -delete

通过多次运行该命令,对于副本全部丢失(Missing blocks)或损坏的53个block,可以从namenode节点删除元信息和损坏文件。

Under replicated blocks: 33
Blocks with corrupt replicas: 0
Missing blocks: 33
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 244

此时再查看HBase Master 的Web UI,Offline Regions 和RIT(Region-In-Transition) 已经大大降低。

过了没多久,发现datanode03的regionserver挂了

此时通过重启HBase和Hadoop集群,即可消除全部问题

# 在namenode01执行,关闭HBase
stop-hbase.sh # 在namnode01执行,关闭Hadoop
stop-all.sh # 在namenode01执行
start-all.sh # 在namenode01和namenode02节点,分别执行
start-hbase.sh

  • 其他问题:HBase相关故障排查修复思路

HBase的 Regions in Transition 问题
# 查看hbase中损坏的block
hbase hbck # 修复hbase
hbase hbck -repair The Load Balancer is not enabled which will eventually cause performance degradation in HBase as Regions will not be distributed across all RegionServers. The balancer is only expected to be disabled during rolling upgrade scenarios. 关闭balance,防止在停掉服务后,原先节点上的分片会迁移到其他节点上,到时候在移回来,浪费时间。
hbase(main):001:0> balance_switch true 2018-02-27 21:14:54,236 INFO [hbasefsck-pool1-t38] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e540df791e7fcdc93c118b8055d1c74f, NAME => 'pos_flow_summary_20170713,,1503656523513.e540df791e7fcdc93c118b8055d1c74f.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,236 INFO [hbasefsck-pool1-t47] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e59b1015c6fed189cdb9ba8493024563, NAME => 'pos_flow_summary_20180111,,1515768771542.e59b1015c6fed189cdb9ba8493024563.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,241 INFO [hbasefsck-pool1-t44] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => d22e214e72ff89e87b4df3eebd9603f9, NAME => 'pos_flow_summary_20180112,,1515855181051.d22e214e72ff89e87b4df3eebd9603f9.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,244 INFO [hbasefsck-pool1-t23] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e8667191e988db9d65b52cfdb5e83a4d, NAME => 'pos_flow_summary_20170310,,1504229353726.e8667191e988db9d65b52cfdb5e83a4d.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,245 INFO [hbasefsck-pool1-t45] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => d05b759994d757b8fc857993e3351648, NAME => 'app_point,5000,1510910952310.d05b759994d757b8fc857993e3351648.', STARTKEY => '5000', ENDKEY => '5505|1dcfb8c9a44c4147acc823c2e463d536'} # 修复 .META表
hbase hbck -fixMeta ERROR: Region { meta => pos_flow,2012|dd12dceee69c56f6776154d02e49f840,1518061965154.71eb7d463708010bc2a3f1e96deca135., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/71eb7d463708010bc2a3f1e96deca135, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow_summary_20180115,,1516115199923.70df944adbd82c1422be8f7ee8c24f3e., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow_summary_20180115/70df944adbd82c1422be8f7ee8c24f3e, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,5215|249f79b383f5c144cdd95cd1c29fdec3,1518380260884.67bfa42b4c45ec847c7eb27bbd7d86e5., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/67bfa42b4c45ec847c7eb27bbd7d86e5, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow_summary_20170528,,1504142971183.679bcdecd0335c99d847374db34de31d., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow_summary_20170528/679bcdecd0335c99d847374db34de31d, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,4744|dcf7bccc75f738986e5db100f1f54473,1518489513549.673e899d577f6111b5699b3374ba6adc., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/673e899d577f6111b5699b3374ba6adc, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,1321|ab83f75ef25bdd0d2ecc363fe1fe0106,1518466350793.66b9622950bba42339f011ac745b080b., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/66b9622950bba42339f011ac745b080b, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,9449|1ed33683e675c3e9ddbecf4d9bd42183,1518041132081.66b11e69bc62f356b3f81f351b8a6c68., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/66b11e69bc62f356b3f81f351b8a6c68, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => access_log,1000,1517363823393.65c41f802af180f41af848f1fed8e725., hdfs => hdfs://namenode01:9000/hbase/data/default/access_log/65c41f802af180f41af848f1fed8e725, deployed => , replicaId => 0 } not deployed on any region server. Table pos_flow_summary_20180222 is okay.
Number of regions: 0
Deployed on:
Table pos_flow_summary_20180223 is okay.
Number of regions: 0
Deployed on:
Table pos_flow_summary_20180224 is okay.
Number of regions: 1
Deployed on: prd-bldb-hdp-data02,60020,1519734905071
Table pos_flow_summary_20180225 is okay.
Number of regions: 1
Deployed on: prd-bldb-hdp-data02,60020,1519734905071
Table pos_flow_summary_20180226 is okay.
Number of regions: 1
Deployed on: prd-bldb-hdp-data02,60020,1519734905071
Table hbase:namespace is okay.
Number of regions: 1
Deployed on: prd-bldb-hdp-data02,60020,1519734905071
Table gb_app_active is inconsistent.
Number of regions: 7
Deployed on: prd-bldb-hdp-data01,60020,1519734905393 prd-bldb-hdp-data02,60020,1519734905071 prd-bldb-hdp-data03,60020,1519734905043
Table app_point is inconsistent.
Number of regions: 3
Deployed on: prd-bldb-hdp-data01,60020,1519734905393 prd-bldb-hdp-data03,60020,1519734905043
970 inconsistencies detected.
Status: INCONSISTENT
2018-02-27 21:40:59,644 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
2018-02-27 21:40:59,644 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x161d70981710083
2018-02-27 21:40:59,646 INFO [main] zookeeper.ZooKeeper: Session: 0x161d70981710083 closed
2018-02-27 21:40:59,646 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down # 当出现漏洞
hbase hbck -fixHdfsHoles # 缺少regioninfo
hbase hbck -fixHdfsOrphans # hbase region 引用文件出错
# Found lingering reference file hdfs:
hbase hbck -fixReferenceFiles # 修复assignments问题
hbase hbck -fixAssignments 2018-02-28 14:07:57,814 INFO [hbasefsck-pool1-t40] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0836651ac3e23c331ca049e2f333e19f, NAME => 'pos_flow,9139|62b0a7a92cb8c4d25cea82991856334e,1518205798951.0836651ac3e23c331ca049e2f333e19f.', STARTKEY => '9139|62b0a7a92cb8c4d25cea82991856334e', ENDKEY => '9159|441da161eba8d989493f9d2ca2a3e4a2'}
2018-02-28 14:07:57,814 INFO [hbasefsck-pool1-t11] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0e4dbc902294c799db7029df118c61a4, NAME => 'app_point,9000,1510910950800.0e4dbc902294c799db7029df118c61a4.', STARTKEY => '9000', ENDKEY => '9509|05a93'}
2018-02-28 14:07:57,817 INFO [hbasefsck-pool1-t10] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 03f6d27f36e4f73e8030cfa6454dfadf, NAME => 'pos_flow_summary_20170913,,1505387404710.03f6d27f36e4f73e8030cfa6454dfadf.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,817 INFO [hbasefsck-pool1-t35] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0634fab1259b036a5fbd024fd8da4ba7, NAME => 'pos_flow_summary_20171213,,1513262830799.0634fab1259b036a5fbd024fd8da4ba7.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,818 INFO [hbasefsck-pool1-t29] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0666757ecbb89b60a52613cba2dab2f0, NAME => 'pos_flow,4344|8226741aea1eb243789f87abd6e44318,1518152887266.0666757ecbb89b60a52613cba2dab2f0.', STARTKEY => '4344|8226741aea1eb243789f87abd6e44318', ENDKEY => '4404|5fe6f71832f527f173696d3570556461'}
2018-02-28 14:07:57,819 INFO [hbasefsck-pool1-t42] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 037d3dff1101418ea3c3868dc9855ecc, NAME => 'pos_flow,7037|c0124cbc08feb233e745aae0d896195a,1518489580825.037d3dff1101418ea3c3868dc9855ecc.', STARTKEY => '7037|c0124cbc08feb233e745aae0d896195a', ENDKEY => '7057|9fae9bddd296a534155c02297532cd28'}
2018-02-28 14:07:57,820 INFO [hbasefsck-pool1-t12] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 01ff32a4f85c2de9e3e16c9b6156afa2, NAME => 'pos_flow_summary_20170902,,1504437004424.01ff32a4f85c2de9e3e16c9b6156afa2.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,823 INFO [hbasefsck-pool1-t41] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0761d39042e2ec7002dbf291ce23e209, NAME => 'pos_flow_summary_20170725,,1503624161916.0761d39042e2ec7002dbf291ce23e209.', STARTKEY => '', ENDKEY => ''} hbase master stop
hbase master start
service hbase-master restart 2018-02-28 14:41:39,547 INFO [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:39,547 INFO [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:39,547 WARN [main] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 251.87213750182957 msec.
2018-02-28 14:41:39,799 INFO [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:39,799 INFO [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:39,799 WARN [main] hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 5083.97300871329 msec.
2018-02-28 14:41:44,883 INFO [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:44,883 INFO [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:44,883 WARN [main] hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 9836.488682902267 msec. 转载:https://blog.csdn.net/mnasd/article/details/84963221

HBase 永久RIT(Region-In-Transition)问题的更多相关文章

  1. Hbase 永久 Region-In-Transition 的查错记录

    状态:部分 region 的状态为 FAILED_CLOSE,且一直停留在 RIT,不可服务. 1. 首先,到 hbase region 上查日志(/var/log/hbase/),看到是 hdfs ...

  2. Hbase合并Region的过程中出现永久RIT的解决

    在合并Region的过程中出现永久RIT怎么办?笔者在生产环境中就遇到过这种情况,在批量合并Region的过程中,出现了永久MERGING_NEW的情况,虽然这种情况不会影响现有集群的正常的服务能力, ...

  3. 【转】HBASE Region in Transition issue on Master UI

    [From]https://community.hortonworks.com/content/supportkb/244808/hbase-region-in-transition-issue-on ...

  4. hbase hbck及region RIT处理

    hbase hbck主要用来检查hbase集群region的状态以及对有问题的region进行修复. hbase hbck :检查hbase所有表的一致性,如果正常,就会Print OK hbase ...

  5. HBase原理–所有Region切分的细节都在这里了

    本文由  网易云发布.   作者:范欣欣(本篇文章仅限内部分享,如需转载,请联系网易获取授权.)   Region自动切分是HBase能够拥有良好扩张性的最重要因素之一,也必然是所有分布式系统追求无限 ...

  6. Hbase Region in transition问题解决

    1  hbase hbck -repair 强制修复 如果ok就可以 2 不ok,找到hdfs上对应的该表位置,删除,之后在使用hbase hbck -repair 解决过程: 第一次,使用了方法二, ...

  7. hbase优化之region合并和压缩

    HBASE操作:(一般先合并region然后再压缩) 一 .Region合并: merge_region   'regionname1','regionname2' ,'true'  --true代表 ...

  8. hbase集群region数量和大小的影响

    1.Region数量的影响 通常较少的region数量可使群集运行的更加平稳,官方指出每个RegionServer大约100个regions的时候效果最好,理由如下: 1)Hbase的一个特性MSLA ...

  9. hbase报Dead Region Servers

    问题描述: 16010端口启动成功,16020未启动. hbase-root-regionserver-hbase2.log日志: 2019-08-14 16:45:10,552 WARN [Thre ...

随机推荐

  1. django 之模板层

    1. 模板语法之变量 格式:{{ 变量名 }} 句点符,深度查询(可以点到方法,不要加括号,只能是无参的方法) 代码 视图函数: from django.shortcuts import render ...

  2. Android 将drawable下的图片转换成bitmap、Drawable

    将drawable下的图片转换成bitmap . Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.xxx ...

  3. haproxy附加

    1.安装haproxy yum -y install haproxy 2.编写文件  vim /etc/haproxy/haproxy.cfg

  4. log4j日志格式化

    Apache log4j 提供了各种布局对象,每一个对象都可以根据各种布局格式记录数据.另外,也可以创建一个布局对象格式化测井数据中的特定应用的方法. 所有的布局对象 - Appender对象收到 L ...

  5. Python:如何获取一个用户名的组ID

    getpwname只能得到gid一个username. import pwd myGroupId = pwd.getpwnam(username).pw_gid getgroups只能获取groups ...

  6. java虚拟机规范(se8)——java虚拟机结构(六)

    2.11 指令集简介 java虚拟机指令由一个字节的操作码,接着时0个或多个操作数组成,操作码描述了执行的操作,操作数提供了操作所需的参数或者数据.许多指令没有操作数只包含一个操作码. 如果忽略异常处 ...

  7. 第二十五天 慵懒的投射在JDBC上的暖阳 —Hibernate的使用(四)

    版权声明:本文为博主原创文章.未经博主同意不得转载. https://blog.csdn.net/zwszws/article/details/28493209            6月4日.晴天. ...

  8. CG-CTF pwn部分wp

    面向pwn刷cgctfPWN1,When did you born题目给了一个ELF文件,和一个.C文件先运行ELF,大概如下What’s Your Birth?0What’s Your Name?0 ...

  9. java爬取猫咪上的图片

    首先是对知识点归纳 1.用到获取网页源代码,分析图片地址,发现图片的地址都是按编号排列的,所以想到用循环获取 2.保存图片要用到流操作和文件操作,对两部分知识进行了复习巩固 3.保存后的图片有一部分是 ...

  10. vue组件库的基本开发步骤(源代码)

    上次发布的随笔忘记提供源代码了,今天特地来补充,如果有什么问题,欢迎大家为我修改指正. vue.config.js文件: const path = require('path') function r ...