某日 收到告警 线上集群rm切换 观察resourcemanager 日志报错如下

这行不明显 再看看其他日志报错

在 app attempt_removed 时候发生了空指针错误

    break;
case APP_ATTEMPT_REMOVED:
if (!(event instanceof AppAttemptRemovedSchedulerEvent)) {
throw new RuntimeException("Unexpected event type: " + event);
}
AppAttemptRemovedSchedulerEvent appAttemptRemovedEvent =
(AppAttemptRemovedSchedulerEvent) event;
removeApplicationAttempt(
appAttemptRemovedEvent.getApplicationAttemptID(),
appAttemptRemovedEvent.getFinalAttemptState(),
appAttemptRemovedEvent.getKeepContainersAcrossAppAttempts());
break;
case CONTAINER_EXPIRED:

定位到代码问题在这里 新增标记内容

private void removeApplicationAttempt(
ApplicationAttemptId applicationAttemptId,
RMAppAttemptState rmAppAttemptFinalState, boolean keepContainers) {
LOG.info("Application " + applicationAttemptId + " is done." +
" finalState=" + rmAppAttemptFinalState);
try {
writeLock.lock();
SchedulerApplication<FSAppAttempt> application =
applications.get(applicationAttemptId.getApplicationId());
FSAppAttempt attempt = getSchedulerApp(applicationAttemptId); if (attempt == null || application == null) {
LOG.info("Unknown application " + applicationAttemptId + " has completed!");
return;
}
//已经停止了就不用再次停止了 新增
// Check if the attempt is already stopped and don't stop it twice.
if (attempt.isStopped()) {
LOG.info("Application " + applicationAttemptId + " has already been "
+ "stopped!");
return;
} // Release all the running containers
for (RMContainer rmContainer : attempt.getLiveContainers()) {
if (keepContainers
&& rmContainer.getState().equals(RMContainerState.RUNNING)) {
// do not kill the running container in the case of work-preserving AM
// restart.
LOG.info("Skip killing " + rmContainer.getContainerId());
continue;
}
super.completedContainer(rmContainer,
SchedulerUtils.createAbnormalContainerStatus(
rmContainer.getContainerId(),
SchedulerUtils.COMPLETED_APPLICATION),
RMContainerEventType.KILL);
}

增加如下代码

 @Override
public String moveApplication(ApplicationId appId,
String queueName) throws YarnException {
try {
writeLock.lock();
SchedulerApplication<FSAppAttempt> app = applications.get(appId);
if (app == null) {
throw new YarnException("App to be moved " + appId + " not found.");
}
FSAppAttempt attempt = (FSAppAttempt) app.getCurrentAppAttempt();
// To serialize with FairScheduler#allocate, synchronize on app attempt
try {
attempt.getWriteLock().lock();
FSLeafQueue oldQueue = (FSLeafQueue) app.getQueue();
// Check if the attempt is already stopped: don't move stopped app
// attempt. The attempt has already been removed from all queues.
if (attempt.isStopped()) {
LOG.info("Application " + appId + " is stopped and can't be moved!"throw new YarnException("Application " ++ " is stopped and can't be moved!");
}
String destQueueName = handleMoveToPlanQueue(queueName);
FSLeafQueue targetQueue = queueMgr.getLeafQueue(destQueueName, false);
if (targetQueue == null) {
throw new YarnException("Target queue " + queueName
+ " not found or is not a leaf queue.");
}
if (targetQueue == oldQueue) {
return oldQueue.getQueueName();
}
private void executeMove(SchedulerApplication<FSAppAttempt> app,
FSAppAttempt attempt, FSLeafQueue oldQueue, FSLeafQueue newQueue) {
// Check current runs state. Do not remove the attempt from the queue until
// after the check has been performed otherwise it could remove the app
// from a queue without moving it to a new queue.
boolean wasRunnable = oldQueue.isRunnableApp(attempt);
// if app was not runnable before, it may be runnable now
boolean nowRunnable = maxRunningEnforcer.canAppBeRunnable(newQueue,
attempt.getUser());
if (wasRunnable && !nowRunnable) {
throw new IllegalStateException("Should have already verified that app "
+ attempt.getApplicationId() + " would be runnable in new queue");
} // Now it is safe to remove from the queue.
oldQueue.removeApp(attempt); if (wasRunnable) {
maxRunningEnforcer.untrackRunnableApp(attempt);
} else if (nowRunnable) {
// App has changed from non-runnable to runnable
maxRunningEnforcer.untrackNonRunnableApp(attempt);
} attempt.move(newQueue); // This updates all the metrics
app.setQueue(newQueue);
newQueue.addApp(attempt, nowRunnable); if (nowRunnable) {
maxRunningEnforcer.trackRunnableApp(attempt);
}
if (wasRunnable) {
maxRunningEnforcer.updateRunnabilityOnAppRemoval(attempt, oldQueue);
}
}

问题解决

参考 https://issues.apache.org/jira/secure/attachment/12841441/YARN-5136.2.patch

记录一次 hadoop yarn resourceManager无故切换的故障的更多相关文章

  1. 记录一次线上yarn RM频繁切换的故障

    周末一大早被报警惊醒,rm频繁切换 急急忙忙排查 看到两处错误日志 错误信息1 ervation <memory:0, vCores:0> 2019-12-21 11:51:57,781 ...

  2. Hadoop记录-yarn ResourceManager Active频繁易主问题排查(转载)

    一.故障现象 两个节点的ResourceManger频繁在active和standby角色中切换.不断有active易主的告警发出 许多任务的状态没能成功更新,导致一些任务状态卡在NEW_SAVING ...

  3. Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3

    ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Rec ...

  4. Spark&Hive:如何使用scala开发spark访问hive作业,如何使用yarn resourcemanager。

    背景: 接到任务,需要在一个一天数据量在460亿条记录的hive表中,筛选出某些host为特定的值时才解析该条记录的http_content中的经纬度: 解析规则譬如: 需要解析host: api.m ...

  5. Hadoop yarn配置参数

    参照site:http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml 我们在配置yar ...

  6. Hadoop Yarn 安装

    环境:Linux, 8G 内存.60G 硬盘 , Hadoop 2.2.0 为了构建基于Yarn体系的Spark集群.先要安装Hadoop集群,为了以后查阅方便记录了我本次安装的详细步骤. 事前准备 ...

  7. 通过tarball形式安装HBASE Cluster(CDH5.0.2)——配置分布式集群中的YARN ResourceManager 的HA

    <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the &q ...

  8. hadoop yarn HA集群搭建

    可先完成hadoop namenode HA的搭建:http://www.cnblogs.com/kisf/p/7458519.html 搭建yarnde HA只需要在namenode HA配置基础上 ...

  9. hadoop+yarn+hbase+storm+kafka+spark+zookeeper)高可用集群详细配置

    配置 hadoop+yarn+hbase+storm+kafka+spark+zookeeper 高可用集群,同时安装相关组建:JDK,MySQL,Hive,Flume 文章目录 环境介绍 节点介绍 ...

随机推荐

  1. 【概率论】4-5:均值和中值(The Mean and the Median)

    title: [概率论]4-5:均值和中值(The Mean and the Median) categories: - Mathematic - Probability keywords: - Me ...

  2. [svn]查看,删除svn账号

    1.查看svn账号 ll ~/.subversion/auth/svn.simple 随便打开一个文件 这是保存的对应地址的svn账号和密码,都是明文的 win路径:C:\Users\ysk\AppD ...

  3. ReactJS和AngularJS对比

    Angular的特点: 优势: AngularJS是一套完整的框架,angular有自带的数据绑定.render渲染.angularUI库,过滤器,$filter,$directive(模板),$se ...

  4. PYTHON -----pyinstaller的安装

    这几天一直在安装pyinstaller库,发现了一个好方法 因为一直使用pip安装,我只能介绍一下pip安装pyinstaller的方法的注意事项 1:一般pip会在安装python的Scripts文 ...

  5. 从UDP的”连接性”说起–告知你不为人知的UDP

    原文地址:http://bbs.utest.qq.com/?p=631 很早就计划写篇关于UDP的文章,尽管UDP协议远没TCP协议那么庞大.复杂,但是,要想将UDP描述清楚,用好UDP却要比TCP难 ...

  6. ybatis 逆向工程 自动生成的mapper文件没有 主键方法

    1.数据表没有设置主键 设置个主键就好 2.在mybits配置文档里设置了某些属性值为false 在mybatis配置文档里查看 enableSelectByPrimaryKey="true ...

  7. Class.ForName()读取配置文件

    榨汁机(Juicer)榨汁的案例 分别有水果(Fruit)苹果(Apple)香蕉(Banana)桔子(Orange)榨汁(squeeze) public class Demo_Reflect { /* ...

  8. 一篇文章搞懂Python装饰器所有用法

    01. 装饰器语法糖 如果你接触 Python 有一段时间了的话,想必你对 @ 符号一定不陌生了,没错 @ 符号就是装饰器的语法糖. 它放在一个函数开始定义的地方,它就像一顶帽子一样戴在这个函数的头上 ...

  9. set serveroutput on 命令

    使用set serveroutput on 命令设置环境变量serveroutput为打开状态,从而使得pl/sql程序能够在SQL*plus中输出结果 使用函数dbms_output.put_lin ...

  10. 阶段5 3.微服务项目【学成在线】_day04 页面静态化_17-页面静态化-模板管理-GridFS研究-存文件

    将模板信息保存在cms_template里面 存储在fs.chunks这个集合中.这个集合里面存的是分块文件. fs.files存的是文件的基本信息 chunks存的是块信息 创建测试文件 在cms的 ...