yarn 单点故障 重启 ResourceManger Restart
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html
Feature
Phase 1: Non-work-preserving RM restart
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below.
The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status of the application such as the completion state (failed, killed, finished) and diagnostics when the application completes. Besides, RM also saves the credentials like security keys, tokens to work in a secure environment. Any time RM shuts down, as long as the required information (i.e.application metadata and the alongside credentials if running in a secure environment) is available in the state-store, when RM restarts, it can pick up the application metadata from the state-store and re-submit the application. RM won’t re-submit the applications if they were already completed (i.e. failed, killed, finished) before RM went down.
NodeManagers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to all the NodeManagers and ApplicationMasters it was talking to via heartbeats. As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM’s perspective, these re-registered NodeManagers are similar to the newly joining NMs. AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed and re-kick that application as usual. As described before, the previously running applications’ work is lost in this manner since they are essentially killed by RM via the re-sync command on restart.
Phase 2: Work-preserving RM restart
As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem to not kill any applications running on YARN cluster if RM restarts.
Beyond all the groundwork that has been done in Phase 1 to ensure the persistency of application state and reload that state on recovery, Phase 2 primarily focuses on re-constructing the entire running state of YARN cluster, the majority of which is the state of the central scheduler inside RM which keeps track of all containers’ life-cycle, applications’ headroom and resource requests, queues’ resource usage etc. In this way, RM doesn’t need to kill the AM and re-run the application from scratch as it is done in Phase 1. Applications can simply re-sync back with RM and resume from where it were left off.
RM recovers its runing state by taking advantage of the container statuses sent from all NMs. NM will not kill the containers when it re-syncs with the restarted RM. It continues managing the containers and send the container statuses across to RM when it re-registers. RM reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information. In the meantime, AM needs to re-send the outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down. Application writers using AMRMClient library to communicate with RM do not need to worry about the part of AM re-sending resource requests to RM on re-sync, as it is automatically taken care by the library itself.
yarn 单点故障 重启 ResourceManger Restart的更多相关文章
- OpenGL ES 3.0: 图元重启(Primitive restart)
[TOC] 背景概述 在OpenGL绘制图形时,可能需要绘制多个并不相连的图形.这样的情况下这几个图形没法被当做一个图形来处理.也就需要多次调用 DrawArrays 或 DrawElements. ...
- Tomcat重启脚本restart.sh停止脚本stop.sh
Tomcat重启脚本restart.sh停止脚本stop.sh Tomcat本身提供了 startup.sh(启动)shutdown.sh(关闭)脚本,我们在部署中经常会出现死进程形象,无法杀掉进程需 ...
- 四:ResourceManger Restart
概述: RM是yarn中最重要的组件.但是只有一个RM,因此存在单点失败的问题.RM的重启有两种方式: 1.(Non-work-preserving RM restart) 不保留工作状态的重启 ...
- jar包重启脚本-restart.sh
#!/bin/sh PROJECT_PATH=/var/www/ PROJECT_NAME=demo.jar PROJECT_ALL_LOG_NAME=logs/demo-all.log # stop ...
- YARN的重启动问题:RM Restart/RM HA/Timeline Server/NM Restart
ResourceManger Restart ResourceManager负责资源管理和应用的调度,是YARN的核心组件,有可能存在单点失败的问题.ResourceManager Restart是使 ...
- Yarn NodeManager restart
一.介绍默认Yarn NodeManager重启后会断开所有当前正在运行的container的状态,这意味着重启后需要重新启动container进程,该特性的作用就是把NM的状态临时保存到本地,重启后 ...
- pm2 重启策略(restart strategies)
使用 PM2 启动应用程序 时,应用程序会在自动退出.事件循环为空 (node.js) 或应用程序崩溃时自动重新启动. 但您也可以配置额外的重启策略,例如: 使用定时任务重新启动应用程序 文件更改后重 ...
- Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3
ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Rec ...
- yarn资源管理器高可用性的实现
资源管理器高可用性 . The ResourceManager (RM) is responsible for tracking the resources in a cluster, and sch ...
随机推荐
- Linux软件包(源码包和二进制包)及其区别和特点
Linux 下的软件包众多,而且几乎都是经 GPL 授权的,也就是说这些软件都免费,振奋人心吧?而且更棒的是,这些软件几乎都提供源代码(开源的),只要你愿意,就可以修改程序源代码,以符合个人的需求和习 ...
- word中快捷键查看与设定
很多时候,我们在编辑word文档的时候,为了快速方便都使用快捷键,常用的快捷键大家都知道,但是不常用的是不是就比较懵圈,本文就来告诉你怎么查看与设置word的快捷键. 我使用的word2016 第一步 ...
- python之random、time与sys模块
一.random模块 import random # float型 print(random.random()) #取0-1之间的随机小数 print(random.uniform(n,m)) #取 ...
- 解析CentOS 7中系统文件与目录管理
Linux目录结构 Linux目录结构是树形的目录结构 根目录 所有分区.目录.文件等的位置起点 整个树形目录结构中,使用独立的一个"/"表示 常见的子目录 目录 目录名称 目录 ...
- tensorflow与numpy的版本兼容性问题
在Python交互式窗口导入tensorflow出现了下面的错误: root@ubuntu:~# python3 Python 3.6.8 (default, Oct 7 2019, 12:59:55 ...
- maven 打包到本地库
mvn install:install-file -DgroupId=com.oracle -DartifactId=ojdbc14 -Dversion=10.2.0.2.0 -Dpackaging= ...
- java 从上至下打印二叉树
从上往下打印二叉树题目描述: 从上往下打印出二叉树的每个节点,同层节点从左至右打印. 输入: 输入可能包含多个测试样例. 对于每个测试案例,输入的第一行一个整数n(1<=n<=1000, ...
- Git---报错:git Please move or remove them before you can merge 解决方案
场景: 当前在本地仓库lucky,因修改了123.txt的文件内容,需要将lucky分支push到远程Git库,在push前有其他的同事已删除了远程Git库中的123.txt文件.因此这时就产生了远程 ...
- PHP 根据域名和IP返回不同的内容
遇到一个好玩的事情,访问别人的IP和别人的域名返回的内容竟然不一样.突然觉得很好玩,也很好奇.自己研究了一下下,就简单写一下吧~ 一个IP和一个域名, 先讲一下公网IP没有绑定域名,但是可以通过一个没 ...
- 对url路径中的参数进行加密--Java
需求: 后台对一些比较敏感的参数进行数据加密,然后在传送到前端.当前端跳转到后台时,再由后台对其进行解密. 参考 针对url参数的加密解密算法(java版) 修改:对中间的js页面加密代码改写为jav ...