YARN - Yet Another Resource Negotiator

http://www.socc2013.org/home/program

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

Hadoop V1.0的问题

Hadoop被发明的时候是用于index海量的web crawls, 所以它很适应那个场景, 但是现在Hadoop被当作一种通用的计算平台, 这个已经超出当初它被设计时的目标和scope.
所以Hadoop作为通用的计算平台有两个主要的缺点, 计算模型和资源管理紧耦合, 无法使用除map/reduce以外的计算模型, 中心化的job控制管理, 带来很大的扩展性问题

1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model

2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler

YARN就是用来decouple计算模型和资源管理, MapReduce is just one of the applications running on top of YARN
想换其他的计算模型也很容易, 比如Dryad, Giraph, Hoya, REEF, Spark, Storm and Tez

更具体的问题描述,

JobTracker 是 Map-reduce 的集中处理点，存在单点故障。

JobTracker 完成了太多的任务，造成了过多的资源消耗，当 map-reduce job 非常多的时候，会造成很大的内存开销，潜在来说，也增加了 JobTracker fail 的风险，这也是业界普遍总结出老 Hadoop 的 Map-Reduce 只能支持 4000 节点主机的上限。

在 TaskTracker 端，以 map/reduce task 的数目作为资源的表示过于简单，没有考虑到 cpu/ 内存的占用情况，如果两个大内存消耗的 task 被调度到了一块，很容易出现 OOM。

在 TaskTracker 端，把资源强制划分为 map task slot 和 reduce task slot, 如果当系统中只有 map task 或者只有 reduce task 的时候，会造成资源的浪费，也就是前面提过的集群资源利用的问题。

源代码层面分析的时候，会发现代码非常的难读，常常因为一个 class 做了太多的事情，代码量达 3000 多行，造成 class 的任务不清晰，增加 bug 修复和版本维护的难度。

从操作的角度来看，现在的 Hadoop MapReduce 框架在有任何重要的或者不重要的变化 ( 例如 bug 修复，性能提升和特性化 ) 时，都会强制进行系统级别的升级更新。更糟的是，它不管用户的喜好，强制让分布式集群系统的每一个用户端同时更新。这些更新会让用户为了验证他们之前的应用程序是不是适用新的 Hadoop 版本而浪费大量时间。

Yarn的3个主要角色

ResourceManager(RM)

ResourceManager 支持分层级的应用队列，这些队列享有集群一定比例的资源。从某种意义上讲它就是一个纯粹的调度器，它在执行过程中不对应用进行监控和状态跟踪。同样，它也不能重启因应用失败或者硬件错误而运行失败的任务。ResourceManager 是基于应用程序对资源的需求进行调度的; 每一个应用程序需要不同类型的资源因此就需要不同的容器。资源包括：内存，CPU，磁盘，网络等等。可以看出，这同现 Mapreduce 固定类型的资源使用模型有显著区别，它给集群的使用带来负面的影响。资源管理器提供一个调度策略的插件，它负责将集群资源分配给多个队列和应用程序。调度插件可以基于现有的能力调度和公平调度模型。

The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources among various competing applications in the cluster.
Given this central and global view of the cluster resources, it can enforce rich, familiar properties such as fairness, capacity, and locality across tenants.

1. Jobs are submitted to the RM via a public submission protocol and go through an admission control phase during which security credentials are validated and various
operational and administrative checks are performed.

2. Once the scheduler has enough resources, the application is moved from accepted to running state. Aside from internal bookkeeping, this involves allocating a container for the AM and spawning it on a node in the cluster.

3. A record of accepted applications is written to persistent storage and recovered in case of RM restart or failure.

NodeManager (NM)

NodeManager 是每一台机器框架的代理，是执行应用程序的容器，监控应用程序的资源使用情况 (CPU，内存，硬盘，网络 ) 并且向调度器汇报(通过heartbeat)

NMs are responsible for monitoring resource availability, reporting faults, and container lifecycle management (e.g., starting, killing).
Communications between RM and NMs are heartbeatbased for scalability.

ApplicationMaster (AM)

每一个应用的 ApplicationMaster 的职责, 向调度器索要适当的资源容器，运行任务，跟踪应用程序的状态和监控它们的进程，处理任务的失败原因
Yarn的AM也是一种特殊的container, 不是全局的, 是每个job都会创建一个AM, 专职负责该job的整个生命周期, AM通常会基于现有的high level编程框架来实现, 比如M/R

The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution (e.g., running reducers against the output of maps), handling faults and computation skew, and performing other local optimizations.

AM can run arbitrary user code, and can be written in any programming language since all communication with the RM and NM is encoded using extensible communication protocols(ex, protobuf).
Although in practice we expect most jobs will use a higher level programming framework (e.g., MapReduce, Dryad, Tez, REEF, etc.).

Resource Manager (RM)

RM做什么?
ResourceManager should only handle live resource scheduling, and helps central components in YARN scale beyond the Hadoop 1.0 JobTracker.

RM不做什么?

ResourceManager is not responsible for:
1. coordinating application execution or task fault-tolerance
2. providing status or metrics for running applications (now part of the ApplicationMaster)
3. serving framework specific reports of completed jobs (now delegated to a per-framework daemon)

RM和AM的通信

RM需要和client, AM, NM进行通信, 所以需要提供相应的接口, 当然其中RM和AM的通信最为重要, 这里重点讨论一下.

The ResourceManager exposes two public interfaces towards:
1) clients submitting applications,
2) ApplicationMaster(s) dynamically negotiating access to resources,
3) one internal interface towards NodeManagers for cluster monitoring and resource access management.

AM和RM之间的通信, 通过ResourceRequests, 即AM通过RR来告诉RM, 我需要什么样的资源?
现在支持底下几种request属性, 当然这个是可以不断扩展的
现在也支持, RM通过RR来从AM那里收回resources, 比如当资源紧张, RM需要考虑重新分配之前的资源

ApplicationMasters codify their need for resources in terms of one or more ResourceRequests, each of which tracks:
1. number of containers (e.g., 200 containers),
2. resources8 per container h2GB RAM, 1 CPUi,
3. locality preferences, and
4. priority of requests within the application

Application Master (AM)

The ApplicationMaster is the process that coordinates the application’s execution in the cluster, but it itself is run in the cluster just like any other container.
A component of the RM negotiates for the container to spawn this bootstrap process.

1. The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand. AM encodes its preferences and constraints in a heartbeat message to the RM.

2. In response to subsequent heartbeats, the AM will receive a container lease on bundles of resources bound to a particular node in the cluster.

3. Based on the containers it receives from the RM, the AM may update its execution plan to accommodate perceived abundance or scarcity.

Since the RM does not interpret the container status, the AM determines the semantics of the success or failure of the container exit status reported by NMs through the RM.

Since the AM is itself a container running in a cluster of unreliable hardware, it should be resilient to failure.

Node Manager (NM)

The NodeManager is the “worker” daemon in YARN.
It authenticates container leases, manages containers’ dependencies, monitors their execution, and provides a set of services to containers.

All containers in YARN– including AMs– are described by a container launch context (CLC)

This record includes a map of environment variables, dependencies stored in remotely accessible storage, security tokens, payloads for NM services, and the command necessary to create the process.

After validating the authenticity of the lease, the NM configures the environment for the container, including initializing its monitoring subsystem with the resource constraints specified in the lease.

To launch the container, the NM copies all the necessary dependencies– data files, executables, tarballs– to local storage.

The NM eventually garbage collects dependencies not in use by running containers.

Container Killing

The NM will also kill containers as directed by the RM or the AM.

a. Containers may be killed when the RM reports its owning application as completed, when the scheduler decides to evict it for another tenant
b. when the NM detects that the container exceeded the limits of its lease.
c. AMs may request containers to be killed when the corresponding work isn’t needed any more.

Whenever a container exits, the NM will clean up its working directory in local storage. When an application completes, all resources owned by its containers are discarded on all nodes, including any of its processes still running in the cluster.

NM Local Monitoring

NM also periodically monitors the health of the physical node. It monitors any issues with the local disks, and runs an admin configured script frequently that in turn can point to any hardware/software issues. When such an issue is discovered, NM changes its state to be unhealthy and reports RM about the same which then makes a scheduler specific decision of killing the containers and/or stopping future allocations on this node till the health issue is addressed.

YARN framework/application writers

From the preceding description of the core architecture, we extract the responsibilities of a YARN application author:

1. Submitting the application by passing a CLC for the ApplicationMaster to the RM.
2. When RM starts the AM, it should register with the RM and periodically advertise its liveness and requirements over the heartbeat protocol
3. Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM. It may also monitor the status of the running container and stop it when the resource should be reclaimed. Monitoring the progress of work done inside the container is strictly the AM’s responsibility.
4. Once the AM is done with its work, it should unregister from the RM and exit cleanly.
5. Optionally, framework authors may add control flow between their own clients to report job status and expose a control plane.

Fault tolerance and availability

RM Failover, 仍然有单点问题

At the time of this writing, the RM remains a single point of failure in YARN’s architecture.

The RM recovers from its own failures by restoring its state from a persistent store on initialization.
Once the recovery process is complete, it kills all the containers running in the cluster, including live ApplicationMasters. It then launches new instances of each AM.

NM Failover

When a NM fails, the RM detects it by timing out its heartbeat response, marks all the containers running on that node as killed, and reports the failure to all running AMs.
If the fault is transient, the NM will re-synchronize with the RM, clean up its local state, and continue.
In both cases, AMs are responsible for reacting to node failures, potentially redoing work done by any containers running on that node during the fault.

AM Failover

Since the AM runs in the cluster, its failure does not affect the availability of the cluster, but the probability of an application hiccup due to AM failure is higher than in Hadoop 1.x.
The RM may restart the AM if it fails, though the platform offers no support to restore the AMs state.

Container Fail

The failure handling of the containers themselves is completely left to the frameworks.
The RM collects all container exit events from the NMs and propagates those to the corresponding AMs in a heartbeat response.

Mesos VS. YARN

While Mesos and YARN both have schedulers at two levels, there are two very significant differences.

First, Mesos is an offer-based resource manager, whereas YARN has a request-based approach.
YARN allows the AM to ask for resources based on various criteria including locations, allows the requester to modify future requests based on what was given and on current usage.
Our approach was necessary to support the location based allocation.

Second, instead of a per-job intraframework scheduler, Mesos leverages a pool of central schedulers (e.g., classic Hadoop or MPI).
YARN enables late binding of containers to tasks, where each individual job can perform local optimizations, and seems more amenable to rolling upgrades (since each job can run on
a different version of the framework). On the other side, per-job ApplicationMaster might result in greater overhead than the Mesos approach.

两点不同,
Mesos是offer-based, master会收集slave上可用的resouces offer, 并通知各个计算framework, 由各个计算framework的schduler来判断是否可以申请当前的offer
Yarn是request-based, AM不知道你是否有足够的资源, 只是向RM发出resources request(包含各种选择标准)

Mesos是central schedulers, 即对每个计算framework使用一个schduler
Yarn是为每个job分配一个AM, 这样便于进行job级别的local optimizations

Hadoop V1.0 VS YARN

首先客户端不变，其调用 API 及接口大部分保持兼容，这也是为了对开发使用者透明化，使其不必对原有代码做大的改变

Yarn 框架相对于老的 MapReduce 框架什么优势呢？我们可以看到：

这个设计大大减小了 JobTracker（也就是现在的 ResourceManager）的资源消耗，并且让监测每一个 Job 子任务 (tasks) 状态的程序分布式化了，更安全、更优美。
在新的 Yarn 中，ApplicationMaster 是一个可变更的部分，用户可以对不同的编程模型写自己的 AppMst，让更多类型的编程模型能够跑在 Hadoop 集群中，可以参考 hadoop Yarn 官方配置模板中的 mapred-site.xml 配置。
对于资源的表示以内存为单位 ( 在目前版本的 Yarn 中，没有考虑 cpu 的占用 )，比之前以剩余 slot 数目更合理。
老的框架中，JobTracker 一个很大的负担就是监控 job 下的 tasks 的运行状况，现在，这个部分就扔给 ApplicationMaster 做了，而 ResourceManager 中有一个模块叫做 ApplicationsMasters( 注意不是 ApplicationMaster)，它是监测 ApplicationMaster 的运行状况，如果出问题，会将其在其他机器上重启。
Container 是 Yarn 为了将来作资源隔离而提出的一个框架。这一点应该借鉴了 Mesos 的工作，目前是一个框架，仅仅提供 java 虚拟机内存的隔离 ,hadoop 团队的设计思路应该后续能支持更多的资源调度和控制 , 既然资源表示成内存量，那就没有了之前的 map slot/reduce slot 分开造成集群资源闲置的尴尬情况。