分析JobInProgress中Map/Reduce任务分配

1.JobTracker能否决定给当前的TaskTracker节点分配一个Job的具体的哪一个任务？
2.什么是map本地任务？
3.nonRunningMapCache的作用是什么？
4.从TaskTracker节点上分配挂载的本地任务时，如果以前发生过该TaskTracker节点执行某一Map任务失败了的情况，则应将该Map任务该如何处理？

分享到:

QQ好友和群
腾讯微博
QQ空间

收藏
 转播
 分享
 淘帖
 支持
 反对

欢迎加入about云群371358502、39327136，云计算爱好者群，亦可关注about云腾讯认证空间||关注本站微信

使用道具
举报

主题

帖子

积分

积分: 6384

沙发

楼主|
发表于 2014-4-6 23:51:53
|
只看该作者

众
所周知，JobTracker节点使用配置的任务调度器TaskScheduler来为某一个具体的TaskTracker节点分配任务，同时这个任务调
度器只能决定给该TaskTracker节点分配哪一个Job或者那些Job的任务以及分配多少个任务，但是它却不能决定给当前的TaskTracker
节点分配一个Job的具体的哪一个任务。

另外，针对一个具体的TaskTracker节点而言，任何一个作业都可以判断它的那些Map任务相对于该TaskTracker节点来说属于本地任务，那些Map任务是属于非本地任务的，当然，对于Reduce任务来说，是没有本地任务与非本地任务这一说法的。

因此，具体来讲就是，当任务调度器决定为一个TaskTracker节点分配一个Job的本地任务时，它会调用该JobInProgress对象的
obtainNewLocalMapTask()方法，分配一个非本地任务时，它会调用对应的obtainNewNonLocalMapTask()方
法，那么以这个TaskTracker节点在集群中的物理位置为参考，这个Job可能有多个本地任务和多个非本地任务，至于为该TaskTracker节
点分配哪一个本地或者非本地任务就由JobInProgress来决定了；当任务调度器为TaskTracker节点分配一个Job的Reduce任务
时，就会调用该Job对应的JobInProgress对象的obtainNewReduceTask()方法。至于JobInProgress对象究竟
是如何分配一个本地或非本地Map任务、Reduce任务的，那将是本文接下来要详细讲述的重点了。

1.分配作业的Map任务

作业的Map任务之所以有本地和非本地之分，主要是因为该Map任务的输入数据和执行该Map任务的TaskTracker节点在集群中的位置有可
能同。本地与非本地Map任务是相对于执行或将要执行该任务的TaskTracker节点来说的，当任务调度器决定为一个TaskTracker节点分配
某一个Job的一个本地Map任务时，它(JobInProgress)会查找这个Job中的那些Map任务的输入数据合该TaskTracker节点在
同一台PC或机架上，那么这些Map任务对于TaskTracker节点来说就是本地任务了。这里要值得一提的是，在作业初始化的时候，就为每一个Map
任务做了一个本地化的预分配工作，即根据Map任务的输入数据的物理位置，将该Map任务挂载到对应的物理节点上，该过程的源代码为：

private Map<Node, List<TaskInProgress>> createCache(JobClient.RawSplit[] splits, int maxLevel) {
Map<Node, List<TaskInProgress>> cache = new IdentityHashMap<Node, List<TaskInProgress>>(maxLevel);
for (int i = 0; i < splits.length; i++) {
String[] splitLocations = splits[i].getLocations();//获取该数据切片坐在的物理位置(多个副本)
if (splitLocations.length == 0) {
nonLocalMaps.add(maps[i]);
continue;
}
//针对每一个副本的物理位置
for(String host: splitLocations) {
//解析副本在集群中的哪一个节点上
Node node = jobtracker.resolveAndAddToTopology(host);
LOG.info("tip:" + maps[i].getTIPId() + " has split on node:" + node);
for (int j = 0; j < maxLevel; j++) {
List<TaskInProgress> hostMaps = cache.get(node);
if (hostMaps == null) {
hostMaps = new ArrayList<TaskInProgress>();
cache.put(node, hostMaps);
hostMaps.add(maps[i]);//将Map任务挂载到该节点上
}
//去重，避免一个节点挂载了两个相同的Map任务
if (hostMaps.get(hostMaps.size() - 1) != maps[i]) {
hostMaps.add(maps[i]);
}
node = node.getParent();//获取节点的父节点(由于maxLevel的值是2，所以父节点就是rack节点)
}
}
}
return cache;
}

复制代码

通过这样的一个预处理过程，最终Node与Map任务之间的映射关系被保存在它的一个属性nonRunningMapCache中了。当
JobInProgress为一个TaskTracker节点分配一个本地Map任务时，它可以只需要解析该TaskTracker节点在集群中的哪一个
节点node上，根据该node就可以从nonRunningMapCache中获取一个Map任务，该Map任务相对于当前这个TaskTracker
来说就是本地任务了；当JobInProgress为一个TaskTracker节点分配一个非本地Map任务时，它可以获取集群中所有的rack节点
(除它自己所在的rack外)，通过这些rack节点node，就可以从nonRunningMapCache中获取一个Map任务，该Map任务相对于
当前这个TaskTracker来说就是非本地任务了。

根据上面的源代码可以看出，所谓的本地任务之分就是由maxLevel来确定的，即Map任务的输入数据与TaskTracker节点在集群中的物理距
离，在目前的版本中(Hadoop-0.20.2.0)，maxLevel的默认值是2，也可由JobTracker节点的配置文件来设置，对应的配置项
为：mapred.task.cache.levels。另外，从上面的源代码可以看出，这个预处理过程也明确地定义了非本地Map任务，即map操作的
输入数据的位置为null的Map任务，这并不代表说该Map任务没有输入数据。因为，Hadoop为用户提供了自定义数据切片的API(用户自己实现
InputSplit)，这里的RawSplit并没有直接保存map操作所需的输入数据的位置信息，而是对真正的InputSplit进行了封装，这告
诉我们两个很重要的情况，

一：用户在自定义Map任务的InputSplit时，应考虑这个Map任务是否可以作为某些TaskTracker节点的本地任务(比如某一个Map任务的输入数据在跨越多个节点，那么这个Map任务永远也不可能是本地任务)；

二：Map任务的InputSplit实现可以为map操作带入少量的输入数据(例如，某一个Map任务需要两个输入数据，一个数据很大，另一个数据很
小，只有几百或上千Bytes，那么，用户就可以自定义一个InputSplit来保存这个小数据，很明显，用HDFS保存这样小的数据根本不划算)。这
个非本地Map任务保存在nonLocalMaps属性中。

1.1 分配本地Map任务

JobInProgress给某一个TaskTracker节点分配一个本地Map任务的操作比较的简单，不过，这其中有一个异常情况，就是当这个
TaskTracker节点无法被解析成集群中的一个Node时，那么，本次的本地Map任务分配会被当做一次分配非本地Map任务来操作。这个过程的源
代码如下：

public synchronized Task obtainNewLocalMapTask(TaskTrackerStatus tts,int clusterSize, int numUniqueHosts) throws IOException {
if (!tasksInited.get()) {
return null;
}
//为当前的计算节点获取一个本地map任务
int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel, status.mapProgress());
if (target == -1) {
return null;
}
Task result = maps[target].getTaskToRun(tts.getTrackerName());
if (result != null) {
addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true);
}
return result;
}

复制代码

/**
* 为当前的计算节点从作业的map任务集中选取一个合适的任务；
* 参数maxCacheLevel决定了当前分配的是本地任务还是非本地任务
*/
private synchronized int findNewMapTask(final TaskTrackerStatus
tts, final int clusterSize, final int numUniqueHosts, final int
maxCacheLevel, final double avgProgress) {
...
Node node = jobtracker.getNode(tts.getHost()); //根据当前计算节点的主机/IP来获取其在集群拓扑结构中对应的位置节点
//
// I) Non-running TIP :
// 1. check from local node to the root [bottom up cache lookup]
// i.e if the cache is available and the host has been resolved
// (node!=null)
if (node != null) {
Node key = node; //当前待分配的map任务的输入数据所在的节点
int level = 0;
// maxCacheLevel might be greater than this.maxLevel if findNewMapTask is
// called to schedule any task (local, rack-local, off-switch or speculative)
// tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if findNewMapTask is
// (i.e. -1) if findNewMapTask is to only schedule off-switch/speculative
// tasks
int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);
for (level = 0;level < maxLevelToSchedule; ++level) {
List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key); //获取节点key上还未分配的map任务
if (cacheForLevel != null) {
tip = findTaskFromList(cacheForLevel, tts, numUniqueHosts,level == 0); //从一个map任务集中为当前的计算节点找到一个合适的任务
if (tip != null) {
// Add to running cache
scheduleMap(tip);
// remove the cache if its empty
if (cacheForLevel.size() == 0) {
nonRunningMapCache.remove(key);
}
return tip.getIdWithinJob();
}
}
key = key.getParent();
}
// Check if we need to only schedule a local task (node-local/rack-local)
if (level == maxCacheLevel) {
return -1;
}
}
...
}

复制代码

还有一个值得注意的问题就是，如果该TaskTracker节点所在的Node上有Map任务时，当从该Node上分配挂载的本地任务时，如果以前发生过
该TaskTracker节点执行某一Map任务失败了的情况，则应将该Map任务从Node上删除，同时，对于无法执行或正在执行的Map任务也应该从
Node上删除，对应的源码为：

private synchronized
TaskInProgress findTaskFromList(Collection<TaskInProgress> tips,
TaskTrackerStatus ttStatus, int numUniqueHosts, boolean removeFailedTip)
{
Iterator<TaskInProgress> iter = tips.iterator();
while (iter.hasNext()) {
TaskInProgress tip = iter.next();
// Select a tip if
// 1. runnable : still needs to be run and is not completed
// 2. ~running : no other node is running it
// 3. earlier attempt failed : has not failed on this host
// and has failed on all the other hosts
// A TIP is removed from the list if
// (1) this tip is scheduled
// (2) if the passed list is a level 0 (host) cache
// (3) when the TIP is non-schedulable (running, killed, complete)
if (tip.isRunnable() && !tip.isRunning()) {
// check if the tip has failed on this host
if (!tip.hasFailedOnMachine(ttStatus.getHost()) || tip.getNumberOfFailedMachines() >= numUniqueHosts) {
// check if the tip has failed on all the nodes
iter.remove();
return tip;
}
else if (removeFailedTip) {
// the case where we want to remove a failed tip from the host cache
// point#3 in the TIP removal logic above
iter.remove();
}
} else {
// see point#3 in the comment above for TIP removal logic
iter.remove();
}
}
return null;
}

复制代码

1.2 分配非本地Map任务

JobInProgress为某一个TaskTracker节点分配一个非本地Map任务相对于分配一个本地任务来说要复杂的多，它首先会先从
nonRunningMapCache中选择一个非本地任务，如果没有找到再从nonLocalMaps中选择一个任务，如果还没有找到，则判断这个作业
是否设置了hasSpeculativeMaps，如果没有设置，则不再为该TaskTracker节点分配非本地Map任务了；如果设置了，则从正在被
其它TaskTracker节点执行的本地或非本地Map任务中选一个，不过这是有优先顺序的，首先从正在运行的runningMapCache中寻找一
个本地Map任务，如果没有找到再从runningMapCache中寻找一个非本地Map任务，最后再从nonLocalRunningMaps中寻找
一个非本地Map任务，此时还没有找到的话，就不再为该TaskTracker节点分配Map任务了。这个过程的源代码如下：

public synchronized Task obtainNewNonLocalMapTask(TaskTrackerStatus tts, int clusterSize, int numUniqueHosts)
throws IOException {
if (!tasksInited.get()) {
return null;
}
int target = findNewMapTask(tts, clusterSize, numUniqueHosts, NON_LOCAL_CACHE_LEVEL, status.mapProgress());
if (target == -1) {
return null;
}
Task result = maps[target].getTaskToRun(tts.getTrackerName());
if (result != null) {
addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true);
}
return result;
}
private synchronized int findNewMapTask(final TaskTrackerStatus
tts, final int clusterSize, final int numUniqueHosts, final int
maxCacheLevel, final double avgProgress) {
...
Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();
// get the node parent at max level
Node nodeParentAtMaxLevel = (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);
for (Node parent : nodesAtMaxLevel) {
// skip the parent that has already been scanned
if (parent == nodeParentAtMaxLevel) {
continue;
}
List<TaskInProgress> cache = nonRunningMapCache.get(parent);
if (cache != null) {
tip = findTaskFromList(cache, tts, numUniqueHosts, false);
if (tip != null) {
// Add to the running cache
scheduleMap(tip);
// remove the cache if empty
if (cache.size() == 0) {
nonRunningMapCache.remove(parent);
}
LOG.info("Choosing a non-local task " + tip.getTIPId());
return tip.getIdWithinJob();
}
}
}
// 3. Search non-local tips for a new task
tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false);
if (tip != null) {
// Add to the running list
scheduleMap(tip);
LOG.info("Choosing a non-local task " + tip.getTIPId());
return tip.getIdWithinJob();
}
// II) Running TIP :
if (hasSpeculativeMaps) {
long currentTime = System.currentTimeMillis();
// 1. Check bottom up for speculative tasks from the running cache
if (node != null) {
Node key = node;
for (int level = 0; level < maxLevel; ++level) {
Set<TaskInProgress> cacheForLevel = runningMapCache.get(key);
if (cacheForLevel != null) {
tip = findSpeculativeTask(cacheForLevel, tts, avgProgress, currentTime, level == 0);
if (tip != null) {
if (cacheForLevel.size() == 0) {
runningMapCache.remove(key);
}
return tip.getIdWithinJob();
}
}
key = key.getParent();
}
}
// 2. Check breadth-wise for speculative tasks
for (Node parent : nodesAtMaxLevel) {
// ignore the parent which is already scanned
if (parent == nodeParentAtMaxLevel) {
continue;
}
Set<TaskInProgress> cache = runningMapCache.get(parent);
if (cache != null) {
tip = findSpeculativeTask(cache, tts, avgProgress, currentTime, false);
if (tip != null) {
// remove empty cache entries
if (cache.size() == 0) {
runningMapCache.remove(parent);
}
LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation");
return tip.getIdWithinJob();
}
}
}
// 3. Check non-local tips for speculation
tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, currentTime, false);
if (tip != null) {
LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation");
return tip.getIdWithinJob();
}
}
return -1;
}

复制代码

2. 分配作业的Reduce任务

由于
Reduce任务的输入数据来源于该作业所有的Map任务的输出，而执行Map任务的TaskTracker节点将map的输出保存在自己本地，所以
Reduce任务的输入数据在绝大多数情况下不可能都在某一个TaskTracker节点上，因此对于任何一个TaskTracker节点来说没有本地和
非本地的Reduce任务之分。JobInProgress为某一个TaskTracker节点分配一个Reduce任务的操作就相当的简单了，这个过程
类似于分配非本地Map任务。

它首先直接从nonRunningReduces中寻找一个任务，如果没有找到则在看这个作业设置了hasSpeculativeReduces没有，若
没有则不分配了；若设置了，则从runningReduces中寻找一个正在被其它TaskTracker节点执行的Reduce任务分配给该
TaskTracker节点。该过程对应的源代码如下：

public synchronized Task obtainNewReduceTask(TaskTrackerStatus tts, int clusterSize, int numUniqueHosts) throws IOException {
if (status.getRunState() != JobStatus.RUNNING) {
return null;
}
// Ensure we have sufficient map outputs ready to shuffle before
// scheduling reduces
if (!scheduleReduces()) {
return null;
}
int target = findNewReduceTask(tts, clusterSize, numUniqueHosts, status.reduceProgress());
if (target == -1) {
return null;
}
Task result = reduces[target].getTaskToRun(tts.getTrackerName());
if (result != null) {
addRunningTaskToTIP(reduces[target], result.getTaskID(), tts, true);
}
return result;
}
private synchronized int findNewReduceTask(TaskTrackerStatus tts, int clusterSize, int numUniqueHosts, double avgProgress) {
if (numReduceTasks == 0) {
return -1;
}
String taskTracker = tts.getTrackerName();
TaskInProgress tip = null;
// Update the last-known clusterSize
this.clusterSize = clusterSize;
if (!shouldRunOnTaskTracker(taskTracker)) {
return -1;
}
long outSize = resourceEstimator.getEstimatedReduceInputSize();
long availSpace = tts.getResourceStatus().getAvailableSpace();
if(availSpace < outSize) {
LOG.warn("No local disk space for reduce task.
TaskTracker[" + taskTracker + "] has " + availSpace + " bytes free; but
we expect reduce input to take " + outSize);
return -1; //see if a different TIP might work better.
}
// 1. check for a never-executed reduce tip
// reducers don't have a cache and so pass -1 to explicitly call that out
tip = findTaskFromList(nonRunningReduces, tts, numUniqueHosts, false);
if (tip != null) {
scheduleReduce(tip);
return tip.getIdWithinJob();
}
// 2. check for a reduce tip to be speculated
if (hasSpeculativeReduces) {
tip = findSpeculativeTask(runningReduces, tts, avgProgress, System.currentTimeMillis(), false);
if (tip != null) {
scheduleReduce(tip);
return tip.getIdWithinJob();
}
}
return -1;
}
private synchronized TaskInProgress
findSpeculativeTask(Collection<TaskInProgress> list,
TaskTrackerStatus ttStatus, double avgProgress, long currentTime,
boolean shouldRemove) {
Iterator<TaskInProgress> iter = list.iterator();
while (iter.hasNext()) {
TaskInProgress tip = iter.next();
// should never be true! (since we delete completed/failed tasks)
if (!tip.isRunning()) {
iter.remove();
continue;
}
//当前TaskTracker节点没有运行该任务
if (!tip.hasRunOnMachine(ttStatus.getHost(), ttStatus.getTrackerName())) {
if (tip.hasSpeculativeTask(currentTime, avgProgress)) {
// In case of shared list we don't remove it. Since the TIP failed
// on this tracker can be scheduled on some other tracker.
if (shouldRemove) {
iter.remove(); //this tracker is never going to run it again
}
return tip;
}
} else {
// Check if this tip can be removed from the list.
// If the list is shared then we should not remove.
if (shouldRemove) {
// This tracker will never speculate this tip
iter.remove();
}
}
}
return null;
}

复制代码

在目前的Hadoop版本设计中，作业中任务的调度细节被封装到了JobInProgress中，使得作业调度器TaskScheduler可完全控制的
调度粒度限制在Job级，同时JobInProgress为上层的TaskScheduler实现的任务调度提供API，这样做就大大地降低了用户自行设
计TaskScheduler的门槛，即可以很容易的根据自己的应用场景集中在作业级别上实现合适的调度策略。

分析JobInProgress中Map/Reduce任务分配的更多相关文章

hadoop入门级总结二：Map/Reduce
在上一篇博客:hadoop入门级总结一:HDFS中,简单的介绍了hadoop分布式文件系统HDFS的整体框架及文件写入读出机制.接下来,简要的总结一下hadoop的另外一大关键技术之一分布式计算框架: ...
MapReduce剖析笔记之五：Map与Reduce任务分配过程
在上一节分析了TaskTracker和JobTracker之间通过周期的心跳消息获取任务分配结果的过程.中间留了一个问题,就是任务到底是怎么分配的.任务的分配自然是由JobTracker做出来的,具体 ...
Map/Reduce 工作机制分析 --- 作业的执行流程
前言从运行我们的 Map/Reduce 程序,到结果的提交,Hadoop 平台其实做了很多事情. 那么 Hadoop 平台到底做了什么事情,让 Map/Reduce 程序可以如此 "轻易& ...
第九篇：Map/Reduce 工作机制分析 - 作业的执行流程
前言从运行我们的 Map/Reduce 程序,到结果的提交,Hadoop 平台其实做了很多事情. 那么 Hadoop 平台到底做了什么事情,让 Map/Reduce 程序可以如此 "轻易& ...
hadoop中map和reduce的数量设置
hadoop中map和reduce的数量设置,有以下几种方式来设置一.mapred-default.xml 这个文件包含主要的你的站点定制的Hadoop.尽管文件名以mapred开头,通过它可以控制 ...
MapReduce启动的Map/Reduce子任务简要分析
对于Hadoop来说,是通过在DataNode中启动Map/Reduce java进程的方式来实现分布式计算处理的,那么就从源码层简要分析一下hadoop中启动Map/Reduce任务的过程. ...
hadoop中map和reduce的数量设置问题
转载http://my.oschina.net/Chanthon/blog/150500 map和reduce是hadoop的核心功能,hadoop正是通过多个map和reduce的并行运行来实现任务 ...
【转】Python 中map、reduce、filter函数
转自:http://www.blogjava.net/vagasnail/articles/301140.html?opt=admin 介绍下Python 中 map,reduce,和filter 内 ...
基于python的《Hadoop权威指南》一书中气象数据下载和map reduce化数据处理及其可视化
文档内容: 1:下载<hadoop权威指南>中的气象数据 2:对下载的气象数据归档整理并读取数据 3:对气象数据进行map reduce进行处理关键词:<Hadoop权威指南> ...

随机推荐

JSP之response对象使用
1.使用response对象实现重定向 response.sendRedirect("http://www.cnblogs.com"); 2.使用response对象刷新页面每隔 ...
USB2.0学习笔记连载（六）：USB2.0硬件设计需要注意事项
笔者在设计USB2.0时找到了一个官方给的硬件设计正确设计指南,其中有些内容还挺nice的.不单单只是USB的设计,其中有些思想可以应用到其他的场合中. 对于USB2.0而言,全速状态下可以达到480 ...
Java数据通讯中使用Googgle Protobuf 序列化与反序列化
概念 1.什么是protocol buffer ProtocolBuffer是用于结构化数据串行化的灵活.高效.自动的方法,有如XML,不过它更小.更快.也更简单.你可以定义自己的数据结构,然后使用代 ...
e828. 创建JTabbedPane
A tabbed pane is a container that displays only one child component at a time. Typically, the childr ...
回想sql语句中的各种连接
1. 内连接(Inner Join) 内连接是最常见的一种连接,它页被称为普通连接,而E.FCodd最早称之为自然连接. 以下是ANSI SQL-92标准 select * from t_ins ...
随手记录一下 Vue 下来框搜索 select2 封装成vue
引入布局文件 <link href="https://cdnjs.cloudflare.com/ajax/libs/select2/4.0.3/css/select2.min.css& ...
CentOS6.8下安装MySQL5.6
一:卸载旧版本使用下面的命令检查是否安装有MySQL Server rpm -qa | grep mysql 有的话通过下面的命令来卸载掉 rpm -e mysql //普通删除模式 rpm -e ...
muscle 软件进行多序列比对
今天在使用muscle 软件进行多序列比对时,发现输出的结果全部为gap, 而且还没有明显的报错信息找了很久之后,终于发现了问题 muscle 为了追求速度,对输入序列的个数和长度进行了限制下面是 ...
C# 判断字符串为空有哪几种方法
Length法:);Empty法:bool isEmpty = (str == String.Empty);General法:bool isEmpty = (str == ""); ...
HTML5 Canvas 超炫酷烟花绽放动画教程
这是一个很酷的HTML5 Canvas动画,它将模拟的是我们现实生活中烟花绽放的动画特效,效果非常逼真,但是毕竟是电脑模拟,带女朋友看就算了,效果还是差了点,呵呵.这个HTML5 Canvas动画有一 ...

分析JobInProgress中Map/Reduce任务分配

分析JobInProgress中Map/Reduce任务分配的更多相关文章

随机推荐

热门专题