延迟调度的主要目的是提高数据本地性(data locality),减少数据在网络中的传输。对于那些输入数据不在本地的MapTask,调度器将会延迟调度他们,而把slot分配给那些具备本地性的MapTask。

  延迟调度的大体思想如下:

  若该job找到一个node-local的MapTask,则返回该task;若找不到,则延迟调度。即在nodeLocalityDelay时长内,重新找到一个node-local的MapTask并返回;

  否则等待时长超过nodeLocalityDelay之后,寻找一个rack-local的MapTask并返回;若找不到,则延迟调度。即在rackLocalityDelay时长内,重新找到一个rack-local的MapTask并返回;

  否则等待超过nodeLocalityDelay + rackLocalityDelay之后,重新寻找一个off-switch的MapTask并返回。

  FairScheduler.java中关于延迟调度的主要变量:

 long nodeLocalityDelay://node-local已经等待的时间
long rackLocalityDelay: //rack-local已经等待的时间
boolean skippedAtLastHeartbeat://该job是否被延迟调度(是否被跳过)
timeWaitedForLocalMap://自从上次MapTask被分配以来等待的时间
LocalityLevel lastMapLocalityLevel://上次分配的MapTask对应的本地级别
nodeLocalityDelay = rackLocalityDelay =
Math.min(15000 , (long) (1.5 * jobTracker.getNextHeartbeatInterval()));

  

  在fair scheduler中,每个job维护了两个变量用来完成延迟调度:最后一个被调度的MapTask的本地性级别(lastMapLocalityLevel)与自从这个job被跳过以来所等待的时间(timeWaitedForLocalMap)。工作流程如下(具体工作在FairScheduler.java的getAllowedLocalityLevel ()方法中完成):

 /**
* Get the maximum locality level at which a given job is allowed to
* launch tasks, based on how long it has been waiting for local tasks.
* This is used to implement the "delay scheduling" feature of the Fair
* Scheduler for optimizing data locality.
* If the job has no locality information (e.g. it does not use HDFS), this
* method returns LocalityLevel.ANY, allowing tasks at any level.
* Otherwise, the job can only launch tasks at its current locality level
* or lower, unless it has waited at least nodeLocalityDelay or
* rackLocalityDelay milliseconds depends on the current level. If it
* has waited (nodeLocalityDelay + rackLocalityDelay) milliseconds,
* it can go to any level.
*/
protected LocalityLevel getAllowedLocalityLevel(JobInProgress job,
long currentTime) {
JobInfo info = infos.get(job);
if (info == null) { // Job not in infos (shouldn't happen)
LOG.error("getAllowedLocalityLevel called on job " + job
+ ", which does not have a JobInfo in infos");
return LocalityLevel.ANY;
}
if (job.nonLocalMaps.size() > 0) { // Job doesn't have locality information
return LocalityLevel.ANY;
}
// Don't wait for locality if the job's pool is starving for maps
Pool pool = poolMgr.getPool(job);
PoolSchedulable sched = pool.getMapSchedulable();
long minShareTimeout = poolMgr.getMinSharePreemptionTimeout(pool.getName());
long fairShareTimeout = poolMgr.getFairSharePreemptionTimeout();
if (currentTime - sched.getLastTimeAtMinShare() > minShareTimeout ||
currentTime - sched.getLastTimeAtHalfFairShare() > fairShareTimeout) {
eventLog.log("INFO", "No delay scheduling for "
+ job.getJobID() + " because it is being starved");
return LocalityLevel.ANY;
}
// In the common case, compute locality level based on time waited
switch(info.lastMapLocalityLevel) {
case NODE: // Last task launched was node-local
if (info.timeWaitedForLocalMap >=
nodeLocalityDelay + rackLocalityDelay)
return LocalityLevel.ANY;
else if (info.timeWaitedForLocalMap >= nodeLocalityDelay)
return LocalityLevel.RACK;
else
return LocalityLevel.NODE;
case RACK: // Last task launched was rack-local
if (info.timeWaitedForLocalMap >= rackLocalityDelay)
return LocalityLevel.ANY;
else
return LocalityLevel.RACK;
default: // Last task was non-local; can launch anywhere
return LocalityLevel.ANY;
}
}

getAllowedLocalityLevel()

1. 若lastMapLocalityLevel为Node:

1)若timeWaitedForLocalMap >= nodeLocalityDelay + rackLocalityDelay,则可以调度off-switch及以下级别的MapTask;

2)若timeWaitedForLocalMap >= nodeLocalityDelay,则可以调度rack-local及以下级别的MapTask;

3)否则调度node-local级别的MapTask。

2. 若lastMapLocalityLevel为Rack:

1)若timeWaitedForLocalMap >= rackLocalityDelay,则调度off-switch及以下级别的MapTask;

2)否则调度rack-local及以下级别的MapTask;

3. 否则调度off-switch及以下级别的MapTask;

  延迟调度的具体工作流程如下(具体工作在FairScheduler.java的assignTasks()方法中完成):

 @Override
public synchronized List<Task> assignTasks(TaskTracker tracker)
throws IOException {
if (!initialized) // Don't try to assign tasks if we haven't yet started up
return null;
String trackerName = tracker.getTrackerName();
eventLog.log("HEARTBEAT", trackerName);
long currentTime = clock.getTime(); // Compute total runnable maps and reduces, and currently running ones
int runnableMaps = 0;
int runningMaps = 0;
int runnableReduces = 0;
int runningReduces = 0;
for (Pool pool: poolMgr.getPools()) {
runnableMaps += pool.getMapSchedulable().getDemand();
runningMaps += pool.getMapSchedulable().getRunningTasks();
runnableReduces += pool.getReduceSchedulable().getDemand();
runningReduces += pool.getReduceSchedulable().getRunningTasks();
} ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
// Compute total map/reduce slots
// In the future we can precompute this if the Scheduler becomes a
// listener of tracker join/leave events.
int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);
int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus); eventLog.log("RUNNABLE_TASKS",
runnableMaps, runningMaps, runnableReduces, runningReduces); // Update time waited for local maps for jobs skipped on last heartbeat
//备注一
updateLocalityWaitTimes(currentTime); // Check for JT safe-mode
if (taskTrackerManager.isInSafeMode()) {
LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");
return null;
} TaskTrackerStatus tts = tracker.getStatus(); int mapsAssigned = 0; // loop counter for map in the below while loop
int reducesAssigned = 0; // loop counter for reduce in the below while
int mapCapacity = maxTasksToAssign(TaskType.MAP, tts);
int reduceCapacity = maxTasksToAssign(TaskType.REDUCE, tts);
boolean mapRejected = false; // flag used for ending the loop
boolean reduceRejected = false; // flag used for ending the loop // Keep track of which jobs were visited for map tasks and which had tasks
// launched, so that we can later mark skipped jobs for delay scheduling
Set<JobInProgress> visitedForMap = new HashSet<JobInProgress>();
Set<JobInProgress> visitedForReduce = new HashSet<JobInProgress>();
Set<JobInProgress> launchedMap = new HashSet<JobInProgress>(); ArrayList<Task> tasks = new ArrayList<Task>();
// Scan jobs to assign tasks until neither maps nor reduces can be assigned
//备注二
while (true) {
// Computing the ending conditions for the loop
// Reject a task type if one of the following condition happens
// 1. number of assigned task reaches per heatbeat limit
// 2. number of running tasks reaches runnable tasks
// 3. task is rejected by the LoadManager.canAssign
if (!mapRejected) {
if (mapsAssigned == mapCapacity ||
runningMaps == runnableMaps ||
!loadMgr.canAssignMap(tts, runnableMaps,
totalMapSlots, mapsAssigned)) {
eventLog.log("INFO", "Can't assign another MAP to " + trackerName);
mapRejected = true;
}
}
if (!reduceRejected) {
if (reducesAssigned == reduceCapacity ||
runningReduces == runnableReduces ||
!loadMgr.canAssignReduce(tts, runnableReduces,
totalReduceSlots, reducesAssigned)) {
eventLog.log("INFO", "Can't assign another REDUCE to " + trackerName);
reduceRejected = true;
}
}
// Exit while (true) loop if
// 1. neither maps nor reduces can be assigned
// 2. assignMultiple is off and we already assigned one task
if (mapRejected && reduceRejected ||
!assignMultiple && tasks.size() > 0) {
break; // This is the only exit of the while (true) loop
} // Determine which task type to assign this time
// First try choosing a task type which is not rejected
TaskType taskType;
if (mapRejected) {
taskType = TaskType.REDUCE;
} else if (reduceRejected) {
taskType = TaskType.MAP;
} else {
// If both types are available, choose the task type with fewer running
// tasks on the task tracker to prevent that task type from starving
if (tts.countMapTasks() + mapsAssigned <=
tts.countReduceTasks() + reducesAssigned) {
taskType = TaskType.MAP;
} else {
taskType = TaskType.REDUCE;
}
} // Get the map or reduce schedulables and sort them by fair sharing
List<PoolSchedulable> scheds = getPoolSchedulables(taskType);
//对job进行排序
Collections.sort(scheds, new SchedulingAlgorithms.FairShareComparator());
boolean foundTask = false;
//备注三
for (Schedulable sched: scheds) { // This loop will assign only one task
eventLog.log("INFO", "Checking for " + taskType +
" task in " + sched.getName());
//备注四
Task task = taskType == TaskType.MAP ?
sched.assignTask(tts, currentTime, visitedForMap) :
sched.assignTask(tts, currentTime, visitedForReduce);
if (task != null) {
foundTask = true;
JobInProgress job = taskTrackerManager.getJob(task.getJobID());
eventLog.log("ASSIGN", trackerName, taskType,
job.getJobID(), task.getTaskID());
// Update running task counts, and the job's locality level
if (taskType == TaskType.MAP) {
launchedMap.add(job);
mapsAssigned++;
runningMaps++;
//备注五
updateLastMapLocalityLevel(job, task, tts);
} else {
reducesAssigned++;
runningReduces++;
}
// Add task to the list of assignments
tasks.add(task);
break; // This break makes this loop assign only one task
} // end if(task != null)
} // end for(Schedulable sched: scheds) // Reject the task type if we cannot find a task
if (!foundTask) {
if (taskType == TaskType.MAP) {
mapRejected = true;
} else {
reduceRejected = true;
}
}
} // end while (true) // Mark any jobs that were visited for map tasks but did not launch a task
// as skipped on this heartbeat
for (JobInProgress job: visitedForMap) {
if (!launchedMap.contains(job)) {
infos.get(job).skippedAtLastHeartbeat = true;
}
} // If no tasks were found, return null
return tasks.isEmpty() ? null : tasks;
}

assignTasks()

  备注一:updateLocalityWaitTimes()。首先更新自上次心跳以来,timeWaitedForLocalMap的时间,并将所有job 的skippedAtLastHeartbeat设为false;代码如下:

 /**
* Update locality wait times for jobs that were skipped at last heartbeat.
*/
private void updateLocalityWaitTimes(long currentTime) {
long timeSinceLastHeartbeat =
(lastHeartbeatTime == 0 ? 0 : currentTime - lastHeartbeatTime);
lastHeartbeatTime = currentTime;
for (JobInfo info: infos.values()) {
if (info.skippedAtLastHeartbeat) {
info.timeWaitedForLocalMap += timeSinceLastHeartbeat;
info.skippedAtLastHeartbeat = false;
}
}
}

updateLocalityWaitTimes()

  备注二:在while(true)循环中不断分配MapTask和ReduceTask,直到没有可被分配的为止;在循环中对所有job进行排序;接着在一个for()循环中进行真正的MapTask分配(Schedulable有两个子类,分别代表PoolSchedulable与JobSchedulable。这里的Schedulable可当做job看待)。

  备注三、四:在for()循环里,JobSchedulable中的assignTask()方法会被调用,来选择适当的MapTask或者ReduceTask。在选择MapTask时,先会调用FairScheduler.getAllowedLocalityLevel()方法来确定应该调度哪个级别的MapTask(具体的方法分析见上),然后根据该方法的返回值来选择对应级别的MapTask。assignTask()方法代码如下:

 @Override
public Task assignTask(TaskTrackerStatus tts, long currentTime,
Collection<JobInProgress> visited) throws IOException {
if (isRunnable()) {
visited.add(job);
TaskTrackerManager ttm = scheduler.taskTrackerManager;
ClusterStatus clusterStatus = ttm.getClusterStatus();
int numTaskTrackers = clusterStatus.getTaskTrackers(); // check with the load manager whether it is safe to
// launch this task on this taskTracker.
LoadManager loadMgr = scheduler.getLoadManager();
if (!loadMgr.canLaunchTask(tts, job, taskType)) {
return null;
}
if (taskType == TaskType.MAP) {
//确定应该调度的级别
LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(
job, currentTime);
scheduler.getEventLog().log(
"ALLOWED_LOC_LEVEL", job.getJobID(), localityLevel);
switch (localityLevel) {
case NODE:
return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
case RACK:
return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
default:
return job.obtainNewMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return job.obtainNewReduceTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return null;
}
}

assignTask()

  可以看到,在该方法中又会根据相应的级别调用JobInProgress类中的方法来获取该级别的MapTask。

  备注五:最后updateLastMapLocalityLevel()方法会更新该job的一些信息:lastMapLocalityLevel设为该job对应的级别;timeWaitedForLocalMap置为0。

   /**
* Update a job's locality level and locality wait variables given that that
* it has just launched a map task on a given task tracker.
*/
private void updateLastMapLocalityLevel(JobInProgress job,
Task mapTaskLaunched, TaskTrackerStatus tracker) {
JobInfo info = infos.get(job);
boolean isNodeGroupAware = conf.getBoolean(
"net.topology.nodegroup.aware", false);
LocalityLevel localityLevel = LocalityLevel.fromTask(
job, mapTaskLaunched, tracker, isNodeGroupAware);
info.lastMapLocalityLevel = localityLevel;
info.timeWaitedForLocalMap = 0;
eventLog.log("ASSIGNED_LOC_LEVEL", job.getJobID(), localityLevel);
}

updateLastMapLocalityLevel()

  本文基于hadoop1.2.1。如有错误,还请指正

  参考文章: 《Hadoop技术内幕 深入理解MapReduce架构设计与实现原理》 董西成

    https://issues.apache.org/jira/secure/attachment/12457515/fair_scheduler_design_doc.pdf

  转载请注明出处:http://www.cnblogs.com/gwgyk/p/4568270.html

Fair Scheduler中的Delay Schedule分析的更多相关文章

  1. Hadoop学习之--Fair Scheduler作业调度分析

    Fair Scheduler调度器同步心跳分配任务的过程简单来讲会经历以下环节: 1. 对map/reduce是否已经达到资源上限的循环判断 2. 对pool队列根据Fair算法排序 3.然后循环po ...

  2. Cocos2d-x 源代码分析 : Scheduler(定时器) 源代码分析

    源代码版本号 3.1r,转载请注明 我也最终不out了,開始看3.x的源代码了.此时此刻的心情仅仅能是wtf! !!!!!!! !.只是也最终告别CC时代了. cocos2d-x 源代码分析文件夹 h ...

  3. 【原】Spark中Master源码分析(二)

    继续上一篇的内容.上一篇的内容为: Spark中Master源码分析(一) http://www.cnblogs.com/yourarebest/p/5312965.html 4.receive方法, ...

  4. Fair Scheduler 队列设置经验总结

    Fair Scheduler 队列设置经验总结 由于公司的hadoop集群的计算资源不是很充足,需要开启yarn资源队列的资源抢占.在使用过程中,才明白资源抢占的一些特点.在这里总结一下. 只有一个队 ...

  5. YARN的Fair Scheduler和Capacity Scheduler

    关于Scheduler YARN有四种调度机制:Fair Schedule,Capacity Schedule,FIFO以及Priority: 其中Fair Scheduler是资源池机制,进入到里面 ...

  6. 三:Fair Scheduler 公平调度器

    参考资料: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html http://h ...

  7. LoadRunner中对图表的分析说明

    LoadRunner中对图表的分析说明 (一)在Vusers(虚拟用户状态)中 1.Running Vusers(负载过程中的虚拟用户运行情况) 说明——系统形成负载的过程,随着时间的推移,虚拟用户数 ...

  8. Hadoop的三种调度器FIFO、Capacity Scheduler、Fair Scheduler(转载)

    目前Hadoop有三种比较流行的资源调度器:FIFO .Capacity Scheduler.Fair Scheduler.目前Hadoop2.7默认使用的是Capacity Scheduler容量调 ...

  9. 在VS 2015中边调试边分析性能

    (此文章同时发表在本人微信公众号"dotNET每日精华文章",欢迎右边二维码来关注.) 对代码进行性能分析,之前往往是一种独立的Profiling过程,现在在VS 2015中可以结 ...

随机推荐

  1. dir、help查询

    #!/usr/bin/env python li = [] print(dir(li)) help(list)

  2. SQL Join的一些总结

    1.1.1 摘要 Join是关系型数据库系统的重要操作之一,SQL Server中包含的常用Join:内联接.外联接和交叉联接等.如果我们想在两个或以上的表获取其中从一个表中的行与另一个表中的行匹配的 ...

  3. NPOI分层导出

    using NPOI.HSSF.UserModel; using NPOI.POIFS.FileSystem; using org.in2bits.MyXls; using System; using ...

  4. [转](六)unity4.6Ugui中文教程文档-------概要-UGUI Animation Integration

    5.Animation Integration(动画集成) 动画允许控件的所有状态之间相互转换,充分使用unity的动画系统.这是最强大的的转换模式的在处理很多属性的同时可以进行动画. 要使用动画转换 ...

  5. DOCTYPE声明作用及用法详解

    一.浏览器呈现模式和doctype 有的网页是遵循标准而创作的,但也有很多不是.即使你不能创建遵循标准的网页,也希望浏览器根据标准来正确显示那些页.目前,大量网页充斥着大量非标准代码,它们仍能正常地工 ...

  6. git中忽略UserInterfaceState.xcuserstate的方法

    在commit 时候一直会提示userinterfacestate.xcuserstate文件尚未commit. 你可以用命令行 git rm --cached [YourProjectName].x ...

  7. 在windows上搭建react-native的android环境

    参考文档: http://facebook.github.io/react-native/docs/getting-started.html http://reactnative.cn/docs/0. ...

  8. LINQ之路 7:子查询、创建策略和数据转换

    在前面的系列中,我们已经讨论了LINQ简单查询的大部分特性,了解了LINQ的支持计术和语法形式.至此,我们应该可以创建出大部分相对简单的LINQ查询.在本篇中,除了对前面的知识做个简单的总结,还会介绍 ...

  9. eclipse代码自动补全[转]

    一.每次输入都自动提示 设置Window->preferences->Java->Editor->Content Assist 再右下角Auto activation trig ...

  10. php curl_init函数用法

    使用PHP的cURL库可以简单和有效地去抓网页.你只需要运行一个脚本,然后分析一下你所抓取的网 页,然后就可以以程序的方式得到你想要的数据了.无论是你想从从一个链接上取部分数据,或是取一个XML文件并 ...