spark论文中说他使用了延迟调度算法,源于这篇论文:http://people.csail.mit.edu/matei/papers/2010/eurosys_delay_scheduling.pdf 同时它也是hadoop的调度算法。

Abstract

delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead.
 当调度器调度到一个task运行,但是这个task无法创建一个本地task,那这个task就等待一小段时间,让其他task先执行。

1. Introduction

HFS has two main goals:
  • Fair sharing: divide resources using max-min fair sharing [7] to achieve statistical multiplexing
  • Data locality: place computations near their input data, to maximize system throughput
At a high level, two approaches can be taken:
1. Kill running tasks to make room for the new job.
2. Wait for running tasks to finish
HDFS的HFS调度器有两个功能,提供基于统计和最大最小值的公平调度,同时将计算放在离数据近的地方。为了达到公平,在创建新任务时,调度器可以选择 
1)停止一个整整执行的task,腾出资源; 
2)等待足够资源后执行。 前者浪费了被停止的task的(已完成那部分)的工作,而后者难以保证公平。

Our principal result in this paper is that, counterintuitively, an algorithm based on waiting can achieve both high fairness and high data locality. We show first that in large clusters, tasks finish at such a high rate that resources can be reassigned to new jobs on a timescale much smaller than job durations. However, a strict implementation of fair sharing compromises locality, because the job to be scheduled next according to fairness might not have data on the nodes that are currently free. To resolve this problem, we relax fairness slightly through a simple algorithm called delay scheduling, in which a job waits for a limited amount of time for a scheduling opportunity on a node that has data for it. We show that a very small amount of waiting is enough to bring locality close to 100%. Delay scheduling performs well in typical Hadoop workloads because Hadoop tasks are short relative to jobs, and because there are multiple locations where a task can run to access each data block

2. Background

Users submit jobs consisting of a map function and a reduce function. Hadoop breaks each job into tasks. First, map tasks process each input block (typically 64 MB) and produce intermediate results, which are key-value pairs. There is one map task per input block. Next, reduce tasks pass the list of intermediate values for each key and through the user’s reduce function, producing the job’s final output.
Job scheduling in Hadoop is performed by a master, which manages a number of slaves. Each slave has a fixed number of map slots and reduce slots in which it can run tasks. Typically, administrators set the number of slots to one or two per core. The master assigns tasks in response to heartbeats sent by slaves every few seconds, which report the number of free map and reduce slots on the slave.
 Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps, Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack)
hadoop将job拆分为task。每个map task处理一个block,并产生K-V形式的中间文件;reduce task中间文件的每一个key传给用户的reduce函数,并产生输出文件。
hadoop的调度在master进行,每个slave有定的map-slot和reduce-slot资源可以用来执行task;
slave每隔几秒通过心跳消息告诉master自己有多少空闲的slot,master根据这些信息进行调度分配task到slave上执行。
hadoop模式是FIFO调度,且区分了5个优先级别,不同优先级间是严格优先级调度。
默认的本地化优化方法,使用了谷歌论文的算法:每次调度到一个job后,用贪心法选择离数据最近的slave执行。


3. Delay Scheduling

1. How should resources be reassigned to new jobs?
We show that waiting imposes little impact on job response times when jobs are longer than the average task length and when a cluster is shared between many users
等待几乎不影响响应时间。
2. How should data locality be achieved?
We propose an algorithm called delay scheduling that temporarily relaxes fairness to improve locality by asking jobs to wait for a scheduling opportunity on a node with local data
为得到本地化的slave,等待有限的时间

3.1 Naıve Fair Sharing Algorithm  原始的公平调度算法

3.2 Scheduling Responsiveness
 假设一个job j需要F个slot,且这个job独立执行时需要J秒; 假设每个task平均耗时T秒,且集群有S个slot,那么
  等待一个slot平均耗时T/S, job j需要等待FT/S。如果要求相比于job的运行时间J是可以忽略的, 需要满足
      J>>FT/S.

只要下面条件之一满足,等待就不会影响响应时间:
  • 有很多的job;
  • 很小的job(执行时间短)
  • 很长的job(执行时间远大于它的task)




3.3 Locality Problems with Naıve Fair Sharing
Running on a node that contains the data (node locality) is most efficient, but when this is not possible, running on the same rack (rack locality) is faster than running off-rack.
在包含数据的节点上执行task效率最高;其次是数据在同一机架的其他机器上。

3.3.1 Head-of-line Scheduling
If the head-of-line job is small, it is unlikely to have data on the node that is given to it. For example, a job with data on
10% of nodes will only achieve 10% locality
如果执行的都是小的task,那么很难达到本地化效果。


在facebook, 包含1到25个map 的job站大部分的比例。但是他们的节点本地化只有5%,机架本地化只有59%。

3.3.2 Sticky Slots


slot粘性是指一个job的task始终在同一个slot执行,即使它的本地化情况不是最优的,也难以摆脱(从而去找到更优的slot)。

3.4 Delay Scheduling


3.5 Analysis of Delay Scheduling
如何选择延时D的大小?
1. Non-locality decreases exponentially with D.   
非本地化比例随延时增大而以指数行下降低。
2. The amount of waiting required to achieve a given level of locality is a fraction of the average task length and decreases linearly with the number of slots per node L
We first consider how much locality improves depending on D?
为了达到某个本地化程度,总共的等待时间与task的平局长度成正比(协变更准确);与每个节点上slot数成反比(逆变更准确)。
   exponentially with D
A second question is how long a job waits below its fair share to launch a local task?

in our experiments, local tasks ran up to 2x faster than non-local tasks.



3.5.1 Long Tasks and Hotspots



4. Hadoop Fair Scheduler Design


1. 只要一个pool有task,至少保证它有min数量的slot
2. 允许kill长任务: 
First, each pool has a minimum share timeout, T-min。每个pool有最小资源保证时间T-min,当超过T-min时,会kill其他任务来满足这个最小的资源需求
Second, there is a global fair share timeout, T-fair , used to kill tasks if a pool is being starved of its fair share.
如果一个pool饿死状态超过T-fair,将kill任务来释放资源。
T-min应该基于应用的服务级别设置,而T-fair应该基于应用能容忍的最大超时设置。

4.1 Task Assignment in HFS
First, we create a sorted list of jobs according to our hierarchical scheduling policy. Second, we scan down this list to find a job to launch a task from, applying delay scheduling to skip jobs that do not have data on the node being assigned for a limited time. The same algorithm is applied independently for map slots and reduce slots although we do not use delay scheduling for reduces because they usually need to read data from all nodes.
 当有一个slot是空闲的,HDFS按如下步骤分配 
1)将jobs按照调度层级和策略排序;
2)使用延时调度算法扫描排好序的jobs,选出一个job来执行。
另外,map和reduce独立地使用相同的调度算法,但是reduce不使用延时调度,因为reduce通常需要从所有节点读取数据。

4.1 Task Assignment in HFS

We expect administrators to set the wait times W1 and W2 based on the rate at which slots free up in their cluster and the desired level of locality, using the analysis in Section 3.5.
configured with a block size of 128 MB because this improved performance (Facebook uses this setting in production).
在facebook,使用128M的块大小性能更好。

5. Evaluation



5.1.1 Results for IO-Heavy Workload

5.1.2 Results for CPU-Heavy Workload
We note two behaviors: 
First, fair sharing improves response times of small jobs as before, but its effect is much larger (speeding some jobs as much as 20x), because the cluster is more heavily loaded (we are running on the same data but with more expensive jobs). 
Second, delay scheduling has a negligible effect, because the workload is CPU-bound, but it also does not hurt performance.

5.1.3 Results for Mixed Workload

5.2.1 Hierarchical Scheduling

5.2.2 Delay Scheduling with Small Jobs

5.2.3 Delay Scheduling with Sticky Slots

5.3 Sensitivity Analysis

6. Discussion

Two key aspects of the cluster environment enable delay scheduling to perform well:  两个前提条件:
  • first, most tasks are short compared to jobs, and
  • second, there are multiple locations in which a task can run to read a given data block, because systems like Hadoop support multiple task slots per node
Because delay scheduling only involves being able to skip jobs in a sorted order that captures “who should be scheduled next,” we believe that it can be used in a variety of environments beyond Hadoop and HFS  其他使用场景:
  • Scheduling Policies other than Fair Sharing  其他调度策略,不局限于公平调度
  • Scheduling Preferences other than Data Locality 其他优先偏好,不局限于本地化
  • Load Management Mechanisms other than Slots 其他资源管理机制,不局限于slot
  • Distributed Scheduling Decisions 分布式调度,不局限于单个调度器





spark 笔记 3:Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling的更多相关文章

  1. spark 笔记 14: spark中的delay scheduling实现

    延迟调度算法的实现是在TaskSetManager类中的,它通过将task存放在四个不同级别的hash表里,当有可用的资源时,resourceOffer函数的参数之一(maxLocality)就是这些 ...

  2. spark 笔记 4:Apache Hadoop YARN: Yet Another Resource Negotiator

    spark支持YARN做资源调度器,所以YARN的原理还是应该知道的:http://www.socc2013.org/home/program/a5-vavilapalli.pdf    但总体来说, ...

  3. Spark笔记——技术点汇总

    目录 概况 手工搭建集群 引言 安装Scala 配置文件 启动与测试 应用部署 部署架构 应用程序部署 核心原理 RDD概念 RDD核心组成 RDD依赖关系 DAG图 RDD故障恢复机制 Standa ...

  4. spark笔记 环境配置

    spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...

  5. spark 笔记 2: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

    http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...

  6. spark 笔记 8: Stage

    Stage 是一组独立的任务,他们在一个job中执行相同的功能(function),功能的划分是以shuffle为边界的.DAG调度器以拓扑顺序执行同一个Stage中的task. /** * A st ...

  7. spark 笔记 9: Task/TaskContext

    DAGScheduler最终创建了task set,并提交给了taskScheduler.那先得看看task是怎么定义和执行的. Task是execution执行的一个单元. Task: execut ...

  8. spark 笔记 7: DAGScheduler

    在前面的sparkContex和RDD都可以看到,真正的计算工作都是同过调用DAGScheduler的runjob方法来实现的.这是一个很重要的类.在看这个类实现之前,需要对actor模式有一点了解: ...

  9. spark 笔记 6: RDD

    了解RDD之前,必读UCB的论文,个人认为这是最好的资料,没有之一. http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf A Re ...

随机推荐

  1. 关于ASP.NET中fileupload控件的缺点

    一.首我来理一理“FileUpload”控件的工作大概原理: FileUpload 控件显示一个文本框控件和一个浏览按钮,使用户可以选择客户端上的文件并将它上载到 Web 服务器.用户通过在控件的文本 ...

  2. Java 获取日期间的日期 & 根据日期获取星期

    场景:根据起止日期获取中间的日期: 根据日期获取当前日期的星期 根据日期日期获取日期 /** * 获取日期间日期 * @param start * @param end * @return */ pr ...

  3. MySQL索引之数据结构及算法原理

    MySQL索引之数据结构及算法原理 MySQL支持多个存储引擎,而各种存储引擎对索引的支持也各不相同,因此MySQL数据库支持多种索引类型,如BTree索引,哈希索引,全文索引等等.本文只关注BTre ...

  4. 为Qtcreator 编译的程序添加管理员权限

    (1)创建资源文件 myapp.rc 1 24 uac.manifest (2)创建文件uac.manifest <?xml version="1.0" encoding=& ...

  5. Raspbian 2019-06-20 发布

    有关新Raspbian的信息是作为今天博客文章的一部分提供的,该帖子宣布了全新的Raspberry Pi 4: 为了支持Raspberry Pi 4,我们发布了一个全新的操作系统,基于即将发布的Deb ...

  6. java代码调用exe(cmd命令)

    public class ShellCommand{    public static void execCmd(String cmd, boolean wait)    {        execC ...

  7. 【BZOJ2752】【Luogu P2221】 [HAOI2012]高速公路

    不是很难的一个题目.正确思路是统计每一条边被经过的次数,但我最初由于习惯直接先上了一个前缀和再推的式子,导致极其麻烦难以写对而且会爆\(longlong\). 推导过程请看这里. #include & ...

  8. JavaScript面向对象OOM 2(JavaScript 创建对象的工厂模式和构造函数模式)

      在创建对象的时候,使用对象字面量和 new Object() 构造函数的方式创建一个对象是最简单最方便的方式.但是凡是处于初级阶段的事物都会不可避免的存在一个问题,没有普适性,意思就是说我要为世界 ...

  9. 理解 Cookie、Session、Token

    发展史 Cookie Session Token Token的起源 基于服务器的验证 基于服务器验证方式暴露的一些问题 基于Token的验证原理 Tokens的优势 发展史 1.很久很久以前,Web ...

  10. JavaScript Array -->map()、filter()、reduce()、forEach()函数的使用

    题目: 1.得到 3000 到 3500 之内工资的人. 2.增加一个年龄的字段,并且计算其年龄. 3.打印出每个人的所在城市 4.计算所有人的工资的总和. 测试数据: function getDat ...