Spark Job Scheduling

最近由于项目需要在研究spark相关的内容，形成了一些技术性文档，发布这记录下，懒得翻译了。

There are some spaces the official documents didn't explain very clearly, especially on some details. Here are given some more explanations base on the practices I did and the source codes I read these days.

(The official document link is http://spark.apache.org/docs/latest/job-scheduling.html)

There are two different schedulers in current spark implementation, FIFO is the default setting and the initial way that spark implement.
Both FIFO and FAIR schedulers can support the basic functionality that multiple parallel jobs run simultaneously, the prerequisite is that they are submitted from separate threads. (i.e., in single thread, the jobs are executed in order)
In FIFO Scheduler, the jobs which are submitted earlier has higher priority and possibility than those later jobs. But it doesn't mean that the first job will execute first, it is also possible that later jobs run before the earlier ones if the resources of the whole cluster are not occupied. However, the FIFO scheduler will cause the worst case: if the first jobs are large, the later jobs maybe suffer significant delay.
The FAIR Scheduler is the way corresponding to Hadoop FAIR scheduler and enhancement for FIFO. In FIFO fashion, there is only one factor Priority will be considered in SchedulableQueue; While in FAIR fashion, more factors will be considered including minshare, runningtasks, weight (You can reference the code below if interest).Similarly, the jobs don't always run by following the rules by FairSchedulingAlgorithm strictly, while as a whole, the FAIR scheduler really alleviate largely the delay time for small jobs by adjusting the parameters which were delayed significantly in FIFO fashion in my observation through the concurrent JMeter tests。

private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val priority1 = s1.priority
    val priority2 = s2.priority
    var res = math.signum(priority1 - priority2)
    if (res == 0) {
      val stageId1 = s1.stageId
      val stageId2 = s2.stageId
      res = math.signum(stageId1 - stageId2)
    }
    if (res < 0) {
      true
    } else {
      false
    }
  }
}

private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val minShare1 = s1.minShare
    val minShare2 = s2.minShare
    val runningTasks1 = s1.runningTasks
    val runningTasks2 = s2.runningTasks
    val s1Needy = runningTasks1 < minShare1
    val s2Needy = runningTasks2 < minShare2
    val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0).toDouble
    val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0).toDouble
    val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
    val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
    var compare: Int = 0
 
    if (s1Needy && !s2Needy) {
      return true
    } else if (!s1Needy && s2Needy) {
      return false
    } else if (s1Needy && s2Needy) {
      compare = minShareRatio1.compareTo(minShareRatio2)
    } else {
      compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
    }
 
    if (compare < 0) {
      true
    } else if (compare > 0) {
      false
    } else {
      s1.name < s2.name
    }
  }

　　5.The pools in FIFO and FAIR schedulers

Spark Job Scheduling的更多相关文章

Spark记录-官网学习配置篇（一）
参考http://spark.apache.org/docs/latest/configuration.html Spark提供三个位置来配置系统: Spark属性控制大多数应用程序参数,可以使用Sp ...
spark总结——转载
转载自: spark总结第一个Spark程序 /** * 功能:用spark实现的单词计数程序 * 环境:spark 1.6.1, scala 2.10.4 */ // 导入相关类库impor ...
Spark调研笔记第3篇 - Spark集群相应用的调度策略简单介绍
Spark集群的调度分应用间调度和应用内调度两种情况,下文分别进行说明. 1. 应用间调度 1) 调度策略1: 资源静态分区资源静态分区是指整个集群的资源被预先划分为多个partitions,资源分 ...
论文阅读计划1(Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming & An Enforcement of Real Time Scheduling in Spark Streaming & StyleBank: An Explicit Representation for Neural Ima)
Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming[1] 简介:雅虎发布的一份各种流处理引擎的基准 ...
[Spark] Spark 3.0 Accelerator Aware Scheduling - GPU
Ref: Spark3.0 preview预览版尝试GPU调用(本地模式不支持GPU) 预览版本:https://archive.apache.org/dist/spark/spark-3.0.0-p ...
spark 笔记 14: spark中的delay scheduling实现
延迟调度算法的实现是在TaskSetManager类中的,它通过将task存放在四个不同级别的hash表里,当有可用的资源时,resourceOffer函数的参数之一(maxLocality)就是这些 ...
spark 笔记 3：Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
spark论文中说他使用了延迟调度算法,源于这篇论文:http://people.csail.mit.edu/matei/papers/2010/eurosys_delay_scheduling.pd ...
Spark分析之Job Scheduling Process
经过前面文章的SparkContext.DAGScheduler.TaskScheduler分析,再从总体上了解Spark Job的调度流程 1.SparkContext将job的RDD DAG图提交 ...
Spark踩坑记——Spark Streaming+Kafka
[TOC] 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...

随机推荐

hyper-v 中安装 Centos 7.0 设置网络教程
安装环境是: 系统:win server 2012 r2 DataCenter hyper-v版本:6.3.9600.16384 centos版本:7.0 从网上下载的 centos 7.0 如果找 ...
WinForm窗体拖动代码
本文转载自:http://www.cnblogs.com/ap0606122/archive/2012/10/23/2734964.html using System; using System.Co ...
【KVM】Ubuntu14.04 安装KVM
1. 首先检查系统是否支持CPU虚拟化 # egrep -o "svm|vmx" /proc/cpuinfo 若显示如下类似信息,则说明支持CPU虚拟化 vmx vmx ... v ...
bzoj3036: 绿豆蛙的归宿
Description 随着新版百度空间的下线,Blog宠物绿豆蛙完成了它的使命,去寻找它新的归宿. 给出一个有向无环的连通图,起点为1终点为N,每条边都有一个长度.绿豆蛙从起点出发,走向终点.到达每 ...
android学习笔记31——ADB命令
使用Adb shell command直接送key event給Androidadb shell input keyevent 7 # for key '0'adb shell input keyev ...
golang一个深复制的库
https://github.com/mitchellh/copystructure
Return 和 Break 的区别
前段日子发布的负面情绪太多了,哦哦,其实我需要的是努力,努力提高自己的真实能力.经历了好多的鄙视否定,我已经没有最初那么敏感,心态平和了许多.我没有借口说基础不好了,一年了,要努力的话,那么我应该不会 ...
ASP.NET地址栏form提交安全验证
以下类可以在web.config中直接配置,可以防范地址栏.表单提交的恶意数据. 安全模块作用: a.针对URL参数验证的功能,防止sql注入 b.针对form表单XSS漏洞的防护功能 c.针对上传文 ...
flash bulider 生成app无法安装在xcode模拟器上
使用flash bulider开发app在ios模拟器上运行,出现以下错误错误提示是isb与当前设备的osx不符合.当前使用airsdk版本是4.0,xcode5.1.1. 查看了air13sdk的 ...
黄聪：Discuz!X/数据库操作方法、DB::table、C::t
函数功能 DB::table($tablename) 获取正确带前缀的表名,转换数据库句柄, DB::delete($tablename, 条件,条数限制) 删除表中的数据 DB::insert($ ...

Spark Job Scheduling

Spark Job Scheduling的更多相关文章

随机推荐

热门专题