Spark Application的调度算法
要想明白spark application调度机制,需要回答一下几个问题:
1.谁来调度?
2.为谁调度?
3.调度什么?
3.何时调度?
4.调度算法
前四个问题可以用如下一句话里来回答:每当集群资源发生变化时,active master 进程 为 所有已注册的并且没有调度完毕的application调度Worker节点上的Executor进程。
"active master" , spark集群可能有多个master,但是只有一个active master 参与调度,standby master不参与调度。
集群资源发生变化是什么意思呢?这里的集群资源指的主要是cores的变化,注册/移除Executor进程使得集群的freeCores变多/变少,添加/移除Worker节点使得集群的freeCores变多/变少... ...,所有导致集群资源发生变化的操作,都会调用schedule()重新为application和driver进行资源调度。
spark提供了两种资源调度算法:spreadOut和非spreadOut。spreadOut算法会尽可能的将一个application 所需要的Executor进程分布在多个worker几点上,从而提高并行度,非spreadOut与之相反,他会把一个worker节点的freeCores都耗尽了才会去下一个worker节点分配。
为了详细说明这两种算法,我们先来以一个具体的例子来介绍,最后再介绍源码。
基本概念
每一个application至少包含以下基本属性:
coresPerExecutor:每一个Executor进程的core个数
memoryPerExecutor:每一个Executor进程的memory大小
maxCores: 这个application最多需要的core个数。
每一个worker至少包含以下基本属性:
freeCores:worker 节点当前可用的core个数
memoryFree:worker节点当前可用的memory大小。
假设一个待注册的application如下:
coresPerExecutor:2
memoryPerExecutor:512M
maxCores: 12
这表示这个application 最多需要12个core,每一个Executor进行都要2个core,512M内存。
假设某一时刻spark集群有如下几个worker节点,他们按照coresFree降序排列:
Worker1:coresFree=10 memoryFree=10G
Worker2:coresFree=7 memoryFree=1G
Worker3:coresFree=3 memoryFree=2G
Worker4:coresFree=2 memoryFree=215M
Worker5:coresFree=1 memoryFree=1G
其中worker5不满足application的要求:worker5.coresFree < application.coresPerExecutor
worker4也不满足application的要求:worker4.memoryFree < application.memoryPerExecutor
因此最终满足调度要求的worker节点只有前三个,我们将这三个节点记作usableWorkers。
spreadOut算法
先介绍spreadOut算法吧。上面已经说了,满足条件的worker只有前三个:
Worker1:coresFree=10 memoryFree=10G
Worker2:coresFree=7 memoryFree=1G
Worker3:coresFree=3 memoryFree=2G
第一次调度之后,worker列表如下:
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:1,totalCores=2
可以发现,worker1的coresFree和memoryFree都变小了而worker2,worker3并没有发生改变,这是因为我们在worker1上面分配了一个Executor进程(这个Executor进程占用两个2core,512M memory)而没有在workre2和worker3上分配。
接下来,开始去worker2上分配:
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=5 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:2,totalCores=4
此时已经分配了2个Executor进程,4个core。
接下来去worker3上分配:
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=5 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=6
接下来再去worker1分配,然后worker2... ...以round-robin方式分配,由于worker3.coresFree<application.coresPerExecutor,不会在他上面分配资源了:
Worker1:coresFree=6 memoryFree=9.0G assignedExecutors=2 assignedCores=4
Worker2:coresFree=5 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:4,totalCores=8
Worker1:coresFree=6 memoryFree=9.0G assignedExecutors=2 assignedCores=4
Worker2:coresFree=3 memoryFree=0M assignedExecutors=2 assignedCores=4
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:5,totalCores=10
此时worker2也不满足要求了:worker2.memoryFree<application.memoryPerExecutor
因此,下一次分配就去worker1上了:
Worker1:coresFree=4 memoryFree=8.5G assignedExecutors=3 assignedCores=6
Worker2:coresFree=3 memoryFree=0M assignedExecutors=2 assignedCores=4
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:6,totalCores=12
ok,由于已经分配了12个core,达到了application的要求,所以不在为这个application调度了。
非spreadOUt算法
那么非spraadOut算法呢?他是逮到一个worker如果不把他的资源耗尽了是不会放手的:
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:1,totalCores=2
Worker1:coresFree=6 memoryFree=9.0G assignedExecutors=2 assignedCores=4
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:2,totalCores=4
Worker1:coresFree=4 memoryFree=8.5 assignedExecutors=3 assignedCores=6
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:3,totalCores=6
Worker1:coresFree=2 memoryFree=8.0G assignedExecutors=4 assignedCores=8
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:4,totalCores=8
Worker1:coresFree=0 memoryFree=7.5G assignedExecutors=5 assignedCores=10
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:5,totalCores=10
可以发现,worker1的coresfree已经耗尽了,好可怜。由于application需要12个core,而这里才分配了10个,所以还要继续往下分配:
Worker1:coresFree=0 memoryFree=7.5G assignedExecutors=5 assignedCores=10
Worker2:coresFree=5 memoryFree=512G assignedExecutors=1 assignedCores=2
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:6,totalCores=12
ok,最终分配来12个core,满足了application的要求。
对比:
spreadOut算法中,是以round-robin方式,轮询的在worker节点上分配Executor进程,即以如下序列分配:worker1,worker2... ... worker n,worker1... ....
非spreadOut算法中,逮者一个worker就不放手,直到满足一下条件之一:
worker.freeCores<application.coresPerExecutor 或者 worker.memoryFree<application.memoryPerExecutor 。
在上面两个例子中,虽然最终都分配了6个Executor进程和12个core,但是spreadOut方式下,6个Executor进程分散在不同的worker节点上,充分利用了spark集群的worker节点,而非spreadOut方式下,只在worker1和worker2上分配了Executor进程,并没有充分利用spark worker节点。
小插曲,spreadOut + oneExecutorPerWorker 算法
spark还有一个叫做”oneExecutorPerWorker“机制,即一个worker上启动一个Executor进程,下面只是简单的说一下得了:
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=7 memoryFree=1G assignedExecutors=0 assignedCores=0
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:1,totalCores=2
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=5 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=3 memoryFree=2G assignedExecutors=0 assignedCores=0
totalExecutors:2,totalCores=4
Worker1:coresFree=8 memoryFree=9.5G assignedExecutors=1 assignedCores=2
Worker2:coresFree=5 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=6
Worker1:coresFree=6 memoryFree=9.0G assignedExecutors=1 assignedCores=4
Worker2:coresFree=3 memoryFree=512M assignedExecutors=1 assignedCores=2
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=8
Worker1:coresFree=6 memoryFree=9.0G assignedExecutors=1 assignedCores=4
Worker2:coresFree=2 memoryFree=0 M assignedExecutors=1 assignedCores=4
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=10
Worker1:coresFree=4 memoryFree=9.5G assignedExecutors=1 assignedCores=6
Worker2:coresFree=2 memoryFree=0 M assignedExecutors=1 assignedCores=4
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=12
和spreadOut+非oneExecutorPerWorker对比发现,唯一的不同就是Executor进程的数量,一个是6,一个是3。
(
这里在额外扩展一下,假设application的maxCores=14,而不是12,那么接着上面那个worker列表来:
Worker1:coresFree=4 memoryFree=9.5G assignedExecutors=1 assignedCores=6
Worker2:coresFree=0 memoryFree=0 M assignedExecutors=1 assignedCores=6
Worker3:coresFree=1 memoryFree=1.5G assignedExecutors=1 assignedCores=2
totalExecutors:3,totalCores=12
虽然worker2.memoryFree=0,但是仍然可以继续在他上面分配core,因为onExecutorPerWorker机制不检查内存的限制。
)
接下来看看源码是怎么实现的:
了解了上面写的,在阅读源码就很轻易了,这里简单说一下。
org.apache.spark.deploy.master.Master收到application发送的RegisterApplication(description, driver)消息后,开始执行注册逻辑:
case RegisterApplication(description, driver) => {
// TODO Prevent repeated registrations from some driver
//standby master不调度
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, driver)
//注册app,即将其加入到waitingApps中
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
//将app加入持久化引擎,主要是为了故障恢复
persistenceEngine.addApplication(app)
//向driver发送RegisteredApplication消息表明master已经注册了这个app
driver.send(RegisteredApplication(app.id, self))
//为waitingApps中的app调度资源
schedule()
}
}
上面的注释已经写的很清楚了... ...
/**
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) { return }
// Drivers take strict precedence over executors
//为了避免每次schedule,总是在相同的worker上分配资源,所有这里打乱worker顺序。
val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
//下面这个for循环是为driver调度资源,因为这里只将application的调度,所以driver的调度不说了。
for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
for (driver <- waitingDrivers) {
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
}
}
} //为application调度资源
startExecutorsOnWorkers()
}
/**
* Schedule and launch executors on workers
*/
private def startExecutorsOnWorkers(): Unit = {
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
// 为waitingApps中的app调度资源,app.coresLeft是app还有多少core没有分配
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// Filter out workers that don't have enough resources to launch an executor
// 筛选出状态为ALIVE并且这个worker剩余内存,剩余core都大于等于app的要求,然后按照coresFree降序排列
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
//在usableWorkers上为app分配Executor
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) // Now that we've decided how many cores to allocate on each worker, let's allocate them
// 在worker上启动Executor进程
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
这个方法做了如下事情:
1.筛选出可用的worker,即usableWorkers,如果一个worker满足以下所有条件,那么这个worker就被添加到usableWorkers中:
Alive
worker.memoryFree >= app.desc.memoryPerExecutorMB
worker.coresFree >= coresPerExecutor
2.assignedCores是一个数组,assignedCores[i]里面存储了需要在usableWorkers[i]上分配的core个数,譬如如果assingedCores[1]=2,那么就需要在usableWorkers[1]上分配2个core。
/**
* Schedule executors to be launched on the workers.
* Returns an array containing number of cores assigned to each worker.
*
* There are two modes of launching executors. The first attempts to spread out an application's
* executors on as many workers as possible, while the second does the opposite (i.e. launch them
* on as few workers as possible). The former is usually better for data locality purposes and is
* the default.
*
* The number of cores assigned to each executor is configurable. When this is explicitly set,
* multiple executors from the same application may be launched on the same worker if the worker
* has enough cores and memory. Otherwise, each executor grabs all the cores available on the
* worker by default, in which case only one executor may be launched on each worker.
*
* It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core
* at a time). Consider the following example: cluster has 4 workers with 16 cores each.
* User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is
* allocated at a time, 12 cores from each worker would be assigned to each executor.
* Since 12 < 16, no executors would launch [SPARK-8881].
*/
private def scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int] = {
val coresPerExecutor = app.desc.coresPerExecutor
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
val oneExecutorPerWorker = coresPerExecutor.isEmpty
val memoryPerExecutor = app.desc.memoryPerExecutorMB
val numUsable = usableWorkers.length
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) /** Return whether the specified worker can launch an executor for this app. */
//是否可以在一个worker上分配Executor
def canLaunchExecutor(pos: Int): Boolean = {
val keepScheduling = coresToAssign >= minCoresPerExecutor
val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor // If we allow multiple executors per worker, then we can always launch new executors.
// Otherwise, if there is already an executor on this worker, just give it more cores.
val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
if (launchingNewExecutor) {
//在不里,需要检查worker的空闲core和内存是否够用
val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
keepScheduling && enoughCores && enoughMemory && underLimit
} else {
// We're adding cores to an existing executor, so no need
// to check memory and executor limits
//尤其需要注意的是,oneExecutorPerWorker机制下,不检测内存限制,很重要。
keepScheduling && enoughCores
}
} // Keep launching executors until no more workers can accommodate any
// more executors, or if we have reached this application's limits
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
while (freeWorkers.nonEmpty) {
freeWorkers.foreach { pos =>
var keepScheduling = true
while (keepScheduling && canLaunchExecutor(pos)) {
//要分配的cores
coresToAssign -= minCoresPerExecutor
//已分配的cores
assignedCores(pos) += minCoresPerExecutor // If we are launching one executor per worker, then every iteration assigns 1 core
// to the executor. Otherwise, every iteration assigns cores to a new executor.
//一个worker只启动一个Executor
if (oneExecutorPerWorker) {
assignedExecutors(pos) = 1
} else {
assignedExecutors(pos) += 1
} // Spreading out an application means spreading out its executors across as
// many workers as possible. If we are not spreading out, then we should keep
// scheduling executors on this worker until we use all of its resources.
// Otherwise, just move on to the next worker.
//如果没有开启spreadOUt算法,就一直在一个worker上分配,直到不能再分配为止。
if (spreadOutApps) {
keepScheduling = false
}
}
}
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
assignedCores
}
/**
* Allocate a worker's resources to one or more executors.
* @param app the info of the application which the executors belong to
* @param assignedCores number of cores on this worker for this application
* @param coresPerExecutor number of cores per executor
* @param worker the worker info
*/
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
//计算要创建多少个Executor进程,默认值是1. val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
val exec = app.addExecutor(worker, coresToAssign)
//真正的启动Executor进程了。
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
由于本人接触spark时间不长,如有错误或者任何意见可以在留言或者发送邮件到franciswbs@163.com,让我们一起交流。
作者:FrancisWang
邮箱:franciswbs@163.com
出处:http://www.cnblogs.com/francisYoung/
本文地址:http://www.cnblogs.com/francisYoung/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
Spark Application的调度算法的更多相关文章
- spark application调度机制(spreadOutApps,oneExecutorPerWorker 算法)
1.要想明白spark application调度机制,需要回答一下几个问题: 1.谁来调度? 2.为谁调度? 3.调度什么? 3.何时调度? 4.调度算法 前四个问题可以用如下一句话里来回答:每当集 ...
- 为Spark Application指定不同的JDK版本
随着企业内部业务系统越来越多,基于JVM的服务,通常情况线上环境可能会有多套JDK跑不同的服务.大家都知道基于高版本的Java规范编写的服务跑在低版本的JVM上会出现:java.lang.Unsupp ...
- [Spark][Python][Application]非交互式运行Spark Application 的例子
非交互式运行Spark Application 的例子 $ cat Count.py import sys from pyspark import SparkContext if __name__ = ...
- spark application提交应用的两种方式
bin/spark-submit --help ... ... --deploy-mode DEPLOY_MODE Whether to launch the driver program loc ...
- Spark application注册master机制
直接上Master类的代码: case RegisterApplication(description) => { if (state == RecoveryState.STANDBY) { / ...
- Spark 资源池简介
在一个application内部,不同线程提交的Job默认按照FIFO顺序来执行,假设线程1先提交了一个job1,线程2后提交了一个job2,那么默认情况下,job2必须等待job1执行完毕后才能执行 ...
- spark面试总结1
Spark Core面试篇01 一.简答题 1.Spark master使用zookeeper进行HA的,有哪些元数据保存在Zookeeper? 答:spark通过这个参数spark.deploy.z ...
- Spark面试相关
Spark Core面试篇01 随着Spark技术在企业中应用越来越广泛,Spark成为大数据开发必须掌握的技能.前期分享了很多关于Spark的学习视频和文章,为了进一步巩固和掌握Spark,在原有s ...
- spark基础知识介绍(包含foreachPartition写入mysql)
数据本地性 数据计算尽可能在数据所在的节点上运行,这样可以减少数据在网络上的传输,毕竟移动计算比移动数据代价小很多.进一步看,数据如果在运行节点的内存中,就能够进一步减少磁盘的I/O的传输.在spar ...
随机推荐
- maven 配置问题
① 错误 Cannot load JDBC driver class 'oracle.jdbc.driver.OracleDriver' 原因:pom.xml文件下载ojdbc14-10.2.0.3. ...
- UML入门
本文主要讲解uml的一些入门知识. uml:统一建模语言,uml通过图形化的表达对系统进行细致的划分,在开发前期有助于开发人员与开发人员之间交流,同时也能方便用户与开发者之间进行良好的反馈.利用uml ...
- [转载]js中return的用法
一.返回控制与函数结果,语法为:return 表达式; 语句结束函数执行,返回调用函数,而且把表达式的值作为函数的结果 二.返回控制,无函数结果,语法为:return; 在大多数情况下,为事件处理函 ...
- Redis Cluster 介绍与使用
Redis Cluster 功能特性 Redis 集群是分布式的redis 实现,具有以下特性: 1. 高可用性与可线性扩张到1000个节点 2. 数据自动路由到多个节点 3. 节点间数据共享 4. ...
- Selenium Xpath Tutorials - Identifying xpath for element with examples to use in selenium
Xpath in selenium is close to must required. XPath is element locator and you need to provide xpath ...
- 阿里云直播PHP SDK如何使用
前一篇聊了聊关于阿里云直播,如何进行进行调试,ok,那这篇我们就聊一聊关于阿里云直播的SDK(当然是关于PHP的),基于下面的原因: 1.直播云没有单独的SDK,直播部分的SDK是直接封装在CDN的相 ...
- 总结:在MyEclipse中部署一个wap应用时需要配置的环境变量,我的JDK是安装在C盘,mysql安装在D盘,Tomcat解压在E盘,所以路径一定要看清楚哦,!
- 自动加载dll,加载dll中程序集的信息。
自动加载程序集,解析程序集中的方法. private static object Invoke(string lpFileName, string Namespace, string ClassNam ...
- jquery计算文本字符个数
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta http ...
- swift之inout
在swift中,我们常常对数据进行一些处理.因为swift的计算属性,所以如果不是大量重复性处理,基本可以在set及didSet中改变原数据的一些状态.但需要用到同样的算法处理大量数据的时候,仍然需要 ...