继续上一篇的内容。上一篇的内容为:

Spark中Master源码分析(一) http://www.cnblogs.com/yourarebest/p/5312965.html

4.receive方法,receive方法中消息类型主要分为以下12种情况:
(1)重新选择了新Leader,进行数据的恢复
(2)恢复完毕,重新创建Driver,完成资源的重新分配
(3)触发Leadership的选举
(4)Master注册新的Worker
(5)Master注册新的App,然后重新分配资源
(6)Executor转态发生改变,比如正在运行,执行完毕后会发生的情况
(7)Driver转态发生变化,进行相应的操作
(8)心跳机制,通过该机制master和worker保持联系
(9)master对于app的状态的处理
(10)worker调度状态改变响应
(11)没有注册的app将认为已经完成了并移除
(12)通过worker是否超时,从而判断worker是否dead

12种情况详细代码如下所示:
(1)重新选择了新Leader,进行数据的恢复

case ElectedLeader => {
val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)
state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {
RecoveryState.ALIVE
} else {
RecoveryState.RECOVERING
}
logInfo("I have been elected leader! New state: " + state)
if (state == RecoveryState.RECOVERING) {
//恢复数据中
beginRecovery(storedApps, storedDrivers, storedWorkers)
//守护单线程1s后发送一个完成恢复的请求,并异步等待响应
recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
self.send(CompleteRecovery)
}
}, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
}
}

(2)恢复完毕,重新创建Driver,完成资源的重新分配

case CompleteRecovery => completeRecovery()详见下①

①completeRecovery方法如下:

private def completeRecovery() {
if (state != RecoveryState.RECOVERING) { return }
state = RecoveryState.COMPLETING_RECOVERY
//kill所有的不响应的workers和apps
workers.filter(.state == WorkerState.UNKNOWN).foreach(removeWorker)
apps.filter(
.state == ApplicationState.UNKNOWN).foreach(finishApplication)
// 重新创建Driver
drivers.filter(_.worker.isEmpty).foreach { d =>
logWarning(s"Driver ${d.id} was not found after master recovery")
if (d.desc.supervise) {
logWarning(s"Re-launching ${d.id}")
relaunchDriver(d)详见下②
} else {
removeDriver(d.id, DriverState.ERROR, None)
logWarning(s"Did not re-launch ${d.id} because it was not supervised")
}
}

②relaunchDriver方法如下,将Driver的转态为RELAUNCHING,添加到即将创建的Driver列表中,然后重新分配资源

private def relaunchDriver(driver: DriverInfo) {
driver.worker = None
driver.state = DriverState.RELAUNCHING
waitingDrivers += driver
//重新分配资源,详见下③
schedule()
}

③schedule的方法如下,该方法主要为等待执行的apps安排可用的资源,每当一个新的app提交或可用资源(worker等)发生变化时调用

private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) { return }
// Drivers优先于executors
// 通过Random.shuffle返回一个新的乱序排序的workers集合
val shuffledWorkers = Random.shuffle(workers)
for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
for (driver <- waitingDrivers) {
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//根据worker和driver信息创建worker,详见下④
launchDriver(worker, driver)
waitingDrivers -= driver
}
}
}
//调用和创建workers上的executors
startExecutorsOnWorkers()
}

④ launchDriver方法如下,根据worker和driver信息创建worker

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
//将worker的资源分配给driver
worker.addDriver(driver)
driver.worker = Some(worker)
//worker将启动driver
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
//将driver的状态置位RUNNING
driver.state = DriverState.RUNNING
}

(3)触发Leadership的选举

case RevokedLeadership => {
logError("Leadership has been revoked -- master shutting down.")
System.exit(0)
}

(4)Master注册新的Worker,然后重新分配资源

case RegisterWorker(
id, workerHost, workerPort, workerRef, cores, memory, workerUiPort, publicAddress) => {
logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
workerHost, workerPort, cores, Utils.megabytesToString(memory)))
if (state == RecoveryState.STANDBY) {
} else if (idToWorker.contains(id)) {
//通知worker注册失效,并退出
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
} else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
workerRef, workerUiPort, publicAddress)
if (registerWorker(worker)) {
//将新添加的worker信息持久化
persistenceEngine.addWorker(worker)
//worker发送RegisteredWorker消息,并开始向master发送心跳
workerRef.send(RegisteredWorker(self, masterWebUiUrl))
//重新分配资源
schedule()
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress)workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress))
}
}
}

(5)Master注册新的App,然后重新分配资源

case RegisterApplication(description, driver) => {
if (state == RecoveryState.STANDBY) {
} else {
logInfo("Registering app " + description.name)
//根据appdescription和driver创建app
val app = createApplication(description, driver),详见下①
//注册app
registerApplication(app),详见下②
logInfo("Registered app " + description.name + " with ID " + app.id)
//将app持久化
persistenceEngine.addApplication(app)
//driver将给AppClient发送RegisteredApplication消息
driver.send(RegisteredApplication(app.id, self))
//重新分配资源
schedule()
}
}

①createApplication方法如下,根据appdescription和driver创建app

private def createApplication(desc: ApplicationDescription, driver: RpcEndpointRef):
ApplicationInfo = {
val now = System.currentTimeMillis()
val date = new Date(now)
//用App的主构造器创建一个App
new ApplicationInfo(now, newApplicationId(date), desc, date, driver, defaultCores)
}

②registerApplication方法如下:

private def registerApplication(app: ApplicationInfo): Unit = {
val appAddress = app.driver.address
if (addressToApp.contains(appAddress)) {
logInfo("Attempted to re-register application at same address: " + appAddress)
return
}
//将app的源信息,比如状态、运行时间、核数注册到metrics系统中
applicationMetricsSystem.registerSource(app.appSource)
apps += app
idToApp(app.id) = app
endpointToApp(app.driver) = app
addressToApp(appAddress) = app
waitingApps += app
}

(6)Executor转态发生改变,比如正在运行,执行完毕后会发生的情况

case ExecutorStateChanged(appId, execId, state, message, exitStatus) => {
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) => {
val appInfo = idToApp(appId)
exec.state = state
//如果executor正在执行任务,将retry次数置位0
if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() } //给appClient发送ExecutorUpdated消息
exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus))
//如果Executor执行完了,移除worker和app上的executor
if (ExecutorState.isFinished(state)) {
logInfo(s"Removing executor ${exec.fullId} because it is $state")
//如果一个app已经执行完了,将它的信息反馈在Web UI上
if (!appInfo.isFinished) {
appInfo.removeExecutor(exec)
}
exec.worker.removeExecutor(exec)
val normalExit = exitStatus == Some(0)
// 只要retry次数小于10,那么executor的资源就会不断的调整
if (!normalExit) {
if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
//调整资源
schedule()
} else {
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
s"${appInfo.retryCount} times; removing it")
removeApplication(appInfo, ApplicationState.FAILED)
}
}
}
}
}
case None =>
logWarning(s"Got status update for unknown executor $appId/$execId")
}
}

(7)Driver转态发生变化,进行相应的操作

case DriverStateChanged(driverId, state, exception) => {
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
}

(8)心跳机制,通过该机制master和worker保持联系

case Heartbeat(workerId, worker) => {
idToWorker.get(workerId) match {
case Some(workerInfo) =>
//更新worker的最后一次心跳时间
workerInfo.lastHeartbeat = System.currentTimeMillis()
case None =>
if (workers.map(_.id).contains(workerId)) {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" Asking it to re-register.")
worker.send(ReconnectWorker(masterUrl))
} else {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" This worker was never registered, so ignoring the heartbeat.")
}
}
}

(9)master对于app的状态的处理

case MasterChangeAcknowledged(appId) => {
idToApp.get(appId) match {
case Some(app) =>
logInfo("Application has been re-registered: " + appId)
app.state = ApplicationState.WAITING
case None =>
logWarning("Master change ack from unknown app: " + appId)
}
if (canCompleteRecovery) { completeRecovery() }
}

(10)worker调度状态改变响应

case WorkerSchedulerStateResponse(workerId, executors, driverIds) => {
idToWorker.get(workerId) match {
case Some(worker) =>
logInfo("Worker has been re-registered: " + workerId)
worker.state = WorkerState.ALIVE
val validExecutors = executors.filter(exec => idToApp.get(exec.appId).isDefined)
for (exec <- validExecutors) {
val app = idToApp.get(exec.appId).get
val execInfo = app.addExecutor(worker, exec.cores, Some(exec.execId))
worker.addExecutor(execInfo)
execInfo.copyState(exec)
}
for (driverId <- driverIds) {
drivers.find(_.id == driverId).foreach { driver =>
driver.worker = Some(worker)
driver.state = DriverState.RUNNING
worker.drivers(driverId) = driver
}
}
case None =>
logWarning("Scheduler state from unknown worker: " + workerId)
}
if (canCompleteRecovery) { completeRecovery() }
}

(11)没有注册的app将认为已经完成了并移除

case UnregisterApplication(applicationId) =>
logInfo(s"Received unregister request from application $applicationId")
idToApp.get(applicationId).foreach(finishApplication)

(12)通过worker是否超时,从而判断worker是否dead

case CheckForWorkerTimeOut => {
//移除Dead worker,如果系统当前时间-Worker超时(1min)>worker最后心跳时间,判断worker为dead并移除
timeOutDeadWorkers()
}

【原】Spark中Master源码分析(二)的更多相关文章

  1. 【原】Spark中Master源码分析(一)

    Master作为集群的Manager,对于集群的健壮运行发挥着十分重要的作用.下面,我们一起了解一下Master是听从Client(Leader)的号召,如何管理好Worker的吧. 1.家当(静态属 ...

  2. 【原】Spark中Client源码分析(二)

    继续前一篇的内容.前一篇内容为: Spark中Client源码分析(一)http://www.cnblogs.com/yourarebest/p/5313006.html DriverClient中的 ...

  3. 【原】 Spark中Worker源码分析(二)

    继续前一篇的内容.前一篇内容为: Spark中Worker源码分析(一)http://www.cnblogs.com/yourarebest/p/5300202.html 4.receive方法, r ...

  4. Spark中决策树源码分析

    1.Example 使用Spark MLlib中决策树分类器API,训练出一个决策树模型,使用Python开发. """ Decision Tree Classifica ...

  5. 【原】Spark中Client源码分析(一)

    在Spark Standalone中我们所谓的Client,它的任务其实是由AppClient和DriverClient共同完成的.AppClient是一个允许app(Client)和Spark集群通 ...

  6. 【原】 Spark中Worker源码分析(一)

    Worker作为对于Spark集群的健壮运行起着举足轻重的作用,作为Master的奴隶,每15s向Master告诉自己还活着,一旦主人(Master>有了任务(Application),立马交给 ...

  7. Spark RPC框架源码分析(二)RPC运行时序

    前情提要: Spark RPC框架源码分析(一)简述 一. Spark RPC概述 上一篇我们已经说明了Spark RPC框架的一个简单例子,Spark RPC相关的两个编程模型,Actor模型和Re ...

  8. Spark Scheduler模块源码分析之DAGScheduler

    本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...

  9. Spark Scheduler模块源码分析之TaskScheduler和SchedulerBackend

    本文是Scheduler模块源码分析的第二篇,第一篇Spark Scheduler模块源码分析之DAGScheduler主要分析了DAGScheduler.本文接下来结合Spark-1.6.0的源码继 ...

随机推荐

  1. 现代php开发

    最近在看 Modern PHP 很薄的一本书,有种发现新大陆的感觉,强烈推荐.php是一门脚本语言,随着web的发展而发展起来,最早的时候大家还是混编html,php,完全没有工程项目的概念,(我们公 ...

  2. PHP学习笔记(1) - 开发环境搭建

    运行环境:phpstudy 它基本包括运行php应用需要的一切,php. apache.mysql,一键傻瓜安装 装好之后只需要配置虚拟主机和修改host文件就可以支持多站点 下载: http://w ...

  3. PHP学习心得(一)——简介

    PHP(“PHP: Hypertext Preprocessor”,超文本预处理器的字母缩写)是一种被广泛应用的开放源代码的多用途脚本语言,它可嵌入到 HTML中,尤其适合 web 开发. PHP 脚 ...

  4. 【PyInstaller安装及使用】将py程序转换成exe可执行程序

    1  配置所需的环境 平台:windows7 64位,已经安装了python(x,y) 若未安装python环境,请自行安装python2.7或者其他版本,Python安装完成以后,需要将Python ...

  5. mongodb Install the MongoDB service

    在用到mongodb时,首先要运行mongod.exe以启动mongo,这样就会出现命令框( command prompt),为了避免出现这种情况.要以服务的形式来启动mongo,这样就不会出现命令框 ...

  6. forward && redirect 区别介绍

    解释一 一句话,转发是服务器行为,重定向是客户端行为.为什么这样说呢,这就要看两个动作的工作流程: 转发过程:客户浏览器发送http请求---->web服务器接受此请求-->调用内部的一个 ...

  7. poj 1811 Prim test

    基本上一个裸的Miller_Rabin大素数判定和一个裸的Pollard_rho素数分解算法,当模板用吧! #include<cstdio> #include<algorithm&g ...

  8. ASP.NET Application,Session,Cookie和ViewState等对象用法和区别 (转)

    在ASP.NET中,有很多种保存信息的内置对象,如:Application,Session,Cookie,ViewState和Cache等.下面分别介绍它们的用法和区别. 方法 信息量大小 作用域和保 ...

  9. FZU 2150 Fire Game(BFS)

    点我看题目 题意 :就是有两个熊孩子要把一个正方形上的草都给烧掉,他俩同时放火烧,烧第一块的时候是不花时间的,每一块着火的都可以在下一秒烧向上下左右四块#代表草地,.代表着不能烧的.问你最少花多少时间 ...

  10. SPRING IN ACTION 第4版笔记-第八章Advanced Spring MVC-004-Pizza例子的用户流程(flowExecutionKey、_eventId_phoneEntered、flowExecutionUrl )

    一. 1. 2. 3.customer-flow.xml 自己定义customer,最后output <?xml version="1.0" encoding="U ...