DStreamGraph有点像简洁版的DAG scheduler,负责根据某个时间间隔生成一序列JobSet,以及按照依赖关系序列化。这个类的inputStream和outputStream是最重要的属性。spark stream将动态的输入流与对流的处理通过一个shuffle来连接。前面的(shuffle map)是input stream,其实是DStream的子类,它们负责将收集的数据以block的方式存到spark memory中;而output stream,是另外的一系类DStream,负责将数据从spark memory读取出来,分解成spark core中的RDD,然后再做数据处理。





final private[streaming] class DStreamGraph extends Serializable with Logging {

private val inputStreams = new ArrayBuffer[InputDStream[_]]()
private val outputStreams = new ArrayBuffer[DStream[_]]()

var rememberDuration: Duration = null
var checkpointInProgress = false

var zeroTime: Time = null
var startTime: Time = null
var batchDuration: Duration = null


def addInputStream(inputStream: InputDStream[_]) {
this.synchronized {
inputStream.setGraph(this)
inputStreams += inputStream
}
}

def addOutputStream(outputStream: DStream[_]) {
this.synchronized {
outputStream.setGraph(this)
outputStreams += outputStream
}
}

def getInputStreams() = this.synchronized { inputStreams.toArray }

def getOutputStreams() = this.synchronized { outputStreams.toArray }

def getReceiverInputStreams() = this.synchronized {
inputStreams.filter(_.isInstanceOf[ReceiverInputDStream[_]])
.map(_.asInstanceOf[ReceiverInputDStream[_]])
.toArray
}


def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap(outputStream => outputStream.generateJob(time))
}
logDebug("Generated " + jobs.length + " jobs for time " + time)
jobs
}


@throws(classOf[IOException])
private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException {
logDebug("DStreamGraph.writeObject used")
this.synchronized {
checkpointInProgress = true
logDebug("Enabled checkpoint mode")
oos.defaultWriteObject()
checkpointInProgress = false
logDebug("Disabled checkpoint mode")
}
}

@throws(classOf[IOException])
private def readObject(ois: ObjectInputStream): Unit = Utils.tryOrIOException {
logDebug("DStreamGraph.readObject used")
this.synchronized {
checkpointInProgress = true
ois.defaultReadObject()
checkpointInProgress = false
}
}

JobScheduler负责产生jobs
/**
* This class schedules jobs to be run on Spark. It uses the JobGenerator to generate
* the jobs and runs them using a thread pool.
*/
private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {

private val jobSets = new ConcurrentHashMap[Time, JobSet]
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs)
private val jobGenerator = new JobGenerator(this)
val clock = jobGenerator.clock
val listenerBus = new StreamingListenerBus()

// These two are created only when scheduler starts.
// eventActor not being null means the scheduler has been started and not stopped
var receiverTracker: ReceiverTracker = null
private var eventActor: ActorRef = null

def start(): Unit = synchronized {
if (eventActor != null) return // scheduler has already been started

logDebug("Starting JobScheduler")
eventActor = ssc.env.actorSystem.actorOf(Props(new Actor {
def receive = {
case event: JobSchedulerEvent => processEvent(event)
}
}), "JobScheduler")

listenerBus.start()
receiverTracker = new ReceiverTracker(ssc)
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}

def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}

private class JobHandler(job: Job) extends Runnable {
def run() {
eventActor ! JobStarted(job)
job.run()
eventActor ! JobCompleted(job)
}
}
job完成后处理
private def handleJobCompletion(job: Job) {
job.result match {
case Success(_) =>
val jobSet = jobSets.get(job.time)
jobSet.handleJobCompletion(job)
logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
if (jobSet.hasCompleted) {
jobSets.remove(jobSet.time)
jobGenerator.onBatchCompletion(jobSet.time)
logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
jobSet.totalDelay / 1000.0, jobSet.time.toString,
jobSet.processingDelay / 1000.0
))
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
}
case Failure(e) =>
reportError("Error running job " + job, e)
}
}









spark streaming 4: DStreamGraph JobScheduler的更多相关文章

  1. Spark Streaming Backpressure分析

    1.为什么引入Backpressure 默认情况下,Spark Streaming通过Receiver以生产者生产数据的速率接收数据,计算过程中会出现batch processing time > ...

  2. Spark Streaming性能优化: 如何在生产环境下应对流数据峰值巨变

    1.为什么引入Backpressure 默认情况下,Spark Streaming通过Receiver以生产者生产数据的速率接收数据,计算过程中会出现batch processing time > ...

  3. 5. Spark Streaming高级解析

    5.1 DStreamGraph对象分析 在Spark Streaming中,DStreamGraph是一个非常重要的组件,主要用来: 1. 通过成员inputStreams持有Spark Strea ...

  4. 4. Spark Streaming解析

    4.1 初始化StreamingContext import org.apache.spark._ import org.apache.spark.streaming._ val conf = new ...

  5. Spark Streaming揭秘 Day25 StreamingContext和JobScheduler启动源码详解

    Spark Streaming揭秘 Day25 StreamingContext和JobScheduler启动源码详解 今天主要理一下StreamingContext的启动过程,其中最为重要的就是Jo ...

  6. Spark Streaming揭秘 Day3-运行基石(JobScheduler)大揭秘

    Spark Streaming揭秘 Day3 运行基石(JobScheduler)大揭秘 引子 作为一个非常强大框架,Spark Streaming兼具了流处理和批处理的特点.还记得第一天的谜团么,众 ...

  7. Spark Streaming源码分析 – JobScheduler

    先给出一个job从被generate到被执行的整个过程在JobGenerator中,需要定时的发起GenerateJobs事件,而每个job其实就是针对DStream中的一个RDD,发起一个Spark ...

  8. 贯通Spark Streaming JobScheduler内幕实现和深入思考

    本节主要内容: 一.SparkStreaming Job生成深度思考 二.SparkStreaming Job生成源码解析 JobScheduler的地位非常的重要,所有的关键都在JobSchedul ...

  9. Spark Streaming源码解读之JobScheduler内幕实现和深度思考

    本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...

随机推荐

  1. href="javascript:show_login()"意思

    整句话意味着当你点击一个超链接时,你会触发函数show_login. Href是一个超链接,通过单击该超链接触发. javascript:后面是JS代码 show_login():表示JS的函数的油烟 ...

  2. jeesite表字段太多导致不能自动生成那张表的代码——————jetty 之 form too large | form too many keys 异常

    看了Jetty的源码才发现,jetty限制了Form提交数据的大小,该源码类来自jetty lib库下的jetty-server-7.6.16.v20140903.jar包下的 org.eclipse ...

  3. idea 党用快捷键

    实用快捷键: Ctrl+/ 或 Ctrl+Shift+/ 注释(// 或者/*...*/ )Ctrl+D 复制行Ctrl+X 删除行快速修复 alt+enter (modify/cast)代码提示 a ...

  4. Hacklab WebIDE在线调试ESP32笔记

    目录 1.什么是Hacklab WebIDE 1.1 优势 1.2 趋势 2. 使用方法 2.1 功能介绍 2.2 编译第一个程序 2.3 搭建esp32的开发环境 2.4 建立开发板与云平台的连接 ...

  5. 1.RPC原理学习

    1.什么是RPC:远程过程调用协议 RPC(Remote Procedure Call Protocol)— 远程过程调用协议,是一种通过网络从远程计算机程序上请求服务,而不需要 了解底层网络技术的协 ...

  6. Linux基础命令02

    常用的一些命令选项 向网络发送icmp检测主机是否在线 ping 指定发送包数量 ping -c windows系统中是ping -t不间断刷包 比如ping百度,ping不同,一直卡在这里,加了-w ...

  7. (一)数据库系统概述和ER图

    1.数据库系统 数据库系统有数据库.数据库管理系统.应用系统和数据库管理员组成.数据库呢就是数据的集合,应用系统和管理员就不说了,数据库管理系统即常说的DBMS,比如我们用的mysql,oracle, ...

  8. updatedepthtexture 和 screen space shadow 开关

    2018.0.3f 里面directional light开了shadow 就会有一张updatedepth 如果距离远 没有阴影就没有shadow pass 但是updatedepth没有关掉 管线 ...

  9. Acwing-203-同余方程(扩展欧几里得)

    链接: https://www.acwing.com/problem/content/205/ 题意: 求关于x的同余方程 ax ≡ 1(mod b) 的最小正整数解. 思路: 首先:扩展欧几里得推导 ...

  10. HDU 6047 - Maximum Sequence | 2017 Multi-University Training Contest 2

    /* HDU 6047 - Maximum Sequence [ 单调队列 ] 题意: 起初给出n个元素的数列 A[N], B[N] 对于 A[]的第N+K个元素,从B[N]中找出一个元素B[i],在 ...