1.Spark Streaming 代码分析:

1.1 演示样例代码DEMO:

实时计算的WorldCount:

    import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: NetworkWordCount <master> <hostname> <port>\n" +
"In local mode, <master> should be 'local[n]' with n > 1")
System.exit(1)
} // Create the context with a 1 second batch size
val ssc = new StreamingContext("local[4]", "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME")) // Create a NetworkInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}

以上代码能实时的接受和处理网络传输过来的文本,并计算单词个数。

1.2 以下基于以上代码DEMO分析记录下Spark streaming的启动过程的源代码,以及字节流处理过程的代源代码。

以上代码DEMO涉及的启动流程时序例如以下: 简述:

(1)从Streaming WorldCount的编程演示样例可看出Spark Streaming和Spark的编程API非常像。

Spark Streaming是构建Spark程序框架的基础上的。从时序图的StreamingContext和SparkContext的交互能够看出来StreamingContext里面封装了SparkContext。部分关键代码例如以下:

    //包权限,包内可见。包外不可见
//StreamingContext 里面包装的还是一个SparkContext
class StreamingContext private[streaming] (
sc_ : SparkContext,
cp_ : Checkpoint,
batchDur_ : Duration
) extends Logging { def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
} private[streaming] def createNewSparkContext(
master: String,
appName: String,
sparkHome: String,
jars: Seq[String],
environment: Map[String, String]
): SparkContext = {
val conf = SparkContext.updatedConf(
new SparkConf(), master, appName, sparkHome, jars, environment)
createNewSparkContext(conf)
}
...

上面是StreamingContext的部分构造函数的代码。可见当中对SparkContext的封装以及CreateNewSparkContext对于SparkContext的构造过程。从代码角度看展示了StreamingContext是针对于SparkContext的封装。结合DEMO从框架启动阶段来讲,当完毕StreamingContext的实例化时候也就是完毕了程序的配置初始化过程。

(2)结合Demo,StreamingContext实例化之后则是从StreamingContext获取了一个SocketTextStream,代码例如以下:

val lines = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_ONLY_SER)

当中args(1)。args(2)分别相应于后面所说的Receiver监听的Ip和port号,随后的參数从英文名能够看出来定义的是存储级别(这里定义仅存储在内存)。

DEMO中的代码简单解释下,然后还是结合时序图和源代码来看下:

StreamingContext.SocketTextStream方法例如以下:

def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}

方法返回的是ReceiverInputDStreaming。这里引出了DStream的概念(拓展总结一下)。依据DStream的代码凝视。DStream的解释例如以下:Dstream本质就是离散化的stream,将stream离散化成一组RDD的list。所以主要的操作仍然是以RDD为基础。

InputDStream 涉及的类继承结构:

上述类继承结构图展示了DInputStream,依据类的名字能够大致看出来DStream的作用,比方FileInputDStream就是处理文件流的DStream,kalfkaInputDSteam就是处理kalfka消息的DStream;DStream本身封装了按时间片离散化了的Stream(数据流);看例如以下DStream的代码片段:

abstract class DStream[T: ClassTag] (
@transient private[streaming] var ssc: StreamingContext
) extends Serializable with Logging { @transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
...

如代码所看到的。generatedRDDS是DStream的一个成员变量HashMap,当中key是时间片段。value是RDD。从DStream持有的这个成员变量就能够看出来DStream的本质,就是依照时间片存储了一系列的数据流。

    /**
* Retrieve a precomputed RDD of this DStream, or computes the RDD. This is an internal
* method that should not be called directly.
*/
private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
// If this DStream was not initialized (i.e., zeroTime not set), then do it
// If RDD was already generated, then retrieve it from HashMap
generatedRDDs.get(time) match { // If an RDD was already generated and is being reused, then
// probably all RDDs in this DStream will be reused and hence should be cached
case Some(oldRDD) => Some(oldRDD) // if RDD was not generated, and if the time is valid
// (based on sliding time of this DStream), then generate the RDD
case None => {
if (isTimeValid(time)) {
compute(time) match {
case Some(newRDD) =>
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logInfo("Persisting RDD " + newRDD.id + " for time " +
time + " to " + storageLevel + " at time " + time)
}
if (checkpointDuration != null &&
(time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo("Marking RDD " + newRDD.id + " for time " + time +
" for checkpointing at time " + time)
}
generatedRDDs.put(time, newRDD)
Some(newRDD)
case None =>
None
}
} else {
None
}
}
}

以上这段代码则是按时间片构建和检索RDD的主要函数,因为逻辑比較简单就不多说!

DStream不在深入。还是回到正题,上面的DEMO代码从SteamingContext拿到了一个ReceiverInputDStream,ReceiverInputDStream封装了SocketReceiver对象,用来从网络中读取数据流,事实上代码逻辑跟下去,事实上就是拿到了一个SocketInputStream,然后按时间片将流离散化的存储在DStream定义的HashMap中。这段涉及的代码例如以下:

def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
def socketStream[T: ClassTag](
hostname: String,
port: Int,
converter: (InputStream) => Iterator[T],
storageLevel: StorageLevel
): ReceiverInputDStream[T] = {
new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
}

以上就是clientDEMO(Driver)获取socketTextStream的过程,总结一下就是通过StreamingContext提供的API获取了一个SocketInputStreamInputDStream。

至此返回到DEMO演示样例。除了WordCount的部分处理逻辑外,就是StreamingContext的启动逻辑了。

通过Spark的仅仅是能够知道RDD仅仅是构建了代码的逻辑以及依赖关系。但并没有真正的运行,同理这里也是相同的道理。以下就是关键的StreamingContext的启动逻辑了。

以下继续总结。

(3)回到时序图,我们能够看出StreamingContext的start方法。直接或间接依次触发了JobScheduler、StreamingListenerBus、ReceiverTracker、JobGenerator的方法。以下结合源代码看一下:

StreamingContext调用了核心的JobScheduler的start方法例如以下:

def start(): Unit = synchronized {
if (eventActor != null) return // scheduler has already been started logDebug("Starting JobScheduler")
eventActor = ssc.env.actorSystem.actorOf(Props(new Actor {
def receive = {
case event: JobSchedulerEvent => processEvent(event)
}
}), "JobScheduler")
/**
* 启动listenerBus
*/
listenerBus.start()
receiverTracker = new ReceiverTracker(ssc)
//启动receiverTracker
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}

上面代码简单总结一下:首先生成了一个eventActor对象(类型是Akka Actor; Akka的简单定义:http://baike.baidu.com/view/5912486.htm?fr=aladdin,详细能够深入学下Scala)。eventActor的事件处理逻辑里面调用了processEvent方法。源代码例如以下:

private def processEvent(event: JobSchedulerEvent) {
try {
event match {
case JobStarted(job) => handleJobStart(job)
case JobCompleted(job) => handleJobCompletion(job)
case ErrorReported(m, e) => handleError(m, e)
}
} catch {
case e: Throwable =>
reportError("Error in job scheduler", e)
}
}
//这个是间接调用 捕获job的启动事件
private def handleJobStart(job: Job) {
val jobSet = jobSets.get(job.time)
if (!jobSet.hasStarted) {
listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo))
}
jobSet.handleJobStart(job)
logInfo("Starting job " + job.id + " from job set of time " + jobSet.time)
SparkEnv.set(ssc.env)
}
//捕获job的完毕事件
private def handleJobCompletion(job: Job) {
job.result match {
case Success(_) =>
val jobSet = jobSets.get(job.time)
jobSet.handleJobCompletion(job)
logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
if (jobSet.hasCompleted) {
jobSets.remove(jobSet.time)
jobGenerator.onBatchCompletion(jobSet.time)
logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
jobSet.totalDelay / 1000.0, jobSet.time.toString,
jobSet.processingDelay / 1000.0
))
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
}
case Failure(e) =>
reportError("Error running job " + job, e)
}
}
//捕获job的ErrorReport事件
private def handleError(msg: String, e: Throwable) {
logError(msg, e)
ssc.waiter.notifyError(e)
}

至此能够看到JobScheduler作为Job的调度器,所通过Akka框架的eventActor对象能够实时监听到Job的启动,停止以及错误等信息。

而且通过以上列出的几个Handle*方法能够看出来在处理完開始结束时间后。将事件信息做了简单的封装后传输到了ListenerBus,让注冊在监听器总线上的监听器都能针对Job的状态变化及时作出反应。这个ListenerBus的设计还是十分巧妙的。

EventActor启动完成后就启动了StreamingListenerBus对象。

涉及的ListenerBus.start的核心源代码例如以下:

//启动函数
def start() {
listenerThread.start()
} //启动一条Daemon线程
val listenerThread = new Thread("StreamingListenerBus") {
setDaemon(true)
override def run() {
while (true) {
val event = eventQueue.take
event match {
case receiverStarted: StreamingListenerReceiverStarted =>
listeners.foreach(_.onReceiverStarted(receiverStarted))
case receiverError: StreamingListenerReceiverError =>
listeners.foreach(_.onReceiverError(receiverError))
case receiverStopped: StreamingListenerReceiverStopped =>
listeners.foreach(_.onReceiverStopped(receiverStopped))
case batchSubmitted: StreamingListenerBatchSubmitted =>
listeners.foreach(_.onBatchSubmitted(batchSubmitted))
case batchStarted: StreamingListenerBatchStarted =>
listeners.foreach(_.onBatchStarted(batchStarted))
case batchCompleted: StreamingListenerBatchCompleted =>
listeners.foreach(_.onBatchCompleted(batchCompleted))
case StreamingListenerShutdown =>
// Get out of the while loop and shutdown the daemon thread
return
case _ =>
}
}
}

通过以上代码能够看出监听器总线真实代码如其名,其start方法启动了一条Deamon线程,来监听各种事件,然后将时间捕捉后发送给注冊在总线上的监听器,其设计的巧妙可见一斑。

以下在没忍住贴上ListenerBus完整的源代码:

/** Asynchronously passes StreamingListenerEvents to registered StreamingListeners. */
private[spark] class StreamingListenerBus() extends Logging {
private val listeners = new ArrayBuffer[StreamingListener]()
with SynchronizedBuffer[StreamingListener] /* Cap the capacity of the SparkListenerEvent queue so we get an explicit error (rather than
* an OOM exception) if it's perpetually being added to more quickly than it's being drained. */
private val EVENT_QUEUE_CAPACITY = 10000
private val eventQueue = new LinkedBlockingQueue[StreamingListenerEvent](EVENT_QUEUE_CAPACITY)
private var queueFullErrorMessageLogged = false val listenerThread = new Thread("StreamingListenerBus") {
setDaemon(true)
override def run() {
while (true) {
val event = eventQueue.take
event match {
case receiverStarted: StreamingListenerReceiverStarted =>
listeners.foreach(_.onReceiverStarted(receiverStarted))
case receiverError: StreamingListenerReceiverError =>
listeners.foreach(_.onReceiverError(receiverError))
case receiverStopped: StreamingListenerReceiverStopped =>
listeners.foreach(_.onReceiverStopped(receiverStopped))
case batchSubmitted: StreamingListenerBatchSubmitted =>
listeners.foreach(_.onBatchSubmitted(batchSubmitted))
case batchStarted: StreamingListenerBatchStarted =>
listeners.foreach(_.onBatchStarted(batchStarted))
case batchCompleted: StreamingListenerBatchCompleted =>
listeners.foreach(_.onBatchCompleted(batchCompleted))
case StreamingListenerShutdown =>
// Get out of the while loop and shutdown the daemon thread
return
case _ =>
}
}
}
} def start() {
listenerThread.start()
} def addListener(listener: StreamingListener) {
listeners += listener
} def post(event: StreamingListenerEvent) {
val eventAdded = eventQueue.offer(event)
if (!eventAdded && !queueFullErrorMessageLogged) {
logError("Dropping StreamingListenerEvent because no remaining room in event queue. " +
"This likely means one of the StreamingListeners is too slow and cannot keep up with the " +
"rate at which events are being started by the scheduler.")
queueFullErrorMessageLogged = true
}
} /**
* Waits until there are no more events in the queue, or until the specified time has elapsed.
* Used for testing only. Returns true if the queue has emptied and false is the specified time
* elapsed before the queue emptied.
*/
def waitUntilEmpty(timeoutMillis: Int): Boolean = {
val finishTime = System.currentTimeMillis + timeoutMillis
while (!eventQueue.isEmpty) {
if (System.currentTimeMillis > finishTime) {
return false
}
/* Sleep rather than using wait/notify, because this is used only for testing and wait/notify
* add overhead in the general case. */
Thread.sleep(10)
}
true
} def stop(): Unit = post(StreamingListenerShutdown)
}

通过这个监听器总线的完整源代码,能够看到里面一个监听器数组。以及一个事件的堵塞队列,以及事件处理程序,也就是将事件通知到注冊上来的每一个监听器。这段代码这是看着就舒服。是一个非常好的代码模式呀,赞一个!

陶醉至此继续看ReceiverTracker(ReceiverTracker是Receiver监管程序)的启动过程,首先看下ReceiverTracker的start方法:

 /** Start the actor and receiver execution thread. */
def start() = synchronized {
if (actor != null) {
throw new SparkException("ReceiverTracker already started")
} if (!receiverInputStreams.isEmpty) {
actor = ssc.env.actorSystem.actorOf(Props(new ReceiverTrackerActor),
"ReceiverTracker")
receiverExecutor.start()
logInfo("ReceiverTracker started")
}
}

解读一下首先还是new了一个Akka的Actor对象,肯定是用来处理一些事件呀。看代码:

/** Actor to receive messages from the receivers. */
private class ReceiverTrackerActor extends Actor {
def receive = {
case RegisterReceiver(streamId, typ, host, receiverActor) =>
registerReceiver(streamId, typ, host, receiverActor, sender)
sender ! true
case AddBlock(receivedBlockInfo) =>
addBlocks(receivedBlockInfo)
case ReportError(streamId, message, error) =>
reportError(streamId, message, error)
case DeregisterReceiver(streamId, message, error) =>
deregisterReceiver(streamId, message, error)
sender ! true
}
}

这个Actor主要是监听从Receiver发送过来的一些消息,包含Receiver的注冊、注销,以及接受数据流的存储AddBlock。再次可见JobTracker真是统揽全局的组件。

ReceiverTracker start方法初始化监听Receiver的Actor后,以下的 receiverExecutor.start()方法的调用才是真正的启动Receiver的核心逻辑。看代码:



从这个代码片段能够看出receiverExecutor是ReceiverLaucher的实例。其Start逻辑例如以下:

 def start() {
thread.start()
} /** This thread class runs all the receivers on the cluster. */
class ReceiverLauncher {
@transient val env = ssc.env
@transient val thread = new Thread() {
override def run() {
try {
SparkEnv.set(env)
//启动Receiver
startReceivers()
} catch {
case ie: InterruptedException => logInfo("ReceiverLauncher interrupted")
}
}
} /**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def startReceivers() {
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}) // Right now, we only honor preferences if all receivers have them
val hasLocationPreferences = receivers.map(_.preferredLocation.isDefined).reduce(_ && _) // Create the parallel collection of receivers to distributed them on the worker nodes
val tempRDD =
if (hasLocationPreferences) {
val receiversWithPreferences = receivers.map(r => (r, Seq(r.preferredLocation.get)))
ssc.sc.makeRDD[Receiver[_]](receiversWithPreferences)
} else {
ssc.sc.makeRDD(receivers, receivers.size)
} // Function to start the receiver on the worker node
val startReceiver = (iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
val receiver = iterator.next()
val executor = new ReceiverSupervisorImpl(receiver, SparkEnv.get)
executor.start()
executor.awaitTermination()
}
// Run the dummy Spark job to ensure that all slaves have registered.
// This avoids all the receivers to be scheduled on the same node.
if (!ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
} // Distribute the receivers and start them
logInfo("Starting " + receivers.length + " receivers")
ssc.sparkContext.runJob(tempRDD, startReceiver)
logInfo("All of the receivers have been terminated")
}

能够ReceiverLauncher的start方法相同启动一条线程,异步的从InputStreams获取全部的Receiver,然后将Receiver封装成ReceiverSuperior逐个启动。当然这么这个是笼统的总结,从细节上再看下:

上面方法中是从receiverInputStreams获取了全部的Receiver,ReceiverInputStreams事实上是这个样子的。

private val inputStreams = new ArrayBuffer[InputDStream[_]]()

def getReceiverInputStreams() = this.synchronized {
inputStreams.filter(_.isInstanceOf[ReceiverInputDStream[_]])
.map(_.asInstanceOf[ReceiverInputDStream[_]])
.toArray
}

一个InputDStream数组经过以下函数处理后得到的数组,主要是校验一下是否是ReceiverInputDStream类型。

从封装在InputDStream中拿到全部的Receiver结合后,然后将Receiver封装成RDD分发到多个work结点上(可见Spark Streaming的巧妙,将Receiver像RDD一样分发)代码例如以下:

  //将Receiver封装成RDD并分发到work结点上。
val tempRDD =
if (hasLocationPreferences) {
val receiversWithPreferences = receivers.map(r => (r, Seq(r.preferredLocation.get)))
ssc.sc.makeRDD[Receiver[_]](receiversWithPreferences)
} else {
ssc.sc.makeRDD(receivers, receivers.size)
} //在worker结点上循环遍历并将Receiver封装成ReceiverSupervisor然后逐个启动。
val startReceiver = (iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
val receiver = iterator.next()
val executor = new ReceiverSupervisorImpl(receiver, SparkEnv.get)
executor.start()
executor.awaitTermination()
}

这里有个细节是每一个Receiver事实上都是封装成了ReceiverSupervisor,然后才启动的。ReceiverSupervisor事实上是Receiver的监管程序,其能够用来处理Receiver启动之外的一些事件逻辑,至此补充一张ReceiverTracker以及ReceverSupervisor相关的时序图:

因为Receiver都是封装在ReceiverSupervisor里的。那就看一下ReceiverSupervisor的start方法。

 /** Start the supervisor */
def start() {
onStart()
startReceiver()
} override protected def onStart() {
blockGenerator.start()
}
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
logInfo("Starting receiver")
receiver.onStart()
logInfo("Called receiver onStart")
onReceiverStart()
receiverState = Started
} catch {
case t: Throwable =>
stop("Error starting receiver " + streamId, Some(t))
}
} /**
* 发送RegisterServer 消息给Driver报告自己启动了。 */
override protected def onReceiverStart() {
val msg = RegisterReceiver(
streamId, receiver.getClass.getSimpleName, Utils.localHostName(), actor)
val future = trackerActor.ask(msg)(askTimeout)
Await.result(future, askTimeout)
}

结合时序图以及源代码,能够看出ReceiverSupervisor的start方法触发了BlockGenerator的start方法以及Receiver的onstart方法。

那么一下补充一张有关BlockGenerator的相关调用时序以及Receiver的类继承结构图。

SocketReceiver接受并保存数据的处理流程。

Receiver的类继承结构图:

依据以上Receiver的类结构图,能够看出来Receiver的继承堆栈结构,我们仅仅看下和本次Demo相关的onstart方法(由于ReceiverSupervisor调用了Recevier的onstart方法。而且这里涉及的Receiver是SocketReceiver)。

private[streaming]
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging { def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
} def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
} /** Create a socket connection and receive data until receiver is stopped */
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
logInfo("Stopped receiving")
restart("Retrying connecting to " + host + ":" + port)
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
restart("Error receiving data", t)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
}

从以上代码能够看出SocketReceiver的onStart方法也是启动了一条后台线程,循环的监听Socket的还有一端发送过来的字节流,然后调用store方法保存起来,供兴许的Action进行处理。

那就在看下Store方法:

  /**
* Store a single item of received data to Spark's memory.
* These single items will be aggregated together into data blocks before
* being pushed into Spark's memory.
*/
def store(dataItem: T) {
executor.pushSingle(dataItem)
}

结合上端代码和并结合上面“socketReceiver接受并保存数据”的时序图。就可以追溯到SocketReceiver接受数据并有BlockManager报错的过程。当中ReceiverSupervisor重载了BlockManager的“+”号方法,同一时候也利用了堵塞队列的机制来保存数据。最后通过BlockManager将接受到的流数据存储起来。

以上过程就不做具体分解和结合时序图还是比較easy理解的。

(4)最后单拿出来再看下JobGenerator的start方法:

     /** Start generation of jobs */
def start(): Unit = synchronized {
if (eventActor != null) return // generator has already been started eventActor = ssc.env.actorSystem.actorOf(Props(new Actor {
def receive = {
case event: JobGeneratorEvent => processEvent(event)
}
}), "JobGenerator")
if (ssc.isCheckpointPresent) {
restart()
} else {
startFirstTime()
}
} /** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
graph.start(startTime - graph.batchDuration)
timer.start(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
} /** Restarts the generator based on the information in checkpoint */
private def restart() {
// If manual clock is being used for testing, then
// either set the manual clock to the last checkpointed time,
// or if the property is defined set it to that time
if (clock.isInstanceOf[ManualClock]) {
val lastTime = ssc.initialCheckpoint.checkpointTime.milliseconds
val jumpTime = ssc.sc.conf.getLong("spark.streaming.manualClock.jump", 0)
clock.asInstanceOf[ManualClock].setTime(lastTime + jumpTime)
} val batchDuration = ssc.graph.batchDuration // Batches when the master was down, that is,
// between the checkpoint and current restart time
val checkpointTime = ssc.initialCheckpoint.checkpointTime
val restartTime = new Time(timer.getRestartTime(graph.zeroTime.milliseconds))
val downTimes = checkpointTime.until(restartTime, batchDuration)
logInfo("Batches during down time (" + downTimes.size + " batches): "
+ downTimes.mkString(", ")) // Batches that were unprocessed before failure
val pendingTimes = ssc.initialCheckpoint.pendingTimes.sorted(Time.ordering)
logInfo("Batches pending processing (" + pendingTimes.size + " batches): " +
pendingTimes.mkString(", "))
// Reschedule jobs for these times
val timesToReschedule = (pendingTimes ++ downTimes).distinct.sorted(Time.ordering)
logInfo("Batches to reschedule (" + timesToReschedule.size + " batches): " +
timesToReschedule.mkString(", "))
timesToReschedule.foreach(time =>
jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time)))
) // Restart the timer
timer.start(restartTime.milliseconds)
logInfo("Restarted JobGenerator at " + restartTime)
}

以上代码则是设计了Job的生成以及启动过程,详细信息临时做个checkPoint有待下次分解。

Spark Streaming源代码学习总结(一)的更多相关文章

  1. Apache 流框架 Flink,Spark Streaming,Storm对比分析(二)

    本文由  网易云发布. 本文内容接上一篇Apache 流框架 Flink,Spark Streaming,Storm对比分析(一) 2.Spark Streaming架构及特性分析 2.1 基本架构 ...

  2. Apache 流框架 Flink,Spark Streaming,Storm对比分析(2)

    此文已由作者岳猛授权网易云社区发布. 欢迎访问网易云社区,了解更多网易技术产品运营经验. 2.Spark Streaming架构及特性分析 2.1 基本架构 基于是spark core的spark s ...

  3. spark streaming 笔记

    spark streaming项目 学习笔记 为什么要flume+kafka? 生成数据有高峰与低峰,如果直接高峰数据过来flume+spark/storm,实时处理容易处理不过来,扛不住压力.而选用 ...

  4. 5分钟学习spark streaming之 轻松在浏览器运行和修改Word Counts

    方案一:根据官方实例,下载预编译好的版本,执行以下步骤: nc -lk 9999 作为实时数据源 ./bin/run-example org.apache.spark.examples.sql.str ...

  5. Spark2.3(三十五)Spark Structured Streaming源代码剖析(从CSDN和Github中看到别人分析的源代码的文章值得收藏)

    从CSDN中读取到关于spark structured streaming源代码分析不错的几篇文章 spark源码分析--事件总线LiveListenerBus spark事件总线的核心是LiveLi ...

  6. Spark学习之Spark Streaming(9)

    Spark学习之Spark Streaming(9) 1. Spark Streaming允许用户使用一套和批处理非常接近的API来编写流式计算应用,这就可以大量重用批处理应用的技术甚至代码. 2. ...

  7. StreamDM:基于Spark Streaming、支持在线学习的流式分析算法引擎

    StreamDM:基于Spark Streaming.支持在线学习的流式分析算法引擎 streamDM:Data Mining for Spark Streaming,华为诺亚方舟实验室开源了业界第一 ...

  8. Spark Streaming官方文档学习--下

    Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...

  9. Spark Streaming官方文档学习--上

    官方文档地址:http://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming是spark ap ...

随机推荐

  1. CRMEB系统就是集客户关系管理+营销电商系统,能够真正帮助企业基于微信公众号、小程序实现会员管理、数据分析,精准营销的电子商务管理系统。可满足企业新零售、批发、分销、等各种业务需求。

    **可以快速二次开发的开源小程序商城系统源码**源码开源地址:https://github.crmeb.net/u/LXT 项目介绍: CRMEB系统就是集客户关系管理+营销电商系统,能够真正帮助企业 ...

  2. B - Substrings Sort

    Problem description You are given nn strings. Each string consists of lowercase English letters. Rea ...

  3. 获取端口号且stiop

  4. Windows-Server-2008、IIS7.0环境下配置伪静态化

    在Windows-Server-2008.IIS7.0环境下配置伪静态化                首先,是IIS7.0的配置,由于Windows Server 2008操作系统默认的IIS版本为 ...

  5. 06--谈谈:C++类的“包含”机制

    谈谈:C++类的“包含”机制   本人在学习Qt的时候发现了一个非常有趣的现象.有很多函数的调用方法都写成了如下的形式: object.func().func2(); 这令小弟着实不懂.在上面这段代码 ...

  6. structure vs class in swift language

    Both class and structure can do: Define properties to store values Define methods to provide functio ...

  7. 浏览器 HTTP 协议缓存机制详解--网络缓存决策机制流程图

    1.缓存的分类 2.浏览器缓存机制详解 2.1 HTML Meta标签控制缓存 2.2 HTTP头信息控制缓存 2.2.1 浏览器请求流程 2.2.2 几个重要概念解释 3.用户行为与缓存 4.Ref ...

  8. HTML编辑器ASP.NET

    <%@ Page Language="C#" AutoEventWireup="true" CodeFile="kindeditor.aspx. ...

  9. 获取url后面的路径

    function GetUrlRelativePath() { var url = document.location.toString(); var arrUrl = url.split(" ...

  10. struts与spring整合

    Spring与Struts框架整合 Spring,负责对象对象创建 Struts, 用Action处理请求 Spring与Struts框架整合, 关键点:让struts框架action对象的创建,交给 ...