Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考
本节的主要内容:
一、数据接受架构和设计模式
二、接受数据的源码解读
Spark Streaming不断持续的接收数据,具有Receiver的Spark 应用程序的考虑。
Receiver和Driver在不同进程,Receiver接收数据后要不断给Deriver汇报。
因为Driver负责调度,Receiver接收的数据如果不汇报给Deriver,Deriver调度时不会把接收的数据计算入调度系统中(如:数据ID,Block分片)。
思考Spark Streaming接收数据:
不断有循环器接收数据,接收数据要存储数据,将存储数据后需要汇报给Deriver,接收数据和存储数据不应该给同一个对象进行处理。
Spark Streaming接收数据从设计模式来讲是MVC的架构:
V:就是Driver
M:就是Receiver
C:就是ReceiverSupervisor
因为:
Receiver就是接收数据器,例如:可以从socketTextStream中获取数据。
ReceiverSupervisor就是存储数据的控制器,因为Receiver是通过ReceiverSupervisor来启动的,反过来讲Receiver在接收到数据后是通过ReceiverSupervisor来存储数据的。
然后将存储后的元数据汇报给Driver端。
V:就是Driver,操作元数据通过元数据指针,根据指针地址操作其他机器上具体数据内容,并将处理结果展示出来。
所以说:
Spark Streaming数据接收全生命周期可以看成是一个MVC模式,ReceiverSupervisor相当于是控制器(C),Receiver(M)、Driver(V)
源码分析:
1、Receiver类:
/**
* :: DeveloperApi ::
* Abstract class of a receiver that can be run on worker nodes to receive external data. A
* custom receiver can be defined by defining the functions `onStart()` and `onStop()`. `onStart()`
* should define the setup steps necessary to start receiving data,
* and `onStop()` should define the cleanup steps necessary to stop receiving data.
* Exceptions while receiving can be handled either by restarting the receiver with `restart(...)`
* or stopped completely by `stop(...)` or
*
* A custom receiver in Scala would look like this.
*
* {{{
* class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
* def onStart() {
* // Setup stuff (start threads, open sockets, etc.) to start receiving data.
* // Must start new thread to receive data, as onStart() must be non-blocking.
*
* // Call store(...) in those threads to store received data into Spark's memory.
*
* // Call stop(...), restart(...) or reportError(...) on any thread based on how
* // different errors needs to be handled.
*
* // See corresponding method documentation for more details
* }
*
* def onStop() {
* // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*
* A custom receiver in Java would look like this.
*
* {{{
* class MyReceiver extends Receiver<String> {
* public MyReceiver(StorageLevel storageLevel) {
* super(storageLevel);
* }
*
* public void onStart() {
* // Setup stuff (start threads, open sockets, etc.) to start receiving data.
* // Must start new thread to receive data, as onStart() must be non-blocking.
*
* // Call store(...) in those threads to store received data into Spark's memory.
*
* // Call stop(...), restart(...) or reportError(...) on any thread based on how
* // different errors needs to be handled.
*
* // See corresponding method documentation for more details
* }
*
* public void onStop() {
* // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*/
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable { 2、ReceiverSupervisor类:
/**
* Abstract class that is responsible for supervising a Receiver in the worker.
* It provides all the necessary interfaces for handling the data received by the receiver.
*/
private[streaming] abstract class ReceiverSupervisor(
receiver: Receiver[_],
conf: SparkConf
) extends Logging {
ReceiverTracker发送一个个Job,每个Job有一个task,每个task中有一个ReceiverSupervisor,用于启动每个Receiver的,看ReceiverTracker的start方法:
/**
* 管理receiver的:启动、执行、重新启动
* 确定所有的输入流记录,有成员记录所有输入来源
* 需要输入流,为每个输入流启动一个receiver
* This class manages the execution of the receivers of ReceiverInputDStreams. Instance of
* this class must be created after all input streams have been added and StreamingContext.start()
* has been called because it needs the final set of input streams at the time of instantiation.
*dirver端
* @param skipReceiverLaunch Do not launch the receiver. This is useful for testing.
*/
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging { private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
private val receiverInputStreamIds = receiverInputStreams.map { _.id }
private val receivedBlockTracker = new ReceivedBlockTracker(
ssc.sparkContext.conf,
ssc.sparkContext.hadoopConfiguration,
receiverInputStreamIds,
ssc.scheduler.clock,
ssc.isCheckpointPresent,
Option(ssc.checkpointDir)
)
private val listenerBus = ssc.scheduler.listenerBus /** Enumeration to identify current state of the ReceiverTracker */
object TrackerState extends Enumeration {
type TrackerState = Value
val Initialized, Started, Stopping, Stopped = Value
}
import TrackerState._ /** State of the tracker. Protected by "trackerStateLock" */
@volatile private var trackerState = Initialized // endpoint is created when generator starts.
// This not being null means the tracker has been started and not stopped
private var endpoint: RpcEndpointRef = null private val schedulingPolicy = new ReceiverSchedulingPolicy() // Track the active receiver job number. When a receiver job exits ultimately, countDown will
// be called.
private val receiverJobExitLatch = new CountDownLatch(receiverInputStreams.size) /**
* Track all receivers' information. The key is the receiver id, the value is the receiver info.
* It's only accessed in ReceiverTrackerEndpoint.
*/
private val receiverTrackingInfos = new HashMap[Int, ReceiverTrackingInfo] /**
* Store all preferred locations for all receivers. We need this information to schedule
* receivers. It's only accessed in ReceiverTrackerEndpoint.
*/
private val receiverPreferredLocations = new HashMap[Int, Option[String]] /** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
} if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
RDD中的元素必须要实现序列化,才能将RDD序列化传输给Executor端,Receiver就实现了Serializable接口,自定义的Receiver也必须实现Serializable接口。
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
处理Receiver接收到的数据,存储数据并汇报给Driver,Receiver是一条一条的接收数据的。
作用于rdd的function,内部就是一个个Receiver,代码里面需要启动的Receiver是谁,根据你输入的数据来源inputDStreams receiver,socketTextStream
相当于一个引用句柄socketReceiver,我们获得的Receiver是引用的描述,接收的数据其是下面的getReceiver产生的:
/**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}) runDummySparkJob() logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers))
}
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
private val receiverInputStreamIds = receiverInputStreams.map { _.id }
private val receivedBlockTracker = new ReceivedBlockTracker(
ssc.sparkContext.conf,
ssc.sparkContext.hadoopConfiguration,
receiverInputStreamIds,
ssc.scheduler.clock,
ssc.isCheckpointPresent,
Option(ssc.checkpointDir)
)
private[streaming]
class SocketInputDStream[T: ClassTag](
ssc_ : StreamingContext,
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_) { def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
} private[streaming]
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging { def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
如果Receiver RDD为空,则默认创建一个RDD,主要处理Receiver 接收到的数据,将接收数据给ReceiverSupervisor存储数据,并将元数据汇报给ReceiverTracker,Receiver 接收数据是一条条的,从抽象讲,是while循环获取一条条数据。接收数据,合并成buffer,放入block队列,在ReceiverSupervisorImpl启动会调用BlockGenerator对象的start方法。
override protected def onStart() {
registeredBlockGenerators.foreach { _.start() }
}
/**
* Generates batches of objects received by a
* [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
* named blocks at regular intervals. This class starts two threads,
* one to periodically start a new batch and prepare the previous batch of as a block,
* the other to push the blocks into the block manager.
*
* Note: Do not create BlockGenerator instances directly inside receivers. Use
* `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
*/
private[streaming] class BlockGenerator(
listener: BlockGeneratorListener,
receiverId: Int,
conf: SparkConf,
clock: Clock = new SystemClock()
) extends RateLimiter(conf) with Logging { private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any]) /**
* The BlockGenerator can be in 5 possible states, in the order as follows.
*
* - Initialized: Nothing has been started
* - Active: start() has been called, and it is generating blocks on added data.
* - StoppedAddingData: stop() has been called, the adding of data has been stopped,
* but blocks are still being generated and pushed.
* - StoppedGeneratingBlocks: Generating of blocks has been stopped, but
* they are still being pushed.
* - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed.
*/
private object GeneratorState extends Enumeration {
type GeneratorState = Value
val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value
}
import GeneratorState._ private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value") private val blockIntervalTimer =
new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } } @volatile private var currentBuffer = new ArrayBuffer[Any]
@volatile private var state = Initialized /** Start block generating and pushing threads. */
def start(): Unit = synchronized {
if (state == Initialized) {
state = Active
blockIntervalTimer.start()
blockPushingThread.start()
logInfo("Started BlockGenerator")
} else {
throw new SparkException(
s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
}
}
BlockGenerator类是用来干什么的?从上述的源码注释可以说明该类来把一个Receiver接收到的数据合并到一个Block然后写入到BlockManager对象中。
该类内部有两个线程,一个是周期性把数据生成一批对象,然后把先前的一批数据封装成Block。另一个线程时把Block写入到BlockManager进行存储。
override def createBlockGenerator(
blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
// Cleanup BlockGenerators that have already been stopped
registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() } val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
registeredBlockGenerators += newBlockGenerator
newBlockGenerator
}
BlockGenerator类继承自ReateLimiter类,说明我们不能限定接收数据的速度,但是可以限定存储数据的速度,转过来就限定流动的速度。
BlockGenerator类有一个定时器(默认每200ms将接收到的数据合并成block)和一个线程(把block写入到BlockManager),200ms会产生一个Block,即1秒钟生成5个Partition。太小则生成的数据片中数据太小,导致一个Task处理的数据少,性能差。实际经验得到不要低于50msprivate val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
private val blockIntervalTimer =
new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
邮箱:18610086859@vip.126.com
Spark发行版笔记10
Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考的更多相关文章
- Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式 数据接收源码彻底研究 一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
- Spark Streaming源码解读之Driver中ReceiverTracker架构设计以具体实现彻底研究
本期内容 : ReceiverTracker的架构设计 消息循环系统 ReceiverTracker具体实现 一. ReceiverTracker的架构设计 1. ReceiverTracker可以以 ...
- Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
- 15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...
- Spark Streaming源码解读之生成全生命周期彻底研究与思考
本期内容 : DStream与RDD关系彻底研究 Streaming中RDD的生成彻底研究 问题的提出 : 1. RDD是怎么生成的,依靠什么生成 2.执行时是否与Spark Core上的RDD执行有 ...
- 9. Spark Streaming技术内幕 : Receiver在Driver的精妙实现全生命周期彻底研究和思考
原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/) Spark streaming 程序需要不断接收新数据,然后进行业务逻辑 ...
- 16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/) 本期内容: 一.Spark Streaming 数据清理总览 二.Spark Streami ...
- Spark Streaming源码解读之数据清理内幕彻底解密
本期内容 : Spark Streaming数据清理原理和现象 Spark Streaming数据清理代码解析 Spark Streaming一直在运行的,在计算的过程中会不断的产生RDD ,如每秒钟 ...
- Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析 多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...
随机推荐
- [水煮 ASP.NET Web API2 方法论](12-2)管理 OData 路由
问题 如何控制 OData 路由 解决方案 为了注册路由,可以使用 HttpConfigurationExtension 类中 MapODataServiceRoute 的扩展方法.对于单一路由这样 ...
- loadrunner中文件的操作
loadrunner中文件的操作 我们可以使用fopen().fscanf().fprintf().fclose()函数进行文件操作,但是因为LoadRunner不支持FILE数据类型,所以我们需要做 ...
- 微软企业库5.0 学习之路——第八步、使用Configuration Setting模块等多种方式分类管理企业库配置信息
在介绍完企业库几个常用模块后,我今天要对企业库的配置文件进行处理,缘由是我打开web.config想进行一些配置的时候发现web.config已经变的异常的臃肿(大量的企业库配置信息充斥其中),所以决 ...
- Python开发基础-Day23try异常处理、socket套接字基础1
异常处理 错误 程序里的错误一般分为两种: 1.语法错误,这种错误,根本过不了python解释器的语法检测,必须在程序执行前就改正 2.逻辑错误,人为造成的错误,如数据类型错误.调用方法错误等,这些解 ...
- dijkstra算法模板及其用法
Dijkstra算法 1.定义概览 Dijkstra(迪杰斯特拉)算法是典型的单源最短路径算法,用于计算一个节点到其他所有节点的最短路径.主要特点是以起始点为中心向外层层扩展,直到扩展到终点为止.Di ...
- Java 对象池实现
http://blog.csdn.net/bryantd/article/details/1100019 http://www.cnblogs.com/devinzhang/archive/2012/ ...
- 【Vijos 1607】【NOI 2009】植物大战僵尸
https://vijos.org/p/1607 vijos界面好漂亮O(∩_∩)O~~ 对于一个植物x,和一个它保护的植物y,连一条边<x,y>表示x保护y,对于每个植物再向它左方的植物 ...
- bzoj3012(Trie)
bzoj3012 题解 3012: [Usaco2012 Dec]First! Desdescription Bessie has been playing with strings again. S ...
- 【找规律】计蒜客17118 2017 ACM-ICPC 亚洲区(西安赛区)网络赛 E. Maximum Flow
题意:一张有n个点的图,结点被编号为0~n-1,i往所有编号比它大的点j连边,权值为i xor j.给你n,问你最大流. 打个表,别忘了把相邻两项的差打出来,你会发现神奇的规律……你会发现每个答案都是 ...
- 【置换群/模拟】NOIP2005-篝火晚会
[问题描述] 佳佳刚进高中,在军训的时候,由于佳佳吃苦耐劳,很快得到了教官的赏识,成为了“小教官”.在军训结束的那天晚上,佳佳被命令组织同学们进行篝火晚会.一共有n个同学,编号从1到n.一开始,同学们 ...