Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考

本节的主要内容：

一、数据接受架构和设计模式

二、接受数据的源码解读

Spark Streaming不断持续的接收数据，具有Receiver的Spark 应用程序的考虑。

Receiver和Driver在不同进程，Receiver接收数据后要不断给Deriver汇报。

因为Driver负责调度，Receiver接收的数据如果不汇报给Deriver，Deriver调度时不会把接收的数据计算入调度系统中（如：数据ID，Block分片）。

思考Spark Streaming接收数据：

不断有循环器接收数据，接收数据要存储数据，将存储数据后需要汇报给Deriver，接收数据和存储数据不应该给同一个对象进行处理。

Spark Streaming接收数据从设计模式来讲是MVC的架构：

V：就是Driver

M：就是Receiver

C：就是ReceiverSupervisor

因为：

Receiver就是接收数据器，例如：可以从socketTextStream中获取数据。

ReceiverSupervisor就是存储数据的控制器，因为Receiver是通过ReceiverSupervisor来启动的，反过来讲Receiver在接收到数据后是通过ReceiverSupervisor来存储数据的。

然后将存储后的元数据汇报给Driver端。

V：就是Driver，操作元数据通过元数据指针，根据指针地址操作其他机器上具体数据内容，并将处理结果展示出来。

所以说：

Spark Streaming数据接收全生命周期可以看成是一个MVC模式，ReceiverSupervisor相当于是控制器（C），Receiver(M)、Driver（V）

源码分析：

1、Receiver类：

/**
 * :: DeveloperApi ::
 * Abstract class of a receiver that can be run on worker nodes to receive external data. A
 * custom receiver can be defined by defining the functions `onStart()` and `onStop()`. `onStart()`
 * should define the setup steps necessary to start receiving data,
 * and `onStop()` should define the cleanup steps necessary to stop receiving data.
 * Exceptions while receiving can be handled either by restarting the receiver with `restart(...)`
 * or stopped completely by `stop(...)` or
 *
 * A custom receiver in Scala would look like this.
 *
 * {{{
 *  class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
 *      def onStart() {
 *          // Setup stuff (start threads, open sockets, etc.) to start receiving data.
 *          // Must start new thread to receive data, as onStart() must be non-blocking.
 *
 *          // Call store(...) in those threads to store received data into Spark's memory.
 *
 *          // Call stop(...), restart(...) or reportError(...) on any thread based on how
 *          // different errors needs to be handled.
 *
 *          // See corresponding method documentation for more details
 *      }
 *
 *      def onStop() {
 *          // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
 *      }
 *  }
 * }}}
 *
 * A custom receiver in Java would look like this.
 *
 * {{{
 * class MyReceiver extends Receiver<String> {
 *     public MyReceiver(StorageLevel storageLevel) {
 *         super(storageLevel);
 *     }
 *
 *     public void onStart() {
 *          // Setup stuff (start threads, open sockets, etc.) to start receiving data.
 *          // Must start new thread to receive data, as onStart() must be non-blocking.
 *
 *          // Call store(...) in those threads to store received data into Spark's memory.
 *
 *          // Call stop(...), restart(...) or reportError(...) on any thread based on how
 *          // different errors needs to be handled.
 *
 *          // See corresponding method documentation for more details
 *     }
 *
 *     public void onStop() {
 *          // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
 *     }
 * }
 * }}}
 */
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {

2、ReceiverSupervisor类：

/**
 * Abstract class that is responsible for supervising a Receiver in the worker.
 * It provides all the necessary interfaces for handling the data received by the receiver.
 */
private[streaming] abstract class ReceiverSupervisor(
    receiver: Receiver[_],
    conf: SparkConf
  ) extends Logging {

ReceiverTracker发送一个个Job，每个Job有一个task，每个task中有一个ReceiverSupervisor，用于启动每个Receiver的，看ReceiverTracker的start方法：

/**
  * 管理receiver的：启动、执行、重新启动
  * 确定所有的输入流记录，有成员记录所有输入来源
  * 需要输入流，为每个输入流启动一个receiver
 * This class manages the execution of the receivers of ReceiverInputDStreams. Instance of
 * this class must be created after all input streams have been added and StreamingContext.start()
 * has been called because it needs the final set of input streams at the time of instantiation.
 *dirver端
 * @param skipReceiverLaunch Do not launch the receiver. This is useful for testing.
 */
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {

  private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
  private val receiverInputStreamIds = receiverInputStreams.map { _.id }
  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )
  private val listenerBus = ssc.scheduler.listenerBus

  /** Enumeration to identify current state of the ReceiverTracker */
  object TrackerState extends Enumeration {
    type TrackerState = Value
    val Initialized, Started, Stopping, Stopped = Value
  }
  import TrackerState._

  /** State of the tracker. Protected by "trackerStateLock" */
  @volatile private var trackerState = Initialized

  // endpoint is created when generator starts.
  // This not being null means the tracker has been started and not stopped
  private var endpoint: RpcEndpointRef = null

  private val schedulingPolicy = new ReceiverSchedulingPolicy()

  // Track the active receiver job number. When a receiver job exits ultimately, countDown will
  // be called.
  private val receiverJobExitLatch = new CountDownLatch(receiverInputStreams.size)

  /**
   * Track all receivers' information. The key is the receiver id, the value is the receiver info.
   * It's only accessed in ReceiverTrackerEndpoint.
   */
  private val receiverTrackingInfos = new HashMap[Int, ReceiverTrackingInfo]

  /**
   * Store all preferred locations for all receivers. We need this information to schedule
   * receivers. It's only accessed in ReceiverTrackerEndpoint.
   */
  private val receiverPreferredLocations = new HashMap[Int, Option[String]]

  /** Start the endpoint and receiver execution thread. */
  def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }

    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }
RDD中的元素必须要实现序列化，才能将RDD序列化传输给Executor端，Receiver就实现了Serializable接口，自定义的Receiver也必须实现Serializable接口。
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
处理Receiver接收到的数据，存储数据并汇报给Driver，Receiver是一条一条的接收数据的。

作用于rdd的function，内部就是一个个Receiver，代码里面需要启动的Receiver是谁，根据你输入的数据来源inputDStreams receiver，socketTextStream

相当于一个引用句柄socketReceiver，我们获得的Receiver是引用的描述，接收的数据其是下面的getReceiver产生的：

/**
 * Get the receivers from the ReceiverInputDStreams, distributes them to the
 * worker nodes as a parallel collection, and runs them.
 */
private def launchReceivers(): Unit = {
  val receivers = receiverInputStreams.map(nis => {
    val rcvr = nis.getReceiver()
    rcvr.setReceiverId(nis.id)
    rcvr
  })

  runDummySparkJob()

  logInfo("Starting " + receivers.length + " receivers")
  endpoint.send(StartAllReceivers(receivers))
}

private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
  private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
  private val receiverInputStreamIds = receiverInputStreams.map { _.id }
  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )

private[streaming]
class SocketInputDStream[T: ClassTag](
    ssc_ : StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](ssc_) {

  def getReceiver(): Receiver[T] = {
    new SocketReceiver(host, port, bytesToObjects, storageLevel)
  }
}

private[streaming]
class SocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends Receiver[T](storageLevel) with Logging {

  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }

如果Receiver RDD为空，则默认创建一个RDD,主要处理Receiver 接收到的数据，将接收数据给ReceiverSupervisor存储数据，并将元数据汇报给ReceiverTracker，Receiver 接收数据是一条条的，从抽象讲，是while循环获取一条条数据。接收数据，合并成buffer，放入block队列，在ReceiverSupervisorImpl启动会调用BlockGenerator对象的start方法。

override protected def onStart() {
  registeredBlockGenerators.foreach { _.start() }
}

/**
 * Generates batches of objects received by a
 * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
 * named blocks at regular intervals. This class starts two threads,
 * one to periodically start a new batch and prepare the previous batch of as a block,
 * the other to push the blocks into the block manager.
 *
 * Note: Do not create BlockGenerator instances directly inside receivers. Use
 * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
 */
private[streaming] class BlockGenerator(
    listener: BlockGeneratorListener,
    receiverId: Int,
    conf: SparkConf,
    clock: Clock = new SystemClock()
  ) extends RateLimiter(conf) with Logging {

  private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any])

  /**
   * The BlockGenerator can be in 5 possible states, in the order as follows.
   *
   *  - Initialized: Nothing has been started
   *  - Active: start() has been called, and it is generating blocks on added data.
   *  - StoppedAddingData: stop() has been called, the adding of data has been stopped,
   *                       but blocks are still being generated and pushed.
   *  - StoppedGeneratingBlocks: Generating of blocks has been stopped, but
   *                             they are still being pushed.
   *  - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed.
   */
  private object GeneratorState extends Enumeration {
    type GeneratorState = Value
    val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value
  }
  import GeneratorState._

  private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
  require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")

  private val blockIntervalTimer =
    new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
  private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

  @volatile private var currentBuffer = new ArrayBuffer[Any]
  @volatile private var state = Initialized

  /** Start block generating and pushing threads. */
  def start(): Unit = synchronized {
    if (state == Initialized) {
      state = Active
      blockIntervalTimer.start()
      blockPushingThread.start()
      logInfo("Started BlockGenerator")
    } else {
      throw new SparkException(
        s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
    }
  }
BlockGenerator类是用来干什么的？从上述的源码注释可以说明该类来把一个Receiver接收到的数据合并到一个Block然后写入到BlockManager对象中。
该类内部有两个线程，一个是周期性把数据生成一批对象，然后把先前的一批数据封装成Block。另一个线程时把Block写入到BlockManager进行存储。

override def createBlockGenerator(
    blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
  // Cleanup BlockGenerators that have already been stopped
  registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() }

  val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
  registeredBlockGenerators += newBlockGenerator
  newBlockGenerator
}

BlockGenerator类继承自ReateLimiter类，说明我们不能限定接收数据的速度，但是可以限定存储数据的速度，转过来就限定流动的速度。

BlockGenerator类有一个定时器（默认每200ms将接收到的数据合并成block）和一个线程(把block写入到BlockManager)，200ms会产生一个Block，即1秒钟生成5个Partition。太小则生成的数据片中数据太小，导致一个Task处理的数据少，性能差。实际经验得到不要低于50msprivate val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")

require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
private val blockIntervalTimer =
  new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

微信公众号：DT_Spark

博客：http://blog.sina.com.cn/ilovepains

手机：18610086859

QQ：1740415547

邮箱：18610086859@vip.126.com

Spark发行版笔记10

Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考的更多相关文章

Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式数据接收源码彻底研究一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
Spark Streaming源码解读之Driver中ReceiverTracker架构设计以具体实现彻底研究
本期内容 : ReceiverTracker的架构设计消息循环系统 ReceiverTracker具体实现一. ReceiverTracker的架构设计 1. ReceiverTracker可以以 ...
Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...
Spark Streaming源码解读之生成全生命周期彻底研究与思考
本期内容 : DStream与RDD关系彻底研究 Streaming中RDD的生成彻底研究问题的提出 : 1. RDD是怎么生成的,依靠什么生成 2.执行时是否与Spark Core上的RDD执行有 ...
9. Spark Streaming技术内幕 : Receiver在Driver的精妙实现全生命周期彻底研究和思考
原创文章,转载请注明:转载自听风居士博客(http://www.cnblogs.com/zhouyf/) Spark streaming 程序需要不断接收新数据,然后进行业务逻辑 ...
16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自听风居士博客(http://www.cnblogs.com/zhouyf/) 本期内容: 一.Spark Streaming 数据清理总览二.Spark Streami ...
Spark Streaming源码解读之数据清理内幕彻底解密
本期内容 : Spark Streaming数据清理原理和现象 Spark Streaming数据清理代码解析 Spark Streaming一直在运行的,在计算的过程中会不断的产生RDD ,如每秒钟 ...
Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...

随机推荐

ASOP编译说明
具体说明https://source.android.com/source/ 源码下载https://mirrors.tuna.tsinghua.edu.cn/help/AOSP/ 1 搭建编译环境使 ...
SpringBoot学习：整合Mybatis，使用HikariCP超高性能数据源
一.添加pom依赖jar包:  <dependency> <groupId>org.mybatis.spring.boot</ ...
Nginx开启跨域访问
CORS on Nginx The following Nginx configuration enables CORS, with support for preflight requests. # ...
30、Flask实战第30天：cms模版抽离和个人信息页面完成
cms模版抽离新建一个cms_base.html文件作为基础模板,把cms_index.html的内容拷贝到cms_base.html中. 编辑 cms_base.html,把在不同页面会变动的部分 ...
sed 插入和替换
sed -i '/参考行/i\插入内容' *.ksh sed -i 's,原内容,替换后内容,g' *.ksh
Xamarin中VS无法连接Mac系统的解决办法
Xamarin中VS无法连接Mac系统的解决办法按照以下步骤排查:(1)确认Mac系统中安装Xamarin.iOS开发必备的组件,如Mono.Xamarin.iOS.(2)将Windows和Mac下 ...
JZYZOJ1372 [noi2002]荒岛野人扩展欧几里得
http://172.20.6.3/Problem_Show.asp?id=1372 想法其实很好想,但是我扩展欧几里得还是用得不熟练,几乎是硬套模板,大概因为今天一个下午状态都不大好.扩展欧几里得算 ...
[xsy2282]cake
题意:一个$n\times n$的有标号点阵,现在用一条直线把它们分成两部分,问有多少种不同的分法结论:方案数就是以点阵上的点为端点且不经过第三个点的线段数对一个满足要求的线段,将其绕中点顺时针转 ...
【扫描线】Gym - 101190E - Expect to Wait
假设初始人数为0, 将每个时刻在等待的人数写下来,就是求个和. 如果纵坐标看成人数,横坐标看成时间,就是求个面积. 因为初始人数不一定为零,所以离线后扫描线即可回答所有询问. #include< ...
【搜索】bzoj3109 [cqoi2013]新数独
搜索,没什么好说的.要注意读入. Code: #include<cstdio> #include<cstdlib> using namespace std; ][]= {{,, ...

Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考

Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考的更多相关文章

随机推荐

热门专题