spark streaming 1: SparkContex

StreamingContext 和SparkContex的用途是差不多的，作为spark stream的入口，提供配置、生成DStream等功能。

总体来看，spark stream包括如下模块：

/**
 * Main entry point for Spark Streaming functionality. It provides methods used to create
 * [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
 * created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
 * configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
 * The associated SparkContext can be accessed using `context.sparkContext`. After
 * creating and transforming DStreams, the streaming computation can be started and stopped
 * using `context.start()` and `context.stop()`, respectively.
 * `context.awaitTransformation()` allows the current thread to wait for the termination
 * of the context by `stop()` or by an exception.
 */
class StreamingContext private[streaming] (
    sc_ : SparkContext,
    cp_ : Checkpoint,
    batchDur_ : Duration
  ) extends Logging {

重要的属性：DStreamGraph有点像简洁版的DAG scheduler，负责根据某个时间间隔生成一序列JobSet，以及按照依赖关系序列化

private[streaming] val graph: DStreamGraph = {
  if (isCheckpointPresent) {
    cp_.graph.setContext(this)
    cp_.graph.restoreCheckpointData()
    cp_.graph
  } else {
    assert(batchDur_ != null, "Batch duration for streaming context cannot be null")
    val newGraph = new DStreamGraph()
    newGraph.setBatchDuration(batchDur_)
    newGraph
  }
}

private val nextReceiverInputStreamId = new AtomicInteger(0)

JobScheduler

private[streaming] val scheduler = new JobScheduler(this)

private[streaming] val waiter = new ContextWaiter

private[streaming] val progressListener = new StreamingJobProgressListener(this)

添加一个listener，他们会处理InputStream输入的数据

/** Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for
  * receiving system events related to streaming.
  */
def addStreamingListener(streamingListener: StreamingListener) {
  scheduler.listenerBus.addListener(streamingListener)
}

启动调度器

/**
 * Start the execution of the streams.
 *
 * @throws SparkException if the context has already been started or stopped.
 */
def start(): Unit = synchronized {
  if (state == Started) {
    throw new SparkException("StreamingContext has already been started")
  }
  if (state == Stopped) {
    throw new SparkException("StreamingContext has already been stopped")
  }
  validate()
  sparkContext.setCallSite(DStream.getCreationSite())
  scheduler.start()
  state = Started
}

各种InputStream，后续再细看。

class PluggableInputDStream[T: ClassTag](
  @transient ssc_ : StreamingContext,
  receiver: Receiver[T]) extends ReceiverInputDStream[T](ssc_) {

  def getReceiver(): Receiver[T] = {
    receiver
  }
}

class SocketInputDStream[T: ClassTag](
    @transient ssc_ : StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](ssc_) {

  def getReceiver(): Receiver[T] = {
    new SocketReceiver(host, port, bytesToObjects, storageLevel)
  }
}

/**
 * An input stream that reads blocks of serialized objects from a given network address.
 * The blocks will be inserted directly into the block store. This is the fastest way to get
 * data into Spark Streaming, though it requires the sender to batch data and serialize it
 * in the format that the system is configured with.
 */
private[streaming]
class RawInputDStream[T: ClassTag](
    @transient ssc_ : StreamingContext,
    host: String,
    port: Int,
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](ssc_ ) with Logging {

  def getReceiver(): Receiver[T] = {
    new RawNetworkReceiver(host, port, storageLevel).asInstanceOf[Receiver[T]]
  }
}

/**
 * Create a input stream that monitors a Hadoop-compatible filesystem
 * for new files and reads them using the given key-value types and input format.
 * Files must be written to the monitored directory by "moving" them from another
 * location within the same file system. File names starting with . are ignored.
 * @param directory HDFS directory to monitor for new file
 * @tparam K Key type for reading HDFS file
 * @tparam V Value type for reading HDFS file
 * @tparam F Input format for reading HDFS file
 */
def fileStream[
  K: ClassTag,
  V: ClassTag,
  F <: NewInputFormat[K, V]: ClassTag
] (directory: String): InputDStream[(K, V)] = {
  new FileInputDStream[K, V, F](this, directory)
}

private[streaming]
class TransformedDStream[U: ClassTag] (
    parents: Seq[DStream[_]],
    transformFunc: (Seq[RDD[_]], Time) => RDD[U]
  ) extends DStream[U](parents.head.ssc) {

  require(parents.length > 0, "List of DStreams to transform is empty")
  require(parents.map(_.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
  require(parents.map(_.slideDuration).distinct.size == 1,
    "Some of the DStreams have different slide durations")

  override def dependencies = parents.toList

  override def slideDuration: Duration = parents.head.slideDuration

  override def compute(validTime: Time): Option[RDD[U]] = {
    val parentRDDs = parents.map(_.getOrCompute(validTime).orNull).toSeq
    Some(transformFunc(parentRDDs, validTime))
  }
}

From WizNote

spark streaming 1: SparkContex的更多相关文章

Spark踩坑记——Spark Streaming+Kafka
[TOC] 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...
Spark Streaming+Kafka
Spark Streaming+Kafka 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端, ...
Storm介绍及与Spark Streaming对比
Storm介绍 Storm是由Twitter开源的分布式.高容错的实时处理系统,它的出现令持续不断的流计算变得容易,弥补了Hadoop批处理所不能满足的实时要求.Storm常用于在实时分析.在线机器学 ...
flume+kafka+spark streaming整合
1.安装好flume2.安装好kafka3.安装好spark4.流程说明: 日志文件->flume->kafka->spark streaming flume输入:文件 flume输 ...
spark streaming kafka example
// scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...
Spark Streaming中动态Batch Size实现初探
本期内容 : BatchDuration与 Process Time 动态Batch Size Spark Streaming中有很多算子,是否每一个算子都是预期中的类似线性规律的时间消耗呢? 例如: ...
Spark Streaming源码解读之No Receivers彻底思考
本期内容 : Direct Acess Kafka Spark Streaming接收数据现在支持的两种方式: 01. Receiver的方式来接收数据,及输入数据的控制 02. No Receive ...
Spark Streaming架构设计和运行机制总结
本期内容 : Spark Streaming中的架构设计和运行机制 Spark Streaming深度思考 Spark Streaming的本质就是在RDD基础之上加上Time ,由Time不断的运行 ...
Spark Streaming中空RDD处理及流处理程序优雅的停止
本期内容 : Spark Streaming中的空RDD处理 Spark Streaming程序的停止由于Spark Streaming的每个BatchDuration都会不断的产生RDD,空RDD ...

随机推荐

098、Swarm 如何实现 Failover （Swarm05）
参考https://www.cnblogs.com/CloudMan6/p/7898245.html 故障是在所难免的,容器可能崩溃,Docker Host 可能宕机,不过幸运的是,Swarm 已 ...
网速监控-nload
用来监控系统网卡实时网速的. 安装 yum install nload -y # 或 apt install nload -y 使用 # 直接运行默认监控第一个网卡, 使用上下方向键来切换网卡. nl ...
js鼠标点击特效，有关参数设置
效果图,用的faststone--录像--togif,黄色圆圈实际是不显示的博客后台管理设置本地新建一个demo.html文件,可以自行测试,要引入jquery文件哦来个“红橙黄绿蓝靛紫”的点击 ...
python cv2读取rtsp实时码流按时生成连续视频文件
代码实现 # coding: utf-8 import datetime import cv2 import os ip = '192.168.3.160'.replace("." ...
2019-2020-1 20199319《Linux内核原理与分析》第五周作业
系统调用的三层机制(上) 基础知识 1.通过库函数的方式进行系统调用,库函数用来把系统调用给封装起来. 2.CPU有四种不同的执行级别:0.1.2.3,数字越小,特权越高.Linux操作系统中采用了0 ...
Backtracking（一）
LeetCode中涉及到回溯的题目有通用的解题套路: 46. permutations 这一类回溯题目中的基础中的基础,无重复数字枚举: /* Given a collection of distin ...
uestc summer training #3 线段树优化建边
线段树建边 struct E { int value, modvalue; } a[MAXN << ]; pair<int, int> b[MAXN]; ], r[MAXN & ...
java基础笔试题一
1.Vector和ArrayList的区别答:Vector的方法都是同步的(Synchronized),是线程安全的(thread-safe),而ArrayList的方法不是,由于线程的同步必然要影 ...
在神经网络中weight decay
weight decay(权值衰减)的最终目的是防止过拟合.在损失函数中,weight decay是放在正则项(regularization)前面的一个系数,正则项一般指示模型的复杂度,所以weigh ...
vue 中监听窗口发生变化，触发监听事件， window.onresize && window.addEventListener('resize'，fn) ，window.onresize无效的处理方式
// 开始这样写,不执行 window.onresize = function() { console.log('窗口发生变化') } // 改成window监听事件 window.addEventL ...

spark streaming 1: SparkContex

spark streaming 1: SparkContex的更多相关文章

随机推荐

热门专题