spark streaming 5: InputDStream

InputDStream的继承关系。他们都是使用InputDStream这个抽象类的接口进行操作的。特别注意ReceiverInputDStream这个类，大部分时候我们使用的是它作为扩展的基类，因为它才能（更容易）使接收数据的工作分散到各个worker上执行，更符合分布式计算的理念。

所有的输入流都某个时间间隔将数据以block的形式保存到spark memory中，但以spark core不同的是，spark streaming默认是将对象序列化后保存到内存中。

/**
 * This is the abstract base class for all input streams. This class provides methods
 * start() and stop() which is called by Spark Streaming system to start and stop receiving data.
 * Input streams that can generate RDDs from new data by running a service/thread only on
 * the driver node (that is, without running a receiver on worker nodes), can be
 * implemented by directly inheriting this InputDStream. For example,
 * FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for
 * new files and generates RDDs with the new files. For implementing input streams
 * that requires running a receiver on the worker nodes, use
 * [[org.apache.spark.streaming.dstream.ReceiverInputDStream]] as the parent class.
 *
 * @param ssc_ Streaming context that will execute this input stream
 */
abstract class InputDStream[T: ClassTag] (@transient ssc_ : StreamingContext)
  extends DStream[T](ssc_) {

  private[streaming] var lastValidTime: Time = null

  ssc.graph.addInputStream(this)

/**
 * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
 * that has to start a receiver on worker nodes to receive external data.
 * Specific implementations of NetworkInputDStream must
 * define `the getReceiver()` function that gets the receiver object of type
 * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
 * to the workers to receive data.
 * @param ssc_ Streaming context that will execute this input stream
 * @tparam T Class type of the object of this stream
 */
abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {

  /** Keeps all received blocks information */
  private lazy val receivedBlockInfo = new HashMap[Time, Array[ReceivedBlockInfo]]

  /** This is an unique identifier for the network input stream. */
  val id = ssc.getNewReceiverStreamId()

/**
 * Gets the receiver object that will be sent to the worker nodes
 * to receive data. This method needs to defined by any specific implementation
 * of a NetworkInputDStream.
 */
def getReceiver(): Receiver[T]

最终都是以BlockRDD返回的

/** Ask ReceiverInputTracker for received data blocks and generates RDDs with them. */
override def compute(validTime: Time): Option[RDD[T]] = {
  // If this is called for any time before the start time of the context,
  // then this returns an empty RDD. This may happen when recovering from a
  // master failure
  if (validTime >= graph.startTime) {
    val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id)
    receivedBlockInfo(validTime) = blockInfo
    val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId])
    Some(new BlockRDD[T](ssc.sc, blockIds))
  } else {
    Some(new BlockRDD[T](ssc.sc, Array[BlockId]()))
  }
}

From WizNote

spark streaming 5: InputDStream的更多相关文章

Spark Streaming源码分析 – InputDStream
对于NetworkInputDStream而言,其实不是真正的流方式,将数据读出来后不是直接去处理,而是先写到blocks中,后面的RDD再从blocks中读取数据继续处理这就是一个将stream离散 ...
Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...
spark streaming kafka1.4.1中的低阶api createDirectStream使用总结
转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不 ...
spark streaming集成kafka
Kakfa起初是由LinkedIn公司开发的一个分布式的消息系统,后成为Apache的一部分,它使用Scala编写,以可水平扩展和高吞吐率而被广泛使用.目前越来越多的开源分布式处理系统如Clouder ...
spark streaming从指定offset处消费Kafka数据
spark streaming从指定offset处消费Kafka数据 -- : 770人阅读评论() 收藏举报分类: spark() 原文地址:http://blog.csdn.net/high ...
Spark streaming + Kafka 流式数据处理，结果存储至MongoDB、Solr、Neo4j（自用）
KafkaStreaming.scala文件 import kafka.serializer.StringDecoder import org.apache.spark.SparkConf impor ...
spark streaming 接收kafka消息之一 -- 两种接收方式
源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的类说明: This is the abstrac ...
Spark Streaming消费Kafka Direct保存offset到Redis，实现数据零丢失和exactly once
一.概述上次写这篇文章文章的时候,Spark还是1.x,kafka还是0.8x版本,转眼间spark到了2.x,kafka也到了2.x,存储offset的方式也发生了改变,笔者根据上篇文章和网上文章 ...
Error- Overloaded method value createDirectStream in error Spark Streaming打包报错
直接上代码 StreamingExamples.setStreamingLogLevels() val Array(brokers, topics) = args // Create context ...

随机推荐

Java--java.util.stream.Collectors文档实例
// java.util.stream.Collectors 类的主要作用就是辅助进行各类有用的 reduction 操作,例如转变输出为 Collection,把 Stream 元素进行归组. pu ...
css and canvas实现圆形进度条
进度条效果: 话不多说,上代码使用css动画实现,看到一篇博客的启发,稍微修改了下, css实现的原理是用两个半圆一开始隐藏,再分别旋转180度,最后成为一个整圆半圆效果,一开始右边的半圆在盒 ...
vue项目默认IE以最高级别打开
只需要在index.html加入 <meta http-equiv="X-UA-Compatible" content="IE=Edge">
oracle中行转列操作
数据准备阶段: CREATE TABLE CC (Student NVARCHAR2(2),Course NVARCHAR2(2),Score INT); INSERT into CC sele ...
视频大文件分片上传（使用webuploader插件）
背景公司做网盘系统,一直在调用图片服务器的接口上传图片,以前写的,以为简单改一改就可以用最初要求 php 上传多种视频格式,支持大文件,并可以封面截图,时长统计问题 1.上传到阿里云服务器,13 ...
markdown实现点击链接下载文件
今天用Markdown工具,需要实现一个点连接下载文件的功能,看起来很多简单我也没多想就直接写了,并且单个页面测试的时候也挺正常,就发布了,但是发布后使用的时候发现问题了,浏览器中直接点击链接没反应, ...
bootstrap和JS实现下拉菜单
// bootstrap下拉菜单 <div class="btn-group"> <button id="button_text" type= ...
rank 和 ROW_NUMBER 区别
SELECT * , RANK() OVER ( PARTITION BY APP_NAME ORDER BY SETTING_NAME,SETTING_CODE ASC ) AS Rank FROM ...
ios系统保存校园网密码
相信ios用户每次登陆时无法保存必须要重新输入账号密码的问题困扰了很多同学,特别是苹果5用户(不要问为什么,屏幕本来就小) 现在我们就一起想办法来解决它吧! 首先,我们进入设置->Safari浏 ...
ALIENTEK 战舰ENC28J60 LWIP和UIP补充例程（LWIP WEB有惊喜）
前面的话:自从接触网络模块,到现在有一阵子时间了,未来必定是网络的世界.学一些网络方面的知识是有必要的.我们ALINTEK 推出的ENC28J60网络模块块作为入门还是不错的.详细见此贴:http:/ ...

spark streaming 5: InputDStream

spark streaming 5: InputDStream的更多相关文章

随机推荐

热门专题