DStream

1.1基本说明

1.1.1 Duration

Spark Streaming的时间类型,单位是毫秒;

生成方式如下:

1)new Duration(milli seconds)

输入毫秒数值来生成;

2)seconds(seconds)

输入秒数值来生成;

3)Minutes(minutes)

输入分钟数值来生成;

1.1.2 slideDuration

/** Time interval after which the DStream generates a RDD */
def slideDuration: Duration

slideDuration,时间窗口滑动长度;根据这个时间长度来生成一个RDD;

1.1.3 dependencies

/** List of parent DStreams on which this DStream depends on */
def dependencies: List[DStream[_]]

dependencies,DStreams的依赖关系;

1.1.4 compute

/** Method that generates a RDD for the given time */
def compute(validTime: Time): Option[RDD[T]]

compute,根据给定的时间来生成RDD;

1.1.5 zeroTime

// Time zero for the DStream
private[streaming] var zeroTime: Time = null

zeroTime,DStream的起点时间;

1.1.6 rememberDuration

// Duration for which the DStream will remember each RDD created
private[streaming] var rememberDuration: Duration = null

rememberDuration,记录DStream中每个RDD的产生时间;

1.1 7 storageLevel

// Storage level of the RDDs in the stream
private[streaming] var storageLevel: StorageLevel = StorageLevel.NONE

storageLevel,DStream中每个RDD的存储级别;

1.1.8 parentRememberDuration

// Duration for which the DStream requires its parent DStream to remember each RDD created
private[streaming] def parentRememberDuration = rememberDuration

parentRememberDuration,父DStream记录RDD的生成时间;

1.1.9 persist

/** Persist the RDDs of this DStream with the given storage level */
def persist(level: StorageLevel): DStream[T] = {
if (this.isInitialized) {
throw new UnsupportedOperationException(
"Cannot change storage level of a DStream after streaming context has started")
}
this.storageLevel = level
this
}

Persist,DStream中RDD的存储级别;

1.1.10 checkpoint

  /**
* Enable periodic checkpointing of RDDs of this DStream
* @param interval Time interval after which generated RDD will be checkpointed
*/
def checkpoint(interval: Duration): DStream[T] = {
if (isInitialized) {
throw new UnsupportedOperationException(
"Cannot change checkpoint interval of a DStream after streaming context has started")
}
persist()
checkpointDuration = interval
this
}

checkpoint,设置DStream的checkpoint时间间隔;

1.1.11 initialize

  /**
* Initialize the DStream by setting the "zero" time, based on which
* the validity of future times is calculated. This method also recursively initializes
* its parent DStreams.
*/
private[streaming] def initialize(time: Time) {
if (zeroTime != null && zeroTime != time) {
throw new SparkException(s"ZeroTime is already initialized to $zeroTime"
+ s", cannot initialize it again to $time")
}
zeroTime = time // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
logInfo(s"Checkpoint interval automatically set to $checkpointDuration")
} // Set the minimum value of the rememberDuration if not already set
var minRememberDuration = slideDuration
if (checkpointDuration != null && minRememberDuration <= checkpointDuration) {
// times 2 just to be sure that the latest checkpoint is not forgotten (#paranoia)
minRememberDuration = checkpointDuration * 2
}
if (rememberDuration == null || rememberDuration < minRememberDuration) {
rememberDuration = minRememberDuration
} // Initialize the dependencies
dependencies.foreach(_.initialize(zeroTime))
}

initialize,DStream初始化,其初始时间通过"zero" time设置;

1.1.12 getOrCompute

  /**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {

getOrCompute,通过时间参数获取RDD;

1.1.13 generateJob

  /**
* Generate a SparkStreaming job for the given time. This is an internal method that
* should not be called directly. This default implementation creates a job
* that materializes the corresponding RDD. Subclasses of DStream may override this
* to generate their own jobs.
*/
private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
case None => None
}
}

generateJob,内部方法,来生成SparkStreaming的作业。

1.1.14 clearMetadata

/**

*Clear metadata that are older than `rememberDuration` of this DStream.

* This is an internal method that should notbe called directly. This default

* implementation clears the old generatedRDDs. Subclasses of DStream may override

* this to clear their own metadata alongwith the generated RDDs.

*/

private[streaming]defclearMetadata(time: Time) {

clearMetadata,内部方法,清除DStream中过期的数据。

1.1.15 updateCheckpointData

/**

* Refresh the list of checkpointed RDDs thatwill be saved along with checkpoint of

* this stream. This is an internal methodthat should not be called directly. This is

* a default implementation that saves onlythe file names of the checkpointed RDDs to

* checkpointData. Subclasses of DStream(especially those of InputDStream) may override

* this method to save custom checkpointdata.

*/

private[streaming]defupdateCheckpointData(currentTime:Time) {

updateCheckpointData,内部方法,更新Checkpoint。

1.2 DStream基本操作

1.2.1 map

/** Return a newDStreamby applying a function toall elements of this DStream. */

defmap[U: ClassTag](mapFunc: T=> U): DStream[U] = {

newMappedDStream(this, context.sparkContext.clean(mapFunc))

}

Map操作,对DStream中所有元素进行Map操作,和RDD中的操作一样。

1.2.2 flatMap

/**

* Return a new DStream by applying afunction to all elements of this DStream,

* and then flattening the results

*/

defflatMap[U:ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = {

newFlatMappedDStream(this, context.sparkContext.clean(flatMapFunc))

}

flatMap操作,对DStream中所有元素进行flatMap操作,和RDD中的操作一样。

1.2.3filter

/** Return a new DStream containing only the elements that satisfy apredicate. */

def filter(filterFunc: T => Boolean): DStream[T] = newFilteredDStream(this, filterFunc)

filter操作,对DStream中所有元素进行过滤,和RDD中的操作一样。

1.2.4 glom

/**

* Return a new DStream in which each RDD isgenerated by applying glom() to each RDD of

* this DStream. Applying glom() to an RDD coalescesall elements within each partition into

* an array.

*/

defglom(): DStream[Array[T]] =new GlommedDStream(this)

glom操作,对DStream中RDD的所有元素聚合,数组形式返回。

1.2.5 repartition

/**

* Return a new DStream with an increased ordecreased level of parallelism. Each RDD in the

* returned DStream has exactly numPartitionspartitions.

*/

defrepartition(numPartitions: Int):DStream[T] =this.transform(_.repartition(numPartitions))

repartition操作,对DStream中RDD重新分区,和RDD中的操作一样。

1.2.6 mapPartitions

/**

* Return a new DStream in which each RDD isgenerated by applying mapPartitions() to each RDDs

* of this DStream. Applying mapPartitions()to an RDD applies a function to each partition

* of the RDD.

*/

defmapPartitions[U:ClassTag](

mapPartFunc: Iterator[T] => Iterator[U],

preservePartitioning: Boolean = false

): DStream[U] = {

newMapPartitionedDStream(this, context.sparkContext.clean(mapPartFunc), preservePartitioning)

}

mapPartitions操作,对DStream中RDD进行mapPartitions操作,和RDD中的操作一样。

1.2.7 reduce

/**

* Return a new DStream in which each RDD hasa single element generated by reducing each RDD

* of this DStream.

*/

defreduce(reduceFunc:(T, T) => T): DStream[T] =

this.map(x => (null, x)).reduceByKey(reduceFunc, 1).map(_._2)

reduce操作,对DStream中RDD进行reduce操作,和RDD中的操作一样。

1.2.8 count

/**

* Return a new DStream in which each RDD hasa single element generated by counting each RDD

* of this DStream.

*/

defcount(): DStream[Long] = {

this.map(_=> (null,1L))

.transform(_.union(context.sparkContext.makeRDD(Seq((null,0L)),1)))

.reduceByKey(_ + _)

.map(_._2)

}

count操作,对DStream中RDD进行count操作,和RDD中的操作一样。

1.2.9 countByValue

/**

* Return a new DStream in which each RDDcontains the counts of each distinct value in

* each RDD of this DStream. Hashpartitioning is used to generate

* the RDDs with `numPartitions` partitions(Spark's default number of partitions if

* `numPartitions` not specified).

*/

defcountByValue(numPartitions:Int = ssc.sc.defaultParallelism)(implicit ord: Ordering[T] = null)

: DStream[(T, Long)] =

this.map(x => (x, 1L)).reduceByKey((x: Long, y: Long) => x +y, numPartitions)

countByValue操作,对DStream中RDD进行countByValue操作,和RDD中的操作一样。

1.2.10 foreachRDD

/**

* Apply a function to each RDD in thisDStream. This is an output operator, so

* 'this' DStream will be registered as anoutput stream and therefore materialized.

*/

defforeachRDD(foreachFunc:(RDD[T], Time) => Unit) {

// because the DStream is reachable from the outer objecthere, and because

// DStreams can't be serialized with closures, we can'tproactively check

// it for serializability and so we pass the optionalfalse to SparkContext.clean

newForEachDStream(this, context.sparkContext.clean(foreachFunc, false)).register()

}

foreachRDD操作,对DStream中RDD进行函数操作,该操作是一个输出操作。

1.2.11 transform

/**

* Return a new DStream in which each RDD isgenerated by applying a function

* on each RDD of 'this' DStream.

*/

deftransform[U:ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = {

// because the DStream is reachable from the outer objecthere, and because

// DStreams can't be serialized with closures, we can'tproactively check

// it for serializability and so we pass the optionalfalse to SparkContext.clean

transform((r: RDD[T], t: Time) =>context.sparkContext.clean(transformFunc(r),false))

}

transform操作,对DStream中RDD进行transform函数操作。

1.2.12 transformWith

/**

* Return a new DStream in which each RDD isgenerated by applying a function

* on each RDD of 'this' DStream and 'other'DStream.

*/

deftransformWith[U: ClassTag,V: ClassTag](

other: DStream[U], transformFunc:(RDD[T], RDD[U]) => RDD[V]

): DStream[V] = {

// because the DStream is reachable from the outer objecthere, and because

// DStreams can't be serialized with closures, we can'tproactively check

// it for serializability and so we pass the optionalfalse to SparkContext.clean

valcleanedF = ssc.sparkContext.clean(transformFunc, false)

transformWith(other, (rdd1: RDD[T], rdd2:RDD[U], time: Time) => cleanedF(rdd1, rdd2))

}

transformWith操作,对DStream与其它DStream进行transform函数操作。

1.2.13 print

/**

* Print the first ten elements of each RDDgenerated in this DStream. This is an output

* operator, so this DStream will beregistered as an output stream and there materialized.

*/

defprint() {

defforeachFunc = (rdd: RDD[T], time: Time) => {

valfirst11 = rdd.take(11)

println ("-------------------------------------------")

println ("Time: " + time)

println ("-------------------------------------------")

first11.take(10).foreach(println)

if(first11.size > 10) println("...")

println()

}

newForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()

}

print操作,对DStream进行打印输出,这是一个输出操作。

1.2.14 window

/**

* Return a new DStream in which each RDDcontains all the elements in seen in a

* sliding window of time over this DStream.The new DStream generates RDDs with

* the same interval as this DStream.

@param windowDuration width of thewindow; must be a multiple of this DStream's interval.

*/

defwindow(windowDuration:Duration): DStream[T] = window(windowDuration,this.slideDuration)

/**

* Return a new DStreaminwhich each RDD contains all the elements in seen in a

* sliding window of time over this DStream.

@param windowDuration width of thewindow; must be a multiple of this DStream's

*                       batching interval

@param slideDuration  sliding interval of the window (i.e., theinterval after which

*                       the new DStream willgenerate RDDs); must be a multiple of this

*                       DStream's batchinginterval

*/

def window(windowDuration:Duration, slideDuration: Duration): DStream[T] = {

newWindowedDStream(this, windowDuration, slideDuration)

}

window操作,设置窗口时长、滑动时长,生成一个窗口的DStream。

1.2.15 reduceByWindow

/**

* Return a new DStream in which each RDD hasa single element generated by reducing all

* elements in a sliding window over thisDStream.

@param reduceFunc associativereduce function

@param windowDuration width of thewindow; must be a multiple of this DStream's

*                       batching interval

@paramslideDuration sliding interval of thewindow (i.e., the interval after which

*                       the new DStream willgenerate RDDs); must be a multiple of this

*                       DStream's batchinginterval

*/

def reduceByWindow(

reduceFunc: (T, T) => T,

windowDuration: Duration,

slideDuration: Duration

): DStream[T] = {

this.reduce(reduceFunc).window(windowDuration,slideDuration).reduce(reduceFunc)

}

/**

* Return a new DStream in which each RDD hasa single element generated by reducing all

* elements in a sliding window over thisDStream. However, the reduction is done incrementally

* using the old window's reduced value :

*  1.reduce the new values that entered the window (e.g., adding new counts)

*  2."inverse reduce" the old values that left the window (e.g.,subtracting old counts)

* This is more efficient than reduceByWindow without "inversereduce" function.

* However, it is applicable to only "invertible reduce functions".

@param reduceFunc associativereduce function

@param invReduceFunc inverse reducefunction

@param windowDuration width of thewindow; must be a multiple of this DStream's

*                       batching interval

@param slideDuration  sliding interval of the window (i.e., theinterval after which

*                       the new DStream willgenerate RDDs); must be a multiple of this

*                       DStream's batchinginterval

*/

defreduceByWindow(

reduceFunc:(T, T) => T,

invReduceFunc: (T, T) => T,

windowDuration: Duration,

slideDuration: Duration

): DStream[T] = {

this.map(x=> (1, x))

.reduceByKeyAndWindow(reduceFunc,invReduceFunc, windowDuration, slideDuration,1)

.map(_._2)

}

reduceByWindow操作,对窗口进行reduceFunc操作。

1.2.16 countByWindow

/**

* Return a new DStream in which each RDD hasa single element generated by counting the number

* of elements in a sliding window over thisDStream. Hash partitioning is used to generate

* the RDDs with Spark's default number ofpartitions.

@param windowDuration width of thewindow; must be a multiple of this DStream's

*                       batching interval

@param slideDuration  sliding interval of the window (i.e., theinterval after which

*                       the new DStream willgenerate RDDs); must be a multiple of this

*                       DStream's batchinginterval

*/

defcountByWindow(windowDuration:Duration, slideDuration: Duration): DStream[Long] = {

this.map(_=>1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)

}

countByWindow操作,对窗口进行count操作。

1.2.17countByValueAndWindow

/**

* Return a new DStream in which each RDDcontains the count of distinct elements in

* RDDs in a sliding window over thisDStream. Hash partitioning is used to generate

* the RDDs with `numPartitions` partitions(Spark's default number of partitions if

* `numPartitions` not specified).

@param windowDuration width of thewindow; must be a multiple of this DStream's

*                       batching interval

@param slideDuration  sliding interval of the window (i.e., theinterval after which

*                       the new DStream willgenerate RDDs); must be a multiple of this

*                       DStream's batchinginterval

@param numPartitions  number of partitions of each RDD in the newDStream.

*/

defcountByValueAndWindow(

windowDuration: Duration,

slideDuration: Duration,

numPartitions: Int =ssc.sc.defaultParallelism)

(implicitord: Ordering[T] = null)

: DStream[(T, Long)] =

{

this.map(x=> (x, 1L)).reduceByKeyAndWindow(

(x: Long, y: Long) => x + y,

(x: Long, y: Long) => x - y,

windowDuration,

slideDuration,

numPartitions,

(x: (T, Long)) => x._2 != 0L

)

}

countByValueAndWindow操作,对窗口进行countByValue操作。

1.2.18 union

/**

* Return a new DStream by unifying data ofanother DStream with this DStream.

@paramthat Another DStream having the same slideDuration as this DStream.

*/

defunion(that:DStream[T]): DStream[T] =new UnionDStream[T](Array(this, that))

/**

* Return all the RDDs defined by theInterval object (both end times included)

*/

def slice(interval:Interval): Seq[RDD[T]] = {

slice(interval.beginTime, interval.endTime)

}

union操作,对DStream和其它DStream进行合并操作。

1.2.19 slice

/**

* Return all the RDDs between 'fromTime' to'toTime' (both included)

*/

defslice(fromTime:Time, toTime: Time): Seq[RDD[T]] = {

if(!isInitialized) {

thrownew SparkException(this + " has not beeninitialized")

}

if(!(fromTime - zeroTime).isMultipleOf(slideDuration)) {

logWarning("fromTime (" + fromTime + ") is not amultiple of slideDuration ("

+ slideDuration + ")")

}

if(!(toTime - zeroTime).isMultipleOf(slideDuration)) {

logWarning("toTime (" + fromTime + ") is not amultiple of slideDuration ("

+ slideDuration + ")")

}

valalignedToTime = toTime.floor(slideDuration)

valalignedFromTime = fromTime.floor(slideDuration)

logInfo("Slicing from " + fromTime + " to " + toTime +

" (aligned to " + alignedFromTime + " and " + alignedToTime + ")")

alignedFromTime.to(alignedToTime,slideDuration).flatMap(time => {

if(time >= zeroTime) getOrCompute(time) elseNone

})

}

slice操作,根据时间间隔,取DStream中的每个RDD序列,生成一个RDD。

1.2.20saveAsObjectFiles

/**

* Save each RDD in this DStream as aSequence file of serialized objects.

* The file name at each batch interval isgenerated based on `prefix` and

* `suffix`:"prefix-TIME_IN_MS.suffix".

*/

defsaveAsObjectFiles(prefix: String, suffix: String = ""){

valsaveFunc = (rdd: RDD[T], time: Time) => {

valfile = rddToFileName(prefix, suffix, time)

rdd.saveAsObjectFile(file)

}

this.foreachRDD(saveFunc)

}

saveAsObjectFiles操作,输出操作,对DStream中的每个RDD输出为序列化文件格式。

1.2.21 saveAsTextFiles

/**

* Save each RDD in this DStreamasat text file, using string representation

* of elements. The file name at each batchinterval is generated based on

* `prefix` and `suffix`:"prefix-TIME_IN_MS.suffix".

*/

defsaveAsTextFiles(prefix:String, suffix: String ="") {

valsaveFunc = (rdd: RDD[T], time: Time) => {

valfile = rddToFileName(prefix, suffix, time)

rdd.saveAsTextFile(file)

}

this.foreachRDD(saveFunc)

}

/**

* Register this streaming as an outputstream. This would ensure that RDDs of this

* DStream will be generated.

*/

private[streaming]defregister(): DStream[T] = {

ssc.graph.addOutputStream(this)

this

}

}

saveAsTextFiles操作,输出操作,对DStream中的每个RDD输出为文本格式。

转载请注明出处:

http://blog.csdn.net/sunbow0/article/details/43091247

Spark Streaming之三:DStream解析的更多相关文章

  1. Spark Streaming揭秘 Day34 解析UI监听模式

    Spark Streaming揭秘 Day34 解析UI监听模式 今天分享下SparkStreaming中的UI部分,和所有的UI系统一样,SparkStreaming中的UI系统使用的是监听器模式. ...

  2. spark streaming 使用geoIP解析IP

    1.首先将GEOIP放到服务器上,如,/opt/db/geo/GeoLite2-City.mmdb 2.新建scala sbt工程,测试是否可以顺利解析 import java.io.Fileimpo ...

  3. 53、Spark Streaming:输入DStream之Kafka数据源实战

    一.基于Receiver的方式 1.概述 基于Receiver的方式: Receiver是使用Kafka的高层次Consumer API来实现的.receiver从Kafka中获取的数据都是存储在Sp ...

  4. spark streaming 2: DStream

    DStream是类似于RDD概念,是对数据的抽象封装.它是一序列的RDD,事实上,它大部分的操作都是对RDD支持的操作的封装,不同的是,每次DStream都要遍历它内部所有的RDD执行这些操作.它可以 ...

  5. spark streaming之三 rdd,job的动态生成以及动态调度

    前面一篇讲到了,DAG静态模板的生成.那么spark streaming会在每一个batch时间一到,就会根据DAG所形成的逻辑以及物理依赖链(dependencies)动态生成RDD以及由这些RDD ...

  6. Spark Streaming on Kafka解析和安装实战

    本课分2部分讲解: 第一部分,讲解Kafka的概念.架构和用例场景: 第二部分,讲解Kafka的安装和实战. 由于时间关系,今天的课程只讲到如何用官网的例子验证Kafka的安装是否成功.后续课程会接着 ...

  7. Spark Streaming运行流程及源码解析(一)

    本系列主要描述Spark Streaming的运行流程,然后对每个流程的源码分别进行解析 之前总听同事说Spark源码有多么棒,咱也不知道,就是疯狂点头.今天也来撸一下Spark源码. 对Spark的 ...

  8. 苏宁基于Spark Streaming的实时日志分析系统实践 Spark Streaming 在数据平台日志解析功能的应用

    https://mp.weixin.qq.com/s/KPTM02-ICt72_7ZdRZIHBA 苏宁基于Spark Streaming的实时日志分析系统实践 原创: AI+落地实践 AI前线 20 ...

  9. Spark Streaming 在数据平台日志解析功能的应用

    https://mp.weixin.qq.com/s/bGXhC9hvDj4lzK7wYYHGDg 目前,我们使用Filebeat监控日志产生的目录,收集产生的日志,打到logstash集群,接入ka ...

随机推荐

  1. iOS开发UI篇—懒载入

    iOS开发UI篇-懒载入 1.懒载入基本 懒载入--也称为延迟载入,即在须要的时候才载入(效率低,占用内存小).所谓懒载入,写的是其get方法. 注意:假设是懒载入的话则一定要注意先推断是否已经有了. ...

  2. 浅谈MySQL外键

    http://www.xiaoxiaozi.com/2009/07/12/1158/ 像MySQL这样的关系型数据库管理系统,它们的基础是在数据库的表之间创建关系的能力.通过方便地在不同表中建立记录到 ...

  3. window 平台搭建简单的直播点播系统

    Windows平台如何搭建简单的直播系统前文已经有介绍,今天介绍下如何搭建简单的点播系统. 同样还是利用crtmpServer,crtmpServer可以从github中下载,可以从群里下载(群里有修 ...

  4. Android Volley分析(一)——结构

    Volley是Android系统下的一个网络通信库.为Android提供简单高速的网络操作(Volley:Esay, Fast Networking for Android),以下是它的结构: 既然是 ...

  5. 轻松搞定RabbitMQ(三)——消息应答与消息持久化

    转自 http://blog.csdn.net/xiaoxian8023/article/details/48710653 这个官网的第二个例子中的消息应答和消息持久化部分.我把它摘出来作为单独的一块 ...

  6. jvm基础(2)

    7.类装载器 (1)class装载验证流程: A加载.这是装载类的第一个阶段,执行的动作包括:取得类的二进制流,转为方法区数据结构,在java堆中生成对应的java.lang.Class对象. B链接 ...

  7. 文件共享和使用 dup 函数创建新描述符的区别

    前言 文件共享是指同时打开一个文件 用 dup 函数能对指定文件描述符再创建一个新的描述符,且这个新的描述符和旧的描述符指向的是同一个文件. 这两种行为有什么区别呢?下面给出的两张文件系统的图形象的解 ...

  8. 【BZOJ4293】[PA2015]Siano 线段树

    [BZOJ4293][PA2015]Siano Description 农夫Byteasar买了一片n亩的土地,他要在这上面种草. 他在每一亩土地上都种植了一种独一无二的草,其中,第i亩土地的草每天会 ...

  9. apache下实现301永久性重定向的方法

    因为博客是使用了www.php100.com作为博客域名,所以想实现php100.com全部重定向(跳转)到www.php100.com.同时按照google的建议,使用服务器端 301 重定向,为了 ...

  10. 【C++基础学习】类型声明

    1.初始化 在C++中,初始化与赋值操作是完全不同的两个操作.初始化不是赋值,初始化的含义是创建变量时赋予其一个初始值,而赋值的含义是把对象的当前值擦除,而以一个新值来代替. 初始化的方式有: 1 i ...