之前一直对窗口操作不太理解。认为spark streaming本身已经是分片计算,还需要窗口操作干啥。


之前用storm的窗口计算,实在是麻烦。而spark streaming则要简单许多。



window length - The duration of the window ( in the figure). -> N
sliding interval - The interval at which the window operation is performed ( in the figure). -> M



sdf sdfsd sdf


sdf sdfsd sdf


sdf sdfsd sdf
sdf sdfsd sdf 234





sdf : 2
sdfsd : 1 ---------------------------------------- 2: ---------------------------------------- sdf : 4
sdfsd : 2 ---------------------------------------- 3: ---------------------------------------- sdf : 8
sdfsd : 4
234:1 ---------------------------------------- 4: ---------------------------------------- sdf : 6
sdfsd : 3
234:1 ---------------------------------------- 5: ---------------------------------------- sdf : 4
sdfsd : 2
234:1 ---------------------------------------- 6: ---------------------------------------- 无 ----------------------------------------



 var maxRatePerPartition = "3700"
if (args.length > 1) maxRatePerPartition = args(0)
var master = "local"
if (args.length > 2) master = args(1)
var duration = 10
if (args.length > 3) duration = args(2).toInt
var group = "brokers"
if (args.length > 4) group = args(3)
var topic = "test"
if (args.length > 5) topic = args(4)
var brokerlist = "master:9092"
if (args.length > 6) brokerlist = args(5) val sparkConf = new SparkConf()
.setAppName("window") val ssc = new StreamingContext(sparkConf, Seconds(duration))
val topicsSet = topic.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokerlist,
"group.id" -> group,
"enable.auto.commit" -> "false")
val kc = new KafkaCluster(kafkaParams)
val topicAndPartition2 = kc.getPartitions(topicsSet)
var topicAndPartition: Set[TopicAndPartition] = Set[TopicAndPartition]() var fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]() var bool = true if (topicAndPartition2.isRight) topicAndPartition = kc.getPartitions(topicsSet).right.get else {
val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
bool = false
} if (bool) {
var temp = kc.getConsumerOffsets(group, topicAndPartition)
if (temp.isRight) {
fromOffsets = temp.right.get
} else if (temp.isLeft) {
val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler) var offsetsList = Array[OffsetRange]()
val msgContent = messages.transform(x => {
offsetsList = x.asInstanceOf[HasOffsetRanges].offsetRanges
}) val data = msgContent.flatMap(x => x._2.split(" ")) val data2 = data.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(10)) data2.foreachRDD(x => {
x.foreach(x => {
println(x._1 + " = " + x._2)
}) for (offsets <- offsetsList) {
val topicAndPartition = TopicAndPartition(topic, offsets.partition)
val o = kc.setConsumerOffsetMetadata(group, Map[TopicAndPartition, OffsetAndMetadata](topicAndPartition -> OffsetAndMetadata(offsets.untilOffset)))
if (o.isLeft) {
println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
} ssc.start()



reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b),Seconds(30), Seconds(10))
reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b),Seconds(30), Seconds(10))



window3 : time1+time2+time3
window5 : time3+time4+time5 函数2(增量统计):
window3 : (time1+time2)+time3 然后checkpoint -> (time1+time2)|window3
window5 : window3 +time4+time5 - (time1+time2) 只需计算 time4+time5,无需计算time3

但是基于这种解释,有一个问题,为了便于window7的计算,在window5的时候,必须计算time3+time4并缓存,其实还是需要计算time3;而且需要缓存的大小(如window3需要缓存的大小是:(time1+time2)+(window3 = time1+time2+time3) = 2*time1+2*time2+time3=5/3*window3)是window本身的5/3。计算方式其实比函数1更复杂,而且更占用资源。



* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.spark.streaming.dstream import scala.reflect.ClassTag import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.Duration private[streaming]
class WindowedDStream[T: ClassTag](
parent: DStream[T],
_windowDuration: Duration,
_slideDuration: Duration)
extends DStream[T](parent.ssc) { if (!_windowDuration.isMultipleOf(parent.slideDuration)) {
throw new Exception("The window duration of windowed DStream (" + _windowDuration + ") " +
"must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
} if (!_slideDuration.isMultipleOf(parent.slideDuration)) {
throw new Exception("The slide duration of windowed DStream (" + _slideDuration + ") " +
"must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
} // Persist parent level by default, as those RDDs are going to be obviously reused.
parent.persist(StorageLevel.MEMORY_ONLY_SER) def windowDuration: Duration = _windowDuration override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = _slideDuration override def parentRememberDuration: Duration = rememberDuration + windowDuration override def persist(level: StorageLevel): DStream[T] = {
// Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying
// RDDs and persisting the windowed RDDs would store numerous copies of the underlying data.
// Instead control the persistence of the parent DStream.
} override def compute(validTime: Time): Option[RDD[T]] = {
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
val rddsInWindow = parent.slice(currentWindow)


* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.spark.streaming.dstream import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag import org.apache.spark.Partitioner
import org.apache.spark.rdd.{CoGroupedRDD, RDD}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Duration, Interval, Time} private[streaming]
class ReducedWindowedDStream[K: ClassTag, V: ClassTag](
parent: DStream[(K, V)],
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
filterFunc: Option[((K, V)) => Boolean],
_windowDuration: Duration,
_slideDuration: Duration,
partitioner: Partitioner
) extends DStream[(K, V)](parent.ssc) { require(_windowDuration.isMultipleOf(parent.slideDuration),
"The window duration of ReducedWindowedDStream (" + _windowDuration + ") " +
"must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
) require(_slideDuration.isMultipleOf(parent.slideDuration),
"The slide duration of ReducedWindowedDStream (" + _slideDuration + ") " +
"must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
) // Reduce each batch of data using reduceByKey which will be further reduced by window
// by ReducedWindowedDStream
private val reducedStream = parent.reduceByKey(reduceFunc, partitioner) // Persist RDDs to memory by default as these RDDs are going to be reused.
reducedStream.persist(StorageLevel.MEMORY_ONLY_SER) def windowDuration: Duration = _windowDuration override def dependencies: List[DStream[_]] = List(reducedStream) override def slideDuration: Duration = _slideDuration override val mustCheckpoint = true override def parentRememberDuration: Duration = rememberDuration + windowDuration override def persist(storageLevel: StorageLevel): DStream[(K, V)] = {
} override def checkpoint(interval: Duration): DStream[(K, V)] = {
// reducedStream.checkpoint(interval)
} override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
val previousWindow = currentWindow - slideDuration logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow) // _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
// |________ _________| |________ _________|
// | |
// V V
// old RDDs new RDDs
// // Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
logDebug("# old RDDs = " + oldRDDs.size) // Get the RDDs of the reduced values in "new time steps"
val newRDDs =
reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size) // Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs // Cogroup the reduced RDDs and merge the reduced values
val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
// val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ val numOldValues = oldRDDs.size
val numNewValues = newRDDs.size val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
throw new Exception("Unexpected number of sequences of reduced values")
// Getting reduced values "old time steps" that will be removed from current window
val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
// Getting reduced values "new time steps"
val newValues =
(1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) if (arrayOfValues(0).isEmpty) {
// If previous window's reduce value does not exist, then at least new values should exist
if (newValues.isEmpty) {
throw new Exception("Neither previous window has value for key, nor new values found. " +
"Are you sure your key class hashes consistently?")
// Reduce the new values
newValues.reduce(reduceF) // return
} else {
// Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
// If old values exists, then inverse reduce then from previous value
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
// If new values exists, then reduce them with previous value
if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF))
tempValue // return
} val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
.mapValues(mergeValues) if (filterFunc.isDefined) {
} else {




 override def compute(validTime: Time): Option[RDD[T]] = {
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
val rddsInWindow = parent.slice(currentWindow)


 override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
val previousWindow = currentWindow - slideDuration logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow) // _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
// |________ _________| |________ _________|
// | |
// V V
// old RDDs new RDDs
// // Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
logDebug("# old RDDs = " + oldRDDs.size) // Get the RDDs of the reduced values in "new time steps"
val newRDDs =
reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size) // Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs // Cogroup the reduced RDDs and merge the reduced values
val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
// val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ val numOldValues = oldRDDs.size
val numNewValues = newRDDs.size val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
throw new Exception("Unexpected number of sequences of reduced values")
// Getting reduced values "old time steps" that will be removed from current window
val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
// Getting reduced values "new time steps"
val newValues =
(1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) if (arrayOfValues(0).isEmpty) {
// If previous window's reduce value does not exist, then at least new values should exist
if (newValues.isEmpty) {
throw new Exception("Neither previous window has value for key, nor new values found. " +
"Are you sure your key class hashes consistently?")
// Reduce the new values
newValues.reduce(reduceF) // return
} else {
// Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
// If old values exists, then inverse reduce then from previous value
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
// If new values exists, then reduce them with previous value
if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF))
tempValue // return
} val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
.mapValues(mergeValues) if (filterFunc.isDefined) {
} else {


    //  _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
// |________ _________| | |________ _________|
// | | |
// V V V
// old RDDs keyRDDs new RDDs


previous window - old RDDS = key RDDS
key RDDs + new RDDs = current window (当前窗口)

        // Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) //自定义函数 前减后
// If new values exists, then reduce them with previous value if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF)) //自定义函数 前加后

然后对前一个窗口previous window作为缓存

// Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

所以,非首次的计算需要计算old RDDs 和new RDDs,其实看起来还是很繁琐,效率当真较高一点吗?待测试。


  1. Structured-Streaming之窗口操作

    Structured Streaming 之窗口事件时间聚合操作 Spark Streaming 中 Exactly Once 指的是: 每条数据从输入源传递到 Spark 应用程序 Exactly ...

  2. StructuredStreaming基础操作和窗口操作

    一.流式DataFrames/Datasets的结构类型推断与划分 ◆ 默认情况下,基于文件源的结构化流要求必须指定schema,这种限制确保即 使在失败的情况下也会使用一致的模式来进行流查询. ◆ ...

  3. uCGUI窗口操作要点

    uCGUI窗口操作要点 1. 创建一个窗口的时候,会给此窗口发送“创建(WM_CREATE)”消息,从而执行它的回调函数:如果创建窗口的标志带有“可视标志(WM_CF_SHOW)”,那么在后续执行GU ...

  4. WPF: WpfWindowToolkit 一个窗口操作库的介绍

    在 XAML 应用的开发过程中,使用MVVM 框架能够极大地提高软件的可测试性.可维护性.MVVM的核心思想是关注点分离,使得业务逻辑从 View 中分离出来到 ViewModel 以及 Model ...

  5. 使用cmd命令行窗口操作SqlServer

    本文主要介绍使用windows下的使用cmd命令行窗口操作Sqlserver, 首先我们可以运行 osql  ?/   ,这样就把所有可以通过CMD命令行操作sqlserver的命令显示出来 (有图有 ...

  6. 项目总结03:window.open()方法用于子窗口数据回调至父窗口,即子窗口操作父窗口

    window.open()方法用于子窗口数据回调至父窗口,即子窗口操作父窗口 项目中经常遇到一个业务逻辑:在A窗口中打开B窗口,在B窗口中操作完以后关闭B窗口,同时自动刷新A窗口(或局部更新A窗口)( ...

  7. JS打开新窗口,子窗口操作父窗口

    <!--父窗口弹窗代码开始--> <script type="text/javascript"> function OpenWindow() { windo ...

  8. CKFinder 弹出窗口操作并设置回调函数

    CKFinder 弹出窗口操作并设置回调函数 官方例子参考CKFinderJava-2.4.1/ckfinder/_samples/popup.html 写一个与EXT集成的小例子 Ext.defin ...

  9. js open窗口父子窗口操作

    http://zhidao.baidu.com/question/61358246.html?an=0&si=1 js open窗口父子窗口操作     父窗口js代码:   function ...


  1. 18.Mysql搜索引擎及其区别

    这是面试中的问题:当时也是没有直接回答出来,还是因为基础知识不扎实. 一般Mysql常用的搜索引擎有:ISAM.MylSAM.HEAP.InnoDB.Berkley(BDB) ISAM:执行读取操作的 ...

  2. 最全的select加锁分析(Mysql)

    引言 大家在面试中有没遇到面试官问你下面六句Sql的区别呢 select * from table where id = ? select * from table where id < ? s ...

  3. hIve—timestamp时间戳问题

    先查看表 timestamp可以转换为标准的时间(精确到秒);https://tool.lu/timestamp/ 这个时间格式用处很多: 多个时间可以使用函数,来切换. 每个用户 产生行为的时候,用 ...

  4. css-图片垂直居中

    1. img { vertical-align: middle; }   2. <body> <div> <img src="1.jpg" alt=& ...

  5. Delphi中TApplication详解(转仅供自己参考)

    转自:http://blog.sina.com.cn/s/blog_4d6f55d90100bmv9.html TApplication是用于Delphi应用程序的类型,该类在单元forms中声明.T ...

  6. sql中优化查询

    1.在大部分情况下,where条件语句中包含or.not,SQL将不使用索引:可以用in代替or,用比较运算符!=代替not. 2.在没有必要显示不重复运行时,不使用distinct关键字,避免增加处 ...

  7. CSS: Position Introduction.

    brief introduction: detailed introduction: ①absolute locate:http://www.runoob.com/try/try.php?filena ...

  8. PHP对redis操作详解

    /*1.Connection*/$redis = new Redis();$redis->connect('',6379,1);//短链接,本地host,端口为6379,超过1 ...

  9. Linux grep命令使用方法

    Linux系统中grep命令可以根据指定的字符串或者正则表达式对文件内容进行匹配查找.在Linux文件处理和SHELL编程中使用广泛. grep基本语法 用法: grep [选项] "字符串 ...

  10. Apache Mina UDP连接目标服务器地址时出现异常

    俩种情形,第一种是开始连接时候就没连上服务器:第二种是服务器关闭连接,出现的异常: 第一种: java.lang.reflect.InvocationTargetException at sun.re ...