之前一直对窗口操作不太理解。认为spark streaming本身已经是分片计算,还需要窗口操作干啥。

窗口操作最为简单易懂的场景就是,在M时间间隔计算一次N时间内的热搜。当M=N的时候,就像上述所说,窗口操作本身没什么优势;但当在M!=N的时候,窗口计算优势就体现出来了。

之前用storm的窗口计算,实在是麻烦。而spark streaming则要简单许多。

借用官网提供的图以及例子:

简来说就是10秒钟计算30秒内的单词数。

两个参数
window length - The duration of the window ( in the figure). -> N
sliding interval - The interval at which the window operation is performed ( in the figure). -> M

在每一次十秒分别输入:

1:

sdf sdfsd sdf

2:

sdf sdfsd sdf

3:

sdf sdfsd sdf
sdf sdfsd sdf 234

输入

期望得到的结果:

1

----------------------------------------

sdf : 2
sdfsd : 1 ---------------------------------------- 2: ---------------------------------------- sdf : 4
sdfsd : 2 ---------------------------------------- 3: ---------------------------------------- sdf : 8
sdfsd : 4
234:1 ---------------------------------------- 4: ---------------------------------------- sdf : 6
sdfsd : 3
234:1 ---------------------------------------- 5: ---------------------------------------- sdf : 4
sdfsd : 2
234:1 ---------------------------------------- 6: ---------------------------------------- 无 ----------------------------------------

输出

代码如下:

 var maxRatePerPartition = "3700"
if (args.length > 1) maxRatePerPartition = args(0)
var master = "local"
if (args.length > 2) master = args(1)
var duration = 10
if (args.length > 3) duration = args(2).toInt
var group = "brokers"
if (args.length > 4) group = args(3)
var topic = "test"
if (args.length > 5) topic = args(4)
var brokerlist = "master:9092"
if (args.length > 6) brokerlist = args(5) val sparkConf = new SparkConf()
.setMaster(master)
.setAppName("window") val ssc = new StreamingContext(sparkConf, Seconds(duration))
val topicsSet = topic.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokerlist,
"group.id" -> group,
"enable.auto.commit" -> "false")
val kc = new KafkaCluster(kafkaParams)
val topicAndPartition2 = kc.getPartitions(topicsSet)
var topicAndPartition: Set[TopicAndPartition] = Set[TopicAndPartition]() var fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]() var bool = true if (topicAndPartition2.isRight) topicAndPartition = kc.getPartitions(topicsSet).right.get else {
val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
bool = false
} if (bool) {
var temp = kc.getConsumerOffsets(group, topicAndPartition)
if (temp.isRight) {
fromOffsets = temp.right.get
} else if (temp.isLeft) {
val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
}
}
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler) var offsetsList = Array[OffsetRange]()
val msgContent = messages.transform(x => {
offsetsList = x.asInstanceOf[HasOffsetRanges].offsetRanges
x
}) val data = msgContent.flatMap(x => x._2.split(" ")) val data2 = data.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(10)) data2.foreachRDD(x => {
x.foreach(x => {
println(x._1 + " = " + x._2)
})
}) for (offsets <- offsetsList) {
val topicAndPartition = TopicAndPartition(topic, offsets.partition)
val o = kc.setConsumerOffsetMetadata(group, Map[TopicAndPartition, OffsetAndMetadata](topicAndPartition -> OffsetAndMetadata(offsets.untilOffset)))
if (o.isLeft) {
println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
}
} ssc.start()
ssc.awaitTermination()

输出结果达到预期效果。

reduceByKeyAndWindow提供了两个重载函数。两个函数实现的功能是一样的,差别在于invReduceFunc。函数1直接计算窗口,函数2计算并集加差集,效率更高。

函数1
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b),Seconds(30), Seconds(10))
函数2
reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b),Seconds(30), Seconds(10))

还是以上图为例

根据zhouzhihubeyond的博客,实现原理如下:

函数1(全量统计):
window3 : time1+time2+time3
window5 : time3+time4+time5 函数2(增量统计):
window3 : (time1+time2)+time3 然后checkpoint -> (time1+time2)|window3
window5 : window3 +time4+time5 - (time1+time2) 只需计算 time4+time5,无需计算time3

但是基于这种解释,有一个问题,为了便于window7的计算,在window5的时候,必须计算time3+time4并缓存,其实还是需要计算time3;而且需要缓存的大小(如window3需要缓存的大小是:(time1+time2)+(window3 = time1+time2+time3) = 2*time1+2*time2+time3=5/3*window3)是window本身的5/3。计算方式其实比函数1更复杂,而且更占用资源。

先问是不是,再问为什么。

上源码:

 /*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.spark.streaming.dstream import scala.reflect.ClassTag import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.Duration private[streaming]
class WindowedDStream[T: ClassTag](
parent: DStream[T],
_windowDuration: Duration,
_slideDuration: Duration)
extends DStream[T](parent.ssc) { if (!_windowDuration.isMultipleOf(parent.slideDuration)) {
throw new Exception("The window duration of windowed DStream (" + _windowDuration + ") " +
"must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
} if (!_slideDuration.isMultipleOf(parent.slideDuration)) {
throw new Exception("The slide duration of windowed DStream (" + _slideDuration + ") " +
"must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
} // Persist parent level by default, as those RDDs are going to be obviously reused.
parent.persist(StorageLevel.MEMORY_ONLY_SER) def windowDuration: Duration = _windowDuration override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = _slideDuration override def parentRememberDuration: Duration = rememberDuration + windowDuration override def persist(level: StorageLevel): DStream[T] = {
// Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying
// RDDs and persisting the windowed RDDs would store numerous copies of the underlying data.
// Instead control the persistence of the parent DStream.
parent.persist(level)
this
} override def compute(validTime: Time): Option[RDD[T]] = {
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
val rddsInWindow = parent.slice(currentWindow)
Some(ssc.sc.union(rddsInWindow))
}
}

函数1关键源码

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.spark.streaming.dstream import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag import org.apache.spark.Partitioner
import org.apache.spark.rdd.{CoGroupedRDD, RDD}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Duration, Interval, Time} private[streaming]
class ReducedWindowedDStream[K: ClassTag, V: ClassTag](
parent: DStream[(K, V)],
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
filterFunc: Option[((K, V)) => Boolean],
_windowDuration: Duration,
_slideDuration: Duration,
partitioner: Partitioner
) extends DStream[(K, V)](parent.ssc) { require(_windowDuration.isMultipleOf(parent.slideDuration),
"The window duration of ReducedWindowedDStream (" + _windowDuration + ") " +
"must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
) require(_slideDuration.isMultipleOf(parent.slideDuration),
"The slide duration of ReducedWindowedDStream (" + _slideDuration + ") " +
"must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
) // Reduce each batch of data using reduceByKey which will be further reduced by window
// by ReducedWindowedDStream
private val reducedStream = parent.reduceByKey(reduceFunc, partitioner) // Persist RDDs to memory by default as these RDDs are going to be reused.
super.persist(StorageLevel.MEMORY_ONLY_SER)
reducedStream.persist(StorageLevel.MEMORY_ONLY_SER) def windowDuration: Duration = _windowDuration override def dependencies: List[DStream[_]] = List(reducedStream) override def slideDuration: Duration = _slideDuration override val mustCheckpoint = true override def parentRememberDuration: Duration = rememberDuration + windowDuration override def persist(storageLevel: StorageLevel): DStream[(K, V)] = {
super.persist(storageLevel)
reducedStream.persist(storageLevel)
this
} override def checkpoint(interval: Duration): DStream[(K, V)] = {
super.checkpoint(interval)
// reducedStream.checkpoint(interval)
this
} override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
currentTime)
val previousWindow = currentWindow - slideDuration logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow) // _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
//
// |________ _________| |________ _________|
// | |
// V V
// old RDDs new RDDs
// // Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
logDebug("# old RDDs = " + oldRDDs.size) // Get the RDDs of the reduced values in "new time steps"
val newRDDs =
reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size) // Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs // Cogroup the reduced RDDs and merge the reduced values
val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
partitioner)
// val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ val numOldValues = oldRDDs.size
val numNewValues = newRDDs.size val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
throw new Exception("Unexpected number of sequences of reduced values")
}
// Getting reduced values "old time steps" that will be removed from current window
val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
// Getting reduced values "new time steps"
val newValues =
(1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) if (arrayOfValues(0).isEmpty) {
// If previous window's reduce value does not exist, then at least new values should exist
if (newValues.isEmpty) {
throw new Exception("Neither previous window has value for key, nor new values found. " +
"Are you sure your key class hashes consistently?")
}
// Reduce the new values
newValues.reduce(reduceF) // return
} else {
// Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
// If old values exists, then inverse reduce then from previous value
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
}
// If new values exists, then reduce them with previous value
if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF))
}
tempValue // return
}
} val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
.mapValues(mergeValues) if (filterFunc.isDefined) {
Some(mergedValuesRDD.filter(filterFunc.get))
} else {
Some(mergedValuesRDD)
}
}
}

函数2的关键源码

可以看到,它们关键的区别就在compute函数。

函数1的compute函数,没有问题就根据两个时间点得到当前窗口时间并取得相应的值而已

 override def compute(validTime: Time): Option[RDD[T]] = {
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
val rddsInWindow = parent.slice(currentWindow)
Some(ssc.sc.union(rddsInWindow))
}

函数2的compute函数就要复杂多了

 override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
currentTime)
val previousWindow = currentWindow - slideDuration logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow) // _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
//
// |________ _________| |________ _________|
// | |
// V V
// old RDDs new RDDs
// // Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
logDebug("# old RDDs = " + oldRDDs.size) // Get the RDDs of the reduced values in "new time steps"
val newRDDs =
reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size) // Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs // Cogroup the reduced RDDs and merge the reduced values
val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
partitioner)
// val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ val numOldValues = oldRDDs.size
val numNewValues = newRDDs.size val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
throw new Exception("Unexpected number of sequences of reduced values")
}
// Getting reduced values "old time steps" that will be removed from current window
val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
// Getting reduced values "new time steps"
val newValues =
(1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) if (arrayOfValues(0).isEmpty) {
// If previous window's reduce value does not exist, then at least new values should exist
if (newValues.isEmpty) {
throw new Exception("Neither previous window has value for key, nor new values found. " +
"Are you sure your key class hashes consistently?")
}
// Reduce the new values
newValues.reduce(reduceF) // return
} else {
// Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
// If old values exists, then inverse reduce then from previous value
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
}
// If new values exists, then reduce them with previous value
if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF))
}
tempValue // return
}
} val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
.mapValues(mergeValues) if (filterFunc.isDefined) {
Some(mergedValuesRDD.filter(filterFunc.get))
} else {
Some(mergedValuesRDD)
}
}

如图,我在官方的基础上加了一个keyRDDs

    //  _____________________________
// | previous window _________|___________________
// |___________________| current window | --------------> Time
// |_____________________________|
//
// |________ _________| | |________ _________|
// | | |
// V V V
// old RDDs keyRDDs new RDDs
//

从代码中可以看出,做了如下操作:

previous window - old RDDS = key RDDS
key RDDs + new RDDs = current window (当前窗口)

        // Get the previous window's reduced value
var tempValue = arrayOfValues(0).head
if (!oldValues.isEmpty) {
tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) //自定义函数 前减后
}
// If new values exists, then reduce them with previous value if (!newValues.isEmpty) {
tempValue = reduceF(tempValue, newValues.reduce(reduceF)) //自定义函数 前加后
}

然后对前一个窗口previous window作为缓存

// Get the RDD of the reduced value of the previous window
val previousWindowRDD =
getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

所以,非首次的计算需要计算old RDDs 和new RDDs,其实看起来还是很繁琐,效率当真较高一点吗?待测试。

streaming窗口操作的更多相关文章

  1. Structured-Streaming之窗口操作

    Structured Streaming 之窗口事件时间聚合操作 Spark Streaming 中 Exactly Once 指的是: 每条数据从输入源传递到 Spark 应用程序 Exactly ...

  2. StructuredStreaming基础操作和窗口操作

    一.流式DataFrames/Datasets的结构类型推断与划分 ◆ 默认情况下,基于文件源的结构化流要求必须指定schema,这种限制确保即 使在失败的情况下也会使用一致的模式来进行流查询. ◆ ...

  3. uCGUI窗口操作要点

    uCGUI窗口操作要点 1. 创建一个窗口的时候,会给此窗口发送“创建(WM_CREATE)”消息,从而执行它的回调函数:如果创建窗口的标志带有“可视标志(WM_CF_SHOW)”,那么在后续执行GU ...

  4. WPF: WpfWindowToolkit 一个窗口操作库的介绍

    在 XAML 应用的开发过程中,使用MVVM 框架能够极大地提高软件的可测试性.可维护性.MVVM的核心思想是关注点分离,使得业务逻辑从 View 中分离出来到 ViewModel 以及 Model ...

  5. 使用cmd命令行窗口操作SqlServer

    本文主要介绍使用windows下的使用cmd命令行窗口操作Sqlserver, 首先我们可以运行 osql  ?/   ,这样就把所有可以通过CMD命令行操作sqlserver的命令显示出来 (有图有 ...

  6. 项目总结03:window.open()方法用于子窗口数据回调至父窗口,即子窗口操作父窗口

    window.open()方法用于子窗口数据回调至父窗口,即子窗口操作父窗口 项目中经常遇到一个业务逻辑:在A窗口中打开B窗口,在B窗口中操作完以后关闭B窗口,同时自动刷新A窗口(或局部更新A窗口)( ...

  7. JS打开新窗口,子窗口操作父窗口

    <!--父窗口弹窗代码开始--> <script type="text/javascript"> function OpenWindow() { windo ...

  8. CKFinder 弹出窗口操作并设置回调函数

    CKFinder 弹出窗口操作并设置回调函数 官方例子参考CKFinderJava-2.4.1/ckfinder/_samples/popup.html 写一个与EXT集成的小例子 Ext.defin ...

  9. js open窗口父子窗口操作

    http://zhidao.baidu.com/question/61358246.html?an=0&si=1 js open窗口父子窗口操作     父窗口js代码:   function ...

随机推荐

  1. JAVA 中文 unicode 相互转换 文件读取

    package com.test; import org.junit.Test; public class JunitTest { @Test public void test(){ String p ...

  2. import的使用

    iimport函数用来调用python自带的.py文件或者用户自己编写的.py文件 调用方式很简单 1 import time 2 import lib import time 调用python自带的 ...

  3. Python: 对CSV文件读写 和 Md5加密

    1. python 有专门的csv包,直接导入即可. import csv: 2. 直接使用普通文件的open方法 csv_reader=open("e:/python/csv_data/l ...

  4. yum-cron更新 CentOS yum update 不升级内核版本方法

    http://www.360doc.com/content/15/0608/17/15798950_476597844.shtml 相关yum-cron说明有一些 CentOS yum update ...

  5. 转载:基于HALCON的模板匹配方法总结

    转载链接:     http://blog.csdn.net/b108074013/article/details/37657801 很早就想总结一下前段时间学习HALCON的心得,但由于其他的事情总 ...

  6. Appium环境安装

    Appium 是开源的.跨平台的.多语言支持的 移动应用 自动化工具 原生app,如计算器 混合(Hybrid)app 内嵌web + 原生 移动 web app 手机浏览器打开的网址 安装appiu ...

  7. 使用spring validation完成数据后端校验

    前言 数据的校验是交互式网站一个不可或缺的功能,前端的js校验可以涵盖大部分的校验职责,如用户名唯一性,生日格式,邮箱格式校验等等常用的校验.但是为了避免用户绕过浏览器,使用http工具直接向后端请求 ...

  8. PHP中Notice: unserialize(): Error at offset of bytes in on line 的解决方法

    使用unserialize函数将数据储存到数据库的时候遇到了这个报错,后来发现是将gb2312转换成utf-8格式之后,每个中文的字节数从2个增加到3个之后导致了反序列化的时候判断字符长度出现了问题, ...

  9. WDA-3-ALV查询

    主要是梳理下WebDynpro For ABAP开发过程: 1.创建WebDynpro组件 2.创建WebDynpro应用 1.创建WebDynpro组件 1.1创建 路径:选择Package--&g ...

  10. Dictionary转为Model实例

    Dictionary<string, object> dic = new Dictionary<string, object>(); dic.Add(); dic.Add(&q ...