flink学习之十一-window&EventTime实例
上面试了Processing Time,在这里准备看下Event Time,以及必须需要关注的,在ET场景下的Watermarks。
EventTime & Watermark
Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time.
以event time为准的程序,必须要指定watermark.
以下内容引自 《从0到1学习flink》及 官网说明:
支持 Event Time 的流处理器需要一种方法来衡量 Event Time 的进度。 例如,当 Event Time 超过一小时结束时,需要通知构建每小时窗口的窗口操作符,以便操作员可以关闭正在进行的窗口。
Event Time 可以独立于 Processing Time 进行。 例如,在一个程序中,操作员的当前 Event Time 可能略微落后于 Processing Time (考虑到接收事件的延迟),而两者都以相同的速度进行。另一方面,另一个流程序可能只需要几秒钟的时间就可以处理完 Kafka Topic 中数周的 Event Time 数据。
A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.
Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).
Flink 中用于衡量 Event Time 进度的机制是 Watermarks。 Watermarks 作为数据流的一部分流动并带有时间戳 t。 Watermark(t)声明 Event Time 已到达该流中的时间 t,这意味着流中不应再有具有时间戳 t’<= t 的元素(即时间戳大于或等于水印的事件)
下图显示了带有(逻辑)时间戳和内联水印的事件流。在本例中,事件是按顺序排列的(相对于它们的时间戳),这意味着水印只是流中的周期性标记。
stream_watermark_in_orderWatermark 对于无序流是至关重要的,如下所示,其中事件不按时间戳排序。通常,Watermark 是一种声明,通过流中的该点,到达某个时间戳的所有事件都应该到达。一旦水印到达操作员,操作员就可以将其内部事件时间提前到水印的值。
stream_watermark_out_of_order
理解下来,如果flink中设置的时间类型是Event Time,必须要设置watermark,作为告诉flink进度的标志。
如果watermark(time1)已经确定,那么说明流中所有time2早于watermark-time1的数据肯定都已经被处理完毕,不管是有序数据流还是无序数据流。
watermark是谁来产生的?--sorry,是跑在flink中的job代码来产生,而不是datasource本身。
watermark是每个数据都有一个对应的么?可以1:1,但不是,按需要和实际情况来做。
It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.
平行流中的水印
水印是在源函数处生成的,或直接在源函数之后生成的。源函数的每个并行子任务通常独立生成其水印。这些水印定义了特定并行源处的事件时间。
当水印通过流程序时,它们会提前到达操作人员处的事件时间。当一个操作符提前(advanced)它的事件时间(event time)时,它为它的后续操作符在下游生成一个新的水印。
一些操作员消耗多个输入流; 例如,一个 union,或者跟随 keyBy(…)或 partition(…)函数的运算符。 这样的操作员当前事件时间是其输入流的事件时间的最小值。 由于其输入流更新其事件时间,因此操作员也是如此。
下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的运算符。
flink_parallel_streams_watermarks
从上图看,event time是从source中产生的,同样的,watermark也是如此。
数据从source在经过map转换,并且放在window中处理
其他的没看懂。。。
关于TimeStamp及Watermark
In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.
Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.
There are two ways to assign timestamps and generate watermarks:
- Directly in the data stream source
- Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted
Attention Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.
event time类型下,flink必须知道event对应的timestamp,也就是说,这个stream中的每个元素都要分配timestamp,一般是放在每个元素中对应的字段。
分配timestamp和生成watermark一般是在一起处理的(hand-in-hand).
有两种方式来分配timestamp+生成watermark
- 直接在datasource中指定
- 通过一个timestamp assigner(或者称之为watermark generator)来指定。在flink中,timestamp assigner 同时也是一个watermark generator
直接在datasource中指定
Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks. When this is done, no timestamp assigner is needed. Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.
To assign a timestamp to an element in the source directly, the source must use the
collectWithTimestamp(...)
method on theSourceContext
. To generate watermarks, the source must call theemitWatermark(Watermark)
function.
比如之前的mysql datasource with spring,其实现是这样的:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> sourceContext.collect(urlInfo));
}
如果需要加入timestamp,则需要调用collectWithTimestamp;如果需要生成watermark,则需要调用emitWatermark。
修改后如下:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> {
// 增加timestamp
sourceContext.collectWithTimestamp(urlInfo,System.currentTimeMillis());
// 生成水印
sourceContext.emitWatermark(new Watermark(urlInfo.getCurrentTime()== null? System.currentTimeMillis():urlInfo.getCurrentTime().getTime()));
sourceContext.collect(urlInfo);
});
}
注意其中增加的两行代码,timestamp和watermark都是针对每个元素的。
通过Timestamp Assigners / Watermark Generators指定
Timestamp assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.
Timestamp assigners are usually specified immediately after the data source, but it is not strictly required to do so. A common pattern, for example, is to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to be specified before the first operation on event time (such as the first window operation). As a special case, when using Kafka as the source of a streaming job, Flink allows the specification of a timestamp assigner / watermark emitter inside the source (or consumer) itself. More information on how to do so can be found in the Kafka Connector documentation.
Timestamp Assigner 允许输入一个stream,输出一个带timestamp、watermark的元素组成的流。如果流之前已经有了timestamp、watermark,则会被覆盖。
Timestamp Assigner 一般会立即在datasoure初始化之后马上指定,不过却并不一定非要这么做。一个通用的模式是在parse、filter之后,指定timestamp assigner;不过在任何第一次需要对event time操作之前,必须指定timestamp assigner。
先看一个例子:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new MyTimestampAndWatermarkAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里在filter之后做了一个assignTimestampAndWatermarks的操作。
With Periodic Watermarks--周期性的添加watermark
AssignerWithPeriodicWatermarks
assigns timestamps and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).The interval (every n milliseconds) in which the watermark will be generated is defined via
ExecutionConfig.setAutoWatermarkInterval(...)
. The assigner’sgetCurrentWatermark()
method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.
如果需要周期性的生成watermark,而不是每次都生成,就需要调用方法AssignerWithPeriodicWatermarks,时间间隔以milliseconds为单位,需要在ExecutionConfig.setAutoWatermarkInterval方法中设置。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// 设定watermark间隔时间
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(300);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里通过ExecuteConfig设置了watermark生成的间隔时间,同时在filter之后加入了TimeLagWatermarkGenerator,其代码如下(来源于官网,稍有修改):
/**
* This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.
*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<UrlInfo> {
private final long maxTimeLag = 5000; // 5 seconds
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}
With Punctuated(不时打断) Watermarks
To generate watermarks whenever a certain event indicates that a new watermark might be generated, use
AssignerWithPunctuatedWatermarks
. For this class Flink will first call theextractTimestamp(...)
method to assign the element a timestamp, and then immediately call thecheckAndGetNextWatermark(...)
method on that element.The
checkAndGetNextWatermark(...)
method is passed the timestamp that was assigned in theextractTimestamp(...)
method, and can decide whether it wants to generate a watermark. Whenever thecheckAndGetNextWatermark(...)
method returns a non-null watermark, and that watermark is larger than the latest previous watermark, that new watermark will be emitted.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new PunctuatedAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
import myflink.model.UrlInfo;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
public class PunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UrlInfo> {
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark checkAndGetNextWatermark(UrlInfo lastElement, long extractedTimestamp) {
/**
* Creates a new watermark with the given timestamp in milliseconds.
*/
return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}
kafka相关
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.
The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.
由于kafka有多个partition,每个kafka partition中可能都有自己的event time规则,而在消费端,多个partition中的数据是并行处理的,来自于不同partition的数据其event time规则不同,所以就破坏掉了event time的生成规则。
在这种情况下,可以使用flink的Kafka-partition-aware watermark生成,如下代码:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("zookeeper.connect", "localhost:2181");
properties.put("group.id", "metric-group");
properties.put("auto.offset.reset", "latest");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
SingleOutputStreamOperator<UrlInfo> dataStreamSource = env.addSource(
new FlinkKafkaConsumer010<String>(
"testjin",// topic
new SimpleStringSchema(),
properties
)
).setParallelism(1)
// map操作,转换,从一个数据流转换成另一个数据流,这里是从string-->UrlInfo
.map(string -> JSON.parseObject(string, UrlInfo.class));
dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UrlInfo>(){
@Override
public long extractAscendingTimestamp(UrlInfo element) {
return element.getCurrentTime().getTime();
}
});
env.execute("save url to db");
}
注意使用的是AscendingTimestampExtractor,也就是一个升序的timestamp 指派器。
参考资料:
http://www.54tianzhisheng.cn/2018/12/11/Flink-time/
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html
flink学习之十一-window&EventTime实例的更多相关文章
- python3.4学习笔记(十一) 列表、数组实例
python3.4学习笔记(十一) 列表.数组实例 #python列表,数组类型要相同,python不需要指定数据类型,可以把各种类型打包进去#python列表可以包含整数,浮点数,字符串,对象#创建 ...
- 入门大数据---Flink学习总括
第一节 初识 Flink 在数据激增的时代,催生出了一批计算框架.最早期比较流行的有MapReduce,然后有Spark,直到现在越来越多的公司采用Flink处理.Flink相对前两个框架真正做到了高 ...
- Spring 4 官方文档学习(十一)Web MVC 框架之配置Spring MVC
内容列表: 启用MVC Java config 或 MVC XML namespace 修改已提供的配置 类型转换和格式化 校验 拦截器 内容协商 View Controllers View Reso ...
- Spring 4 官方文档学习(十一)Web MVC 框架之resolving views 解析视图
接前面的Spring 4 官方文档学习(十一)Web MVC 框架,那篇太长,故另起一篇. 针对web应用的所有的MVC框架,都会提供一种呈现views的方式.Spring提供了view resolv ...
- Spring 4 官方文档学习(十一)Web MVC 框架
介绍Spring Web MVC 框架 Spring Web MVC的特性 其他MVC实现的可插拔性 DispatcherServlet 在WebApplicationContext中的特殊的bean ...
- jQuery框架学习第十一天:实战jQuery表单验证及jQuery自动完成提示插件
jQuery框架学习第一天:开始认识jQueryjQuery框架学习第二天:jQuery中万能的选择器jQuery框架学习第三天:如何管理jQuery包装集 jQuery框架学习第四天:使用jQuer ...
- 值得 Web 开发人员学习的20个 jQuery 实例教程
这篇文章挑选了20个优秀的 jQuery 实例教程,这些 jQuery 教程将帮助你把你的网站提升到一个更高的水平.其中,既有网站中常用功能的的解决方案,也有极具吸引力的亮点功能的实现方法,相信通过对 ...
- python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容
python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...
- JMeter学习-011-JMeter 后置处理器实例之 - 正则表达式提取器(三)多参数获取进阶引用篇
前两篇文章分表讲述了 后置处理器 - 正则表达式提取器概述及简单实例.多参数获取,相应博文敬请参阅 简单实例.多参数获取. 此文主要讲述如何引用正则表达式提取器获取的数据信息.其实,正则表达式提取器获 ...
随机推荐
- 关于最近练习PYTHON代码的一点心得
做测试以来,一直想学习代码,以前也断断续续的学习过,不过都是练习一些基础语法,学习的是菜鸟教程,但是效果不大. 最近在练习CODEWAR里做练习题,慢慢强化自己对一些基本语法的理解,熟悉基本的内置函数 ...
- main()和代码块
main方法 * main()方法的使用说明 * main方法是程序的主入口(一个主程序 先从main开始进行执行) * * * main方法也可以是一个普通的静态方法 代码块 代码块也是类的成员变量 ...
- sed 对文件进行操作
首先我们想不进入一个文件 对文件进行操作 那么久需要用到sed了 在某个变量之前添加内容: sed -i 's/原内容/要添加内容/g' 文件名 sed -i 's/原内容/要添加内容&/' ...
- token的创建及解析
<dependency> <groupId>io.jsonwebtoken</groupId> <artifactId>jjwt</artifac ...
- camunda流程实例启动的一些简单操作
public class ZccRuntimeService { RuntimeService runtimeService; RepositoryService repositoryService; ...
- 归并排序(Merge_Sort)
基本思想 建立在归并操作上的一种有效的排序算法.该算法是采用分治法(Divide and Conquer)的一个非常典型的应用. 算法原理 归并操作指的是将两个已经排序的序列合并成一个序列的操作,归并 ...
- iView的page 组件
//html <div class="pageNation"> <Page :total= totalPages :page-size= pageSize siz ...
- 《构建之法》IT行业的创新 读书笔记 WEEK 5
本周选读邹欣老师的<构建之法>第16章——IT行业的创新. 邹欣老师将本章话题分成五个部分来阐述:创新的迷思.创新的时机.创新的招数.魔方的创新.创新和作坊,博主认为时机和招数这两个部分在 ...
- js 禁止右击保存图片,禁止拖拽图片
禁止鼠标右键保存图片 <img src="" oncontextmenu="return false;"> 禁止鼠标拖动图片 <img src ...
- ubuntu14.04 配置android studio环境
二.复制所需的文件到ubuntu 2.1.如果你还没有linux版本的android studio.sdk.jdk请先下载所需文件,我已经上传到百度网盘了 下载地址: android studio-l ...