flink学习之十一-window&EventTime实例
上面试了Processing Time,在这里准备看下Event Time,以及必须需要关注的,在ET场景下的Watermarks。
EventTime & Watermark
Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time.
以event time为准的程序,必须要指定watermark.
以下内容引自 《从0到1学习flink》及 官网说明:
支持 Event Time 的流处理器需要一种方法来衡量 Event Time 的进度。 例如,当 Event Time 超过一小时结束时,需要通知构建每小时窗口的窗口操作符,以便操作员可以关闭正在进行的窗口。
Event Time 可以独立于 Processing Time 进行。 例如,在一个程序中,操作员的当前 Event Time 可能略微落后于 Processing Time (考虑到接收事件的延迟),而两者都以相同的速度进行。另一方面,另一个流程序可能只需要几秒钟的时间就可以处理完 Kafka Topic 中数周的 Event Time 数据。
A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.
Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).
Flink 中用于衡量 Event Time 进度的机制是 Watermarks。 Watermarks 作为数据流的一部分流动并带有时间戳 t。 Watermark(t)声明 Event Time 已到达该流中的时间 t,这意味着流中不应再有具有时间戳 t’<= t 的元素(即时间戳大于或等于水印的事件)
下图显示了带有(逻辑)时间戳和内联水印的事件流。在本例中,事件是按顺序排列的(相对于它们的时间戳),这意味着水印只是流中的周期性标记。
stream_watermark_in_orderWatermark 对于无序流是至关重要的,如下所示,其中事件不按时间戳排序。通常,Watermark 是一种声明,通过流中的该点,到达某个时间戳的所有事件都应该到达。一旦水印到达操作员,操作员就可以将其内部事件时间提前到水印的值。
stream_watermark_out_of_order
理解下来,如果flink中设置的时间类型是Event Time,必须要设置watermark,作为告诉flink进度的标志。
如果watermark(time1)已经确定,那么说明流中所有time2早于watermark-time1的数据肯定都已经被处理完毕,不管是有序数据流还是无序数据流。
watermark是谁来产生的?--sorry,是跑在flink中的job代码来产生,而不是datasource本身。
watermark是每个数据都有一个对应的么?可以1:1,但不是,按需要和实际情况来做。
It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.
平行流中的水印
水印是在源函数处生成的,或直接在源函数之后生成的。源函数的每个并行子任务通常独立生成其水印。这些水印定义了特定并行源处的事件时间。
当水印通过流程序时,它们会提前到达操作人员处的事件时间。当一个操作符提前(advanced)它的事件时间(event time)时,它为它的后续操作符在下游生成一个新的水印。
一些操作员消耗多个输入流; 例如,一个 union,或者跟随 keyBy(…)或 partition(…)函数的运算符。 这样的操作员当前事件时间是其输入流的事件时间的最小值。 由于其输入流更新其事件时间,因此操作员也是如此。
下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的运算符。
flink_parallel_streams_watermarks
从上图看,event time是从source中产生的,同样的,watermark也是如此。
数据从source在经过map转换,并且放在window中处理
其他的没看懂。。。
关于TimeStamp及Watermark
In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.
Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.
There are two ways to assign timestamps and generate watermarks:
- Directly in the data stream source
- Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted
Attention Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.
event time类型下,flink必须知道event对应的timestamp,也就是说,这个stream中的每个元素都要分配timestamp,一般是放在每个元素中对应的字段。
分配timestamp和生成watermark一般是在一起处理的(hand-in-hand).
有两种方式来分配timestamp+生成watermark
- 直接在datasource中指定
- 通过一个timestamp assigner(或者称之为watermark generator)来指定。在flink中,timestamp assigner 同时也是一个watermark generator
直接在datasource中指定
Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks. When this is done, no timestamp assigner is needed. Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.
To assign a timestamp to an element in the source directly, the source must use the
collectWithTimestamp(...)
method on theSourceContext
. To generate watermarks, the source must call theemitWatermark(Watermark)
function.
比如之前的mysql datasource with spring,其实现是这样的:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> sourceContext.collect(urlInfo));
}
如果需要加入timestamp,则需要调用collectWithTimestamp;如果需要生成watermark,则需要调用emitWatermark。
修改后如下:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> {
// 增加timestamp
sourceContext.collectWithTimestamp(urlInfo,System.currentTimeMillis());
// 生成水印
sourceContext.emitWatermark(new Watermark(urlInfo.getCurrentTime()== null? System.currentTimeMillis():urlInfo.getCurrentTime().getTime()));
sourceContext.collect(urlInfo);
});
}
注意其中增加的两行代码,timestamp和watermark都是针对每个元素的。
通过Timestamp Assigners / Watermark Generators指定
Timestamp assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.
Timestamp assigners are usually specified immediately after the data source, but it is not strictly required to do so. A common pattern, for example, is to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to be specified before the first operation on event time (such as the first window operation). As a special case, when using Kafka as the source of a streaming job, Flink allows the specification of a timestamp assigner / watermark emitter inside the source (or consumer) itself. More information on how to do so can be found in the Kafka Connector documentation.
Timestamp Assigner 允许输入一个stream,输出一个带timestamp、watermark的元素组成的流。如果流之前已经有了timestamp、watermark,则会被覆盖。
Timestamp Assigner 一般会立即在datasoure初始化之后马上指定,不过却并不一定非要这么做。一个通用的模式是在parse、filter之后,指定timestamp assigner;不过在任何第一次需要对event time操作之前,必须指定timestamp assigner。
先看一个例子:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new MyTimestampAndWatermarkAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里在filter之后做了一个assignTimestampAndWatermarks的操作。
With Periodic Watermarks--周期性的添加watermark
AssignerWithPeriodicWatermarks
assigns timestamps and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).The interval (every n milliseconds) in which the watermark will be generated is defined via
ExecutionConfig.setAutoWatermarkInterval(...)
. The assigner’sgetCurrentWatermark()
method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.
如果需要周期性的生成watermark,而不是每次都生成,就需要调用方法AssignerWithPeriodicWatermarks,时间间隔以milliseconds为单位,需要在ExecutionConfig.setAutoWatermarkInterval方法中设置。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// 设定watermark间隔时间
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(300);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里通过ExecuteConfig设置了watermark生成的间隔时间,同时在filter之后加入了TimeLagWatermarkGenerator,其代码如下(来源于官网,稍有修改):
/**
* This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.
*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<UrlInfo> {
private final long maxTimeLag = 5000; // 5 seconds
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}
With Punctuated(不时打断) Watermarks
To generate watermarks whenever a certain event indicates that a new watermark might be generated, use
AssignerWithPunctuatedWatermarks
. For this class Flink will first call theextractTimestamp(...)
method to assign the element a timestamp, and then immediately call thecheckAndGetNextWatermark(...)
method on that element.The
checkAndGetNextWatermark(...)
method is passed the timestamp that was assigned in theextractTimestamp(...)
method, and can decide whether it wants to generate a watermark. Whenever thecheckAndGetNextWatermark(...)
method returns a non-null watermark, and that watermark is larger than the latest previous watermark, that new watermark will be emitted.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new PunctuatedAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
import myflink.model.UrlInfo;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
public class PunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UrlInfo> {
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark checkAndGetNextWatermark(UrlInfo lastElement, long extractedTimestamp) {
/**
* Creates a new watermark with the given timestamp in milliseconds.
*/
return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}
kafka相关
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.
The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.
由于kafka有多个partition,每个kafka partition中可能都有自己的event time规则,而在消费端,多个partition中的数据是并行处理的,来自于不同partition的数据其event time规则不同,所以就破坏掉了event time的生成规则。
在这种情况下,可以使用flink的Kafka-partition-aware watermark生成,如下代码:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("zookeeper.connect", "localhost:2181");
properties.put("group.id", "metric-group");
properties.put("auto.offset.reset", "latest");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
SingleOutputStreamOperator<UrlInfo> dataStreamSource = env.addSource(
new FlinkKafkaConsumer010<String>(
"testjin",// topic
new SimpleStringSchema(),
properties
)
).setParallelism(1)
// map操作,转换,从一个数据流转换成另一个数据流,这里是从string-->UrlInfo
.map(string -> JSON.parseObject(string, UrlInfo.class));
dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UrlInfo>(){
@Override
public long extractAscendingTimestamp(UrlInfo element) {
return element.getCurrentTime().getTime();
}
});
env.execute("save url to db");
}
注意使用的是AscendingTimestampExtractor,也就是一个升序的timestamp 指派器。
参考资料:
http://www.54tianzhisheng.cn/2018/12/11/Flink-time/
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html
flink学习之十一-window&EventTime实例的更多相关文章
- python3.4学习笔记(十一) 列表、数组实例
python3.4学习笔记(十一) 列表.数组实例 #python列表,数组类型要相同,python不需要指定数据类型,可以把各种类型打包进去#python列表可以包含整数,浮点数,字符串,对象#创建 ...
- 入门大数据---Flink学习总括
第一节 初识 Flink 在数据激增的时代,催生出了一批计算框架.最早期比较流行的有MapReduce,然后有Spark,直到现在越来越多的公司采用Flink处理.Flink相对前两个框架真正做到了高 ...
- Spring 4 官方文档学习(十一)Web MVC 框架之配置Spring MVC
内容列表: 启用MVC Java config 或 MVC XML namespace 修改已提供的配置 类型转换和格式化 校验 拦截器 内容协商 View Controllers View Reso ...
- Spring 4 官方文档学习(十一)Web MVC 框架之resolving views 解析视图
接前面的Spring 4 官方文档学习(十一)Web MVC 框架,那篇太长,故另起一篇. 针对web应用的所有的MVC框架,都会提供一种呈现views的方式.Spring提供了view resolv ...
- Spring 4 官方文档学习(十一)Web MVC 框架
介绍Spring Web MVC 框架 Spring Web MVC的特性 其他MVC实现的可插拔性 DispatcherServlet 在WebApplicationContext中的特殊的bean ...
- jQuery框架学习第十一天:实战jQuery表单验证及jQuery自动完成提示插件
jQuery框架学习第一天:开始认识jQueryjQuery框架学习第二天:jQuery中万能的选择器jQuery框架学习第三天:如何管理jQuery包装集 jQuery框架学习第四天:使用jQuer ...
- 值得 Web 开发人员学习的20个 jQuery 实例教程
这篇文章挑选了20个优秀的 jQuery 实例教程,这些 jQuery 教程将帮助你把你的网站提升到一个更高的水平.其中,既有网站中常用功能的的解决方案,也有极具吸引力的亮点功能的实现方法,相信通过对 ...
- python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容
python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...
- JMeter学习-011-JMeter 后置处理器实例之 - 正则表达式提取器(三)多参数获取进阶引用篇
前两篇文章分表讲述了 后置处理器 - 正则表达式提取器概述及简单实例.多参数获取,相应博文敬请参阅 简单实例.多参数获取. 此文主要讲述如何引用正则表达式提取器获取的数据信息.其实,正则表达式提取器获 ...
随机推荐
- MySQL 小数处理函数 round 和 floor
一. 在mysql中,round函数用于数据的四舍五入,它有两种形式: 1.round(x,d) ,x指要处理的数,d是指保留几位小数 这里有个值得注意的地方是,d可以是负数,这时是指定小数点左边的 ...
- HTML-参考手册: HTML ASCII
ylbtech-HTML-参考手册: HTML ASCII 1.返回顶部 1. HTML ASCII 参考手册 ASCII 字符集被用于因特网上不同计算机间传输信息. ASCII 字符集 ASCII ...
- laravel定义全局变量
laravel中config()函数可以获取 bootstrap/cache/config.php中的内容,而config文件夹下的所有配置文件夹中的内容可以通过 php artisan confi ...
- 21. Blog接口开发
一般的系统由登录.增删改查所组成.我们的Blog同样如此.我们会开发登录.创建博客.删除博客.修改博客.查询博客等功能.话不多说,我们直接展开实践吧. 思路分析 创建项目.既然我们要创建一个blog, ...
- QT5.2 Assistant-设置应用程序图标
在Qt助手(assistant.exe)搜索关键字"Setting the Application Icon"就可以看到在各种平台设置Qt程序图标的方法,包括QT支持的Win ...
- Python面试题之这两个参数是什么意思:*args,**kwargs?我们为什么要使用它们?
如果我们不确定要往函数中传入多少个参数,或者我们想往函数中以列表和元组的形式传参数时,那就使要用*args: 如果我们不知道要往函数中传入多少个关键词参数,或者想传入字典的值作为关键词参数时,那就要使 ...
- ubuntu 设置root密码
- 微信小程序の条件渲染
<view> 今天吃什么 </view> <view wx:if="{{condition==1}}">饺子</view> < ...
- Linux账号管理与ALC权限设定(一)
UID 与 GID UID用户的编号 GID 用户群组的编号 账号登录时,有一个对应的文本来记录某个账户的UID与GID.然后获得这个UID去对应的密码文本中,取得密码进行比对,然后登陆. 保 ...
- Android VideoView无法播放网络视频
今天学习Android播放视频和音频,其中在练习播放视频的时候无法播放网络视频,网络视频是别人发布在网上的,但是把视频放在本地是可以的,最后推测是没有开放网络的访问权限的问题,果然开放了之后就能正常访 ...