If you are building a Realtime streaming application, Event Time processing is one of the features that you will have to use sooner or later.

Since in most of the real-world use cases messages arrive out-of-order, there should be some way through which the system you build understands the fact that messages could arrive late and handle them accordingly.

In this blog post, we will see why we need Event Time processing and how we can enable it in ApacheFlink.

EventTime is the time at which an event occurred in the real-world and ProcessingTime is the time at which that event is processed by the Flink system.

To understand the importance of Event Time processing, we will first start by building a Processing Time based system and see it’s drawback.

We will create a SlidingWindow of size 10 seconds which slides every 5 seconds and at the end of the window, the system will emit the number of messages that were received during that time.

Once you understand how EventTime processing works with respect to a SlidingWindow, it will not be difficult to understand how it works for a TumblingWindow as well. So let’s get started.

ProcessingTime based system

For this example we expect messages to have the format value,timestamp where value is the message and timestamp is the time at which this message was generated at the source.

Since we are now building a Processing Time based system, the code below ignores the timestamp part.

It is an important aspect to understand that the messages should contain the information on when it was generated.

Flink or any other system is not a magic box that can somehow figure this out by itself. Later we will see that, Event Time processing extracts this timestamp information to handle late messages.

val text = senv.socketTextStream("localhost", )
val counts = text.map {(m: String) => (m.split(",")(), ) }
.keyBy()
.timeWindow(Time.seconds(), Time.seconds())
.sum()
counts.print
senv.execute("ProcessingTime processing example")

Case 1: Messages arrive without delay

Suppose the source generated three messages of the type a at times 13th second, 13th second and 16th second respectively.

(Hours and minutes are not important here since the window size is only 10 seconds).

These messages will fall into the windows as follows.

The first two messages that were generated at 13th sec will fall into both window1[5s-15s] and window2[10s-20s] and the third message generated at 16th second will fall into window2[10s-20s] and window3[15s-25s].

The final counts emitted by each window will be (a,2), (a,3) and (a,1) respectively.

This output can be considered as the expected behavior. Now we will look at what happens when one of the message arrives late into the system.

Case 2: Messages arrive in delay

Now suppose one of the messages (generated at 13th second) arrived at a delay of 6 seconds(at 19th second), may be due to some network congestion.

Can you guess which all windows would this message fall into?

The delayed message fell into window 2 and 3, since 19 is within the range 10-20 and 15-25.

It did not cause any problem to the calculation in window2 (because the message was anyways supposed to fall into that window) but it affected the result of window1 and window3.

We will now try to fix this problem by using EventTime processing.

EventTime based system

To enable EventTime processing, we need a timestamp extractor that extracts the event time information from the message.

Remember that the messages were of the format value,timestamp. The extractTimestamp method gets the timestamp part and returns it as a Long.

Ignore the getCurrentWatermark method for now, we will come back to it later.

class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {
override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
e.split(",")().toLong
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis)
}
}

注:这个例子使用的AssignerWithPeriodicWatermarks接口。其实,还有另一个接口 AssignerWithPunctuatedWatermarks。

官网描述: 

As described in timestamps and watermark handling, Flink provides abstractions that allow the programmer to assign their own timestamps and emit their own watermarks.

More specifically, one can do so by implementing one of the AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks interfaces, depending on the use case.

In a nutshell, the first will emit watermarks periodically, while the second does so based on some property of the incoming records, e.g. whenever a special element is encountered in the stream.

We now need to set this timestamp extractor and also set the TimeCharactersistic as EventTime.

Rest of the code remains the same as in the case of ProcessingTime.

senv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val text = senv.socketTextStream("localhost", )
.assignTimestampsAndWatermarks(new TimestampExtractor)
val counts = text.map {(m: String) => (m.split(",")(), ) }
.keyBy()
.timeWindow(Time.seconds(), Time.seconds())
.sum()
counts.print
senv.execute("EventTime processing example")

The result of running the above code is shown in the diagram below.

The results look better, the windows 2 and 3 now emitted correct result, but window1 is still wrong.

Flink did not assign the delayed message to window 3 because it now checked the message’s event time and understood that it did not fall in that window.

But why didn’t it assign the message to window 1?.

The reason is that by the time the delayed message reached the system(at 19th second), the evaluation of window 1 has already finished (at 15th second).

Let us now try to fix this issue by using the Watermark.

Note that in window 2, the delayed message was still placed at 19th second, not at 13th second(it's event time).

This depiction in the figure was intentional to indicate that the messages within a window are not sorted according to it's event time. (this might change in future)

Watermarks

Watermarks is a very important and interesting idea and I will try to give you a brief overview about it.

If you are interested in learning more, you can watch this awesome talk from Google and also read this blog from dataArtisans.

A Watermark is essentially a timestamp. When an Operator in Flink receives a watermark, it understands(assumes) that it is not going to see any message older than that timestamp.

Hence watermark can also be thought of as a way of telling Flink how far it is, in the “EventTime”.

For the purpose of this example, think of it as a way of telling Flink how much delayed a message can be.

In the last attempt, we set the watermark as the current system time. It was, therefore, not expecting any delayed messages.

We will now set the watermark as current time - 5 seconds, which tells Flink to expect messages to be a maximum of 5 seconds dealy - This is because each window will be evaluated only when the watermark passes through it. Since our watermark is current time - 5 seconds, the first window [5s-15s] will be evaluated only at 20th second. Similarly the window [10s-20s] will be evaluated at 25th second and so on.

override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis - )
}

Here we are assuming that the eventtime is 5 seconds older than the current system time, but that is not always the case.

In many cases it will be better to hold the max timestamp received so far(which is extracted from the message) and subtract the expected delay from it.

The result of running the code after making above changes is:

Finally we have the correct result, all the three windows now emit counts as expected - which is (a,2), (a,3) and (a,1).

Allowed Lateness

In our earlier approach where we used “watermark - delay”, the window would not fire until the watermark is past window_length + delay.

If you want to accommodate late events, and want the window to fire on-time you can use Allowed Lateness.

If allowed lateness is set, Flink will not discard message unless it is past the window_end_time + allowed lateness.

Once a late message is received, Flink will extract it’s timestamp and check if it is within the allowed lateness, then it will check whether to FIRE the window or not (as per the Trigger set).

Hence, note that a window might fire multiple times in this approach, and you might want to make your sink idempotent - if you need exactly once processing.

Conclusion

The importance of real-time stream processing systems has grown lately and having to deal with delayed message is part of any such system you build.

In this blog post, we saw how late arriving messages can affect the results of your system and how ApacheFlink’s Event Time processing capabilities can be used to solve them.

That concludes the post, Thanks for reading! 
Continue reading

中文译文:https://blog.csdn.net/a6822342/article/details/78064815

Flink Event Time Processing and Watermarks(文末有翻译)的更多相关文章

  1. Angular 2的12个经典面试问题汇总(文末附带Angular测试)

    Angular作为目前最为流行的前端框架,受到了前端开发者的普遍欢迎.不论是初学Angular的新手,还是有一定Angular开发经验的开发者,了解本文中的12个经典面试问题,都将会是一个深入了解和学 ...

  2. 30分钟玩转Net MVC 基于WebUploader的大文件分片上传、断网续传、秒传(文末附带demo下载)

    现在的项目开发基本上都用到了上传文件功能,或图片,或文档,或视频.我们常用的常规上传已经能够满足当前要求了, 然而有时会出现如下问题: 文件过大(比如1G以上),超出服务端的请求大小限制: 请求时间过 ...

  3. Visual Studio Code-批量在文末添加文本字段

    小技巧一例,在vs code或notepad++文末批量添加文本字段信息,便于数据信息的完整,具体操作如下: Visual Studio Code批量添加"@azureyun.com&quo ...

  4. C# 30分钟完成百度人脸识别——进阶篇(文末附源码)

    距离上次入门篇时隔两个月才出这进阶篇,小编惭愧,对不住关注我的卡哇伊的小伙伴们,为此小编用这篇博来谢罪. 前面的准备工作我就不说了,注册百度账号api,创建web网站项目,引入动态链接库引入. 不了解 ...

  5. 文末福利丨i春秋互联网安全校园行第1站精彩回顾

    活动背景 为响应国家完善网络安全人才培养体系.推动网络安全教育的号召,i春秋特此发起“互联网安全校园行”系列活动.旨在通过活动和知识普及提升大学生信息安全意识,并通过线下交流.技能分享.安全小活动以及 ...

  6. i春秋官网4.0上线啦 文末有福利

    爱瑞宝地(Everybody)期待了很久的 i春秋官网4.0上线啦 除了产品的功能更加完善 性能和体验也将大幅度提高 清新.舒适的视觉感受 搭配更加便捷的操作流程 只需一秒,扫码立即登录 即刻进入网络 ...

  7. Angular的12个经典问题,看看你能答对几个?(文末附带Angular测试)

    Angular作为目前最为流行的前端框架,受到了前端开发者的普遍欢迎.不论是初学Angular的新手,还是有一定Angular开发经验的开发者,了解本文中的12个经典面试问题,都将会是一个深入了解和学 ...

  8. 文末有福利 | IT从业者应关注哪些技术热点?

    7月14-15日,MPD工作坊北京站即将开幕,目前大会日程已经出炉,来自各大企业的技术专家,按照软件研发中心的岗位职能划分,从产品运营.团队管理.架构技术.自动化运维等领域进行干货分享,点击此[链接] ...

  9. Angular 2的12个经典面试问题汇总(文末附带Angular測试)

    Angular作为眼下最为流行的前端框架,受到了前端开发者的普遍欢迎.不论是初学Angular的新手.还是有一定Angular开发经验的开发者,了解本文中的12个经典面试问题,都将会是一个深入了解和学 ...

随机推荐

  1. asp.net core系列 35 EF保存数据(2) -- EF系列结束

    一.事务 (1) 事务接着上篇继续讲完.如果使用了多种数据访问技术,来访问关系型数据库,则可能希望在这些不同技术所执行的操作之间共享事务.下面示例显示了如何在同一事务中执行 ADO.NET SqlCl ...

  2. 大战Java虚拟机【2】—— GC策略

    前言 前面我们已经知道了Java虚拟机所做的事情就是回收那些不用的垃圾,那些不用的对象.那么问题来了,我们如何知道一个对象我们不需要使用了呢?程序在使用的过程中会不断的创建对象,这些所创建的对象指不定 ...

  3. 补习系列(6)- springboot 整合 shiro 一指禅

    目标 了解ApacheShiro是什么,能做什么: 通过QuickStart 代码领会 Shiro的关键概念: 能基于SpringBoot 整合Shiro 实现URL安全访问: 掌握基于注解的方法,以 ...

  4. 基于 Nginx 的 HTTPS 性能优化

    前言 分享一个卓见云的较多客户遇到HTTPS优化案例. 随着相关浏览器对HTTP协议的“不安全”.红色页面警告等严格措施的出台,以及向 iOS 应用的 ATS 要求和微信.支付宝小程序强制 HTTPS ...

  5. 痞子衡嵌入式:常用的数据差错控制技术(1)- 重复校验(Repetition Code)

    大家好,我是痞子衡,是正经搞技术的痞子.今天痞子衡给大家讲的是嵌入式里数据差错控制技术-重复校验. 在嵌入式应用里,除了最核心的数据处理外,我们还会经常和数据传输打交道.数据传输需要硬件传输接口的支持 ...

  6. 开源项目福利-github开源项目免费使用Azure PipeLine

    微软收购Github后,很多人猜想微软可能会砍掉VSTS,然而事实VSTS并没有砍掉,关于Azure Devops的详细信息可以查看 这篇博客,如果想查看原文也可以从链接里提供的原始地址里查看. 今天 ...

  7. [我还会回来的]asp.net core再战iris

    废话不多说,直接开干! 硬件配置 处理器: Intel(R) Core(TM) i5-4690k CPU @3.90GHz 内存容量: 8.00 GB 软件版本 OS: Microsoft Windo ...

  8. 【转】Android播放音频MediaPlayer的几种方式介绍

    接下来笔者介绍一下Android中播放音频的几种方式,android.media包下面包含了Android开发中媒体类,当然笔者不会依次去介绍,下面介绍几个音频播放中常用的类: 1.使用MediaPl ...

  9. rocketmq 发送时异常:system busy 和 broker busy 解决方案

    记一次 rocketmq 使用时的异常. 这里就不说什么rocketmq 源码啥的了,因为没看过.网上一搜这两个异常 大部分都是什么源码解读,也没说出现后的解决办法(蓝瘦香菇). 大量测试发现: 1. ...

  10. spring mvc 启动过程及源码分析

    由于公司开源框架选用的spring+spring mvc + mybatis.使用这些框架,网上都有现成的案例:需要那些配置文件.每种类型的配置文件的节点该如何书写等等.如果只是需要项目能够跑起来,只 ...