https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/event_timestamps_watermarks.html

 

To work with Event Time, streaming programs need to set the time characteristic accordingly.

首先配置成,Event Time

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

 

Assigning Timestamps

In order to work with Event Time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to get its event timestamp assigned. That happens usually by accessing/extracting the timestamp from some field in the element.

Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about the progress in event time.

There are two ways to assign timestamps and generate Watermarks:

  1. Directly in the data stream source
  2. Via a TimestampAssigner / WatermarkGenerator

接着,我们需要定义如何去获取event time和如何产生Watermark?

一种方式,在source中写死,

@Override
public void run(SourceContext<MyType> ctx) throws Exception {
while (/* condition */) {
MyType next = getNext();
ctx.collectWithTimestamp(next, next.getEventTimestamp()); if (next.hasWatermarkTime()) {
ctx.emitWatermark(new Watermark(next.getWatermarkTime()));
}
}
}

这种方式明显比较low,不太方便,并且这种方式是会被TimestampAssigner 覆盖掉的,

所以看看第二种方式,

Timestamp Assigners / Watermark Generators

Timestamp Assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps or watermarks already, the timestamp assigner overwrites those.

The timestamp assigners occur usually direct after the data source, but it is not strictly required to. A common pattern is for example to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to occur before the first operation on event time (such as the first window operation).

一般在会在source后加些map,filter做些初始化或格式化

然后,在任意需要用到event time的操作之前,比如window,进行设置

给个例子,

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); DataStream<MyEvent> stream = env.addSource(new FlinkKafkaConsumer09<MyEvent>(topic, schema, props)); DataStream<MyEvent> withTimestampsAndWatermarks = stream
.filter( event -> event.severity() == WARNING )
.assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks()); withTimestampsAndWatermarks
.keyBy( (event) -> event.getGroup() )
.timeWindow(Time.seconds(10))
.reduce( (a, b) -> a.add(b) )
.addSink(...);

 

那么Timestamp Assigners如何实现,比如例子中给出的MyTimestampsAndWatermarks

有3种,

With Ascending timestamps

The simplest case for generating watermarks is the case where timestamps within one source occur in ascending order. In that case, the current timestamp can always act as a watermark, because no lower timestamps will occur any more.

DataStream<MyEvent> stream = ...

DataStream<MyEvent> withTimestampsAndWatermarks =
stream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<MyEvent>() { @Override
public long extractAscendingTimestamp(MyEvent element) {
return element.getCreationTime();
}
});

这种没人用吧,不如直接用processing time了

 

With Periodic Watermarks

The AssignerWithPeriodicWatermarks assigns timestamps and generate watermarks periodically (possibly depending the stream elements, or purely based on processing time).

The interval (every n milliseconds) in which the watermark will be generated is defined via ExecutionConfig.setAutoWatermarkInterval(...). Each time, the assigner’s getCurrentWatermark() method will be called, and a new Watermark will be emitted, if the returned Watermark is non-null and larger than the previous Watermark.

定期的发送,你可以通过ExecutionConfig.setAutoWatermarkInterval(...),来设置这个频率

/**
* This generator generates watermarks assuming that elements come out of order to a certain degree only.
* The latest elements for a certain timestamp t will arrive at most n milliseconds after the earliest
* elements for timestamp t.
*/
public class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks<MyEvent> { private final long maxOutOfOrderness = 3500; // 3.5 seconds private long currentMaxTimestamp; @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
long timestamp = element.getCreationTime();
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
} @Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
} /**
* This generator generates watermarks that are lagging behind processing time by a certain amount.
* It assumes that elements arrive in Flink after at most a certain time.
*/
public class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks<MyEvent> { private final long maxTimeLag = 5000; // 5 seconds @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
return element.getCreationTime();
} @Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}

 

上面给出两个case,区别是第一种,会以event time的Max,来设置watermark

第二种,是以当前的processing time来设置watermark

 

With Punctuated Watermarks

To generate Watermarks whenever a certain event indicates that a new watermark can be generated, use theAssignerWithPunctuatedWatermarks. For this class, Flink will first call the extractTimestamp(...) method to assign the element a timestamp, and then immediately call for that element the checkAndGetNextWatermark(...) method.

The checkAndGetNextWatermark(...) method gets the timestamp that was assigned in the extractTimestamp(...) method, and can decide whether it wants to generate a Watermark. Whenever the checkAndGetNextWatermark(...) method returns a non-null Watermark, and that Watermark is larger than the latest previous Watermark, that new Watermark will be emitted.

这种即,watermark不是由时间来触发的,而是以特定的event触发的,即本到某些特殊的event或message,才触发watermark

所以它的接口叫,checkAndGetNextWatermark

需要先check

public class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks<MyEvent> {

    @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
return element.getCreationTime();
} @Override
public Watermark checkAndGetNextWatermark(MyEvent lastElement, long extractedTimestamp) {
return element.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}

Flink - Generating Timestamps / Watermarks的更多相关文章

  1. flink Window的Timestamps/Watermarks和allowedLateness的区别

    Watermartks是通过additional的时间戳来控制窗口激活的时间,allowedLateness来控制窗口的销毁时间.   注: 因为此特性包括官方文档在1.3-1.5版本均未做改变,所以 ...

  2. 【翻译】生成 Timestamps / Watermarks

    本文翻译自flink官网:https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_waterm ...

  3. Flink - watermark生成

    参考,Flink - Generating Timestamps / Watermarks watermark,只有在有window的情况下才用到,所以在window operator前加上assig ...

  4. Flink Program Guide (3) -- Event Time (DataStream API编程指导 -- For Java)

    Event Time 本文翻译自DataStream API Docs v1.2的Event Time ------------------------------------------------ ...

  5. flink学习之十一-window&EventTime实例

    上面试了Processing Time,在这里准备看下Event Time,以及必须需要关注的,在ET场景下的Watermarks. EventTime & Watermark Event t ...

  6. Flink Program Guide (4) -- 时间戳和Watermark生成(DataStream API编程指导 -- For Java)

    时间戳和Watermark生成 本文翻译自Generating Timestamp / Watermarks --------------------------------------------- ...

  7. Flink学习(二)Flink中的时间

    摘自Apache Flink官网 最早的streaming 架构是storm的lambda架构 分为三个layer batch layer serving layer speed layer 一.在s ...

  8. Flink中的多source+event watermark测试

    这次需要做一个监控项目,全网日志的指标计算,上线的话,计算量应该是百亿/天 单个source对应的sql如下 最原始的sql select pro,throwable,level,ip,`count` ...

  9. [白话解析] Flink的Watermark机制

    [白话解析] Flink的Watermark机制 0x00 摘要 对于Flink来说,Watermark是个很难绕过去的概念.本文将从整体的思路上来说,运用感性直觉的思考来帮大家梳理Watermark ...

随机推荐

  1. 如何在多模型的情况下进行EF6的结构迁移

    所谓多模型就是在一个数据库中包含两个不同模型,或者换句话说就是两个不同DbContext的数据都放到同一个数据库中.这里的多模型不是指多租户的数据库(有谁知道EF很好处理多租户数据库的方案,可以联系我 ...

  2. ORACLE配置tnsnames.ora文件实例

    ORACLE配置tnsnames.ora文件实例客户机为了和服务器连接,必须先和服务器上的监听进程联络.ORACLE通过tnsnames.ora文件中的连接描述符来说明连接信息.一般tnsnames. ...

  3. 能加载文件或程序集“System.Data.SQLite”或它的某一个依赖项。试图加载格式不正确的程序。

    现象: 能加载文件或程序集“System.Data.SQLite”或它的某一个依赖项.试图加载格式不正确的程序.

  4. Android 用adb pull或push 拷贝手机文件到到电脑上,拷贝手机数据库到电脑上,拷贝电脑数据库到手机上

    先说一下adb命令配置,如果遇到adb不是内部或外部命令,也不是可运行的程序或批量文件.配置下环境变量 1.adb不是内部或外部命令,也不是可运行的程序或批量文件. 解决办法:在我的电脑-属性-高级计 ...

  5. js:语言精髓笔记13--语言技巧

    消除代码全局变量名占用: //本质是使用匿名函数: void function(x, y, z) { console.log(x + y + z); }(1,2,3); //要使函数内的变量不被释放, ...

  6. 数据采集器移动手持打印POS终端(PDA)商超抄单方案-PDA抄单无线开单+进销存软件

    PDA主要应用在业务员外出在超市能及时抄单发送回总公司,总公司办公人员及时在软件里面调出业务员的抄单数量进行配货,大大提供工作效率. 移动数据终端和无线基础架构相结合,面向零售与批发门店店员的解决方案 ...

  7. Ural 1018 (树形DP+背包+优化)

    题目链接: http://acm.hust.edu.cn/vjudge/problem/viewProblem.action?id=17662 题目大意:树枝上间连接着一坨坨苹果(不要在意'坨'),给 ...

  8. AppCache 离线存储 应用程序缓存 API 及注意事项

    使用ApplicationCache接口实现离线缓存 原文:http://www.mb5u.com/HTML5/html5_96464.html 推荐:html5 application cache遇 ...

  9. Git 远程操作详解

    Git是目前最流行的版本管理系统,学会Git几乎成了开发者的必备技能. Git有很多优势,其中之一就是远程操作非常简便.本文详细介绍5个Git命令,它们的概念和用法,理解了这些内容,你就会完全掌握Gi ...

  10. 外部调用JS文件时出现中文乱码的解决办法

    若测试网页的编码格式为:gb2312,而调用外部JS文件时出现了乱码(前提是JS文件无错误),则将调用的外部JS文件用记事本打开,然后再保存成编码格式为UTF-8的JS文件即可. 若测试网页的编码格式 ...