上面试了Processing Time,在这里准备看下Event Time,以及必须需要关注的,在ET场景下的Watermarks。

EventTime & Watermark

Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time.

以event time为准的程序,必须要指定watermark.

以下内容引自 《从0到1学习flink》及 官网说明:

支持 Event Time 的流处理器需要一种方法来衡量 Event Time 的进度。 例如,当 Event Time 超过一小时结束时,需要通知构建每小时窗口的窗口操作符,以便操作员可以关闭正在进行的窗口。

Event Time 可以独立于 Processing Time 进行。 例如,在一个程序中,操作员的当前 Event Time 可能略微落后于 Processing Time (考虑到接收事件的延迟),而两者都以相同的速度进行。另一方面,另一个流程序可能只需要几秒钟的时间就可以处理完 Kafka Topic 中数周的 Event Time 数据。

A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.

Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).

Flink 中用于衡量 Event Time 进度的机制是 Watermarks。 Watermarks 作为数据流的一部分流动并带有时间戳 t。 Watermark(t)声明 Event Time 已到达该流中的时间 t,这意味着流中不应再有具有时间戳 t’<= t 的元素(即时间戳大于或等于水印的事件)

下图显示了带有(逻辑)时间戳和内联水印的事件流。在本例中,事件是按顺序排列的(相对于它们的时间戳),这意味着水印只是流中的周期性标记。

stream_watermark_in_order

Watermark 对于无序流是至关重要的,如下所示,其中事件不按时间戳排序。通常,Watermark 是一种声明,通过流中的该点,到达某个时间戳的所有事件都应该到达。一旦水印到达操作员,操作员就可以将其内部事件时间提前到水印的值。

stream_watermark_out_of_order

理解下来,如果flink中设置的时间类型是Event Time,必须要设置watermark,作为告诉flink进度的标志。

如果watermark(time1)已经确定,那么说明流中所有time2早于watermark-time1的数据肯定都已经被处理完毕,不管是有序数据流还是无序数据流。

watermark是谁来产生的?--sorry,是跑在flink中的job代码来产生,而不是datasource本身。

watermark是每个数据都有一个对应的么?可以1:1,但不是,按需要和实际情况来做。

It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.

平行流中的水印

水印是在源函数处生成的,或直接在源函数之后生成的。源函数的每个并行子任务通常独立生成其水印。这些水印定义了特定并行源处的事件时间。

当水印通过流程序时,它们会提前到达操作人员处的事件时间。当一个操作符提前(advanced)它的事件时间(event time)时,它为它的后续操作符在下游生成一个新的水印。

一些操作员消耗多个输入流; 例如,一个 union,或者跟随 keyBy(…)或 partition(…)函数的运算符。 这样的操作员当前事件时间是其输入流的事件时间的最小值。 由于其输入流更新其事件时间,因此操作员也是如此。

下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的运算符。

flink_parallel_streams_watermarks

从上图看,event time是从source中产生的,同样的,watermark也是如此。

数据从source在经过map转换,并且放在window中处理

其他的没看懂。。。

关于TimeStamp及Watermark

In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.

Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.

There are two ways to assign timestamps and generate watermarks:

  1. Directly in the data stream source
  2. Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted

Attention Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.

event time类型下,flink必须知道event对应的timestamp,也就是说,这个stream中的每个元素都要分配timestamp,一般是放在每个元素中对应的字段。

分配timestamp和生成watermark一般是在一起处理的(hand-in-hand).

有两种方式来分配timestamp+生成watermark

  • 直接在datasource中指定
  • 通过一个timestamp assigner(或者称之为watermark generator)来指定。在flink中,timestamp assigner 同时也是一个watermark generator
直接在datasource中指定

Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks. When this is done, no timestamp assigner is needed. Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.

To assign a timestamp to an element in the source directly, the source must use the collectWithTimestamp(...) method on the SourceContext. To generate watermarks, the source must call the emitWatermark(Watermark) function.

比如之前的mysql datasource with spring,其实现是这样的:

    @Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query "); if(urlInfoManager == null){
init();
} List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> sourceContext.collect(urlInfo));
}

如果需要加入timestamp,则需要调用collectWithTimestamp;如果需要生成watermark,则需要调用emitWatermark。

修改后如下:

    @Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query "); if(urlInfoManager == null){
init();
} List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> {
// 增加timestamp
sourceContext.collectWithTimestamp(urlInfo,System.currentTimeMillis()); // 生成水印
sourceContext.emitWatermark(new Watermark(urlInfo.getCurrentTime()== null? System.currentTimeMillis():urlInfo.getCurrentTime().getTime())); sourceContext.collect(urlInfo); });
}

注意其中增加的两行代码,timestamp和watermark都是针对每个元素的。

通过Timestamp Assigners / Watermark Generators指定

Timestamp assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.

Timestamp assigners are usually specified immediately after the data source, but it is not strictly required to do so. A common pattern, for example, is to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to be specified before the first operation on event time (such as the first window operation). As a special case, when using Kafka as the source of a streaming job, Flink allows the specification of a timestamp assigner / watermark emitter inside the source (or consumer) itself. More information on how to do so can be found in the Kafka Connector documentation.

Timestamp Assigner 允许输入一个stream,输出一个带timestamp、watermark的元素组成的流。如果流之前已经有了timestamp、watermark,则会被覆盖。

Timestamp Assigner 一般会立即在datasoure初始化之后马上指定,不过却并不一定非要这么做。一个通用的模式是在parse、filter之后,指定timestamp assigner;不过在任何第一次需要对event time操作之前,必须指定timestamp assigner。

先看一个例子:

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();

        DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());

        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new MyTimestampAndWatermarkAssigner()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring");
}

可以看到,这里在filter之后做了一个assignTimestampAndWatermarks的操作。

With Periodic Watermarks--周期性的添加watermark

AssignerWithPeriodicWatermarks assigns timestamps and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).

The interval (every n milliseconds) in which the watermark will be generated is defined viaExecutionConfig.setAutoWatermarkInterval(...). The assigner’s getCurrentWatermark() method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.

如果需要周期性的生成watermark,而不是每次都生成,就需要调用方法AssignerWithPeriodicWatermarks,时间间隔以milliseconds为单位,需要在ExecutionConfig.setAutoWatermarkInterval方法中设置。

   public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();

        DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());

        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        // 设定watermark间隔时间
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(300); SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
} return false;
}).assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring");
}

可以看到,这里通过ExecuteConfig设置了watermark生成的间隔时间,同时在filter之后加入了TimeLagWatermarkGenerator,其代码如下(来源于官网,稍有修改):

/**
* This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.
*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<UrlInfo> { private final long maxTimeLag = 5000; // 5 seconds @Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
} @Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}
With Punctuated(不时打断) Watermarks

To generate watermarks whenever a certain event indicates that a new watermark might be generated, useAssignerWithPunctuatedWatermarks. For this class Flink will first call the extractTimestamp(...) method to assign the element a timestamp, and then immediately call the checkAndGetNextWatermark(...) method on that element.

The checkAndGetNextWatermark(...) method is passed the timestamp that was assigned in the extractTimestamp(...) method, and can decide whether it wants to generate a watermark. Whenever the checkAndGetNextWatermark(...) method returns a non-null watermark, and that watermark is larger than the latest previous watermark, that new watermark will be emitted.

 public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();

        DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());

        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new PunctuatedAssigner()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring");
}
import myflink.model.UrlInfo;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark; public class PunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UrlInfo> { @Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
} @Override
public Watermark checkAndGetNextWatermark(UrlInfo lastElement, long extractedTimestamp) {
/**
* Creates a new watermark with the given timestamp in milliseconds.
*/
return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}
kafka相关

When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).

In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.

For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.

The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.

由于kafka有多个partition,每个kafka partition中可能都有自己的event time规则,而在消费端,多个partition中的数据是并行处理的,来自于不同partition的数据其event time规则不同,所以就破坏掉了event time的生成规则。

在这种情况下,可以使用flink的Kafka-partition-aware watermark生成,如下代码:

    public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("zookeeper.connect", "localhost:2181");
properties.put("group.id", "metric-group");
properties.put("auto.offset.reset", "latest");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); SingleOutputStreamOperator<UrlInfo> dataStreamSource = env.addSource(
new FlinkKafkaConsumer010<String>(
"testjin",// topic
new SimpleStringSchema(),
properties
)
).setParallelism(1)
// map操作,转换,从一个数据流转换成另一个数据流,这里是从string-->UrlInfo
.map(string -> JSON.parseObject(string, UrlInfo.class)); dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UrlInfo>(){ @Override
public long extractAscendingTimestamp(UrlInfo element) {
return element.getCurrentTime().getTime();
}
}); env.execute("save url to db");
}

注意使用的是AscendingTimestampExtractor,也就是一个升序的timestamp 指派器。

参考资料:

http://www.54tianzhisheng.cn/2018/12/11/Flink-time/

https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html

https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html

转载:https://www.jianshu.com/p/13b6d180adcb

flink学习之十一-window&EventTime实例的更多相关文章

  1. python3.4学习笔记(十一) 列表、数组实例

    python3.4学习笔记(十一) 列表.数组实例 #python列表,数组类型要相同,python不需要指定数据类型,可以把各种类型打包进去#python列表可以包含整数,浮点数,字符串,对象#创建 ...

  2. 入门大数据---Flink学习总括

    第一节 初识 Flink 在数据激增的时代,催生出了一批计算框架.最早期比较流行的有MapReduce,然后有Spark,直到现在越来越多的公司采用Flink处理.Flink相对前两个框架真正做到了高 ...

  3. Spring 4 官方文档学习(十一)Web MVC 框架之配置Spring MVC

    内容列表: 启用MVC Java config 或 MVC XML namespace 修改已提供的配置 类型转换和格式化 校验 拦截器 内容协商 View Controllers View Reso ...

  4. Spring 4 官方文档学习(十一)Web MVC 框架之resolving views 解析视图

    接前面的Spring 4 官方文档学习(十一)Web MVC 框架,那篇太长,故另起一篇. 针对web应用的所有的MVC框架,都会提供一种呈现views的方式.Spring提供了view resolv ...

  5. Spring 4 官方文档学习(十一)Web MVC 框架

    介绍Spring Web MVC 框架 Spring Web MVC的特性 其他MVC实现的可插拔性 DispatcherServlet 在WebApplicationContext中的特殊的bean ...

  6. jQuery框架学习第十一天:实战jQuery表单验证及jQuery自动完成提示插件

    jQuery框架学习第一天:开始认识jQueryjQuery框架学习第二天:jQuery中万能的选择器jQuery框架学习第三天:如何管理jQuery包装集 jQuery框架学习第四天:使用jQuer ...

  7. 值得 Web 开发人员学习的20个 jQuery 实例教程

    这篇文章挑选了20个优秀的 jQuery 实例教程,这些 jQuery 教程将帮助你把你的网站提升到一个更高的水平.其中,既有网站中常用功能的的解决方案,也有极具吸引力的亮点功能的实现方法,相信通过对 ...

  8. python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容

    python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...

  9. JMeter学习-011-JMeter 后置处理器实例之 - 正则表达式提取器(三)多参数获取进阶引用篇

    前两篇文章分表讲述了 后置处理器 - 正则表达式提取器概述及简单实例.多参数获取,相应博文敬请参阅 简单实例.多参数获取. 此文主要讲述如何引用正则表达式提取器获取的数据信息.其实,正则表达式提取器获 ...

随机推荐

  1. 查看静态库(.lib)和动态库(.dll)的导出函数的信息 error LNK2001: 无法解析的外部符号 _Delete

    转自VC错误:http://www.vcerror.com/?p=1381 在window下查看动态库的导出函数可以用vs自带的Dependenc工具: 查看静态库的信息要用命令行来实现: 首先运行V ...

  2. Anaconda详细安装及使用教程(带图文)

    https://blog.csdn.net/ITLearnHall/article/details/81708148

  3. angular-file-upload插件的使用简单介绍

    参考博客: https://www.cnblogs.com/jarson-7426/p/5191156.html angular-file-upload 最近一段时间用了一下angular-file- ...

  4. PHP中global与$GLOBALS的区别

    单一个global是一个关键字,通常附加在变量前,用于将变量声明至全局作用域: $GLOBALS是预定义的超全局变量,把变量扔到里边的话一样可以带到全局去. $GLOBALS 是一个关联数组,每一个变 ...

  5. filter 在CSS用的效果

    滤镜说明: Alpha:设置透明层次 blur:创建高速度移动效果,即模糊效果 Chroma:制作专用颜色透明 DropShadow:创建对象的固定影子 FlipH:创建水平镜像图片 FlipV:创建 ...

  6. python之路——操作系统的发展史

    阅读目录 手工操作 —— 穿孔卡片 批处理 —— 磁带存储和批处理系统 多道程序系统 分时系统 实时系统 通用操作系统 操作系统的进一步发展 操作系统的作用 手工操作 —— 穿孔卡片 1946年第一台 ...

  7. Redis Sentinel 高可用方案

      redis 主从复制的问题 Redis主从复制可将主节点数据同步给从节点,从节点此时有两个作用: 1,一旦主节点宕机,从节点作为主节点的备份可以随时顶上来. 2,扩展主节点的读能力,分担主节点读压 ...

  8. CTU OPEN 2017 Punching Power /// 最大独立集

    题目大意: 给定n 给定n个机器的位置 要求任意两个机器间的距离至少为1.3米 求最多能选择多少个机器 至少为1.3米 说明若是位于上下左右一步的得放就不行 将机器编号 将不能同时存在的机器连边 此时 ...

  9. servlet的ServletContext接口

    ServletContext Servlet 上下文 每个web工程都只有一个ServletContext对象,也就是不管在哪个servlet里面,获取到的这个ServletContext对象都是同一 ...

  10. android中的ContentProvider实现数据共享

    为了在应用程序之间交换数据,android中提供了ContentProvider,ContentProvider是不同应用程序之间进行数据交换的标准API.当一个应用程序需要把自己的数据暴露给其他程序 ...