转发请注明原创地址 http://www.cnblogs.com/dongxiao-yang/p/7610412.html

一 概念

watermark是flink为了处理eventTime窗口计算提出的一种机制,本质上也是一种时间戳,由flink souce或者自定义的watermark生成器按照需求定期或者按条件生成一种系统event,与普通数据流event一样流转到对应的下游operations,接收到watermark数据的operator以此不断调整自己管理的window event time clock。

( A watermark is a special event signaling that time in the event stream (i.e., the real-world timestamps in the event stream) has reached a certain point (say, 10am), and thus no event with timestamp earlier than 10am will arrive from now on. These watermarks are part of the data stream alongside regular events, and a Flink operator advances its event time clock to 10am once it has received a 10am watermark from all its upstream operations/sources)

二 TimestampAssigner和Watermark

首先,eventTime计算意味着flink必须有一个地方用于抽取每条消息中自带的时间戳,所以TimestampAssigner的实现类都要具体实现

long extractTimestamp(T element, long previousElementTimestamp);方法用来抽取当前元素的eventTime,这个eventTime会用来决定元素落到下游的哪个或者哪几个window中进行计算。

其次,在数据进入window前,需要有一个Watermarker生成当前的event time对应的水位线,flink支持两种后置的Watermarker:Periodic和Punctuated,一种是定期产生watermark(即使没有消息产生),一种是在满足特定情况的前提下触发。两种Watermark分别需要实现接口为

Watermark getCurrentWatermark()和Watermark checkAndGetNextWatermark(T lastElement, long extractedTimestamp);

帖几个官网给出的实现样例

Periodic Watermarks

/**
* This generator generates watermarks assuming that elements arrive out of order,
* but only to a certain degree. The latest elements for a certain timestamp t will arrive
* at most n milliseconds after the earliest elements for timestamp t.
*/
public class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks<MyEvent> { private final long maxOutOfOrderness = 3500; // 3.5 seconds private long currentMaxTimestamp; @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
long timestamp = element.getCreationTime();
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
} @Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
} /**
* This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.
*/
public class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks<MyEvent> { private final long maxTimeLag = 5000; // 5 seconds @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
return element.getCreationTime();
} @Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}

Punctuated Watermarks

public class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks<MyEvent> {

    @Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
return element.getCreationTime();
} @Override
public Watermark checkAndGetNextWatermark(MyEvent lastElement, long extractedTimestamp) {
return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}

三代码调试

public class WindowWaterMark {

    public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub String hostName = "localhost";
Integer port = Integer.parseInt("8001"); // set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(2);
env.getConfig().setAutoWatermarkInterval(9000); // get input data
DataStream<String> text = env.socketTextStream(hostName, port); DataStream<Tuple3<String, Long, Integer>> counts = text.filter(new FilterClass()).map(new LineSplitter())
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple3<String, Long, Integer>>(){ private long currentMaxTimestamp = 0l;
private final long maxOutOfOrderness = 10000l; @Override
public long extractTimestamp(Tuple3<String, Long, Integer> element,
long previousElementTimestamp) {
// TODO Auto-generated method stub
long timestamp= element.f1;
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
System.out.println("get timestamp is "+timestamp+" currentMaxTimestamp "+currentMaxTimestamp);
return timestamp;
} @Override
public Watermark getCurrentWatermark() {
// TODO Auto-generated method stub
System.out.println("wall clock is "+ System.currentTimeMillis() +" new watermark "+(currentMaxTimestamp - maxOutOfOrderness));
return new Watermark(currentMaxTimestamp - maxOutOfOrderness); } })
.keyBy(0)
.timeWindow(Time.seconds(20))
// .allowedLateness(Time.seconds(10))
.sum(2); counts.print(); // execute program
env.execute("Java WordCount from SocketTextStream Example");
} public static final class LineSplitter implements
MapFunction<String, Tuple3<String, Long, Integer>> { @Override
public Tuple3<String, Long, Integer> map(String value) throws Exception {
// TODO Auto-generated method stub
String[] tokens = value.toLowerCase().split("\\W+"); long eventtime = Long.parseLong(tokens[1]); return new Tuple3<String, Long, Integer>(tokens[0], eventtime, 1);
}
} private static class MyTimestamp extends
AscendingTimestampExtractor<Tuple3<String, Long, Integer>> {
private static final long serialVersionUID = 1L; @Override
public long extractAscendingTimestamp(
Tuple3<String, Long, Integer> element) {
// TODO Auto-generated method stub
return element.f1;
} } public static final class FilterClass implements FilterFunction<String>
{ @Override
public boolean filter(String value) throws Exception {
// TODO Auto-generated method stub
if(StringUtils.isNullOrWhitespaceOnly(value))
{
return false;
}
else
{
return true;
}
} } }

测试代码如上,注意这段代码手动更改了autowatermarkinterval的时间为9s以便于观察方法调用顺序。

首先启动job不输入数据,30s后日志输出为

wall clock is 1506680562679 new watermark -10000

wall clock is 1506680562679 new watermark -10000

wall clock is 1506680571683 new watermark -10000

wall clock is 1506680571683 new watermark -10000

wall clock is 1506680580687 new watermark -10000

wall clock is 1506680580687 new watermark -10000

.........................

这说明在没有数据输入的情况下PeriodicWatermarks仍然会周期性调用getCurrentWatermark这个方法,每次有两条相同wall clock的日志跟程序里env.setParallelism(2)这个参数相同,表明watermark与operator的并发一致。

输入数据aaaa 1506590035000

日志输出为

wall clock is 1506681868124 new watermark -10000

wall clock is 1506681877129 new watermark -10000

wall clock is 1506681877129 new watermark -10000

get timestamp is 1506590035000 currentMaxTimestamp 1506590035000

wall clock is 1506681886132 new watermark 1506590025000

wall clock is 1506681886132 new watermark -10000

wall clock is 1506681895136 new watermark -10000

wall clock is 1506681895136 new watermark 1506590025000

...........................................

上述日志表明接收到消息后extractTimestamp这个方法会被立即调用,但是同时注意到wall clock日志的打印时间完全没有受到数据流入的影响,所以在PeriodicWatermarks这个是线下,watermark的产生时间和速率与数据流的输入无关。

需要说明的是,时间窗口的起始时间计算方法为

public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
return timestamp - (timestamp - offset + windowSize) % windowSize;
}

所有对于上述测试代码里时间长度为20s的滚动窗口,默认下,在每分钟内窗口的起止时间都是(0~20)(20~40)(40~60)这样,我们的第一条数据aaaa 1506590035000对应时间为2017/9/28 17:13:55,所以它将会在(2017/9/28 17:13:40~2017/9/28 17:14:00)这个窗口完成计算

继续输入数据

cc 1506590035000
cc 1506590035000
bb 1506590035000
aaaa 1506590035000
bb 1506590035000

aaaa 1506590041000 //上调数据的eventTime至2017/9/28 17:14:01,超过前一个window的结束时间
bb 1506590041000
cc 1506590041000

日志输出为

get timestamp is 1506590041000 currentMaxTimestamp 1506590041000

wall clock is 1507522499419 new watermark 1506590031000

wall clock is 1507522499424 new watermark 1506590025000

wall clock is 1507522508422 new watermark 1506590031000

wall clock is 1507522508429 new watermark 1506590025000

wall clock is 1507522517426 new watermark 1506590031000

wall clock is 1507522517434 new watermark 1506590025000

get timestamp is 1506590041000 currentMaxTimestamp 1506590041000

wall clock is 1507522526429 new watermark 1506590031000

wall clock is 1507522526435 new watermark 1506590031000

wall clock is 1507522535431 new watermark 1506590031000

wall clock is 1507522535440 new watermark 1506590031000

wall clock is 1507522544433 new watermark 1506590031000

可以看到虽然后来的数据已经超过了第一个window的endtime,但是由于getCurrentWatermark方法的设定系统目前的watermark为2017/9/28 17:13:51小于endtime,所以flink并不会立即执行整个窗口的运算

继续增加数据和eventtime

aaaa 1506590051000
bb 1506590051000
cc 1506590051000

日志输出如下

get timestamp is 1506590051000 currentMaxTimestamp 1506590051000

get timestamp is 1506590051000 currentMaxTimestamp 1506590051000

get timestamp is 1506590051000 currentMaxTimestamp 1506590051000

wall clock is 1507522589449 new watermark 1506590041000

wall clock is 1507522589461 new watermark 1506590041000

1> (aaaa,1506590035000,2)

2> (cc,1506590035000,2)

2> (bb,1506590035000,2)

这个时候watermark刚好大于了第一个window的endtime,整个(2017/9/28 17:13:40~2017/9/28 17:14:00)窗口对应的数据开始执行计算,输出对应结果。

参考文档

http://vishnuviswanath.com/flink_eventtime.html

https://data-artisans.com/blog/how-apache-flink-enables-new-streaming-applications-part-1

https://www.youtube.com/watch?v=3UfZN59Nsk8

Flink流计算编程--watermark(水位线)简介








flink watermark介绍的更多相关文章

  1. [源码分析] 从源码入手看 Flink Watermark 之传播过程

    [源码分析] 从源码入手看 Flink Watermark 之传播过程 0x00 摘要 本文将通过源码分析,带领大家熟悉Flink Watermark 之传播过程,顺便也可以对Flink整体逻辑有一个 ...

  2. 《从0到1学习Flink》—— 介绍Flink中的Stream Windows

    前言 目前有许多数据分析的场景从批处理到流处理的演变, 虽然可以将批处理作为流处理的特殊情况来处理,但是分析无穷集的流数据通常需要思维方式的转变并且具有其自己的术语(例如,"windowin ...

  3. Apache Flink 整体介绍

    前言 Flink 是一种流式计算框架,为什么我会接触到 Flink 呢?因为我目前在负责的是监控平台的告警部分,负责采集到的监控数据会直接往 kafka 里塞,然后告警这边需要从 kafka topi ...

  4. Flink窗口介绍及应用

    Windows是Flink流计算的核心,本文将概括的介绍几种窗口的概念,重点只放在窗口的应用上. 本实验的数据采用自拟电影评分数据(userId, movieId, rating, timestamp ...

  5. Flink - watermark生成

    参考,Flink - Generating Timestamps / Watermarks watermark,只有在有window的情况下才用到,所以在window operator前加上assig ...

  6. flink架构介绍

    前言 flink作为基于流的大数据计算引擎,可以说在大数据领域的红人,下面对flink-1.7的架构进行逻辑上的分析并和spark做了一些关键点的对比. 架构 如图1,flink架构分为3个部分,cl ...

  7. Flink入门(二)——Flink架构介绍

    1.基本组件栈 了解Spark的朋友会发现Flink的架构和Spark是非常类似的,在整个软件架构体系中,同样遵循着分层的架构设计理念,在降低系统耦合度的同时,也为上层用户构建Flink应用提供了丰富 ...

  8. [Flink原理介绍第四篇】:Flink的Checkpoint和Savepoint介绍

    原文:https://blog.csdn.net/hxcaifly/article/details/84673292 https://blog.csdn.net/zero__007/article/d ...

  9. flink原理介绍-数据流编程模型v1.4

    数据流编程模型 抽象级别 程序和数据流 并行数据流 窗口 时间 有状态操作 检查点(checkpoint)容错 批量流处理 下一步 抽象级别 flink针对 流式/批处理 应用提供了不同的抽象级别. ...

随机推荐

  1. HTML二(基本标签)

    一.标题 HTML 标题(Heading)是通过 <h1> - <h6> 等标签进行定义的. <!--标题--> <h1>标题 1</h1> ...

  2. 算法笔记_121:蓝桥杯第六届省赛(Java语言C组部分习题)试题解答

     目录 1 隔行变色 2 立方尾不变 3 无穷分数 4 格子中输出 5 奇妙的数字 6 打印大X   前言:以下试题解答代码部分仅供参考,若有不当之处,还请路过的同学提醒一下~ 1 隔行变色 隔行变色 ...

  3. java thread dump日志分析

    jstack Dump 日志文件中的线程状态 dump 文件里,值得关注的线程状态有: 死锁,Deadlock(重点关注)  执行中,Runnable 等待资源,Waiting on conditio ...

  4. vue 子组件引用

    使用 ref 为子组件指定一个引用 ID.例如: <div id="parent"> <user-profile ref="profile"& ...

  5. Exif介绍

    Exif是一种图像文件格式,它的数据存储与JPEG格式是完全相同的.实际上Exif格式就是在JPEG格式头部插入了数码照片的信息,包括拍摄时的光圈.快门.白平衡.ISO.焦距.日期时间等各种和拍摄条件 ...

  6. Oracle EBS SLA(子分类账)

    SLA概述 SLA(Subledger Accounting) 子帐是子分类帐会计的简称,字面上的含义就是子分类帐会计分录 SLA常用表介绍 在SLA中技术方面最常用的就是日记账来源追溯,在追溯的过程 ...

  7. HTTP 协议中的 Transfer-Encoding

    HTTP 协议中的 Transfer-Encoding 文章目录 Persistent Connection Content-Length Transfer-Encoding: chunked 本文作 ...

  8. 【laravel5.4】关键字【use】使用

    1.在namespace 和 class 之间使用,是引入类文件的意思,命名空间过长或者类文件同名,可以使用[as]区别 2.在class 类里面使用[use],是导入trait  类的意思,多继承的 ...

  9. 深入PHP内核之全局变量

    在阅读PHP源码的时候,会遇到很多诸如:CG(),EG() ,PG(),FG()这样的宏,如果不了解这些宏的意义,会给理解源码造成很大困难 EG().这个宏可以用来访问符号表,函数,资源信息和常量 C ...

  10. GitLab Notification Emails

    GitLab has a notification system in place to notify a user of events that are important for the work ...