Flink - FLIP
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
FLIP-1 : Fine Grained Recovery from Task Failures
When a task fails during execution, Flink currently resets the entire execution graph and triggers complete re-execution from the last completed checkpoint. This is more expensive than just re-executing the failed tasks.
如果一个task失败,当前需要完全停掉整个job恢复,这个明显太重了;
proposal
简单的方案,如果一个task失败,就把和它相连的整条pipeline都重启,但如果所有node都和该task相连,那还是要重启整个job
、
但这个方案太naive了,是否可以尽量的减少重启的范围?
如果要只重启fail的task,以及后续的tasks,而不想重启源,只有cache
每个node,把要发出去的Intermediate Result缓存下来,当一个node的task挂了后, 只需要从上一层node把Intermediate Result从发出来,就可以避免从source重启
至于如何cache Intermediate Result,在memory还是disk,还是其他,只是方案不同
Caching Intermediate Result
This type of data stream caches all elements since the latest checkpoint, possibly spilling them to disk, if the data exceeds the memory capacity.
When a downstream operator restarts from that checkpoint, it can simply re-read that data stream without requiring the producing operator to restart. Applicable to both batch (bounded) and streaming (unbounded) operations. When no checkpoints are used (batch), it needs to cache all data.
Memory-only caching Intermediate Result
Similar to the caching intermediate result, but discards sent data once the memory buffering capacity is exceeded. Acts as a “best effort” helper for recovery, which will bound recovery when checkpoints are frequent enough to hold data in between checkpoints in memory. On the other hand, it comes absolutely for free, it simply used memory that would otherwise not be used anyways.
Blocking Intermediate Result
This is applicable only to bounded intermediate results (batch jobs). It means that the consuming operator starts only after the entire bounded result has been produced. This bounds the cancellations/restarts downstream in batch jobs.
FLIP-2 Extending Window Function Metadata
Right now, in Flink a WindowFunction does not get a lot of information when a window fires.
The signature of WindowFunction is this:
public
interface
WindowFunction<IN, OUT, KEY, W
extends
Window>
extends
Function, Serializable {
void
apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out);
}
i.e , the user code only has access to the key for which the window fired, the window for which we fired and the data of the window itself. In the future, we might like to extend the information available to the user function. We initially propose this as additional information:
Why/when did the window fire. Did it fire on time, i.e. when the watermark passed the end of the window. Did it fire early because of a speculative early trigger or did it fire on late-arriving data.
- How many times did we fire before for the current window. This would probably be an increasing index, such that each firing for a window can be uniquely identified.
当前在window functions中暴露出来的信息不够,需要给出更多的信息,比如why,when fire等
FLIP-3 - Organization of Documentation
FLIP-4 : Enhance Window Evictor
Right now, the ability of Window Evictor is limited
- The Evictor is called only before the WindowFunction. (There can be use cases where the elements have to be evicted after the WindowFunction is applied)
- Elements are evicted only from the beginning of the Window. (There can be cases where we need to allow eviction of elements from anywhere within in the Window as per the eviction logic that user wish to implement)
当前Evictor只是在WindowFunction 之前被执行,是否可以在WindowFunction 之后被执行?
当前的接口只是从beginning of the Window开始,是否可以从任意位置开始evict
FLIP-5: Only send data to each taskmanager once for broadcasts
Problem:
We experience some unexpected increase of data sent over the network for broadcasts with increasing number of slots per task manager.
降低在广播时,发送的冗余数据
当前状况是,
要达到的效果是,
每个taskmanager只发送一次
FLIP-6 - Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.
核心想法是,把jobmanager的工作分离出来
增加两个新的模块, ResourceManager和dispatcher
The ResourceManager (introduced in Flink 1.1) is the cluster-manager-specific component. There is a generic base class, and specific implementations for:
YARN
Mesos
Standalone-multi-job (Standalone mode)
Self-contained-single-job (Docker/Kubernetes)
显然对于不同的资源管理平台,只需要实现不同的ResourceManager
于是JobManager, TaskManager和ResourceManager之间的关系就变成这样
TaskManager,向ResourceManager进行注册,并定期汇报solts的情况
JobManager会向ResourceManager请求slot,然后ResourceManager会选择TaskManager,告诉它向某JobManager提供slots
然后该TaskManager会直接联系JobManager去提供slots
同时JobManager会有slot pool,来保持申请到的slots
The SlotPool is a modification of what is currently the InstanceManager.
这样就算ResourceManager挂掉了,JobManager仍然可以继续使用已经申请的slots
The new design includes the concept of a Dispatcher. The dispatcher accepts job submissions from clients and starts the jobs on their behalf on a cluster manager.
In the future run, the dispatcher will also help with the following aspects:
The dispatcher is a cross-job service that can run a long-lived web dashboard
Future versions of the dispatcher should receive only HTTP calls and thus can act as a bridge in firewalled clusters
The dispatcher never executes code and can thus be viewed as a trusted process. It can run with higher privileges (superuser credentials) and spawn jobs on behalf of other users (acquiring their authentication tokens). Building on that, the dispatcher can manage user authentications
把Dispatcher从JobManager中分离出来的好处,
首先dispatcher是可以跨cluster的,是个long-lived web dashboard,比如后面如果一个cluster或jobmanager挂了,我可以简单的spawn到另外一个
第二,client到dispatcher是基于http的很容易穿过防火墙
第三,dispatcher可以当作类似proxy的作用,比如authentications
所以对于不同的cluster manager的具体架构如下,
Yarn,
Compared to the state in Flink 1.1, the new Flink-on-YARN architecture offers the following benefits:
The client directly starts the Job in YARN, rather than bootstrapping a cluster and after that submitting the job to that cluster. The client can hence disconnect immediately after the job was submitted
All user code libraries and config files are directly in the Application Classpath, rather than in the dynamic user code class loader
Containers are requested as needed and will be released when not used any more
The “as needed” allocation of containers allows for different profiles of containers (CPU / memory) to be used for different operators
这个架构把整个Flink都托管在Yarn内部
好处,
你不需要先拉起flink集群,然后再提交job,只需要直接提交job;Yarn的ResourcManager会先拉起Application Master,其中包含Resource Manager和Job Manager;然后当Flink resource manager需要资源时,会先和YARN ResourceManager请求,它会去创建container,其中包含TaskManager;
Mesos,
这个架构和Yarn类似,
Mesos-specific Fault Tolerance Aspects
ResourceManager and JobManager run inside a regular Mesos container. The Dispatcher is responsible for monitoring and restarting those containers in case they fail. The Dispatcher itself must be made highly available by a Mesos service like Marathon
Standalone,
The Standalone Setup is should keep compatibility with current Standalone Setups.
The role of the long running JobManager is now a “local dispatcher” process that spawns JobManagers with Jobs internally. The ResourceManager lives across jobs and handles TaskManager registration.
For highly-available setups, there are multiple dispatcher processes, competing for being leader, similar as the currently the JobManagers do.
Component Design and Details
更具体的步骤,
FLIP-7: Expose metrics to WebInterface
With the introduction of the metric system it is now time to make it easily accessible to users. As the WebInterface is the first stop for users for any details about Flink, it seems appropriate to expose the gathered metrics there as well.
The changes can be roughly broken down into 4 steps:
- Create a data-structure on the Job-/TaskManager containing a metrics snapshot
- Transfer this snapshot to the WebInterface back-end
- Store the snapshot in the WebRuntimeMonitor in an easily accessible way
- Expose the stored metrics to the WebInterface via REST API
FLIP-8: Rescalable Non-Partitioned State
要解决的问题是,当dynamic scaling的时候,如何解决状态的问题
如果没有状态,动态的scaling,需要做的只是把流量分到新的operator的并发上
但是对于状态,当增加并发的时候,需要把状态切分,而减少并发的时候,需要把状态合并
这个就比较麻烦了
同时在Flink里面,状态分为3部分,operator state, the function state and key-value states
其中对于key-value states的方案相对简单一些,https://docs.google.com/document/d/1G1OS1z3xEBOrYD4wSu-LuBCyPUWyFd9l3T9WyssQ63w/edit#
这里基本的思想,就是更细粒度的checkpoint;
原先是以task级别为粒度,这样加载的时候,只能加载一个task,如果一个task扩成2个,task级别的checkpoint也需要切分
而采用更细粒度的checkpoint独立存储,而不依赖task,这样就可以独立于task进行调度
比如对于key-value,创建一个叫key groups的概念,以key group作为一个checkpoint的单元
In order to efficiently distribute key-value states across the cluster, they should be grouped into key groups. Each key group represents a subset of the key space and is checkpointed as an independent unit. The key groups can then be re-assigned to different tasks if the DOP changes.
这样当发生增减operator的并发度的时候,只需要以key group为单位调度到新的operator上,同时在该operator上恢复相应的checkpoint即可,如图
然后,对于non-partitioned operator and function state,这个问题怎么解
比如对于kafkasource,4个partitions,2个source的并发
scaling down后,就会出现下图,左边的情况,因为只有s1 task了,他只会load他自己的checkpoint,而之前s2的checkpoint就没人管了
而理论上,我们是要达到右边的情况的
scaling up后,也会出现下图左边的case,因为S1,S2加载了原来的checkpoint,但是当前其实partition3,partition4已经不再分配到s2了
思路还是一样,把checkpoint的粒度变细,而不依赖于task,
FLIP-9: Trigger DSL
当前支持的trigger方式不够灵活,而且对late element只能drop,需要设计更为灵活和合理的DSL,用于描述Trigger policy
FLIP-10: Unify Checkpoints and Savepoints
Currently checkpoints and savepoints are handled in slightly different ways with respect to storing and restoring them. The main differences are that savepoints 1) are manually triggered, 2) persist checkpoint meta data, and 3) are not automatically discarded.
With this FLIP, I propose to allow to unify checkpoints and savepoints by allowing savepoints to be triggered automatically.
FLIP-11: Table API Stream Aggregations
The Table API is a declarative API to define queries on static and streaming tables. So far, only projection, selection, and union are supported operations on streaming tables. This FLIP proposes to add support for different types of aggregations on top of streaming tables. In particular, we seek to support:
Group-window aggregates, i.e., aggregates which are computed for a group of elements. A (time or row-count) window is required to bound the infinite input stream into a finite group.
Row-window aggregates, i.e., aggregates which are computed for each row, based on a window (range) of preceding and succeeding rows.
Each type of aggregate shall be supported on keyed/grouped or non-keyed/grouped data streams for streaming tables as well as batch tables.
Since time-windowed aggregates will be the first operation that require the definition of time, this FLIP does also discuss how the Table API handles time characteristics, timestamps, and watermarks.
FLIP-12: Asynchronous I/O Design and Implementation
I/O access, for the most case, is a time-consuming process, making the TPS for single operator much lower than in-memory computing, particularly for streaming job, when low latency is a big concern for users. Starting multiple threads may be an option to handle this problem, but the drawbacks are obvious: The programming model for end users may become more complicated as they have to implement thread model in the operator. Furthermore, they have to pay attention to coordinate with checkpointing.
流最终会碰到外部存储,那就会有IO瓶颈,比如写数据库,那么会阻塞整个流
这个问题怎么解?可以用多线程,这个会让编程模型比较复杂,说白了,不够优雅
所以解决方法其实就是用reactor模型,典型的解决I/O等待的方法
AsyncFunction: Async I/O will be triggered in AsyncFunction.
AsyncWaitOperator: An StreamOperator which will invoke AsyncFunction.
AsyncCollector: For each input streaming record, an AsyncCollector will be created and passed into user's callback to get the async i/o result.
AsyncCollectorBuffer: A buffer to keep all AsyncCollectors.
Emitter Thread: A working thread in AsyncCollectorBuffer, being signalled while some of AsyncCollectors have finished async i/o and emitting results to the following opeartors.
对于普通的operator,调用function,然后把数据用collector发送出去
但对于AsyncWaitOperator,无法直接得到结果,所以把AsyncCollector传入callback,当function触发callback的时候,再emit数据
但这样有个问题,emit的顺序是取决于,执行速度,如果对emit顺序没有要求应该可以
但如果模拟同步的行为,理论上,emit的顺序应该等同于收到的顺序
这样就需要一个buffer,去cache所有的AsyncCollector,即AsyncCollectorBuffer
当callback被执行时,某个AsyncCollector会被填充上数据,这个时候被mark成可发送
但是否发送,需要依赖一个外部的emitter
他会check,并决定是否真正的emit这个AsyncCollector,比如check是否它之前的所有的都已经emit,否则需要等待
这里还需要考虑的是, watermark,它必须要等到前面的数据都已经被成功emit后,才能被emit;这样才能保证一致性
Flink - FLIP的更多相关文章
- Blink: How Alibaba Uses Apache Flink
This is a guest post from Xiaowei Jiang, Senior Director of Alibaba’s search infrastructure team. Th ...
- 揭秘 Flink 1.9 新架构,Blink Planner 你会用了吗?
本文为 Apache Flink 新版本重大功能特性解读之 Flink SQL 系列文章的开篇,Flink SQL 系列文章由其核心贡献者们分享,涵盖基础知识.实践.调优.内部实现等各个方面,带你由浅 ...
- 修改代码150万行!与 Blink 合并后的 Apache Flink 1.9.0 究竟有哪些重大变更?
8月22日,Apache Flink 1.9.0 正式发布,早在今年1月,阿里便宣布将内部过去几年打磨的大数据处理引擎Blink进行开源并向 Apache Flink 贡献代码.当前 Flink 1. ...
- 开篇 | 揭秘 Flink 1.9 新架构,Blink Planner 你会用了吗?
本文为 Apache Flink 新版本重大功能特性解读之 Flink SQL 系列文章的开篇,Flink SQL 系列文章由其核心贡献者们分享,涵盖基础知识.实践.调优.内部实现等各个方面,带你由浅 ...
- Flink整合面向用户的数据流SDKs/API(Flink关于弃用Dataset API的论述)
动机 Flink提供了三种主要的sdk/API来编写程序:Table API/SQL.DataStream API和DataSet API.我们认为这个API太多了,建议弃用DataSet API,而 ...
- POJ 1753. Flip Game 枚举or爆搜+位压缩,或者高斯消元法
Flip Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 37427 Accepted: 16288 Descr ...
- [LeetCode] Flip Game 翻转游戏之二
You are playing the following Flip Game with your friend: Given a string that contains only these tw ...
- [LeetCode] Flip Game 翻转游戏
You are playing the following Flip Game with your friend: Given a string that contains only these tw ...
- poj Flip Game 1753 (枚举)
Flip Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 27005 Accepted: 11694 Descr ...
随机推荐
- CF 149D Coloring Brackets 区间dp ****
给一个给定括号序列,给该括号上色,上色有三个要求 1.只有三种上色方案,不上色,上红色,上蓝色 2.每对括号必须只能给其中的一个上色 3.相邻的两个不能上同色,可以都不上色 求0-len-1这一区间内 ...
- Bluez alpha版震撼发布!
经过Z.XML团队所有成员的不懈努力,我们的Bluez alpha版成功完成了!现在我们宣布:Bluez alpha版正式发布! 首先我们来向大家介绍下我们这个游戏: 这是一款即时RPG塔防类游戏.在 ...
- 【现代程序设计】homework-06
1) 把程序编译通过, 跑起来. 读懂程序,在你觉得比较难懂的地方加上一些注释,这样大家就能比较容易地了解这些程序在干什么. 把正确的 playPrev(GoMove) 的方法给实现了. 如果大家不会 ...
- 【vijos1066】弱弱的战壕 线段树
描述 永恒和mx正在玩一个即时战略游戏,名字嘛~~~~~~恕本人记性不好,忘了-_-b. mx在他的基地附近建立了n个战壕,每个战壕都是一个独立的作战单位,射程可以达到无限(“mx不赢定了?!?”永恒 ...
- android开发时,finish()跟System.exit(0)的区别
这两天在弄Android,遇到一个问题:所开发的小游戏中有背景音乐,玩的过程中始终有音乐在放着,然后在我退出游戏后,音乐还在播放! 我看了一下我最开始写的退出游戏的代码,就是简单的finish() ...
- 简单几何(求交点) UVA 11437 Triangle Fun
题目传送门 题意:三角形三等分点连线组成的三角形面积 分析:入门题,先求三等分点,再求交点,最后求面积.还可以用梅涅劳斯定理来做 /********************************** ...
- SQL 计算列
SQL计算列,可以解决一般标量计算(数学计算,如ColumnA*ColumnB)的问题,而子查询计算(如select sum(salary) from tableOther where id=’ABC ...
- css的引入方法
1.外部途径: 建立xx.css文件与html文件放在同一目录下 加入 <link rel="stylesheet" type="text/css" hr ...
- SplendidCRM 如何添加及使用中文语言包
SplendidCRM 功能很强大,也支持多国语言,但关于中文语言安装的介绍在网上一直都找到,自已摸索了一下,成功使SplendidCRM应用中文,以下是安装方法. 版本号:SplendidCRM 7 ...
- lal
import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.F ...