flink checkpoint 源码分析（二）

转发请注明原创地址http://www.cnblogs.com/dongxiao-yang/p/8260370.html

flink checkpoint 源码分析（一）一文主要讲述了在JobManager端定时生成TriggerCheckpoint的代码部分，本文继续研究下TaskManager端如何处理收到的TriggerCheckpoint消息并执行对应的备份操作。

TriggerCheckpoint消息进入TaskManager的处理路径为 handleMessage -> handleCheckpointingMessage -> Task.triggerCheckpointBarrier

     public void triggerCheckpointBarrier(

             final long checkpointID,

             long checkpointTimestamp,

             final CheckpointOptions checkpointOptions) {

         final AbstractInvokable invokable = this.invokable;

         final CheckpointMetaData checkpointMetaData = new CheckpointMetaData(checkpointID, checkpointTimestamp);

         if (executionState == ExecutionState.RUNNING && invokable != null) {

             if (invokable instanceof StatefulTask) {

                 // build a local closure

                 final StatefulTask statefulTask = (StatefulTask) invokable;

                 final String taskName = taskNameWithSubtask;

                 final SafetyNetCloseableRegistry safetyNetCloseableRegistry =

                     FileSystemSafetyNet.getSafetyNetCloseableRegistryForThread();

                 Runnable runnable = new Runnable() {

                     @Override

                     public void run() {

                         // set safety net from the task's context for checkpointing thread

                         LOG.debug("Creating FileSystem stream leak safety net for {}", Thread.currentThread().getName());

                         FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(safetyNetCloseableRegistry);

                         try {

                             boolean success = statefulTask.triggerCheckpoint(checkpointMetaData, checkpointOptions);

                             if (!success) {

                                 checkpointResponder.declineCheckpoint(

                                         getJobID(), getExecutionId(), checkpointID,

                                         new CheckpointDeclineTaskNotReadyException(taskName));

                             }

                         }

                         catch (Throwable t) {

                             if (getExecutionState() == ExecutionState.RUNNING) {

                                 failExternally(new Exception(

                                     "Error while triggering checkpoint " + checkpointID + " for " +

                                         taskNameWithSubtask, t));

                             } else {

                                 LOG.debug("Encountered error while triggering checkpoint {} for " +

                                     "{} ({}) while being not in state running.", checkpointID,

                                     taskNameWithSubtask, executionId, t);

                             }

                         } finally {

                             FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(null);

                         }

                     }

                 };

                 executeAsyncCallRunnable(runnable, String.format("Checkpoint Trigger for %s (%s).", taskNameWithSubtask, executionId));

             }

             else {

                 checkpointResponder.declineCheckpoint(jobId, executionId, checkpointID,

                         new CheckpointDeclineTaskNotCheckpointingException(taskNameWithSubtask));

                 LOG.error("Task received a checkpoint request, but is not a checkpointing task - {} ({}).",

                         taskNameWithSubtask, executionId);

             }

         }

         else {

             LOG.debug("Declining checkpoint request for non-running task {} ({}).", taskNameWithSubtask, executionId);

             // send back a message that we did not do the checkpoint

             checkpointResponder.declineCheckpoint(jobId, executionId, checkpointID,

                     new CheckpointDeclineTaskNotReadyException(taskNameWithSubtask));

         }

     }

在正常的情况下，triggerCheckpointBarrier会调用StreamTask内部实现的triggerCheckpoint()方法，并根据调用链条

triggerCheckpoint->performCheckpoint->checkpointState->CheckpointingOperation.executeCheckpointing

    public void executeCheckpointing() throws Exception {

            startSyncPartNano = System.nanoTime();

            boolean failed = true;

            try {

                for (StreamOperator<?> op : allOperators) {

                    checkpointStreamOperator(op);

                }

                if (LOG.isDebugEnabled()) {

                    LOG.debug("Finished synchronous checkpoints for checkpoint {} on task {}",

                            checkpointMetaData.getCheckpointId(), owner.getName());

                }

                startAsyncPartNano = System.nanoTime();

                checkpointMetrics.setSyncDurationMillis((startAsyncPartNano - startSyncPartNano) / 1_000_000);

                // at this point we are transferring ownership over snapshotInProgressList for cleanup to the thread

                runAsyncCheckpointingAndAcknowledge();

                failed = false;

                if (LOG.isDebugEnabled()) {

                    LOG.debug("{} - finished synchronous part of checkpoint {}." +

                            "Alignment duration: {} ms, snapshot duration {} ms",

                        owner.getName(), checkpointMetaData.getCheckpointId(),

                        checkpointMetrics.getAlignmentDurationNanos() / 1_000_000,

                        checkpointMetrics.getSyncDurationMillis());

                }

在executeCheckpointing方法里进行了两个操作，首先是对该task对应的所有StreamOperator对象调用checkpointStreamOperator(op)

checkpointStreamOperator代码：

    private void checkpointStreamOperator(StreamOperator<?> op) throws Exception {

            if (null != op) {

                // first call the legacy checkpoint code paths

                nonPartitionedStates.add(op.snapshotLegacyOperatorState(

                        checkpointMetaData.getCheckpointId(),

                        checkpointMetaData.getTimestamp(),

                        checkpointOptions));

                OperatorSnapshotResult snapshotInProgress = op.snapshotState(

                        checkpointMetaData.getCheckpointId(),

                        checkpointMetaData.getTimestamp(),

                        checkpointOptions);

                snapshotInProgressList.add(snapshotInProgress);

            } else {

                nonPartitionedStates.add(null);

                OperatorSnapshotResult emptySnapshotInProgress = new OperatorSnapshotResult();

                snapshotInProgressList.add(emptySnapshotInProgress);

            }

        }

StreamOperator的snapshotState(long checkpointId,long timestamp,CheckpointOptions checkpointOptions)方法最终由它的子类AbstractStreamOperator给出了一个final实现

    @Override

    public final OperatorSnapshotResult snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions) throws Exception {

        KeyGroupRange keyGroupRange = null != keyedStateBackend ?

                keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;

        OperatorSnapshotResult snapshotInProgress = new OperatorSnapshotResult();

        CheckpointStreamFactory factory = getCheckpointStreamFactory(checkpointOptions);

        try (StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(

                checkpointId,

                timestamp,

                factory,

                keyGroupRange,

                getContainingTask().getCancelables())) {

            snapshotState(snapshotContext);

            snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());

            snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());

            if (null != operatorStateBackend) {

                snapshotInProgress.setOperatorStateManagedFuture(

                    operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));

            }

            if (null != keyedStateBackend) {

                snapshotInProgress.setKeyedStateManagedFuture(

                    keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));

            }

        } catch (Exception snapshotException) {

            try {

                snapshotInProgress.cancel();

            } catch (Exception e) {

                snapshotException.addSuppressed(e);

            }

            throw new Exception("Could not complete snapshot " + checkpointId + " for operator " +

                getOperatorName() + '.', snapshotException);

        }

        return snapshotInProgress;

    }

上述代码里的snapshotState(snapshotContext)方法在不同的最终operator中有自己的具体实现。

executeCheckpointing的第二个操作是然后是调用runAsyncCheckpointingAndAcknowledge执行

所有的state固化文件操作并返回acknowledgeCheckpoint给JobManager。

    private static final class AsyncCheckpointRunnable implements Runnable, Closeable {

.....

.....

                if (asyncCheckpointState.compareAndSet(CheckpointingOperation.AsynCheckpointState.RUNNING,

                        CheckpointingOperation.AsynCheckpointState.COMPLETED)) {

                    owner.getEnvironment().acknowledgeCheckpoint(

                        checkpointMetaData.getCheckpointId(),

                        checkpointMetrics,

                        subtaskState);

补充，在上文提到的performCheckpoint方法内，调用checkpointState方法之前，flink会把预先把checkpointBarrier发送到下游task，以便下游operator尽快开始他们的checkpoint进程，

这也是flink barrier机制生成barrier的地方。

    synchronized (lock) {

            if (isRunning) {

                // we can do a checkpoint

                // Since both state checkpointing and downstream barrier emission occurs in this

                // lock scope, they are an atomic operation regardless of the order in which they occur.

                // Given this, we immediately emit the checkpoint barriers, so the downstream operators

                // can start their checkpoint work as soon as possible

                operatorChain.broadcastCheckpointBarrier(

                        checkpointMetaData.getCheckpointId(),

                        checkpointMetaData.getTimestamp(),

                        checkpointOptions);

                checkpointState(checkpointMetaData, checkpointOptions, checkpointMetrics);

                return true;

    public void broadcastCheckpointBarrier(long id, long timestamp, CheckpointOptions checkpointOptions) throws IOException {

        try {

            CheckpointBarrier barrier = new CheckpointBarrier(id, timestamp, checkpointOptions);

            for (RecordWriterOutput<?> streamOutput : streamOutputs) {

                streamOutput.broadcastEvent(barrier);

            }

        }

        catch (InterruptedException e) {

            throw new IOException("Interrupted while broadcasting checkpoint barrier");

        }

    }

上述描述的触发checkpoint调用路径是针对source task的链路。对于其余非souce的operator，

方法链路为StreamInputProcessor/StreamTwoInputProcessor.processInput() ->barrierHandler.getNextNonBlocked()->processBarrier ->notifyCheckpoint->triggerCheckpointOnBarrier

参考文档：

Flink 原理与实现：如何生成 StreamGraph

flink checkpoint 源码分析（二）的更多相关文章

flink checkpoint 源码分析（一）
转发请注明原创地址http://www.cnblogs.com/dongxiao-yang/p/8029356.html checkpoint是Flink Fault Tolerance机制的重要构成 ...
Fresco 源码分析(二) Fresco客户端与服务端交互(1) 解决遗留的Q1问题
4.2 Fresco客户端与服务端的交互(一) 解决Q1问题从这篇博客开始,我们开始讨论客户端与服务端是如何交互的,这个交互的入口,我们从Q1问题入手(博客按照这样的问题入手,是因为当时我也是从这里 ...
框架-springmvc源码分析(二)
框架-springmvc源码分析(二) 参考: http://www.cnblogs.com/leftthen/p/5207787.html http://www.cnblogs.com/leftth ...
Tomcat源码分析二：先看看Tomcat的整体架构
Tomcat源码分析二:先看看Tomcat的整体架构 Tomcat架构图我们先来看一张比较经典的Tomcat架构图: 从这张图中,我们可以看出Tomcat中含有Server.Service.Conn ...
十、Spring之BeanFactory源码分析(二)
Spring之BeanFactory源码分析(二) 前言在前面我们简单的分析了BeanFactory的结构,ListableBeanFactory,HierarchicalBeanFactory,A ...
Vue源码分析(二) : Vue实例挂载
Vue源码分析(二) : Vue实例挂载 author: @TiffanysBear 实例挂载主要是 $mount 方法的实现,在 src/platforms/web/entry-runtime-wi ...
多线程之美8一 AbstractQueuedSynchronizer源码分析<二>
目录 AQS的源码分析该篇主要分析AQS的ConditionObject,是AQS的内部类,实现等待通知机制. 1.条件队列条件队列与AQS中的同步队列有所不同,结构图如下: 两者区别: 1.链表 ...
ABP源码分析二：ABP中配置的注册和初始化
一般来说,ASP.NET Web应用程序的第一个执行的方法是Global.asax下定义的Start方法.执行这个方法前HttpApplication 实例必须存在,也就是说其构造函数的执行必然是完成 ...
spring源码分析(二)Aop
创建日期:2016.08.19 修改日期:2016.08.20-2016.08.21 交流QQ:992591601 参考资料:<spring源码深度解析>.<spring技术内幕&g ...

随机推荐

区块链 -- Merkle Tree
我们地球上大部分人应该连它的名字都没有听过,而且说实话它也是个比较传统的概念了.Merkle Tree 是由计算机科学家 Ralph Merkle 在很多年前提出的,并以他本人的名字来命名.不过,Me ...
MYSQL LIMIT 用法详解
在mysql的limit用法中,网上有这样的论述: "//为了检索从某一个偏移量到记录集的结束所有的记录行,可以指定第二个参数为 -1: mysql>SELECT * FROM tab ...
[Todo] C++学习资料进度
<C++必知必会> /Users/baidu/Documents/Data/Interview/C++
python matplotlib.pyplot学习记录
matplotlib是python中很强大的绘图工具,在机器学习中经常用到首先是导入 import matplotlib.pyplot as plt plt中有很多方法,记录下常用的方法 plt.p ...
http://www.cnblogs.com/jqyp/archive/2010/08/20/1805041.html
http://www.cnblogs.com/jqyp/archive/2010/08/20/1805041.html
asp.net mvc 3 配置全局错误处理 Web.config中设置CustomError
摘自: http://www.myexception.cn/web/1130191.html asp.net mvc 配置全局异常处理 Web.config中设置CustomError Web.con ...
Chrome DevTools快捷键
weblogic8.1 登陆5 ip 限制
weblogic8.1 5 ip 限制报错信息如图所示: 解决办法:此weblogic 未破解,去网上下载破解包,然后放到 copy weblogic_sp.jar to $WL_HOME/ser ...
JS-产生随机数的几个用法！
<script> function GetRandomNum(Min,Max) { var Range = Max - Min; var Rand = Math.random(); ret ...
ssm整合（Spring+SpringMVC+Mybatis）
一.Spring Spring致力于提供一种方法管理你的业务对象.IOC容器,它可以装载bean(也就是我们java中的类,当然也包括service dao里面的),有了这个机制,我们就不用在每次使用 ...

flink checkpoint 源码分析 （二）

flink checkpoint 源码分析 （二）的更多相关文章

随机推荐

热门专题

flink checkpoint 源码分析（二）

flink checkpoint 源码分析（二）的更多相关文章