flink dataset api使用及原理

随着大数据技术在各行各业的广泛应用，要求能对海量数据进行实时处理的需求越来越多，同时数据处理的业务逻辑也越来越复杂，传统的批处理方式和早期的流式处理框架也越来越难以在延迟性、吞吐量、容错能力以及使用便捷性等方面满足业务日益苛刻的要求。

在这种形势下，新型流式处理框架Flink通过创造性地把现代大规模并行处理技术应用到流式处理中来，极大地改善了以前的流式处理框架所存在的问题。

1.概述：

flink提供DataSet Api用户处理批量数据。flink先将接入数据转换成DataSet数据集，并行分布在集群的每个节点上；然后将DataSet数据集进行各种转换操作(map，filter等)，最后通过DataSink操作将结果数据集输出到外部系统。

2.数据接入

输入InputFormat

/**

 * The base interface for data sources that produces records.

 * <p>

 * The input format handles the following:

 * <ul>

 *   <li>It describes how the input is split into splits that can be processed in parallel.</li>

 *   <li>It describes how to read records from the input split.</li>

 *   <li>It describes how to gather basic statistics from the input.</li>

 * </ul>

 * <p>

 * The life cycle of an input format is the following:

 * <ol>

 *   <li>After being instantiated (parameterless), it is configured with a {@link Configuration} object.

 *       Basic fields are read from the configuration, such as a file path, if the format describes

 *       files as input.</li>

 *   <li>Optionally: It is called by the compiler to produce basic statistics about the input.</li>

 *   <li>It is called to create the input splits.</li>

 *   <li>Each parallel input task creates an instance, configures it and opens it for a specific split.</li>

 *   <li>All records are read from the input</li>

 *   <li>The input format is closed</li>

 * </ol>

 * <p>

 * IMPORTANT NOTE: Input formats must be written such that an instance can be opened again after it was closed. That

 * is due to the fact that the input format is used for potentially multiple splits. After a split is done, the

 * format's close function is invoked and, if another split is available, the open function is invoked afterwards for

 * the next split.

 *

 * @see InputSplit

 * @see BaseStatistics

 *

 * @param <OT> The type of the produced records.

 * @param <T> The type of input split.

 */

3.数据转换

DataSet：一组相同类型的元素。DataSet可以通过transformation转换成其它的DataSet。示例如下：

DataSet#map(org.apache.flink.api.common.functions.MapFunction)

DataSet#reduce(org.apache.flink.api.common.functions.ReduceFunction)

DataSet#join(DataSet)

DataSet#coGroup(DataSet)

其中，Function：用户定义的业务逻辑，支持java 8 lambda表达式

function的实现通过operator来做的，以map为例

    /**

     * Applies a Map transformation on this DataSet.

     *

     * <p>The transformation calls a {@link org.apache.flink.api.common.functions.MapFunction} for each element of the DataSet.

     * Each MapFunction call returns exactly one element.

     *

     * @param mapper The MapFunction that is called for each element of the DataSet.

     * @return A MapOperator that represents the transformed DataSet.

     *

     * @see org.apache.flink.api.common.functions.MapFunction

     * @see org.apache.flink.api.common.functions.RichMapFunction

     * @see MapOperator

     */

    public <R> MapOperator<T, R> map(MapFunction<T, R> mapper) {

        if (mapper == null) {

            throw new NullPointerException("Map function must not be null.");

        }

        String callLocation = Utils.getCallLocationName();

        TypeInformation<R> resultType = TypeExtractor.getMapReturnTypes(mapper, getType(), callLocation, true);

        return new MapOperator<>(this, resultType, clean(mapper), callLocation);

    }

其中，Operator

4.数据输出

DataSink：一个用来存储数据结果的操作。

输出OutputFormat

例如，可以csv输出

    /**

     * Writes a {@link Tuple} DataSet as CSV file(s) to the specified location with the specified field and line delimiters.

     *

     * <p><b>Note: Only a Tuple DataSet can written as a CSV file.</b>

      * For each Tuple field the result of {@link Object#toString()} is written.

     *

     * @param filePath The path pointing to the location the CSV file is written to.

     * @param rowDelimiter The row delimiter to separate Tuples.

     * @param fieldDelimiter The field delimiter to separate Tuple fields.

     * @param writeMode The behavior regarding existing files. Options are NO_OVERWRITE and OVERWRITE.

     *

     * @see Tuple

     * @see CsvOutputFormat

     * @see DataSet#writeAsText(String) Output files and directories

     */

    public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, WriteMode writeMode) {

        return internalWriteAsCsv(new Path(filePath), rowDelimiter, fieldDelimiter, writeMode);

    }

    @SuppressWarnings("unchecked")

    private <X extends Tuple> DataSink<T> internalWriteAsCsv(Path filePath, String rowDelimiter, String fieldDelimiter, WriteMode wm) {

        Preconditions.checkArgument(getType().isTupleType(), "The writeAsCsv() method can only be used on data sets of tuples.");

        CsvOutputFormat<X> of = new CsvOutputFormat<>(filePath, rowDelimiter, fieldDelimiter);

        if (wm != null) {

            of.setWriteMode(wm);

        }

        return output((OutputFormat<T>) of);

    }

5.总结

　　1. flink通过InputFormat对各种数据源的数据进行读取转换成DataSet数据集

　　2. flink提供了丰富的转换操作，DataSet可以通过transformation转换成其它的DataSet，内部的实现是Function和Operator。

　　3. flink通过OutFormat将DataSet转换成DataSink，最终将数据写入到不同的存储介质。

参考资料：

【1】https://blog.51cto.com/13654660/2087705

flink dataset api使用及原理的更多相关文章

flink DataStream API使用及原理
传统的大数据处理方式一般是批处理式的,也就是说,今天所收集的数据,我们明天再把今天收集到的数据算出来,以供大家使用,但是在很多情况下,数据的时效性对于业务的成败是非常关键的. Spark 和 Flin ...
Flink DataSet API Programming Guide
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html Example ...
Apache Flink - Batch(DataSet API)
Flink DataSet API编程指南: Flink中的DataSet程序是实现数据集转换的常规程序(例如,过滤,映射,连接,分组).数据集最初是从某些来源创建的(例如,通过读取文件或从本地集合创 ...
Flink入门（五）——DataSet Api编程指南
Apache Flink Apache Flink 是一个兼顾高吞吐.低延迟.高性能的分布式处理框架.在实时计算崛起的今天,Flink正在飞速发展.由于性能的优势和兼顾批处理,流处理的特性,Flink ...
Apache Flink 1.12.0 正式发布，DataSet API 将被弃用，真正的流批一体
Apache Flink 1.12.0 正式发布 Apache Flink 社区很荣幸地宣布 Flink 1.12.0 版本正式发布!近 300 位贡献者参与了 Flink 1.12.0 的开发,提交 ...
Flink整合面向用户的数据流SDKs/API(Flink关于弃用Dataset API的论述)
动机 Flink提供了三种主要的sdk/API来编写程序:Table API/SQL.DataStream API和DataSet API.我们认为这个API太多了,建议弃用DataSet API,而 ...
【翻译】Flink Table Api & SQL — 用户定义函数
本文翻译自官网:User-defined Functions https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/tabl ...
8、Flink Table API & Flink Sql API
一.概述上图是flink的分层模型,Table API 和 SQL 处于最顶端,是 Flink 提供的高级 API 操作.Flink SQL 是 Flink 实时计算为简化计算模型,降低用户使用实时 ...
统一批处理流处理——Flink批流一体实现原理
实现批处理的技术许许多多,从各种关系型数据库的sql处理,到大数据领域的MapReduce,Hive,Spark等等.这些都是处理有限数据流的经典方式.而Flink专注的是无限流处理,那么他是怎么做到 ...

随机推荐

C#调用C/C++ DLL 参数传递和回调函数的总结
原文:C#调用C/C++ DLL 参数传递和回调函数的总结 Int型传入: Dll端: extern "C" __declspec(dllexport) int Add(int a ...
C#基础加强篇----委托、Lamada表达式和事件（上）
1.委托 C#的委托相当于C/C++中的函数指针.函数指针用指针获取一个函数的入口地址,实现对函数的操作. 委托与C/C++中的函数指针不同在于,委托是面向对象的,是引用类型,对委托的使用要先定义后实 ...
LIBCMTD.lib(exe_winmain.obj) : error LNK2019: 无法解析的外部符号 _WinMain@16，该符号在函数 "int __cdecl invoke_main(void)" (?invoke_main@@YAHXZ) 中被引用
这个问题是没找到程序入口在网上查这个问题,一般都是说两条: (若是win32程序) 一是在项目属性\CC++\预处理器\预处理器定义\里添加 _WINDOWS 一是在项目属性\链接\系统里选择窗 ...
Mysql下载(on windows-noinstall zip archive)
所有内容,都是针对Mysql5.7.18介绍. 1.首先你需要下载一个完整的包,Mysql目前有两个版本可以使用: a. MySql Enterprise Edition:企业版 b. MySql C ...
企业级架构 MVVM 模式指南 (WPF 和 Silverlight 实现) 译（2）
本书包含的章节内容第一章:表现模式,以一个例子呈献给读者表现模式的发展历程,我们会用包括MVC和MVP在内的各种方式实现一个收费项目的例子.沿此方向,我们会发现每一种模式的问题所在,这也是触发设计模 ...
百度 Echarts 地图表 js 引用路径
使用地图表格,除了需echarts,还需zrender,自行下载JS文件: 目标,做成这样的效果:http://echarts.baidu.com/doc/example/map3.html ...
FMX+Win32，窗口无法保持原样，应该是个bug
从FMX发布开始,一直有这问题,大家看看是不是一个bug,应该如何修复? 新建一个FMX Application,运行后,点击窗口标题栏右上角的“最大化”按钮,此时窗口是最大化的.在windows最底 ...
Python正则表达式进阶-零宽断言
1. 什么是零宽断言有时候在使用正则表达式做匹配的时候,我们希望匹配一个字符串,这个字符串的前面或后面需要是特定的内容,但我们又不想要前面或后面的这个特定的内容,这时候就需要零宽断言的帮助了.所谓零 ...
Codlility---MinPerimeterRectangle
Task description An integer N is given, representing the area of some rectangle. The area of a recta ...
java中list和Arrylist的区别
List:是一个有序的集合,可以包含重复的元素.提供了按索引访问的方式.它继承 Collection. List有两个重要的实现类:ArrayList 和 LinkedList ArrayList:我 ...

flink dataset api使用及原理

flink dataset api使用及原理的更多相关文章

随机推荐

热门专题