Hadoop Compatibility in Flink
18 Nov 2014 by Fabian Hüske (@fhueske)
Apache Hadoop is an industry standard for scalable analytical data processing. Many data analysis applications have been implemented as Hadoop MapReduce jobs and run in clusters around the world. Apache Flink can be an alternative to MapReduce and improves it in many dimensions. Among other features, Flink provides much better performance and offers APIs in Java and Scala, which are very easy to use. Similar to Hadoop, Flink’s APIs provide interfaces for Mapper and Reducer functions, as well as Input- and OutputFormats along with many more operators. While being conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are unfortunately not source compatible.
Flink’s Hadoop Compatibility Package
To close this gap, Flink provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs. This package was developed as part of a Google Summer of Code 2014 project.
With the Hadoop Compatibility package, you can reuse all your Hadoop
InputFormats
(mapred and mapreduce APIs)OutputFormats
(mapred and mapreduce APIs)Mappers
(mapred API)Reducers
(mapred API)
in Flink programs without changing a line of code. Moreover, Flink also natively supports all Hadoop data types (Writables
and WritableComparable
).
The following code snippet shows a simple Flink WordCount program that solely uses Hadoop data types, InputFormat, OutputFormat, Mapper, and Reducer functions.
// Definition of Hadoop Mapper function
public class Tokenizer implements Mapper<LongWritable, Text, Text, LongWritable> { ... }
// Definition of Hadoop Reducer function
public class Counter implements Reducer<Text, LongWritable, Text, LongWritable> { ... } public static void main(String[] args) {
final String inputPath = args[0];
final String outputPath = args[1]; final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // Setup Hadoop’s TextInputFormat
HadoopInputFormat<LongWritable, Text> hadoopInputFormat =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, new JobConf());
TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); // Read a DataSet with the Hadoop InputFormat
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat);
DataSet<Tuple2<Text, LongWritable>> words = text
// Wrap Tokenizer Mapper function
.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))
.groupBy(0)
// Wrap Counter Reducer function (used as Reducer and Combiner)
.reduceGroup(new HadoopReduceCombineFunction<Text, LongWritable, Text, LongWritable>(
new Counter(), new Counter())); // Setup Hadoop’s TextOutputFormat
HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =
new HadoopOutputFormat<Text, LongWritable>(
new TextOutputFormat<Text, LongWritable>(), new JobConf());
hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); // Output & Execute
words.output(hadoopOutputFormat);
env.execute("Hadoop Compat WordCount");
}
As you can see, Flink represents Hadoop key-value pairs as Tuple2<key, value>
tuples. Note, that the program uses Flink’s groupBy()
transformation to group data on the key field (field 0 of the Tuple2<key, value>
) before it is given to the Reducer function. At the moment, the compatibility package does not evaluate custom Hadoop partitioners, sorting comparators, or grouping comparators.
Hadoop functions can be used at any position within a Flink program and of course also be mixed with native Flink functions. This means that instead of assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler such as Apache Oozie, you can implement an arbitrary complex Flink program consisting of multiple Hadoop Input- and OutputFormats, Mapper and Reducer functions. When executing such a Flink program, data will be pipelined between your Hadoop functions and will not be written to HDFS just for the purpose of data exchange.
What comes next?
While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as Spargel operations (Pregel/Giraph-style jobs).
Summary
Flink lets you reuse a lot of the code you wrote for Hadoop MapReduce, including all data types, all Input- and OutputFormats, and Mapper and Reducers of the mapred-API. Hadoop functions can be used within Flink programs and mixed with all other Flink functions. Due to Flink’s pipelined execution, Hadoop functions can arbitrarily be assembled without data exchange via HDFS. Moreover, the Flink community is currently working on a dedicated Hadoop Job operation to supporting the execution of Hadoop jobs as a whole.
If you want to use Flink’s Hadoop compatibility package checkout our documentation.
Hadoop Compatibility in Flink的更多相关文章
- Hadoop,Spark,Flink 相关KB
Hive: https://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-exter ...
- flink hadoop yarn
新一代大数据处理引擎 Apache Flink https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/ 新一代大数据处 ...
- Flink学习笔记:Flink开发环境搭建
本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKhaz ...
- flink学习笔记-各种Time
说明:本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKh ...
- Flink Program Guide (1) -- 基本API概念(Basic API Concepts -- For Java)
false false false false EN-US ZH-CN X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-n ...
- 新一代大数据处理引擎 Apache Flink
https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/index.html 大数据计算引擎的发展 这几年大数据的飞速发 ...
- Flink知识点
1. Flink.Storm.Sparkstreaming对比 Storm只支持流处理任务,数据是一条一条的源源不断地处理,而MapReduce.spark只支持批处理任务,spark-streami ...
- 什么是Apache Flink
大数据计算引擎的发展 这几年大数据的飞速发展,出现了很多热门的开源社区,其中著名的有 Hadoop.Storm,以及后来的 Spark,他们都有着各自专注的应用场景.Spark 掀开了内存计算的先河, ...
- Flink 部署文档
Flink 部署文档 1 先决条件 2 下载 Flink 二进制文件 3 配置 Flink 3.1 flink-conf.yaml 3.2 slaves 4 将配置好的 Flink 分发到其他节点 5 ...
随机推荐
- 带着新人学springboot的应用09(springboot+异步任务)
本来想说说检索的,不过不知道什么鬼,下载ElasticSearch太慢了,还是放一下,后面有机会再补上!今天就说个简单的东西,来说说任务. 什么叫做任务呢?其实就是类中实现了一个什么功能的方法.常见的 ...
- mybatis注解@Param对JavaBean的作用
当参数是一个JavaBean时,如果不用@Param且sql里获取变量用#{},如@Select("SELECT id,USERNAME,uname from uk_user where d ...
- 【ASP.NET Core快速入门】(十四)MVC开发:UI、 EF + Identity实现、注册实现、登陆实现
前言 之前我们进行了MVC的web页面的Cookie-based认证实现,接下来的开发我们要基于之前的MvcCookieAuthSample项目做修改. MvcCookieAuthSample项目地址 ...
- Alibaba Cluster Data 开放下载:270GB 数据揭秘你不知道的阿里巴巴数据中心
打开一篇篇 IT 技术文章,你总能够看到“大规模”.“海量请求”这些字眼.如今,这些功能强大的互联网应用,都运行在大规模数据中心上,然而,对于大规模数据中心,你又了解多少呢?实际上,除了阅读一些科技文 ...
- 痞子衡嵌入式:一表全搜罗常见移动通信标准(1-5G, GSM/GPRS/CDMA/LTE/NR...)
大家好,我是痞子衡,是正经搞技术的痞子.今天痞子衡给大家介绍的是移动通信标准. 移动无线网络已经成为我们生活.学习.娱乐不可缺少的必备品,而移动无线通信技术本身也在不断地更新换代.从1986年诞生第一 ...
- Maven教程(4)--Maven管理Oracle驱动包
由于Oracle授权问题,Maven3不提供Oracle JDBC driver,为了在Maven项目中应用Oracle JDBC driver,必须手动添加到本地仓库. 手动添加到本地仓库需要本地有 ...
- sql server 2008R2 导出insert 语句(转载)
转载来源: https://blog.csdn.net/zengcong2013/article/details/78648988. sql server 2008R2数据库导出表里所有数据成inse ...
- Win10系统简单开启热点
介绍 笔记本电脑使用的都是无线网卡,我们可以通过这网卡来开启热点供手机使用,说起开热点,大家都是想到的使用360随身wifi或者是猎豹wifi来开启热点吧,我个人不太喜欢使用这些软件,原因因为有DNS ...
- [JS设计模式]:单例模式(1)
什么是单例模式 所谓单例,就是一个类只有一个实例,实现的方法一般是先判断是否存在实例,如果存在就直接返回,如果不存在就创建了再返回.这样确保了一个类只有一个实例对象. 实现的单例有很多种方式,最简单的 ...
- Redis环境搭建和代码测试及与GIS结合的GEO数据类型预研
文章版权由作者李晓晖和博客园共有,若转载请于明显处标明出处:http://www.cnblogs.com/naaoveGIS/ 1.背景 1.1传统MySQL+ Memcached架构遇到的问题 My ...