Hadoop Compatibility in Flink
18 Nov 2014 by Fabian Hüske (@fhueske)
Apache Hadoop is an industry standard for scalable analytical data processing. Many data analysis applications have been implemented as Hadoop MapReduce jobs and run in clusters around the world. Apache Flink can be an alternative to MapReduce and improves it in many dimensions. Among other features, Flink provides much better performance and offers APIs in Java and Scala, which are very easy to use. Similar to Hadoop, Flink’s APIs provide interfaces for Mapper and Reducer functions, as well as Input- and OutputFormats along with many more operators. While being conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are unfortunately not source compatible.
Flink’s Hadoop Compatibility Package
To close this gap, Flink provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs. This package was developed as part of a Google Summer of Code 2014 project.
With the Hadoop Compatibility package, you can reuse all your Hadoop
InputFormats
(mapred and mapreduce APIs)OutputFormats
(mapred and mapreduce APIs)Mappers
(mapred API)Reducers
(mapred API)
in Flink programs without changing a line of code. Moreover, Flink also natively supports all Hadoop data types (Writables
and WritableComparable
).
The following code snippet shows a simple Flink WordCount program that solely uses Hadoop data types, InputFormat, OutputFormat, Mapper, and Reducer functions.
// Definition of Hadoop Mapper function
public class Tokenizer implements Mapper<LongWritable, Text, Text, LongWritable> { ... }
// Definition of Hadoop Reducer function
public class Counter implements Reducer<Text, LongWritable, Text, LongWritable> { ... } public static void main(String[] args) {
final String inputPath = args[0];
final String outputPath = args[1]; final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // Setup Hadoop’s TextInputFormat
HadoopInputFormat<LongWritable, Text> hadoopInputFormat =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, new JobConf());
TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); // Read a DataSet with the Hadoop InputFormat
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat);
DataSet<Tuple2<Text, LongWritable>> words = text
// Wrap Tokenizer Mapper function
.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))
.groupBy(0)
// Wrap Counter Reducer function (used as Reducer and Combiner)
.reduceGroup(new HadoopReduceCombineFunction<Text, LongWritable, Text, LongWritable>(
new Counter(), new Counter())); // Setup Hadoop’s TextOutputFormat
HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =
new HadoopOutputFormat<Text, LongWritable>(
new TextOutputFormat<Text, LongWritable>(), new JobConf());
hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); // Output & Execute
words.output(hadoopOutputFormat);
env.execute("Hadoop Compat WordCount");
}
As you can see, Flink represents Hadoop key-value pairs as Tuple2<key, value>
tuples. Note, that the program uses Flink’s groupBy()
transformation to group data on the key field (field 0 of the Tuple2<key, value>
) before it is given to the Reducer function. At the moment, the compatibility package does not evaluate custom Hadoop partitioners, sorting comparators, or grouping comparators.
Hadoop functions can be used at any position within a Flink program and of course also be mixed with native Flink functions. This means that instead of assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler such as Apache Oozie, you can implement an arbitrary complex Flink program consisting of multiple Hadoop Input- and OutputFormats, Mapper and Reducer functions. When executing such a Flink program, data will be pipelined between your Hadoop functions and will not be written to HDFS just for the purpose of data exchange.
What comes next?
While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as Spargel operations (Pregel/Giraph-style jobs).
Summary
Flink lets you reuse a lot of the code you wrote for Hadoop MapReduce, including all data types, all Input- and OutputFormats, and Mapper and Reducers of the mapred-API. Hadoop functions can be used within Flink programs and mixed with all other Flink functions. Due to Flink’s pipelined execution, Hadoop functions can arbitrarily be assembled without data exchange via HDFS. Moreover, the Flink community is currently working on a dedicated Hadoop Job operation to supporting the execution of Hadoop jobs as a whole.
If you want to use Flink’s Hadoop compatibility package checkout our documentation.
Hadoop Compatibility in Flink的更多相关文章
- Hadoop,Spark,Flink 相关KB
Hive: https://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-exter ...
- flink hadoop yarn
新一代大数据处理引擎 Apache Flink https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/ 新一代大数据处 ...
- Flink学习笔记:Flink开发环境搭建
本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKhaz ...
- flink学习笔记-各种Time
说明:本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKh ...
- Flink Program Guide (1) -- 基本API概念(Basic API Concepts -- For Java)
false false false false EN-US ZH-CN X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-n ...
- 新一代大数据处理引擎 Apache Flink
https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/index.html 大数据计算引擎的发展 这几年大数据的飞速发 ...
- Flink知识点
1. Flink.Storm.Sparkstreaming对比 Storm只支持流处理任务,数据是一条一条的源源不断地处理,而MapReduce.spark只支持批处理任务,spark-streami ...
- 什么是Apache Flink
大数据计算引擎的发展 这几年大数据的飞速发展,出现了很多热门的开源社区,其中著名的有 Hadoop.Storm,以及后来的 Spark,他们都有着各自专注的应用场景.Spark 掀开了内存计算的先河, ...
- Flink 部署文档
Flink 部署文档 1 先决条件 2 下载 Flink 二进制文件 3 配置 Flink 3.1 flink-conf.yaml 3.2 slaves 4 将配置好的 Flink 分发到其他节点 5 ...
随机推荐
- Eclipse导入别人的项目报错:Unable to load annotation processor factory 'xxxxx.jar' for project
使用eclipse导入别人的项目时候,报错Unable to load annotation processor factory 'xxxxx.jar' for project. 解决方案 1.项目右 ...
- 自己构建一个Spring自定义标签以及原理讲解
平时不论是在Spring配置文件中引入其他中间件(比如dubbo),还是使用切面时,都会用到自定义标签.那么配置文件中的自定义标签是如何发挥作用的,或者说程序是如何通过你添加的自定义标签实现相应的功能 ...
- JdbcTemplate的一次爬坑记录
时隔三个多月,我终于想起我还有个博客,其实也不是忘了我这个博客,只是平时工作繁忙没时间去写博客,故今晚腾出时间来记录一下上次工作中遇到的一个问题,给园友们分享出来,以免入坑. 上个星期在工作中使用Jd ...
- vnc server的安装
vnc是一款使用广泛的服务器管理软件,可以实现图形化管理.我在安装vnc server碰到一些问题,也整理下我的安装步骤,希望对博友们有一些帮助. 1 安装对应的软件包 [root@centos6 ~ ...
- 创建简单WEB高可用集群
1. 环境介绍 我这里弄了2个虚拟机,信息如下: node1:192.168.168.201 node2:192.168.168.202 2.配置主机名 [root@node1 ~]# vim /et ...
- HttpServletResponse ServletResponse 返回响应 设置响应头设置响应正文体 重定向 常用方法 如何重定向 响应编码 响应乱码
HttpServletResponse 和 ServletResponse 都是接口 具体的类型对象是由Servlet容器传递过来 ServletResponse对象的功能分为以下四种: ...
- spring原理案例-基本项目搭建 03 创建工程运行测试 spring ioc原理实例示例
下面开始项目的搭建 使用 Java EE - Eclipse 新建一 Dynamic Web Project Target Runtime 选 Apache Tomcat 7.0(不要选 Apache ...
- HttpClient post提交数据,返回json
// string data = "{\"uid\":515,\"timestamp\":\"2018 - 5 - 25 19:05:00\ ...
- 从零开始学安全(三十八)●cobaltstrike生成木马抓肉鸡
链接:https://pan.baidu.com/s/1qstCSM9nO95tFGBsnYFYZw 提取码:w6ih 上面是工具 需要java jdk 在1.8.5 以上 实验环境windows ...
- [android] 创建模拟器遇到的常见错误
1.错误提示: invalid command line sdk安装目录有中文添加ANDROID_SDK_HOME环境变量,指向sdk安装目录2.模拟器无法安装应用模拟器开启其实是开启了的程序占用这个 ...