Example Program


ExecutionEnvironment  -- SparkContext

DataSet – RDD




public class WordCountExample {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?"); DataSet<Tuple2<String, Integer>> wordCounts = text
.flatMap(new LineSplitter())
.sum(1); wordCounts.print();
} public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));


Specifying Keys


1. 用tuple的index,如下用tuple的第一个和第二个做联合key

DataSet<Tuple3<Integer,String,Long>> input = // [...]
DataSet<Tuple3<Integer,String,Long> grouped = input
.reduce(/*do something*/);


2. 对于POJO对象,使用Field Expressions

// some ordinary POJO (Plain old Java Object)
public class WC {
public String word;
public int count;
DataSet<WC> words = // [...]
DataSet<WC> wordCounts = words.groupBy("word").reduce(/*do something*/);


3. 使用Key Selector Functions

// some ordinary POJO
public class WC {public String word; public int count;}
DataSet<WC> words = // [...]
DataSet<WC> wordCounts = words
new KeySelector<WC, String>() {
public String getKey(WC wc) { return wc.word; }
.reduce(/*do something*/);


Passing Functions to Flink

1. 实现function interface

class MyMapFunction implements MapFunction<String, Integer> {
public Integer map(String value) { return Integer.parseInt(value); }
}); (new MyMapFunction());

或使用匿名类, MapFunction<String, Integer> () {
public Integer map(String value) { return Integer.parseInt(value); }


2. 使用Rich functions

Rich functions provide, in addition to the user-defined function (map, reduce, etc), four methods: open, close, getRuntimeContext, and setRuntimeContext.

These are useful for parameterizing the function (see Passing Parameters to Functions), creating and finalizing local state, accessing broadcast variables (see Broadcast Variables, and for accessing runtime information such as accumulators and counters (seeAccumulators and Counters, and information on iterations (see Iterations).

Rich functions的使用和普通的function是一样的,不同的就是,多4个接口函数,可以用于一些特殊的场景,比如给function传参,或访问broadcast变量,accumulators和counter,因为这些场景你需要先getRuntimeContext

class MyMapFunction extends RichMapFunction<String, Integer> {
public Integer map(String value) { return Integer.parseInt(value); }


Execution Configuration

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ExecutionConfig executionConfig = env.getConfig();
  • enableClosureCleaner() / disableClosureCleaner(). The closure cleaner is enabled by default. The closure cleaner removes unneeded references to the surrounding class of anonymous functions inside Flink programs. With the closure cleaner disabled, it might happen that an anonymous user function is referencing the surrounding class, which is usually not Serializable. This will lead to exceptions by the serializer.



  • getParallelism() / setParallelism(int parallelism) Set the default parallelism for the job.


  • getExecutionRetryDelay() / setExecutionRetryDelay(long executionRetryDelay) Sets the delay in milliseconds that the system waits after a job has failed, before re-executing it. The delay starts after all tasks have been successfully been stopped on the TaskManagers, and once the delay is past, the tasks are re-started. This parameter is useful to delay re-execution in order to let certain time-out related failures surface fully (like broken connections that have not fully timed out), before attempting a re-execution and immediately failing again due to the same problem. This parameter only has an effect if the number of execution re-tries is one or more.

    getExecutionMode() / setExecutionMode(). The default execution mode is PIPELINED. Sets the execution mode to execute the program. The execution mode defines whether data exchanges are performed in a batch or on a pipelined manner.


  • enableObjectReuse() / disableObjectReuse() By default, objects are not reused in Flink. Enabling the object reuse mode will instruct the runtime to reuse user objects for better performance. Keep in mind that this can lead to bugs when the user-code function of an operation is not aware of this behavior.


  • enableSysoutLogging() / disableSysoutLogging() JobManager status updates are printed to System.out by default. This setting allows to disable this behavior.


  • getGlobalJobParameters() / setGlobalJobParameters() This method allows users to set custom objects as a global configuration for the job. Since the ExecutionConfig is accessible in all user defined functions, this is an easy method for making configuration globally available in a job.


  • 其他的参数都是序列化相关的,不列了


Data Sinks

Data sinks consume DataSets and are used to store or return them. Data sink operations are described using an OutputFormat.

可以custom output format: 比如写数据库,

DataSet<Tuple3<String, Integer, Double>> myResult = [...]

// write Tuple DataSet to a relational database
// build and configure OutputFormat
.setQuery("insert into persons (name, age, height) values (?,?,?)")



DataSet<Tuple3<Integer, String, Double>> tData = // [...]
DataSet<Tuple2<BookPojo, Double>> pData = // [...]
DataSet<String> sData = // [...] // sort output on String field in ascending order
tData.print().sortLocalOutput(1, Order.ASCENDING); // sort output on Double field in descending and Integer field in ascending order
tData.print().sortLocalOutput(2, Order.DESCENDING).sortLocalOutput(0, Order.ASCENDING);




final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();

DataSet<String> lines = env.readTextFile(pathToTextFile);
// build your program env.execute();



final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();

// Create a DataSet from a list of elements
DataSet<Integer> myInts = env.fromElements(1, 2, 3, 4, 5); // Create a DataSet from any Java collection
List<Tuple2<String, Integer>> data = ...
DataSet<Tuple2<String, Integer>> myTuples = env.fromCollection(data); // Create a DataSet from an Iterator
Iterator<Long> longIt = ...
DataSet<Long> myLongs = env.fromCollection(longIt, Long.class);



DataSet<Tuple2<String, Integer>> myResult = ...

List<Tuple2<String, Integer>> outData = new ArrayList<Tuple2<String, Integer>>();
myResult.output(new LocalCollectionOutputFormat(outData));


Iteration Operators

Iterations implement loops in Flink programs. The iteration operators encapsulate a part of the program and execute it repeatedly, feeding back the result of one iteration (the partial solution) into the next iteration. There are two types of iterations in Flink: BulkIteration and DeltaIteration.






Semantic Annotations

Semantic annotations can be used to give Flink hints about the behavior of a function.

目的是做性能优化,优化器在明确知道function读参数的使用情况,比如如果知道某些field只是做forward,就可以保留它的sorting or partitioning信息


Forwarded Fields Annotation




public class MyMap implements
MapFunction<Tuple2<Integer, Integer>, Tuple3<String, Integer, Integer>> {
public Tuple3<String, Integer, Integer> map(Tuple2<Integer, Integer> val) {
return new Tuple3<String, Integer, Integer>("foo", val.f1 / 2, val.f0);


Non-Forwarded Fields



@NonForwardedFields("f1") // second field is not forwarded
public class MyMap implements
MapFunction<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> {
public Tuple2<Integer, Integer> map(Tuple2<Integer, Integer> val) {
return new Tuple2<Integer, Integer>(val.f0, val.f1 / 2);


Read Fields



@ReadFields("f0; f3") // f0 and f3 are read and evaluated by the function.
public class MyMap implements
MapFunction<Tuple4<Integer, Integer, Integer, Integer>,
Tuple2<Integer, Integer>> {
public Tuple2<Integer, Integer> map(Tuple4<Integer, Integer, Integer, Integer> val) {
if(val.f0 == 42) {
return new Tuple2<Integer, Integer>(val.f0, val.f1);
} else {
return new Tuple2<Integer, Integer>(val.f3+10, val.f1);


Broadcast Variables

Broadcast variables allow you to make a data set available to all parallel instances of an operation, in addition to the regular input of the operation. This is useful for auxiliary data sets, or data-dependent parameterization. The data set will then be accessible at the operator as a Collection.

  • Broadcast: broadcast sets are registered by name via withBroadcastSet(DataSet, String), and
  • Access: accessible via getRuntimeContext().getBroadcastVariable(String) at the target operator.
// 1. The DataSet to be broadcasted
DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3); DataSet<String> data = env.fromElements("a", "b"); RichMapFunction<String, String>() {
public void open(Configuration parameters) throws Exception {
// 3. Access the broadcasted DataSet as a Collection
Collection<Integer> broadcastSet = getRuntimeContext().getBroadcastVariable("broadcastSetName");
} @Override
public String map(String value) throws Exception {
}).withBroadcastSet(toBroadcast, "broadcastSetName"); // 2. Broadcast the DataSet




Passing Parameters to Functions



ataSet<Integer> toFilter = env.fromElements(1, 2, 3);

toFilter.filter(new MyFilter(2));

private static class MyFilter implements FilterFunction<Integer> {

  private final int limit;

  public MyFilter(int limit) {
this.limit = limit;
} @Override
public boolean filter(Integer value) throws Exception {
return value > limit;




DataSet<Integer> toFilter = env.fromElements(1, 2, 3);

Configuration config = new Configuration();
config.setInteger("limit", 2); toFilter.filter(new RichFilterFunction<Integer>() {
private int limit; @Override
public void open(Configuration parameters) throws Exception {
limit = parameters.getInteger("limit", 0);
} @Override
public boolean filter(Integer value) throws Exception {
return value > limit;






Setting a custom global configuration

Configuration conf = new Configuration();
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

Accessing values from the global configuration

public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {

    private String mykey;
public void open(Configuration parameters) throws Exception {;
ExecutionConfig.GlobalJobParameters globalParams = getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
Configuration globConf = (Configuration) globalParams;
mykey = globConf.getString("mykey", null);
// ... more here ...


Accumulators & Counters


Flink currently has the following built-in accumulators. Each of them implements the Accumulator interface.

  • IntCounter, LongCounter and DoubleCounter: See below for an example using a counter.
  • Histogram: A histogram implementation for a discrete number of bins. Internally it is just a map from Integer to Integer. You can use this to compute distributions of values, e.g. the distribution of words-per-line for a word count program.
private IntCounter numLines = new IntCounter();
getRuntimeContext().addAccumulator("num-lines", this.numLines); //在任意地方进行计数
this.numLines.add(1); //最终取得结果


Execution Plans


final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();





The HTML document containing the visualizer is located undertools/planVisualizer.html.




Web Interface

Flink offers a web interface for submitting and executing jobs. If you choose to use this interface to submit your packaged program, you have the option to also see the plan visualization.

The script to start the webinterface is located under bin/ After starting the webclient (per default on port 8080), your program can be uploaded and will be added to the list of available programs on the left side of the interface.

也可以通过web interface来提交job和查看执行计划

Flink DataSet API Programming Guide的更多相关文章

  1. Flink DataStream API Programming Guide

    Example Program The following program is a complete, working example of streaming window word count ...

  2. flink dataset api使用及原理

    随着大数据技术在各行各业的广泛应用,要求能对海量数据进行实时处理的需求越来越多,同时数据处理的业务逻辑也越来越复杂,传统的批处理方式和早期的流式处理框架也越来越难以在延迟性.吞吐量.容错能力以及使用便 ...

  3. Flink-v1.12官方网站翻译-P016-Flink DataStream API Programming Guide

    Flink DataStream API编程指南 Flink中的DataStream程序是对数据流实现转换的常规程序(如过滤.更新状态.定义窗口.聚合).数据流最初是由各种来源(如消息队列.套接字流. ...

  4. Apache Flink - Batch(DataSet API)

    Flink DataSet API编程指南: Flink中的DataSet程序是实现数据集转换的常规程序(例如,过滤,映射,连接,分组).数据集最初是从某些来源创建的(例如,通过读取文件或从本地集合创 ...

  5. Flink入门(五)——DataSet Api编程指南

    Apache Flink Apache Flink 是一个兼顾高吞吐.低延迟.高性能的分布式处理框架.在实时计算崛起的今天,Flink正在飞速发展.由于性能的优势和兼顾批处理,流处理的特性,Flink ...

  6. Apache Flink 1.12.0 正式发布,DataSet API 将被弃用,真正的流批一体

    Apache Flink 1.12.0 正式发布 Apache Flink 社区很荣幸地宣布 Flink 1.12.0 版本正式发布!近 300 位贡献者参与了 Flink 1.12.0 的开发,提交 ...

  7. Flink整合面向用户的数据流SDKs/API(Flink关于弃用Dataset API的论述)

    动机 Flink提供了三种主要的sdk/API来编写程序:Table API/SQL.DataStream API和DataSet API.我们认为这个API太多了,建议弃用DataSet API,而 ...

  8. Structured Streaming Programming Guide结构化流编程指南

    目录 Overview Quick Example Programming Model Basic Concepts Handling Event-time and Late Data Fault T ...

  9. 对Spark2.2.0文档的学习3-Spark Programming Guide

    Spark Programming Guide Link: 每个Spark A ...


  1. hdu 1756 判断点在多边形内 *

    模板题 #include<cstdio> #include<iostream> #include<algorithm> #include<cstring> ...

  2. Linux常用命令_(文件操作)



    packge-info.java是一个Java文件,可以添加到任何的Java源码包中.packge-info.java的目标是提供一个包级的文档说明或者是包级的注释. ...

  4. 字符串集合或字符串数组转换成json数组

    字符串可以是List<String>类型的字符串集合,也可以是String[]类型的字符串数组,二者转换成JSON数组的方式没有什么不同.下面代码注意关键的部分即可(画红线部分). 1. ...

  5. 找模式串[XDU1032]

    Problem 1032 - 找模式串 Time Limit: 1000MS   Memory Limit: 65536KB   Difficulty: Total Submit: 644  Acce ...

  6. POJ 3661 (线性DP)

    题目链接: 题目大意:牛跑步.有N分钟,M疲劳值.每分钟跑的距离不同.每分钟可以选择跑步或是休息.一旦休息了必须休息到疲劳值为0.0疲劳值 ...

  7. 响应式HTML5+CSS3 网站开发测试实践

    仅仅利用media query适配样式是远远不够的,并没有考虑触屏下的行为和特有的内容组织方式的不同.简单在桌面版基础上叠加mobile版的代码,会带来请求增多.流量.性能.代码冗余等诸多方面问题.有 ...

  8. redis AND memcache

    memcache文章索引 MEMCACHE问题集锦[转] MEMCACHED 高可用方案 REPCACHED NOSQL之[MEMCACHED]学习 当 MySQL 和 Memcached 遇到尾部空 ...

  9. BestCoder Round #73

    这场比赛打完后可以找何神玩了orz(orz)* T1Rikka with Chess 嘿嘿嘿.输出n/2+m/2即可. 我能说我智商捉鸡想了4min吗? T2Rikka with Graph 由于N个 ...

  10. COJ983 WZJ的数据结构(负十七)

    显然是动态树裸题:O(mlogn) #include<cstdio> #include<cstring> #include<algorithm> #include& ...