Apache Flink - Batch(DataSet API)

Flink DataSet API编程指南:

Flink中的DataSet程序是实现数据集转换的常规程序(例如，过滤，映射，连接，分组)。数据集最初是从某些来源创建的(例如，通过读取文件或从本地集合创建)。结果通过接收器返回，接收器可以将数据写入(分布式)文件或标准输出(命令行终端)。

public class WordCountExample {

    public static void main(String[] args) throws Exception {

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> text = env.fromElements(

            "Who's there?",

            "I think I hear them. Stand, ho! Who's there?");

        DataSet<Tuple2<String, Integer>> wordCounts = text

            .flatMap(new LineSplitter())

            .groupBy(0)

            .sum(1);

        wordCounts.print();

    }

    public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {

        @Override

        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {

            for (String word : line.split(" ")) {

                out.collect(new Tuple2<String, Integer>(word, 1));

            }

        }

    }

}

WordCount 程序

DataSet转换：

数据转换将一个或多个DataSet转换为新的DataSet。程序可以将多个转换组合到复杂的程序集中。

转换	描述
Map	用一个元素产生另一个元素。 data.map(new MapFunction<String, Integer>() { public Integer map(String value) { return Integer.parseInt(value); } });
FlatMap	用一个元素产生零个，一个或多个元素。 data.flatMap(new FlatMapFunction<String, String>() { public void flatMap(String value, Collector<String> out) { for (String s : value.split(" ")) { out.collect(s); } } });
MapPartition	在单个函数调用中转换并行分区。该函数将分区作为`迭代`流来获取，并且可以生成任意数量的结果值。每个分区中的元素数量取决于并行度和先前的操作。 data.mapPartition(new MapPartitionFunction<String, Long>() { public void mapPartition(Iterable<String> values, Collector<Long> out) { long c = 0; for (String s : values) { c++; } out.collect(c); } });
Filter	计算每个元素的布尔函数，并保留函数返回true的元素。 data.filter(new FilterFunction<Integer>() { public boolean filter(Integer value) { return value > 1000; } });
Reduce	通过将两个元素重复组合成一个元素，将一组元素组合成一个元素。Reduce可以应用于完整数据集或分组数据集。 data.reduce(new ReduceFunction<Integer> { public Integer reduce(Integer a, Integer b) { return a + b; } });
ReduceGroup	将一组元素组合成一个或多个元素。ReduceGroup可以应用于完整数据集或分组数据集。 data.reduceGroup(new GroupReduceFunction<Integer, Integer> { public void reduce(Iterable<Integer> values, Collector<Integer> out) { int prefixSum = 0; for (Integer i : values) { prefixSum += i; out.collect(prefixSum); } } });
Aggregate	将一组值聚合为单个值。聚合函数可以被认为是内置的reduce函数。聚合可以应用于完整数据集或分组数据集。 Dataset<Tuple3<Integer, String, Double>> input = // [...] DataSet<Tuple3<Integer, String, Double>> output = input.aggregate(SUM, 0).and(MIN, 2); 您还可以使用简写语法进行最小，最大和总和聚合。 Dataset<Tuple3<Integer, String, Double>> input = // [...] DataSet<Tuple3<Integer, String, Double>> output = input.sum(0).andMin(2);
Distinct	返回数据集的不同元素。它相对于元素的所有字段或字段子集从输入DataSet中删除重复条目。 data.distinct();
Join	通过创建在其键上相等的所有元素对来连接两个数据集。可选地使用JoinFunction将元素对转换为单个元素，或使用FlatJoinFunction将元素对转换为任意多个（包括无）元素。 result = input1.join(input2) .where(0) // key of the first input (tuple field 0) .equalTo(1); // key of the second input (tuple field 1) 您可以通过Join Hints指定运行时执行连接的方式。提示描述了通过分区或广播进行连接，以及它是使用基于排序还是基于散列的算法。如果未指定提示，系统将尝试估算输入大小，并根据这些估计选择最佳策略。 / This executes a join by broadcasting the first data set // using a hash table for the broadcast data result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST) .where(0).equalTo(1); 请注意，连接转换仅适用于等值连接。其他连接类型需要使用OuterJoin或CoGroup表示。
OuterJoin	在两个数据集上执行左，右或全外连接。外连接类似于常规（内部）连接，并创建在其键上相等的所有元素对。此外，如果在另一侧没有找到匹配的密钥，则保留“外部”侧（左侧，右侧或两者都满）的记录。匹配元素对（或一个元素和`null`另一个输入的值）被赋予JoinFunction以将元素对转换为单个元素，或者转换为FlatJoinFunction以将元素对转换为任意多个（包括无）元素。 input1.leftOuterJoin(input2) // rightOuterJoin or fullOuterJoin for right or full outer joins .where(0) // key of the first input (tuple field 0) .equalTo(1) // key of the second input (tuple field 1) .with(new JoinFunction<String, String, String>() { public String join(String v1, String v2) { // NOTE: // - v2 might be null for leftOuterJoin // - v1 might be null for rightOuterJoin // - v1 OR v2 might be null for fullOuterJoin } });
CoGroup	reduce操作的二维变体。将一个或多个字段上的每个输入分组，然后加入组。每对组调用转换函数。 data1.coGroup(data2) .where(0) .equalTo(1) .with(new CoGroupFunction<String, String, String>() { public void coGroup(Iterable<String> in1, Iterable<String> in2, Collector<String> out) { out.collect(...); } });
Cross	构建两个输入的笛卡尔积（交叉乘积），创建所有元素对。可选择使用CrossFunction将元素对转换为单个元素。 DataSet<Integer> data1 = // [...] DataSet<String> data2 = // [...] DataSet<Tuple2<Integer, String>> result = data1.cross(data2); 注：交叉是一个潜在的非常计算密集型操作它甚至可以挑战大的计算集群！建议使用crossWithTiny（）和crossWithHuge（）来提示系统的DataSet大小。
Union	生成两个数据集的并集。 DataSet<String> data1 = // [...] DataSet<String> data2 = // [...] DataSet<String> result = data1.union(data2);
Rebalance	均匀地重新平衡数据集的并行分区以消除数据偏差。只有类似map的转换可能会遵循重新平衡转换。 DataSet<String> in = // [...] DataSet<String> result = in.rebalance() .map(new Mapper());
Hash-Partition	hash分区给定键上的数据集。键可以指定为位置键，表达键和键选择器功能。 DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionByHash(0) .mapPartition(new PartitionMapper());
Range-Partition	范围分区给定键上的数据集。键可以指定为位置键，表达键和键选择器功能。 DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionByRange(0) .mapPartition(new PartitionMapper());
Custom Partition	手动指定数据分区。注意：此方法仅适用于单个字段键。 DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionCustom(Partitioner<K> partitioner, key)
Sort Partition	本地按指定顺序对指定字段上的数据集的所有分区进行排序。可以将字段指定为元组位置或字段表达式。通过链接sortPartition（）调用来完成对多个字段的排序。 DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.sortPartition(1, Order.ASCENDING) .mapPartition(new PartitionMapper());
First-n	返回数据集的前n个（任意）元素。First-n可以应用于常规数据集，分组数据集或分组排序数据集。分组键可以指定为键选择器功能或字段位置键。 DataSet<Tuple2<String,Integer>> in = // [...] // regular data set DataSet<Tuple2<String,Integer>> result1 = in.first(3); // grouped data set DataSet<Tuple2<String,Integer>> result2 = in.groupBy(0) .first(3); // grouped-sorted data set DataSet<Tuple2<String,Integer>> result3 = in.groupBy(0) .sortGroup(1, Order.ASCENDING) .first(3);

以下转换可适用于元组数据集

转换

描述

Project

从元组中选择字段的子集

DataSet<Tuple3<Integer, Double, String>> in = // [...]

DataSet<Tuple2<String, Integer>> out = in.project(2,0);

MinBy/MaxBy

从一组元组中选择一个元组，其元组的一个或多个字段的值最小（最大）。用于比较的字段必须是有效的关键字段，即可比较。如果多个元组具有最小（最大）字段值，则返回这些元组的任意元组。MinBy（MaxBy）可以应用于完整数据集或分组数据集。

DataSet<Tuple3<Integer, Double, String>> in = // [...]

// a DataSet with a single tuple with minimum values for the Integer and String fields.

DataSet<Tuple3<Integer, Double, String>> out = in.minBy(0, 2);

// a DataSet with one tuple for each group with the minimum value for the Double field.

DataSet<Tuple3<Integer, Double, String>> out2 = in.groupBy(2)

                                                  .minBy(1);

数据源：

数据源创建初始数据集，例如来自文件或Java集合。创建数据集的一般机制是在输入格式后面抽象的。Flink附带了几种内置格式，可以从通用文件格式创建数据集。他们中的许多在ExecutionEnvironment上都有快捷方法。
基于文件的：

readTextFile(path)/ TextInputFormat- 按行读取文件并将其作为字符串返回。
readTextFileWithValue(path)/ TextValueInputFormat- 按行读取文件并将它们作为StringValues返回。StringValues是可变字符串。
readCsvFile(path)/ CsvInputFormat- 解析逗号（或其他字符）分隔字段的文件。返回元组或POJO的DataSet。支持基本java类型及其Value对应作为字段类型。
readFileOfPrimitives(path, Class)/ PrimitiveInputFormat- 解析原始数据类型（如String或Integer）分隔的新行（或其他字符序列）的文件。
readFileOfPrimitives(path, delimiter, Class)/ PrimitiveInputFormat- 解析使用给定的分隔符分隔的原始数据类型（例如String或Integer）的新行（或其他字符序列）文件。
readSequenceFile(Key, Value, path)/ SequenceFileInputFormat- 创建JobConf并从指定路径读取文件，类型为SequenceFileInputFormat，Key class和Value类，并将它们返回为Tuple2 <Key，Value>。

基于集合：

fromCollection(Collection) - 从Java Java.util创建数据集。收集。集合中的所有元素（基础）必须属于同一类型。
fromCollection(Iterator(迭代器), Class) - 从迭代器创建数据集。该类指定迭代器返回的元素的数据类型。
fromElements(T ...)- 根据给定的对象序列（序列）创建数据集。所有对象必须属于同一类型。
fromParallelCollection(SplittableIterator, Class)- 从迭代器并行创建数据集。该类指定迭代器返回的元素的数据类型。
generateSequence(from, to)- 并行地生成给定间隔中的数字序列。

通用：

readFile(inputFormat, path)/ FileInputFormat- 接受文件输入格式。
createInput(inputFormat)/ InputFormat- 接受通用输入格式。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// read text file from local files system

DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile");

// read text file from a HDFS running at nnHost:nnPort

DataSet<String> hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");

// read a CSV file with three fields

DataSet<Tuple3<Integer, String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")

                           .types(Integer.class, String.class, Double.class);

// read a CSV file with five fields, taking only two of them

DataSet<Tuple2<String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")

                               .includeFields("10010")  // take the first and the fourth field

                           .types(String.class, Double.class);

// read a CSV file with three fields into a POJO (Person.class) with corresponding fields

DataSet<Person>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")

                         .pojoType(Person.class, "name", "age", "zipcode");

// read a file from the specified path of type SequenceFileInputFormat

DataSet<Tuple2<IntWritable, Text>> tuples =

 env.readSequenceFile(IntWritable.class, Text.class, "hdfs://nnHost:nnPort/path/to/file");

// creates a set from some given elements

DataSet<String> value = env.fromElements("Foo", "bar", "foobar", "fubar");

// generate a number sequence

DataSet<Long> numbers = env.generateSequence(1, 10000000);

// Read data from a relational database using the JDBC input format

DataSet<Tuple2<String, Integer> dbData =

    env.createInput(

      JDBCInputFormat.buildJDBCInputFormat()

                     .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")

                     .setDBUrl("jdbc:derby:memory:persons")

                     .setQuery("select name, age from persons")

                     .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO))

                     .finish()

    );

// Note: Flink's program compiler needs to infer the data types of the data items which are returned

// by an InputFormat. If this information cannot be automatically inferred, it is necessary to

// manually provide the type information as shown in the examples above.

配置CSV：
- types(Class ... types) 指定要解析的字段的类型。它是强制性的来配置要解析的字段的类型。在类型为Boolean.class的情况下，“True”（不区分大小写），“False”（不区分大小写），“1”和“0”被视为布尔值。
- lineDelimiter(String del)指定个体记录的分隔符。默认行分隔符是换行符'\n'。
- fieldDelimiter(String del)指定用于分隔记录字段的分隔符。默认字段分隔符是逗号字符','。
- includeFields(boolean ... flag)，includeFields(String mask)或includeFields(long bitMask) 定义从输入文件中读取的字段。默认情况下，将解析前n个字段（由types()调用中的类型数定义）。
- parseQuotedStrings(char quoteChar)启用带引号的字符串解析。如果字符串字段的第一个字符是引号字符，字符串被解析为引用的字符串。引号字符串中的字段分隔符将被忽略。如果带引号的字符串字段的最后一个字符不是引号字符，或者引号字符出现在某个不是引用字符串字段的开头或结尾的点上（除非引号字符使用''转义），否则引用字符串解析失败。如果启用了带引号的字符串解析并且该字段的第一个字符不是引用字符串，则将该字符串解析为unquoted （结束）字符串。默认情况下，禁用带引号的字符串解析。
- ignoreComments(String commentPrefix) 指定注释前缀。所有以指定注释前缀开头的行都不会被解析和忽略。默认情况下，不会忽略任何行。
- ignoreInvalidLines()使宽松的解析，即，不能正确地解析的行被忽略。默认情况下，宽松解析是无效的和禁止使用的引发异常的行。
- ignoreFirstLine() 安装InputFormat以忽略输入文件的第一行。默认情况下，不会忽略任何行。

递归遍历输入路径目录：对于基于文件的输入，当输入路径是目录时，默认情况下不会枚举嵌套文件。相反，只读取基目录中的文件，而忽略嵌套文件。可以通过recursive.file.enumeration配置参数启用嵌套文件的递归枚举，如下例所示。

// enable recursive enumeration of nested input files

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// create a configuration object

Configuration parameters = new Configuration();

// set the recursive enumeration parameter

parameters.setBoolean("recursive.file.enumeration", true);

// pass the configuration to the data source

DataSet<String> logs = env.readTextFile("file:///path/with.nested/files")

              .withParameters(parameters);

读取压缩文件：如果它们标有适当的文件扩展名(.deflate/.gz/.gzip/.bz2/.xz)，Flink目前支持输入文件的透明解压缩。特别是，这意味着不需要进一步配置输入格式，并且任何FileInputFormat支持压缩，包括自定义输入格式。请注意，压缩文件可能无法并行读取，从而影响作业可伸缩性。

数据接收：

数据接收器使用DataSet并用于存储或返回它们。使用OutputFormat描述数据接收器操作。Flink带有各种内置输出格式，这些格式封装在DataSet上的操作后面：

writeAsText()/ TextOutputFormat- 按字符串顺序写入元素。通过调用每个元素的toString()方法获得字符串。
writeAsFormattedText()/ TextOutputFormat- 按字符串顺序写入元素。通过为每个元素调用用户定义的format()方法来获取字符串。
writeAsCsv(...)/ CsvOutputFormat- 将元组写为逗号分隔符文件。行和字段分隔符是可配置的。每个字段的值来自对象的toString()方法。
print()/ printToErr()/ print(String msg)/ printToErr(String msg)- 在标准输出/标准错误流上打印每个元素的toString()值。可选地，可以提供前缀，其前缀为输出。这有助于区分不同的输出调用。如果并行度大于1，则输出也将与生成输出的任务的标识符一起添加。
write()/ FileOutputFormat- 自定义文件输出的方法和基类。支持自定义对象到字节的转换。
output()/ OutputFormat- 大多数通用输出方法，用于非基于文件的数据接收器（例如将结果存储在数据库中）。

标准数据接收方法：

// text data

DataSet<String> textData = // [...]

// write DataSet to a file on the local file system

textData.writeAsText("file:///my/result/on/localFS");

// write DataSet to a file on a HDFS with a namenode running at nnHost:nnPort

textData.writeAsText("hdfs://nnHost:nnPort/my/result/on/localFS");

// write DataSet to a file and overwrite the file if it exists

textData.writeAsText("file:///my/result/on/localFS", WriteMode.OVERWRITE);

// tuples as lines with pipe as the separator "a|b|c"

DataSet<Tuple3<String, Integer, Double>> values = // [...]

values.writeAsCsv("file:///path/to/the/result/file", "\n", "|");

// this writes tuples in the text formatting "(a, b, c)", rather than as CSV lines

values.writeAsText("file:///path/to/the/result/file");

// this writes values as strings using a user-defined TextFormatter object

values.writeAsFormattedText("file:///path/to/the/result/file",

    new TextFormatter<Tuple2<Integer, Integer>>() {

        public String format (Tuple2<Integer, Integer> value) {

            return value.f1 + " - " + value.f0;

        }
    });

使用自定义输出格式：

DataSet<Tuple3<String, Integer, Double>> myResult = [...]

// write Tuple DataSet to a relational database

myResult.output(

    // build and configure OutputFormat

    JDBCOutputFormat.buildJDBCOutputFormat()

                    .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")

                    .setDBUrl("jdbc:derby:memory:persons")

                    .setQuery("insert into persons (name, age, height) values (?,?,?)")

                    .finish()

    );

本地排序输出：

可以使用元组的域的位置和字段表达式以指定顺序在指定字段上对数据接收器的输出进行本地排序。这适用于每种输出格式。

以下示例显示如何使用此功能：

DataSet<Tuple3<Integer, String, Double>> tData = // [...]

DataSet<Tuple2<BookPojo, Double>> pData = // [...]

DataSet<String> sData = // [...]

// sort output on String field in ascending order

tData.sortPartition(1, Order.ASCENDING).print();

// sort output on Double field in descending and Integer field in ascending order

tData.sortPartition(2, Order.DESCENDING).sortPartition(0, Order.ASCENDING).print();

// sort output on the "author" field of nested BookPojo in descending order

pData.sortPartition("f0.author", Order.DESCENDING).writeAsText(...);

// sort output on the full tuple in ascending order

tData.sortPartition("*", Order.ASCENDING).writeAsCsv(...);

// sort atomic type (String) output in descending order

sData.sortPartition("*", Order.DESCENDING).writeAsText(...);

尚不支持全局排序的输出。

迭代运算符：

在Flink程序中迭代实现循环。迭代运算符封装程序的一部分并重复执行，将一次迭代的结果（部分解）反馈到下一次迭代中。Flink中有两种类型的迭代：BulkIteration和 DeltaIteration。
Bulk Iteration：要创建BulkIteration，请调用DataSet的iterate(int)迭代方法。这将返回一个IterativeDataSet，可以使用常规运算符进行转换。迭代调用的单个参数指定最大迭代次数。要指定迭代的结束，请调用IterativeDataSet的closeWith(DataSet)方法以指定应将哪个转换反馈到下一次迭代。您可以用closeWith(DataSet, DataSet)指定终止条件，如果此DataSet为空，将会执行第二个DataSet并终止迭代。如果未指定终止条件，则迭代将在给定的最大数量迭代后终止。以下示例迭代地估计Pi的值。目标是计算落入单位圆的随机点数。在每次迭代中，挑选一个随机点。如果此点位于单位圆内，我们会增加计数。然后Pi的估计值为count除以迭代次数乘以4。
```
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// Create initial IterativeDataSet

IterativeDataSet<Integer> initial = env.fromElements(0).iterate(10000);

DataSet<Integer> iteration = initial.map(new MapFunction<Integer, Integer>() {

    @Override

    public Integer map(Integer i) throws Exception {

        double x = Math.random();

        double y = Math.random();

        return i + ((x * x + y * y < 1) ? 1 : 0);

    }

});

// Iteratively transform the IterativeDataSet

DataSet<Integer> count = initial.closeWith(iteration);

count.map(new MapFunction<Integer, Double>() {

    @Override

    public Double map(Integer count) throws Exception {

        return count / (double) 10000 * 4;

    }

}).print();

env.execute("Iterative Pi Example");
```

Delta Iterations：定义DeltaIteration类似于定义BulkIteration。对于delta迭代，两个数据集构成每次迭代的输入，并且在每次迭代中生成两个数据集作为结果。调用iterateDelta(DataSet, int, int)（或iterateDelta(DataSet, int, int[])）创建DeltaIteration。

// read the initial data sets

DataSet<Tuple2<Long, Double>> initialSolutionSet = // [...]

DataSet<Tuple2<Long, Double>> initialDeltaSet = // [...]

int maxIterations = 100;

int keyPosition = 0;

DeltaIteration<Tuple2<Long, Double>, Tuple2<Long, Double>> iteration = initialSolutionSet

    .iterateDelta(initialDeltaSet, maxIterations, keyPosition);

DataSet<Tuple2<Long, Double>> candidateUpdates = iteration.getWorkset()

    .groupBy(1)

    .reduceGroup(new ComputeCandidateChanges());

DataSet<Tuple2<Long, Double>> deltas = candidateUpdates

    .join(iteration.getSolutionSet())

    .where(0)

    .equalTo(0)

    .with(new CompareChangesToCurrent());

DataSet<Tuple2<Long, Double>> nextWorkset = deltas

    .filter(new FilterByThreshold());

iteration.closeWith(deltas, nextWorkset)

    .writeAsCsv(outputPath);