Flink之DataStreamAPI入门

Types
Transformations
Defining UDFs

本文API基于Flink 1.4

def main(args: Array[String]) {

  // 第一种会自动判断用本地还是远程。本地也可以用createLocalEnvironment()

  val env = StreamExecutionEnvironment.getExecutionEnvironment

  val remoteEnv = StreamExecutionEnvironment.createRemoteEnvironment("JMhost", 1234, "path/to/jarFile.jar to ship to the JobManager")  

  // 设置时间语义为event time。env还有很多set方法：

  // state backend默认in memory，.setStateBackend(new

  // enableCheckpointing，然后checkpoint设定

  env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

  env.getConfig.setAutoWatermarkInterval(1000L)

  // create a DataStream[SensorReading] from a stream source

  val sensorData: DataStream[SensorReading] = env

    // SensorSource 是继承了 SourceFunction 中的 RichParalleSourceFunction 的类

    .addSource(new SensorSource)

    .setParallelism(4)

    // assign timestamps and watermarks (required for event time)

    .assignTimestampsAndWatermarks(new SensorTimeAssigner)

  val avgTemp: DataStream[SensorReading] = sensorData

    // convert Fahrenheit to Celsius with an inline

    .map( r => {

        val celsius = (r.temperature - 32) * (5.0 / 9.0)

        SensorReading(r.id, r.timestamp, celsius)

      } )

    .keyBy(_.id) // 不是需要创建key，只是函数指定数据中的某部分作为key

    // shortcut for window.(TumblingEventTimeWindows.of(Time.seconds(5)))，如果上面设置了processTime，那就是另一个的缩写

    .timeWindow(Time.seconds(5))

    // compute average temperature using a UDF

    .apply(new TemperatureAverager)

  avgTemp.print()

  // 提交到集群时，execute将dataflow提交到远程JM。

  // IDE模式下JM和TM在同一JVM进程，.execute()变为启动并运行Flink。

  // 之后构建执行计划，从sources到所有transformations，最后执行计划。

  env.execute("Compute average sensor temperature")

}

注意，map、flatMap算子需要TypeInformation的隐式转换，即implicit val typeInfo = TypeInformation.of(classOf[map后的类型])。但更好的办法是import org.apache.flink.streaming.api.scala._ 或org.apache.flink.api.scala._静态数据

从map到apply都是transformation operator，它的作用一般是一用反射拿到相应算子的输出类型，二是通过transform返回一个Operator。而transform同时会把操作注册到执行环境，用于后续生成DAG。

Types

Primitives

Java and Scala tuples

Java的tuple是mutable，可以setField(newValue, index)，而且index从0开始。

DataStream<Tuple2<String, Integer>> persons = env.fromElements(

  Tuple2.of("Adam", 17),

  Tuple2.of("Sarah", 23));

persons.filter(p -> p.f1 > 18)

})

Scala case classes
POJOs, including classes generated by Apache Avro

POJO条件是：public class，public 无参构造器，所有成员变量public或可以通过getter和setter访问（遵循默认名字），所有成员变量的类型都是Flink支持的。
Flink Value types

实现org.apache.flink.types.Value接口中的read()和write()的序列化逻辑。

Flink提供内置的Value types，如IntValue, DoubleValue, and StringValue，且是可变的。
Some special types

Scala’s Either, Option, and Try types。Flink’s Java version of the Either type.

primitive and object Array types, Java Enum types and Hadoop Writable types

对于Java的类型推断

如果函数的返回值是泛型，那么要加returns。具体哪个原文也没写具体...

.map(new MyMapFunction<Long, MyType>())

.returns(MyType.class);

.flatMap(new MyFlatMapFunction<String, Integer>())

.returns(new TypeHint<Integer>(){});

class MyFlatMapFunction<T, O> implements FlatMapFunction<T, O> {

   public void flatMap(T value, Collector<O> out) { ... }

}

TypeInformation

作为key的自定义类。

TypeInformation maps fields from the types to fields in a flat schema. Basic types are mapped to single fields and tuples and case classes are mapped to as many fields as the class has. The flat schema must be valid for all type instances, thus variable length types like collections and arrays are not assigned to individual fields, but they are considered to be one field as a whole.

// Create TypeInformation and TypeSerializer for a 2-tuple in Scala

// get the execution config

val config = inputStream.executionConfig

...

// create the type information，要引入 org.apache.flink.streaming.api.scala._

val tupleInfo: TypeInformation[(String, Double)] =

    createTypeInformation[(String, Double)

// create a serializer

val tupleSerializer = typeInfo.createSerializer(config)

Transformations

为了让Java和Scala的代码尽量相似，Flink减去了Scala的一些隐式，特别是模式匹配方面会和Spark有所不同。如果需要这些隐式，可以import org.apache.flink.streaming.api.scala.extensions._，或者用相应的函数名，如map改用mapWith

Basic transformations are transformations on individual events.

// id为sensor_N，split后变为两条连续数据

val sensorIds: DataStream[String] = ...

val splitIds: DataStream[String] = sensorIds

  .flatMap(id => id.split("_"))

KeyedStream transformations are transformations that are applied to events in the context of a key.

如果unique key很多，要小心内存不够

ROLLING AGGREGATIONS，如sum(), min(), minBy()返回拥有最小值的event。只能使用一个。

val resultStream: DataStream[(Int, Int, Int)] = inputStream

  .keyBy(0) // key on first field of the tuple

  .sum(1)   // sum the second field of the tuple

// 输出结果

 //(1,2,2) followed by (1,7,2), (2,3,1) followed by (2,5,1)

// 其实keyBy不一定要是record的成员变量

val keyedStream = input.keyBy(value => math.max(value._1, value._2))

REDUCE类似累加器，只要符合这个描述符就行(T, T)=> T

val reducedSensors = readings

  .keyBy(_.id)

  .reduce((r1, r2) => {

    val highestTimestamp = Math.max(r1.timestamp, r2.timestamp)

    SensorReading(r1.id, highestTimestamp, r1.temperature)

  })

Multi-stream transformations merge multiple streams into one stream or split one stream into multiple streams.

CONNECT, COMAP, AND COFLATMAP：

val keyedConnect: ConnectedStreams[(Int, Long), (Int, String)] = first

// repartition-repartition

  .connect(second)

  .keyBy(0, 0) // key both input streams on first attribute

// 对于ConnectedStreams有map和flatMap方法接口，如map：

CoMapFunction[IN1, IN2, OUT]

    > map1(IN1): OUT

    > map2(IN2): OUT

// map1和map2的调用顺序是无法确定的，都是当event到达时尽快调用

// connect streams with broadcast

val keyedConnect: ConnectedStreams[(Int, Long), (Int, String)] = first

// broadcast-forward

  .connect(second.broadcast()) // 将second复制并广播到每个first流

// 例子

// group sensor readings by their id

val keyed: KeyedStream[SensorReading, String] = tempReadings

  .keyBy(_.id)

// connect the two streams and raise an alert

// if the temperature and smoke levels are high

val alerts = keyed

  .connect(smokeReadings.broadcast)

  .flatMap(new RaiseAlertFlatMap)

class RaiseAlertFlatMap extends CoFlatMapFunction[SensorReading, SmokeLevel, Alert] {

  // 注意，这个变量没有checkpoint

  var smokeLevel = SmokeLevel.Low

  override def flatMap1(in1: SensorReading, collector: Collector[Alert]): Unit = {

    // high chance of fire => true

    if (smokeLevel.equals(SmokeLevel.High) && in1.temperature > 100) {

      collector.collect(Alert("Risk of fire!", in1.timestamp))

    }

  }

  override def flatMap2(in2: SmokeLevel, collector: Collector[Alert]): Unit = {

    smokeLevel = in2

  }

}

SPLIT [DATASTREAM -> SPLITSTREAM] AND SELECT

val inputStream: DataStream[(Int, String)] = ...

val splitted: SplitStream[(Int, String)] = inputStream

  .split(t => if (t._1 > 1000) Seq("large") else Seq("small"))

val large: DataStream[(Int, String)] = splitted.select("large")

val small: DataStream[(Int, String)] = splitted.select("small")

val all: DataStream[(Int, String)] = splitted.select("small", "large")

Partitioning transformation reorganize stream events.

shuffle()

rebalance()：全部均匀，比如两条流都平均分到4条流

rescale()：部分均匀，比如将两条流各自平均分为两条，即下游有4个并行度

broadcast()：复制数据并发到所有下游并行任务

global()：全部event发到第一个并行task

partitionCustom()：
```
val numbers: DataStream[(Int)] = ...

numbers.partitionCustom(myPartitioner, 0)

object myPartitioner extends Partitioner[Int] {

  val r = scala.util.Random

  override def partition(key: Int, numPartitions: Int): Int = {

    if (key < 0) 0 else r.nextInt(numPartitions)

  }

}
```
Task chaining and resource groups

默认达成条件后chaining
- 上下游的并行度一致
- 下游节点的入度为1 （也就是说下游节点没有来自其他节点的输入）
- 上下游节点都在同一个 slot group 中（下面会解释 slot group）
- 下游节点的 chain 策略为 ALWAYS（可以与上下游链接，map、flatmap、filter等默认是ALWAYS）
- 上游节点的 chain 策略为 ALWAYS 或 HEAD（只能与下游链接，不能与上游链接，Source默认是HEAD）
- 两个节点间数据分区方式是 forward（参考理解数据流的分区）
- 用户没有禁用 chain
```
// The two mappers will be chained, and filter will not be chained to the first mapper.

someStream.filter(...).map(...).startNewChain().map(...)

// Do not chain the map operator

someStream.map(...).disableChaining()

// 默认情况下，如果所有的source operator都共享一个slot，那么后续的operator都会共享一个slot。为了避免不合理的共享，可以通过下面设置强制指定filter的共享组为“name”

someStream.filter(...).slotSharingGroup("name")
```

Task slot是一个TaskManager内资源分配的最小载体，代表了一个固定大小的资源子集，每个TaskManager会将其所占有的资源平分给它的slot。

通过调整 task slot 的数量，用户可以定义task之间是如何相互隔离的。每个 TaskManager 有一个slot，也就意味着每个task运行在独立的 JVM 中。每个 TaskManager 有多个slot的话，也就是说多个task运行在同一个JVM中。

而在同一个JVM进程中的task，即多个slot可以共享TCP连接（基于多路复用）和心跳消息，可以减少数据的网络传输，也能共享一些数据结构，一定程度上减少了每个task的消耗。

每个slot可以接受单个task，也可以接受多个连续task组成的pipeline，即task chain。除此之外，还可以利用上面提到的SlotSharingGroup来吧非chain的task放在同一个slot。这样不同task之间就不需要换线程了，也不需要重新计算总task数，直接保持并行度即可。

Defining UDFs

Flink用Java默认的序列化方式对所有UDFs及其接收的参数进行序列化，并发送到worker进程。

rich function比一般lambda多了一些方法：open()用于初始化算子，会被每个task执行该rich function前调用。其Configuration参数可忽略（DataSet API用）。还有close()、getRuntimeContext()和setRuntimeContext

class MyFlatMap extends RichFlatMapFunction[Int, (Int, Int)] {

  var subTaskIndex = 0

  override def open(configuration: Configuration): Unit = {

    subTaskIndex = getRuntimeContext.getIndexOfThisSubtask

    // do some initialization

    // e.g. establish a connection to an external system

  }

  override def flatMap(in: Int, out: Collector[(Int, Int)]): Unit = {

    // subtasks are 0-indexed

    if(in % 2 == subTaskIndex) {

      out.collect((subTaskIndex, in))

    }

    // do some more processing

  }

  override def close(): Unit = {

    // do some cleanup, e.g. close connections to external systems

  }

}

利用rich function调用global configuration

def main(args: Array[String]) : Unit = {

  val env = StreamExecutionEnvironment.getExecutionEnvironment

  // flink的类，一个HashMap

  val conf = new Configuration()

  // set the parameter “keyWord” to “flink”

  conf.setString("keyWord", "flink")

  // set the configuration as global

  env.getConfig.setGlobalJobParameters(conf)

  val input: DataStream[String] = env.fromElements(

   "I love flink", "bananas", "apples", "flinky")

  input.filter(new MyFilterFunction)

    .print()

  env.execute()

}

class MyFilterFunction extends RichFilterFunction[String] {

  var keyWord = ""

  override def open(configuration: Configuration): Unit = {

    val globalParams = getRuntimeContext.getExecutionConfig.getGlobalJobParameters

    val globConf = globalParams.asInstanceOf[Configuration]

    // null 为默认值

    keyWord = globConf.getString("keyWord", null)

  }

  override def filter(value: String): Boolean = {

    // use the keyWord parameter to filter out elements

    value.contains(keyWord)

  }

}

补充：

parallelism：可以在evn设置默认，在每个operator覆盖默认

Referencing：_.birthday._表示birthday成员变量的的全部成员变量

DataStream<Tuple3<Integer, Double, String>> in = // [...]

DataStream<Tuple2<String, Integer>> out = in.project(2,0);

Flink还会减少自己的内部的第三方依赖，如transitive dependencies。使用第三方包时，要么打包所有依赖，要么把依赖放到Flink的lib目录，这样Flink每次启动都会加载这些依赖（打包时会忽略，只要集群的Flink有就行）。

参考

Stream Processing with Apache Flink by Vasiliki Kalavri; Fabian Hueske