Flink 自己创建一套独立的类型系统,

参考, https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/types_serialization.html

为何要自己搞一套,而不像其他的平台一样让编程语言或serialization framework来天然做掉?

Flink tries to know as much information about what types enter and leave user functions as possible. This stands in contrast to the approach to just assuming nothing and letting the programming language and serialization framework handle all types dynamically.

  • To allow using POJOs and grouping/joining them by referring to field names, Flink needs the type information to make checks (for typos and type compatibility) before the job is executed.

  • The more we know, the better serialization and data layout schemes the compiler/optimizer can develop. That is quite important for the memory usage paradigm in Flink (work on serialized data inside/outside the heap and make serialization very cheap).

  • For the upcoming logical programs (see roadmap draft) we need this to know the “schema” of functions.

  • Finally, it also spares users having to worry about serialization frameworks and having to register types at those frameworks.

Note. POJOs是什么?Plain Old Java Object(简单的Java对象),即轻量java对象的花式叫法

主要的理由,

第一是要做类型检查,Flink支持比较灵活的基于field的join或group,需要先检查这个field是否可以作为key,或这个field是否可以做join或group

第二是性能优化,便于使用更好的序列化和数据的layout

Flink主要定义如下几种类型,

Internally, Flink makes the following distinctions between types:

  • Basic types: All Java primitives and their boxed form, plus void, String, and Date.

  • Primitive arrays and Object arrays

  • Composite types

    • Flink Java Tuples (part of the Flink Java API)

    • Scala case classes (including Scala tuples)

    • POJOs: classes that follow a certain bean-like pattern

  • Scala auxiliary types (Option, Either, Lists, Maps, …)

  • Generic types: These will not be serialized by Flink itself, but by Kryo.

基本类型

数组(包含Primitive数组和对象数组)

组合类型,包含Flink Tuples, Scala case classes, 和POJOS

Scala增加的辅助类型

泛型,这个Flink不处理,而是用kryo

这里尤其需要注意POJOs,因为它的field是可以直接用name引用的,非常方便

dataSet.join(another).where("name").equalTo("personName")

那么对于Flink的准确的POJO的定义是什么?

  • The class is public and standalone (no non-static inner class)
  • The class has a public no-argument constructor
  • All fields in the class (and all superclasses) are either public or or have a public getter and a setter method that follows the Java beans naming conventions for getters and setters.

很简单,只要满足上面的规范,就支持“by-name” field referencing

文档里面还描述了在Scala和Java API中的类型问题,

对于Scala,用manifest或typetag来解决了泛型擦除的问题,所以主要是Flink用macro实现了TypeInformation,便于使用

对于Java,就必须要解决泛型擦除的问题,

DataSet<SomeType> result = dataSet
.map(new MyGenericNonInferrableFunction<Long, SomeType>())
.returns(SomeType.class);

比如,上面的日志,如果不加最后的hints,在runtime其实是无法知道SomeType是什么的,在编译的时候已经被erase成Object

所以Flink使用returns原语来增加hints

 

来看看源码,

基类为,

package org.apache.flink.api.common.typeinfo;
TypeInformation

目的, This type information class acts as the tool

to generate serializers and comparators
to perform semantic checks such as whether the fields that are uses as join/grouping keys actually exist.
bridges between the programming languages object model and a logical flat schema

前两个目的好理解,

最后一个目的,搞清两个概念,

arity,the number of fields it contains directly 
total number of fields,number of fields in the entire schema of this type, including nested types

举个例子,

* public class InnerType {
* public int id;
* public String text;
* }
*
* public class OuterType {
* public long timestamp;
* public InnerType nestedType;
* }

对于Inner type,arity和fields都是2

但对于OuterType,虽然arity是2,但fields是3,因为要把嵌套类型的fields也算上,这就是把编程语言对象模型转换为flat的逻辑schema

如何算fields的规则如下:

*   <li>Basic types are indivisible and are considered a single field.</li>
* <li>Arrays and collections are one field</li>
* <li>Tuples and case classes represent as many fields as the class has fields</li>

 

IntegerTypeInfo
用这个作为例子,分析一下
public class IntegerTypeInfo<T> extends NumericTypeInfo<T> 
public abstract class NumericTypeInfo<T> extends BasicTypeInfo<T> 
public class BasicTypeInfo<T> extends TypeInformation<T> implements AtomicType<T>

可以看到Integer最终继承到BasicType,BasicType除了继承TypeInformation还实现AtomicType接口,

public interface AtomicType<T> {   

   TypeComparator<T> createComparator(boolean sortOrderAscending, ExecutionConfig executionConfig);
}
* An atomic type is a type that is treated as one indivisible unit and where the entire type acts
* as a key.
* In contrast to atomic types are composite types, where the type information is aware of the individual
* fields and individual fields may be used as a key.
atomic类型就是不可分的类型,不像composite类型还包含其他的field,所以atomic本身整个作为key,基本类型如int肯定是属于atomic类型的
 
在BasicTypeInfo中定义了所有基本类型的TypeInfo,
    public static final BasicTypeInfo<String> STRING_TYPE_INFO = new BasicTypeInfo<String>(String.class, new Class<?>[]{}, StringSerializer.INSTANCE, StringComparator.class);
public static final BasicTypeInfo<Boolean> BOOLEAN_TYPE_INFO = new BasicTypeInfo<Boolean>(Boolean.class, new Class<?>[]{}, BooleanSerializer.INSTANCE, BooleanComparator.class);
public static final BasicTypeInfo<Byte> BYTE_TYPE_INFO = new IntegerTypeInfo<Byte>(Byte.class, new Class<?>[]{Short.class, Integer.class, Long.class, Float.class, Double.class, Character.class}, ByteSerializer.INSTANCE, ByteComparator.class);
public static final BasicTypeInfo<Short> SHORT_TYPE_INFO = new IntegerTypeInfo<Short>(Short.class, new Class<?>[]{Integer.class, Long.class, Float.class, Double.class, Character.class}, ShortSerializer.INSTANCE, ShortComparator.class);
public static final BasicTypeInfo<Integer> INT_TYPE_INFO = new IntegerTypeInfo<Integer>(Integer.class, new Class<?>[]{Long.class, Float.class, Double.class, Character.class}, IntSerializer.INSTANCE, IntComparator.class);
public static final BasicTypeInfo<Long> LONG_TYPE_INFO = new IntegerTypeInfo<Long>(Long.class, new Class<?>[]{Float.class, Double.class, Character.class}, LongSerializer.INSTANCE, LongComparator.class);
public static final BasicTypeInfo<Float> FLOAT_TYPE_INFO = new FractionalTypeInfo<Float>(Float.class, new Class<?>[]{Double.class}, FloatSerializer.INSTANCE, FloatComparator.class);
public static final BasicTypeInfo<Double> DOUBLE_TYPE_INFO = new FractionalTypeInfo<Double>(Double.class, new Class<?>[]{}, DoubleSerializer.INSTANCE, DoubleComparator.class);
public static final BasicTypeInfo<Character> CHAR_TYPE_INFO = new BasicTypeInfo<Character>(Character.class, new Class<?>[]{}, CharSerializer.INSTANCE, CharComparator.class);
public static final BasicTypeInfo<Date> DATE_TYPE_INFO = new BasicTypeInfo<Date>(Date.class, new Class<?>[]{}, DateSerializer.INSTANCE, DateComparator.class);
public static final BasicTypeInfo<Void> VOID_TYPE_INFO = new BasicTypeInfo<Void>(Void.class, new Class<?>[]{}, VoidSerializer.INSTANCE, null);

可以看到Byte,short,int,long都用的是IntegerTypeInfo

创建的4个参数分别为,以INT_TYPE_INFO为例,

class对象,Integer.class

可能被cast成的类型,所以对于Integer,被cast成long,float,double,character都是可以的

Serializer对象

Comparator对象

可以看到flink重新封装了所有对象的Serializer和Comparator

我们看下LongSerializer,

@Override
public void serialize(Long record, DataOutputView target) throws IOException {
target.writeLong(record.longValue());
}

很高效的,对于Long,只会序列化真正的longValue,而不会存多余的东西

 

而NumericTypeInfo,只是一种特殊的BasicTypeInfo

    private static final Set<Class<?>> numericalTypes = Sets.<Class<?>>newHashSet(
Integer.class,
Long.class,
Double.class,
Byte.class,
Short.class,
Float.class,
Character.class
);

只有上面这几种class对象,才被认为是NumericTypeInfo

而IntegerTypeInfo,只是范围的进一步缩小,

    private static final Set<Class<?>> integerTypes = Sets.<Class<?>>newHashSet(
Integer.class,
Long.class,
Byte.class,
Short.class,
Character.class
);
 

除了上面的AtomicType,还有如array的typeinfo

比如,BasicArrayTypeInfo

Flink - TypeInformation的更多相关文章

  1. Kafka设计解析(二十)Apache Flink Kafka consumer

    转载自 huxihx,原文链接 Apache Flink Kafka consumer Flink提供了Kafka connector用于消费/生产Apache Kafka topic的数据.Flin ...

  2. 【译】Apache Flink Kafka consumer

    Flink提供了Kafka connector用于消费/生产Apache Kafka topic的数据.Flink的Kafka consumer集成了checkpoint机制以提供精确一次的处理语义. ...

  3. 【翻译】Flink Table Api & SQL —— 数据类型

    本文翻译自官网:https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/types.html Flink Table ...

  4. Apache Flink 1.9重磅发布!首次合并阿里内部版本Blink重要功能

    8月22日,Apache Flink 1.9.0 版本正式发布,这也是阿里内部版本 Blink 合并入 Flink 后的首次版本发布.此次版本更新带来的重大功能包括批处理作业的批式恢复,以及 Tabl ...

  5. Flink 案例整合

    1.概述 Flink 1.1.0 版本已经在官方发布了,官方博客于 2016-08-08 更新了 Flink 1.1.0 的变动.在这 Flink 版本的发布,添加了 SQL 语法这一特性.这对于业务 ...

  6. Flink - DataStream

    先看例子, final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); D ...

  7. Flink - Working with State

    All transformations in Flink may look like functions (in the functional processing terminology), but ...

  8. Flink - Juggling with Bits and Bytes

    http://www.36dsj.com/archives/33650 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-B ...

  9. Flink Program Guide (8) -- Working with State :Fault Tolerance(DataStream API编程指导 -- For Java)

    Working with State 本文翻译自Streaming Guide/ Fault Tolerance / Working with State ---------------------- ...

随机推荐

  1. 《CLR via C#》读书笔记 之 线程基础

    第二十五章 线程基础 2014-06-28 25.1 Windows为什么要支持线程 25.2 线程开销 25.3 停止疯狂 25.6 CLR线程和Windows线程 25.7 使用专用线程执行异步的 ...

  2. Node入门教程(8)第六章:path 模块详解

    path 模块详解 path 模块提供了一些工具函数,用于处理文件与目录的路径.由于windows和其他系统之间路径不统一,path模块还专门做了相关处理,屏蔽了彼此之间的差异. 可移植操作系统接口( ...

  3. MXNET:监督学习

    线性回归 给定一个数据点集合 X 和对应的目标值 y,线性模型的目标就是找到一条使用向量 w 和位移 b 描述的线,来尽可能地近似每个样本X[i] 和 y[i]. 数学公式表示为\(\hat{y}=X ...

  4. (转)Windows系统白名单以及UAC机制

    用户帐户控制 深入了解 Windows 7 用户帐户控制 Mark Russinovich   概览: 标准用户帐户 用户帐户控制 内容 UAC 技术 提升与恶意软件安全性 Windows 7 中的不 ...

  5. Java知多少(48)try语句的嵌套

    Try语句可以被嵌套.也就是说,一个try语句可以在另一个try块内部.每次进入try语句,异常的前后关系都会被推入堆栈.如果一个内部的try语句不含特殊异常的catch处理程序,堆栈将弹出,下一个t ...

  6. Android5.0通知变化浅析

    目前在Android中通知的使用还是很常见的,为了做版本兼容,常用兼容包NotificationCompat.Builder和 Notification.Builder. NotificationCo ...

  7. mysql之我们终将踩过的坑(优化)

    一.EXPLAIN 做MySQL优化,我们要善用 EXPLAIN 查看SQL执行计划. 下面来个简单的示例,标注(1,2,3,4,5)我们要重点关注的数据 type列,连接类型.一个好的sql语句至少 ...

  8. Git -- 搭建git服务器

    在远程仓库一节中,我们讲了远程仓库实际上和本地仓库没啥不同,纯粹为了7x24小时开机并交换大家的修改. GitHub就是一个免费托管开源代码的远程仓库.但是对于某些视源代码如生命的商业公司来说,既不想 ...

  9. 安卓开发笔记——关于开源项目SlidingMenu的使用介绍(仿QQ5.0侧滑菜单)

    记得去年年末的时候写过这个侧滑效果,当时是利用自定义HorizontalScrollView来实现的,效果如下: 有兴趣的朋友可以看看这篇文件<安卓开发笔记——自定义HorizontalScro ...

  10. [转]Git忽略规则及.gitignore规则不生效的解决办法

    在git中如果想忽略掉某个文件,不让这个文件提交到版本库中,可以使用修改根目录中 .gitignore 文件的方法(如无,则需自己手工建立此文件).这个文件每一行保存了一个匹配的规则例如: # 此为注 ...