1.Flink中exactly once实现原理分析

  生产者从kafka拉取数据以及消费者往kafka写数据都需要保证exactly once。目前flink中支持exactly once的source不多,有kafka source;能实现exactly once的sink也不多,如kafka sink、streamingFileSink,其都要开启checkpoint才能实现exactly once。接下来以FlinkKafkaProducer为例,深入研究其源代码,从而理解flink中的exactly once(精准一次性语义)是怎么实现的。

1.1 大致流程图(也叫分两阶段提交原理)

1. JobManager定期(通过CheckpointCodinator)向各个包含state的subTask发起checkpoint的请求

2. subTask将各自的state写入到相应的statebackend,一个资源槽对应一个文件,其中各个subTask的state写入这个文件中

3. 各个subTask向JobManager发送checkpoint成功的消息

4. 当所有subTask都发送了checkpoint成功的消息后,jobManager会向所有实现了checkpoint的subTask发送成功的消息

5. subTask往kafka写数据,并且向Kafka提交事务()

注意:为了保证一个流水线(pipeline)上的operrator state和keyedstate数据一致,flink引入了barrier机制,即在jobmanager和taskManager间设置一个barrier,相当于节流,保证在checkpoint时,source不能在读取数据

问题:kafka涉及到生产者往里面写数据一个事务,以及消费者读取数据一个事务,这两个事物间有什么联系?

1.2 源码解析

(1)首先看FlinkKafkaProducer类,可以发现其继承了TwoPhaseCommitSinkFunction

(2)TwoPhaseCommitSinkFunction是所有要实现一次性语义的SinkFunction的一个比较推荐的基类,其实现了两个重要的接口,分别为:CheckpointedFunction, CheckpointListener

  • CheckpointedFunction接口

此接口中包含两个方法,分别为snapshotState方法、initializeState方法,源代码如下

  1. public interface CheckpointedFunction {
  2.  
  3. /**
  4. * This method is called when a snapshot for a checkpoint is requested. This acts as a hook to the function to
  5. * ensure that all state is exposed by means previously offered through {@link FunctionInitializationContext} when
  6. * the Function was initialized, or offered now by {@link FunctionSnapshotContext} itself.
  7. *
  8. * @param context the context for drawing a snapshot of the operator
  9. * @throws Exception
  10. */
  11. void snapshotState(FunctionSnapshotContext context) throws Exception;
  12.  
  13. /**
  14. * This method is called when the parallel function instance is created during distributed
  15. * execution. Functions typically set up their state storing data structures in this method.
  16. *
  17. * @param context the context for initializing the operator
  18. * @throws Exception
  19. */
  20. void initializeState(FunctionInitializationContext context) throws Exception;
  21.  
  22. }

  其中snapshotState方法是用checkpoint时,拍快照,其能将state持久化到statebackend。这里面存了一些transactionID、subTask编号、以及kafka的相关信息(用来写数据)。若是checkpoint成功了,但是subTask并没有成功将数据写入kafka,则会通过这个方法恢复原先最近的state进行恢复,然后继续

  initializeState方法可以用来恢复state,解释可能以前将state持久化到了statebackend,但并没有将数据成功写入kafka,则可以ton过这个方法恢复最近的state,然后将数据继续往kafka写数据。

  • CheckpointListener接口

此接口中包含一个notifyCheckpointComplete方法

源码如下

  1. /**
  2. * This interface must be implemented by functions/operations that want to receive
  3. * a commit notification once a checkpoint has been completely acknowledged by all
  4. * participants.
  5. */
  6. @PublicEvolving
  7. public interface CheckpointListener {
  8.  
  9. /**
  10. * This method is called as a notification once a distributed checkpoint has been completed.
  11. *
  12. * Note that any exception during this method will not cause the checkpoint to
  13. * fail any more.
  14. *
  15. * @param checkpointId The ID of the checkpoint that has been completed.
  16. * @throws Exception
  17. */
  18. void notifyCheckpointComplete(long checkpointId) throws Exception;
  19. }

notifyCheckpointComplete方法什么时候被调用呢?所有分区的subTask向JobManager相应checkpoint后才会被调用,即告知各个subTask,这次checkpoint成功了,可以进行下一步的操作了,该方法源码如下:

  1. @Override
  2. public final void notifyCheckpointComplete(long checkpointId) throws Exception {
  3. // the following scenarios are possible here
  4. //
  5. // (1) there is exactly one transaction from the latest checkpoint that
  6. // was triggered and completed. That should be the common case.
  7. // Simply commit that transaction in that case.
  8. //
  9. // (2) there are multiple pending transactions because one previous
  10. // checkpoint was skipped. That is a rare case, but can happen
  11. // for example when:
  12. //
  13. // - the master cannot persist the metadata of the last
  14. // checkpoint (temporary outage in the storage system) but
  15. // could persist a successive checkpoint (the one notified here)
  16. //
  17. // - other tasks could not persist their status during
  18. // the previous checkpoint, but did not trigger a failure because they
  19. // could hold onto their state and could successfully persist it in
  20. // a successive checkpoint (the one notified here)
  21. //
  22. // In both cases, the prior checkpoint never reach a committed state, but
  23. // this checkpoint is always expected to subsume the prior one and cover all
  24. // changes since the last successful one. As a consequence, we need to commit
  25. // all pending transactions.
  26. //
  27. // (3) Multiple transactions are pending, but the checkpoint complete notification
  28. // relates not to the latest. That is possible, because notification messages
  29. // can be delayed (in an extreme case till arrive after a succeeding checkpoint
  30. // was triggered) and because there can be concurrent overlapping checkpoints
  31. // (a new one is started before the previous fully finished).
  32. //
  33. // ==> There should never be a case where we have no pending transaction here
  34. //
  35.  
  36. Iterator<Map.Entry<Long, TransactionHolder<TXN>>> pendingTransactionIterator = pendingCommitTransactions.entrySet().iterator();
  37. Throwable firstError = null;
  38.  
  39. while (pendingTransactionIterator.hasNext()) {
  40. Map.Entry<Long, TransactionHolder<TXN>> entry = pendingTransactionIterator.next();
  41. Long pendingTransactionCheckpointId = entry.getKey();
  42. TransactionHolder<TXN> pendingTransaction = entry.getValue();
  43. if (pendingTransactionCheckpointId > checkpointId) {
  44. continue;
  45. }
  46.  
  47. LOG.info("{} - checkpoint {} complete, committing transaction {} from checkpoint {}",
  48. name(), checkpointId, pendingTransaction, pendingTransactionCheckpointId);
  49.  
  50. logWarningIfTimeoutAlmostReached(pendingTransaction);
  51. try {
  52. commit(pendingTransaction.handle);
  53. } catch (Throwable t) {
  54. if (firstError == null) {
  55. firstError = t;
  56. }
  57. }
  58.  
  59. LOG.debug("{} - committed checkpoint transaction {}", name(), pendingTransaction);
  60.  
  61. pendingTransactionIterator.remove();
  62. }
  63.  
  64. if (firstError != null) {
  65. throw new FlinkRuntimeException("Committing one of transactions failed, logging first encountered failure",
  66. firstError);
  67. }
  68. }

注意,该方法除了提醒个subTask此次checkpoint成功了外,还会提交事务,具体见源码如下(为该方法源码的一部分):

FlinkKafkaProducer中的commit方法

  1. @Override
  2. protected void commit(FlinkKafkaProducer.KafkaTransactionState transaction) {
  3. if (transaction.isTransactional()) {
  4. try {
  5. transaction.producer.commitTransaction();
  6. } finally {
  7. recycleTransactionalProducer(transaction.producer);
  8. }
  9. }
  10. }

  若是事务提交失败后,该怎么办呢?没关系,事务提交失败后,会根据重启策略重启,并调用initializeState方法恢复先前最近的一个state,继续往kafka写数据,提交事务,再次提交事务时,就不是调用commit方法了,而是调用FlinkKafkaProducer中的recoverAndCommit方法(这块也可能是preCommit方法,自己还没完全看懂源码),先恢复数据再commit事务,源码如下

  1. @Override
  2. protected void recoverAndCommit(FlinkKafkaProducer.KafkaTransactionState transaction) {
  3. if (transaction.isTransactional()) {
  4. try (
  5. FlinkKafkaInternalProducer<byte[], byte[]> producer =
  6. initTransactionalProducer(transaction.transactionalId, false)) {
  7. producer.resumeTransaction(transaction.producerId, transaction.epoch);
  8. producer.commitTransaction();
  9. } catch (InvalidTxnStateException | ProducerFencedException ex) {
  10. // That means we have committed this transaction before.
  11. LOG.warn("Encountered error {} while recovering transaction {}. " +
  12. "Presumably this transaction has been already committed before",
  13. ex,
  14. transaction);
  15. }
  16. }
  17. }

注意:这里可以保证checkpoint成功,以及事务提交成功,但是没法保证它俩在一起同时成功。但这也没关系,就算checkpoint成功了,事务没成功也没关系。事务没成功会回滚,它会从statebackend中恢复数据,然后再向kafka中写数据,提交事务。

2 自定义两阶段提交sink实例

  自定义两阶段提交sink,其面向的存储系统一定要支持事务,比如mysq,0.11版以后的kafka。简单来说,自定义两阶段提交sink就是继承TwoPhaseCommitSinkFunction类,然后重写里面的方法,具体见下面的例子

MySQL分两阶段提交的Sink

druid连接池

  1. package cn._51doit.flink.day11;
  2.  
  3. import com.alibaba.druid.pool.DruidDataSourceFactory;
  4.  
  5. import javax.sql.DataSource;
  6. import java.sql.Connection;
  7. import java.sql.SQLException;
  8. import java.util.Properties;
  9.  
  10. public class DruidConnectionPool {
  11.  
  12. private transient static DataSource dataSource = null;
  13.  
  14. private transient static Properties props = new Properties();
  15.  
  16. static {
  17.  
  18. props.put("driverClassName", "com.mysql.jdbc.Driver");
  19. props.put("url", "jdbc:mysql://172.16.200.101:3306/bigdata?characterEncoding=UTF-8");
  20. props.put("username", "root");
  21. props.put("password", "123456");
  22. try {
  23. dataSource = DruidDataSourceFactory.createDataSource(props);
  24. } catch (Exception e) {
  25. e.printStackTrace();
  26. }
  27.  
  28. }
  29.  
  30. private DruidConnectionPool() {
  31. }
  32.  
  33. public static Connection getConnection() throws SQLException {
  34. return dataSource.getConnection();
  35. }
  36. }

MySqlTwoPhaseCommitSinkFunction

  1. package cn._51doit.flink.day11;
  2.  
  3. import org.apache.flink.api.common.ExecutionConfig;
  4. import org.apache.flink.api.common.typeutils.base.VoidSerializer;
  5. import org.apache.flink.api.java.tuple.Tuple2;
  6. import org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer;
  7. import org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction;
  8.  
  9. import java.sql.Connection;
  10. import java.sql.PreparedStatement;
  11. import java.sql.SQLException;
  12.  
  13. public class MySqlTwoPhaseCommitSink extends TwoPhaseCommitSinkFunction<Tuple2<String, Integer>, MySqlTwoPhaseCommitSink.ConnectionState, Void> {
  14.  
  15. public MySqlTwoPhaseCommitSink() {
  16. super(new KryoSerializer<>(MySqlTwoPhaseCommitSink.ConnectionState.class, new ExecutionConfig()), VoidSerializer.INSTANCE);
  17. }
  18.  
  19. @Override
  20. protected MySqlTwoPhaseCommitSink.ConnectionState beginTransaction() throws Exception {
  21.  
  22. System.out.println("=====> beginTransaction... ");
  23. //Class.forName("com.mysql.jdbc.Driver");
  24. //Connection conn = DriverManager.getConnection("jdbc:mysql://172.16.200.101:3306/bigdata?characterEncoding=UTF-8", "root", "123456");
  25. Connection connection = DruidConnectionPool.getConnection();
  26. connection.setAutoCommit(false);
  27. return new ConnectionState(connection);
  28.  
  29. }
  30.  
  31. @Override
  32. protected void invoke(MySqlTwoPhaseCommitSink.ConnectionState connectionState, Tuple2<String, Integer> value, Context context) throws Exception {
  33. Connection connection = connectionState.connection;
  34. PreparedStatement pstm = connection.prepareStatement("INSERT INTO t_wordcount (word, counts) VALUES (?, ?) ON DUPLICATE KEY UPDATE counts = ?");
  35. pstm.setString(1, value.f0);
  36. pstm.setInt(2, value.f1);
  37. pstm.setInt(3, value.f1);
  38. pstm.executeUpdate();
  39. pstm.close();
  40.  
  41. }
  42.  
  43. @Override
  44. protected void preCommit(MySqlTwoPhaseCommitSink.ConnectionState connectionState) throws Exception {
  45. System.out.println("=====> preCommit... " + connectionState);
  46. }
  47.  
  48. @Override
  49. protected void commit(MySqlTwoPhaseCommitSink.ConnectionState connectionState) {
  50. System.out.println("=====> commit... ");
  51. Connection connection = connectionState.connection;
  52. try {
  53. connection.commit();
  54. connection.close();
  55. } catch (SQLException e) {
  56. throw new RuntimeException("提交事物异常");
  57. }
  58. }
  59.  
  60. @Override
  61. protected void abort(MySqlTwoPhaseCommitSink.ConnectionState connectionState) {
  62. System.out.println("=====> abort... ");
  63. Connection connection = connectionState.connection;
  64. try {
  65. connection.rollback();
  66. connection.close();
  67. } catch (SQLException e) {
  68. throw new RuntimeException("回滚事物异常");
  69. }
  70. }
  71.  
  72. static class ConnectionState {
  73.  
  74. private final transient Connection connection;
  75.  
  76. ConnectionState(Connection connection) {
  77. this.connection = connection;
  78. }
  79.  
  80. }
  81.  
  82. }

 3 将数据写入Hbase

  使用hbase的幂等性结合at least Once(flink中state能恢复,在两次checkpoint间可能会有重复读取数据的情况)实现精确一次性语义

HBaseUtil

  1. package cn._51doit.flink.day11;
  2.  
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.hbase.HBaseConfiguration;
  5. import org.apache.hadoop.hbase.client.Connection;
  6. import org.apache.hadoop.hbase.client.ConnectionFactory;
  7.  
  8. /**
  9. * Hbase的工具类,用来创建Hbase的Connection
  10. */
  11. public class HBaseUtil {
  12. /**
  13. * @param zkQuorum zookeeper地址,多个要用逗号分隔
  14. * @param port zookeeper端口号
  15. * @return
  16. */
  17. public static Connection getConnection(String zkQuorum, int port) throws Exception {
  18. Configuration conf = HBaseConfiguration.create();
  19. conf.set("hbase.zookeeper.quorum", zkQuorum);
  20. conf.set("hbase.zookeeper.property.clientPort", port + "");
  21. Connection connection = ConnectionFactory.createConnection(conf);
  22. return connection;
  23. }
  24. }

MyHbaseSink

  1. package cn._51doit.flink.day11;
  2.  
  3. import org.apache.flink.api.java.tuple.Tuple2;
  4. import org.apache.flink.api.java.utils.ParameterTool;
  5. import org.apache.flink.configuration.Configuration;
  6. import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
  7. import org.apache.hadoop.hbase.TableName;
  8. import org.apache.hadoop.hbase.client.Connection;
  9. import org.apache.hadoop.hbase.client.Put;
  10. import org.apache.hadoop.hbase.client.Table;
  11.  
  12. import java.util.ArrayList;
  13. import java.util.List;
  14.  
  15. public class MyHbaseSink extends RichSinkFunction<Tuple2<String, Double>> {
  16.  
  17. private transient Connection connection;
  18.  
  19. private transient Integer maxSize = 1000;
  20.  
  21. private transient Long delayTime = 5000L;
  22.  
  23. private transient Long lastInvokeTime;
  24.  
  25. private transient List<Put> puts = new ArrayList<>(maxSize);
  26.  
  27. public MyHbaseSink() {}
  28.  
  29. public MyHbaseSink(Integer maxSize, Long delayTime) {
  30. this.maxSize = maxSize;
  31. this.delayTime = delayTime;
  32. }
  33.  
  34. @Override
  35. public void open(Configuration parameters) throws Exception {
  36. super.open(parameters);
  37.  
  38. ParameterTool params = (ParameterTool) getRuntimeContext()
  39. .getExecutionConfig().getGlobalJobParameters();
  40.  
  41. //创建一个Hbase的连接
  42. connection = HBaseUtil.getConnection(
  43. params.getRequired("hbase.zookeeper.quorum"),
  44. params.getInt("hbase.zookeeper.property.clientPort", 2181)
  45. );
  46.  
  47. lastInvokeTime = System.currentTimeMillis();
  48. }
  49.  
  50. @Override
  51. public void invoke(Tuple2<String, Double> value, Context context) throws Exception {
  52.  
  53. String rk = value.f0;
  54. Put put = new Put(rk.getBytes());
  55. put.addColumn("data".getBytes(), "order".getBytes(), value.f1.toString().getBytes());
  56.  
  57. puts.add(put);
  58.  
  59. //使用ProcessingTime
  60. long currentTime = System.currentTimeMillis();
  61.  
  62. //加到一个集合中
  63. if(puts.size() == maxSize || currentTime - lastInvokeTime >= delayTime) {
  64.  
  65. //获取一个HbaseTable
  66. Table table = connection.getTable(TableName.valueOf("myorder"));
  67.  
  68. table.put(puts);
  69.  
  70. puts.clear();
  71.  
  72. lastInvokeTime = currentTime;
  73.  
  74. table.close();
  75. }
  76.  
  77. }
  78.  
  79. @Override
  80. public void close() throws Exception {
  81. connection.close();
  82. }
  83. }

 4 ProtoBuf

  protoBuf是一种序列化机制,数据存储还是二进制,其特点是序列化、反序列化快,占用空间小(相比json而言,是它的1/3)、跨平台、跨语言。

4.1 protobuf的使用测试

(1)创建一个maven工程

(2)导入pom依赖,具体内容见下

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <project xmlns="http://maven.apache.org/POM/4.0.0"
  3. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5. <modelVersion>4.0.0</modelVersion>
  6.  
  7. <groupId>org.example</groupId>
  8. <artifactId>protobuf-bean</artifactId>
  9. <version>1.0-SNAPSHOT</version>
  10.  
  11. <properties>
  12. <maven.compiler.source>1.8</maven.compiler.source>
  13. <maven.compiler.target>1.8</maven.compiler.target>
  14. <encoding>UTF-8</encoding>
  15. </properties>
  16. <dependencies>
  17. <dependency>
  18. <groupId>com.google.protobuf</groupId>
  19. <artifactId>protobuf-java</artifactId>
  20. <version>3.7.1</version>
  21. </dependency>
  22.  
  23. <dependency>
  24. <groupId>com.google.protobuf</groupId>
  25. <artifactId>protobuf-java-util</artifactId>
  26. <version>3.7.1</version>
  27. </dependency>
  28. <dependency>
  29. <groupId>org.apache.kafka</groupId>
  30. <artifactId>kafka-clients</artifactId>
  31. <version>2.4.0</version>
  32. </dependency>
  33.  
  34. <dependency>
  35. <groupId>junit</groupId>
  36. <artifactId>junit</artifactId>
  37. <version>4.12</version>
  38. <scope>test</scope>
  39. </dependency>
  40. </dependencies>
  41. <build>
  42. <extensions>
  43. <extension>
  44. <groupId>kr.motd.maven</groupId>
  45. <artifactId>os-maven-plugin</artifactId>
  46. <version>1.6.2</version>
  47. </extension>
  48. </extensions>
  49. <plugins>
  50. <plugin>
  51. <groupId>org.xolstice.maven.plugins</groupId>
  52. <artifactId>protobuf-maven-plugin</artifactId>
  53. <version>0.6.1</version>
  54.  
  55. <configuration>
  56. <protocArtifact>
  57. com.google.protobuf:protoc:3.7.1:exe:${os.detected.classifier}
  58. </protocArtifact>
  59. <pluginId>grpc-java</pluginId>
  60. </configuration>
  61. <executions>
  62. <execution>
  63. <goals>
  64. <goal>compile</goal>
  65. <goal>compile-custom</goal>
  66. </goals>
  67. </execution>
  68. </executions>
  69. </plugin>
  70.  
  71. </plugins>
  72. </build>
  73.  
  74. </project>

(3)在main目录下创建一个proto文件夹,在这个文件夹下编辑相应的xxx.proto文件,具体如下

  1. syntax = "proto3";
  2. option java_package = "cn._51doit.proto";
  3. option java_outer_classname = "OrderProto";
  4.  
  5. message Order {
  6. int32 id = 1;
  7. string time = 2;
  8. double money = 3;
  9. }

(4)在maven的plugins中会有个protobuf插件,点击里面的protobuf.compile,即可在项目中的target目录下生成相应的protobuf bean文件(支持多种语言的schema信息)

(5)将得到的proto bean移到自己想要的目录中即可

此测试就是将json数据转成protoBuf bean格式数据,然后在将其序列化输出,以及反序列化至bean输出

OrderProtoTest

  1. package cn._51doit.test;
  2.  
  3. import cn._51doit.proto.OrderProto;
  4. import com.google.protobuf.InvalidProtocolBufferException;
  5. import com.google.protobuf.util.JsonFormat;
  6.  
  7. public class OrderProtoTest {
  8. public static void main(String[] args) throws InvalidProtocolBufferException {
  9. String json = "{\"id\": 100, \"time\": \"2020-07-01\", \"money\": 66.66}";
  10.  
  11. //使用工具类生成一个类
  12. OrderProto.Order.Builder bean = OrderProto.Order.newBuilder();
  13.  
  14. //将数据拷贝的bean中
  15. JsonFormat.parser().merge(json, bean);
  16.  
  17. bean.setId(666);
  18. bean.setTime("2019-10-18");
  19. bean.setMoney(888.88);
  20. //序列化转成二进制
  21. //bean -> byte数组
  22. byte[] bytes = bean.build().toByteArray();
  23.  
  24. System.out.println("二进制:" + bytes);
  25.  
  26. //反序列化
  27. //二进制数组转成bean
  28. OrderProto.Order order = OrderProto.Order.parseFrom(bytes);
  29. System.out.println("对象格式:" + order);
  30. }
  31. }

4.2 将数据以ProtoBuf的二进制形式发送到Kafka

  1. DataToKafka

  1. package cn._51doit.test;
  2.  
  3. import cn._51doit.proto.DataBeanProto;
  4. import org.apache.kafka.clients.producer.KafkaProducer;
  5. import org.apache.kafka.clients.producer.ProducerRecord;
  6.  
  7. import java.util.Properties;
  8.  
  9. public class DataToKafka {
  10. public static void main(String[] args) {
  11. // 1 配置参数
  12. Properties props = new Properties();
  13. //连接kafka节点
  14. props.setProperty("bootstrap.servers", "feng05:9092,feng06:9092,feng07:9092");
  15. props.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
  16. props.setProperty("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
  17.  
  18. String topic = "dataproto";
  19.  
  20. // 2 kafka的生产者
  21. KafkaProducer<String, byte[]> producer = new KafkaProducer<String, byte[]>(props);
  22. DataBeanProto.DataBean.Builder bean = DataBeanProto.DataBean.newBuilder();
  23. DataBeanProto.DataBeans.Builder list = DataBeanProto.DataBeans.newBuilder();
  24.  
  25. for (int i = 1; i <= 100; i++) {
  26. //往bean中设置属性
  27. bean.setId(i);
  28. bean.setTitle("doit-" + i);
  29. bean.setUrl("www.51doit.cn");
  30. //将bean追加到list中
  31. list.addDataBean(bean);
  32. //清空原来分组的数据
  33. bean.clear();
  34.  
  35. if(list.getDataBeanCount() == 10) {
  36. //将beans的集合转成protobuf的二进制
  37. byte[] bytes = list.build().toByteArray();
  38. ProducerRecord<String, byte[]> record = new ProducerRecord<>(topic, bytes);
  39. producer.send(record); //一次发送10条
  40. producer.flush();
  41. list.clear();
  42. }
  43. }
  44. System.out.println("message send success");
  45. // 释放资源
  46. producer.close();
  47. }
  48.  
  49. }

4.3  Flume的KafkaChannel整合kafka序列化器

  需求:(1)在kafka中定义序列化器,在数据写入kafka前,将之转成对应的二进制存入kafka

     (2)Flink从Kafka中拉取刚存入相应格式的二进制数据,转成ProtoBuf的Bean

(1)kafka序列化器的实现

  大致思路就是首先获取一个protoBuf bean,然后定义一个序列化器,实现一个Serializer接口,在里面重写serialize方法,具体逻辑见下面代码。将该代码打包,放到flume的lib文件夹中,注意需要将flume的lib中protobuf-java-2.5.0.jar注释或者删除掉。

KafkaProtoBufSerializer

  1. package cn._51doit.test;
  2.  
  3. import cn._51doit.proto.UserProto;
  4. import com.google.protobuf.InvalidProtocolBufferException;
  5. import com.google.protobuf.util.JsonFormat;
  6. import org.apache.kafka.common.header.Headers;
  7. import org.apache.kafka.common.serialization.Serializer;
  8.  
  9. import java.util.Map;
  10.  
  11. public class KafkaProtoBufSerializer implements Serializer<byte[]> {
  12.  
  13. @Override
  14. public void configure(Map<String, ?> configs, boolean isKey) {
  15.  
  16. }
  17.  
  18. @Override
  19. public byte[] serialize(String topic, byte[] data) {
  20. // 将source传给channel的数据转成ProtoBuf的二进制
  21. //line是一个json
  22. String line = new String(data);
  23. UserProto.User.Builder bean = UserProto.User.newBuilder();
  24. //使用工具类将JSON的数据的数据set到bean中
  25. try {
  26. JsonFormat.parser().merge(line, bean);
  27. } catch (InvalidProtocolBufferException e) {
  28. return null;
  29. }
  30. return bean.build().toByteArray(); //返回的是ProtoBuf的二进制
  31. }
  32.  
  33. @Override
  34. public byte[] serialize(String topic, Headers headers, byte[] data) {
  35. return new byte[0];
  36. }
  37.  
  38. @Override
  39. public void close() {
  40.  
  41. }
  42. }

(2)Flink的Kafka反序列化器的实现

注意,此处除了要设置反序列化,即将kafka中确定topic中的protoBuf格式的二进制数据序列化成protoBuf的bean,还要指定bean的序列化规则(注册自定义的序列化类),这样flink处理该数据时才能进行网络传输

DataBeanProto(bean,跨语言)

使用4.1方法生成

DataBeansDeserializer反序列化器

  1. package cn._51doit.flink.day11;
  2.  
  3. import org.apache.flink.api.common.serialization.DeserializationSchema;
  4. import org.apache.flink.api.common.typeinfo.TypeInformation;
  5.  
  6. import java.io.IOException;
  7.  
  8. /**
  9. * 自定义的Flink反序列化器
  10. */
  11. public class DataBeansDeserializer implements DeserializationSchema<DataBeanProto.DataBeans> {
  12.  
  13. //反序列化
  14. @Override
  15. public DataBeanProto.DataBeans deserialize(byte[] message) throws IOException {
  16. return DataBeanProto.DataBeans.parseFrom(message);
  17. }
  18.  
  19. @Override
  20. public boolean isEndOfStream(DataBeanProto.DataBeans nextElement) {
  21. return false;
  22. }
  23.  
  24. @Override
  25. public TypeInformation<DataBeanProto.DataBeans> getProducedType() {
  26. return TypeInformation.of(DataBeanProto.DataBeans.class);
  27. }
  28. }

PBSerializer序列化器

  1. package cn._51doit.flink.day11;
  2.  
  3. import com.esotericsoftware.kryo.Kryo;
  4. import com.esotericsoftware.kryo.Serializer;
  5. import com.esotericsoftware.kryo.io.Input;
  6. import com.esotericsoftware.kryo.io.Output;
  7. import com.google.protobuf.Message;
  8.  
  9. import java.lang.reflect.Method;
  10. import java.util.HashMap;
  11.  
  12. public class PBSerializer extends Serializer<Message> {
  13.  
  14. /* This cache never clears, but only scales like the number of
  15. * classes in play, which should not be very large.
  16. * We can replace with a LRU if we start to see any issues.
  17. */
  18. final protected HashMap<Class, Method> methodCache = new HashMap<Class, Method>();
  19.  
  20. /**
  21. * This is slow, so we should cache to avoid killing perf:
  22. * See: http://www.jguru.com/faq/view.jsp?EID=246569
  23. */
  24. protected Method getParse(Class cls) throws Exception {
  25. Method meth = methodCache.get(cls);
  26. if (null == meth) {
  27. meth = cls.getMethod("parseFrom", new Class[]{ byte[].class });
  28. methodCache.put(cls, meth);
  29. }
  30. return meth;
  31. }
  32.  
  33. //序列化
  34. @Override
  35. public void write(Kryo kryo, Output output, Message mes) {
  36. byte[] ser = mes.toByteArray();
  37. output.writeInt(ser.length, true);
  38. output.writeBytes(ser);
  39. }
  40.  
  41. //反序列化
  42. @Override
  43. public Message read(Kryo kryo, Input input, Class<Message> pbClass) {
  44. try {
  45. int size = input.readInt(true);
  46. byte[] barr = new byte[size];
  47. input.readBytes(barr);
  48. return (Message)getParse(pbClass).invoke(null, barr);
  49. } catch (Exception e) {
  50. throw new RuntimeException("Could not create " + pbClass, e);
  51. }
  52. }
  53. }

测试类

ProtoBufDemo

  1. package cn._51doit.flink.day11;
  2.  
  3. import cn._51doit.flink.day10.FlinkUtilsV2;
  4. import org.apache.flink.api.common.functions.FlatMapFunction;
  5. import org.apache.flink.api.java.utils.ParameterTool;
  6. import org.apache.flink.streaming.api.datastream.DataStream;
  7. import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
  8. import org.apache.flink.util.Collector;
  9.  
  10. public class ProtoBufDemo {
  11.  
  12. public static void main(String[] args) throws Exception{
  13.  
  14. ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]);
  15.  
  16. DataStream<DataBeanProto.DataBeans> dataBeansStream = FlinkUtilsV2.createKafkaDataStream(parameters, "dataproto", "gid", DataBeansDeserializer.class);
  17. //注册自定义的序列化类
  18. FlinkUtilsV2.getEnv().getConfig().registerTypeWithKryoSerializer(DataBeanProto.DataBeans.class, PBSerializer.class);
  19. FlinkUtilsV2.getEnv().getConfig().registerTypeWithKryoSerializer(DataBeanProto.DataBean.class, PBSerializer.class);
  20.  
  21. SingleOutputStreamOperator<DataBeanProto.DataBean> dataBeanStream = dataBeansStream.flatMap(
  22. new FlatMapFunction<DataBeanProto.DataBeans, DataBeanProto.DataBean>() {
  23. @Override
  24. public void flatMap(DataBeanProto.DataBeans list, Collector<DataBeanProto.DataBean> out) throws Exception {
  25.  
  26. for (DataBeanProto.DataBean dataBean : list.getDataBeanList()) {
  27. out.collect(dataBean);
  28. }
  29. }
  30. });
  31.  
  32. dataBeanStream.print();
  33.  
  34. FlinkUtilsV2.getEnv().execute();
  35.  
  36. }
  37. }
  1.  

flink-----实时项目---day07-----1.Flink的checkpoint原理分析 2. 自定义两阶段提交sink(MySQL) 3 将数据写入Hbase(使用幂等性结合at least Once实现精确一次性语义) 4 ProtoBuf的更多相关文章

  1. 字节跳动流式数据集成基于Flink Checkpoint两阶段提交的实践和优化

    背景 字节跳动开发套件数据集成团队(DTS ,Data Transmission Service)在字节跳动内基于 Flink 实现了流批一体的数据集成服务.其中一个典型场景是 Kafka/ByteM ...

  2. Flink EOS如何防止外部系统乱入--两阶段提交源码

    一.前言 根据维基百科的定义,两阶段提交(Two-phase Commit,简称2PC)是巨人们用来解决分布式系统架构下的所有节点在进行事务提交时保持一致性问题而设计的一种算法,也可称之为协议. 在F ...

  3. 5.Flink实时项目之业务数据准备

    1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...

  4. 3.Flink实时项目之流程分析及环境搭建

    1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...

  5. 6.Flink实时项目之业务数据分流

    在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...

  6. 4.Flink实时项目之数据拆分

    1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...

  7. 7.Flink实时项目之独立访客开发

    1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...

  8. 9.Flink实时项目之订单宽表

    1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...

  9. 10.Flink实时项目之订单维度表关联

    1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...

随机推荐

  1. springcloud3(六) 服务降级限流熔断组件Resilience4j

    代码地址:https://github.com/showkawa/springBoot_2017/tree/master/spb-demo/spb-gateway/src/test/java/com/ ...

  2. 什么是 Webhook?

    1. 什么是 Webhook? Webhook 是一个 API 概念,是微服务 API 的使用范式之一,也被成为反向 API,即前端不主动发送请求,完全由后端推送:举个常用例子,比如你的好友发了一条朋 ...

  3. java中将double保留两位小数,将double保留两位小数并转换成String

    将Double类型的数据保留2位小数: Double a = 3.566; BigDecimal bd = new BigDecimal(a); Double d = bd.setScale(2, B ...

  4. dart系列之:dart语言中的特殊操作符

    dart系列之:dart语言中的特殊操作符 目录 简介 普通操作符 类型测试操作符 条件运算符 级联符号 类中的自定义操作符 总结 简介 有运算就有操作符,dart中除了普通的算术运算的操作符之外,还 ...

  5. 巧用Python快速构建网页服务器

    经常做web开发,要调试一个网页,直接打开文件,用file模式显然是业余的. 但动辄要部署个IIS或APACHE站点,也确实太累,怎么办? 逐浪君此前有分享过通过http-server来构建快速的we ...

  6. soft and hard limit

    soft限制了资源使用上限; soft可调整; hard限制了soft上限; 普通用户可使用ulimit -H调低hard limit. 限制的是一个进程可用资源, 而不是某个用户总和. man se ...

  7. [nowcoder5668H]Sort the Strings Revision

    考虑对于$p_{i}=0$,那么可以快速比较出$s_{0},s_{1},...,s_{i-1}$与$s_{i},s_{i+1},...,s_{n}$之间的大小关系,然后对两边分别找到最小的$p_{i} ...

  8. Java遍历map的五种方式

    使用For-Each迭代entries 这是最常见的方法,并在大多数情况下更可取的.当你在循环中需要使用Map的键和值时,就可以使用这个方法 Map<Integer, Integer> m ...

  9. gitee+typro+picgo搭建博客图床

    gitee+typro+picgo搭建博客图床 前提环境 typro.picgo.nodejs 直接在官网下载即可 下载完成后,打开picgo 安装插件gitee-uploader 1.1-2即可显示 ...

  10. vue-通过name进行数据过滤

    <template> <div> <h3>搜索列表</h3> <input type="text" placeholder=& ...