在生产中需要将一些数据发到kafka,而且需要做到EXACTLY_ONCE,kafka使用的版本为1.1.0,flink的版本为1.8.0,但是会很经常因为提交事务引起错误,甚至导致任务重启

kafka producer的配置如下

  def getKafkaProducer(kafkaAddr: String, targetTopicName: String, kafkaProducersPoolSize: Int): FlinkKafkaProducer[String] = {
val properties = new Properties()
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaAddr)
properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 6000 * 6 + "")
// 设置了retries参数,可以在Kafka的Partition发生leader切换时,Flink不重启,而是做5次尝试:
properties.setProperty(ProducerConfig.RETRIES_CONFIG, "5")
properties.setProperty(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, String.valueOf(1048576 * 5))
val serial = new KeyedSerializationSchemaWrapper(new SimpleStringSchema())
//val producer = new FlinkKafkaProducer011[String](targetTopicName, serial, properties, Optional.of(new KafkaProducerPartitioner[String]()), Semantic.EXACTLY_ONCE, kafkaProducersPoolSize)
val producer = new FlinkKafkaProducer[String](targetTopicName, serial, properties, Optional.of(new KafkaProducerPartitioner[String]()), FlinkKafkaProducer.Semantic.EXACTLY_ONCE, kafkaProducersPoolSize)
producer.setWriteTimestampToKafka(true)
producer
}

Flink env如下

    val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(60 * 1000 * 1, CheckpointingMode.EXACTLY_ONCE)
val config = env.getCheckpointConfig
//RETAIN_ON_CANCELLATION在job canceled的时候会保留externalized checkpoint state
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//用于指定checkpoint coordinator上一个checkpoint完成之后最小等多久可以出发另一个checkpoint,当指定这个参数时,maxConcurrentCheckpoints的值为1
config.setMinPauseBetweenCheckpoints(3000)
//用于指定运行中的checkpoint最多可以有多少个,如果有设置了minPauseBetweenCheckpoints,则maxConcurrentCheckpoints这个参数就不起作用了(大于1的值不起作用)
config.setMaxConcurrentCheckpoints(1)
//指定checkpoint执行的超时时间(单位milliseconds),超时没完成就会被abort掉
config.setCheckpointTimeout(30000)
//用于指定在checkpoint发生异常的时候,是否应该fail该task,默认为true,如果设置为false,则task会拒绝checkpoint然后继续运行
//https://issues.apache.org/jira/browse/FLINK-11662
config.setFailOnCheckpointingErrors(false)

然后经常会出现事务失效的问题,报错有很多种,大概为以下

java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: Committing one of transactions failed, logging first encountered failure
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.notifyCheckpointComplete(TwoPhaseCommitSinkFunction.java:296)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:684)
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1213)
... 5 more
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1002)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:619)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:97)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:228)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
org.apache.kafka.common.KafkaException: Cannot perform send because at least one previous transactional or idempotent request has failed with errors.
at org.apache.kafka.clients.producer.internals.TransactionManager.failIfNotReadyForSend(TransactionManager.java:278)
at org.apache.kafka.clients.producer.internals.TransactionManager.maybeAddPartitionToTransaction(TransactionManager.java:263)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:804)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaInternalProducer.send(FlinkKafkaInternalProducer.java:105)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:650)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:97)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:228)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
Checkpoint failed: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.

Checkpoint failed: Could not complete snapshot 11 for operator Sink: data_Sink (2/2).

这些错误基本涉及到两阶段提交、事务、checkpoint。

查看kafka documentation和研究ProducerConfig这个类后发现 kafka producer 在使用EXACTLY_ONCE的时候需要增加一些配置

the transaction timeout must be larger than the checkpoint interval, but smaller than the broker transaction.max.timeout.ms.

在getKafkaProducer增加以下配置后,出现原来的错误减少

    //checkpoint 间隔时间<TRANSACTION_TIMEOUT_CONFIG<kafka transaction.max.timeout.ms (默认900秒)
properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 1000 * 60 * 3 + "")
properties.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")

问题得到缓解。

参考:

https://www.cnblogs.com/wangzhuxing/p/10111831.html

http://www.heartthinkdo.com/?p=2040

http://romanmarkunas.com/web/blog/kafka-transactions-in-practice-1-producer/

Flink生产数据到Kafka频繁出现事务失效导致任务重启的更多相关文章

  1. kafka没配置好,导致服务器重启之后,topic丢失,topic里面的消息也丢失

    转,原文:https://blog.csdn.net/zfszhangyuan/article/details/53389916 ----------------------------------- ...

  2. Kafka科普系列 | Kafka中的事务是什么样子的?

    事务,对于大家来说可能并不陌生,比如数据库事务.分布式事务,那么Kafka中的事务是什么样子的呢? 在说Kafka的事务之前,先要说一下Kafka中幂等的实现.幂等和事务是Kafka 0.11.0.0 ...

  3. flink⼿手动维护kafka偏移量量

    flink对接kafka,官方模式方式是自动维护偏移量 但并没有考虑到flink消费kafka过程中,如果出现进程中断后的事情! 如果此时,进程中段: 1:数据可能丢失 从获取了了数据,但是在执⾏行行 ...

  4. java面试记录二:spring加载流程、springmvc请求流程、spring事务失效、synchronized和volatile、JMM和JVM模型、二分查找的实现、垃圾收集器、控制台顺序打印ABC的三种线程实现

    注:部分答案引用网络文章 简答题 1.Spring项目启动后的加载流程 (1)使用spring框架的web项目,在tomcat下,是根据web.xml来启动的.web.xml中负责配置启动spring ...

  5. Mysql引起的spring事务失效

    老项目加新功能,导致出现service调用service的情况..一共2张表有数据的添加删除.然后测试了一下事务,表A和表B,我在表B中抛了异常,但结果发现,表B回滚正常,但是表A并没有回滚.显示事务 ...

  6. spring声明式事务 同一类内方法调用事务失效

    只要避开Spring目前的AOP实现上的限制,要么都声明要事务,要么分开成两个类,要么直接在方法里使用编程式事务 [问题] Spring的声明式事务,我想就不用多介绍了吧,一句话“自从用了Spring ...

  7. SpringMvc配置 导致实事务失效

    SpringMVC回归MVC本质,简简单单的Restful式函数,没有任何基类之后,应该是传统Request-Response框架中最好用的了. Tips 1.事务失效的惨案 Spring MVC最打 ...

  8. Spring component-scan 的逻辑 、单例模式下多实例问题、事务失效

    原创内容,转发请保留:http://www.cnblogs.com/iceJava/p/6930118.html,谢谢 之前遇到该问题,今天查看了下 spring 4.x 的代码 一,先理解下 con ...

  9. spring事务失效情况分析

    详见:http://blog.yemou.net/article/query/info/tytfjhfascvhzxcyt113 <!--[if !supportLists]-->一.&l ...

  10. kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition...Error: NETWORK_EXCEPTION

      kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition... ...

随机推荐

  1. 我的vim配置相关

    谨以此文记录下之前的折腾.(后续可能还会折腾什么) 目标 我的目的很简单,就是希望能有一个启动快速的文本编辑器,可以简单的代码着色,vim键位,简单的文本修改,打开大点的文件不发愁,可以简单的form ...

  2. 1.mysql创建索引

    -- 创建一个普通索引(方式①)create index 索引名 ON 表名 (列名(索引键长度) [ASC|DESC]);-- 创建一个普通索引(方式②)alter table 表名 add ind ...

  3. 批量获取title

    1 import requests 2 from bs4 import BeautifulSoup 3 import pandas as pd 4 from openpyxl import Workb ...

  4. loadrunner---脚本录制常见问题

    一:loadrunner录制脚本时ie浏览器不弹出? 1.IE浏览器取消勾选[启用第三方浏览器扩展]启动IE,从[工具]进入[Internet选项],切到高级,去掉[启用第三方浏览器扩展(需要重启动) ...

  5. MAYA专用卸载工具,完全彻底卸载删除干净maya各种残留注册表和文件的方法和步骤

    maya专用卸载工具,完全彻底卸载删除干净maya各种残留注册表和文件的方法和步骤.如何卸载maya呢?有很多同学想把maya卸载后重新安装,但是发现maya安装到一半就失败了或者显示maya已安装或 ...

  6. BIP拓展js的使用

    __app.define("common_VM_Extend.js", function () {   var selectData = null;   var common_VM ...

  7. REMOTE HOST IDENTIFICATION HAS CHANGED!服务器重置后远程连接不上

    问题: 解决: 本地打开shell,重置key

  8. 远程连接linux桌面

    https://zhuanlan.zhihu.com/p/127265014 https://zhuanlan.zhihu.com/p/93438433

  9. WEB的安全性测试要素【转】

    原文链接:http://www.cnblogs.com/zgqys1980/archive/2009/05/13/1455710.html WEB的安全性测试要素 1.SQL Injection(SQ ...

  10. gorm去重查询 iris框架

    写练习 demo 时遇到需要进行去重查询,gorm没有db.distinct()的写法 // 数据库的表字段 type Pro_location_relation struct { Id int64 ...