环境:

kafka  0.10

spark  2.1.0

zookeeper  3.4.5-cdh5.14.0

公司阿里云测试机,十月一放假前,没有在继续消费,假期过后回来再使用spark streaming消费某个消费组下的kafka时报错如下:

As I regularly kill the servers running Kafka and the producers feeding it (yes, just for fun), things sometimes go a bit crazy, not entirely sure why but I got the error:

kafka.common.OffsetOutOfRangeError: FetchResponse(topic='my_messages', partition=0, error=1, highwaterMark=-1, messages=)
To fix it I added the “seek” setting: consumer.seek(0,2)

出现问题的原因:

kafka会定时清理日志

当我们的任务开始的时候,如果之前消费过某个topic,那么这个topic会在zk上设置offset,我们一般会去获取这个offset来继续从上次结束的地方继续消费,但是kafka定时清理日志的功能,比如定时一天一清理,那么如果你的offset是前天消费的offset,那么这个时候你再去消费,自然而然的你的offset肯定已经不在有效范围内,所以就报OffsetOutOfRangeException了

解决:

需要在发现zk_offset<earliest_offset>时矫正zk_offset为合法值

前期完整代码

https://www.cnblogs.com/niutao/p/10547831.html

改正后的关键代码:

/**
* 获取最小offset
* Returns the earliest (lowest) available offsets, taking new partitions into account.
*
* @param kafkaParams kafka客户端配置
* @param topics 获取获取offset的topic
*/
def getEarliestOffsets(kafkaParams: Map[String, Object], topics: Iterable[String]): Map[TopicPartition, Long] = {
val newKafkaParams = mutable.Map[String, Object]()
newKafkaParams ++= kafkaParams
newKafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val consumer: KafkaConsumer[String, Array[Byte]] = new KafkaConsumer[String, Array[Byte]](newKafkaParams)
consumer.subscribe(topics)
val notOffsetTopicPartition = mutable.Set[TopicPartition]()
try {
consumer.poll(0)
} catch {
case ex: NoOffsetForPartitionException =>
log.warn(s"consumer topic partition offset not found:${ex.partition()}")
notOffsetTopicPartition.add(ex.partition())
}
val parts = consumer.assignment().toSet
consumer.pause(parts)
consumer.seekToBeginning(parts)
consumer.pause(parts)
val offsets = parts.map(tp => tp -> consumer.position(tp)).toMap
consumer.unsubscribe()
consumer.close()
offsets
}
/**
* 获取最大offset
* Returns the latest (highest) available offsets, taking new partitions into account.
*
* @param kafkaParams kafka客户端配置
* @param topics 需要获取offset的topic
**/
def getLatestOffsets(kafkaParams: Map[String, Object], topics: Iterable[String]): Map[TopicPartition, Long] = {
val newKafkaParams = mutable.Map[String, Object]()
newKafkaParams ++= kafkaParams
newKafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
val consumer: KafkaConsumer[String, Array[Byte]] = new KafkaConsumer[String, Array[Byte]](newKafkaParams)
consumer.subscribe(topics)
val notOffsetTopicPartition = mutable.Set[TopicPartition]()
try {
consumer.poll(0)
} catch {
case ex: NoOffsetForPartitionException =>
log.warn(s"consumer topic partition offset not found:${ex.partition()}")
notOffsetTopicPartition.add(ex.partition())
}
val parts = consumer.assignment().toSet
consumer.pause(parts)
consumer.seekToEnd(parts)
val offsets = parts.map(tp => tp -> consumer.position(tp)).toMap
consumer.unsubscribe()
consumer.close()
offsets
}
val earliestOffsets = getEarliestOffsets(kafkaParams , topics)
val latestOffsets = getLatestOffsets(kafkaParams , topics)
for((k,v) <- topicPartOffsetMap.toMap){
val current = v
val earliest = earliestOffsets.get(k).get
val latest = latestOffsets.get(k).get
if (current > latest || current < earliest) {
log.warn("矫正offset: " + current +" -> "+ earliest);
topicPartOffsetMap.put(k , earliest)
}
}

完整代码,拿去直接用就可以了

import kafka.utils.{ZKGroupTopicDirs, ZkUtils}
import org.apache.kafka.clients.consumer.{ConsumerRecord, KafkaConsumer}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, HasOffsetRanges, KafkaUtils}
import org.slf4j.LoggerFactory import scala.collection.JavaConversions._
import scala.reflect.ClassTag
import scala.util.Try
import org.apache.kafka.clients.consumer.{Consumer, ConsumerConfig, KafkaConsumer, NoOffsetForPartitionException}
import org.apache.kafka.common.TopicPartition
import org.apache.zookeeper.data.Stat import scala.collection.JavaConversions._
import scala.collection.mutable
/**
* Kafka的连接和Offset管理工具类
*
* @param zkHosts Zookeeper地址
* @param kafkaParams Kafka启动参数
*/
class KafkaManager(zkHosts: String, kafkaParams: Map[String, Object]) extends Serializable {
//Logback日志对象,使用slf4j框架
@transient private lazy val log = LoggerFactory.getLogger(getClass)
//建立ZkUtils对象所需的参数
val (zkClient, zkConnection) = ZkUtils.createZkClientAndConnection(zkHosts, 10000, 10000)
// zkClient.setZkSerializer(new MyZkSerializer())
//ZkUtils对象,用于访问Zookeeper
val zkUtils = new ZkUtils(zkClient, zkConnection, false) /**
* 获取最小offset
* Returns the earliest (lowest) available offsets, taking new partitions into account.
*
* @param kafkaParams kafka客户端配置
* @param topics 获取获取offset的topic
*/
def getEarliestOffsets(kafkaParams: Map[String, Object], topics: Iterable[String]): Map[TopicPartition, Long] = {
val newKafkaParams = mutable.Map[String, Object]()
newKafkaParams ++= kafkaParams
newKafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val consumer: KafkaConsumer[String, Array[Byte]] = new KafkaConsumer[String, Array[Byte]](newKafkaParams)
consumer.subscribe(topics)
val notOffsetTopicPartition = mutable.Set[TopicPartition]()
try {
consumer.poll(0)
} catch {
case ex: NoOffsetForPartitionException =>
log.warn(s"consumer topic partition offset not found:${ex.partition()}")
notOffsetTopicPartition.add(ex.partition())
}
val parts = consumer.assignment().toSet
consumer.pause(parts)
consumer.seekToBeginning(parts)
consumer.pause(parts)
val offsets = parts.map(tp => tp -> consumer.position(tp)).toMap
consumer.unsubscribe()
consumer.close()
offsets
} /**
* 获取最大offset
* Returns the latest (highest) available offsets, taking new partitions into account.
*
* @param kafkaParams kafka客户端配置
* @param topics 需要获取offset的topic
**/
def getLatestOffsets(kafkaParams: Map[String, Object], topics: Iterable[String]): Map[TopicPartition, Long] = {
val newKafkaParams = mutable.Map[String, Object]()
newKafkaParams ++= kafkaParams
newKafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
val consumer: KafkaConsumer[String, Array[Byte]] = new KafkaConsumer[String, Array[Byte]](newKafkaParams)
consumer.subscribe(topics)
val notOffsetTopicPartition = mutable.Set[TopicPartition]()
try {
consumer.poll(0)
} catch {
case ex: NoOffsetForPartitionException =>
log.warn(s"consumer topic partition offset not found:${ex.partition()}")
notOffsetTopicPartition.add(ex.partition())
}
val parts = consumer.assignment().toSet
consumer.pause(parts)
consumer.seekToEnd(parts)
val offsets = parts.map(tp => tp -> consumer.position(tp)).toMap
consumer.unsubscribe()
consumer.close()
offsets
} /**
* 获取消费者当前offset
*
* @param consumer 消费者
* @param partitions topic分区
* @return
*/
def getCurrentOffsets(consumer: Consumer[_, _], partitions: Set[TopicPartition]): Map[TopicPartition, Long] = {
partitions.map(tp => tp -> consumer.position(tp)).toMap
}
/**
* 从Zookeeper读取Kafka消息队列的Offset
*
* @param topics Kafka话题
* @param groupId Kafka Group ID
* @return 返回一个Map[TopicPartition, Long],记录每个话题每个Partition上的offset,如果还没消费,则offset为0
*/
def readOffsets(topics: Seq[String], groupId: String): Map[TopicPartition, Long] = {
val topicPartOffsetMap = collection.mutable.HashMap.empty[TopicPartition, Long]
val partitionMap = zkUtils.getPartitionsForTopics(topics)
// /consumers/<groupId>/offsets/<topic>/
partitionMap.foreach(topicPartitions => {
val zkGroupTopicDirs = new ZKGroupTopicDirs(groupId, topicPartitions._1)
topicPartitions._2.foreach(partition => {
val offsetPath = zkGroupTopicDirs.consumerOffsetDir + "/" + partition
val tryGetKafkaOffset = Try {
val offsetStatTuple = zkUtils.readData(offsetPath)
if (offsetStatTuple != null) {
log.info("查询Kafka消息偏移量详情: 话题:{}, 分区:{}, 偏移量:{}, ZK节点路径:{}", Seq[AnyRef](topicPartitions._1, partition.toString, offsetStatTuple._1, offsetPath): _*)
topicPartOffsetMap.put(new TopicPartition(topicPartitions._1, Integer.valueOf(partition)), offsetStatTuple._1.toLong)
}
}
if(tryGetKafkaOffset.isFailure){
//http://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
val consumer = new KafkaConsumer[String, Object](kafkaParams)
val partitionList = List(new TopicPartition(topicPartitions._1, partition))
consumer.assign(partitionList)
val minAvailableOffset = consumer.beginningOffsets(partitionList).values.head
consumer.close()
log.warn("查询Kafka消息偏移量详情: 没有上一次的ZK节点:{}, 话题:{}, 分区:{}, ZK节点路径:{}, 使用最小可用偏移量:{}", Seq[AnyRef](tryGetKafkaOffset.failed.get.getMessage, topicPartitions._1, partition.toString, offsetPath, minAvailableOffset): _*)
topicPartOffsetMap.put(new TopicPartition(topicPartitions._1, Integer.valueOf(partition)), minAvailableOffset)
}
})
})
//TODO 解决kafka中数据还没来得及消费,数据就已经丢失或者过期了#########################
//Offsets out of range with no configured reset policy for partition
//获取EarliestOffsets
val earliestOffsets = getEarliestOffsets(kafkaParams , topics)
val latestOffsets = getLatestOffsets(kafkaParams , topics)
for((k,v) <- topicPartOffsetMap.toMap){
val current = v
val earliest = earliestOffsets.get(k).get
val latest = latestOffsets.get(k).get
if (current > latest || current < earliest) {
log.warn("矫正offset: " + current +" -> "+ earliest);
topicPartOffsetMap.put(k , earliest)
}
} topicPartOffsetMap.toMap
} //#########################################################
/**
* 包装createDirectStream方法,支持Kafka Offset,用于创建Kafka Streaming流
*
* @param ssc Spark Streaming Context
* @param topics Kafka话题
* @tparam K Kafka消息Key类型
* @tparam V Kafka消息Value类型
* @return Kafka Streaming流
*/
def createDirectStream[K: ClassTag, V: ClassTag](ssc: StreamingContext, topics: Seq[String]): InputDStream[ConsumerRecord[K, V]] = {
val groupId = kafkaParams("group.id").toString
//TODO
val storedOffsets: Map[TopicPartition, Long] = readOffsets(topics, groupId)
// val storedOffsets: Map[TopicPartition, Long] = getCurrentOffset(kafkaParams , topics)
log.info("Kafka消息偏移量汇总(格式:(话题,分区号,偏移量)):{}", storedOffsets.map(off => (off._1.topic, off._1.partition(), off._2)))
val kafkaStream = KafkaUtils.createDirectStream[K, V](ssc, PreferConsistent, ConsumerStrategies.Subscribe[K, V](topics, kafkaParams, storedOffsets))
kafkaStream
} /**
* 保存Kafka消息队列消费的Offset
*
* @param rdd SparkStreaming的Kafka RDD,RDD[ConsumerRecord[K, V]
* @param storeEndOffset true=保存结束offset, false=保存起始offset
*/
def persistOffsets[K, V](rdd: RDD[ConsumerRecord[K, V]], storeEndOffset: Boolean = true): Unit = {
val groupId = kafkaParams("group.id").toString
val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsList.foreach(or => {
val zkGroupTopicDirs = new ZKGroupTopicDirs(groupId, or.topic)
val offsetPath = zkGroupTopicDirs.consumerOffsetDir + "/" + or.partition
val offsetVal = if (storeEndOffset) or.untilOffset else or.fromOffset
zkUtils.updatePersistentPath(zkGroupTopicDirs.consumerOffsetDir + "/" + or.partition, offsetVal + "" /*, JavaConversions.bufferAsJavaList(acls)*/)
log.debug("保存Kafka消息偏移量详情: 话题:{}, 分区:{}, 偏移量:{}, ZK节点路径:{}", Seq[AnyRef](or.topic, or.partition.toString, offsetVal.toString, offsetPath): _*)
})
} }

kafka的offset管理代码

关于kafka定期清理日志后再消费报错kafka.common.OffsetOutOfRangeException的解决的更多相关文章

  1. kafka删除topic后再创建同名的topic报错(ERROR org.apache.kafka.common.errors.TopicExistsException)

    [hadoop@datanode3 logs]$ kafka-topics.sh --delete --zookeeper datanode1:2181 --topic firstTopic firs ...

  2. php表单提交后再后退 内容则默认清空的解决方法

    转载原文地址: http://www.jquerycn.cn/a_14422 在session_start()之后,字符输出之前加上header("Cache-control: privat ...

  3. Kafka学习之(六)搭建kafka集群

    想要搭建kafka集群,必须具备zookeeper集群,关于zookeeper集群的搭建,在Kafka学习之(五)搭建kafka集群之Zookeeper集群搭建博客有说明.需要具备两台以上装有zook ...

  4. Extjs4---Cannot read property 'addCls' of null 或者 el is null 关于tab关闭后再打开不显示或者报错

    做后台管理系统时遇到的问题,关于tab关闭后再打开不显示,或者报错 我在新的tabpanel中加入了一个grid,当我关闭再次打开就会报错Cannot read property 'addCls' o ...

  5. 将线上服务器生成的日志信息实时导入kafka,采用agent和collector分层传输,app的数据通过thrift传给agent,agent通过avro sink将数据发给collector,collector将数据汇集后,发送给kafka

    记flume部署过程中遇到的问题以及解决方法(持续更新) - CSDN博客 https://blog.csdn.net/lijinqi1987/article/details/77449889 现将调 ...

  6. Flume下读取kafka数据后再打把数据输出到kafka,利用拦截器解决topic覆盖问题

    1:如果在一个Flume Agent中同时使用Kafka Source和Kafka Sink来处理events,便会遇到Kafka Topic覆盖问题,具体表现为,Kafka Source可以正常从指 ...

  7. ELK+kafka构建日志收集系统

    ELK+kafka构建日志收集系统   原文  http://lx.wxqrcode.com/index.php/post/101.html   背景: 最近线上上了ELK,但是只用了一台Redis在 ...

  8. ELK+Kafka 企业日志收集平台(一)

    背景: 最近线上上了ELK,但是只用了一台Redis在中间作为消息队列,以减轻前端es集群的压力,Redis的集群解决方案暂时没有接触过,并且Redis作为消息队列并不是它的强项:所以最近将Redis ...

  9. 企业日志大数据分析系统ELK+KAFKA实现【转】

    背景: 最近线上上了ELK,但是只用了一台Redis在中间作为消息队列,以减轻前端es集群的压力,Redis的集群解决方案暂时没有接触过,并且Redis作为消息队列并不是它的强项:所以最近将Redis ...

随机推荐

  1. iPhone开发视频教程 Objective-C部分

    第一.二章  OC基础语法 iPhone开发教程 第一章 OC基础语法  iPhone开发概述-必看 (1.1) http://www.apkbus.com/android-102215-1-1.ht ...

  2. ORM框架之EntityFramework介绍

    ORM框架之EntityFramework介绍 1. 简介 大家好!我是高堂. 作为一位伪前端程序猿,我给大家介绍一下微软的自家的 ORM框架. ADO.NET Entity Framework 以下 ...

  3. vue项目中导出PDF的两种方式

    参考大家导出的方式,基本上是如下两种: 1.使用 html2Canvas + jsPDF 导出PDF, 这种方式什么都好,就是下载的pdf太模糊了.对要求好的pdf这种方式真是不行啊! 2.调用浏览器 ...

  4. Python诞生以来意义菜谱

    自Python诞生以来,它被誉为最简单的编程语言.进入人工智能时代后,它逐渐成为编程领域的主导语言. Python是一种快速.强大.高效和灵活的编程语言家常菜做法大全.学习后,无论您是想进入数据分析菜 ...

  5. go语言中regexp包中的函数和方法

    // regexp.go ------------------------------------------------------------ // 判断在 b 中能否找到正则表达式 patter ...

  6. 微信小程序开发(五)数据绑定

    承接上篇博客. // index.js Page({ data: { time: (new Date()).toString(), addr: "北京" }, but: funct ...

  7. 03_Hive的交互方式

    之前使用的Shell方式只是Hive交互方式中的一种,还有一种就是将Hive启动为服务运行在一个节点上,那么剩下的节点 就可以使用客户端来连接它,从而也可以使用Hive的数据分析服务 1.Hive的交 ...

  8. PAT Basic 1052 卖个萌 (20 分)

    萌萌哒表情符号通常由“手”.“眼”.“口”三个主要部分组成.简单起见,我们假设一个表情符号是按下列格式输出的: [左手]([左眼][口][右眼])[右手] 现给出可选用的符号集合,请你按用户的要求输出 ...

  9. chrome上一些好用的插件

    1. Super Auto Refresh Plus - 这个插件可以自动刷新网页 2. 屏蔽百度推广 - 这个插件可以屏蔽百度搜索的推广广告

  10. springMVC的简单了解和环境搭建

    一,什么mvc 模型-视图-控制器(MVC)是一个众所周知的以设计界面应用程序为基础的设计思想.它主要通过 分离模型.视图及控制器在应用程序中的角色 将业务逻辑从界面中解耦.通常, 模型负责封装应用程 ...