话不多说,可以看上篇博文,关于offset存储到zookeeper

https://www.cnblogs.com/niutao/p/10547718.html

本篇博文主要告诉你如何将offset写到Hbase做存储:

最后存储到Hbase的展现形式:

testDirect:co:1552667595000  column=info:0, timestamp=1552667594784, value=66
testDirect:co:1552667595000 column=info:1, timestamp=1552667594784, value=269
testDirect:co:1552667595000 column=info:2, timestamp=1552667594784, value=67
testDirect:co:1552667600000 column=info:0, timestamp=1552667599864, value=66
testDirect:co:1552667600000 column=info:1, timestamp=1552667599864, value=269
testDirect:co:1552667600000 column=info:2, timestamp=1552667599864, value=67
testDirect:co:1552667605000 column=info:0, timestamp=1552667604778, value=66
testDirect:co:1552667605000 column=info:1, timestamp=1552667604778, value=269
testDirect:co:1552667605000 column=info:2, timestamp=1552667604778, value=67
testDirect:co:1552667610000 column=info:0, timestamp=1552667609777, value=66
testDirect:co:1552667610000 column=info:1, timestamp=1552667609777, value=269
版本:
scala:2.11.8
spark:2.11
hbase:1.2.0-cdh5.14.0
遇到的问题:

`java.lang.IllegalStateException: Consumer is not subscribed to any topics or assigned any partitions`
分析原因:

从指定的主题或者分区获取数据,在poll之前,你没有订阅任何主题或分区是不行的,每一次poll,消费者都会尝试使用最后一次消费的offset作为接下来获取数据的start offset,最后一次消费的offset也可以通过seek(TopicPartition, long)设置或者自动设置
通过源码可以找到:
public ConsumerRecords<K, V> poll(long timeout) {
acquire();
try {
if (timeout < 0)
throw new IllegalArgumentException("Timeout must not be negative");
// 如果没有任何订阅,抛出异常
if (this.subscriptions.hasNoSubscriptionOrUserAssignment())
throw new IllegalStateException("Consumer is not subscribed to any topics or assigned any partitions"); // 一直poll新数据直到超时
long start = time.milliseconds();
// 距离超时还剩余多少时间
long remaining = timeout;
do {
// 获取数据,如果自动提交,则进行偏移量自动提交,如果设置offset重置,则进行offset重置
Map<TopicPartition, List<ConsumerRecord<K, V>>> records = pollOnce(remaining);
if (!records.isEmpty()) {
// 再返回结果之前,我们可以进行下一轮的fetch请求,避免阻塞等待
fetcher.sendFetches();
client.pollNoWakeup();
// 如果有拦截器进行拦截,没有直接返回
if (this.interceptors == null)
return new ConsumerRecords<>(records);
else
return this.interceptors.onConsume(new ConsumerRecords<>(records));
} long elapsed = time.milliseconds() - start;
remaining = timeout - elapsed;
} while (remaining > 0); return ConsumerRecords.empty();
} finally {
release();
}
}

解决:

因此,需要订阅当前的topic才能消费,我之前使用的api是:(适用于非新--已经被消费者消费过的)
因此,需要订阅当前的topic才能消费,我之前使用的api是:(适用于非新--已经被消费者消费过的)
`val inputDStream1 = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Assign[String, String](
fromOffsets.keys,kafkaParams,fromOffsets)
)` 修改:(全新的topic,没有被消费者消费过)
`val inputDStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)`

完整代码:

package offsetInHbase
import kafka.utils.ZkUtils
import org.apache.hadoop.hbase.filter.PrefixFilter
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.client.{Scan, Put, ConnectionFactory}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.{OffsetRange, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.kafka010.LocationStrategies._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by angel
*/
object KafkaOffsetsBlogStreamingDriver { def main(args: Array[String]) { if (args.length < 6) {
System.err.println("Usage: KafkaDirectStreamTest " +
"<batch-duration-in-seconds> " +
"<kafka-bootstrap-servers> " +
"<kafka-topics> " +
"<kafka-consumer-group-id> " +
"<hbase-table-name> " +
"<kafka-zookeeper-quorum>")
System.exit(1)
}
//5 cdh1:9092,cdh2:2181,cdh3:2181 testDirect co testDirect cdh1:2181,cdh2:2181,cdh3:2181 val batchDuration = args(0)
val bootstrapServers = args(1).toString
val topicsSet = args(2).toString.split(",").toSet
val consumerGroupID = args(3)
val hbaseTableName = args(4)
val zkQuorum = args(5)
val zkKafkaRootDir = "kafka"
val zkSessionTimeOut = 10000
val zkConnectionTimeOut = 10000 val sparkConf = new SparkConf().setAppName("Kafka-Offset-Management-Blog")
.setMaster("local[4]")//Uncomment this line to test while developing on a workstation
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(batchDuration.toLong))
val topics = topicsSet.toArray
val topic = topics(0) val kafkaParams = Map[String, Object](
"bootstrap.servers" -> bootstrapServers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> consumerGroupID,
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
) /*
Create a dummy process that simply returns the message as is.
*/
def processMessage(message:ConsumerRecord[String,String]):ConsumerRecord[String,String]={
message
} /*
Save Offsets into HBase
*/
def saveOffsets(
TOPIC_NAME:String,
GROUP_ID:String,
offsetRanges:Array[OffsetRange],
hbaseTableName:String,
batchTime: org.apache.spark.streaming.Time
) ={
val hbaseConf = HBaseConfiguration.create()
hbaseConf.addResource("src/main/resources/hbase-site.xml")
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTableName))
val rowKey = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(batchTime.milliseconds)
val put = new Put(rowKey.getBytes)
for(offset <- offsetRanges){
put.addColumn(Bytes.toBytes("info"),Bytes.toBytes(offset.partition.toString),
Bytes.toBytes(offset.untilOffset.toString))
}
table.put(put)
conn.close()
} /*
Returns last committed offsets for all the partitions of a given topic from HBase in following cases.
- CASE 1: SparkStreaming job is started for the first time. This function gets the number of topic partitions from
Zookeeper and for each partition returns the last committed offset as 0
- CASE 2: SparkStreaming is restarted and there are no changes to the number of partitions in a topic. Last
committed offsets for each topic-partition is returned as is from HBase.
- CASE 3: SparkStreaming is restarted and the number of partitions in a topic increased. For old partitions, last
committed offsets for each topic-partition is returned as is from HBase as is. For newly added partitions,
function returns last committed offsets as 0
*/
def getLastCommittedOffsets(
TOPIC_NAME:String,
GROUP_ID:String,
hbaseTableName:String,
zkQuorum:String,
zkRootDir:String,
sessionTimeout:Int,
connectionTimeOut:Int
):Map[TopicPartition,Long] ={ val hbaseConf = HBaseConfiguration.create()
hbaseConf.addResource("src/main/resources/hbase-site.xml")
val zkUrl = zkQuorum+"/"+zkRootDir
val zkClientAndConnection = ZkUtils.createZkClientAndConnection(zkUrl,sessionTimeout,connectionTimeOut)
val zkUtils = new ZkUtils(zkClientAndConnection._1, zkClientAndConnection._2,false)
val zKNumberOfPartitionsForTopic = zkUtils.getPartitionsForTopics(Seq(TOPIC_NAME)).get(TOPIC_NAME).toList.head.size //Connect to HBase to retrieve last committed offsets
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTableName))
val startRow = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(System.currentTimeMillis())
val stopRow = TOPIC_NAME + ":" + GROUP_ID + ":" + 0
val scan = new Scan()
val scanner = table.getScanner(scan.setStartRow(startRow.getBytes).setStopRow(stopRow.getBytes).setReversed(true))
val result = scanner.next()
//Set the number of partitions discovered for a topic in HBase to 0
var hbaseNumberOfPartitionsForTopic = 0
if (result != null){
//If the result from hbase scanner is not null, set number of partitions from hbase to the number of cells
//listCells 获取列族下的列
hbaseNumberOfPartitionsForTopic = result.listCells().size()
} val fromOffsets = collection.mutable.Map[TopicPartition,Long]()
//初始化时候的hbase
if(hbaseNumberOfPartitionsForTopic == 0){
// initialize fromOffsets to beginning
for (partition <- 0 to zKNumberOfPartitionsForTopic-1){
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}
//增加了topic的分区数
} else if(zKNumberOfPartitionsForTopic > hbaseNumberOfPartitionsForTopic){
// handle scenario where new partitions have been added to existing kafka topic
for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1){
val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}
//将新增的分区也添加上
for (partition <- hbaseNumberOfPartitionsForTopic to zKNumberOfPartitionsForTopic-1){
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}
} else {
//initialize fromOffsets from last run
for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1 ){
val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}
}
scanner.close()
conn.close()
fromOffsets.toMap
} val fromOffsets= getLastCommittedOffsets(
topic,
consumerGroupID,
hbaseTableName,
zkQuorum,
zkKafkaRootDir,
zkSessionTimeOut,
zkConnectionTimeOut)
//刚开始时候启动,全新的topic会报错
val inputDStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Assign[String, String](
fromOffsets.keys,kafkaParams,fromOffsets)
)
//如果报错,则使用下面的api
// val inputDStream = KafkaUtils.createDirectStream[String, String](
// ssc,
// PreferConsistent,
// Subscribe[String, String](topics, kafkaParams)
// ) /*
For each RDD in a DStream apply a map transformation that processes the message.
*/
inputDStream.foreachRDD((rdd,batchTime) => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach(offset => println(offset.topic, offset.partition, offset.fromOffset,offset.untilOffset))
val newRDD = rdd.map(message => processMessage(message))
newRDD.count()
saveOffsets(topic,consumerGroupID,offsetRanges,hbaseTableName,batchTime) //save the offsets to HBase
}) println("Number of messages processed " + inputDStream.count())
ssc.start()
ssc.awaitTermination()
}
}

sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到Hbase)的更多相关文章

  1. SparkStreaming消费kafka中数据的方式

    有两种:Direct直连方式.Receiver方式 1.Receiver方式: 使用kafka高层次的consumer API来实现,receiver从kafka中获取的数据都保存在spark exc ...

  2. SparkStreaming消费Kafka,手动维护Offset到Mysql

    目录 说明 整体逻辑 offset建表语句 代码实现 说明 当前处理只实现手动维护offset到mysql,只能保证数据不丢失,可能会重复 要想实现精准一次性,还需要将数据提交和offset提交维护在 ...

  3. sparkstreaming消费kafka后bulk到es

    不使用es-hadoop的saveToES,与scala版本冲突问题太多.不使用bulkprocessor,异步提交,es容易oom,速度反而不快.使用BulkRequestBuilder同步提交. ...

  4. Spark Streaming消费Kafka Direct保存offset到Redis,实现数据零丢失和exactly once

    一.概述 上次写这篇文章文章的时候,Spark还是1.x,kafka还是0.8x版本,转眼间spark到了2.x,kafka也到了2.x,存储offset的方式也发生了改变,笔者根据上篇文章和网上文章 ...

  5. sparkStreaming读取kafka的两种方式

    概述 Spark Streaming 支持多种实时输入源数据的读取,其中包括Kafka.flume.socket流等等.除了Kafka以外的实时输入源,由于我们的业务场景没有涉及,在此将不会讨论.本篇 ...

  6. spark-streaming集成Kafka处理实时数据

    在这篇文章里,我们模拟了一个场景,实时分析订单数据,统计实时收益. 场景模拟 我试图覆盖工程上最为常用的一个场景: 1)首先,向Kafka里实时的写入订单数据,JSON格式,包含订单ID-订单类型-订 ...

  7. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)-- 2

    参考上篇博文:https://www.cnblogs.com/niutao/p/10547718.html 同样的逻辑,不同的封装 package offsetInZookeeper /** * Cr ...

  8. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)

    版本声明: kafka:1.0.1 spark:2.1.0 注意:在使用过程中可能会出现servlet版本不兼容的问题,因此在导入maven的pom文件的时候,需要做适当的排除操作 <?xml ...

  9. sparkStreaming消费kafka-0.8方式:direct方式(存储offset到zookeeper)

    生产中,为了保证kafka的offset的安全性,并且防止丢失数据现象,会手动维护偏移量(offset) 版本:kafka:0.8 其中需要注意的点: 1:获取zookeeper记录的分区偏移量 2: ...

随机推荐

  1. 防火墙iptables的简单使用

    规则定义 # service iptables start # chkconfig iptables on 想让规则生效,则shell命令行下执行 sh /bin/iptables.sh即可 [roo ...

  2. FS 日志空间限定

    一.说明: FS默认安装的log文件,仅仅的限制了每个文件的大小,但是没有限制文件的个数.这种情况下,在FS运行很长时间之后,会出现物理空间不够的情况,导致FS或者mysql 或者其他应用没有空间使用 ...

  3. Python2018-字符串中字符个数统计

    1 编写程序,完成以下要求: 统计字符串中,各个字符的个数 比如:"hello world" 字符串统计的结果为: h:1 e:1 l:3 o:2 d:1 r:1 w:1 prin ...

  4. simulate events

    windows system maintains a msg queue, and any process that supports msg will create an thread that h ...

  5. Servlet随笔

    HttpServlet中的getRequestURL.getRequestURI.getContextPath方法获取的字符串为 jsp文件会被编译成一个Servlet,该Servlet继承自Http ...

  6. [C]内存管理、内存泄露、堆栈

    原文地址:https://www.cnblogs.com/youthshouting/p/4280543.html,转载请注明源地址. 1.内存分配区间:         对于一个C语言程序而言,内存 ...

  7. centos7编译安装lnmp

    1.前言 本文适合于已经对Linux操作系统具有基本操作经验,并且能够在Linux或Windows上通过一键搭建工具或者yum命令行进行环境搭建的读者,阅读本文需具有一定的专业知识,本文不建议初学者阅 ...

  8. Confluence 6 为翻译显示用户界面的键(Key)名称

    这个功能在你使用 Confluence 用户界面为 Confluence 创建翻译的时候会非常有用.当你打开主面板的时候,在你访问的 URL 的最后面添加下面的文字:can add the follo ...

  9. 【层次聚类】python scipy实现

    层次聚类 原理 有一个讲得很清楚的博客:博客地址 主要用于:没有groundtruth,且不知道要分几类的情况 用scipy模块实现聚类 参考函数说明: pdist squareform linkag ...

  10. 20165314 2017-2018-2《Java程序设计》课程总结

    20165314 2017-2018-2<Java程序设计>课程总结 每周作业链接汇总 预备作业1:我期望的师生关系 预备作业2:C语言基础调查和java学习展望 预备作业3:Linux安 ...