kafka 的 createDirectStream

kafka api中给出2类直接获取流的接口：createStream和createDirectStream。

createStream比较简单，只需topic、groupid、zookeeper就可以直接获取流，brokers和offset都是黑盒无需进行控制，但在项目中往往不受控。以下是部分源码：

/**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param ssc       StreamingContext object

   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)

   * @param groupId   The group id for this consumer

   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                  in its own thread

   * @param storageLevel  Storage level to use for storing the received objects

   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream(

      ssc: StreamingContext,

      zkQuorum: String,

      groupId: String,

      topics: Map[String, Int],

      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2

    ): ReceiverInputDStream[(String, String)] = {

    val kafkaParams = Map[String, String](

      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,

      "zookeeper.connection.timeout.ms" -> "10000")

    createStream[String, String, StringDecoder, StringDecoder](

      ssc, kafkaParams, topics, storageLevel)

  }

KafkaUtils.createStream

createDirectStream直接去操作kafka，需要自己手动保存offset，方法的注释写的还是很明白的，以下是部分源码：

/**

   * Create an input stream that directly pulls messages from Kafka Brokers

   * without using any receiver. This stream can guarantee that each message

   * from Kafka is included in transformations exactly once (see points below).

   *

   * Points to note:

   *  - No receivers: This stream does not use any receiver. It directly queries Kafka

   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked

   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on

   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.

   *    You can access the offsets used in each batch from the generated RDDs (see

   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).

   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing

   *    in the [[StreamingContext]]. The information on consumed offset can be

   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).

   *  - End-to-end semantics: This stream ensures that every records is effectively received and

   *    transformed exactly once, but gives no guarantees on whether the transformed data are

   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure

   *    that the output operation is idempotent, or use transactions to output records atomically.

   *    See the programming guide for more details.

   *

   * @param ssc StreamingContext object

   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">

   *    configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"

   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in

   *    host1:port1,host2:port2 form.

   * @param fromOffsets Per-topic/partition Kafka offsets defining the (inclusive)

   *    starting point of the stream

   * @param messageHandler Function for translating each message and metadata into the desired type

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam KD type of Kafka message key decoder

   * @tparam VD type of Kafka message value decoder

   * @tparam R type returned by messageHandler

   * @return DStream of R

   */

  def createDirectStream[

    K: ClassTag,

    V: ClassTag,

    KD <: Decoder[K]: ClassTag,

    VD <: Decoder[V]: ClassTag,

    R: ClassTag] (

      ssc: StreamingContext,

      kafkaParams: Map[String, String],

      fromOffsets: Map[TopicAndPartition, Long],

      messageHandler: MessageAndMetadata[K, V] => R

  ): InputDStream[R] = {

    val cleanedHandler = ssc.sc.clean(messageHandler)

    new DirectKafkaInputDStream[K, V, KD, VD, R](

      ssc, kafkaParams, fromOffsets, cleanedHandler)

  }

KafkaUtils.createDirectStream

和

/**

   * Create an input stream that directly pulls messages from Kafka Brokers

   * without using any receiver. This stream can guarantee that each message

   * from Kafka is included in transformations exactly once (see points below).

   *

   * Points to note:

   *  - No receivers: This stream does not use any receiver. It directly queries Kafka

   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked

   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on

   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.

   *    You can access the offsets used in each batch from the generated RDDs (see

   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).

   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing

   *    in the [[StreamingContext]]. The information on consumed offset can be

   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).

   *  - End-to-end semantics: This stream ensures that every records is effectively received and

   *    transformed exactly once, but gives no guarantees on whether the transformed data are

   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure

   *    that the output operation is idempotent, or use transactions to output records atomically.

   *    See the programming guide for more details.

   *

   * @param ssc StreamingContext object

   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">

   *   configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"

   *   to be set with Kafka broker(s) (NOT zookeeper servers), specified in

   *   host1:port1,host2:port2 form.

   *   If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest"

   *   to determine where the stream starts (defaults to "largest")

   * @param topics Names of the topics to consume

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam KD type of Kafka message key decoder

   * @tparam VD type of Kafka message value decoder

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createDirectStream[

    K: ClassTag,

    V: ClassTag,

    KD <: Decoder[K]: ClassTag,

    VD <: Decoder[V]: ClassTag] (

      ssc: StreamingContext,

      kafkaParams: Map[String, String],

      topics: Set[String]

  ): InputDStream[(K, V)] = {

    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)

    val kc = new KafkaCluster(kafkaParams)

    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)

    new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](

      ssc, kafkaParams, fromOffsets, messageHandler)

  }

KafkaUtils.createDirectStream

项目中需要的是手动去控制这个偏移量，由此可以看到多了2个参数：fromOffsets: Map[TopicAndPartition, Long] 和 messageHandler: MessageAndMetadata[K, V] => R。

获取fromOffsets的思路应该就是：

1. 连接到zk

2. 获取topic和partitions

3. 遍历topic的partitions，读取每个partitions的offset（存在zk中的地址为：/consumers/[group id]/offsets/[topic]/[0 ... N]）

4. 有可能读取的路径为空，那么得去取leader中的offset

因此，对应代码：（可以参考这些源码：kafka.utils.ZkUtils，org.apache.spark.streaming.kafka.KafkaUtils，kafka.tools.GetOffsetShell，及其对应的调用类）

private def getOffset = {

    val fromOffset: mutable.Map[TopicAndPartition, Long] = mutable.Map()

    val (zkClient, zkConnection) = ZkUtils.createZkClientAndConnection(kafkaZkQuorum, kafkaZkSessionTimeout, kafkaZkSessionTimeout)

    val zkUtil = new ZkUtils(zkClient, zkConnection, false)

    zkUtil.getPartitionsForTopics(kafkaTopic.split(",").toSeq)

      .foreach({ topic2Partition =>

        val topic = topic2Partition._1

        val partitions = topic2Partition._2

        val topicDirs = new ZKGroupTopicDirs(groupId, topic)

        partitions.foreach(partition => {

          val zkPath = s"${topicDirs.consumerOffsetDir}/$partition"

          zkUtil.makeSurePersistentPathExists(zkPath)

          val untilOffset = zkUtil.zkClient.readData[String](zkPath)

          val tp = TopicAndPartition(topic, partition)

          val offset = {

            if (null == untilOffset)

              getLatestLeaderOffsets(tp, zkUtil)

            else untilOffset.toLong

          }

          fromOffset += (tp -> offset)

        }

        )

      })

    zkUtil.close()

    fromOffset.toMap

  }

getOffset

获取messageHandler，就跟其第二个构造函数一样即可：

messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)

messageHandler

接着就是getLatestLeaderOffsets：

private def getLatestLeaderOffsets(tp: TopicAndPartition, zkUtil: ZkUtils): Long = {

    try {

      val brokerId = zkUtil.getLeaderForPartition(tp.topic, tp.partition).get

      val brokerInfoString = zkUtil.readDataMaybeNull(s"${ZkUtils.BrokerIdsPath}/$brokerId")._1.get

      val brokerInfo = Json.parseFull(brokerInfoString).get.asInstanceOf[Map[String, Any]]

      val host = brokerInfo("host").asInstanceOf[String]

      val port = brokerInfo("port").asInstanceOf[Int]

      val consumer = new SimpleConsumer(host, port, 10000, 100000, "getLatestLeaderOffsets")

      val request = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.LatestTime, 1)))

      val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(tp).offsets

      offsets.head

    } catch {

      case _ => throw new Exception("获取最新offset异常：" + TopicAndPartition)

    }

  }

getLatestLeaderOffsets

最后就是调用的方式了：

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc,

      kafkaParams, getOffset, (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message))

KafkaUtils.createDirectStream

由于要从灾难中还原，做到7*24，需要设置checkpoint，业务逻辑需要包含在checkpoint的方法里，代码如下：

def main(args: Array[String]): Unit = {

    val run = gatewayIsEnable || urlAnalysIsEnable

    if (run) {

      val ssc = StreamingContext.getOrCreate(checkpointDir, createStreamingContext _)

      ssc.start()

      ssc.awaitTermination()

    }

  }

  def createStreamingContext() = {

    val duration = SysConfig.duration(2)

    val sparkConf = new SparkConf().setAppName("cmhi")

    val ssc = new StreamingContext(sparkConf, Seconds(duration))

    ssc.checkpoint(checkpointDir)

    Osgi.init(ssc, debug)

    ssc

  }

main

kafka 的 createDirectStream的更多相关文章

【python】spark+kafka使用
网上用python写spark+kafka的资料好少啊自己记录一点踩到的坑~ spark+kafka介绍的官方网址:http://spark.apache.org/docs/latest/strea ...
【Spark】SparkStreaming-Kafka-集成-终极参考资料
SparkStreaming-Kafka-集成-终极参考资料 Spark Streaming和Kafka整合开发指南(二) – 过往记忆 Streamingkafka零丢失 | 等英博客 spark- ...
Spark createDirectStream 维护 Kafka offset（Scala）
createDirectStream方式需要自己维护offset,使程序可以实现中断后从中断处继续消费数据. KafkaManager.scala import kafka.common.TopicA ...
pyspark kafka createDirectStream和createStream 区别
from pyspark.streaming.kafka import KafkaUtils kafkaStream = KafkaUtils.createStream(streamingContex ...
spark读取kafka数据 createStream和createDirectStream的区别
1.KafkaUtils.createDstream 构造函数为KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic, ...
Spark Streaming + Kafka 整合向导之createDirectStream
启动zk: zkServer.sh start 启动kafka:kafka-server-start.sh $KAFKA_HOME/config/server.properties 创建一个topic ...
spark streaming 与 kafka 结合使用的一些概念理解
1. createStream会使用 Receiver:而createDirectStream不会,数据会通过driver接收. 2.createStream使用 Receiver 源源不断的接收数据 ...
spark streaming kafka example
// scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...
Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...

随机推荐

windows消息机制与实例
windows发送窗口消息所需工具:spy++,visual studio 2017,c#语言技术路线:首先通过spy++获得所要操纵的窗口的句柄,函数的原型声明为: [DllImport(&qu ...
201521123054 《Java程序设计》第11周学习总结
1. 本周学习总结 2. 书面作业 1.1 除了使用synchronized修饰方法实现互斥同步访问,还有什么办法实现互斥同步访问(请出现相关代码)? 使用Lock对象和Condition对象实现互斥 ...
201521123068《Java程序设计》第12周学习总结
1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多流与文件相关内容. 2. 书面作业将Student对象(属性:int id, String name,int age,doubl ...
子元素设定margin值会影响父元素
有些情况下,我们设定父元素下的子元素margin值时,父元素会被影响. 这是个常见问题,而且只在标准浏览器下 (FirfFox.Chrome.Opera.Sarfi)产生问题,IE下反而表现良好. 例 ...
上传文件复用代码【fileUpload】
这是使用了FileUpload上传组件的,解决了中文乱码问题了,并且删除了临时文件的. 使用了一个Book对象做示范 private Book uploadData(HttpServletReques ...
在ZABBIX平台上通过SNMP协议监控网络设备
在ZABBIX平台上通过SNMP协议监控网络设备方法一:自动发现监控项 ZABBIX自带模板Template SNMP Interfaces中有"自动发现规则"这一选项,在主机选 ...
百度编辑器不能插入html标签解决方法
找到此方法: me.addInputRule(function (root) { var allowDivTransToP = this.options.allowDivTransToP; var v ...
Node.js 异步异闻录
本文首发在个人博客:http://muyunyun.cn/posts/7b9fdc87/ 提到 Node.js, 我们脑海就会浮现异步.非阻塞.单线程等关键词,进一步我们还会想到 buffer.模块机 ...
由throw new Error() 引发的探讨
问题复现在工作时遇到了需要抛出异常并且需要自己捕获处理的地方,于是在抛出的地方写下 function parseExcel(con) { try { // doSomething } catch ( ...
ActiveMQ 入门helloworld
1.下载安装ActiveMQ 官网下载地址:http://activemq.apache.org/download.html ActiveMQ 提供了Windows 和Linux.Unix 等几个版本 ...

kafka 的 createDirectStream

kafka 的 createDirectStream的更多相关文章

随机推荐

热门专题