storm-kafka源码走读之KafkaSpout
from: http://blog.csdn.net/wzhg0508/article/details/40903919
(五)storm-kafka源码走读之KafkaSpout
现在开始介绍KafkaSpout源码了。
开始时,在open方法中做一些初始化,
- ........................
- _state = new ZkState(stateConf);
- _connections = new DynamicPartitionConnections(_spoutConfig, KafkaUtils.makeBrokerReader(conf, _spoutConfig));
- // using TransactionalState like this is a hack
- int totalTasks = context.getComponentTasks(context.getThisComponentId()).size();
- if (_spoutConfig.hosts instanceof StaticHosts) {
- _coordinator = new StaticCoordinator(_connections, conf, _spoutConfig, _state, context.getThisTaskIndex(), totalTasks, _uuid);
- } else {
- _coordinator = new ZkCoordinator(_connections, conf, _spoutConfig, _state, context.getThisTaskIndex(), totalTasks, _uuid);
- }
- ............
前后省略了一些代码,关于metric这系列暂时不介绍。主要是初始化Zookeeper连接zkstate,把kafka Partition 与broker关系对应起来(初始化DynamicPartitionConnections),在DynamicPartitionConnections构造函数需要传入一个brokerReader,我们是zkHosts,看KafkaUtils代码就知道采用的是ZkBrokerReader,来看下ZkBrokerReader的构造函数代码
- public ZkBrokerReader(Map conf, String topic, ZkHosts hosts) {
- try {
- reader = new DynamicBrokersReader(conf, hosts.brokerZkStr, hosts.brokerZkPath, topic);
- cachedBrokers = reader.getBrokerInfo();
- lastRefreshTimeMs = System.currentTimeMillis();
- refreshMillis = hosts.refreshFreqSecs * 1000L;
- } catch (java.net.SocketTimeoutException e) {
- LOG.warn("Failed to update brokers", e);
- }
- }
有一个refreshMillis参数,这个参数是定时更新zk中partition的信息,
- //ZkBrokerReader
- @Override
- public GlobalPartitionInformation getCurrentBrokers() {
- long currTime = System.currentTimeMillis();
- if (currTime > lastRefreshTimeMs + refreshMillis) { // 当前时间大于和上次更新时间之差大于refreshMillis
- try {
- LOG.info("brokers need refreshing because " + refreshMillis + "ms have expired");
- cachedBrokers = reader.getBrokerInfo();
- lastRefreshTimeMs = currTime;
- } catch (java.net.SocketTimeoutException e) {
- LOG.warn("Failed to update brokers", e);
- }
- }
- return cachedBrokers;
- }
- // 下面是调用DynamicBrokersReader 的代码
- /**
- * Get all partitions with their current leaders
- */
- public GlobalPartitionInformation getBrokerInfo() throws SocketTimeoutException {
- GlobalPartitionInformation globalPartitionInformation = new GlobalPartitionInformation();
- try {
- int numPartitionsForTopic = getNumPartitions();
- String brokerInfoPath = brokerPath();
- for (int partition = 0; partition < numPartitionsForTopic; partition++) {
- int leader = getLeaderFor(partition);
- String path = brokerInfoPath + "/" + leader;
- try {
- byte[] brokerData = _curator.getData().forPath(path);
- Broker hp = getBrokerHost(brokerData);
- globalPartitionInformation.addPartition(partition, hp);
- } catch (org.apache.zookeeper.KeeperException.NoNodeException e) {
- LOG.error("Node {} does not exist ", path);
- }
- }
- } catch (SocketTimeoutException e) {
- throw e;
- } catch (Exception e) {
- throw new RuntimeException(e);
- }
- LOG.info("Read partition info from zookeeper: " + globalPartitionInformation);
- return globalPartitionInformation;
- }
GlobalPartitionInformation是一个Iterator类,存放了paritition与broker之间的对应关系,DynamicPartitionConnections中维护Kafka Consumer与parittion之间的关系,每个Consumer读取哪些paritition信息。这个COnnectionInfo信息会在storm.kafka.ZkCoordinator中会被初始化和更新,需要提到的一点是一个KafkaSpout包含一个SimpleConsumer
- //storm.kafka.DynamicPartitionConnections
- static class ConnectionInfo {
- SimpleConsumer consumer;
- Set<Integer> partitions = new HashSet();
- public ConnectionInfo(SimpleConsumer consumer) {
- this.consumer = consumer;
- }
- }
再看ZkCoordinator类,看其构造函数
- //storm.kafka.ZkCoordinator
- public ZkCoordinator(DynamicPartitionConnections connections, Map stormConf, SpoutConfig spoutConfig, ZkState state, int taskIndex, int totalTasks, String topologyInstanceId, DynamicBrokersReader reader) {
- _spoutConfig = spoutConfig;
- _connections = connections;
- _taskIndex = taskIndex;
- _totalTasks = totalTasks;
- _topologyInstanceId = topologyInstanceId;
- _stormConf = stormConf;
- _state = state;
- ZkHosts brokerConf = (ZkHosts) spoutConfig.hosts;
- _refreshFreqMs = brokerConf.refreshFreqSecs * 1000;
- _reader = reader;
- }
_refreshFreqMs就是定时更新zk partition到本地的操作,在kafkaSpout中nextTuple方法中每次都会去调用ZkCoordinator的getMyManagedPartitions方法。该方法根据_refreshFreqMs参数定时更新partition信息
- //storm.kafka.ZkCoordinator
- @Override
- public List<PartitionManager> getMyManagedPartitions() {
- if (_lastRefreshTime == null || (System.currentTimeMillis() - _lastRefreshTime) > _refreshFreqMs) {
- refresh();
- _lastRefreshTime = System.currentTimeMillis();
- }
- return _cachedList;
- }
- @Override
- public void refresh() {
- try {
- LOG.info(taskId(_taskIndex, _totalTasks) + "Refreshing partition manager connections");
- GlobalPartitionInformation brokerInfo = _reader.getBrokerInfo();
- List<Partition> mine = KafkaUtils.calculatePartitionsForTask(brokerInfo, _totalTasks, _taskIndex);
- Set<Partition> curr = _managers.keySet();
- Set<Partition> newPartitions = new HashSet<Partition>(mine);
- newPartitions.removeAll(curr);
- Set<Partition> deletedPartitions = new HashSet<Partition>(curr);
- deletedPartitions.removeAll(mine);
- LOG.info(taskId(_taskIndex, _totalTasks) + "Deleted partition managers: " + deletedPartitions.toString());
- for (Partition id : deletedPartitions) {
- PartitionManager man = _managers.remove(id);
- man.close();
- }
- LOG.info(taskId(_taskIndex, _totalTasks) + "New partition managers: " + newPartitions.toString());
- for (Partition id : newPartitions) {
- PartitionManager man = new PartitionManager(_connections, _topologyInstanceId, _state, _stormConf, _spoutConfig, id);
- _managers.put(id, man);
- }
- } catch (Exception e) {
- throw new RuntimeException(e);
- }
- _cachedList = new ArrayList<PartitionManager>(_managers.values());
- LOG.info(taskId(_taskIndex, _totalTasks) + "Finished refreshing");
- }
其中每个Consumer分配partition的算法是KafkaUtils.calculatePartitionsForTask(brokerInfo, _totalTasks, _taskIndex);
主要做的工作就是获取并行的task数,与当前partition做比较,得出一个COnsumer要负责哪些parititons的读取,具体算法去kafka文档吧
以上在KafkaSpout中做完了初始化操作,下面开始取数据发射数据了,来看nextTuple方法
- // storm.kafka.KafkaSpout
- @Override
- public void nextTuple() {
- List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
- for (int i = 0; i < managers.size(); i++) {
- try {
- // in case the number of managers decreased
- _currPartitionIndex = _currPartitionIndex % managers.size();
- EmitState state = managers.get(_currPartitionIndex).next(_collector);
- if (state != EmitState.EMITTED_MORE_LEFT) {
- _currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
- }
- if (state != EmitState.NO_EMITTED) {
- break;
- }
- } catch (FailedFetchException e) {
- LOG.warn("Fetch failed", e);
- _coordinator.refresh();
- }
- }
- long now = System.currentTimeMillis();
- if ((now - _lastUpdateMs) > _spoutConfig.stateUpdateIntervalMs) {
- commit();
- }
- }
看完上述代码可知,所有的操作都是在PartitionManager中进行的,PartitionManager中会读取message信息,然后进行发射,主要逻辑在PartitionManager的next方法中
- //returns false if it's reached the end of current batch
- public EmitState next(SpoutOutputCollector collector) {
- if (_waitingToEmit.isEmpty()) {
- fill();
- }
- while (true) {
- MessageAndRealOffset toEmit = _waitingToEmit.pollFirst();
- if (toEmit == null) {
- return EmitState.NO_EMITTED;
- }
- Iterable<List<Object>> tups = KafkaUtils.generateTuples(_spoutConfig, toEmit.msg);
- if (tups != null) {
- for (List<Object> tup : tups) {
- collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset));
- }
- break;
- } else {
- ack(toEmit.offset);
- }
- }
- if (!_waitingToEmit.isEmpty()) {
- return EmitState.EMITTED_MORE_LEFT;
- } else {
- return EmitState.EMITTED_END;
- }
- }
如果_waitingToEmit列表为空,则去读取msg,然后进行逐条发射,每发射一条,break一下,返回EMIT_MORE_LEFT给KafkaSpout的nextTuple方法中,,然后进行判断是否该paritition读取的一次读取的message buffer size是否已发射完毕,如果发射完毕就进行下一个partition 数据读取和发射,
注意的一点是,并不是一次把该partition的所有待发射的msg都发射完再commit offset到zk,而是发射一条,判断一下是否到了该commit的时候了(开始时设置的定时commit时间间隔),笔者认为这样做的原因是为了好控制fail
KafkaSpout中的ack,fail,commit操作全部交给了PartitionManager来做,看代码
- @Override
- public void ack(Object msgId) {
- KafkaMessageId id = (KafkaMessageId) msgId;
- PartitionManager m = _coordinator.getManager(id.partition);
- if (m != null) {
- m.ack(id.offset);
- }
- }
- @Override
- public void fail(Object msgId) {
- KafkaMessageId id = (KafkaMessageId) msgId;
- PartitionManager m = _coordinator.getManager(id.partition);
- if (m != null) {
- m.fail(id.offset);
- }
- }
- @Override
- public void deactivate() {
- commit();
- }
- @Override
- public void declareOutputFields(OutputFieldsDeclarer declarer) {
- declarer.declare(_spoutConfig.scheme.getOutputFields());
- }
- private void commit() {
- _lastUpdateMs = System.currentTimeMillis();
- for (PartitionManager manager : _coordinator.getMyManagedPartitions()) {
- manager.commit();
- }
- }
所以PartitionManager是KafkaSpout的核心,很晚了,都3点多了,后续会不上PartitionManager的分析,晚安
- 本文已收录于以下专栏:
- Storm-kafka源码浅谈
storm-kafka源码走读之KafkaSpout的更多相关文章
- twitter storm 源码走读之5 -- worker进程内部消息传递处理和数据结构分析
欢迎转载,转载请注明出处,徽沪一郎. 本文从外部消息在worker进程内部的转化,传递及处理过程入手,一步步分析在worker-data中的数据项存在的原因和意义.试图从代码实现的角度来回答,如果是从 ...
- kafka源码分析之一server启动分析
0. 关键概念 关键概念 Concepts Function Topic 用于划分Message的逻辑概念,一个Topic可以分布在多个Broker上. Partition 是Kafka中横向扩展和一 ...
- Apache Spark源码走读之23 -- Spark MLLib中拟牛顿法L-BFGS的源码实现
欢迎转载,转载请注明出处,徽沪一郎. 概要 本文就拟牛顿法L-BFGS的由来做一个简要的回顾,然后就其在spark mllib中的实现进行源码走读. 拟牛顿法 数学原理 代码实现 L-BFGS算法中使 ...
- Apache Spark源码走读之16 -- spark repl实现详解
欢迎转载,转载请注明出处,徽沪一郎. 概要 之所以对spark shell的内部实现产生兴趣全部缘于好奇代码的编译加载过程,scala是需要编译才能执行的语言,但提供的scala repl可以实现代码 ...
- Apache Spark源码走读之13 -- hiveql on spark实现详解
欢迎转载,转载请注明出处,徽沪一郎 概要 在新近发布的spark 1.0中新加了sql的模块,更为引人注意的是对hive中的hiveql也提供了良好的支持,作为一个源码分析控,了解一下spark是如何 ...
- Apache Spark源码走读之7 -- Standalone部署方式分析
欢迎转载,转载请注明出处,徽沪一郎. 楔子 在Spark源码走读系列之2中曾经提到Spark能以Standalone的方式来运行cluster,但没有对Application的提交与具体运行流程做详细 ...
- Kakfa揭秘 Day3 Kafka源码概述
Kakfa揭秘 Day3 Kafka源码概述 今天开始进入Kafka的源码,本次学习基于最新的0.10.0版本进行.由于之前在学习Spark过程中积累了很多的经验和思想,这些在kafka上是通用的. ...
- Kafka 源码剖析
1.概述 在对Kafka使用层面掌握后,进一步提升分析其源码是极有必要的.纵观Kafka源码工程结构,不算太复杂,代码量也不算大.分析研究其实现细节难度不算太大.今天笔者给大家分析的是其核心处理模块, ...
- apache kafka & CDH kafka源码编译
Apache kafka编译 前言 github网站kafka项目的README.md有关于kafka源码编译的说明 github地址:https://github.com/apache/kafka ...
随机推荐
- Linux设备驱动程序加载/卸载方法 insmod和modprobe命令
linux加载/卸载驱动有两种方法. 1.modprobe 注:在使用这个命令加载模块前先使用depmod -a命令生成modules.dep文件,该文件位于/lib/modules/$(uname ...
- Microsoft.VisualStudio.Web.PageInspector.Loader
未能加载文件或程序集"Microsoft.VisualStudio.Web.PageInspector.Loader, Version=1.0.0.0, Culture=neutral, P ...
- POJ 1442 优先队列
题意:有一些ADD和GET操作.n次ADD操作,每次往序列中加入一个数,由ADD操作可知序列长度为1-n时序列的组成.GET操作输入一个序列长度,输出当前长度序列第i大的元素的值.i初始为0,每次GE ...
- Java控制语句——分支、循环、跳转
分支语句(if语句,switch语句): 循环语句(for,while,do...while); 跳转语句(break,continue,return): 分支语句(if语句,switch语句) if ...
- JavaWeb -- Struts2 模型驱动
1. 模型驱动 示例: 注册表单reg.jsp <%@ page language="java" contentType="text/html; charset=u ...
- tech| kafka入门书籍导读
J梳理了一下自己在入门 kafka 时读过的一些书, 希望能帮助到对 kafka 感兴趣的小伙伴. 涉及到的书籍: kafka 权威指南 Kafka: The Definitive Guide (ka ...
- Android 中Json解析的几种框架(Gson、Jackson、FastJson、LoganSquare)使用与对比
介绍 移动互联网产品与服务器端通信的数据格式,如果没有特殊的需求的话,一般选择使用JSON格式,Android系统也原生的提供了JSON解析的API,但是它的速度很慢,而且没有提供简介方便的接口来提高 ...
- 使用ssm整合是创建Maven项目报错Failure to transfer com.thoughtworks.xstream:xstream:pom:1.3.1
Description Resource Path Location TypeFailure to transfer com.thoughtworks.xstream:xstream:pom:1.3. ...
- 百度之星2017初赛A-1006-度度熊的01世界
度度熊的01世界 Accepts: 967 Submissions: 3064 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/3 ...
- Light oj 1074 spfa
https://vjudge.net/problem/LightOJ-1074 首先吐槽一个单词,directional是有方向的,undirectional是无向的,这个unidirectional ...