【原创】大数据基础之Flume(2)Sink代码解析
flume sink核心类结构
1 核心接口Sink
org.apache.flume.Sink
/**
* <p>Requests the sink to attempt to consume data from attached channel</p>
* <p><strong>Note</strong>: This method should be consuming from the channel
* within the bounds of a Transaction. On successful delivery, the transaction
* should be committed, and on failure it should be rolled back.
* @return READY if 1 or more Events were successfully delivered, BACKOFF if
* no data could be retrieved from the channel feeding this sink
* @throws EventDeliveryException In case of any kind of failure to
* deliver data to the next hop destination.
*/
public Status process() throws EventDeliveryException; public static enum Status {
READY, BACKOFF
}
process为核心接口,返回值为状态,只有两个:ready和backoff,调用方会根据返回值做相应处理,后边会看到;
这个接口也是扩展flume sink需要实现的接口,比如KuduSink;
2 Sink封装
org.apache.flume.SinkProcessor
/**
* <p>
* Interface for a device that allows abstraction of the behavior of multiple
* sinks, always assigned to a SinkRunner
* </p>
* <p>
* A sink processors {@link SinkProcessor#process()} method will only be
* accessed by a single runner thread. However configuration methods
* such as {@link Configurable#configure} may be concurrently accessed.
*
* @see org.apache.flume.Sink
* @see org.apache.flume.SinkRunner
* @see org.apache.flume.sink.SinkGroup
*/
public interface SinkProcessor extends LifecycleAware, Configurable {
/**
* <p>Handle a request to poll the owned sinks.</p>
*
* <p>The processor is expected to call {@linkplain Sink#process()} on
* whatever sink(s) appropriate, handling failures as appropriate and
* throwing {@link EventDeliveryException} when there is a failure to
* deliver any events according to the delivery policy defined by the
* sink processor implementation. See specific implementations of this
* interface for delivery behavior and policies.</p>
*
* @return Returns {@code READY} if events were successfully consumed,
* or {@code BACKOFF} if no events were available in the channel to consume.
* @throws EventDeliveryException if the behavior guaranteed by the processor
* couldn't be carried out.
*/
Status process() throws EventDeliveryException;
这个类负责封装单个sink或者sink group的处理,常用的子类有:
1)单个sink
org.apache.flume.sink.DefaultSinkProcessor
@Override
public Status process() throws EventDeliveryException {
return sink.process();
}
DefaultSinkProcessor的process会直接调用内部sink的process;
2)sink group
org.apache.flume.sink.LoadBalancingSinkProcessor
org.apache.flume.sink.FailoverSinkProcessor.FailedSink
3 sink的调用方为SinkRunner
org.apache.flume.SinkRunner
/**
* <p>
* A driver for {@linkplain Sink sinks} that polls them, attempting to
* {@linkplain Sink#process() process} events if any are available in the
* {@link Channel}.
* </p>
*
* <p>
* Note that, unlike {@linkplain Source sources}, all sinks are polled.
* </p>
*
* @see org.apache.flume.Sink
* @see org.apache.flume.SourceRunner
*/
public class SinkRunner implements LifecycleAware {
...
private static final long backoffSleepIncrement = 1000;
private static final long maxBackoffSleep = 5000; org.apache.flume.SinkRunner.PollingRunner public static class PollingRunner implements Runnable { private SinkProcessor policy;
private AtomicBoolean shouldStop;
private CounterGroup counterGroup; @Override
public void run() {
logger.debug("Polling sink runner starting"); while (!shouldStop.get()) {
try {
if (policy.process().equals(Sink.Status.BACKOFF)) {
counterGroup.incrementAndGet("runner.backoffs"); Thread.sleep(Math.min(
counterGroup.incrementAndGet("runner.backoffs.consecutive")
* backoffSleepIncrement, maxBackoffSleep));
} else {
counterGroup.set("runner.backoffs.consecutive", 0L);
}
} catch (InterruptedException e) {
logger.debug("Interrupted while processing an event. Exiting.");
counterGroup.incrementAndGet("runner.interruptions");
} catch (Exception e) {
logger.error("Unable to deliver event. Exception follows.", e);
if (e instanceof EventDeliveryException) {
counterGroup.incrementAndGet("runner.deliveryErrors");
} else {
counterGroup.incrementAndGet("runner.errors");
}
try {
Thread.sleep(maxBackoffSleep);
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
}
}
}
logger.debug("Polling runner exiting. Metrics:{}", counterGroup);
} }
无论process返回backoff或者抛exception,都会sleep一段时间,所以flume的sink一旦遇到大量异常数据或者自定义sink返回backoff,都会非常慢;
【原创】大数据基础之Flume(2)Sink代码解析的更多相关文章
- 【原创】大数据基础之Flume(2)kudu sink
kudu中的flume sink代码路径: https://github.com/apache/kudu/tree/master/java/kudu-flume-sink kudu-flume-sin ...
- 【原创】大数据基础之Flume(2)应用之kafka-kudu
应用一:kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 大数据系列之Flume+kafka 整合
相关文章: 大数据系列之Kafka安装 大数据系列之Flume--几种不同的Sources 大数据系列之Flume+HDFS 关于Flume 的 一些核心概念: 组件名称 功能介绍 Agent ...
- 【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
- 【原创】大数据基础之Impala(1)简介、安装、使用
impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...
- 【原创】大数据基础之Benchmark(2)TPC-DS
tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...
- 【原创】大数据基础之Spark(5)Shuffle实现原理及代码解析
一 简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
随机推荐
- ArcGis 拓扑检查——狭长角锐角代码C#
中学的时候醉心于研究怎么“逃课”,大学的时候豁然开悟——最牛逼的逃课是准时准地儿去上每一课,却不知到老师讲的啥,“大隐隐于市”大概就是这境界吧. 用到才听说有“余弦定理”这么一个东西,遂感叹“白上了大 ...
- Django多表操作
多表创建 创建模型 下面通过一个简单的图书管理系统,来阐述多表的创建和查询操作 在视图函数里里定义如下代码 from django.db import models class Book(models ...
- 【bzoj4530】[Bjoi2014]大融合 LCT维护子树信息
题目描述 小强要在N个孤立的星球上建立起一套通信系统.这套通信系统就是连接N个点的一个树. 这个树的边是一条一条添加上去的.在某个时刻,一条边的负载就是它所在的当前能够联通的树上路过它的简单路径的数量 ...
- springboot-01 helloworld
第一个springboot程序 新建maven项目,添加如下依赖: <?xml version="1.0" encoding="UTF-8"?> & ...
- Linux 下时间获取
1.获得当天的日期 date +%Y-%m-%d 2.将当前日期赋值给DATE变量 DATE=$(date +%Y%m%d) 3.获取明天的日期 date -d next-day +%Y%m%d 4. ...
- 数据库设计理论与实践·<四>数据库基本术语及其概念
一.关系模型 关系模型是最重要的一种数据模型.关系数据库模型系统采用关系模型作为数据的组织方式. 关系模型的数据结构: 关系:一张表 元组:一行记录. 属性:一列 [码(键,key)]:表中的某个属性 ...
- Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week2, Assignment(Optimization Methods)
声明:所有内容来自coursera,作为个人学习笔记记录在这里. 请不要ctrl+c/ctrl+v作业. Optimization Methods Until now, you've always u ...
- python之字典的增删改查
Python字典是另一种可变容器模型,且可存储任意类型对象,如字符串.数字.元组等其他容器模型.字典都是无序的,但查询速度快. 字典是一个key/value的集合,key可以是任意可被哈希(内部key ...
- 4-13 object类,继承和派生( super) ,钻石继承方法
1,object 类 object class A: ''' 这是一个类 ''' pass a = A() print(A.__dict__) # 双下方法 魔术方法 创建一个空对象 调用init方法 ...
- Luogu P2490「JSOI2016」黑白棋
我博弈基础好差.. Luogu P2490 题意 有一个长度为$ n$的棋盘,黑白相间的放$ k$个棋子,保证$ k$是偶数且最左边为白子 每次小$ A$可以移动不超过$ d$个白子,然后小$ B$可 ...