flume sink核心类结构

1 核心接口Sink

org.apache.flume.Sink

  /**
* <p>Requests the sink to attempt to consume data from attached channel</p>
* <p><strong>Note</strong>: This method should be consuming from the channel
* within the bounds of a Transaction. On successful delivery, the transaction
* should be committed, and on failure it should be rolled back.
* @return READY if 1 or more Events were successfully delivered, BACKOFF if
* no data could be retrieved from the channel feeding this sink
* @throws EventDeliveryException In case of any kind of failure to
* deliver data to the next hop destination.
*/
public Status process() throws EventDeliveryException; public static enum Status {
READY, BACKOFF
}

process为核心接口,返回值为状态,只有两个:ready和backoff,调用方会根据返回值做相应处理,后边会看到;
这个接口也是扩展flume sink需要实现的接口,比如KuduSink;

2 Sink封装

org.apache.flume.SinkProcessor

/**
* <p>
* Interface for a device that allows abstraction of the behavior of multiple
* sinks, always assigned to a SinkRunner
* </p>
* <p>
* A sink processors {@link SinkProcessor#process()} method will only be
* accessed by a single runner thread. However configuration methods
* such as {@link Configurable#configure} may be concurrently accessed.
*
* @see org.apache.flume.Sink
* @see org.apache.flume.SinkRunner
* @see org.apache.flume.sink.SinkGroup
*/
public interface SinkProcessor extends LifecycleAware, Configurable {
/**
* <p>Handle a request to poll the owned sinks.</p>
*
* <p>The processor is expected to call {@linkplain Sink#process()} on
* whatever sink(s) appropriate, handling failures as appropriate and
* throwing {@link EventDeliveryException} when there is a failure to
* deliver any events according to the delivery policy defined by the
* sink processor implementation. See specific implementations of this
* interface for delivery behavior and policies.</p>
*
* @return Returns {@code READY} if events were successfully consumed,
* or {@code BACKOFF} if no events were available in the channel to consume.
* @throws EventDeliveryException if the behavior guaranteed by the processor
* couldn't be carried out.
*/
Status process() throws EventDeliveryException;

这个类负责封装单个sink或者sink group的处理,常用的子类有:

1)单个sink

org.apache.flume.sink.DefaultSinkProcessor

  @Override
public Status process() throws EventDeliveryException {
return sink.process();
}

DefaultSinkProcessor的process会直接调用内部sink的process;

2)sink group

org.apache.flume.sink.LoadBalancingSinkProcessor
org.apache.flume.sink.FailoverSinkProcessor.FailedSink

3 sink的调用方为SinkRunner

org.apache.flume.SinkRunner

/**
* <p>
* A driver for {@linkplain Sink sinks} that polls them, attempting to
* {@linkplain Sink#process() process} events if any are available in the
* {@link Channel}.
* </p>
*
* <p>
* Note that, unlike {@linkplain Source sources}, all sinks are polled.
* </p>
*
* @see org.apache.flume.Sink
* @see org.apache.flume.SourceRunner
*/
public class SinkRunner implements LifecycleAware {
...
private static final long backoffSleepIncrement = 1000;
private static final long maxBackoffSleep = 5000; org.apache.flume.SinkRunner.PollingRunner public static class PollingRunner implements Runnable { private SinkProcessor policy;
private AtomicBoolean shouldStop;
private CounterGroup counterGroup; @Override
public void run() {
logger.debug("Polling sink runner starting"); while (!shouldStop.get()) {
try {
if (policy.process().equals(Sink.Status.BACKOFF)) {
counterGroup.incrementAndGet("runner.backoffs"); Thread.sleep(Math.min(
counterGroup.incrementAndGet("runner.backoffs.consecutive")
* backoffSleepIncrement, maxBackoffSleep));
} else {
counterGroup.set("runner.backoffs.consecutive", 0L);
}
} catch (InterruptedException e) {
logger.debug("Interrupted while processing an event. Exiting.");
counterGroup.incrementAndGet("runner.interruptions");
} catch (Exception e) {
logger.error("Unable to deliver event. Exception follows.", e);
if (e instanceof EventDeliveryException) {
counterGroup.incrementAndGet("runner.deliveryErrors");
} else {
counterGroup.incrementAndGet("runner.errors");
}
try {
Thread.sleep(maxBackoffSleep);
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
}
}
}
logger.debug("Polling runner exiting. Metrics:{}", counterGroup);
} }

无论process返回backoff或者抛exception,都会sleep一段时间,所以flume的sink一旦遇到大量异常数据或者自定义sink返回backoff,都会非常慢;

【原创】大数据基础之Flume(2)Sink代码解析的更多相关文章

  1. 【原创】大数据基础之Flume(2)kudu sink

    kudu中的flume sink代码路径: https://github.com/apache/kudu/tree/master/java/kudu-flume-sink kudu-flume-sin ...

  2. 【原创】大数据基础之Flume(2)应用之kafka-kudu

    应用一:kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic ...

  3. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  4. 大数据系列之Flume+kafka 整合

    相关文章: 大数据系列之Kafka安装 大数据系列之Flume--几种不同的Sources 大数据系列之Flume+HDFS 关于Flume 的 一些核心概念: 组件名称     功能介绍 Agent ...

  5. 【原创】大数据基础之词频统计Word Count

    对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...

  6. 【原创】大数据基础之Impala(1)简介、安装、使用

    impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...

  7. 【原创】大数据基础之Benchmark(2)TPC-DS

    tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...

  8. 【原创】大数据基础之Spark(5)Shuffle实现原理及代码解析

    一 简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...

  9. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

随机推荐

  1. Python获取下载速度并显示进度条

    #!/usr/bin/python3 # -*- coding:utf-8 -*- import sys import time from urllib import request ''' urll ...

  2. python实现双向链表

    双向链表 一种更复杂的链表是“双向链表”或“双面链表”.每个节点有两个链接:一个指向前一个节点,当此节点为第一个节点时,指向空值:而另一个指向下一个节点,当此节点为最后一个节点时,指向空值. 实现 c ...

  3. 分布式中的 transaction log

    分布式中的 transaction log 在分布式系统中,有很多台node组成一个cluster,对于client 的一个写操作请求而言,在什么样的情况下,cluster告诉client此次写操作请 ...

  4. spring注解第03课 按条件加载Bean @Conditional

    package com.atguigu.config; import org.springframework.context.annotation.Bean; import org.springfra ...

  5. MacOS安装Go2Shell

    1 去官网下载安装MacOS最新版本 https://zipzapmac.com/Go2Shell 2 下一步下一步安装 3 设置 打开终端设置, open -a Go2Shell --args co ...

  6. linux 下的init 0,1,2,3,4,5,6知识介绍

    一. init是Linux系统操作中不可缺少的程序之一. 所谓的init进程,它是一个由内核启动的用户级进程. 内核自行启动(已经被载入内存,开始运行,并已初始化所有的设备驱动程序和数据结构等)之后, ...

  7. JVM栈和堆的详解

    一.基本了解 java的数据类型分为两种:基本类型和引用类型.基本类型的变量保存的是原始值,引用类型的变量保存的是引用值.引用值代表某个对象的引用,而不是对象本身,对象本身放在这个引用值所表示的地址的 ...

  8. Django学习手册 - csrf

    CSRF csrf原理 无csrf时存在隐患 Form提交 Ajax提交 默认为全局都csrf Form表单提交方式: <div> <form action="/login ...

  9. wx小程序-列表详细页点击跳转!

    1.因为template 只是单纯的占位符,所以事件要写在外层view上面 2.通过自定义属性来判断 跳转的是那篇文章  自定义属性    (data-自定义名称 ) 3. 执行 onpostTap方 ...

  10. js 一个对象的属性名是一个变量怎么处理?

    1.这种方法的属性(setAttrName)可以是一个变量. var obj = {}; obj[setAttrName] = 'Tom' 2.这样就可以动态的给js对象添加变量属性. var obj ...