Apache Flume 简介

转自：http://blog.163.com/guaiguai_family/blog/static/20078414520138100562883/

Flume 是 Cloudera 公司开源出来的一套日志收集系统，早期版本依赖 ZooKeeper，现在的 FumeNG 去掉了这个依赖，我没用过之前的版本，想来失去整个日志收集系统的全局视图是挺可惜的，但 FlumeNG 上手以及使用挺简单，搭配监测系统也能用的不赖，有利有弊了:-)

下图展示了一种常见的 Flume 使用场景，服务器上发送事件给本地的 Flume agent 或者让本地 Flume agent 去 tail -f 日志文件，日志被发送给同一个数据中心里的多个下游 Flume agent，这些下游 Flume agent 将数据写到 HDFS，同时在本地磁盘留一个短期副本以供调试。

Flume 的配置文件挺易懂的，官方文档有很详细的描述，从结构上讲分成两部分，声明一个 Flume agent 会运行哪些 source、channel、sink，然后是配置各个 source、channel、sink 的属性以及互相的连接关系。

source：日志来源，可以调用外部命令比如 tail -f，也可以监听端口，接收 avro、thrift、文本行格式的日志。source 和 channel 是多对多的关系，source 往 channel 里写数据可以是replicating(默认）或者 multiplexing方式，比如上图里 log collector 里的 source 就是复制了两份日志写到两个 channel 里。
channel：其实个人觉得叫 queue 更合适，免得跟 sink 的用途混淆。channel 用来做日志缓存以及日志分发，粗略来说channel 和 sink 是一对多的关系，channel 传数据到 sink 可以有 default, failover 和 load_balance三种方式，文档里这个地方既叫 sink processor 又叫 sink group，个人觉得 sink group 理解起来更容易，channel 实际是发送数据到 sink group，所以是 channel 和 sink group 一对一(这大概就是为什么sink group 又叫 sink processor，指如何把事件从 channel 里转到 sinks 这个处理过程)，sink group 和 sink 是一对多。default 方式下一个 sink group 只允许有一个 sink；failover 指总是先写优先级高的 sink，失败的话写优先级次高的 sink；load_balance 就容易理解了，轮流写 sink。
sink：处理日志写出，注意一个 sink 只能写一个地方，比如本地文件或者某单个远程主机的网络端口，并不是 sink 来做 load balance，所以上图中针对每个 log collector，log emitter 那里都要各配置一个 sink。Flume 标配的 sink 挺多，local fs, hdfs, hbase, solr, elasticsearch, avro, thrift 等，难能可贵的是 hdfs 和 hbase sink 都支持 Kerberos 认证，真不愧是 Cloudera 家做的东西，跟 Hadoop 集成就是好。

Flume agent 的一个进程里可以包含多个 source、channel、sink，这些元素之间组成的 flow 可以互相没关系，比如一套 source-channel-sink 收集 access.log，一套 source-channel-sink 收集 error.log，两者没有数据交互。同一台机器上也可以运行多个 flume agent 进程。注意同一个 agent 进程里 memory channel 里的 event 是共享的，但是 Flume 在估算内存消耗时不考虑共享这个事情。

Flume agent 进程会每隔三十秒检测配置文件，如果修改了会重新载入，所以虽然没有 ZooKeeper 集中管理配置信息，但利用 Puppet/Chef + Nagios/Ganglia 之类帮忙也不是太大问题。

Flume 没有像 Scribe 那样直接支持 category，而是允许给 event 添加 header，在 multiplexing channel selector 里可以按照 event header 映射到不同 channel，这样就可以在整个 flow 的末端把日志切分开来。如果使用 hdfs sink 的话，hdfs 文件名可以插入 event header 的值，所以不必用 multiplexing channel selector 即可达到按 category 切分日志的效果。

Flume 的设计还是挺灵活挺简单的，我小测试了下，稳定性不错，但是性能不怎么样(可能我测试不规范)，尤其是用 file channel 的时候，Flume 把事件缓存在 JVM 里，这个设计没有 Kafka 高明以及高效。另一个担心是它没有像 Kafka 那样把 replication 作为一个核心设计，需要使用者去 event flow 的各个环节显式配置，比如每个 log collector 加一个 memory channel 和一个 avro sink 写到另一个 log collector 去，这个过程没有 ZooKeeper 的帮助，实际是没有实用价值的。如果项目时间允许，我觉得在 Kafka 基础上构建 sink 是个更高效、方便且可靠的日志收集方案，如同 LinkedIn 的 data pipeline 架构那样。

下面是一段 Perl 脚本，用于生成 Flume 配置，不直接手工配置的原因是很多地方的 source、channel、sink 配置是基本一样的，手工维护有点累。下面那行 tail 脚本有点长，显示不完整，应该是

tail -F -n 0--pid `ps -o ppid= \$\$` $log_file | sed -e \"s/^/host=`hostname --fqdn` category=$category:/\"

话说 Scribe、Flume 这些二货为啥不直接提供 tail -f 的功能。。。。

#!/usr/bin/perl
#
# Emitter:
#   Server -> access.log -> tail -F -> Flume exec source(access) ->
#       Flume file channel(c1) -> Flume Avro Sinks with load balancing sink processor(g1: s1 s2 s3)
#
# Collector:
#   Flume Avro source with replicating channel selects(source1) ->
#       Flume memory channel(file1) -> Flume file roll sink(file1)
#       Flume memory channel(hdfs1) -> Flume HDFS sink(hdfs2)
#       Flume memory channel(hdfs2) -> Flume HDFS sink(hdfs2), in another data center
#       Flume memory channel(hbase1) -> Flume HBase sink(hbase1), the standard HBase sink uses
#                       hbase-site.xml to get server address, so can't use two HBase sinks
#                       except starting another Flume agent process.

use strict;
use warnings;
useGetopt::Long;

my%g_log_files =(
"access"=>[ qw(/tmp/access.log )],
);
my@g_collector_hosts= qw( collector1 collector2 collector3 );
my $g_emitter_avro_port =3000;
my $g_emitter_thrift_port =3001;
my $g_emitter_nc_port =3002;
my $g_collector_avro_port =3000;
my $g_flume_work_dir ="/tmp/flume";
my $g_data_dir ="/tmp/log-data";
my%g_hdfs_paths =(
"hdfs1"=>"hdfs://namenode1:8020/user/gg/data",
"hdfs2"=>"hdfs://namenode2:8020/user/gg/data",
);
my $g_emitter_conf ="emitter.properties";
my $g_collector_conf ="collector.properties";
my $g_overwrite_conf =0;

GetOptions("force!"=> \$g_overwrite_conf);

generate_emitter_config();
generate_collector_config();

exit(0);

#######################################
sub generate_emitter_config {
my $conf ="";
my $sources = join(" ", sort(keys %g_log_files));
my $sinks = join(" ", map {"s$_"}(1..@g_collector_hosts));

    $conf .=<<EOF;
        emitter.sources = $sources avro1 thrift1 nc1
        emitter.channels = c1
        emitter.sinks = $sinks
        emitter.sinkgroups = g1

EOF

formy $category ( sort keys %g_log_files){
my $log_files = $g_log_files{$category};

formy $log_file (@$log_files){
            $conf .=<<EOF;
        emitter.sources.$category.channels = c1
        emitter.sources.$category.type =exec
        emitter.sources.$category.command = tail -F -n 0--pid `ps -o ppid= \$\$` $log_file | sed -e \"s/^/host=`hostname --fqdn` category=$category:/\"
        emitter.sources.$category.shell = /bin/sh -c
        emitter.sources.$category.restartThrottle =5000
        emitter.sources.$category.restart =true
        emitter.sources.$category.logStdErr =true
        emitter.sources.$category.interceptors = i1 i2 i3
        emitter.sources.$category.interceptors.i1.type = timestamp
        emitter.sources.$category.interceptors.i2.type = host
        emitter.sources.$category.interceptors.i2.useIP =false
        emitter.sources.$category.interceptors.i3.type =static
        emitter.sources.$category.interceptors.i3.key = category
        emitter.sources.$category.interceptors.i3.value = $category

EOF
}
}

    $conf .=<<EOF;
        emitter.sources.avro1.channels = c1
        emitter.sources.avro1.type = avro
        emitter.sources.avro1.bind = localhost
        emitter.sources.avro1.port = $g_emitter_avro_port
        emitter.sources.avro1.interceptors = i1 i2 i3
        emitter.sources.avro1.interceptors.i1.type = timestamp
        emitter.sources.avro1.interceptors.i2.type = host
        emitter.sources.avro1.interceptors.i2.useIP =false
        emitter.sources.avro1.interceptors.i3.type =static
        emitter.sources.avro1.interceptors.i3.key = category
        emitter.sources.avro1.interceptors.i3.value =default

        emitter.sources.thrift1.channels = c1
        emitter.sources.thrift1.type = thrift
        emitter.sources.thrift1.bind = localhost
        emitter.sources.thrift1.port = $g_emitter_thrift_port
        emitter.sources.thrift1.interceptors = i1 i2 i3
        emitter.sources.thrift1.interceptors.i1.type = timestamp
        emitter.sources.thrift1.interceptors.i2.type = host
        emitter.sources.thrift1.interceptors.i2.useIP =false
        emitter.sources.thrift1.interceptors.i3.type =static
        emitter.sources.thrift1.interceptors.i3.key = category
        emitter.sources.thrift1.interceptors.i3.value =default

        emitter.sources.nc1.channels = c1
        emitter.sources.nc1.type = netcat
        emitter.sources.nc1.bind = localhost
        emitter.sources.nc1.port = $g_emitter_nc_port
        emitter.sources.nc1.max-line-length =20480
        emitter.sources.nc1.interceptors = i1 i2 i3
        emitter.sources.nc1.interceptors.i1.type = timestamp
        emitter.sources.nc1.interceptors.i2.type = host
        emitter.sources.nc1.interceptors.i2.useIP =false
        emitter.sources.nc1.interceptors.i3.type =static
        emitter.sources.nc1.interceptors.i3.key = category
        emitter.sources.nc1.interceptors.i3.value =default

        emitter.channels.c1.type = file
        emitter.channels.c1.checkpointDir = $g_flume_work_dir/emitter-c1/checkpoint
#emitter.channels.c1.useDualCheckpoints = true
#emitter.channels.c1.backupCheckpointDir = $g_flume_work_dir/emitter-c1/checkpointBackup
        emitter.channels.c1.dataDirs = $g_flume_work_dir/emitter-c1/data

EOF

my $i =0;
my $port = $g_collector_avro_port;
my $onebox = is_one_box();
formy $host ( sort @g_collector_hosts){
++$i;
        $port +=1000if $onebox;

        $conf .=<<EOF;
        emitter.sinks.s$i.channel = c1
        emitter.sinks.s$i.type = avro
        emitter.sinks.s$i.hostname = $host
        emitter.sinks.s$i.port = $port
        emitter.sinks.s$i.batch-size =100
#emitter.sinks.s$i.reset-connection-interval = 600
        emitter.sinks.s$i.compression-type = deflate

EOF
}

    $conf .=<<EOF;

        emitter.sinkgroups.g1.sinks = $sinks
        emitter.sinkgroups.g1.processor.type = load_balance
        emitter.sinkgroups.g1.processor.backoff =true
        emitter.sinkgroups.g1.processor.selector = round_robin

EOF

    $conf =~ s/^+//mg;

die"$g_emitter_conf already exists!\n"if! $g_overwrite_conf &&-e $g_emitter_conf;
    open my $fh,">", $g_emitter_conf ordie"Can't write $g_emitter_conf: $!\n";
print $fh $conf;
    close $fh;
}

sub generate_collector_config {
my $conf ="";
my@sinks= qw(file1 hdfs1 hdfs2 hbase1);
my $sinks = join(" ",@sinks);

my $port = $g_collector_avro_port;
my $onebox = is_one_box();
    $port +=1000if $onebox;

    $conf .=<<EOF;
        collector.sources = source1
        collector.channels = $sinks
        collector.sinks = $sinks

        collector.sources.source1.channels = $sinks
        collector.sources.source1.type = avro
        collector.sources.source1.bind =0.0.0.0
        collector.sources.source1.port = $port
        collector.sources.source1.compression-type = deflate
        collector.sources.source1.interceptors = i1 i2 i3 i4

        collector.sources.source1.interceptors.i1.type = timestamp
        collector.sources.source1.interceptors.i1.preserveExisting =true

        collector.sources.source1.interceptors.i2.type = host
        collector.sources.source1.interceptors.i2.preserveExisting =true
        collector.sources.source1.interceptors.i2.useIP =false

        collector.sources.source1.interceptors.i3.type =static
        collector.sources.source1.interceptors.i3.preserveExisting =true
        collector.sources.source1.interceptors.i3.key = category
        collector.sources.source1.interceptors.i3.value =default

        collector.sources.source1.interceptors.i4.type = host
        collector.sources.source1.interceptors.i4.preserveExisting =false
        collector.sources.source1.interceptors.i4.useIP =false
        collector.sources.source1.interceptors.i4.hostHeader = collector

EOF

formy $sink (@sinks){
        $conf .=<<EOF;
        collector.channels.$sink.type = memory
        collector.channels.$sink.capacity =10000
        collector.channels.$sink.transactionCapacity =100
        collector.channels.$sink.byteCapacityBufferPercentage =20
        collector.channels.$sink.byteCapacity =0

EOF
}

     $conf .=<<EOF;

        collector.sinks.file1.channel = file1
        collector.sinks.file1.type = file_roll
        collector.sinks.file1.sink.directory = $g_data_dir/collector-$port-file1
        collector.sinks.file1.sink.rollInterval =3600
        collector.sinks.file1.batchSize =100
        collector.sinks.file1.sink.serializer = text
        collector.sinks.file1.sink.serializer.appendNewline =true
#collector.sinks.file1.sink.serializer = avro_event
#collector.sinks.file1.sink.serializer.syncIntervalBytes = 2048000
#collector.sinks.file1.sink.serializer.compressionCodec = snappy

        collector.sinks.hdfs1.channel = hdfs1
        collector.sinks.hdfs1.type = hdfs
        collector.sinks.hdfs1.hdfs.path = $g_hdfs_paths{hdfs1}/%{category}/%Y%m%d/%H
        collector.sinks.hdfs1.hdfs.filePrefix =%{collector}-$port
        collector.sinks.hdfs1.hdfs.rollInterval =600
        collector.sinks.hdfs1.hdfs.rollSize =0
        collector.sinks.hdfs1.hdfs.rollCount =0
        collector.sinks.hdfs1.hdfs.idleTimeout =0
        collector.sinks.hdfs1.hdfs.batchSize =100
        collector.sinks.hdfs1.hdfs.codeC = snappy
        collector.sinks.hdfs1.hdfs.fileType =SequenceFile
#collector.sinks.hdfs1.serializer = text
#collector.sinks.hdfs1.serializer.appendNewline = true
        collector.sinks.hdfs1.serializer = avro_event
        collector.sinks.hdfs1.serializer.syncIntervalBytes =2048000
        collector.sinks.hdfs1.serializer.compressionCodec =null
#collector.sinks.hdfs2.serializer.compressionCodec = snappy

        collector.sinks.hdfs2.channel = hdfs2
        collector.sinks.hdfs2.type = hdfs
        collector.sinks.hdfs2.hdfs.path = $g_hdfs_paths{hdfs2}/%{category}/%Y%m%d/%H
        collector.sinks.hdfs2.hdfs.filePrefix =%{collector}-$port
        collector.sinks.hdfs2.hdfs.rollInterval =600
        collector.sinks.hdfs2.hdfs.rollSize =0
        collector.sinks.hdfs2.hdfs.rollCount =0
        collector.sinks.hdfs2.hdfs.idleTimeout =0
        collector.sinks.hdfs2.hdfs.batchSize =100
        collector.sinks.hdfs2.hdfs.codeC = snappy
        collector.sinks.hdfs2.hdfs.fileType =SequenceFile
#collector.sinks.hdfs2.serializer = text
#collector.sinks.hdfs2.serializer.appendNewline = true
        collector.sinks.hdfs2.serializer = avro_event
        collector.sinks.hdfs2.serializer.syncIntervalBytes =2048000
        collector.sinks.hdfs2.serializer.compressionCodec =null
#collector.sinks.hdfs2.serializer.compressionCodec = snappy

        collector.sinks.hbase1.channel = hbase1
        collector.sinks.hbase1.type = hbase
        collector.sinks.hbase1.table = log
        collector.sinks.hbase1.columnFamily = log

EOF

    $conf =~ s/^+//mg;

die"$g_collector_conf already exists!\n"if! $g_overwrite_conf &&-e $g_collector_conf;
    open my $fh,">", $g_collector_conf ordie"Can't write $g_collector_conf: $!\n";
print $fh $conf;
    close $fh;
}

sub is_one_box {
my%h = map { $_ =>1}@g_collector_hosts;
return keys %h <@g_collector_hosts;
}

Apache Flume 简介的更多相关文章

Apache Flume简介及安装部署
概述 Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集.聚合和传输的软件. Flume 的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目 ...
Apache Flume日志收集系统简介
Apache Flume是一个分布式.可靠.可用的系统,用于从大量不同的源有效地收集.聚合.移动大量日志数据进行集中式数据存储. Flume简介 Flume的核心是Agent,Agent中包含Sour ...
Apache Flume 1.7.0 各个模块简介
Flume简介 Apache Flume是一个分布式.可靠.高可用的日志收集系统,支持各种各样的数据来源,如http,log文件,jms,监听端口数据等等,能将这些数据源的海量日志数据进行高效收集.聚 ...
Flume简介与使用（二）——Thrift Source采集数据
Flume简介与使用(二)——Thrift Source采集数据继上一篇安装Flume后,本篇将介绍如何使用Thrift Source采集数据. Thrift是Google开发的用于跨语言RPC通信 ...
Apache Flume 安装文档、日志收集
简介: 官网 http://flume.apache.org 文档 https://flume.apache.org/FlumeUserGuide.html hadoop 生态系统中,flume 的职 ...
Flume 简介及基本使用
一.Flume简介 Apache Flume是一个分布式,高可用的数据收集系统.它可以从不同的数据源收集数据,经过聚合后发送到存储系统中,通常用于日志数据的收集.Flume 分为 NG 和 OG (1 ...
入门大数据---Flume 简介及基本使用
一.Flume简介 Apache Flume 是一个分布式,高可用的数据收集系统.它可以从不同的数据源收集数据,经过聚合后发送到存储系统中,通常用于日志数据的收集.Flume 分为 NG 和 OG ( ...
Apache Flume 1.7.0 发布，日志服务器
Apache Flume 1.7.0 发布了,Flume 是一个分布式.可靠和高可用的服务,用于收集.聚合以及移动大量日志数据,使用一个简单灵活的架构,就流数据模型.这是一个可靠.容错的服务. 本次更 ...
org.apache.flume.FlumeException: NettyAvroRpcClient { host: xxx.xxx.xxx.xxx, port: 41100 }: RPC
2014-12-19 01:05:42,141 (lifecycleSupervisor-1-1) [WARN - org.apache.flume.sink.AbstractRpcSink.star ...

随机推荐

JavaScript之表格修改
讲到表格,我们不免都了解它的属性及用途. colspan跨列(纵向的)和rowspan跨行(横向的). 表格中<tr></tr>标签标示行标签:<td></t ...
JqGrid 使用方法详解
JQGrid JQGrid是一个在jquery基础上做的一个表格控件,以ajax的方式和服务器端通信. JQGrid Demo 是一个在线的演示项目.在这里,可以知道jqgrid可以做什么事情. 下面 ...
WCF学习笔记 -- 如何用C#开发一个WebService
假设所有工程的命名空间是demo. 新建一个C#的ClassLibrary(类库)工程. 在工程引用中加入System.ServiceModel引用. 定义接口,你可以删除自动生成的代码,或者直接修改 ...
(转) DockPanel 右键增加关闭，除此之外全部关闭的功能
在项目中新建一个class文件,代码如下: using System; using System.Collections.Generic; using System.ComponentModel; u ...
HTML5之 WebWorkers
为了进行后台计算提供的完全隔离计算方式不可访问 DOM APIs 不可访问 window object 不可访问 document object 强隔离保证并行计算结果无误(无锁机制) ---- 启 ...
Linux dd 命令
语法:dd [选项] if =输入文件(或设备名称). of =输出文件(或设备名称). ibs = bytes 一次读取bytes字节,即读入缓冲区的字节数. skip = blocks 跳过读入缓 ...
input内容改变触发事件，兼容IE
<html> <head> <script type="text/javascript"> window.onload = function() ...
btrace 实践笔记
btrace简介: btrace 是一个使用在JAVA平台上面的,安全的,动态跟踪工具.它一般用于动态跟踪正在运行的jAVA程序. 使用说明在这里.下载地址在这里. 下载的时候 ...
php5.3不支持 ereg、ereg_replace等函数问题，如提示：Deprecated: Function ereg() is deprecated
在php5.3中,正则函数ereg_replace已经废弃,而dedecms还继续用.有两个方案可以解决以上问题: 1.把php版本换到v5.3下. 2.继续使用v5.3,修改php.ini文件 ;e ...
PHP连接局域网MYSQL数据库的简单实例
PHP连接局域网MYSQL数据库的简单实例 [php] view plaincopy <?PHP /** * php连接mysql数据库 * by www.jbxue.com */ $conn= ...

Apache Flume 简介

Apache Flume 简介的更多相关文章

随机推荐

热门专题