flume配置和说明(转)

Flume是什么

收集、聚合事件流数据的分布式框架
通常用于log数据
采用ad-hoc方案，明显优点如下：
- 可靠的、可伸缩、可管理、可定制、高性能
- 声明式配置，可以动态更新配置
- 提供上下文路由功能
- 支持负载均衡和故障转移
- 功能丰富
- 完全的可扩展

核心概念

Event
Client
Agent
- Sources、Channels、Sinks
- 其他组件：Interceptors、Channel Selectors、Sink Processor

核心概念：Event

Event是Flume数据传输的基本单元。flume以事件的形式将数据从源头传送到最终的目的。Event由可选的hearders和载有数据的一个byte array构成。

载有的数据对flume是不透明的
Headers是容纳了key-value字符串对的无序集合，key在集合内是唯一的。
Headers可以在上下文路由中使用扩展

public interface Event {

public Map<String, String>getHeaders();

public void setHeaders(Map<String, String>headers);

public byte[] getBody();

public void setBody(byte[] body);

}

核心概念：Client

Client是一个将原始log包装成events并且发送它们到一个或多个agent的实体。

例如
- Flume log4j Appender
- 可以使用Client SDK (org.apache.flume.api)定制特定的Client
目的是从数据源系统中解耦Flume
在flume的拓扑结构中不是必须的

核心概念：Agent

一个Agent包含Sources, Channels, Sinks和其他组件，它利用这些组件将events从一个节点传输到另一个节点或最终目的。

agent是flume流的基础部分。
flume为这些组件提供了配置、生命周期管理、监控支持。

核心概念：Source

Source负责接收events或通过特殊机制产生events，并将events批量的放到一个或多个Channels。有event驱动和轮询2种类型的Source

不同类型的Source:
- 和众所周知的系统集成的Sources: Syslog, Netcat
- 自动生成事件的Sources: Exec, SEQ
- 用于Agent和Agent之间通信的IPC Sources: Avro
Source必须至少和一个channel关联

核心概念：Channel

Channel位于Source和Sink之间，用于缓存进来的events，当Sink成功的将events发送到下一跳的channel或最终目的，events从Channel移除。

不同的Channels提供的持久化水平也是不一样的:
- Memory Channel: volatile
- File Channel: 基于WAL（预写式日志Write-Ahead Logging）实现
- JDBC Channel: 基于嵌入Database实现
Channels支持事务
提供较弱的顺序保证
可以和任何数量的Source和Sink工作

核心概念：Sink

Sink负责将events传输到下一跳或最终目的，成功完成后将events从channel移除。

不同类型的Sinks:
- 存储events到最终目的的终端Sink. 比如: HDFS, HBase
- 自动消耗的Sinks. 比如: Null Sink
- 用于Agent间通信的IPC sink: Avro
必须作用与一个确切的channel

Flow可靠性

可靠性基于:
- Agent间事务的交换
- Flow中，Channel的持久特性
可用性:
- 内建的Load balancing支持
- 内建的Failover支持

核心概念：Interceptor

用于Source的一组Interceptor，按照预设的顺序在必要地方装饰和过滤events。

内建的Interceptors允许增加event的headers比如：时间戳、主机名、静态标记等等
定制的interceptors可以通过内省event payload（读取原始日志），在必要的地方创建一个特定的headers。

核心概念：Channel Selector

Channel Selector允许Source基于预设的标准，从所有Channel中，选择一个或多个Channel

内建的Channel Selectors:
- 复制Replicating: event被复制到相关的channel
- 复用Multiplexing: 基于hearder，event被路由到特定的channel

核心概念：Sink Processor

多个Sink可以构成一个Sink Group。一个Sink Processor负责从一个指定的Sink Group中激活一个Sink。Sink Processor可以通过组中所有Sink实现负载均衡；也可以在一个Sink失败时转移到另一个。

Flume通过Sink Processor实现负载均衡（Load Balancing）和故障转移（failover）
内建的Sink Processors:
- Load Balancing Sink Processor – 使用RANDOM, ROUND_ROBIN或定制的选择算法
- Failover Sink Processor
- Default Sink Processor（单Sink）
所有的Sink都是采取轮询（polling）的方式从Channel上获取events。这个动作是通过Sink Runner激活的
Sink Processor充当Sink的一个代理

总结

Flume安装部署

Flume日志收集和接收注意权限问题，因为ubuntu默认不是root用户，所以有些日志如系统日志需要root权限，这里要注意

Flume日志收集：

脚本启动：

nohup flume-ng agent -n agent-c conf -f flume/conf/flume-node.conf &（后台根据agent名和配置文件启动）

nohup flume-ng agent –n agent –c conf –f flume/conf/flume-master.conf& (前面的agent是表示agent,-n后面的agent是配置文件里取的名字)

下面配置多机传输到hadoop集群：

master配置：

最终参数：

agent.sources = source1

agent.channels = memoryChannel

agent.sinks = sink1

# For each one of the sources, the type isdefined

agent.sources.source1.type = avro

#监控本机ip和端口，接收日志

agent.sources.source1.bind = 172.28.0.61

agent.sources.source1.port = 23004

##使用内存通道

agent.sources.source1.channels =memoryChannel

# Each sink's type must be defined

#agent.sinks.loggerSink.channel =memoryChannel

# Each channel's type is defined.

agent.channels.memoryChannel.type = memory

# Other config values specific to each typeof channel(sink or source)

# can be defined as well

# In this case, it specifies the capacityof the memory channel

agent.channels.memoryChannel.capacity =10000

agent.channels.memoryChannel.transactionCapacity= 10000

agent.channels.memoryChannel.keep-alive =1000

agent.sinks.sink1.type=hdfs

#agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d/%H%M%S

#写入到hadoop集群

agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d

agent.sinks.sink1.hdfs.fileType=DataStream

agent.sinks.sink1.hdfs.writeFormat=TEXT

agent.sinks.sink1.hdfs.round=true

agent.sinks.sink1.hdfs.roundValue=5

agent.sinks.sink1.hdfs.roundUnit=minute

agent.sinks.sink1.hdfs.rollInterval=300

agent.sinks.sink1.hdfs.rollSize=0

agent.sinks.sink1.hdfs.rollCount=0

agent.sinks.sink1.hdfs.callTimeout=100000

agent.sinks.sink1.hdfs.request-timeout=100000

agent.sinks.sink1.hdfs.connect-timeout=80000

agent.sinks.sink1.hdfs.useLocalTimeStamp=true

agent.sinks.sink1.channel = memoryChannel

agent.sinks.sink1.hdfs.filePrefix=ats-

#agent.sinks.k1.hdfs.fileSuffix=.log

node配置：

agent.sources=exec-source

agent.sinks=sink1

agent.channels=memoryChannel

agent.sources.exec-source.type=exec

agent.sources.exec-source.command=tail -F/home/mike/flumelog/tt.log //配置监控文件

agent.sources.exec-source.channels =memoryChannel

agent.channels.memoryChannel.type = memory

agent.channels.memoryChannel.capacity =1000

agent.channels.memoryChannel.keep-alive =1000

agent.channels.memoryChannel.type=file

agent.sinks.sink1.type = avro

agent.sinks.sink1.hostname = 172.28.0.61 //配置接收日志端地址和端口，也就是master地址

agent.sinks.sink1.port = 23004

agent.sinks.sink1.channel = memoryChannel

#agent.sinks.sink1.rollInterval = 1000

#agent.sinks.hdfs-sink.type=hdfs

#agent.sinks.hdfs-sink.hdfs.path=hdfs://<Host-Nameof name node>/

#agent.sinks.hdfs-sink.hdfs.filePrefix=apacheaccess

#agent.channels.ch1.type=memory

#agent.channels.ch1.capacity=1000

#agent.sources.exec-source.channels=ch1

#agent.sinks.hdfs-sink.channel=ch1

主要参数说明：

agent.sinks.sink1.hdfs.rollInterval=30 （根据时间滚动生成文件）单位秒

agent.sinks.sink1.hdfs.rollSize=0 （根据文件大小滚动生成文件）字节

agent.sinks.sink1.hdfs.rollCount=0 （根据事件数滚动生成文件）条数（比如行数）

hdfs.request-timeout=80000 单位毫秒

agent.sinks.sink1.hdfs.connect-timeout=80000单位毫秒

connect-timeout：Amount oftime (ms) to allow for the first (handshake) request.

connect-timeout：Amount oftime (ms) to allow for requests after the first.（可以设置大点）

Amount of time (ms) to allow for the first(handshake) request.

这三个参数是用在更新文件上，设置多久生成新文件，要生效记得要删除客户端的.flume/下面的文件然后重启客户端

Timeout时间要设置长一点不然容易报错

flume在实际环境中的应用：

ats日志处理：

ats有固定日志格式，一定时间会生成一个固定格式的文件，这里定时将这个文件cp一份到flume监控目录，然后将日志文件移到备份目录，方便其它用途，这是原始日志，这里解决的flume的日志收集不能完全实时的问题

注意：

2./bin目录下flume-ng启动脚本中的OPTS要设置的大一些，否则会报内存溢出的错误。默认是20m，如下：

[html] view plain copy

1. JAVA_OPTS="-Xmx20m"

3.server端的memory channel的capacity和transactionCapacity一定要设置的比client的大，否则会报错，如下：

2. 1] (org.apache.flume.source.AvroSource.appendBatch:261) - Avro source r1: Unable to process event batch. Exception follows.

3. org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: ..}

ERROR hdfs.BucketWriter: Hit max consecutive under-replication rotations (30); will notcontinue rolling files under this path due to under-replication

这个错误是在修改参数的时候和刚启动的时候会生成一些小文件形成的，然后就正常运转

最新配置，复制写入到hdfs和文件

#master配置,flume-master.conf

#定义source,channel,sinks

agent.sources = source1

agent.channels = memoryChannel1 memoryChannel2

agent.sinks = sink1 sink2

#sources 参数配置,本机ip地址和端口，根据实际修改

agent.sources.source1.type = avro

agent.sources.source1.selector.type = replicating

agent.sources.source1.bind = *.*.*.*

agent.sources.source1.port = 23004

agent.sources.source1.channels = memoryChannel1 memoryChannel2

#加入时间戳拦截器，要不运行时会报异常

agent.sources.source1.interceptors = i1

agent.sources.source1.interceptors.i1.type = timestamp

#channel配置参数

agent.channels.memoryChannel1.type = memory

agent.channels.memoryChannel1.capacity = 10000

agent.channels.memoryChannel1.transactionCapacity = 10000

agent.channels.memoryChannel1.keep-alive = 1000

agent.channels.memoryChannel2.type = memory

agent.channels.memoryChannel2.capacity = 10000

agent.channels.memoryChannel2.transactionCapacity = 10000

agent.channels.memoryChannel2.keep-alive = 1000

#sinks配置参数,hadoop集群地址根据需要更改

#agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d/%H%M%S

agent.sinks.sink1.type=hdfs

agent.sinks.sink1.hdfs.path=hdfs://*.*.*.*:9000/sdg

agent.sinks.sink1.hdfs.fileType=DataStream

agent.sinks.sink1.hdfs.writeFormat=TEXT

agent.sinks.sink1.hdfs.round=true

agent.sinks.sink1.hdfs.roundValue=5

agent.sinks.sink1.hdfs.roundUnit=minute

agent.sinks.sink1.hdfs.rollInterval=60

agent.sinks.sink1.hdfs.rollSize=0

agent.sinks.sink1.hdfs.rollCount=0

agent.sinks.sink1.hdfs.callTimeout=100000

agent.sinks.sink1.hdfs.request-timeout=100000

agent.sinks.sink1.hdfs.connect-timeout=80000

agent.sinks.sink1.hdfs.useLocalTimeStamp=true

agent.sinks.sink1.channel = memoryChannel1

agent.sinks.sink1.hdfs.filePrefix=ats-

#agent.sinks.k1.hdfs.fileSuffix=.log

#write to local file

agent.sinks.sink2.type=file_roll

agent.sinks.sink2.channel=memoryChannel2

agent.sinks.sink2.sink.rollInterval=0

#agent.sinks.sink2.sink.serializer=TEXT

#agent.sinks.sink2.sink.batchSize=1000

agent.sinks.sink2.sink.directory=/home/hadoop/atslog/

flume配置和说明(转)的更多相关文章

关于flume配置加载（二）
为什么翻flume的代码,一方面是确实遇到了问题,另一方面是想翻一下flume的源码,看看有什么收获,现在收获还谈不上,因为要继续总结.不够已经够解决问题了,而且确实有好的代码,后续会继续慢慢分享,这 ...
flume 配置
[root@dtpweb data]#tar -zxvf apache-flume-1.7.0-bin.tar.gz[root@dtpweb conf]# cp flume-env.sh.templa ...
关于flume配置加载
最近项目在用到flume,因此翻了下flume的代码, 启动脚本: nohup bin/flume-ng agent -n tsdbflume -c conf -f conf/配置文件.conf -D ...
Flume配置Replicating Channel Selector
1 官网内容上面的配置是r1获取到的内容会同时复制到c1 c2 c3 三个channel里面 2 详细配置信息 # Name the components on this agent a1.sour ...
Flume配置Multiplexing Channel Selector
1 官网内容上面配置的是根据不同的heder当中state值走不同的channels,如果是CZ就走c1 如果是US就走c2 c3 其他默认走c4 2 我的详细配置信息一个监听http端口然后 ...
hadoop生态搭建（3节点）-09.flume配置
# http://archive.apache.org/dist/flume/1.8.0/# ===================================================== ...
flume 配置与使用
1.下载flume,解压到自建文件夹 2.修改flume-env.sh文件在文件中添加JAVA_HOME 3.修改flume.conf 文件(原名好像不叫这个,我自己把模板名改了) 里面我自己配的( ...
flume配置参数的意义
1.监控端口数据: flume启动: [bingo@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jo ...
Flume配置Failover Sink Processor
1 官网内容 2 看一张图一目了然 3 详细配置 source配置文件 #配置文件: a1.sources= r1 a1.sinks= k1 k2 a1.channels= c1 #负载平衡 a1.s ...

随机推荐

hdu 5067(暴力搜索)
Harry And Dig Machine Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Ot ...
(3)C#工具箱-容器
容器特点:把控件放到容器里,移动容器控件也会跟着移动. 1.flowLayoutPanel(流布局控件) 放入控件后,会自动垂直或水平排列拉长布局,控件自动跑到一行 2.GroupBox(组合框) ...
MFC学习1
引用：http://v.youku.com/v_show/id_XMjM3MTI1ODky.html MSG 窗口——一般是程序在屏幕上的不同的矩形区域。窗口句柄——标识窗口类型的类(资源的标识) ...
POJ 2441 Arrange the Bulls（状压DP）
[题目链接] http://poj.org/problem?id=2441 [题目大意] 每个人有过个喜欢的篮球场地,但是一个场地只能给一个人, 问所有人都有自己喜欢的场地的方案数. [题解] 状态S ...
elasticsearch5.3.0 安装
公司有项目打算用elasticsearch,所以研究了下,目前最新版本5.3.0 安装 1.下载包 https://artifacts.elastic.co/downloads/elasticsea ...
Bluetooth篇开发实例之八匹配
自己写的App匹配蓝牙设备,不需要通过系统设置去连接. 匹配和通信是两回事. 用过Android系统设置(Setting)的人都知道蓝牙搜索之后可以建立配对和解除配对,但是这两项功能的函数没有在SDK ...
xcode 6 exporting ipa 提示 Your account already has a valid iOS distribution certificate 的另一种解决方法
背景: 1. XCode 6.1 2. 证书:develop 证书 3. Scheme 为Device 操作: 在Product - Archive 包过程中,选择Save for Ad hoc De ...
在K8s中创建StatefulSet
在K8s中创建StatefulSet 遇到的问题: 使用Deployment创建的Pod是无状态的,当挂在Volume之后,如果该Pod挂了,Replication Controller会再run一个 ...
util.select.js
ylbtech-JavaScript-util: util.select.js 筛选工具 1.A,JS-效果图返回顶部 1.B,JS-Source Code(源代码)返回顶部 1.B.1, m.y ...
SilverLight-3：SilverLight 备注
ylbtech_silverlight 一.DebugSilverlight应用程序的方法: 第一种: 1.Silverlight引用命名空间:System.Diagnostics; 2.在程序必要的 ...

flume配置和说明(转)

flume配置和说明(转)的更多相关文章

随机推荐

热门专题