flume简介

组件介绍：

代理 Flume Agent

 Flume内部有一个或者多个Agent
 每一个Agent是一个独立的守护进程（JVM）
 从客户端哪儿接收收集，或者从其他的Agent哪儿接收，然后迅速的将获取的数据传到下一个目的节点Agent
 Agent主要由source、channel、sink三个组件组成。

agent source

 一个flume源
 负责一个外部源（数据生成器），如一个web服务器传递给他的事件
 该外部源将它的事件以flume可以识别的格式（event）发送到flume中
 当一个flume源接收到一个事件时，其将通过一个或多个通道存储该事件

agent channel

 通道: 采用被动存储的形式，即通道会缓存该事件直到该事件被sink组件处理
 所以Channel是一种短暂的存储容器，它将从source处接收到的event格式的数据缓存起来，直到它们被sinks消费掉它在source和sink间起着一共桥梁的作用,
  channel是一个完整的事务，这一点保证了数据在收发的时候的一致性.并且它可以和任意数量的source和sink链接
 可以通过参数设置event的最大个数
 Flume通常选择FileChannel,而不使用Memory Channel。
   Memory Channel: 内存存储事务，吞吐率极高，但存在丢数据风险
   File Channel: 本地磁盘的事务实现模式，保证数据不会丢失(WAL实现)

监控网络端口使用

netcat

 # example.conf: A single-node Flume configuration
 
 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = localhost
 a1.sources.r1.port = 44444
 
 # Describe the sink
 a1.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动命令：flume-ng agent -n a1 -c $FLUME_HOME/conf -f $FLUME_HOME/conf/example.conf -Dflume.root.logger=INFO,console

监控具体文件使用

exec

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = exec
 a1.sources.r1.command = tail -F /home/log/data.log
 a1.sources.r1.shell = /bin/bash -c
 
 # Describe the sink
 a1.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动命令：flume-ng agent -n a1 -c $FLUME_HOME/conf -f $FLUME_HOME/conf/exec-memory-logger.conf -Dflume.root.logger=INFO,console

flume监控日志文件并持久化到hdfs

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = exec
 a1.sources.r1.command = tail -F /home/data/data.log
 a1.sources.r1.channels = c1
 
 # Describe the sink
 a1.sinks.k1.type = hdfs
 a1.sinks.k1.channel = c1
 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
 a1.sinks.k1.hdfs.filePrefix = events-
 a1.sinks.k1.hdfs.fileType=DataStream
 a1.sinks.k1.hdfs.useLocalTimeStamp = true  #必须得写 应为14行用到时间数据
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

别人写的

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 ## exec表示flume回去调用给的命令，然后从给的命令的结果中去拿数据
 a1.sources.r1.type = exec
 ## 使用tail这个命令来读数据
 a1.sources.r1.command = tail -F /home/tuzq/software/flumedata/test.log
 a1.sources.r1.channels = c1
 
 # Describe the sink
 ## 表示下沉到hdfs，类型决定了下面的参数
 a1.sinks.k1.type = hdfs
 ## sinks.k1只能连接一个channel，source可以配置多个
 a1.sinks.k1.channel = c1
 ## 下面的配置告诉用hdfs去写文件的时候写到什么位置，下面的表示不是写死的，而是可以动态的变化的。表示输出的目录名称是可变的
 a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
 ##表示最后的文件的前缀
 a1.sinks.k1.hdfs.filePrefix = events-
 ## 表示到了需要触发的时间时，是否要更新文件夹，true:表示要
 a1.sinks.k1.hdfs.round = true
 ## 表示每隔1分钟改变一次
 a1.sinks.k1.hdfs.roundValue = 1
 ## 切换文件的时候的时间单位是分钟
 a1.sinks.k1.hdfs.roundUnit = minute
 ## 表示只要过了3秒钟，就切换生成一个新的文件
 a1.sinks.k1.hdfs.rollInterval = 3
 ## 如果记录的文件大于20字节时切换一次
 a1.sinks.k1.hdfs.rollSize = 20
 ## 当写了5个事件时触发
 a1.sinks.k1.hdfs.rollCount = 5
 ## 收到了多少条消息往dfs中追加内容
 a1.sinks.k1.hdfs.batchSize = 10
 ## 使用本地时间戳
 a1.sinks.k1.hdfs.useLocalTimeStamp = true
 #生成的文件类型，默认是Sequencefile，可用DataStream：为普通文本
 a1.sinks.k1.hdfs.fileType = DataStream
 
 # Use a channel which buffers events in memory
 ##使用内存的方式
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

flume简介的更多相关文章

Flume简介与使用（二）——Thrift Source采集数据
Flume简介与使用(二)——Thrift Source采集数据继上一篇安装Flume后,本篇将介绍如何使用Thrift Source采集数据. Thrift是Google开发的用于跨语言RPC通信 ...
Flume简介与使用（一）——Flume安装与配置
Flume简介与使用(一)——Flume安装与配置 Flume简介 Flume是一个分布式的.可靠的.实用的服务——从不同的数据源高效的采集.整合.移动海量数据. 分布式:可以多台机器同时运行采集数据 ...
Flume 简介及基本使用
一.Flume简介 Apache Flume是一个分布式,高可用的数据收集系统.它可以从不同的数据源收集数据,经过聚合后发送到存储系统中,通常用于日志数据的收集.Flume 分为 NG 和 OG (1 ...
入门大数据---Flume 简介及基本使用
一.Flume简介 Apache Flume 是一个分布式,高可用的数据收集系统.它可以从不同的数据源收集数据,经过聚合后发送到存储系统中,通常用于日志数据的收集.Flume 分为 NG 和 OG ( ...
Apache Flume 简介
转自:http://blog.163.com/guaiguai_family/blog/static/20078414520138100562883/ Flume 是 Cloudera 公司开源出来的 ...
Flume简介与使用（三）——Kafka Sink消费数据之Kafka安装
前面已经介绍了如何利用Thrift Source生产数据,今天介绍如何用Kafka Sink消费数据. 其实之前已经在Flume配置文件里设置了用Kafka Sink消费数据 agent1.sinks ...
Flume简介及安装
Hadoop业务的大致开发流程以及Flume在业务中的地位: 从Hadoop的业务开发流程图中可以看出,在大数据的业务处理过程中,对于数据的采集是十分重要的一步,也是不可避免的一步,从而引出我们本文的 ...
Flume简介及使用
一.Flume概述 1)官网地址 http://flume.apache.org/ 2)日志采集工具 Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日志数据.它具有基于流数据 ...
Apache Flume简介及安装部署
概述 Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集.聚合和传输的软件. Flume 的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目 ...

随机推荐

<Parquet><Physical Properties><Best practice><With impala>
Parquet Parquet is a columnar storage format for Hadoop. Parquet is designed to make the advantages ...
day 67 django 之ORM 基础安装
一 ORM的基础部分 1 ORM的概念对象关系映射(Object Relational Mapping(映射),简称ORM)模式是一种为了解决面向对象与关系数据库存在的互不匹配的现象的技术. 2 ...
winform datatable 或datagridview中添加列
DataGridViewCheckBoxColumn dg = new DataGridViewCheckBoxColumn(); dg.HeaderText = "选择"; dg ...
临时调用call()与apply()方法
当在某个局域范围内要调用构造函数中或者其他局域范围内的方法此时可以用到临时调用方法call与apply 虽然这两个方法都是起临时调用的功能,但是用法不一样 call(obj,val) obj:对象名 ...
6--Python入门--Python基本运算符
算数运算符运算符描述示例 + 相加 1+1→2 - 相减 1-1→0 * 相乘 1*2→2 / 相除 1/2→0.5 % 取余数 3%2→1 ** 幂运算 2**2→4 // 取商 7//2→3 ...
less中使用calc
css3中可以使用calc()来实现自适应布局例如:width:“calc(100% - 25px)” width: calc(expression); ==> expression是一个表 ...
OSPF路由协议（二）
实验要求:使用OSPF路由协议,使每个路由器都能收集到所有网段拓扑如下: 配置如下: R1enableconfigure terminalinterface l0ip address 192.168 ...
apache ab 压力测试
我今天在慕课网中无意之间看到压力测试,可以模拟高并发; 顺便看了一下有没有相关的博客,发现下面的这个很详细; //在apache 安装目录下的bin,运行命令 ab -n1000 -c10 http: ...
PHP设计模式之工厂模式（转）
概念工厂模式是我们最常用的实例化对象模式,是用工厂方法代替new操作的一种模式. 使用工厂模式的好处是,如果你想要更改所实例化的类名等,则只需更改该工厂方法内容即可,不需逐一寻找代码中具体实例化的地 ...
drop redo logfile current报错
目的:在安装完毕11.2.0.4版本Oracle单实例数据库后,对日志进行格式化,删除原日志组current状态,删除报错 #对于理论学习,而带来的理解命令,因此作为记录 #查询日志状态SYS > ...