Hadoop-(Flume)

1. Flume 介绍

1.1. 概述

  • Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。
  • Flume可以采集文件,socket数据包、文件、文件夹、kafka等各种形式源数据,又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中
  • 一般的采集需求,通过对flume的简单配置即可实现
  • Flume针对特殊场景也具备良好的自定义扩展能力, 
    因此,flume可以适用于大部分的日常数据采集场景

1.2. 运行机制

  1. Flume分布式系统中最核心的角色是agent,flume采集系统就是由一个个agent所连接起来形成

  2. 每一个agent相当于一个数据传递员,内部有三个组件:

    1. Source:采集组件,用于跟数据源对接,以获取数据

    2. Sink:下沉组件,用于往下一级agent传递数据或者往最终存储系统传递数据

    3. Channel:传输通道组件,用于从source将数据传递到sink

1.3. Flume 结构图

简单结构

单个 Agent 采集数据

复杂结构

多级 Agent 之间串联

2. Flume 实战案例

案例:使用网络telent命令向一台机器发送一些网络数据,然后通过flume采集网络端口数据

2.1. Flume 的安装部署

Step 1: 下载解压修改配置文件

下载地址:

http://archive.apache.org/dist/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

Flume的安装非常简单,只需要解压即可,当然,前提是已有hadoop环境

上传安装包到数据源所在节点上

这里我们采用在第三台机器来进行安装

  1. cd /export/softwares/
  2. tar -zxvf apache-flume-1.8.-bin.tar.gz -C ../servers/
  3. cd /export/servers/apache-flume-1.8.-bin/conf
  4. cp flume-env.sh.template flume-env.sh
  5. vim flume-env.sh
  6. export JAVA_HOME=/export/servers/jdk1.8.0_141
Step 2: 开发配置文件

根据数据采集的需求配置采集方案,描述在配置文件中(文件名可任意自定义)

配置我们的网络收集的配置文件 
在flume的conf目录下新建一个配置文件(采集方案)

  1. vim /export/servers/apache-flume-1.8.-bin/conf/netcat-logger.conf
  1. # 定义这个agent中各组件的名字
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # 描述和配置source组件:r1
  6. a1.sources.r1.type = netcat
  7. a1.sources.r1.bind = 192.168.174.
  8. a1.sources.r1.port =
  9. # 描述和配置sink组件:k1
  10. a1.sinks.k1.type = logger
  11. # 描述和配置channel组件,此处使用是内存缓存的方式
  12. a1.channels.c1.type = memory
  13. a1.channels.c1.capacity =
  14. a1.channels.c1.transactionCapacity =
  15. # 描述和配置source channel sink之间的连接关系
  16. a1.sources.r1.channels = c1
  17. a1.sinks.k1.channel = c1

Step 3: 启动配置文件

指定采集方案配置文件,在相应的节点上启动flume agent

先用一个最简单的例子来测试一下程序环境是否正常 
启动agent去采集数据

  1. bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
  • -c conf 指定flume自身的配置文件所在目录
  • -f conf/netcat-logger.con 指定我们所描述的采集方案
  • -n a1 指定我们这个agent的名字
Step 4: 安装 Telnet 准备测试

在node02机器上面安装telnet客户端,用于模拟数据的发送

  1. yum -y install telnet
  2. telnet node03 # 使用telnet模拟数据发送

2.2. 采集案例

2.2.3. 采集目录到 HDFS

需求

某服务器的某特定目录下,会不断产生新的文件,每当有新文件出现,就需要把文件采集到HDFS中去

思路

根据需求,首先定义以下3大要素

  1. 数据源组件,即source ——监控文件目录 : spooldir
    1. 监视一个目录,只要目录中出现新文件,就会采集文件中的内容
    2. 采集完成的文件,会被agent自动添加一个后缀:COMPLETED
    3. 所监视的目录中不允许重复出现相同文件名的文件,否则报错\罢工
  2. 下沉组件,即sink——HDFS文件系统 : hdfs sink
  3. 通道组件,即channel——可用file channel 也可以用内存channel
Step 1: Flume 配置文件
  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. mkdir -p /export/servers/dirfile
  3. vim spooldir.conf
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. ##注意:不能往监控目中重复丢同名文件
  7. a1.sources.r1.type = spooldir
  8. a1.sources.r1.spoolDir = /export/servers/dirfile
  9. a1.sources.r1.fileHeader = true
  10. # Describe the sink
  11. a1.sinks.k1.type = hdfs
  12. a1.sinks.k1.channel = c1
  13. a1.sinks.k1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/
  14. a1.sinks.k1.hdfs.filePrefix = events-
  15. #控制文件夹以多少时间滚动 10分钟
  16. a1.sinks.k1.hdfs.round = true
  17. a1.sinks.k1.hdfs.roundValue =
  18. a1.sinks.k1.hdfs.roundUnit = minute
  19. #roll控制写入hdfs文件,以何种方式滚动
  20. #时间间隔
  21. a1.sinks.k1.hdfs.rollInterval =
  22. #文件大小
  23. a1.sinks.k1.hdfs.rollSize =
  24. #even数量
  25. a1.sinks.k1.hdfs.rollCount =
  26. a1.sinks.k1.hdfs.batchSize =
  27. #不想滚动,设置0
  28. a1.sinks.k1.hdfs.useLocalTimeStamp = true
  29. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
  30. a1.sinks.k1.hdfs.fileType = DataStream
  31. # Use a channel which buffers events in memory
  32. a1.channels.c1.type = memory
  33. #胶囊容量
  34. a1.channels.c1.capacity =
  35. #一次向sink运输多少个event
  36. a1.channels.c1.transactionCapacity =
  37. # Bind the source and sink to the channel
  38. a1.sources.r1.channels = c1
  39. a1.sinks.k1.channel = c1

Channel参数解释

capacity:默认该通道中最大的可以存储的event数量 
trasactionCapacity:每次最大可以从source中拿到或者送到sink中的event数量 
keep-alive:event添加到通道中或者移出的允许时间

Step 2: 启动 Flume
  1. bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console
  2. #命令的精简版
Step 3: 上传文件到指定目录

将不同的文件放到下面目录里面去,注意文件不能重名

  1. cd /export/servers/dirfile

2.2.4. 采集文件到 HDFS

需求

比如业务系统使用log4j生成的日志,日志内容不断增加,需要把追加到日志文件中的数据实时采集到hdfs

分析

根据需求,首先定义以下3大要素

  • 采集源,即source——监控文件内容更新 : exec ‘tail -F file’
  • 下沉目标,即sink——HDFS文件系统 : hdfs sink
  • Source和sink之间的传递通道——channel,可用file channel 也可以用 内存channel
Step 1: 定义 Flume 配置文件
  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim tail-file.conf
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = exec
  7. a1.sources.r1.command = tail -F /root/logs/test.log
  8. a1.sources.r1.channels = c1
  9. # Describe the sink
  10. a1.sinks.k1.type = hdfs
  11. a1.sinks.k1.channel = c1
  12. a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H-%M/
  13. a1.sinks.k1.hdfs.filePrefix = events-
  14. a1.sinks.k1.hdfs.round = true
  15. a1.sinks.k1.hdfs.roundValue =
  16. a1.sinks.k1.hdfs.roundUnit = minute
  17. a1.sinks.k1.hdfs.rollInterval =
  18. a1.sinks.k1.hdfs.rollSize =
  19. a1.sinks.k1.hdfs.rollCount =
  20. a1.sinks.k1.hdfs.batchSize =
  21. a1.sinks.k1.hdfs.useLocalTimeStamp = true
  22. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
  23. a1.sinks.k1.hdfs.fileType = DataStream
  24. # Use a channel which buffers events in memory
  25. a1.channels.c1.type = memory
  26. a1.channels.c1.capacity =
  27. a1.channels.c1.transactionCapacity =
  28. # Bind the source and sink to the channel
  29. a1.sources.r1.channels = c1
  30. a1.sinks.k1.channel = c1
Step 2: 启动 Flume
  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console
Step 3: 开发 Shell 脚本定时追加文件内容
  1. mkdir -p /export/servers/shells/
  2. cd /export/servers/shells/
  3. vim tail-file.sh
  1. #!/bin/bash
  2. while true
  3. do
  4. date >> /export/servers/taillogs/access_log;
  5. sleep 0.5;
  6. done
Step 4: 启动脚本
  1. # 创建文件夹
  2. mkdir -p /export/servers/taillogs
  3. # 启动脚本
  4. sh /export/servers/shells/tail-file.sh

2.2.5. Agent 级联

分析

第一个agent负责收集文件当中的数据,通过网络发送到第二个agent当中去 
第二个agent负责接收第一个agent发送的数据,并将数据保存到hdfs上面去

Step 1: Node02 安装 Flume

将node03机器上面解压后的flume文件夹拷贝到node02机器上面去

  1. cd /export/servers
  2. scp -r apache-flume-1.8.-bin/ node02:$PWD
Step 2: Node02 配置 Flume

在node02机器配置我们的flume

  1. cd /export/servers/ apache-flume-1.8.-bin/conf
  2. vim tail-avro-avro-logger.conf
  1. ##################
  2. # Name the components on this agent
  3. a1.sources = r1
  4. a1.sinks = k1
  5. a1.channels = c1
  6. # Describe/configure the source
  7. a1.sources.r1.type = exec
  8. a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
  9. a1.sources.r1.channels = c1
  10. # Describe the sink
  11. ##sink端的avro是一个数据发送者
  12. a1.sinks = k1
  13. a1.sinks.k1.type = avro
  14. a1.sinks.k1.channel = c1
  15. a1.sinks.k1.hostname = node03
  16. a1.sinks.k1.port =
  17. a1.sinks.k1.batch-size =
  18. # Use a channel which buffers events in memory
  19. a1.channels.c1.type = memory
  20. a1.channels.c1.capacity =
  21. a1.channels.c1.transactionCapacity =
  22. # Bind the source and sink to the channel
  23. a1.sources.r1.channels = c1
  24. a1.sinks.k1.channel = c1
Step 3: 开发脚本向文件中写入数据

直接将node03下面的脚本和数据拷贝到node02即可,node03机器上执行以下命令

  1. cd /export/servers
  2. scp -r shells/ taillogs/ node02:$PWD
Step 4: Node03 Flume 配置文件

在node03机器上开发flume的配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim avro-hdfs.conf
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. ##source中的avro组件是一个接收者服务
  7. a1.sources.r1.type = avro
  8. a1.sources.r1.channels = c1
  9. a1.sources.r1.bind = node03
  10. a1.sources.r1.port =
  11. # Describe the sink
  12. a1.sinks.k1.type = hdfs
  13. a1.sinks.k1.hdfs.path = hdfs://node01:8020/av/%y-%m-%d/%H%M/
  14. a1.sinks.k1.hdfs.filePrefix = events-
  15. a1.sinks.k1.hdfs.round = true
  16. a1.sinks.k1.hdfs.roundValue =
  17. a1.sinks.k1.hdfs.roundUnit = minute
  18. a1.sinks.k1.hdfs.rollInterval =
  19. a1.sinks.k1.hdfs.rollSize =
  20. a1.sinks.k1.hdfs.rollCount =
  21. a1.sinks.k1.hdfs.batchSize =
  22. a1.sinks.k1.hdfs.useLocalTimeStamp = true
  23. #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
  24. a1.sinks.k1.hdfs.fileType = DataStream
  25. # Use a channel which buffers events in memory
  26. a1.channels.c1.type = memory
  27. a1.channels.c1.capacity =
  28. a1.channels.c1.transactionCapacity =
  29. # Bind the source and sink to the channel
  30. a1.sources.r1.channels = c1
  31. a1.sinks.k1.channel = c1
Step 5: 顺序启动

node03机器启动flume进程

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

node02机器启动flume进程

  1. cd /export/servers/apache-flume-1.8.-bin/
  2. bin/flume-ng agent -c conf -f conf/tail-avro-avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

node02机器启shell脚本生成文件

  1. cd /export/servers/shells
  2. sh tail-file.sh

3. flume的高可用方案-failover

在完成单点的Flume NG搭建后,下面我们搭建一个高可用的Flume NG集群,架构图如下所示:

3.1. 角色分配

Flume的Agent和Collector分布如下表所示:

名称 HOST 角色
Agent1 node01 Web Server
Collector1 node02 AgentMstr1
Collector2 node03 AgentMstr2

图中所示,Agent1数据分别流入到Collector1和Collector2,Flume NG本身提供了Failover机制,可以自动切换和恢复。在上图中,有3个产生日志服务器分布在不同的机房,要把所有的日志都收集到一个集群中存储。下 面我们开发配置Flume NG集群

3.2. Node01 安装和配置

将node03机器上面的flume安装包以及文件生产的两个目录拷贝到node01机器上面去

node03机器执行以下命令

  1. cd /export/servers
  2. scp -r apache-flume-1.8.-bin/ node01:$PWD
  3. scp -r shells/ taillogs/ node01:$PWD

node01机器配置agent的配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim agent.conf
  1. #agent1 name
  2. agent1.channels = c1
  3. agent1.sources = r1
  4. agent1.sinks = k1 k2
  5. #
  6. ##set gruop
  7. agent1.sinkgroups = g1
  8. #
  9. agent1.sources.r1.channels = c1
  10. agent1.sources.r1.type = exec
  11. agent1.sources.r1.command = tail -F /export/servers/taillogs/access_log
  12. #
  13. ##set channel
  14. agent1.channels.c1.type = memory
  15. agent1.channels.c1.capacity =
  16. agent1.channels.c1.transactionCapacity =
  17. #
  18. ## set sink1
  19. agent1.sinks.k1.channel = c1
  20. agent1.sinks.k1.type = avro
  21. agent1.sinks.k1.hostname = node02
  22. agent1.sinks.k1.port =
  23. #
  24. ## set sink2
  25. agent1.sinks.k2.channel = c1
  26. agent1.sinks.k2.type = avro
  27. agent1.sinks.k2.hostname = node03
  28. agent1.sinks.k2.port =
  29. #
  30. ##set sink group
  31. agent1.sinkgroups.g1.sinks = k1 k2
  32. #
  33. ##set failover
  34. agent1.sinkgroups.g1.processor.type = failover
  35. agent1.sinkgroups.g1.processor.priority.k1 =
  36. agent1.sinkgroups.g1.processor.priority.k2 =
  37. agent1.sinkgroups.g1.processor.maxpenalty =
  1. #agent1 name
  2. agent1.channels = c1
  3. agent1.sources = r1
  4. agent1.sinks = k1 k2
  5. #set gruop
  6. agent1.sinkgroups = g1
  7. #set channel
  8. agent1.channels.c1.type = memory
  9. agent1.channels.c1.capacity =
  10. agent1.channels.c1.transactionCapacity =
  11. agent1.sources.r1.channels = c1
  12. agent1.sources.r1.type = exec
  13. agent1.sources.r1.command = tail -F /root/logs/456.log
  14. # set sink1
  15. agent1.sinks.k1.channel = c1
  16. agent1.sinks.k1.type = avro
  17. agent1.sinks.k1.hostname = node02
  18. agent1.sinks.k1.port =
  19. # set sink2
  20. agent1.sinks.k2.channel = c1
  21. agent1.sinks.k2.type = avro
  22. agent1.sinks.k2.hostname = node03
  23. agent1.sinks.k2.port =
  24. #set sink group
  25. agent1.sinkgroups.g1.sinks = k1 k2
  26. #set failover
  27. agent1.sinkgroups.g1.processor.type = failover
  28. agent1.sinkgroups.g1.processor.priority.k1 =
  29. agent1.sinkgroups.g1.processor.priority.k2 =
  30. agent1.sinkgroups.g1.processor.maxpenalty =

3.3. Node02 与 Node03 配置 FlumeCollection

node02机器修改配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim collector.conf
  1. #set Agent name
  2. a1.sources = r1
  3. a1.channels = c1
  4. a1.sinks = k1
  5. #
  6. ##set channel
  7. a1.channels.c1.type = memory
  8. a1.channels.c1.capacity =
  9. a1.channels.c1.transactionCapacity =
  10. #
  11. ## other node,nna to nns
  12. a1.sources.r1.type = avro
  13. a1.sources.r1.bind = node02
  14. a1.sources.r1.port =
  15. a1.sources.r1.channels = c1
  16. #
  17. ##set sink to hdfs
  18. a1.sinks.k1.type=hdfs
  19. a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/
  20. a1.sinks.k1.hdfs.fileType=DataStream
  21. a1.sinks.k1.hdfs.writeFormat=TEXT
  22. a1.sinks.k1.hdfs.rollInterval=
  23. a1.sinks.k1.channel=c1
  24. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
  25. #
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = avro
  7. a1.sources.r1.channels = c1
  8. a1.sources.r1.bind = node02
  9. a1.sources.r1.port =
  10. # Describe the sink
  11. a1.sinks.k1.type = logger
  12. # Use a channel which buffers events in memory
  13. a1.channels.c1.type = memory
  14. a1.channels.c1.capacity =
  15. a1.channels.c1.transactionCapacity =
  16. # Bind the source and sink to the channel
  17. a1.sources.r1.channels = c1
  18. a1.sinks.k1.channel = c1

node03机器修改配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim collector.conf
  1. #set Agent name
  2. a1.sources = r1
  3. a1.channels = c1
  4. a1.sinks = k1
  5. #
  6. ##set channel
  7. a1.channels.c1.type = memory
  8. a1.channels.c1.capacity =
  9. a1.channels.c1.transactionCapacity =
  10. #
  11. ## other node,nna to nns
  12. a1.sources.r1.type = avro
  13. a1.sources.r1.bind = node03
  14. a1.sources.r1.port =
  15. a1.sources.r1.channels = c1
  16. #
  17. ##set sink to hdfs
  18. a1.sinks.k1.type=hdfs
  19. a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/
  20. a1.sinks.k1.hdfs.fileType=DataStream
  21. a1.sinks.k1.hdfs.writeFormat=TEXT
  22. a1.sinks.k1.hdfs.rollInterval=
  23. a1.sinks.k1.channel=c1
  24. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = avro
  7. a1.sources.r1.channels = c1
  8. a1.sources.r1.bind = node03
  9. a1.sources.r1.port =
  10. # Describe the sink
  11. a1.sinks.k1.type = logger
  12. # Use a channel which buffers events in memory
  13. a1.channels.c1.type = memory
  14. a1.channels.c1.capacity =
  15. a1.channels.c1.transactionCapacity =
  16. # Bind the source and sink to the channel
  17. a1.sources.r1.channels = c1
  18. a1.sinks.k1.channel = c1

3.4. 顺序启动

node03机器上面启动flume

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node02机器上面启动flume

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node01机器上面启动flume

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n agent1 -c conf -f conf/agent.conf -Dflume.root.logger=DEBUG,console

node01机器启动文件产生脚本

  1. cd /export/servers/shells
  2. sh tail-file.sh

3.5. Failover 测试

下面我们来测试下Flume NG集群的高可用(故障转移)。场景如下:我们在Agent1节点上传文件,由于我们配置Collector1的权重比Collector2大,所以 Collector1优先采集并上传到存储系统。然后我们kill掉Collector1,此时有Collector2负责日志的采集上传工作,之后,我 们手动恢复Collector1节点的Flume服务,再次在Agent1上次文件,发现Collector1恢复优先级别的采集工作。具体截图如下所 示:

Collector1优先上传

HDFS集群中上传的log内容预览

Collector1宕机,Collector2获取优先上传权限

重启Collector1服务,Collector1重新获得优先上传的权限

4. flume 的负载均衡

负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法。Load balancing Sink Processor 能够实现 load balance 功能,如下图Agent1 是一个路由节点,负责将 Channel 暂存的 Event 均衡到对应的多个 Sink组件上,而每个 Sink 组件分别连接到一个独立的 Agent 上,示例配置,如下所示:

在此处我们通过三台机器来进行模拟flume的负载均衡

三台机器规划如下:

node01:采集数据,发送到node02和node03机器上去

node02:接收node01的部分数据

node03:接收node01的部分数据

第一步:开发node01服务器的flume配置

node01服务器配置:

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim load_banlancer_client.conf
  1. #agent name
  2. a1.channels = c1
  3. a1.sources = r1
  4. a1.sinks = k1 k2
  5. #set gruop
  6. a1.sinkgroups = g1
  7. #set channel
  8. a1.channels.c1.type = memory
  9. a1.channels.c1.capacity =
  10. a1.channels.c1.transactionCapacity =
  11. a1.sources.r1.channels = c1
  12. a1.sources.r1.type = exec
  13. a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
  14. # set sink1
  15. a1.sinks.k1.channel = c1
  16. a1.sinks.k1.type = avro
  17. #对接端口
  18. a1.sinks.k1.hostname = node02
  19. a1.sinks.k1.port =
  20. # set sink2
  21. a1.sinks.k2.channel = c1
  22. a1.sinks.k2.type = avro
  23. #对接端口
  24. a1.sinks.k2.hostname = node03
  25. a1.sinks.k2.port =
  26. #set sink group
  27. a1.sinkgroups.g1.sinks = k1 k2
  28. #set failover
  29. #负载均衡
  30. a1.sinkgroups.g1.processor.type = load_balance
  31. a1.sinkgroups.g1.processor.backoff = true
  32. #轮训
  33. a1.sinkgroups.g1.processor.selector = round_robin
  34. a1.sinkgroups.g1.processor.selector.maxTimeOut=

第二步:开发node02服务器的flume配置

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim load_banlancer_server.conf
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = avro
  7. a1.sources.r1.channels = c1
  8. a1.sources.r1.bind = node02
  9. a1.sources.r1.port =
  10. # Describe the sink
  11. a1.sinks.k1.type = logger
  12. # Use a channel which buffers events in memory
  13. a1.channels.c1.type = memory
  14. a1.channels.c1.capacity =
  15. a1.channels.c1.transactionCapacity =
  16. # Bind the source and sink to the channel
  17. a1.sources.r1.channels = c1
  18. a1.sinks.k1.channel = c1

第三步:开发node03服务器flume配置

node03服务器配置

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim load_banlancer_server.conf
  1. # Name the components on this agent
  2. a1.sources = r1
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = avro
  7. a1.sources.r1.channels = c1
  8. a1.sources.r1.bind = node03
  9. a1.sources.r1.port =
  10. # Describe the sink
  11. a1.sinks.k1.type = logger
  12. # Use a channel which buffers events in memory
  13. a1.channels.c1.type = memory
  14. a1.channels.c1.capacity =
  15. a1.channels.c1.transactionCapacity =
  16. # Bind the source and sink to the channel
  17. a1.sources.r1.channels = c1
  18. a1.sinks.k1.channel = c1

第四步:准备启动flume服务

启动node03的flume服务

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

启动node02的flume服务

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

启动node01的flume服务

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_client.conf -Dflume.root.logger=DEBUG,console

第五步:node01服务器运行脚本产生数据

cd /export/servers/shells

sh tail-file.sh

5. Flume 案例-静态拦截器

1. 案例场景

A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log

现在要求:

把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。

但是在hdfs中要求的目录为:

  1. /source/logs/access//**
  2. /source/logs/nginx/20180101/**
  3. /source/logs/web/20180101/**

2. 场景分析

​ 图一

3. 数据流程处理分析

4、实现

服务器A对应的IP为 192.168.174.100

服务器B对应的IP为 192.168.174.110

服务器C对应的IP为 node03

采集端配置文件开发

node01与node02服务器开发flume的配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim exec_source_avro_sink.conf
  1. # Name the components on this agent
  2. a1.sources = r1 r2 r3
  3. a1.sinks = k1
  4. a1.channels = c1
  5. # Describe/configure the source
  6. a1.sources.r1.type = exec
  7. a1.sources.r1.command = tail -F /export/servers/taillogs/access.log
  8. a1.sources.r1.interceptors = i1
  9. a1.sources.r1.interceptors.i1.type = static
  10. ## static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对
  11. a1.sources.r1.interceptors.i1.key = type
  12. a1.sources.r1.interceptors.i1.value = access
  13. a1.sources.r2.type = exec
  14. a1.sources.r2.command = tail -F /export/servers/taillogs/nginx.log
  15. a1.sources.r2.interceptors = i2
  16. a1.sources.r2.interceptors.i2.type = static
  17. a1.sources.r2.interceptors.i2.key = type
  18. a1.sources.r2.interceptors.i2.value = nginx
  19. a1.sources.r3.type = exec
  20. a1.sources.r3.command = tail -F /export/servers/taillogs/web.log
  21. a1.sources.r3.interceptors = i3
  22. a1.sources.r3.interceptors.i3.type = static
  23. a1.sources.r3.interceptors.i3.key = type
  24. a1.sources.r3.interceptors.i3.value = web
  25. # Describe the sink
  26. a1.sinks.k1.type = avro
  27. a1.sinks.k1.hostname = node03
  28. a1.sinks.k1.port =
  29. # Use a channel which buffers events in memory
  30. a1.channels.c1.type = memory
  31. a1.channels.c1.capacity =
  32. a1.channels.c1.transactionCapacity =
  33. # Bind the source and sink to the channel
  34. a1.sources.r1.channels = c1
  35. a1.sources.r2.channels = c1
  36. a1.sources.r3.channels = c1
  37. a1.sinks.k1.channel = c1

服务端配置文件开发

在node03上面开发flume配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim avro_source_hdfs_sink.conf
  1. a1.sources = r1
  2. a1.sinks = k1
  3. a1.channels = c1
  4. #定义source
  5. a1.sources.r1.type = avro
  6. a1.sources.r1.bind = node03
  7. a1.sources.r1.port =
  8. #添加时间拦截器
  9. a1.sources.r1.interceptors = i1
  10. a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
  11. #定义channels
  12. a1.channels.c1.type = memory
  13. a1.channels.c1.capacity =
  14. a1.channels.c1.transactionCapacity =
  15. #定义sink
  16. a1.sinks.k1.type = hdfs
  17. a1.sinks.k1.hdfs.path=hdfs://node01:8020/source/logs/%{type}/%Y%m%d
  18. a1.sinks.k1.hdfs.filePrefix =events
  19. a1.sinks.k1.hdfs.fileType = DataStream
  20. a1.sinks.k1.hdfs.writeFormat = Text
  21. #时间类型
  22. a1.sinks.k1.hdfs.useLocalTimeStamp = true
  23. #生成的文件不按条数生成
  24. a1.sinks.k1.hdfs.rollCount =
  25. #生成的文件按时间生成
  26. a1.sinks.k1.hdfs.rollInterval =
  27. #生成的文件按大小生成
  28. a1.sinks.k1.hdfs.rollSize =
  29. #批量写入hdfs的个数
  30. a1.sinks.k1.hdfs.batchSize =
  31. #flume操作hdfs的线程数(包括新建,写入等)
  32. a1.sinks.k1.hdfs.threadsPoolSize=
  33. #操作hdfs超时时间
  34. a1.sinks.k1.hdfs.callTimeout=
  35. #组装source、channel、sink
  36. a1.sources.r1.channels = c1
  37. a1.sinks.k1.channel = c1

采集端文件生成脚本

在node01与node02上面开发shell脚本,模拟数据生成

  1. cd /export/servers/shells
  2. vim server.sh
  1. #!/bin/bash
  2. while true
  3. do
  4. date >> /export/servers/taillogs/access.log;
  5. date >> /export/servers/taillogs/web.log;
  6. date >> /export/servers/taillogs/nginx.log;
  7. sleep 0.5;
  8. done

顺序启动服务

node03启动flume实现数据收集

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

node01与node02启动flume实现数据监控

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

node01与node02启动生成文件脚本

  1. cd /export/servers/shells
  2. sh server.sh

5、项目实现截图

6. Flume 案例二

案例需求:

在数据采集之后,通过flume的拦截器,实现不需要的数据过滤掉,并将指定的第一个字段进行加密,加密之后再往hdfs上面保存

原始数据与处理之后的数据对比

图一 原始文件内容

图二 HDFS上产生收集到的处理数

实现步骤

第一步:创建maven java工程,导入jar包

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <project xmlns="http://maven.apache.org/POM/4.0.0"
  3. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5. <modelVersion>4.0.0</modelVersion>
  6. <groupId>cn.le.cloud</groupId>
  7. <artifactId>example-flume-intercepter</artifactId>
  8. <version>1.0-SNAPSHOT</version>
  9. <dependencies>
  10. <dependency>
  11. <groupId>org.apache.flume</groupId>
  12. <artifactId>flume-ng-sdk</artifactId>
  13. <version>1.8.0</version>
  14. </dependency>
  15. <dependency>
  16. <groupId>org.apache.flume</groupId>
  17. <artifactId>flume-ng-core</artifactId>
  18. <version>1.8.0</version>
  19. </dependency>
  20. </dependencies>
  21. <build>
  22. <plugins>
  23. <plugin>
  24. <groupId>org.apache.maven.plugins</groupId>
  25. <artifactId>maven-compiler-plugin</artifactId>
  26. <version>3.0</version>
  27. <configuration>
  28. <source>1.8</source>
  29. <target>1.8</target>
  30. <encoding>UTF-8</encoding>
  31. <!-- <verbal>true</verbal>-->
  32. </configuration>
  33. </plugin>
  34. <plugin>
  35. <groupId>org.apache.maven.plugins</groupId>
  36. <artifactId>maven-shade-plugin</artifactId>
  37. <version>3.1.1</version>
  38. <executions>
  39. <execution>
  40. <phase>package</phase>
  41. <goals>
  42. <goal>shade</goal>
  43. </goals>
  44. <configuration>
  45. <filters>
  46. <filter>
  47. <artifact>*:*</artifact>
  48. <excludes>
  49. <exclude>META-INF/*.SF</exclude>
  50. <exclude>META-INF/*.DSA</exclude>
  51. <exclude>META-INF/*.RSA</exclude>
  52. </excludes>
  53. </filter>
  54. </filters>
  55. <transformers>
  56. <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
  57. <mainClass></mainClass>
  58. </transformer>
  59. </transformers>
  60. </configuration>
  61. </execution>
  62. </executions>
  63. </plugin>
  64. </plugins>
  65. </build>
  66. </project>

第二步:自定义flume的拦截器

  1. package cn.le.iterceptor;
  2. import com.google.common.base.Charsets;
  3. import org.apache.flume.Context;
  4. import org.apache.flume.Event;
  5. import org.apache.flume.interceptor.Interceptor;
  6. import java.security.MessageDigest;
  7. import java.security.NoSuchAlgorithmException;
  8. import java.util.ArrayList;
  9. import java.util.List;
  10. import java.util.regex.Matcher;
  11. import java.util.regex.Pattern;
  12. import static cn.le.iterceptor.CustomParameterInterceptor.Constants.*;
  13. public class CustomParameterInterceptor implements Interceptor {
  14. /** The field_separator.指明每一行字段的分隔符 */
  15. private final String fields_separator;
  16. /** The indexs.通过分隔符分割后,指明需要那列的字段 下标*/
  17. private final String indexs;
  18. /** The indexs_separator. 多个下标的分隔符*/
  19. private final String indexs_separator;
  20. /**
  21. *
  22. * @param indexs
  23. * @param indexs_separator
  24. */
  25. public CustomParameterInterceptor( String fields_separator,
  26. String indexs, String indexs_separator,String encrypted_field_index) {
  27. String f = fields_separator.trim();
  28. String i = indexs_separator.trim();
  29. this.indexs = indexs;
  30. this.encrypted_field_index=encrypted_field_index.trim();
  31. if (!f.equals("")) {
  32. f = UnicodeToString(f);
  33. }
  34. this.fields_separator =f;
  35. if (!i.equals("")) {
  36. i = UnicodeToString(i);
  37. }
  38. this.indexs_separator = i;
  39. }
  40. /*
  41. *
  42. * \t 制表符 ('\u0009') \n 新行(换行)符 (' ') \r 回车符 (' ') \f 换页符 ('\u000C') \a 报警
  43. * (bell) 符 ('\u0007') \e 转义符 ('\u001B') \cx 空格(\u0020)对应于 x 的控制符
  44. *
  45. * @param str
  46. * @return
  47. * @data:2015-6-30
  48. */
  49. /** The encrypted_field_index. 需要加密的字段下标*/
  50. private final String encrypted_field_index;
  51. public static String UnicodeToString(String str) {
  52. Pattern pattern = Pattern.compile("(\\\\u(\\p{XDigit}{4}))");
  53. Matcher matcher = pattern.matcher(str);
  54. char ch;
  55. while (matcher.find()) {
  56. ch = (char) Integer.parseInt(matcher.group(), );
  57. str = str.replace(matcher.group(), ch + "");
  58. }
  59. return str;
  60. }
  61. /*
  62. * @see org.apache.flume.interceptor.Interceptor#intercept(org.apache.flume.Event)
  63. * 单个event拦截逻辑
  64. */
  65. public Event intercept(Event event) {
  66. if (event == null) {
  67. return null;
  68. }
  69. try {
  70. String line = new String(event.getBody(), Charsets.UTF_8);
  71. String[] fields_spilts = line.split(fields_separator);
  72. String[] indexs_split = indexs.split(indexs_separator);
  73. String newLine="";
  74. for (int i = ; i < indexs_split.length; i++) {
  75. int parseInt = Integer.parseInt(indexs_split[i]);
  76. //对加密字段进行加密
  77. if(!"".equals(encrypted_field_index)&&encrypted_field_index.equals(indexs_split[i])){
  78. newLine+=StringUtils.GetMD5Code(fields_spilts[parseInt]);
  79. }else{
  80. newLine+=fields_spilts[parseInt];
  81. }
  82. if(i!=indexs_split.length-){
  83. newLine+=fields_separator;
  84. }
  85. }
  86. event.setBody(newLine.getBytes(Charsets.UTF_8));
  87. return event;
  88. } catch (Exception e) {
  89. return event;
  90. }
  91. }
  92. /*
  93. * @see org.apache.flume.interceptor.Interceptor#intercept(java.util.List)
  94. * 批量event拦截逻辑
  95. */
  96. public List<Event> intercept(List<Event> events) {
  97. List<Event> out = new ArrayList<Event>();
  98. for (Event event : events) {
  99. Event outEvent = intercept(event);
  100. if (outEvent != null) {
  101. out.add(outEvent);
  102. }
  103. }
  104. return out;
  105. }
  106. /*
  107. * @see org.apache.flume.interceptor.Interceptor#initialize()
  108. */
  109. public void initialize() {
  110. // TODO Auto-generated method stub
  111. }
  112. /*
  113. * @see org.apache.flume.interceptor.Interceptor#close()
  114. */
  115. public void close() {
  116. // TODO Auto-generated method stub
  117. }
  118. /**
  119. * 相当于自定义Interceptor的工厂类
  120. * 在flume采集配置文件中通过制定该Builder来创建Interceptor对象
  121. * 可以在Builder中获取、解析flume采集配置文件中的拦截器Interceptor的自定义参数:
  122. * 字段分隔符,字段下标,下标分隔符、加密字段下标 ...等
  123. * @author
  124. *
  125. */
  126. public static class Builder implements Interceptor.Builder {
  127. /** The fields_separator.指明每一行字段的分隔符 */
  128. private String fields_separator;
  129. /** The indexs.通过分隔符分割后,指明需要那列的字段 下标*/
  130. private String indexs;
  131. /** The indexs_separator. 多个下标下标的分隔符*/
  132. private String indexs_separator;
  133. /** The encrypted_field. 需要加密的字段下标*/
  134. private String encrypted_field_index;
  135. /*
  136. * @see org.apache.flume.conf.Configurable#configure(org.apache.flume.Context)
  137. */
  138. public void configure(Context context) {
  139. fields_separator = context.getString(FIELD_SEPARATOR, DEFAULT_FIELD_SEPARATOR);
  140. indexs = context.getString(INDEXS, DEFAULT_INDEXS);
  141. indexs_separator = context.getString(INDEXS_SEPARATOR, DEFAULT_INDEXS_SEPARATOR);
  142. encrypted_field_index= context.getString(ENCRYPTED_FIELD_INDEX, DEFAULT_ENCRYPTED_FIELD_INDEX);
  143. }
  144. /*
  145. * @see org.apache.flume.interceptor.Interceptor.Builder#build()
  146. */
  147. public Interceptor build() {
  148. return new CustomParameterInterceptor(fields_separator, indexs, indexs_separator,encrypted_field_index);
  149. }
  150. }
  151. /**
  152. * 常量
  153. *
  154. */
  155. public static class Constants {
  156. /** The Constant FIELD_SEPARATOR. */
  157. public static final String FIELD_SEPARATOR = "fields_separator";
  158. /** The Constant DEFAULT_FIELD_SEPARATOR. */
  159. public static final String DEFAULT_FIELD_SEPARATOR =" ";
  160. /** The Constant INDEXS. */
  161. public static final String INDEXS = "indexs";
  162. /** The Constant DEFAULT_INDEXS. */
  163. public static final String DEFAULT_INDEXS = "0";
  164. /** The Constant INDEXS_SEPARATOR. */
  165. public static final String INDEXS_SEPARATOR = "indexs_separator";
  166. /** The Constant DEFAULT_INDEXS_SEPARATOR. */
  167. public static final String DEFAULT_INDEXS_SEPARATOR = ",";
  168. /** The Constant ENCRYPTED_FIELD_INDEX. */
  169. public static final String ENCRYPTED_FIELD_INDEX = "encrypted_field_index";
  170. /** The Constant DEFAUL_TENCRYPTED_FIELD_INDEX. */
  171. public static final String DEFAULT_ENCRYPTED_FIELD_INDEX = "";
  172. /** The Constant PROCESSTIME. */
  173. public static final String PROCESSTIME = "processTime";
  174. /** The Constant PROCESSTIME. */
  175. public static final String DEFAULT_PROCESSTIME = "a";
  176. }
  177. /**
  178. * 工具类:字符串md5加密
  179. */
  180. public static class StringUtils {
  181. // 全局数组
  182. private final static String[] strDigits = { "0", "1", "2", "3", "4", "5",
  183. "6", "7", "8", "9", "a", "b", "c", "d", "e", "f" };
  184. // 返回形式为数字跟字符串
  185. private static String byteToArrayString(byte bByte) {
  186. int iRet = bByte;
  187. // System.out.println("iRet="+iRet);
  188. if (iRet < ) {
  189. iRet += ;
  190. }
  191. int iD1 = iRet / ;
  192. int iD2 = iRet % ;
  193. return strDigits[iD1] + strDigits[iD2];
  194. }
  195. // 返回形式只为数字
  196. private static String byteToNum(byte bByte) {
  197. int iRet = bByte;
  198. System.out.println("iRet1=" + iRet);
  199. if (iRet < ) {
  200. iRet += ;
  201. }
  202. return String.valueOf(iRet);
  203. }
  204. // 转换字节数组为16进制字串
  205. private static String byteToString(byte[] bByte) {
  206. StringBuffer sBuffer = new StringBuffer();
  207. for (int i = ; i < bByte.length; i++) {
  208. sBuffer.append(byteToArrayString(bByte[i]));
  209. }
  210. return sBuffer.toString();
  211. }
  212. public static String GetMD5Code(String strObj) {
  213. String resultString = null;
  214. try {
  215. resultString = new String(strObj);
  216. MessageDigest md = MessageDigest.getInstance("MD5");
  217. // md.digest() 该函数返回值为存放哈希值结果的byte数组
  218. resultString = byteToString(md.digest(strObj.getBytes()));
  219. } catch (NoSuchAlgorithmException ex) {
  220. ex.printStackTrace();
  221. }
  222. return resultString;
  223. }
  224. }
  225. }

第三步:打包上传服务器

将我们的拦截器打成jar包放到flume的lib目录下

第四步:开发flume的配置文件

第三台机器开发flume的配置文件

  1. cd /export/servers/apache-flume-1.8.-bin/conf
  2. vim spool-interceptor-hdfs.conf
  1. a1.channels = c1
  2. a1.sources = r1
  3. a1.sinks = s1
  4. #channel
  5. a1.channels.c1.type = memory
  6. a1.channels.c1.capacity=
  7. a1.channels.c1.transactionCapacity=
  8. #source
  9. a1.sources.r1.channels = c1
  10. a1.sources.r1.type = spooldir
  11. a1.sources.r1.spoolDir = /export/servers/intercept
  12. a1.sources.r1.batchSize=
  13. a1.sources.r1.inputCharset = UTF-
  14. a1.sources.r1.interceptors =i1 i2
  15. a1.sources.r1.interceptors.i1.type =cn.le.iterceptor.CustomParameterInterceptor$Builder
  16. a1.sources.r1.interceptors.i1.fields_separator=\\u0009
  17. a1.sources.r1.interceptors.i1.indexs =,,,,
  18. a1.sources.r1.interceptors.i1.indexs_separator =\\u002c
  19. a1.sources.r1.interceptors.i1.encrypted_field_index =
  20. a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
  21. #sink
  22. a1.sinks.s1.channel = c1
  23. a1.sinks.s1.type = hdfs
  24. a1.sinks.s1.hdfs.path =hdfs://node01:8020/flume/intercept/%Y%m%d
  25. a1.sinks.s1.hdfs.filePrefix = event
  26. a1.sinks.s1.hdfs.fileSuffix = .log
  27. a1.sinks.s1.hdfs.rollSize =
  28. a1.sinks.s1.hdfs.rollInterval =
  29. a1.sinks.s1.hdfs.rollCount =
  30. a1.sinks.s1.hdfs.batchSize =
  31. a1.sinks.s1.hdfs.round = true
  32. a1.sinks.s1.hdfs.roundUnit = minute
  33. a1.sinks.s1.hdfs.threadsPoolSize =
  34. a1.sinks.s1.hdfs.useLocalTimeStamp = true
  35. a1.sinks.s1.hdfs.minBlockReplicas =
  36. a1.sinks.s1.hdfs.fileType =DataStream
  37. a1.sinks.s1.hdfs.writeFormat = Text
  38. a1.sinks.s1.hdfs.callTimeout =
  39. a1.sinks.s1.hdfs.idleTimeout =

第五步:上传测试数据

上传我们的测试数据到/export/servers/intercept 这个目录下面去,如果目录不存在则创建

  1. mkdir -p /export/servers/intercept

测试数据如下

  1. 13601249301
  2. 13601249302
  3. 13601249303
  4. 13601249304
  5. 13601249305
  6. 13601249306
  7. 13601249307
  8. 13601249308
  9. 13601249309
  10. 13601249310
  11. 13601249311
  12. 13601249312

第六步:启动flume

  1. cd /export/servers/apache-flume-1.8.-bin
  2. bin/flume-ng agent -c conf -f conf/spool-interceptor-hdfs.conf -name a1 -Dflume.root.logger=DEBUG,console


  1. package cn.le.flumesource;
  2. import org.apache.flume.Context;
  3. import org.apache.flume.Event;
  4. import org.apache.flume.EventDeliveryException;
  5. import org.apache.flume.PollableSource;
  6. import org.apache.flume.conf.Configurable;
  7. import org.apache.flume.event.SimpleEvent;
  8. import org.apache.flume.source.AbstractSource;
  9. import org.slf4j.Logger;
  10. import java.util.ArrayList;
  11. import java.util.HashMap;
  12. import java.util.List;
  13. import static org.slf4j.LoggerFactory.*;
  14. public class MySqlSource extends AbstractSource implements Configurable, PollableSource {
  15. //打印日志
  16. private static final Logger LOG = getLogger(MySqlSource.class);
  17. //定义sqlHelper
  18. private QueryMySql sqlSourceHelper;
  19. @Override
  20. public long getBackOffSleepIncrement() {
  21. return ;
  22. }
  23. @Override
  24. public long getMaxBackOffSleepInterval() {
  25. return ;
  26. }
  27. @Override
  28. public void configure(Context context) {
  29. //初始化
  30. sqlSourceHelper = new QueryMySql(context);
  31. }
  32. @Override
  33. public PollableSource.Status process() throws EventDeliveryException {
  34. try {
  35. //查询数据表
  36. List<List<Object>> result = sqlSourceHelper.executeQuery();
  37. //存放event的集合
  38. List<Event> events = new ArrayList<>();
  39. //存放event头集合
  40. HashMap<String, String> header = new HashMap<>();
  41. //如果有返回数据,则将数据封装为event
  42. if (!result.isEmpty()) {
  43. List<String> allRows = sqlSourceHelper.getAllRows(result);
  44. Event event = null;
  45. for (String row : allRows) {
  46. event = new SimpleEvent();
  47. event.setBody(row.getBytes());
  48. event.setHeaders(header);
  49. events.add(event);
  50. }
  51. //将event写入channel
  52. this.getChannelProcessor().processEventBatch(events);
  53. //更新数据表中的offset信息
  54. sqlSourceHelper.updateOffset2DB(result.size());
  55. }
  56. //等待时长
  57. Thread.sleep(sqlSourceHelper.getRunQueryDelay());
  58. return Status.READY;
  59. } catch (InterruptedException e) {
  60. LOG.error("Error procesing row", e);
  61. return Status.BACKOFF;
  62. }
  63. }
  64. @Override
  65. public synchronized void stop() {
  66. LOG.info("Stopping sql source {} ...", getName());
  67. try {
  68. //关闭资源
  69. sqlSourceHelper.close();
  70. } finally {
  71. super.stop();
  72. }
  73. }
  74. }
  1. package cn.le.flumesource;
  2. import org.apache.flume.Context;
  3. import org.apache.flume.conf.ConfigurationException;
  4. import org.apache.http.ParseException;
  5. import org.slf4j.Logger;
  6. import org.slf4j.LoggerFactory;
  7. import java.sql.*;
  8. import java.util.ArrayList;
  9. import java.util.List;
  10. import java.util.Properties;
  11. public class QueryMySql {
  12. private static final Logger LOG = LoggerFactory.getLogger(QueryMySql.class);
  13. private int runQueryDelay, //两次查询的时间间隔
  14. startFrom, //开始id
  15. currentIndex, //当前id
  16. recordSixe = , //每次查询返回结果的条数
  17. maxRow; //每次查询的最大条数
  18. private String table, //要操作的表
  19. columnsToSelect, //用户传入的查询的列
  20. customQuery, //用户传入的查询语句
  21. query, //构建的查询语句
  22. defaultCharsetResultSet;//编码集
  23. //上下文,用来获取配置文件
  24. private Context context;
  25. //为定义的变量赋值(默认值),可在flume任务的配置文件中修改
  26. private static final int DEFAULT_QUERY_DELAY = ;
  27. private static final int DEFAULT_START_VALUE = ;
  28. private static final int DEFAULT_MAX_ROWS = ;
  29. private static final String DEFAULT_COLUMNS_SELECT = "*";
  30. private static final String DEFAULT_CHARSET_RESULTSET = "UTF-8";
  31. private static Connection conn = null;
  32. private static PreparedStatement ps = null;
  33. private static String connectionURL, connectionUserName, connectionPassword;
  34. //加载静态资源
  35. static {
  36. Properties p = new Properties();
  37. try {
  38. p.load(QueryMySql.class.getClassLoader().getResourceAsStream("jdbc.properties"));
  39. connectionURL = p.getProperty("dbUrl");
  40. connectionUserName = p.getProperty("dbUser");
  41. connectionPassword = p.getProperty("dbPassword");
  42. Class.forName(p.getProperty("dbDriver"));
  43. } catch (Exception e) {
  44. LOG.error(e.toString());
  45. }
  46. }
  47. //获取JDBC连接
  48. private static Connection InitConnection(String url, String user, String pw) {
  49. try {
  50. Connection conn = DriverManager.getConnection(url, user, pw);
  51. if (conn == null)
  52. throw new SQLException();
  53. return conn;
  54. } catch (SQLException e) {
  55. e.printStackTrace();
  56. }
  57. return null;
  58. }
  59. //构造方法
  60. QueryMySql(Context context) throws ParseException {
  61. //初始化上下文
  62. this.context = context;
  63. //有默认值参数:获取flume任务配置文件中的参数,读不到的采用默认值
  64. this.columnsToSelect = context.getString("columns.to.select", DEFAULT_COLUMNS_SELECT);
  65. this.runQueryDelay = context.getInteger("run.query.delay", DEFAULT_QUERY_DELAY);
  66. this.startFrom = context.getInteger("start.from", DEFAULT_START_VALUE);
  67. this.defaultCharsetResultSet = context.getString("default.charset.resultset", DEFAULT_CHARSET_RESULTSET);
  68. //无默认值参数:获取flume任务配置文件中的参数
  69. this.table = context.getString("table");
  70. this.customQuery = context.getString("custom.query");
  71. connectionURL = context.getString("connection.url");
  72. connectionUserName = context.getString("connection.user");
  73. connectionPassword = context.getString("connection.password");
  74. conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
  75. //校验相应的配置信息,如果没有默认值的参数也没赋值,抛出异常
  76. checkMandatoryProperties();
  77. //获取当前的id
  78. currentIndex = getStatusDBIndex(startFrom);
  79. //构建查询语句
  80. query = buildQuery();
  81. }
  82. //校验相应的配置信息(表,查询语句以及数据库连接的参数)
  83. private void checkMandatoryProperties() {
  84. if (table == null) {
  85. throw new ConfigurationException("property table not set");
  86. }
  87. if (connectionURL == null) {
  88. throw new ConfigurationException("connection.url property not set");
  89. }
  90. if (connectionUserName == null) {
  91. throw new ConfigurationException("connection.user property not set");
  92. }
  93. if (connectionPassword == null) {
  94. throw new ConfigurationException("connection.password property not set");
  95. }
  96. }
  97. //构建sql语句
  98. private String buildQuery() {
  99. String sql = "";
  100. //获取当前id
  101. currentIndex = getStatusDBIndex(startFrom);
  102. LOG.info(currentIndex + "");
  103. if (customQuery == null) {
  104. sql = "SELECT " + columnsToSelect + " FROM " + table;
  105. } else {
  106. sql = customQuery;
  107. }
  108. StringBuilder execSql = new StringBuilder(sql);
  109. //以id作为offset
  110. if (!sql.contains("where")) {
  111. execSql.append(" where ");
  112. execSql.append("id").append(">").append(currentIndex);
  113. return execSql.toString();
  114. } else {
  115. int length = execSql.toString().length();
  116. return execSql.toString().substring(, length - String.valueOf(currentIndex).length()) + currentIndex;
  117. }
  118. }
  119. //执行查询
  120. List<List<Object>> executeQuery() {
  121. try {
  122. //每次执行查询时都要重新生成sql,因为id不同
  123. customQuery = buildQuery();
  124. //存放结果的集合
  125. List<List<Object>> results = new ArrayList<>();
  126. if (ps == null) {
  127. //
  128. ps = conn.prepareStatement(customQuery);
  129. }
  130. ResultSet result = ps.executeQuery(customQuery);
  131. while (result.next()) {
  132. //存放一条数据的集合(多个列)
  133. List<Object> row = new ArrayList<>();
  134. //将返回结果放入集合
  135. for (int i = ; i <= result.getMetaData().getColumnCount(); i++) {
  136. row.add(result.getObject(i));
  137. }
  138. results.add(row);
  139. }
  140. LOG.info("execSql:" + customQuery + "\nresultSize:" + results.size());
  141. return results;
  142. } catch (SQLException e) {
  143. LOG.error(e.toString());
  144. // 重新连接
  145. conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
  146. }
  147. return null;
  148. }
  149. //将结果集转化为字符串,每一条数据是一个list集合,将每一个小的list集合转化为字符串
  150. List<String> getAllRows(List<List<Object>> queryResult) {
  151. List<String> allRows = new ArrayList<>();
  152. if (queryResult == null || queryResult.isEmpty())
  153. return allRows;
  154. StringBuilder row = new StringBuilder();
  155. for (List<Object> rawRow : queryResult) {
  156. Object value = null;
  157. for (Object aRawRow : rawRow) {
  158. value = aRawRow;
  159. if (value == null) {
  160. row.append(",");
  161. } else {
  162. row.append(aRawRow.toString()).append(",");
  163. }
  164. }
  165. allRows.add(row.toString());
  166. row = new StringBuilder();
  167. }
  168. return allRows;
  169. }
  170. //更新offset元数据状态,每次返回结果集后调用。必须记录每次查询的offset值,为程序中断续跑数据时使用,以id为offset
  171. void updateOffset2DB(int size) {
  172. //以source_tab做为KEY,如果不存在则插入,存在则更新(每个源表对应一条记录)
  173. String sql = "insert into flume_meta(source_tab,currentIndex) VALUES('"
  174. + this.table
  175. + "','" + (recordSixe += size)
  176. + "') on DUPLICATE key update source_tab=values(source_tab),currentIndex=values(currentIndex)";
  177. LOG.info("updateStatus Sql:" + sql);
  178. execSql(sql);
  179. }
  180. //执行sql语句
  181. private void execSql(String sql) {
  182. try {
  183. ps = conn.prepareStatement(sql);
  184. LOG.info("exec::" + sql);
  185. ps.execute();
  186. } catch (SQLException e) {
  187. e.printStackTrace();
  188. }
  189. }
  190. //获取当前id的offset
  191. private Integer getStatusDBIndex(int startFrom) {
  192. //从flume_meta表中查询出当前的id是多少
  193. String dbIndex = queryOne("select currentIndex from flume_meta where source_tab='" + table + "'");
  194. if (dbIndex != null) {
  195. return Integer.parseInt(dbIndex);
  196. }
  197. //如果没有数据,则说明是第一次查询或者数据表中还没有存入数据,返回最初传入的值
  198. return startFrom;
  199. }
  200. //查询一条数据的执行语句(当前id)
  201. private String queryOne(String sql) {
  202. ResultSet result = null;
  203. try {
  204. ps = conn.prepareStatement(sql);
  205. result = ps.executeQuery();
  206. while (result.next()) {
  207. return result.getString();
  208. }
  209. } catch (SQLException e) {
  210. e.printStackTrace();
  211. }
  212. return null;
  213. }
  214. //关闭相关资源
  215. void close() {
  216. try {
  217. ps.close();
  218. conn.close();
  219. } catch (SQLException e) {
  220. e.printStackTrace();
  221. }
  222. }
  223. int getCurrentIndex() {
  224. return currentIndex;
  225. }
  226. void setCurrentIndex(int newValue) {
  227. currentIndex = newValue;
  228. }
  229. int getRunQueryDelay() {
  230. return runQueryDelay;
  231. }
  232. String getQuery() {
  233. return query;
  234. }
  235. String getConnectionURL() {
  236. return connectionURL;
  237. }
  238. private boolean isCustomQuerySet() {
  239. return (customQuery != null);
  240. }
  241. Context getContext() {
  242. return context;
  243. }
  244. public String getConnectionUserName() {
  245. return connectionUserName;
  246. }
  247. public String getConnectionPassword() {
  248. return connectionPassword;
  249. }
  250. String getDefaultCharsetResultSet() {
  251. return defaultCharsetResultSet;
  252. }
  253. }

Hadoop-(Flume)的更多相关文章

  1. hadoop flume 架构及监控的部署

    1 Flume架构解释  Flume概念 Flume是一个分布式 ,可靠的,和高可用的,海量的日志聚合系统 支持在系统中定制各类的数据发送方 用于收集数据 提供简单的数据提取能力 并写入到各种接受方 ...

  2. flume学习安装

    近期项目组有需求点击流日志须要自己收集,学习了一下flume而且成功安装了.相关信息记录一下. 1)下载flume1.5版本号  wget http://www.apache.org/dyn/clos ...

  3. flume从kafka读取数据到hdfs中的配置

    #source的名字 agent.sources = kafkaSource # channels的名字,建议按照type来命名 agent.channels = memoryChannel # si ...

  4. 最新Hadoop Shell完全讲解

    本文为原创博客,转载请注明出处:http://www.cnblogs.com/MrFee/p/4683953.html    1.appendToFile   功能:将一个或多个源文件系统的内容追加至 ...

  5. Flume集群搭建

    0. 软件版本下载 http://mirror.bit.edu.cn/apache/flume/   1. 集群环境 Master 172.16.11.97 Slave1 172.16.11.98 S ...

  6. flume学习笔记

    #################################################################################################### ...

  7. Flume定时启动任务 防止挂掉

    一,查看Flume条数:ps -ef|grep java|grep flume|wc -l       ==>15 检查进程:给sh脚本添加权限,chmod 777 xx.sh #!/bin/s ...

  8. flume 日志导入elasticsearch

    Flume配置 . flume生成的数据结构 <span style="font-size:18px;">"_index" : "logs ...

  9. Flume+Morphlines实现数据的实时ETL

    转载:http://mp.weixin.qq.com/s/xCSdkQo1XMQwU91lch29Uw Apache Flume介绍: Apache Flume是一个Apache的开源项目,是一个分布 ...

  10. Flume Channel Selectors + kafka

    http://flume.apache.org/FlumeUserGuide.html#custom-channel-selector 官方文档上channel selectors 有两种类型: Re ...

随机推荐

  1. ARTS打卡计划第十五周

    Algorithms: https://leetcode-cn.com/problems/single-number/submissions/ Review: “What Makes a Good D ...

  2. 预处理、const、static与sizeof-static全局变量与普通的全局变量有什么区别

    1:全局变量的说明之前再加上static就构成了静态的全局变量.全局变量本身就是静态存储方式,静态全局变量当然也是静态存储方式.这两者在存储方式上并无不同.这两者的区别在于,非静态全局变量的作用域是整 ...

  3. 【sed】进阶

      sed的基本用法已能满足大多数需求,但当需要时,知道这些高级特效的存在及如何使用将提供莫大的帮助!   1. 多行命令         sed编辑器提供三个用于处理多行文本的特殊命令: N:将数据 ...

  4. centos6.9实现双网卡绑定

    1.创建bond0文件 # vi /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 NM_CONTROLLED=no #是否由networ ...

  5. redis常见7种使用场景

    一,简单字符串缓存实例 $redis->connect('127.0.0.1', 6379); $strCacheKey = 'Test_bihu'; //SET 应用 $arrCacheDat ...

  6. C# 7 .NET / CLR / Visual Studio version requirements

    C# 7 .NET / CLR / Visual Studio version requirements   You do NOT need to target .NET 4.6 and above, ...

  7. activemq备忘

    ActiveMQ队列消息积压问题调研 http://blog.51cto.com/winters1224/2049432ActiveMQ的插件开发介绍 https://blog.csdn.net/zh ...

  8. [Java]简单计算下一段Java代码段运行了多少秒

    long startTime = System.currentTimeMillis(); ...... long endTime = System.currentTimeMillis(); logge ...

  9. mybatis配置文件祥解(mybatis.xml)

    以下是mybatis.xml文件,提倡放在src目录下,文件名任意 <?xml version="1.0" encoding="UTF-8"?> & ...

  10. easyUI之函数

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <hea ...