【Spark】SparkStreaming与flume进行整合
文章目录
注意事项
一、首先要保证安装了flume,flume相关安装文章可以看【Hadoop离线基础总结】日志采集框架Flume二、把flume的lib
目录下自带的过时的scala-library-2.10.5.jar
包替换成scala-library-2.11.8.jar
三、下载需要的jar包,下载地址献上:https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-flume_2.11/2.2.0/spark-streaming-flume_2.11-2.2.0.jar
并把jar包也放到flume的lib
目录下
SparkStreaming从flume中poll数据
步骤
一、开发flume配置文件
在安装了flume的虚拟机执行以下操作命令
mkdir -p /export/servers/flume/flume-poll //受监控的文件夹
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim flume-poll.conf
# 命名flume的各个组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置source组件
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/servers/flume/flume-poll
a1.sources.r1.fileHeader = true
# 配置channel组件 选用memory channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
# 配置sink组件
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=node03
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
二、启动flume
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/
bin/flume-ng agent -c conf -f conf/flume-poll.conf -n a1 -Dflume.root.logger=DEBUG,CONSOLE
三、开发sparkStreaming代码
1.创建maven工程,导入jar包
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<!-- <verbal>true</verbal>-->
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
2.开发代码
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkFlumePoll {
// 定义updateFunc函数
def updateFunc(newValues: Seq[Int],runningCount: Option[Int]): Option[Int] = {
Option(newValues.sum + runningCount.getOrElse(0))
}
def main(args: Array[String]): Unit = {
// 获取SparkConf
val sparkConf: SparkConf = new SparkConf().set("spark.driver.host", "localhost").setAppName("SparkFlume-Poll").setMaster("local[6]")
// 获取SparkContext
val sparkContext = new SparkContext(sparkConf)
// 设置日志级别
sparkContext.setLogLevel("WARN")
//获取StreamingContext
val streamingContext = new StreamingContext(sparkContext, Seconds(5))
streamingContext.checkpoint("./poll-Flume")
// 通过FlumeUtils调用createPollingStream方法获取flume中的数据
/*
createPollingStream所需参数:
ssc: StreamingContext,
hostname: String,
port: Int,
*/
val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(streamingContext, "node03", 8888)
// 拿到数据后,所有的数据都会封装在SparkFlumeEvent中
// 将SparkFlumeEvent封装的数据转换为DStream
val line: DStream[String] = stream.map(x => {
// x代表SparkFlumeEvent封装对象,里面封装了event数据,通过以下方法转换成数组
val array: Array[Byte] = x.event.getBody.array()
// 将拿到的数组转换为String
val str = new String(array)
str
}
)
// 进行单词计数操作
val value: DStream[(String, Int)] = line.flatMap(_.split(" ")).map((_, 1)).updateStateByKey(updateFunc)
//输出结果
value.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
四、向监控目录中导入文本文件
控制台结果
-------------------------------------------
Time: 1586877095000 ms
-------------------------------------------
-------------------------------------------
Time: 1586877100000 ms
-------------------------------------------
20/04/14 23:11:44 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
20/04/14 23:11:44 WARN BlockManager: Block input-0-1586877094060 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1586877105000 ms
-------------------------------------------
(world,1)
(hive,2)
(hello,2)
(sqoop,1)
(test,1)
(abb,1)
-------------------------------------------
Time: 1586877110000 ms
-------------------------------------------
(world,1)
(hive,2)
(hello,2)
(sqoop,1)
(test,1)
(abb,1)
-------------------------------------------
Time: 1586877115000 ms
-------------------------------------------
(world,1)
(hive,2)
(hello,2)
(sqoop,1)
(test,1)
(abb,1)
20/04/14 23:11:57 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
20/04/14 23:11:57 WARN BlockManager: Block input-0-1586877094061 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1586877120000 ms
-------------------------------------------
(world,2)
(hive,4)
(hello,4)
(sqoop,2)
(test,2)
(abb,2)
-------------------------------------------
Time: 1586877125000 ms
-------------------------------------------
(world,2)
(hive,4)
(hello,4)
(sqoop,2)
(test,2)
(abb,2)
flume将数据push给SparkStreaming
步骤
一、开发flume配置文件
mkdir -p /export/servers/flume/flume-push/
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim flume-push.conf
#push mode
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/servers/flume/flume-push
a1.sources.r1.fileHeader = true
#channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
#注意这里的ip需要指定的是我们spark程序所运行的服务器的ip,也就是我们的localhost
a1.sinks.k1.hostname=192.168.0.105
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
二、启动flume
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/
bin/flume-ng agent -c conf -f conf/flume-push.conf -n a1 -Dflume.root.logger=DEBUG,CONSOLE
三、开发代码
package cn.itcast.sparkstreaming.demo4
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkFlumePush {
def main(args: Array[String]): Unit = {
//获取SparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("SparkFlume-Push").setMaster("local[6]").set("spark.driver.host", "localhost")
//获取SparkContext
val sparkContext = new SparkContext(sparkConf)
sparkContext.setLogLevel("WARN")
//获取StreamingContext
val streamingContext = new StreamingContext(sparkContext, Seconds(5))
val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(streamingContext, "192.168.0.105", 8888)
val value: DStream[String] = stream.map(x => {
val array: Array[Byte] = x.event.getBody.array()
val str = new String(array)
str
})
value.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
四、向监控目录中导入文本文件
控制台结果
-------------------------------------------
Time: 1586882385000 ms
-------------------------------------------
20/04/15 00:39:45 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
20/04/15 00:39:45 WARN BlockManager: Block input-0-1586882384800 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1586882390000 ms
-------------------------------------------
hello world
sqoop hive
abb test
hello hive
-------------------------------------------
Time: 1586882395000 ms
-------------------------------------------
【Spark】SparkStreaming与flume进行整合的更多相关文章
- 【Spark】SparkStreaming和Kafka的整合
文章目录 Streaming和Kafka整合 概述 使用0.8版本下Receiver DStream接收数据进行消费 步骤 一.启动Kafka集群 二.创建maven工程,导入jar包 三.创建一个k ...
- Spark Streaming从Flume Poll数据案例实战和内幕源码解密
本节课分成二部分讲解: 一.Spark Streaming on Polling from Flume实战 二.Spark Streaming on Polling from Flume源码 第一部分 ...
- 图解SparkStreaming与Kafka的整合,这些细节大家要注意!
前言 老刘是一名即将找工作的研二学生,写博客一方面是复习总结大数据开发的知识点,一方面是希望帮助更多自学的小伙伴.由于老刘是自学大数据开发,肯定会存在一些不足,还希望大家能够批评指正,让我们一起进步! ...
- Spark Streaming处理Flume数据练习
把Flume Source(netcat类型),从终端上不断给Flume Source发送消息,Flume把消息汇集到Sink(avro类型),由Sink把消息推送给Spark Streaming并处 ...
- cdh环境下,spark streaming与flume的集成问题总结
文章发自:http://www.cnblogs.com/hark0623/p/4170156.html 转发请注明 如何做集成,其实特别简单,网上其实就是教程. http://blog.csdn.n ...
- spark streaming集成flume
1. 安装flume flume安装,解压后修改flume_env.sh配置文件,指定java_home即可. cp hdfs jar包到flume lib目录下(否则无法抽取数据到hdfs上): $ ...
- Flume+Kafka整合
脚本生产数据---->flume采集数据----->kafka消费数据------->storm集群处理数据 日志文件使用log4j生成,滚动生成! 当前正在写入的文件在满足一定的数 ...
- demo2 Kafka+Spark Streaming+Redis实时计算整合实践 foreachRDD输出到redis
基于Spark通用计算平台,可以很好地扩展各种计算类型的应用,尤其是Spark提供了内建的计算库支持,像Spark Streaming.Spark SQL.MLlib.GraphX,这些内建库都提供了 ...
- hadoop 之 kafka 安装与 flume -> kafka 整合
62-kafka 安装 : flume 整合 kafka 一.kafka 安装 1.下载 http://kafka.apache.org/downloads.html 2. 解压 tar -zxvf ...
随机推荐
- Linux常用命令02(远程管理)
01 关机/重启 序号 命令 对应英文 作用 01 shutdown 选项 时间 shutdown 关机/重新启动 1.1 shutdown shutdown 命令可以 安全 关闭 或者 重新启动系统 ...
- Extjs简单的form+grid组合
采用的是Extjs4.2版本 http://localhost:49999/GridPanel/Index 该链接是本地连接,只是方便自己访问,读者无法正常访问. <script src=&qu ...
- Present CodeForces - 1323D (思维+二分)
题目大意比较简单,就是求一堆(二元组)的异或和. 思路:按位考虑,如果说第k位为1的话,那么一定有奇数个(二元组)在该位为1.二元组内的数是相加的,相加是可以进位的.所以第k位是0还是1,至于k为后边 ...
- sqli-labs通关教程----31~40关
第三十一关 这关一样,闭合变成(",简单测试,#号不能用 ?id=1") and ("1")=("1")--+ 第三十二关 这关会把我们的输 ...
- OkHttp 优雅封装 HttpUtils 之 上传下载解密
曾经在代码里放荡不羁,如今在博文中日夜兼行,只为今天与你分享成果.如果觉得本文有用,记得关注我,我将带给你更多. 还没看过第一篇文章的欢迎移步:OkHttp 优雅封装 HttpUtils 之气海雪山初 ...
- vue axios post请求下载文件,后台springmvc完整代码
注意请求时要设置responseType,不加会中文乱码,被这个坑困扰了大半天... axios post请求: download(index,row){ var ts = ...
- 详解 继承(下)—— super关键字 与 多态
接上篇博文--<详解 继承(上)-- 工具的抽象与分层> 废话不多说,进入正题: 本人在上篇"故弄玄虚",用super();解决了问题,这是为什么呢? 答曰:子类中所有 ...
- ADO.Net和Entity Framework的区别联系
它们有以下几点区别:1,ADO.Net是开发人员自己select.update等写sql语句,来实现对数据库的增删改查等操作:采用EF进行开发操作数据库的时候,只需要操作对象,这样做使开发更方便,此时 ...
- Python常用库-Psutil
背景 介绍一个处理进程的实用工具,这个是一个第三方库.应用主要有类似ps.cd.top,还有查看硬盘.内存使用情况等. 推荐的理由主要有 2 个,第一个是跨平台的,不管是OSX.Centos.Wind ...
- WTF Python:有趣且鲜为人知的Python特性
Python 是一个设计优美的解释型高级语言,它提供了很多能让程序员感到舒适的功能特性.但有的时候,Python 的一些输出结果对于初学者来说似乎并不是那么一目了然. 这个有趣的项目意在收集 Pyth ...