[spark]Spark Streaming教程
(一)官方入门示例
废话不说,先来个示例,有个感性认识再介绍。
这个示例来自spark自带的example,基本步骤如下:
(1)使用以下命令输入流消息:
$ nc -lk 9999
(2)在一个新的终端中运行NetworkWordCount,统计上面的词语数量并输出:
$ bin/run-example streaming.NetworkWordCount localhost 9999
(3)在第一步创建的输入流程中敲入一些内容,在第二步创建的终端中会看到统计结果,如:
第一个终端输入的内容:
hello world again
第二个端口的输出
-------------------------------------------
Time: 1436758706000 ms
-------------------------------------------
(again,1)
(hello,1)
(world,1)
简单解释一下,上面的示例通过手工敲入内容,并传给spark streaming统计单词数量,然后将结果打印出来。
附上代码:
package org.apache.spark.examples.streaming import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel /**
* Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
*
* Usage: NetworkWordCount
* and describe the TCP server that Spark Streaming would connect to receive data.
*
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
*/
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount ")
System.exit(1)
} StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
(二)Spark Streaming kafka示例
本示例使用java+maven来构建一个wordcount
1、创建项目,在pom.xml添加如下的依赖关系
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
2、写代码,此部分代码使用了官方的代码:
package com.netease.gdc.kafkaStreaming; import java.util.Map;
import java.util.HashMap;
import java.util.regex.Pattern; import scala.Tuple2;
import com.google.common.collect.Lists;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils; /**
* Consumes messages from one or more topics in Kafka and does wordcount.
*
* Usage: JavaKafkaWordCount
* is a list of one or more zookeeper servers that make quorum
* is the name of kafka consumer group
* is a list of one or more kafka topics to consume from
*is the number of threads the kafka consumer should use
*
* To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.JavaKafkaWordCount zoo01,zoo02, \
* zoo03 my-consumer-group topic1,topic2 1`
*/ public final class JavaKafkaWordCount {
private static final Pattern SPACE = Pattern.compile(" "); private JavaKafkaWordCount() {
} public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaWordCount
");
System.exit(1);
} SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
// Create the context with a 1 second batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000)); int numThreads = Integer.parseInt(args[3]);
Map topicMap = new HashMap();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
} JavaPairReceiverInputDStream messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap); JavaDStream lines = messages.map(new Function() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
}); JavaDStream words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
}); JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}); wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
}
3、上传到服务器中然后编译
mvn clean package
4、提交job到spark中
/home/hadoop/spark/bin/spark-submit --jars ../mylib/metrics-core-2.2.0.jar,../mylib/zkclient-0.3.jar,../mylib/spark-streaming-kafka_2.10-1.4.0.jar,../mylib/kafka-clients-0.8.2.1.jar,../mylib/kafka_2.10-0.8.2.1.jar --class com.netease.gdc.kafkaStreaming.JavaKafkaWordCount --master spark://192.168.16.102:7077 target/kafkaStreaming-0.0.1-SNAPSHOT.jar 192.168.172.111:2181/kafka my-consumer-group test 3
当然,前提是kafka集群已经正常运行,且存在test这个topic
5、验证
打开一个console producer,输入内容,然后观察wordcount的结果。
结果形式如下:
(hi,1)
(三)基本步骤
本部分介绍创建一个spark streaming应用的基本步骤
1、构建依赖关系,以maven为例,需要在pom.xml中添加以下内容
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.0</version>
</dependency>
如果需要使用其它数据源,则还需要将相应的依赖关系放入pom.xml。
如使用kafka作为数据源:
当然,spark的核心包也要包含:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.0</version>
</dependency>
[spark]Spark Streaming教程的更多相关文章
- Spark Streaming教程
废话不说,先来个示例,有个感性认识再介绍. 这个示例来自spark自带的example,基本步骤如下: (1)使用以下命令输入流消息: $ nc -lk 9999 (2)在一个新的终端中运行Net ...
- Spark Structured streaming框架(1)之基本使用
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
- Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
- Spark的Streaming + Flume进行数据采集(flume主动推送或者Spark Stream主动拉取)
1.针对国外的开源技术,还是学会看国外的英文说明来的直接,迅速,这里简单贴一下如何看: 2.进入到flume的conf目录,创建一个flume-spark-push.sh的文件: [hadoop@sl ...
- Spark2.3(四十二):Spark Streaming和Spark Structured Streaming更新broadcast总结(二)
本次此时是在SPARK2,3 structured streaming下测试,不过这种方案,在spark2.2 structured streaming下应该也可行(请自行测试).以下是我测试结果: ...
- Spark2.2(三十三):Spark Streaming和Spark Structured Streaming更新broadcast总结(一)
背景: 需要在spark2.2.0更新broadcast中的内容,网上也搜索了不少文章,都在讲解spark streaming中如何更新,但没有spark structured streaming更新 ...
- Spark2.2(三十八):Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗内存比较多的问题(Memory issue with spark structured streaming)调研
在spark中<Memory usage of state in Spark Structured Streaming>讲解Spark内存分配情况,以及提到了HDFSBackedState ...
- Spark2.3(三十五)Spark Structured Streaming源代码剖析(从CSDN和Github中看到别人分析的源代码的文章值得收藏)
从CSDN中读取到关于spark structured streaming源代码分析不错的几篇文章 spark源码分析--事件总线LiveListenerBus spark事件总线的核心是LiveLi ...
- Spark2.3(三十四):Spark Structured Streaming之withWaterMark和windows窗口是否可以实现最近一小时统计
WaterMark除了可以限定来迟数据范围,是否可以实现最近一小时统计? WaterMark目的用来限定参数计算数据的范围:比如当前计算数据内max timestamp是12::00,waterMar ...
随机推荐
- Impala与HBase整合
不多说,直接上干货! Impala可以通过Hive外部表方式和HBase进行整合,步骤如下: • 步骤1:创建hbase 表,向表中添加数据 create 'test_info', 'info' pu ...
- 分享一个vue常用的ui控件
vue学习文档 http://www.jianshu.com/p/8a272fc4e8e8 vux github ui demo:https://github.com/airyland/vux M ...
- Es5正则
##JSON(ES5) 前端后台都能识别的数据.(一种数据储存格式) XNL在JSON出来前 JSON不支持 undefinde和函数. 示列:let = '[{"useername&qu ...
- background 背景认知
background 背景 背景颜色 /*背景颜色为红色*/ p { background-color:ren; } 网页背景不仅可以设置颜色还可以插入图片 /*为背景插入图片*/ body { ba ...
- 洛谷 P2021 faebdc玩扑克
P2021 faebdc玩扑克 题目背景 faebdc和zky在玩一个小游戏 题目描述 zky有n个扑克牌,编号从1到n,zky把它排成一个序列,每次把最上方的扑克牌放在牌堆底,然后把下一张扑克牌拿出 ...
- [React] Create a queue of Ajax requests with redux-observable and group the results.
With redux-observable, we have the power of RxJS at our disposal - this means tasks that would other ...
- ubuntu终端sudo java提示“command not found”解决办法
我在ubuntu 12.04里想启动一个java程序,sudo java -jar xxx.jar,但是结果提示sudo:java:command not found. Ubuntu下用sudo运行j ...
- LeetCode 136 Single Number(仅仅出现一次的数字)
翻译 给定一个整型数组,除了某个元素外其余元素均出现两次. 找出这个仅仅出现一次的元素. 备注: 你的算法应该是一个线性时间复杂度. 你能够不用额外空间来实现它吗? 原文 Given an array ...
- actionmode-ActionMode以及它的menu使用
下图左边效果为Context Menu右边效果为ActionMode. ActionMode 其实就是替换在actionbar的位置上显示的一个控件.它跟actionbar一样,也是一种导航作用.只不 ...
- 为什么会出现NoSQL数据库
为什么会出现NoSQL数据库 一.总结 一句话总结:sql不支持分布式且且有性能瓶颈且不支持分布式,不同NoSQL适合不同的场景 1."不同的NoSQL数据库只适合不同的场景"这句 ...