Spark Structured streaming API支持的输出源有：Console、Memory、File和Foreach。其中Console在前两篇博文中已有详述，而Memory使用非常简单。本文着重介绍File和Foreach两种方式，并介绍如何在源码基本扩展新的输出方式。

1. File

　　Structured Streaming支持将数据以File形式保存起来，其中支持的文件格式有四种:json、text、csv和parquet。其使用方式也非常简单只需设置checkpointLocation和path即可。checkpointLocation是检查点保存的路径，而path是真实数据保存的路径。

如下所示的测试例子：

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

// Split the lines into words

val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count

val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console

val query = wordCounts.writeStream

.format("json")

.option("checkpointLocation","root/jar")

.option("path","/root/jar")

.start()

注意：

File形式不能设置"compelete"模型，只能设置"Append"模型。由于Append模型不能有聚合操作，所以将数据保存到外部File时，不能有聚合操作。

2. Foreach

　　foreach输出方式只需要实现ForeachWriter抽象类，并实现三个方法，当Structured Streaming接收到数据就会执行其三个方法，如下的测试示例：

* Licensed to the Apache Software Foundation (ASF) under one or more

* contributor license agreements. See the NOTICE file distributed with

* this work for additional information regarding copyright ownership.

* The ASF licenses this file to You under the Apache License, Version 2.0

* (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

// scalastyle:off println

package org.apache.spark.examples.sql.streaming

import org.apache.spark.sql.SparkSession

/**

* Counts words in UTF8 encoded, '\n' delimited text received from the network.

* Usage: StructuredNetworkWordCount <hostname> <port>

* <hostname> and <port> describe the TCP server that Structured Streaming

* would connect to receive data.

* To run this on your local machine, you need to first run a Netcat server

* `$ nc -lk 9999`

* and then run the example

* `$ bin/run-example sql.streaming.StructuredNetworkWordCount

* localhost 9999`

object StructuredNetworkWordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: StructuredNetworkWordCount <hostname> <port>")

System.exit(1)

}

val host = args(0)

val port = args(1).toInt

val spark = SparkSession

.builder

.appName("StructuredNetworkWordCount")

.getOrCreate()

import spark.implicits._

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

// Start running the query that prints the running counts to the console

val query = wordCounts.writeStream

.outputMode("append")

.foreach(new ForearchWriter[Row]{

override def open(partitionId:Long,version:Long):Boolean={

println("open")

return true

}

override def process(value:Row):Unit={

val spark = SparkSession.builder.getOrCreate()

val seq = value.mkString.split(" ")

val row = Row.fromSeq(seq)

val rowRDD:RDD[Row] = sparkContext.getOrCreate().parallelize[Row](Seq(row))

val userSchema = new StructType().add("name","String").add("age","String")

val peopleDF = spark.createDataFrame(rowRDD,userSchema)

peopleDF.createOrReplaceTempView(myTable)

spark.sql("select * from myTable").show()

}

override def close(errorOrNull:Throwable):Unit={

println("close")

}

})

.start()

query.awaitTermination()

}

// scalastyle:on println

　　上述程序是直接继承ForeachWriter类的接口，并实现了open()、process()、close()三个方法。若采用显示定义一个类来实现，需要注意Scala的泛型设计，如下所示：

class myForeachWriter[T<:Row](stream:CatalogTable) extends ForearchWriter[T]{

override def open(partionId:Long,version:Long):Boolean={

println("open")

true

}

override def process(value:T):Unit={

println(value)

}

override def close(errorOrNull:Throwable):Unit={

println("close")

}

3. 自定义

　　若上述Spark Structured Streaming API提供的数据输出源仍不能满足要求，那么还有一种方法可以使用：修改源码。

如下通过实现一种自定义的Console来介绍这种使用方式：

3.1 ConsoleSink

　　Spark有一个Sink接口，用户可以实现该接口的addBatch方法，其中的data参数是接收的数据，如下所示直接将其输出到控制台：

class ConsoleSink(streamName:String) extends Sink{

override def addBatch(batchId:Long, data;DataFrame):Unit = {

data.show()

}

3.2 DataStreamWriter

　　在用户自定义的输出形式时，并调用start()方法后，Spark框架会去调用DataStreamWriter类的start()方法。所以用户可以直接在该方法中添加自定义的输出方式，如我们向其传递上述创建的ConsoleSink类示例，如下所示：

def start():StreamingQuery={

if(source == "memory"){

...

}else if(source=="foreach"){

...

}else if(source=="consoleSink"){

val streamName:String = extraOption.get("streamName") mathc{

case Some(str):str

case None=>throw new AnalysisException("streamName option must be specified for Sink")

}

val sink = new consoleSink(streamName)

df.sparkSession.sessionState.streamingQueryManager.startQuery(

extraOption.get("queryName"),

extraOption.get("checkpointLocation"),

df,

sink,

outputMode,

useTempCheckpointLocaltion = true,

recoverFromCheckpointLocation = false,

trigger = trigger

)

}else{

...

}

3.3 Structured Streaming

　　在前两部修改和实现完成后，用户就可以按正常的Structured Streaming API方式使用了，唯一不同的是在输出形式传递的参数是"consoleSink"字符串，如下所示：

def execute(stream:CatalogTable):Unit={

val spark = SparkSession

.builder

.appName("StructuredNetworkWordCount")

.getOrCreate()

/**1. 获取数据对象DataFrame*/

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

/**2. 启动Streaming开始接受数据源的信息*/

val query:StreamingQuery = lines.writeStream

.outputMode("append")

.format("consoleSink")

.option("streamName","myStream")

.start()

query.awaitTermination()

}

4. 参考文献

[1]. Structured Streaming Programming Guide.

[2]. Kafka Integration Guide.

Spark Structured Streaming框架(3)之数据输出源详解的更多相关文章

Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
Spark Structured Streaming框架（2）之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
Spark Structured streaming框架（1）之基本使用
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
Spark Structured Streaming框架(1)之基本用法
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
Spark Structured Streaming框架(4)之窗口管理详解
1. 结构 1.1 概述 Structured Streaming组件滑动窗口功能由三个参数决定其功能:窗口时间.滑动步长和触发时间. 窗口时间:是指确定数据操作的长度: 滑动步长:是指窗口每次向前移 ...
Spark Structured Streaming框架(5)之进程管理
Structured Streaming提供一些API来管理Streaming对象.用户可以通过这些API来手动管理已经启动的Streaming,保证在系统中的Streaming有序执行. 1. St ...
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（二十九）：推送avro格式数据到topic，并使用spark structured streaming接收topic解析avro数据
推送avro格式数据到topic 源代码:https://github.com/Neuw84/structured-streaming-avro-demo/blob/master/src/main/j ...
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（十一）定制一个arvo格式文件发送到kafka的topic，通过Structured Streaming读取kafka的数据
将arvo格式数据发送到kafka的topic 第一步:定制avro schema: { "type": "record", "name": ...
DataFlow编程模型与Spark Structured streaming
流式(streaming)和批量( batch):流式数据,实际上更准确的说法应该是unbounded data(processing),也就是无边界的连续的数据的处理:对应的批量计算,更准确的说法是 ...

随机推荐

奥巴马(Obama)获胜演讲全文[中英对照]+高清视频下载
http://www.amznz.com/obama-speech/如果还有人对美国是否凡事都有可能存疑,还有人怀疑美国奠基者的梦想在我们所处的时代是否依然鲜活,还有人质疑我们的民主制度的力量,那么今 ...
高抛低吸T+0操作要领（目前行情短线炒作的必备技能）
最近的行情只能用操蛋来形容,但是危机中不乏机会.现在已经不是之前行情的思路,那着一个股票长线抱着,即使是好的牛股,也经不起目前行情的这么折腾.所以,现在最适合的操作方式就是高抛低吸.今天低吸保不准明 ...
zabbix根据graph name 做screen
下面亲测可用 #!/usr/bin/env python #coding:utf8 import urllib2 import sys import json import argparse #定义通 ...
CentOS下使用yum快速安装memcached
1. 查找Memcached yum search memcached 首先检查yum软件仓库中是否存在memcached,如果有直接进入第3步安装即可,否则执行第2步. 2. 安装第三方软件库(可 ...
Java 异常介绍
Java标准库内建了一些通用的异常,这些类以 Throwable 为顶层父类.Throwable又派生出 Error 类和 Exception 类. 错误:Error类以及他的子类的实例,代表了JVM ...
vuforia 中摄像机的开启与关闭
本文主要讲解的是Unity对Vuforia的开发中在原生调用摄像头上遇到的坑~Unity中调用设备摄像头打开或则关闭,或则开关扫描识别问题等等一些情况~ 下面先说说趟过的坑,再说说解决办法,或则目前没 ...
EasyNVR完美搭配腾讯云CDN/阿里云CDN进行RTMP、HLS直播加速的使用说明
1.相关资料入口腾讯云LVB EasyNVR.com 2.加速说明 2.1. 腾讯LVB加速 2.1.1. 开通服务腾讯云视频LVB开通入口 2.1.2. 登录进入控制台腾讯云直播控制台 2.1 ...
Java中线程和线程池
Java中开启多线程的三种方式 1.通过继承Thread实现 public class ThreadDemo extends Thread{ public void run(){ System.out ...
阿里巴巴fastjson 包的使用解析json数据
Fastjson是一个Java语言编写的高性能功能完善的JSON库.由阿里巴巴公司团队开发的. 主要特性主要体现在以下几个方面: 1.高性能 fastjson采用独创的算法,将parse的速度提升到极 ...
OSI模型第三层网络层-初识路由协议
1.路由协议: 顾名思义就是路由器所使用的协议. 分类: (1)按照作用范围分类,IGP(类型)内部网关协议(rip,ospf,isis),EGP(类型)边界路由协议(bgp) 把互联网比作整个世界土 ...

Spark Structured Streaming框架(3)之数据输出源详解

1. File

2. Foreach

3. 自定义

3.1 ConsoleSink

3.2 DataStreamWriter

3.3 Structured Streaming

4. 参考文献

Spark Structured Streaming框架(3)之数据输出源详解的更多相关文章

随机推荐

热门专题