一文读懂超简单的spark structured stream 源码解读

为了让大家理解structured stream的运行流程，我将根据一个代码例子，讲述structured stream的基本运行流程和原理。

下面是一段简单的代码:

 val spark = SparkSession

       .builder

       .appName("StructuredNetworkWordCount")

       .master("local[4]")

       .getOrCreate()

     spark.conf.set("spark.sql.shuffle.partitions", )

     import spark.implicits._

     val words = spark.readStream

       .format("socket")

       .option("host", "localhost")

       .option("port", )

       .load()

     val df1 = words.as[String]

       .flatMap(_.split(" "))

       .toDF("word")

       .groupBy("word")

       .count()

     df1.writeStream

       .outputMode("complete")

       .format("console")

       .trigger(ProcessingTime())

       .start()

     spark.streams.awaitAnyTermination()

　　这段代码就是单词计数。先从一个socket数据源读入数据，然后以" " 为分隔符把一行文本转换成单词的DataSet，然后转换成有标签("word")的DataFrame，接着按word列进行分组，聚合计算每个word的个数。最后输出到控制台，以10秒为批处理执行周期。

现在来分析它的原理。spark的逻辑里面有一个惰性计算的概念，以上面的例子来说，在第22行代码以前，程序都不会对数据进行真正的计算，而是将计算的公式（或者函数）保存在DataFrame里面，在22行开始的writeStream.start调用后才开始真正的计算。为什么？

因为：

这可以让spark内核做一些优化。

例如：

数据库中存放着人的名字和年龄，我想要在控制台打印出前十个年龄大于20岁的人的名字，那么我的spark代码会这么写：

 df.fileter{row=>

 row._2>}

 .show()

假如说我每执行一行代码就进行一次计算，那么在第二行的时候，我就会把df里面所有的数据进行过滤，筛选出其中年龄大于20的，然后在第3行执行的时候，从第2行里面的结果中选前面10个进行打印。

看出问题了么？这里的输出仅仅只需要10个年龄大于20的人，但是我却把所有人都筛选了一遍，其实我只需要筛选出10个，后面的就不必要筛选了。这就是spark的惰性计算进行优化的地方。

在spark的计算中，在真正的输出函数之前，都不会进行真正的计算，而会在输出函数之前进行优化后再进行计算。我们来看源代码。

这里我贴的是structured stream每次批处理周期到达时会运行的代码:

  private def runBatch(sparkSessionToRunBatch: SparkSession): Unit = {

     // Request unprocessed data from all sources.

     newData = reportTimeTaken("getBatch") {

       availableOffsets.flatMap {

         case (source, available)

           if committedOffsets.get(source).map(_ != available).getOrElse(true) =>

           val current = committedOffsets.get(source)

           val batch = source.getBatch(current, available)

           logDebug(s"Retrieving data from $source: $current -> $available")

           Some(source -> batch)

         case _ => None

       }

     }

     // A list of attributes that will need to be updated.

     var replacements = new ArrayBuffer[(Attribute, Attribute)]

     // Replace sources in the logical plan with data that has arrived since the last batch.

     val withNewSources = logicalPlan transform {

       case StreamingExecutionRelation(source, output) =>

         newData.get(source).map { data =>

           val newPlan = data.logicalPlan

           assert(output.size == newPlan.output.size,

             s"Invalid batch: ${Utils.truncatedString(output, ",")} != " +

             s"${Utils.truncatedString(newPlan.output, ",")}")

           replacements ++= output.zip(newPlan.output)

           newPlan

         }.getOrElse {

           LocalRelation(output)

         }

     }

     // Rewire the plan to use the new attributes that were returned by the source.

     val replacementMap = AttributeMap(replacements)

     val triggerLogicalPlan = withNewSources transformAllExpressions {

       case a: Attribute if replacementMap.contains(a) => replacementMap(a)

       case ct: CurrentTimestamp =>

         CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs,

           ct.dataType)

       case cd: CurrentDate =>

         CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs,

           cd.dataType, cd.timeZoneId)

     }

     reportTimeTaken("queryPlanning") {

       lastExecution = new IncrementalExecution(

         sparkSessionToRunBatch,

         triggerLogicalPlan,

         outputMode,

         checkpointFile("state"),

         currentBatchId,

         offsetSeqMetadata)

       lastExecution.executedPlan // Force the lazy generation of execution plan

     }

     val nextBatch =

       new Dataset(sparkSessionToRunBatch, lastExecution, RowEncoder(lastExecution.analyzed.schema))

     reportTimeTaken("addBatch") {

       sink.addBatch(currentBatchId, nextBatch)

     }

     awaitBatchLock.lock()

     try {

       // Wake up any threads that are waiting for the stream to progress.

       awaitBatchLockCondition.signalAll()

     } finally {

       awaitBatchLock.unlock()

     }

   }

其实很简单，在第58以前都是在解析用户代码，生成logicPlan，优化logicPlan，生成批处理类。第47行的triggerLogicalPlan就是最终优化后的用户逻辑，它被封装在了一个IncrementalExecution类中，这个类连同sparkSessionToRunBatch（运行环境）和RowEncoder（序列化类）一起构成一个新的DataSet，这个DataSet就是最终要发送到worker节点进行执行的代码。第59行代码就是在将它加入到准备发送代码的队列中。我们继续看一段代码，由于我们使用console作为数据下游（sink）所以看看console的addBatch代码：

 override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {

     val batchIdStr = if (batchId <= lastBatchId) {

       s"Rerun batch: $batchId"

     } else {

       lastBatchId = batchId

       s"Batch: $batchId"

     }

     // scalastyle:off println

     println("-------------------------------------------")

     println(batchIdStr)

     println("-------------------------------------------")

     // scalastyle:off println

     data.sparkSession.createDataFrame(

       data.sparkSession.sparkContext.parallelize(data.collect()), data.schema)

       .show(numRowsToShow, isTruncated)

   }

关键代码在16行.show函数，show函数是一个真正的action，在这之前都是一些算子的封装，我们看show的代码:

 private[sql] def showString(_numRows: Int, truncate: Int = ): String = {

     val numRows = _numRows.max()

     val takeResult = toDF().take(numRows + )

     val hasMoreData = takeResult.length > numRows

     val data = takeResult.take(numRows)

第3行进入take:

  def take(n: Int): Array[T] = head(n)

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

 private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {

     try {

       qe.executedPlan.foreach { plan =>

         plan.resetMetrics()

       }

       val start = System.nanoTime()

       val result = SQLExecution.withNewExecutionId(sparkSession, qe) {

         action(qe.executedPlan)

       }

       val end = System.nanoTime()

       sparkSession.listenerManager.onSuccess(name, qe, end - start)

       result

     } catch {

       case e: Exception =>

         sparkSession.listenerManager.onFailure(name, qe, e)

         throw e

     }

   }

这个函数名就告诉我们，这是真正计算要开始了，第7行代码一看就是准备发送代码序列了:

 def withNewExecutionId[T](

       sparkSession: SparkSession,

       queryExecution: QueryExecution)(body: => T): T = {

     val sc = sparkSession.sparkContext

     val oldExecutionId = sc.getLocalProperty(EXECUTION_ID_KEY)

     if (oldExecutionId == null) {

       val executionId = SQLExecution.nextExecutionId

       sc.setLocalProperty(EXECUTION_ID_KEY, executionId.toString)

       executionIdToQueryExecution.put(executionId, queryExecution)

       val r = try {

         // sparkContext.getCallSite() would first try to pick up any call site that was previously

         // set, then fall back to Utils.getCallSite(); call Utils.getCallSite() directly on

         // streaming queries would give us call site like "run at <unknown>:0"

         val callSite = sparkSession.sparkContext.getCallSite()

         sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(

           executionId, callSite.shortForm, callSite.longForm, queryExecution.toString,

           SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), System.currentTimeMillis()))

         try {

           body

         } finally {

           sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionEnd(

             executionId, System.currentTimeMillis()))

         }

       } finally {

         executionIdToQueryExecution.remove(executionId)

         sc.setLocalProperty(EXECUTION_ID_KEY, null)

       }

       r

     } else {

       // Don't support nested `withNewExecutionId`. This is an example of the nested

       // `withNewExecutionId`:

       //

       // class DataFrame {

       //   def foo: T = withNewExecutionId { something.createNewDataFrame().collect() }

       // }

       //

       // Note: `collect` will call withNewExecutionId

       // In this case, only the "executedPlan" for "collect" will be executed. The "executedPlan"

       // for the outer DataFrame won't be executed. So it's meaningless to create a new Execution

       // for the outer DataFrame. Even if we track it, since its "executedPlan" doesn't run,

       // all accumulator metrics will be 0. It will confuse people if we show them in Web UI.

       //

       // A real case is the `DataFrame.count` method.

       throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set")

     }

   }

你看第16行，就是在发送数据，包括用户优化后的逻辑，批处理的id，时间戳等等。worker接收到这个事件后根据logicalPlan里面的逻辑就开始干活了。这就是一个很基本很简单的流程，对于spark入门还是挺有帮助的吧。

一文读懂超简单的spark structured stream 源码解读的更多相关文章

一文读懂，硬核 Apache DolphinScheduler3.0 源码解析
点亮 ️ Star · 照亮开源之路 https://github.com/apache/dolphinscheduler 本文目录 1 DolphinScheduler的设计与策略 1.1 分布 ...
一文读懂:超详细正态分布方差等于o的推导
(uv)' = [(u+△u)(v+△v) - uv] /△x = (v△u+u△v +△u△v)/△x = v(△u/△x) + u(△v/△x) +(△u△v)/△x =u'v+uv'
一文读懂spark yarn集群搭建
文是超简单的spark yarn配置教程: yarn是hadoop的一个子项目,目的是用于管理分布式计算资源,在yarn上面搭建spark集群需要配置好hadoop和spark.我在搭建集群的时候有3 ...
一文读懂AI简史：当年各国烧钱许下的愿，有些至今仍未实现
一文读懂AI简史:当年各国烧钱许下的愿,有些至今仍未实现导读:近日,马云.马化腾.李彦宏等互联网大佬纷纷亮相2018世界人工智能大会,并登台演讲.关于人工智能的现状与未来,他们提出了各自的观点,也引 ...
一文读懂高性能网络编程中的I/O模型
1.前言随着互联网的发展,面对海量用户高并发业务,传统的阻塞式的服务端架构模式已经无能为力.本文(和下篇<高性能网络编程(六):一文读懂高性能网络编程中的线程模型>)旨在为大家提供有用的 ...
从HTTP/0.9到HTTP/2：一文读懂HTTP协议的历史演变和设计思路
本文原作者阮一峰,作者博客:ruanyifeng.com. 1.引言 HTTP 协议是最重要的互联网基础协议之一,它从最初的仅为浏览网页的目的进化到现在,已经是短连接通信的事实工业标准,最新版本 HT ...
一文读懂深度强化学习算法 A3C （Actor-Critic Algorithm）
一文读懂深度强化学习算法 A3C (Actor-Critic Algorithm) 2017-12-25 16:29:19 对于 A3C 算法感觉自己总是一知半解,现将其梳理一下,记录在此,也 ...
[转帖]一文读懂 HTTP/2
一文读懂 HTTP/2 http://support.upyun.com/hc/kb/article/1048799/ 又小拍 • 发表于:2017年05月18日 15:34:45 • 更新于:201 ...
[转帖]从HTTP/0.9到HTTP/2：一文读懂HTTP协议的历史演变和设计思路
从HTTP/0.9到HTTP/2:一文读懂HTTP协议的历史演变和设计思路 http://www.52im.net/thread-1709-1-2.html 本文原作者阮一峰,作者博客:r ...

随机推荐

errno.h的数字对应的字符串错误
#ifndef _I386_ERRNO_H #define _I386_ERRNO_H #define EPERM 1 /* Operation not permitted */ #define EN ...
java项目打包部署
网上打包的教程很多, 但是自己动手总归出现各种各样的问题,自己总结下: 由于刚刚接触JAVA,做了一个简单的java project 项目, 但是包含第三方的jar包, 结果打包的时候就出现问题了. ...
TongWEB与JOnAS 对比，国产中间件战斗机东方通TongWEB源码解析
转自网址: http://bbs.51cto.com/thread-489819-1-1.html 首先需要声明的是,本人出于技术爱好的角度,以下的文字只是对所看到的一些情况的罗列,偶尔附加个人的一些 ...
冒泡排序算法-Python实现
#-*- coding: UTF-8 -*- import numpy as np def BubbleSort(a): for i in xrange(0, a.size): for j in xr ...
试玩mpvue，用vue的开发模式开发微信小程序
mpvue,美团开源的vue文件转换成小程序的文件格式,今天玩了一下练练手 mpvue文档地址: http://mpvue.com/mpvue/#_1 暂时有几个点需要注意的: 1.新增页面需要重新启 ...
erlang程序发布的时候需要注意的地方
假如你的程序依赖三方application,比如cowboy,启动三方程序有两种方式在erl脚本里面手工启动,这种在使用rebar generate打包的时候和发布beam的时候都可以用 appli ...
win7 安装 node-sass报错
由于国内网络问题,所以会导致下载node-sass二进制包失败只需要在 ~/.npmrc(当前用户家目录下)添加下面一行: sass_binary_site=https://npm.taobao.o ...
(转)Eclipse新增安卓虚拟机
Bootstrap-Plugin：折叠（Collapse）插件
ylbtech-Bootstrap-Plugin:折叠(Collapse)插件 1.返回顶部 1. Bootstrap 折叠(Collapse)插件折叠(Collapse)插件可以很容易地让页面区域 ...
1021 docker搭建mysql、网络模式、grid
1.搭建并连接mysql服务 1.1.mysql官方命令 https://hub.docker.com/_/mysql/ #下载mysql镜像: docker pull mysql #启动mysql: ...

一文读懂 超简单的spark structured stream 源码解读

一文读懂 超简单的spark structured stream 源码解读的更多相关文章

随机推荐

热门专题

一文读懂超简单的spark structured stream 源码解读

一文读懂超简单的spark structured stream 源码解读的更多相关文章