在阅读本文之前。请先阅读Spark Sort Based Shuffle内存分析

Spark Shuffle Read调用栈例如以下:

1. org.apache.spark.rdd.ShuffledRDD#compute()

2. org.apache.spark.shuffle.ShuffleManager#getReader()

3. org.apache.spark.shuffle.hash.HashShuffleReader#read()

4. org.apache.spark.storage.ShuffleBlockFetcherIterator#initialize()

5. org.apache.spark.storage.ShuffleBlockFetcherIterator#splitLocalRemoteBlocks()

org.apache.spark.storage.ShuffleBlockFetcherIterator#sendRequest()

org.apache.spark.storage.ShuffleBlockFetcherIterator#fetchLocalBlocks()

以下是fetchLocalBlocks()方法运行时涉及到的类和相应方法:

6. org.apache.spark.storage.BlockManager#getBlockData()

org.apache.spark.shuffle.hash.ShuffleManager#shuffleBlockResolver()

ShuffleManager有两个子类。假设是HashShuffle 则相应的是org.apache.spark.shuffle.hash.HashShuffleManager#shuffleBlockResolver()方法,该方法返回的是org.apache.spark.shuffle.FileShuffleBlockResolver。再调用FileShuffleBlockResolver#getBlockData()方法返回Block数据

;假设是Sort Shuffle,则相应的是

org.apache.spark.shuffle.hash.SortShuffleManager#shuffleBlockResolver(),该方法返回的是org.apache.spark.shuffle.IndexShuffleBlockResolver。然后再调用IndexShuffleBlockResolver#getBlockData()返回Block数据。

以下是org.apache.spark.storage.ShuffleBlockFetcherIterator#sendRequest()方法运行时涉及到的类和相应方法

7.

org.apache.spark.network.shuffle.ShuffleClient#fetchBlocks

org.apache.spark.network.shuffle.ShuffleClient有两个子类,各自是ExternalShuffleClient及BlockTransferService

。其中org.apache.spark.network.shuffle.BlockTransferService又有两个子类,各自是NettyBlockTransferService和NioBlockTransferService。相应两种不同远程获取Block数据方式。Spark 1.5.2中已经将NioBlockTransferService方式设置为deprecated。在兴许版本号中将被移除

以下按上述调用栈对各方法进行说明,这里仅仅讲脉络,细节后面再讨论

ShuffledRDD#compute()代码

Task运行时。调用ShuffledRDD的compute方法,其代码例如以下:

//org.apache.spark.rdd.ShuffledRDD#compute()
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
//通过org.apache.spark.shuffle.ShuffleManager#getReader()方法
//不管是Sort Shuffle 还是 Hash Shuffle。使用的都是
//org.apache.spark.shuffle.hash.HashShuffleReader
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}

能够看到,其核心逻辑是通过调用ShuffleManager#getReader()方法得到HashShuffleReader对象。然后调用HashShuffleReader#read()方法完毕前一Stage中ShuffleMapTask生成的Shuffle 数据的读取。须要说明的是,不管是Hash Shuffle还是Sort Shuffle。使用的都是HashShuffleReader。

HashShuffleReader#read()

跳到HashShuffleReader#read()方法其中。其源代码例如以下:

/** Read the combined key-values for this reduce task */
override def read(): Iterator[Product2[K, C]] = {
//创建ShuffleBlockFetcherIterator对象,在其构造函数中会调用initialize()方法
//该方法中会运行splitLocalRemoteBlocks(),确定数据的读取策略
//远程数据调用sendRequest()方法读取
//本地数据调用fetchLocalBlocks()方法读取
val blockFetcherItr = new ShuffleBlockFetcherIterator(
context,
blockManager.shuffleClient,
blockManager,
mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition),
// Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024) // Wrap the streams for compression based on configuration
val wrappedStreams = blockFetcherItr.map { case (blockId, inputStream) =>
blockManager.wrapForCompression(blockId, inputStream)
} val ser = Serializer.getSerializer(dep.serializer)
val serializerInstance = ser.newInstance() // Create a key/value iterator for each stream
val recordIter = wrappedStreams.flatMap { wrappedStream =>
// Note: the asKeyValueIterator below wraps a key/value iterator inside of a
// NextIterator. The NextIterator makes sure that close() is called on the
// underlying InputStream when all records have been read.
serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
} // Update the context task metrics for each record read.
val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency()
val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
recordIter.map(record => {
readMetrics.incRecordsRead(1)
record
}),
context.taskMetrics().updateShuffleReadMetrics()) // An interruptible iterator must be used here in order to support task cancellation
val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter) val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {
// 读取Map端已经聚合的数据
val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
} else {
//读取Reducer端聚合的数据
val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
}
} else {
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
} // 对输出结果进行排序
dep.keyOrdering match {
case Some(keyOrd: Ordering[K]) =>
// Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
// the ExternalSorter won't spill to disk.
val sorter = new ExternalSorter[K, C, C](ordering = Some(keyOrd), serializer = Some(ser))
sorter.insertAll(aggregatedIter)
context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
context.internalMetricsToAccumulators(
InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes)
sorter.iterator
case None =>
aggregatedIter
}
}

ShuffleBlockFetcherIterator#splitLocalRemoteBlocks()

splitLocalRemoteBlocks()方法确定数据的读取策略,localBlocks变量记录在本地机器的BlockID,remoteBlocks变量则用于记录全部在远程机器上的BlockID。

远程数据块被切割成最大为maxSizeInFlight大小的FetchRequests

val remoteRequests = new ArrayBuffer[FetchRequest]

splitLocalRemoteBlocks()方法具有源代码例如以下:

private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
// Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
// smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
// nodes, rather than blocking on reading output from one node.
//maxBytesInFlight为每次请求的最大数据量,默认值为48M
//通过SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)进行设置
val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize) // Split local and remote blocks. Remote blocks are further split into FetchRequests of size
// at most maxBytesInFlight in order to limit the amount of data in flight.
val remoteRequests = new ArrayBuffer[FetchRequest] // Tracks total number of blocks (including zero sized blocks)
var totalBlocks = 0
for ((address, blockInfos) <- blocksByAddress) {
totalBlocks += blockInfos.size
//要获取的数据在本地
if (address.executorId == blockManager.blockManagerId.executorId) {
// Filter out zero-sized blocks
//记录数据在本地的BlockID
localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
numBlocksToFetch += localBlocks.size
} else {
//数据不在本地时
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
while (iterator.hasNext) {
val (blockId, size) = iterator.next()
// Skip empty blocks
if (size > 0) {
curBlocks += ((blockId, size))
//记录数据在远程机器上的BlockID
remoteBlocks += blockId
numBlocksToFetch += 1
curRequestSize += size
} else if (size < 0) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) {
// Add this FetchRequest
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize = 0
}
}
// Add in the final request
if (curBlocks.nonEmpty) {
remoteRequests += new FetchRequest(address, curBlocks)
}
}
}
logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
remoteRequests
}

ShuffleBlockFetcherIterator#fetchLocalBlocks()

fetchLocalBlocks()方法进行本地Block的读取。调用的是BlockManager的getBlockData方法。其源代码例如以下:

private[this] def fetchLocalBlocks() {
val iter = localBlocks.iterator
while (iter.hasNext) {
val blockId = iter.next()
try {
//调用BlockManager的getBlockData方法
val buf = blockManager.getBlockData(blockId)
shuffleMetrics.incLocalBlocksFetched(1)
shuffleMetrics.incLocalBytesRead(buf.size)
buf.retain()
results.put(new SuccessFetchResult(blockId, blockManager.blockManagerId, 0, buf))
} catch {
case e: Exception =>
// If we see an exception, stop immediately.
logError(s"Error occurred while fetching local blocks", e)
results.put(new FailureFetchResult(blockId, blockManager.blockManagerId, e))
return
}
}
}

跳转到BlockManager的getBlockData方法。能够看到其源代码例如以下:

override def getBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
//先调用的是ShuffleManager的shuffleBlockResolver方法。得到ShuffleBlockResolver
//然后再调用其getBlockData方法 shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
} else {
val blockBytesOpt = doGetLocal(blockId, asBlockResult = false)
.asInstanceOf[Option[ByteBuffer]]
if (blockBytesOpt.isDefined) {
val buffer = blockBytesOpt.get
new NioManagedBuffer(buffer)
} else {
throw new BlockNotFoundException(blockId.toString)
}
}
}

org.apache.spark.shuffle.hash.ShuffleManager#shuffleBlockResolver()方法获取相应的ShuffleBlockResolver,假设是Hash Shuffle,则

是org.apache.spark.shuffle.FileShuffleBlockResolver,假设是Sort Shuffle则org.apache.spark.shuffle.IndexShuffleBlockResolver。

然后调用相应ShuffleBlockResolver的getBlockData方法,返回相应的FileSegment。

FileShuffleBlockResolver#getBlockData方法源代码例如以下:

override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
//相应Hash Shuffle中的Shuffle Consolidate Files机制生成的文件
if (consolidateShuffleFiles) {
// Search all file groups associated with this shuffle.
val shuffleState = shuffleStates(blockId.shuffleId)
val iter = shuffleState.allFileGroups.iterator
while (iter.hasNext) {
val segmentOpt = iter.next.getFileSegmentFor(blockId.mapId, blockId.reduceId)
if (segmentOpt.isDefined) {
val segment = segmentOpt.get
return new FileSegmentManagedBuffer(
transportConf, segment.file, segment.offset, segment.length)
}
}
throw new IllegalStateException("Failed to find shuffle block: " + blockId)
} else {
//普通的Hash Shuffle机制生成的文件
val file = blockManager.diskBlockManager.getFile(blockId)
new FileSegmentManagedBuffer(transportConf, file, 0, file.length)
}
}

IndexShuffleBlockResolver#getBlockData方法源代码例如以下:

override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
// The block is actually going to be a range of a single map output file for this map, so
// find out the consolidated file, then the offset within that from our index
//使用shuffleId和mapId,获取相应索引文件
val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId) val in = new DataInputStream(new FileInputStream(indexFile))
try {
//定位到本次Block相应的数据位置
ByteStreams.skipFully(in, blockId.reduceId * 8)
//数据起始位置
val offset = in.readLong()
//数据结束位置
val nextOffset = in.readLong()
//返回FileSegment
new FileSegmentManagedBuffer(
transportConf,
getDataFile(blockId.shuffleId, blockId.mapId),
offset,
nextOffset - offset)
} finally {
in.close()
}
}

ShuffleBlockFetcherIterator#sendRequest()

sendRequest()方法用于从远程机器上获取数据

 private[this] def sendRequest(req: FetchRequest) {
logDebug("Sending request for %d blocks (%s) from %s".format(
req.blocks.size, Utils.bytesToString(req.size), req.address.hostPort))
bytesInFlight += req.size // so we can look up the size of each blockID
val sizeMap = req.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMap
val blockIds = req.blocks.map(_._1.toString) val address = req.address
//使用ShuffleClient的fetchBlocks方法获取数据
//有两种ShuffleClient。各自是ExternalShuffleClient和BlockTransferService
//默觉得BlockTransferService
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// Only add the buffer to results queue if the iterator is not zombie,
// i.e. cleanup() has not been called yet.
if (!isZombie) {
// Increment the ref count because we need to pass this to a different thread.
// This needs to be released after use.
buf.retain()
results.put(new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))
shuffleMetrics.incRemoteBytesRead(buf.size)
shuffleMetrics.incRemoteBlocksFetched(1)
}
logTrace("Got remote block " + blockId + " after " + Utils.getUsedTimeMs(startTime))
} override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), address, e))
}
}
)
}

通过上面的代码能够看到,代码使用的是shuffleClient.fetchBlocks进行远程Block数据的获取。org.apache.spark.network.shuffle.ShuffleClient有两个子类,各自是ExternalShuffleClient和BlockTransferService,而org.apache.spark.network.shuffle.BlockTransferService又有两个子类。各自是NettyBlockTransferService和NioBlockTransferService,shuffleClient 对象在 org.apache.spark.storage.BlockManager定义,其源代码例如以下:

// org.apache.spark.storage.BlockManager中定义的shuffleClient
private[spark] val shuffleClient = if (externalShuffleServiceEnabled) {
//使用ExternalShuffleClient获取远程Block数据
val transConf = SparkTransportConf.fromSparkConf(conf, numUsableCores)
new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled(),
securityManager.isSaslEncryptionEnabled())
} else {
//使用NettyBlockTransferService或NioBlockTransferService获取远程Block数据
blockTransferService
}

代码中的blockTransferService在SparkEnv中被初始化,详细例如以下:

 //org.apache.spark.SparkEnv中初始化blockTransferService
val blockTransferService =
conf.get("spark.shuffle.blockTransferService", "netty").toLowerCase match {
case "netty" =>
new NettyBlockTransferService(conf, securityManager, numUsableCores)
case "nio" =>
logWarning("NIO-based block transfer service is deprecated, " +
"and will be removed in Spark 1.6.0.")
new NioBlockTransferService(conf, securityManager)
}

Spark Shuffle模块——Suffle Read过程分析的更多相关文章

  1. spark shuffle过程分析

    spark shuffle流程分析 回到ShuffleMapTask.runTask函数 如今回到ShuffleMapTask.runTask函数中: overridedef runTask(cont ...

  2. Spark Shuffle实现

    Apache Spark探秘:Spark Shuffle实现 http://dongxicheng.org/framework-on-yarn/apache-spark-shuffle-details ...

  3. Spark Scheduler模块源码分析之TaskScheduler和SchedulerBackend

    本文是Scheduler模块源码分析的第二篇,第一篇Spark Scheduler模块源码分析之DAGScheduler主要分析了DAGScheduler.本文接下来结合Spark-1.6.0的源码继 ...

  4. Spark Scheduler模块源码分析之DAGScheduler

    本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...

  5. Spark Shuffle原理解析

    Spark Shuffle原理解析 一:到底什么是Shuffle? Shuffle中文翻译为“洗牌”,需要Shuffle的关键性原因是某种具有共同特征的数据需要最终汇聚到一个计算节点上进行计算. 二: ...

  6. Spark Shuffle的技术演进

      在Spark或Hadoop MapReduce的分布式计算框架中,数据被按照key分成一块一块的分区,打散分布在集群中各个节点的物理存储或内存空间中,每个计算任务一次处理一个分区,但map端和re ...

  7. Spark Shuffle大揭秘

    什么是Shuffle: Shuffle中文翻译为“洗牌”,需要Shuffle的关键原因是某种具有共同特征的数据需要最终汇聚到一个计算节点上进行计算. Shuffle面临的问题: 1. 数据量非常大: ...

  8. Spark(五十二):Spark Scheduler模块之DAGScheduler流程

    导入 从一个Job运行过程中来看DAGScheduler是运行在Driver端的,其工作流程如下图: 图中涉及到的词汇概念: 1. RDD——Resillient Distributed Datase ...

  9. Spark Deploy 模块

    Spark Scheduler 模块的文章中,介绍到 Spark 将底层的资源管理和上层的任务调度分离开来,一般而言,底层的资源管理会使用第三方的平台,如 YARN 和 Mesos.为了方便用户测试和 ...

随机推荐

  1. 记录我发现的第一个关于 Google 的 Bug

    先贴上 Bug 链接: https://issuetracker.google.com/issues/68969655 Bug 本身是很简单的,就是 Google 的 Android 在线参考文档中, ...

  2. Oracle - java创建Oracle 的触发器

    Oracle - java创建Oracle 的触发器 今天碰到这个问题,遇到点问题,到这来 总结一下解决的办法, 需求,为一个用户当中的表增加一个自动增长列,我还没有学Oracle 的这部分,只是简单 ...

  3. mysql +keeplive

    下载tar包 ./configure --prefix=/usr/local/keepalived --with-kernel-dir=/usr/src/kernels/2.6.32-431.el6. ...

  4. C#版本websocket及时通信协议实现

    1:Websocket有java.nodejs.python.PHP.等版本 ,我现在使用的是C3版本,服务器端是Fleck.客户端和服务器端来使用websocket的,下面开始讲解如何使用: 2:在 ...

  5. 转: .Net 4.0 ExpandoObject 使用

    本篇文章中就ExpandoObject的基本使用进行一些demo.我们几乎都知道dynamic特性是.net 4.0中一个主要的新特性,而ExpandoObject正是这样的一个动态的类型.Expan ...

  6. Python第一天自学,变量,基本数据类型

    PyCharm 一些简单常用设置操作设置模板 File->Settings->Editor->File and Code Templates //切换python版本File-> ...

  7. Python 简单socket模拟ssh

    OSI七层模型(Open System Interconnection,开放式系统互联) 应用层 表示层 回话层 传输层 tcp,udp 网络层 ip,icmp 数据链路层 mac地址 物理层 物理网 ...

  8. 小程序基于疼讯qcloud的nodejs开发服务器部署

        腾讯,疼讯,很疼. 请慎重看腾讯给出的文档,最好做一个笔记. 我只能说我能力有限,在腾讯云小程序的文档中跳了n天. 最后还是觉得记录下来,以防止我的cpu过载给烧了. 此文档是对<小程序 ...

  9. JQuery之事件冒泡

    JQuery 提供了两种方式来阻止事件冒泡. 方法一:event.stopPropagation(); $("#div1").mousedown(function(event){  ...

  10. c语言的预处理指令分3种   1> 宏定义   2> 条件编译   3> 文件包含

    宏简介 1.C语言在对源程序进行编译之前,会先对一些特殊的预处理指令作解释(比如之前使用的#include文件包含指令),产生一个新的源程序(这个过程称为编译预处理),之后再进行通常的编译 所有的预处 ...