Spark2.0 shuffle service

Spark 的shuffle 服务是spark的核心，本文介绍了非ExternalShuffleClient的方式，看BlockService的整个架构。ShuffleClient是整个框架的基础，有init方法和fetchBlock两个方法。

/** Provides an interface for reading shuffle files, either from an Executor or external service. */

public abstract class ShuffleClient implements Closeable {  

  /**

   * Initializes the ShuffleClient, specifying this Executor's appId.

   * Must be called before any other method on the ShuffleClient.

   * 初始化ShuffleClient, 传入本执行器的程序ID，本方法必须在访问ShuffleClient的其它方法前调用。

   */

  public void init(String appId) { }  

  /**

   * Fetch a sequence of blocks from a remote node asynchronously,

   *

   * Note that this API takes a sequence so the implementation can batch requests, and does not

   * return a future so the underlying implementation can invoke onBlockFetchSuccess as soon as

   * the data of a block is fetched, rather than waiting for all blocks to be fetched.

   * 异步的从远程结点取一系列的数据块，并且不返回future对象，所以当取到一个数据块的数据时，底层的实现可以调用onBlockFetchSuccess方法，

   * 并不用等所有的数据块都取完。

   */

  public abstract void fetchBlocks(

      String host,

      int port,

      String execId,

      String[] blockIds,

      BlockFetchingListener listener);

}

BlockFetchingListener接口，onBlockFetchSuccess方法：每次成功取得一个数据块时调用。当本方法返回时，数据必须被自动释放。如果数据被传递给另一个线程，接收者必须自己调用retain()和release()，或者拷贝数据到一个新的缓冲区。onBlockFetchFailure方法，数据块获取失败时，至少被调用一次。

public interface BlockFetchingListener extends EventListener {

  /**

   * Called once per successfully fetched block. After this call returns, data will be released

   * automatically. If the data will be passed to another thread, the receiver should retain()

   * and release() the buffer on their own, or copy the data to a new buffer.

   */

  void onBlockFetchSuccess(String blockId, ManagedBuffer data);  

  /**

   * Called at least once per block upon failures.

   */

  void onBlockFetchFailure(String blockId, Throwable exception);

}

BlockTransferService扩展了ShuffleClient，有一些方法的公共的实现。

private[spark]

abstract class BlockTransferService extends ShuffleClient with Closeable with Logging {  

  /**

   * Initialize the transfer service by giving it the BlockDataManager that can be used to fetch

   * local blocks or put local blocks.

   * 通过传递给他BlockDataManager对象来初始化传输服务，BlockDataManager可以用来存取本地数据块。

   */

  def init(blockDataManager: BlockDataManager): Unit  

  /**

   * Tear down the transfer service.

   * 关闭传输服务。

   */

  def close(): Unit  

  /**

   * Port number the service is listening on, available only after [[init]] is invoked.

   * 传输服务所在的端口号，在调用init方法后可用。

   */

  def port: Int  

  /**

   * Host name the service is listening on, available only after [[init]] is invoked.

   * 传输服务所在的主机名，在调用init方法后可用。

   */

  def hostName: String  

  /**

   * Fetch a sequence of blocks from a remote node asynchronously,

   * available only after [[init]] is invoked.

   *

   * Note that this API takes a sequence so the implementation can batch requests, and does not

   * return a future so the underlying implementation can invoke onBlockFetchSuccess as soon as

   * the data of a block is fetched, rather than waiting for all blocks to be fetched.

   *

   * 异步的从远程结点取一系列的数据块，，仅在调用init方法后可用。

   * 注意本API用一个序列，所以实现可以使用批量请求，并且不返回future对象，所以当取到一个数据块的数据时，底层的实现可以调用onBlockFetchSuccess方法，

   * 并不用等所有的数据块都取完。

 */ 

override def fetchBlocks( host: String, port: Int, execId: String, blockIds: Array[String], listener: BlockFetchingListener): Unit 

/**
　 * Upload a single block to a remote node, available only after [[init]] is invoked.

   * 上传一个数据块到远程结点，仅在调用init方法后可用。

   */

  def uploadBlock(

      hostname: String,

      port: Int,

      execId: String,

      blockId: BlockId,

      blockData: ManagedBuffer,

      level: StorageLevel,

      classTag: ClassTag[_]): Future[Unit]  

  /**

   * A special case of [[fetchBlocks]], as it fetches only one block and is blocking.

   *

   * It is also only available after [[init]] is invoked.

   * fetchBlocks的一个特别方法，他只取一个数据块并且阻塞，仅在调用init方法后可用。

。

   */

  def fetchBlockSync(host: String, port: Int, execId: String, blockId: String): ManagedBuffer = {

    // A monitor for the thread to wait on.

    val result = Promise[ManagedBuffer]()

    fetchBlocks(host, port, execId, Array(blockId),

      new BlockFetchingListener {

        override def onBlockFetchFailure(blockId: String, exception: Throwable): Unit = {

          result.failure(exception)

        }

        override def onBlockFetchSuccess(blockId: String, data: ManagedBuffer): Unit = {

          val ret = ByteBuffer.allocate(data.size.toInt)

          ret.put(data.nioByteBuffer())

          ret.flip()

          result.success(new NioManagedBuffer(ret))

        }

      })

    ThreadUtils.awaitResult(result.future, Duration.Inf)

  }  

  /**

   * Upload a single block to a remote node, available only after [[init]] is invoked.

   *

   * This method is similar to [[uploadBlock]], except this one blocks the thread

   * until the upload finishes.

   * 上传一个数据块到远程结点，仅在调用init方法后可用。

   * 这个方法和uploadBlock方法类似，除了直到上传结点，本方法会一直阻塞。

   */

  def uploadBlockSync(

      hostname: String,

      port: Int,

      execId: String,

      blockId: BlockId,

      blockData: ManagedBuffer,

      level: StorageLevel,

      classTag: ClassTag[_]): Unit = {

    val future = uploadBlock(hostname, port, execId, blockId, blockData, level, classTag)

    ThreadUtils.awaitResult(future, Duration.Inf)

  }

}

NettyBlockTransferService扩展了BlockTransferServie

Spark2.0 shuffle service的更多相关文章

hadoop-2.7.3.tar.gz + spark-2.0.2-bin-hadoop2.7.tgz + zeppelin-0.6.2-incubating-bin-all.tgz（master、slave1和slave2）（博主推荐）（图文详解）
不多说,直接上干货! 我这里,采取的是ubuntu 16.04系统,当然大家也可以在CentOS6.5里,这些都是小事 CentOS 6.5的安装详解 hadoop-2.6.0.tar.gz + sp ...
Ubuntu14.04或16.04下安装JDK1.8+Scala+Hadoop2.7.3+Spark2.0.2
为了将Hadoop和Spark的安装简单化,今日写下此帖. 首先,要看手头有多少机器,要安装伪分布式的Hadoop+Spark还是完全分布式的,这里分别记录. 1. 伪分布式安装伪分布式的Hadoo ...
图文解析Spark2.0核心技术(转载)
导语 Spark2.0于2016-07-27正式发布,伴随着更简单.更快速.更智慧的新特性,spark 已经逐步替代 hadoop 在大数据中的地位,成为大数据处理的主流标准.本文主要以代码和绘图的方 ...
Spark2.0机器学习系列之1：聚类算法(LDA）
在Spark2.0版本中(不是基于RDD API的MLlib),共有四种聚类方法: (1)K-means (2)Latent Dirichlet allocation (LDA) ...
在centos7上安装部署hadoop2.7.3和spark2.0.0
一.安装装备下载安装包: vmware workstations pro 12 三台centos7.1 mini 虚拟机网络配置NAT网络如下: 二.创建hadoop用户和hadoop用户组 1. ...
hive on spark (spark2.0.0 hive2.3.3)
hive on spark真的很折腾人啊!!!!!!! 一.软件准备阶段 maven3.3.9 spark2.0.0 hive2.3.3 hadoop2.7.6 二.下载源码spark2.0.0,编译 ...
Spark2.0集成Hive操作的相关配置与注意事项
前言已完成安装Apache Hive,具体安装步骤请参照,Linux基于Hadoop2.8.0集群安装配置Hive2.1.1及基础操作补充说明 Hive中metastore(元数据存储)的三种方式 ...
降本增效利器！趣头条Spark Remote Shuffle Service最佳实践
王振华,趣头条大数据总监,趣头条大数据负责人曹佳清,趣头条大数据离线团队高级研发工程师,曾就职于饿了么大数据INF团队负责存储层和计算层组件研发,目前负责趣头条大数据计算层组件Spark的建设范振 ...
Magnet: Push-based Shuffle Service for Large-scale Data Processing
本文是阅读 LinkedIn 公司2020年发表的论文 Magnet: Push-based Shuffle Service for Large-scale Data Processing 一点笔记. ...

随机推荐

tiny4412 linux+qtopia nfs网络文件系统的挂载
1,首先确定uboot启动内核的bootargs参数 Linux-CommandLine = root=/dev/nfs nfsroot=192.168.1.131:/home/tiny4412/ro ...
Mongodb 与sql 语句对照
此处用mysql中的sql语句做例子,C# 驱动用的是samus,也就是上文中介绍的第一种. 引入项目MongoDB.dll //创建Mongo连接 var mongo = new Mongo(&qu ...
LeetCode——Binary Tree Preorder Traversal
Given a binary tree, return the preorder traversal of its nodes' values. For example: Given binary t ...
MySQL主从复制与读写分离[修改]
作者:lixiuran 日期:2014年5月2日备注[本人根据网上资源修改,参考http://www.cnblogs.com/luckcs/articles/2543607.html] 测试环境 ...
点击一个textView里的link导航至程序内可返回的自定义webView
1,在AppDelegate.h里定义一个 id currentViewController; 在AppDelegate.m里 @implementation UIApplication (Priva ...
利用maven将jar包添加到本地仓库中
mvn install:install-file -DgroupId=com.oracle -DartifactId=ojdbc7 -Dversion=12.1.0.2 -Dpackaging=jar ...
Eclipse+pydev解决中文显示和注释问题的方法大全
Eclipse+pydev解决中文显示和注释问题的方法大全 Eclipse的设置 window->preferences->general->editors->textedit ...
php PDO简介和操作
PDO:数据访问抽象层具有三大特点: 1.可以访问其它数据库所有数据库都可以 2.具有事务功能 3.带有预处理语句功能(防止SQL注入攻击) <?php //1.造PDO对象 $dsn = ...
【Java nio】java nio笔记
缓冲区操作:缓冲区,以及缓冲区如何工作,是所有I/O的基础.所谓“输入/输出”讲的无非就是把数据移出货移进缓冲区.进程执行I/O操作,归纳起来也就是向操作系统发出请求,让它要么把缓冲区里的数据排干,要 ...
swiper的延迟加载（非官网方法）
网上找的: https://github.com/nolimits4web/Swiper/issues/626 var tabsSwiper = new Swiper('#games-content' ...

Spark2.0 shuffle service

Spark2.0 shuffle service的更多相关文章

随机推荐

热门专题