【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理

spark中要将计算结果取回driver，有两种方式：collect和take，这两种方式有什么差别？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

  /**

   * Take the first num elements of the RDD. It works by first scanning one partition, and use the

   * results from that partition to estimate the number of additional partitions needed to satisfy

   * the limit.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   *

   * @note Due to complications in the internal implementation, this method will raise

   * an exception if called on an RDD of `Nothing` or `Null`.

   */

  def take(num: Int): Array[T] = withScope {

    val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)

    if (num == 0) {

      new Array[T](0)

    } else {

      val buf = new ArrayBuffer[T]

      val totalParts = this.partitions.length

      var partsScanned = 0

      while (buf.size < num && partsScanned < totalParts) {

        // The number of partitions to try in this iteration. It is ok for this number to be

        // greater than totalParts because we actually cap it at totalParts in runJob.

        var numPartsToTry = 1L

        if (partsScanned > 0) {

          // If we didn't find any rows after the previous iteration, quadruple and retry.

          // Otherwise, interpolate the number of partitions we need to try, but overestimate

          // it by 50%. We also cap the estimation in the end.

          if (buf.isEmpty) {

            numPartsToTry = partsScanned * scaleUpFactor

          } else {

            // the left side of max is >=1 whenever partsScanned >= 2

            numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)

            numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)

          }

        }

        val left = num - buf.size

        val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)

        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

        res.foreach(buf ++= _.take(num - buf.size))

        partsScanned += p.size

      }

      buf.toArray

    }

  }

可见collect是直接计算所有结果，然后将每个partition的结果变成array，然后再合并成一个array；

而take的实现就要复杂一些，它会首先计算1个partition，然后根据结果的数量推断出还需要计算几个分区，然后再计算这几个分区，然后再看结果够不够，这是一个迭代的过程，计算越简单或者take数量越少，越有可能在前边的迭代中满足条件返回；

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理的更多相关文章

大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
大数据基础知识问答----spark篇，大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
大数据学习系列之六 ----- Hadoop+Spark环境搭建
引言在上一篇中大数据学习系列之五 ----- Hive整合HBase图文详解 : http://www.panchengming.com/2017/12/18/pancm62/ 中使用Hive整合 ...

随机推荐

你不知道的 requestIdleCallback
本文副标题是 Request Schedule 源码解析一.在本章中会介绍 requestIdleCallback 的用法以及其缺陷, 接着对 React 团队对该 api 的 hack 部分的源码进 ...
OSGI插件（plugin）web工程建设步骤
所有资源下载(汇总) 底包的下载地址:https://pan.baidu.com/s/15JxHOHf0AyZaLKPJUkpeXA 提取码: bujz web-target.war下载地址: ht ...
电脑装windows和ubuntu，如何卸载ubuntu系统
电脑装windows和ubuntu,如何卸载ubuntu系统 2018年01月17日 16:28:29 职业炮灰阅读数:684 版权声明:本文为博主原创文章,未经博主允许不得转载. https ...
用UE4蓝图制作FPS_零基础学虚幻4第二季
课时1:案例演示 05:12 课时2:工程准备 07:35 (把一个项目从一个工程移动到另一个工程) 1.新建一个空白工程,不包含初学者内容 2.选择我们要复制的工程,按右键,如下图: 复制到新工程的 ...
不同系统下的字长------typedef的意义
int的字节长度是由CPU和操作系统编译器共同决定的, 一般情况下,主要是由操作系统决定,比如,你在64位AMD的机器上安装的是32位操作系统,那么,int默认是32位的:如果是64位操作系统,64位 ...
Vue——报错总结
[Vue warn]: Cannot find element: #app [报错原因] 1. 把对应js放在了head标签里面,页面没有加载完成就进行渲染,导致找不到#app. 2.加了<te ...
min-max容斥学习笔记
min-max容斥学习笔记前置知识二项式反演 \[ f(n)=\sum_{i=0}^n\binom{n}{i}g(i)\Leftrightarrow g(n)=\sum_{i=0}^n(-1)^{ ...
Codeforces 1082C Multi-Subject Competition（前缀+思维）
题目链接:Multi-Subject Competition 题意:给定n名选手,每名选手都有唯一选择的科目si和对应的能力水平.并且给定科目数量为m.求选定若干个科目,并且每个科目参与选手数量相同的 ...
Qt调用自己编译的libglog.a出现问题
我确定依据正确导入库后,依旧出现未定义的引用. undefined reference to _imp___ZN6google17InitGoogleLoggingEPKc 尝试过重新编译,调整编译参 ...
source insight如何删除没用的project 及其常见问题
4年09月05日 ⁄ 综合 ⁄ 共 439字 ⁄ 字号小中大 ⁄ 评论关闭我正在中文路径下加载了一个工程,结果一点击打开,source insight程序就会出现错误提示,要求关闭.我想可能是 ...

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理的更多相关文章

随机推荐

热门专题