【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序，具体是如何实现的？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return this RDD sorted by the given key function.

   */

  def sortBy[K](

      f: (T) => K,

      ascending: Boolean = true,

      numPartitions: Int = this.partitions.length)

      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {

    this.keyBy[K](f)

        .sortByKey(ascending, numPartitions)

        .values

  }

  /**

   * Creates tuples of the elements in this RDD by applying `f`.

   */

  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {

    val cleanedF = sc.clean(f)

    map(x => (cleanedF(x), x))

  }

  /**

   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling

   * `collect` or `save` on the resulting RDD will return or output an ordered list of records

   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in

   * order of the keys).

   */

  // TODO: this currently doesn't work on P other than Tuple2!

  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)

      : RDD[(K, V)] = self.withScope

  {

    val part = new RangePartitioner(numPartitions, self, ascending)

    new ShuffledRDD[K, V, V](self, part)

      .setKeyOrdering(if (ascending) ordering else ordering.reverse)

  }

代码比较简单：sort是一个transformation操作，需要定义一个keyBy，即根据什么排序，然后会做一步map，即 item -> (keyBy(item), item)，然后定义一个Partitioner，即分区策略（多少个分区，升序降序等），最后返回一个ShuffledRDD；

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner：

org.apache.spark.RangePartitioner

/**

 * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly

 * equal ranges. The ranges are determined by sampling the content of the RDD passed in.

 *

 * @note The actual number of partitions created by the RangePartitioner might not be the same

 * as the `partitions` parameter, in the case where the number of sampled records is less than

 * the value of `partitions`.

 */

class RangePartitioner[K : Ordering : ClassTag, V](

    partitions: Int,

    rdd: RDD[_ <: Product2[K, V]],

    private var ascending: Boolean = true)

  extends Partitioner {

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.

  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= 1) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      val sampleSize = math.min(20.0 * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - 1)

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, partitions)

      }

    }

  }

  def numPartitions: Int = rangeBounds.length + 1

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {

    val k = key.asInstanceOf[K]

    var partition = 0

    if (rangeBounds.length <= 128) {

      // If we have less than 128 partitions naive search

      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {

        partition += 1

      }

    } else {

      // Determine which binary search method to use only once.

      partition = binarySearch(rangeBounds, k)

      // binarySearch either returns the match location or -[insertion point]-1

      if (partition < 0) {

        partition = -partition-1

      }

      if (partition > rangeBounds.length) {

        partition = rangeBounds.length

      }

    }

    if (ascending) {

      partition

    } else {

      rangeBounds.length - partition

    }

  }

这里会根据partition的数量确定rangeBounds，rangeBounds很像QuickSort中的pivot，

举例来说：集群现在有10个节点，对1亿数据做排序，partition数量是100，最理想的情况是1亿数据平均分成100份，然后每个节点存放10份，然后各自排序就好，没有数据倾斜；
但是这个很难实现，要注意的是这里平分的过程实际上也是划分边界的过程，即确定每份的最小值和最大值边界，需要对全部数据遍历统计之后才能精确实现；

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式，具体实现为在从全部数据中采样sampleSize个数据，每个分区采样sampleSizePerPartition个，如果某些分区很大，会追加采样个数，这样保证采样过程尽可能的平均，然后针对采样数据进行探测划分边界，得到rangeBounds，有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区；

还有一个问题：在sort之后如果collect到driver，array数据还会保持排序状态吗？

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

答案是肯定的；

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

【原创】大数据基础之Hadoop（1）HA实现原理
有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

[Oracle维护工程师手记]两表结合的MVIEW的告诉刷新
对两表结合查询建立MVIEW,进行MVIEW的的高速刷新失败,如何处理? 例如: SQL> drop user u1 cascade; User dropped. SQL> grant d ...
个人hp笔记本默认设置更改
1.将F1-F12默认的多媒体键(调静音亮度控制声音大小等)改为功能键: (****笔记本型号为惠普****) ·进入BIOS方法:关机状态下,按电源键开机,立刻连续多次点击ESC,看到 F1.F2. ...
Java基础——枚举详解
前言: 在第一次学习面向对象编程时,我记得最深的一句话就是“万物皆对象”.于是我一直秉承着这个思想努力的学习着JAVA,直到学习到枚举(Enum)时,看着它颇为奇怪的语法……我一直在想,这TM是个什么 ...
JQ面向对象的放大镜
index.html <!DOCTYPE html><html> <head> <meta charset="utf-8" /> & ...
C++ 动态链接库 DLL 的一些笔记
DLL 文件源代码: // test.h #ifdef TEST_EXPORTS #define TEST_API __declspec(dllexport) #endif class TEST_AP ...
探索 Python 学习
Python 是一种敏捷的.动态类型化的.极富表现力的开源编程语言,可以被自由地安装到多种平台上(参阅参考资料).Python 代码是被解释的.如果您对编辑.构建和执行循环较为熟悉,则 Python ...
学习Spring Boot：（十五）使用Lombok来优雅的编码
前言 Lombok 是一种 Java™ 实用工具,可用来帮助开发人员消除 Java 的冗长,尤其是对于简单的 Java 对象(POJO).它通过注解实现这一目的. 正文添加依赖在 pom.xml ...
如何修改hosts文件
如何修改hosts文件 1.进入路径 C:\Windows\System32\drivers\etc 2.拷贝hosts文件到其他地方3.修改拷贝的hosts文件,右键用记事本打开4.直接修改或添加 ...
Win10修改编辑文件无法保存怎么办（没有权限）
Win10修改编辑hosts文件无法保存怎么办修改一些系统文件无法保存说明这个账户没有“写”这个权限这里以hosts文件为例,为账户增加读写权限: 首先进入Win10系统的hosts文件所在位置 ...
js获取url参数（通用方法）
function getUrl(name="") { var url = location.search; //获取url中"?"符后的字串 var theRe ...

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

随机推荐

热门专题