spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码:

org.apache.spark.rdd.RDD

  /**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
} /**
* Creates tuples of the elements in this RDD by applying `f`.
*/
def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
val cleanedF = sc.clean(f)
map(x => (cleanedF(x), x))
} /**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

代码比较简单:sort是一个transformation操作,需要定义一个keyBy,即根据什么排序,然后会做一步map,即 item -> (keyBy(item), item),然后定义一个Partitioner,即分区策略(多少个分区,升序降序等),最后返回一个ShuffledRDD;

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner:

org.apache.spark.RangePartitioner

/**
* A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in.
*
* @note The actual number of partitions created by the RangePartitioner might not be the same
* as the `partitions` parameter, in the case where the number of sampled records is less than
* the value of `partitions`.
*/
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner { // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.") private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
} def numPartitions: Int = rangeBounds.length + 1 private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K] def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}

这里会根据partition的数量确定rangeBounds,rangeBounds很像QuickSort中的pivot,

举例来说:集群现在有10个节点,对1亿数据做排序,partition数量是100,最理想的情况是1亿数据平均分成100份,然后每个节点存放10份,然后各自排序就好,没有数据倾斜;
但是这个很难实现,要注意的是这里平分的过程实际上也是划分边界的过程,即确定每份的最小值和最大值边界,需要对全部数据遍历统计之后才能精确实现;

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式,具体实现为在从全部数据中采样sampleSize个数据,每个分区采样sampleSizePerPartition个,如果某些分区很大,会追加采样个数,这样保证采样过程尽可能的平均,然后针对采样数据进行探测划分边界,得到rangeBounds,有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区;

还有一个问题:在sort之后如果collect到driver,array数据还会保持排序状态吗?

org.apache.spark.rdd.RDD

  /**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}

答案是肯定的;

【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理的更多相关文章

  1. 【原创】大数据基础之Hadoop(1)HA实现原理

    有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...

  2. 大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建 图文详解

    引言 在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...

  3. CentOS6安装各种大数据软件 第十章:Spark集群安装和部署

    相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...

  4. 大数据平台搭建(hadoop+spark)

    大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息 主机名 ip地址 安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...

  5. 大数据系列之并行计算引擎Spark部署及应用

    相关博文: 大数据系列之并行计算引擎Spark介绍 之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式 本文将介绍 ...

  6. 大数据系列之并行计算引擎Spark介绍

    相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...

  7. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  8. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  9. 【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程

    Spark2.1.1 一 Spark Submit本地解析 1.1 现象 提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...

  10. 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

  1. Scrapy中选择器的用法

    官方文档:https://doc.scrapy.org/en/latest/topics/selectors.html Using selectors Constructing selectors R ...

  2. 想要开发自己的PHP框架需要那些知识储备?

    作者:安正超链接:https://www.zhihu.com/question/26635323/answer/33812516来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明 ...

  3. Servlet生命周期和注解配置

    Servlet的生命周期和注解配置问题 /* Servlet? 运行在服务器上的小程序 定义浏览器访问到Tomcat的规则 一.生命周期? 1.创建 2.提供服务 3.被销毁 二.servlet3.0 ...

  4. Java BitSet使用场景和示例

    一.什么是BitSet? 注:以下内容来自JDK API: BitSet类实现了一个按需增长的位向量.位Set的每一个组件都有一个boolean值.用非负的整数将BitSet的位编入索引.可以对每个编 ...

  5. Debugging Beyond Visual Studio – WinDbg

    Getting started with WinDbg: 1. Download the Debugging Tools for Windows from the Microsoft website ...

  6. springdata 动态查询 是用来查询的 仅提供查询功能

    springdata 动态查询 是用来查询的 仅提供查询功能

  7. CodeForces 868F Yet Another Minimization Problem(决策单调性优化 + 分治)

    题意 给定一个序列 \(\{a_1, a_2, \cdots, a_n\}\),要把它分成恰好 \(k\) 个连续子序列. 每个连续子序列的费用是其中相同元素的对数,求所有划分中的费用之和的最小值. ...

  8. [linux]解除linux对多次登录密码错误的账户的锁定

    其他wheel账户下,执行: sudo pam_tally2 --user=username --reset

  9. 题解 CF540D 【Bad Luck Island】

    既然没有大佬写题解那本蒟蒻就厚颜无耻地写(水)一(经)下(验)吧 题目要求算出个种人单独留下的存活率 因为n,m,p的范围极小, 那么就可以方便地设3位dp状态dp[i][j][k]表示剩余i个石头, ...

  10. macOS Mojave待机耗电大

    这很有可能是待机时依然链接网络导致的.如果不需要待机时链接网络可以执行 sudo pmset -a tcpkeepalive 0 恢复则执行 sudo pmset -a tcpkeepalive 1