Spark中repartition和partitionBy的区别

repartition 和 partitionBy 都是对数据进行重新分区，默认都是使用 HashPartitioner，区别在于partitionBy 只能用于 PairRDD，但是当它们同时都用于 PairRDD时，结果却不一样：

不难发现，其实 partitionBy 的结果才是我们所预期的，我们打开 repartition 的源码进行查看：

/**

   * Return a new RDD that has exactly numPartitions partitions.

   *

   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses

   * a shuffle to redistribute data.

   *

   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

   * which can avoid performing a shuffle.

   *

   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.

   */

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

    coalesce(numPartitions, shuffle = true)

  }

  /**

   * Return a new RDD that is reduced into `numPartitions` partitions.

   *

   * This results in a narrow dependency, e.g. if you go from 1000 partitions

   * to 100 partitions, there will not be a shuffle, instead each of the 100

   * new partitions will claim 10 of the current partitions. If a larger number

   * of partitions is requested, it will stay at the current number of partitions.

   *

   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,

   * this may result in your computation taking place on fewer nodes than

   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,

   * you can pass shuffle = true. This will add a shuffle step, but means the

   * current upstream partitions will be executed in parallel (per whatever

   * the current partitioning is).

   *

   * @note With shuffle = true, you can actually coalesce to a larger number

   * of partitions. This is useful if you have a small number of partitions,

   * say 100, potentially with a few partitions being abnormally large. Calling

   * coalesce(1000, shuffle = true) will result in 1000 partitions with the

   * data distributed using a hash partitioner. The optional partition coalescer

   * passed in must be serializable.

   */

  def coalesce(numPartitions: Int, shuffle: Boolean = false,

               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

              (implicit ord: Ordering[T] = null)

      : RDD[T] = withScope {

    require(numPartitions > , s"Number of partitions ($numPartitions) must be positive.")

    if (shuffle) {

      /** Distributes elements evenly across output partitions, starting from a random partition. */

      val distributePartition = (index: Int, items: Iterator[T]) => {

        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)

        items.map { t =>

          // Note that the hash code of the key will just be the key itself. The HashPartitioner

          // will mod it with the number of total partitions.

          position = position +

          (position, t)

        }

      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed

      new CoalescedRDD(

        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

        new HashPartitioner(numPartitions)),

        numPartitions,

        partitionCoalescer).values

    } else {

      new CoalescedRDD(this, numPartitions, partitionCoalescer)

    }

  }

即使是RairRDD也不会使用自己的key，repartition 其实使用了一个随机生成的数来当做 Key，而不是使用原来的 Key！！

Spark中repartition和partitionBy的区别的更多相关文章

Spark中ml和mllib的区别
转载自:https://vimsky.com/article/3403.html Spark中ml和mllib的主要区别和联系如下: ml和mllib都是Spark中的机器学习库,目前常用的机器学习功 ...
spark中map与flatMap的区别
作为spark初学者对,一直对map与flatMap两个函数比较难以理解,这几天看了和写了不少例子,终于把它们搞清楚了两者的区别主要在于action后得到的值例子: import org.apac ...
Spark中cache和persist的区别
cache和persist都是用于将一个RDD进行缓存的,这样在之后使用的过程中就不需要重新计算了,可以大大节省程序运行时间. cache和persist的区别基于Spark 1.6.1 的源码,可 ...
Spark中groupBy groupByKey reduceByKey的区别
groupBy 和SQL中groupby一样,只是后面必须结合聚合函数使用才可以. 例如: hour.filter($"version".isin(version: _*)).gr ...
spark中map与mapPartitions区别
在spark中,map与mapPartitions两个函数都是比较常用,这里使用代码来解释一下两者区别 import org.apache.spark.{SparkConf, SparkContext ...
大数据学习day19-----spark02-------0 零碎知识点（分区，分区和分区器的区别） 1. RDD的使用（RDD的概念，特点，创建rdd的方式以及常见rdd的算子） 2.Spark中的一些重要概念
0. 零碎概念 (1) 这个有点疑惑,有可能是错误的. (2) 此处就算地址写错了也不会报错,因为此操作只是读取数据的操作(元数据),表示从此地址读取数据但并没有进行读取数据的操作 (3)分区(有时间 ...
Scala中sortBy和Spark中sortBy区别
Scala中sortBy是以方法的形式存在的,并且是作用在Array或List集合排序上,并且这个sortBy默认只能升序,除非实现隐式转换或调用reverse方法才能实现降序,Spark中sortB ...
Spark中Task，Partition，RDD、节点数、Executor数、core数目的关系和Application，Driver，Job，Task，Stage理解
梳理一下Spark中关于并发度涉及的几个概念File,Block,Split,Task,Partition,RDD以及节点数.Executor数.core数目的关系. 输入可能以多个文件的形式存储在H ...
spark中的scalaAPI之RDDAPI常用操作
package com.XXX import org.apache.spark.storage.StorageLevel import org.apache.spark.{SparkConf, Spa ...

随机推荐

实用方法 - 解决360Doc文章不能复制的问题（实现不登录直接复制）
问题: 有时搜索文章的时候看到一些有用的文字,或者在网上搜索一些文献资料,找到需要的部分后,通常都可以使用 Ctrl + C,或者右键复制下来.但有些网站,比如:360个人图书馆(360Doc)会强制 ...
git 搭建本地仓库
文档创建仓库 mkdir project cd project/ git init git remote add origin /d/project/.git // 仓库创建好了 echo hell ...
ssh文件配置
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.sp ...
免费的SSL证书（LINUX）
贫穷限制了我的SSL. 说起来也简单,免费的SSL证书授权机构,我使用的是Certbot 选择服务器开启的服务,像我php之流,无非apache和nginx,然后选择使用的服务器类型.嗯,补充一句,这 ...
js对象属性两种调用bug
jsobj.url_3[0]=url_3[1];这就错误jsobj.url_3[0]红色看成一个整体的0的属性,这就错了 TypeError: Cannot set property '0' of u ...
hashMap 和linkedHashMap
hashMap是个单向链表的数组 linkedHashMap是个双向链表的数组,modal就是linkedHashMap
Prometheus 函数
函数列表一些函数有默认的参数,例如:year(v=vector(time()) instant-vector).v是参数值,instant-vector是参数类型.vector(time())是默认 ...
java课堂动手动脑总结
java有8种基本数据类型:byte,int,short,long,boolean,char,float,double. 对应的为:Byte,Int,Short,Long,Boolean,Charec ...
CH #46A - 磁力块 - [分块]
题目链接:传送门描述在一片广袤无垠的原野上,散落着N块磁石.每个磁石的性质可以用一个五元组(x,y,m,p,r)描述,其中x,y表示其坐标,m是磁石的质量,p是磁力,r是吸引半径.若磁石A与磁石B的 ...
3-idiots hdu4609 母函数+FFT 组合数学题
http://acm.hdu.edu.cn/showproblem.php?pid=4609 题意:1e5个数,求取三个数能形成三角形的概率. 题解(这怎么会是fft入门题QAQ): 概率的算法就是三 ...

Spark中repartition和partitionBy的区别

Spark中repartition和partitionBy的区别的更多相关文章

随机推荐

热门专题