spark中产生shuffle的算子

Spark中产生shuffle的算子

作用	算子名	能否替换，由谁替换
去重	distinct()	不能
聚合	reduceByKey()	groupByKey
	groupBy()
	groupByKey()	reduceByKey
	aggregateByKey()
	combineByKey()
排序	sortByKey()
排序	sortBy()
重分区	coalesce()
重分区	repartition()
集合或者表操作	Intersection()
	Substract()
	SubstractByKey()
	Join()
	LeftOutJoin()

https://www.cnblogs.com/Alex-zqzy/p/9949117.html

去重

def distinct()

def distinct(numPartitions: Int)

聚合

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

排序

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

重分区

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

集合或者表操作

def intersection(other: RDD[T]): RDD[T]

def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def intersection(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

spark中产生shuffle的算子的更多相关文章

Spark中的各种action算子操作（java版）
在我看来,Spark编程中的action算子的作用就像一个触发器,用来触发之前的transformation算子.transformation操作具有懒加载的特性,你定义完操作之后并不会立即加载,只有 ...
Spark会产生shuffle的算子
去重 def distinct() def distinct(numPartitions: Int) 聚合 def reduceByKey(func: (V, V) => V, numParti ...
spark中map和mapPartitions算子的区别
区别: 1.map是对rdd中每一个元素进行操作 2.mapPartitions是对rdd中每个partition的迭代器进行操作 mapPartitions优点: 1.若是普通map,比如一个par ...
[Spark性能调优] 第三章 : Spark 2.1.0 中 Sort-Based Shuffle 产生的内幕
本課主題 Sorted-Based Shuffle 的诞生和介绍 Shuffle 中六大令人费解的问题 Sorted-Based Shuffle 的排序和源码鉴赏 Shuffle 在运行时的内存管理 ...
Spark 2.x 中 Sort-Based Shuffle 产生的内幕
本课主题 Sorted-Based Shuffle 的诞生和介绍 Shuffle 中六大令人费解的问题 Sorted-Based Shuffle 的排序和源码鉴赏 Shuffle 在运行时的内存管理 ...
Spark中shuffle的触发和调度
Spark中的shuffle是在干嘛? Shuffle在Spark中即是把父RDD中的KV对按照Key重新分区,从而得到一个新的RDD.也就是说原本同属于父RDD同一个分区的数据需要进入到子RDD的不 ...
spark性能调优（二）彻底解密spark的Hash Shuffle
装载:http://www.cnblogs.com/jcchoiling/p/6431969.html 引言 Spark HashShuffle 是它以前的版本,现在1.6x 版本默应是 Sort-B ...
spark中数据倾斜解决方案
数据倾斜导致的致命后果: 1 数据倾斜直接会导致一种情况:OOM. 2 运行速度慢,特别慢,非常慢,极端的慢,不可接受的慢. 搞定数据倾斜需要: 1.搞定shuffle 2.搞定业务场景 3 搞定 c ...
spark教程(13)-shuffle介绍
shuffle 简介 shuffle 描述了数据从 map task 输出到 reduce task 输入的过程,shuffle 是连接 map 和 reduce 的桥梁: shuffle 性能的高低 ...

随机推荐

UVA562(01背包均分问题)
Dividing coins Time Limit:3000MS Memory Limit:0KB 64bit IO Format:%lld & %llu Descriptio ...
DTP模型之二：（XA协议之二）jotm分布式事务实现
分布式事务是指操作多个数据库之间的事务,spring的org.springframework.transaction.jta.JtaTransactionManager,提供了分布式事务支持.如果使用 ...
开始我的技术bolg之旅
虽然做了快十年的IT但是真心觉得自己的水平很烂,IT这个行业keep learning很重要,之前每接触新东西都是随便学一下,有问题解决了就完事了,或者是再有问题临时抱佛脚,从来都没把学过或者做个的事 ...
如何使用最简单的方法将一个已经存在的工程中使用 cocaPodfile
在网上搜索的使用 cocaPods 安装一些优秀的框架,搜索的博客大多步骤都是非常的麻烦,这里的方法非常的简单,本篇仅仅作为以后备用. 第一步:首先找到我们的工程,在终端中输入 cd 拖入已经存在 ...
Game with Powers
题意: 有1~n,n个数字,两个人轮流操作,每一次一个人可以拿一个数字$x$,之后$x, x^2, x^3....x^t$全都被删掉. 给定n,问最优策略下谁赢. 解法: 考虑SG函数,可以注意到题目 ...
HDFS源码分析一-概述
HDFS 主要包含 NameNode, SecondaryNameNode, DataNode 以及 HDFS Client . 我们从以下这几部分讲: 1. HDFS概述 2. NameNode 实 ...
Flex屏蔽并自定义鼠标右键菜单
http://www.cnblogs.com/wuhenke/archive/2010/01/29/1659353.html Google Code上有一个RightClickManager的项目. ...
JQuery onload、ready 加载顺序
// ready 这个方法只是在页面所有的DOM加载完毕后就会触发 // 方式1 $(function(){ // do something }); // 方式2 $(document).ready( ...
Server.MapPath()相关
Server.MapPath()相关 1. Server.MapPath()介绍 Server.MapPath(string path)作用是返回与Web服务器上的指定虚拟路径相对应的物理文 ...
laravel 数据库连接Mysql
找到 config/database.php 'mysql' => [ 'driver' => 'mysql', //数据库的类型 'host' => env('DB_HOST', ...

spark中产生shuffle的算子

spark中产生shuffle的算子的更多相关文章

随机推荐

热门专题