【熵增】

由无序到有序

http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Background

To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

  • mapPartitions to sort each partition using, for example, .sorted
  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
  • sortBy to make a globally ordered RDD

Operations which can cause a shuffle include repartition operations like repartition and coalesce‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dirconfiguration parameter when configuring the Spark context.

Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.

grouped differently across partitions的更多相关文章

  1. 【原创】大数据基础之Spark(5)Shuffle实现原理及代码解析

    一 简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...

  2. Spark 学习笔记:(二)编程指引(Scala版)

    参考: http://spark.apache.org/docs/latest/programming-guide.html 后面懒得翻译了,英文记的,以后复习时再翻. 摘要:每个Spark appl ...

  3. HDU-3280 Equal Sum Partitions

    http://acm.hdu.edu.cn/showproblem.php?pid=3280 用了简单的枚举. Equal Sum Partitions Time Limit: 2000/1000 M ...

  4. HDU 3280 Equal Sum Partitions(二分查找)

    Equal Sum Partitions Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Oth ...

  5. Partitioning & Archiving tables in SQL Server (Part 2: Split, Merge and Switch partitions)

    Reference: http://blogs.msdn.com/b/felixmar/archive/2011/08/29/partitioning-amp-archiving-tables-in- ...

  6. Rotate partitions in DB2 on z

    Rotating partitions   You can use the ALTER TABLE statement to rotate any logical partition to becom ...

  7. How to choose the number of topics/partitions in a Kafka cluster?

    This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...

  8. rabbitmq之partitions

    集群为了保证数据一致性,在同步数据的同时也会通过节点之间的心跳通信来保证对方存活.那如果集群节点通信异常会发生什么,系统如何保障正常提供服务,使用何种策略回复呢? rabbitmq提供的处理脑裂的方法 ...

  9. 8 ways rich people view the world differently than the average person

    Self-made millionaire Steve Siebold spent 26 years interviewing some of the wealthiest people in the ...

随机推荐

  1. “百度杯”CTF比赛 十月场_Login

    题目在i春秋ctf大本营 打开页面是两个登录框,首先判断是不是注入 尝试了各种语句后,发现登录界面似乎并不存在注入 查看网页源代码,给出了一个账号 用帐密登陆后,跳转到到member.php网页,网页 ...

  2. CI调试应用程序

    该分析器将在页面下方显示基准测试结果,运行过的 SQL 语句,以及 $_POST 数据.这些信息有助于开发过程中的调试和优化. 在控制器中设置以下方法以激活该分析器: $this->output ...

  3. UE3客户端服务器GamePlay框架

    客户端(当前玩家)与服务器对应关系图: 整体上看,UE3的GamePlay框架使用的是MVC架构 ① 橙色的Actor对象及橙色箭头相连的成员变量只会被同步给Owner客户端 Controller:控 ...

  4. 计蒜客 UCloud 的安全秘钥(随机化+Hash)

    题目链接 UCloud 的安全秘钥 对于简单的版本,我们直接枚举每个子序列,然后sort一下判断是否完全一样即可. #include <bits/stdc++.h> using names ...

  5. 标题:如何使用ShareSDK实现Cocos2d-x的Android/iOS分享与授权

    Cocos2DX 简介 Cocos2d-x是一套成熟的开源跨平台游戏开发框架.其引擎提供了图形渲染.GUI.音频.网络.物理.用户输入等丰富的功能,被广泛应用于游戏开发及交互式应用的构建.引擎的核心采 ...

  6. 2017-11-07-noip模拟题

    T1 数学老师的报复 矩阵快速幂模板,类似于菲波那切数列的矩阵 [1,1]*[A,1 B,0] #include <cstdio> #define LL long long inline ...

  7. luogu P1402 酒店之王

    题目描述 XX酒店的老板想成为酒店之王,本着这种希望,第一步要将酒店变得人性化.由于很多来住店的旅客有自己喜好的房间色调.阳光等,也有自己所爱的菜,但是该酒店只有p间房间,一天只有固定的q道不同的菜. ...

  8. 老哥你真的知道ArrayList#sublist的正确用法么

    我们有这么一个场景,给你一个列表,可以动态的新增,但是最终要求列表升序,要求长度小于20,可以怎么做? 这个还不简单,几行代码就可以了 public List<Integer> trimL ...

  9. Web地图服务、WMS 请求方式、网络地图服务(WMS)的三大操作

    转自奔跑的熊猫原文 Web地图服务.WMS 请求方式.网络地图服务(WMS)的三大操作 1.GeoServer(地理信息系统服务器) GeoServer是OpenGIS Web 服务器规范的 J2EE ...

  10. spring计时工具类stopwatch用法

    public class Test { public static void main(String[] args) { StopWatch stopWatch = new StopWatch(); ...