How Many Partitions Does An RDD Have
From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:
View Task Execution Against Partitions Using the UI
When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:
scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI
When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.
scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically
In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:
scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.partitions.size
res0: Int = 30
In the python API, there is a method for explicitly listing the number of partitions:
In [1]: someRDD = sc.parallelize(range(101),30)
In [2]: someRDD.getNumPartitions()
Out[2]: 30
Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.
How Many Partitions Does An RDD Have的更多相关文章
- Spark核心概念之RDD
RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions 一系列的分片:比如说64M一片:类似于Hadoop中的s ...
- RDD的依赖关系
RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...
- RDD.scala(源码)
---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...
- Spark函数详解系列之RDD基本转换
摘要: RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集 RDD有两种操作算子: ...
- Spark编程模型及RDD操作
转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
- Spark源码系列:RDD repartition、coalesce 对比
在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...
- 【Spark-core学习之二】 RDD和算子
环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...
- spark 算子之RDD
map map(func) Return a new distributed dataset formed by passing each element of the source through ...
随机推荐
- C# word生成html
引入 Aspose.Words public void ConvertToHtml(string wordPath, string savaPath) { try { Aspose.Words.Doc ...
- 死磕itchat源码--目录结构
阅读itchat源码时,先弄清itchat的目录结构 itchat │ config.py │ content.py │ core.py │ log.py │ returnvalues.py │ ut ...
- javascript 富文本 注意事项
富文本编辑器 div内嵌iframe iframe body contenteditable属性 true 整个iframe 即为可编辑框,创建时注意事项: 1.编辑 焦点问题 弹出新控件时为控件设置 ...
- Codeforces 789A Anastasia and pebbles( 水 )
链接:传送门 题意:这个人每次都去公园捡石子,她有两个口袋,每个口袋最多装 k 个石子,公园有 n 种石子,每种石子 w[i] 个,询问最少几天能将石子全部捡完 思路:排个序,尽量每天都多装,如果 k ...
- Ubuntu环境搭建svn服务器
记录一次使用Ubuntu环境搭建svn服务器的详细步骤 一.查看是否已经安装svn 命令:svn如果显示以下信息,说明已安装 二.卸载已安装的svn 命令:sudo apt-get remove ...
- MYSQL数据库常用的LINUX命令
此为我工作中常用MYSQL数据库命令 LINUX系统中启动 / 重启MYSQL服务:service mysql start / service mysql restart(重启服务) LINUX系 ...
- xunsearch实战经验总结
一.定义好配置文件(非常关键) a):如果需要做精确搜索建议对字段设定index=self,tokenizer = full,不然xunsearch会对字段做分词处理: b):数字区间搜索需设定 ty ...
- OA项目知识总结2
BaseAction的抽取 项目中的每个实体类都对应一个action 每个action都都要继承ActionSupport类 已以及实现ModelDriver接口 并且需要注入service 虽然 ...
- ioremap映射函数
一.ioremap() 函数 Linux在io.h头文件中声明了函数ioremap(),用来将I/O内存资源的物理地址映射到核心虚地址空间(3GB-4GB)中(这里是内核空间),原型如下: 1.ior ...
- codevs——T3657 括号序列
http://codevs.cn/problem/3657/ 时间限制: 1 s 空间限制: 256000 KB 题目等级 : 黄金 Gold 题解 题目描述 Description ...