From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:

View Task Execution Against Partitions Using the UI

When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:

scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI

When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.

scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically

In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:

scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.partitions.size
res0: Int = 30

In the python API, there is a method for explicitly listing the number of partitions:

In [1]: someRDD = sc.parallelize(range(101),30)

In [2]: someRDD.getNumPartitions()
Out[2]: 30

Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.

How Many Partitions Does An RDD Have的更多相关文章

  1. Spark核心概念之RDD

    RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions       一系列的分片:比如说64M一片:类似于Hadoop中的s ...

  2. RDD的依赖关系

    RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...

  3. RDD.scala(源码)

    ---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...

  4. Spark函数详解系列之RDD基本转换

    摘要:   RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集   RDD有两种操作算子:         ...

  5. Spark编程模型及RDD操作

    转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...

  6. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  7. Spark源码系列:RDD repartition、coalesce 对比

    在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...

  8. 【Spark-core学习之二】 RDD和算子

    环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...

  9. spark 算子之RDD

    map map(func) Return a new distributed dataset formed by passing each element of the source through ...

随机推荐

  1. background及background-size

    background有以下几种属性: background-color background-position background-size background-repeat background ...

  2. js脚本捕获页面 GET 方式请求的参数?其实直接使用 window.location.search 获得

    js脚本捕获页面 GET 方式请求的参数?其实直接使用 window.location.search 获得

  3. 可横向滑动的vue tab组件

    示例 前端使用技术:框架->vue 组件>ly-tab一个用于移动端的可触摸滑动具有回弹效果的可复用Vue组件 ly-tab 介绍地址 ly-tab npm地址 使用步骤 1,引入包,定义 ...

  4. a标记地址的几种用法

    1.<a href="tel://号码"></a> 手机使用能自动拨打电话 //可以省略 2.<a href="mailto://邮箱&qu ...

  5. 解决time命令输出信息的重定向问题

    解决time命令输出信息的重定向问题 time命令的输出信息是打印在标准错误输出上的, 我们通过一个简单的尝试来验证一下. [root@web186 root]# time find . -name ...

  6. Vue JsonView 树形格式化代码插件

     组件代码(临时粘出来) <template> <div class="bgView"> <div :class="['json-view' ...

  7. 使用shell脚本定时备份web网站代码

    #!/bin/bash ############### common file ################ #备份文件存放目录 WEBBACK_DIR="/data/backup/ba ...

  8. Innodb性能优化之参数设置

    现在,Innodb是Mysql最多使用的存储引擎.其性能一直广受关注.本文通过基本的参数设置来提高其性能. innodb_buffer_pool_size 缓冲池大小.这是innodb参数中最重要的设 ...

  9. HDU 2815

    特判B不能大于等于C 高次同余方程 #include <iostream> #include <cstdio> #include <cstring> #includ ...

  10. 在IntelliJ IDEA中创建Web项目

    在IntelliJ IDEA中创建Web项目 在IntelliJ IDEA中创建Web项目1,创建Maven WebProject选择File>New>Project 出现New Proj ...