Spark RDD Action 简单用例(二)

foreach(f: T => Unit)

对RDD的所有元素应用f函数进行处理，f无返回值。
/**
 * Applies a function f to all elements of this RDD.
 */
def foreach(f: T => Unit): Unit

scala> val rdd = sc.parallelize(1 to 9, 2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.foreach(x=>{println(x)})

[Stage 0:>                                                          (0 + 0) / 2]
1

2

3

4

5

6

7

8

9

foreachPartition(f: Iterator[T] => Unit)

遍历所有的分区进行f函数操作
/**
 * Applies a function f to each partition of this RDD.
 */
def foreachPartition(f: Iterator[T] => Unit): Unit

scala> val rdd = sc.parallelize(1 to 9, 2)

scala> rdd.foreachPartition(x=>{

     | while(x.hasNext){

     | println(x.next)

     | }

     | println("===========")

     | }

     | )

1

2

3

4

===========

5

6

7

8

9

===========

getCheckpointFile

获取RDD checkpoint的目录.
/**
 * Gets the name of the directory to which this RDD was checkpointed.
 * This is not defined if the RDD is checkpointed locally.
 */
def getCheckpointFile: Option[String]

scala> val rdd = sc.parallelize(1 to 9,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.checkpoint

/*

checkpoint操作后直接查询得到None，说明checkpoint是lazy的

*/

scala> rdd.getCheckpointFile

res6: Option[String] = None

scala> rdd.count

res7: Long = 9                                                                  

scala> rdd.getCheckpointFile

res8: Option[String] = Some(file:/home/check/ca771099-b1bf-46c8-9404-68b4ace7feeb/rdd-1)

getNumPartitions

获取分区数量
/**
 * Returns the number of partitions of this RDD.
 */
@Since("1.6.0")
final def getNumPartitions: Int = partitions.length

scala> val rdd = sc.parallelize(1 to 9,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.getNumPartitions

res9: Int = 2

getStorageLevel

获取当前RDD的存储级别
/** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
def getStorageLevel: StorageLevel = storageLevel

scala> val rdd = sc.parallelize(1 to 9,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.getStorageLevel

res10: org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas)

scala> rdd.cache

res11: rdd.type = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.getStorageLevel

res12: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)

isCheckpointed

获取该RDD是否已checkpoint处理
/**
 * Return whether this RDD is checkpointed and materialized, either reliably or locally.
 */
def isCheckpointed: Boolean

scala> val rdd = sc.parallelize(1 to 9,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.isCheckpointed

res13: Boolean = false

scala> rdd.checkpoint

scala> rdd.isCheckpointed

res15: Boolean = false

scala> rdd.count

res16: Long = 9

scala> rdd.isCheckpointed

res17: Boolean = true

isEmpty()

获取RDD是否为空，如果RDD为Nothing或Null，则抛出异常
/**
 * @note due to complications in the internal implementation, this method will raise an
 * exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice
 * because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`.
 * (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.)
 * @return true if and only if the RDD contains no elements at all. Note that an RDD
 *         may be empty even when it has at least 1 partition.
 */
def isEmpty(): Boolean

scala> val rdd = sc.parallelize(Seq())

rdd: org.apache.spark.rdd.RDD[Nothing] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> rdd.isEmpty

org.apache.spark.SparkDriverExecutionException: Execution error

  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1187)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)

  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1305)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)

  at org.apache.spark.rdd.RDD.take(RDD.scala:1279)

  at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1413)

  at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)

  at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)

  at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1412)

  ... 48 elided

Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;

  at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)

  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)

  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)

  at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)

  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

scala> val rdd = sc.parallelize(Seq(1 to 9))

rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd.isEmpty

res19: Boolean = false

max()

/**
 * Returns the max of this RDD as defined by the implicit Ordering[T].
 * @return the maximum element of the RDD
 * */
def max()(implicit ord: Ordering[T]): T

min()

/**
 * Returns the min of this RDD as defined by the implicit Ordering[T].
 * @return the minimum element of the RDD
 * */
def min()(implicit ord: Ordering[T]): T

scala> val rdd = sc.parallelize(1 to 9)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> rdd.max

res21: Int = 9

scala> rdd.min

res22: Int = 1

reduce(f: (T, T) => T)

对RDD所有元素进行聚合运算
/**
 * Reduces the elements of this RDD using the specified commutative and
 * associative binary operator.
 */
def reduce(f: (T, T) => T): T

scala> val rdd = sc.parallelize(1 to 9)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> def func(x:Int, y:Int):Int={

     | if(x >= y){

     | x

     | }else{

     | y}

     | }

func: (x: Int, y: Int)Int

scala> rdd.reduce(func(_,_))

res23: Int = 9

scala> rdd.reduce((x,y)=>{

     | if(x>=y){

     | x

     | }else{

     | y

     | }

     | }

     | )

res24: Int = 9

saveAsObjectFile(path: String)

将RDD保存指定目录下文件中
/**
 * Save this RDD as a SequenceFile of serialized objects.
 */
def saveAsObjectFile(path: String): Unit

scala> val rdd = sc.parallelize(1 to 9)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> rdd.saveAsObjectFile("/home/check/object")

[root@localhost ~]# ls /home/check/object/

part-00000  _SUCCESS

saveAsTextFile(path: String)

将RDD保存至文本文件

/**
 * Save this RDD as a text file, using string representations of elements.
 */
def saveAsTextFile(path: String): Unit

scala> val rdd = sc.parallelize(1 to 9)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> rdd.saveAsTextFile("/home/check/text")

[root@localhost ~]# ls /home/check/text/part-00000

/home/check/text/part-00000

[root@localhost ~]# more /home/check/text/part-00000

1

2

3

4

5

6

7

8

9

take(num: Int)

返回前num个元素。
/**
 * Take the first num elements of the RDD. It works by first scanning one partition, and use the
 * results from that partition to estimate the number of additional partitions needed to satisfy
 * the limit.
 *
 * @note this method should only be used if the resulting array is expected to be small, as
 * all the data is loaded into the driver's memory.
 *
 * @note due to complications in the internal implementation, this method will raise
 * an exception if called on an RDD of `Nothing` or `Null`.
 */
def take(num: Int): Array[T]

scala> val rdd = sc.parallelize(1 to 9)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24

scala> rdd.take(5)

res28: Array[Int] = Array(1, 2, 3, 4, 5)

takeOrdered(num: Int)

排序后返回前num个元素

scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24

scala> rdd.takeOrdered(3)

res30: Array[Int] = Array(1, 2, 3)

def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T]

scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> rdd.takeSample(true,6,8)

res34: Array[Int] = Array(5, 2, 2, 5, 3, 2)

scala> rdd.takeSample(false,6,8)

res35: Array[Int] = Array(9, 3, 2, 6, 1, 5)

top(num: Int)

降序排列后返回top n
/*
* @param num k, the number of top elements to return
 * @param ord the implicit ordering for T
 * @return an array of top elements
 */
def top(num: Int)(implicit ord: Ordering[T]): Array[T]

scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> rdd.top(3)

res37: Array[Int] = Array(9, 6, 5)

Spark RDD Action 简单用例(二)的更多相关文章

Spark RDD Action 简单用例(一)
collectAsMap(): Map[K, V] 返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个./** * Return the key- ...
Spark RDD Transformation 简单用例（二）
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...
Spark RDD Transformation 简单用例（三）
cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...
Spark RDD Transformation 简单用例（一）
map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...
PHP 下基于 php-amqp 扩展的 RabbitMQ 简单用例 (二) -- Topic Exchange 和 Fanout Exchange
Topic Exchange 此模式下交换机,在推送消息时, 会根据消息的主题词和队列的主题词决定将消息推送到哪个队列. 交换机只会为 Queue 分发符合其指定的主题的消息. 向交换机发送消息时,消 ...
spark RDD transformation与action函数整理
1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...
Spark基础：（二）Spark RDD编程
1.RDD基础 Spark中的RDD就是一个不可变的分布式对象集合.每个RDD都被分为多个分区,这些分区运行在分区的不同节点上. 用户可以通过两种方式创建RDD: (1)读取外部数据集====> ...
spring事务详解（二）简单样例
系列目录 spring事务详解(一)初探事务 spring事务详解(二)简单样例 spring事务详解(三)源码详解 spring事务详解(四)测试验证 spring事务详解(五)总结提高一.引子 ...
Action的三种实现方式,struts.xml配置的详细解释及其简单执行过程(二)
勿以恶小而为之,勿以善小而不为--------------------------刘备劝诸君,多行善事积福报,莫作恶上一章简单介绍了Struts2的'两个蝴蝶飞,你好' (一),如果没有看过,请观 ...

随机推荐

问题解决java.lang.IllegalArgumentException at org.springframework.asm.ClassReader
手上拿到一个老的项目,使用的是spring3.2,启动的时候报错了: 查了一下,发现spring3.2不兼容jdk8,只能使用jdk8以下的版本,使用jdk6可以启动,但是maven构建的时候又提示不 ...
阿里云服务器CentOS7怎么分区格式化/挂载硬盘
一.在阿里云上购买了服务器的硬盘后就可以操作了,先看看硬盘情况: 硬盘vda是系统盘:vdb是在阿里云后台购买的另一块硬盘. 第一次使用要分区:fdisk /dev/vdb1 在提示符下依次输入:n+ ...
Linux 命令及简单操作学习
众所周知,linux命令很多很多,但是,请不用担心,相信你自己不断的积累,终有一天你和你和小伙伴都会为你惊呆的...... 废话不多说,那,什么时候动手????---------现在,马上..... ...
C# 读写文件摘要
主要参考地址:https://www.cnblogs.com/chenyangsocool/p/7511161.html 首先下载微软提供的工具:DsoFile (微软官网下载传送门) 读写自定义摘 ...
11G新特性 -- Result Cache
共享池存放sql语句的解析和编译版本,以便数据库能快速执行频繁执行的sql语句和plsql. 在11g中,数据库使用result cache来存放sql和plsql的执行结果. result cach ...
Android 组件系列-----Activity生命周期
本篇随笔将会深入学习Activity,包括如何定义多个Activity,并设置为默认的Activity.如何从一个Activity跳转到另一个Activity,还有就是详细分析Activity的生命周 ...
Ubuntu 13.10 录音有特别大噪音解决办法
现在物理机跑Ubuntu,平常的QQ只能在虚拟机里跑了:起初和别人QQ语音,别人能听到很大的噪音,以为是虚拟机的回音,属于正常现象,结果我用虚拟机里边的录音工具和Ubuntu里的录音工具测试一下,发现 ...
intellij IDEA 安装和配置和使用
下载:https://www.jetbrains.com/idea/download/download-thanks.html?platform=windows 安装教程:https://blog.c ...
[MySQL]对于事务并发处理带来的问题，脏读、不可重复读、幻读的理解
一.缘由众所周知MySQL从5.5.8开始,Innodb就是默认的存储引擎,Innodb最大的特点是:支持事务.支持行级锁. 既然支持事务,那么就会有处理并发事务带来的问题:更新丢失.脏读.不可重复 ...
基于mindwave脑电波进行疲劳检测算法的设计(5)
时隔两个多月了,前段时间在弄Socket,就没有弄这个了.现在好了,花了几天的时间,终于又完成了一小部分了.这一小节主要讲α,β,δ,θ等等波段之间的关系.废话不多说,直接给出这几天的成果. 上一次, ...

Spark RDD Action 简单用例(二)

foreach(f: T => Unit)

foreachPartition(f: Iterator[T] => Unit)

getCheckpointFile

getNumPartitions

getStorageLevel

isCheckpointed

isEmpty()

max()

min()

reduce(f: (T, T) => T)

saveAsObjectFile(path: String)

saveAsTextFile(path: String)

take(num: Int)

top(num: Int)

Spark RDD Action 简单用例(二)的更多相关文章

随机推荐

热门专题