map

map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
返回通过函数func传递源的每个元素形成的新的分布式数据集。通过函数得到一个新的分布式数据集。
    var rdd = session.sparkContext.parallelize(1 to 10)
rdd.foreach(println)
println("=========================")
rdd.map(x => (x,1)).foreach(println)

结果:

67891012345

=========================

(6,1)(7,1)(8,1)(9,1)(10,1)(1,1)(2,1)(3,1)(4,1)(5,1)

filter

filter(func) Return a new dataset formed by selecting those elements of the source on which funcreturns true.

通过自定义函数对元素进行过滤

    val rdd = session.sparkContext.parallelize(1 to 10)
rdd.foreach(print)
val rdd2 = rdd.filter(_>6)
println("=========================")
rdd2.foreach(print)

结果:

67891012345

=========================

78910

filtMap

flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

通过自定义函数把RDD中的每一个元素映射成多个元素,返回一个集合。

  val ds = session.sparkContext.textFile("D:/公司/test.txt")
ds.foreach(println) val ds2 = ds.flatMap(x => {
x.toString().split(":")
}) println("===================") ds2.foreach(println)

结果:

{ "DEVICENAME": "����4", "LID": 170501310, "ADDRESS": "xxxx", "ID": 230001160 }

===================

{ "DEVICENAME"
"����4", "LID"
170501310, "ADDRESS"
"xxxx", "ID"
230001160 }

mapFunction

mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

类型map.不过是分区进行。类似于批量。

mapPartitionsWithIndex

mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

sample

sample(withReplacementfractionseed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

采集一个RDD的随机样本。

其中包含三个参数

replacement 布尔类型,表示是否重样。

fraction 返回的比例数 介于0到1。如原来RDD数10,fraction=0.5,那么将返回一个长度为5的随机RDD。

seed  表示随机比例。默认为long的最大值。如果此值为恒值(不随机),那么返回的RDD相等。

union

union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.

将两个RDD合并,不去重。

intersection

intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.

返回两个RDD的交集。去重。

    var rdd = session.sparkContext.parallelize(1 to 10)
rdd.foreach(println)
val rdd2 = rdd.sample(true, 0.5)
println("==============")
rdd2.foreach(println)
val rdd3 = rdd.intersection(rdd2)
println("==============")
rdd3.foreach(println)

结果:

==============
89955
==============
958

distinct

distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.

对RDD进行去重。参数为任务数。

其内部实现原理对元素进行分组,然后取第一个。

/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
    var rdd = session.sparkContext.parallelize(1 to 10)
val rdd2 = rdd.sample(true, 0.5)
rdd2.foreach(print)
println("====================")
val rdd3 = rdd2.distinct(10)
rdd3.foreach(print)

结果:

7792224

==============

4279

groupByKey

groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

对一个(k,v)对的数据集进行k的分组,并返回一个v的集合。此算子使用前提是一个(k,v)对的RDD。

官方的建议是,如果要进行类似操作,最好使用reduceByKey 或者aggregateByKey 。

相当于group by ,它不可以自定义函数。如果在这个基础上需要做count等运算,需要使用reduceByKey 或者aggregateByKey 。

groupBy

groupBy和groupByKey略有不同。1:groupby可以自定义key;2:在返回值上,groupby返回的是[key,{key:value1,key:value2}],而groupByKey返回的是[key,{value1,value2}]

    val seq = Seq[String]("spark", "hadoop", "spark")
val rdd = session.sparkContext.parallelize(seq)
val rdd2 = rdd.map(x => (x, 1)).groupBy(_._1)//默认元素为key,此时同groupByKey,但返回值略有不同
rdd2.foreach(println) println("==============")
val rdd4 = rdd.map(x => (x, 1)).groupBy(x => {
x._1 + new Random().nextInt(100);//可以自定义key
}) rdd4.foreach(println) println("==============")
val rdd3 = rdd.map(x => (x, 1)).groupByKey()//默认元素为key
rdd3.foreach(println)

结果:

  (spark,CompactBuffer((spark,1), (spark,1)))
  (hadoop,CompactBuffer((hadoop,1)))
  ==============

  (spark92,CompactBuffer((spark,1)))
  (hadoop72,CompactBuffer((hadoop,1)))
  (spark46,CompactBuffer((spark,1)))

  ==============

  (spark,CompactBuffer(1, 1))
  (hadoop,CompactBuffer(1))

reduceByKey

reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

针对一个(K,V)对的RDD,返回一个对K去重的值。这个具体的值是什么样,取决于第一个参数。

    val seq = Seq[String]("spark", "hadoop", "spark")
val rdd = session.sparkContext.parallelize(seq)
val rdd2 = rdd.map(x => (x, 1)).reduceByKey(_+_)
rdd2.foreach(println)

结果:

  (spark,2)

  (hadoop,1)

小结:groupBy groupByKey reduceByKey

1:groupBy可以自定义key;2:在返回值上,groupBy返回的是[key,{key:value1,key:value2}],而groupByKey返回的是[key,{value1,value2}]

2:reduceByKey(func, [numTasks])的第一个参数为自定义函数,可以对结果进行再处理。groupBy([numTasks])和groupByKey([numTasks])都不能自定义函数,如实现wordcount的功能,需额外使用算子或自定义实现。

3:reduceByKey和groupByKey内部原理不一样。这一点在官方注释上已经讲得很明白。reduceByKey会经过类似于Map与reduce之间的combiner操作(similarly to a "combiner" in MapReduce.)。会将各个节点上的数据进行合并之后再进行传输。

reduceByKey

/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}

groupByKey

 /**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

示例一个用三种方法实现经典的wordcount的例子。

   val seq = Seq[String]("spark", "hadoop", "spark")
val rdd = session.sparkContext.parallelize(seq)
//reduceByKey
val rdd_reduceByKey = rdd.map((_,1)).reduceByKey(_+_).foreach(println)
//groupby
val rdd_groupby = rdd.map((_,1)).groupBy(_._1).map(x=>{
(x._1,x._2.map(x => {x._2}))
}).map(x => (x._1,x._2.sum)).foreach(println)
//groupByKey
val rdd_groupByKey = rdd.map((_,1)).groupByKey().map(x=>(x._1,x._2.sum)).foreach(println)

结果:

  (spark,2)
  (hadoop,1)

aggregateByKey

aggregateByKey(zeroValue)(seqOpcombOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

此函数类似于reduceByKey。它提供了三个参数。第一个参数是一个初始值。后两个参数为自定义函数。

在kv对的RDD中,,按key将value进行分组合并,合并时,将每个value和初始值作为seq函数的参数,进行计算,返回的结果作为一个新的kv对,然后再将结果按照key进行合并,最后将每个分组的value传递给combine函数进行计算(先将前两个value进行计算,将返回结果和下一个value传给combine函数,以此类推),将key与计算结果作为一个新的kv对输出。在client模式下,注意设置master的core数。

/**
* a:param1 b:rdd._2
*/
def seqs(a:Int,b:Int) : Int =b def comb(a: Int, b: Int): Int = a+b def aggregateByKey(session : SparkSession){
val seq = Seq[String]("spark", "hadoop", "spark")
val rdd = session.sparkContext.parallelize(seq).map((_,1))
//("spark",1)
rdd.aggregateByKey(2)(seqs, comb).foreach(println)
}

结果:

  (spark,2)
  (hadoop,1)

sortByKey sortBy

sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

聚合排序。sortBy针对RDD,sortByKey针对(K,V)对的RDD。排序的问题,在client模式下,注意设置master的core数。

   val seq = Seq[String]("spark", "hadoop", "apark","spark", "hadoop", "spark")
val rdd = session.sparkContext.parallelize(seq)
rdd.sortBy(x=>x, true,1).foreach(println)
rdd.map((_,1)).sortByKey(true,1).foreach(println)

结果:

apark
hadoop
hadoop
spark
spark
spark

===============

(apark,1)
(hadoop,1)
(hadoop,1)
(spark,1)
(spark,1)
(spark,1)

join leftOuterJoin rightOuterJoin fullOuterJoin

join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoinrightOuterJoin, and fullOuterJoin.

join相当于inner join。其它参照sql。

cogroup

cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.

cogroup类似全外连接,即fullOuterJoin。不同的是,cogroup会根据key做分组聚合。而且cogroup可以连接多个RDD。

    val seq = Seq[String]("spark", "hadoop", "apark", "spark", "hadoop", "spark")
val seq2 = Seq[String]("spark1", "hadoop", "apark1")
val rdd = session.sparkContext.parallelize(seq).map((_,1))
val rdd2 = session.sparkContext.parallelize(seq2).map((_,1))
rdd.fullOuterJoin(rdd2).foreach(println)
println("===================")
rdd.cogroup(rdd2).collect().foreach(println)

结果:

  (apark,(Some(1),None))
  (apark1,(None,Some(1)))
  (spark1,(None,Some(1)))
  (spark,(Some(1),None))
  (spark,(Some(1),None))
  (spark,(Some(1),None))
  (hadoop,(Some(1),Some(1)))
  (hadoop,(Some(1),Some(1)))

  ===================

  (spark1,(CompactBuffer(),CompactBuffer(1)))
  (spark,(CompactBuffer(1, 1, 1),CompactBuffer()))
  (hadoop,(CompactBuffer(1, 1),CompactBuffer(1)))
  (apark1,(CompactBuffer(),CompactBuffer(1)))
  (apark,(CompactBuffer(1),CompactBuffer()))

cartesian

cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

笛卡尔积。

pipe

pipe(command[envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

调用外部程序。

coalesce

coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

两个函数都是对RDD进行重分区。coalesce带有是否进行shuffle的参数,而repartition强制使用shuffle。

spark 算子之RDD的更多相关文章

  1. Spark算子与RDD基本转换

    map 将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素. 输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区. flatMap 属于Transformation算子 ...

  2. (转)Spark 算子系列文章

    http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...

  3. spark算子之DataFrame和DataSet

    前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...

  4. Spark操作算子本质-RDD的容错

    Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...

  5. Spark(四)【RDD编程算子】

    目录 测试准备 一.Value类型转换算子 map(func) mapPartitions(func) mapPartitions和map的区别 mapPartitionsWithIndex(func ...

  6. Spark计算模型-RDD介绍

    在Spark集群背后,有一个非常重要的分布式数据架构,即弹性分布式数据集(Resilient Distributed DataSet,RDD),它是逻辑集中的实体,在集群中的多台集群上进行数据分区.通 ...

  7. Spark算子总结及案例

    spark算子大致上可分三大类算子: 1.Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据. 2.Key-Value数据类型的Tran ...

  8. Spark 核心概念 RDD 详解

    RDD全称叫做弹性分布式数据集(Resilient Distributed Datasets),它是一种分布式的内存抽象,表示一个只读的记录分区的集合,它只能通过其他RDD转换而创建,为此,RDD支持 ...

  9. Spark 核心概念RDD

    文章正文 RDD全称叫做弹性分布式数据集(Resilient Distributed Datasets),它是一种分布式的内存抽象,表示一个只读的记录分区的集合,它只能通过其他RDD转换而创建,为此, ...

随机推荐

  1. VSCode 常用的快捷键

    R键:点击后热加载,直接查看预览结果 P键: 在虚拟机中显示网格,常用 O 键:切换iOS 和Android Q键 :退出调试 ctr +~  打开 终端

  2. Python if语句

    a=99 b=input("请输入一个数字:") int(b) if a >b: print("欢迎来到Python") eilf a=b: print( ...

  3. RESTful Loads

    RESTful 一种软件架构风格.设计风格,而不是标准,只是提供了一组设计原则和约束条件.它主要用于客户端和服务器交互类的软件.基于这个风格设计的软件可以更简洁,更有层次,更易于实现缓存等机制. 概述 ...

  4. 学习笔记TF037:实现强化学习策略网络

    强化学习(Reinforcement Learing),机器学习重要分支,解决连续决策问题.强化学习问题三概念,环境状态(Environment State).行动(Action).奖励(Reward ...

  5. 使用 jest 测试 react component 的配置,踩坑。

    首先安装依赖 npm i jest -g npm i jest babel-jest identity-obj-proxy enzyme enzyme-adapter-react-15.4 react ...

  6. 用Redis存储Tomcat集群的Session实现session共享

    一.存储 前段时间,我花了不少时间来寻求一种方法,把新开发的代码推送到到生产系统中部署,生产系统要能够零宕机.对使用用户零影响. 我的设想是使用集群来搞定,通过通知负载均衡Nginx,取下集群中的To ...

  7. Spring事务原理

    Spring事务的本质是对数据库事务的封装支持,没有数据库对事务的支持,Spring本身无法提供事务管理功能.对于用JDBC操作数据库想要用到事务,必须经过获取连接——>开启事务——>执行 ...

  8. Linux内核分析第六次作业

    分析system_call中断处理过程 一.先在实验楼的虚拟机中MenuOs增加utsname和utsname-asm指令. 具体实现如下: 1.克隆最新新版本的menu,之后进入menu 2.进入t ...

  9. mongodb集群配置副本集

    测试环境 操作系统:CentOS 7.2 最小化安装 主服务器IP地址:192.168.197.21 mongo01 从服务器IP地址:192.168.197.22 mongo02 从服务器IP地址: ...

  10. redis 双写一致性 看一篇成高手系列1

    首先,缓存由于其高并发和高性能的特性,已经在项目中被广泛使用.在读取缓存方面,大家没啥疑问,都是按照下图的流程来进行业务操作. 但是在更新缓存方面,对于更新完数据库,是更新缓存呢,还是删除缓存.又或者 ...