RDD之五:Key-Value型Transformation算子
Transformation处理的数据为Key-Value形式的算子大致可以分为:输入分区与输出分区一对一、聚集、连接操作。
输入分区与输出分区一对一
mapValues
mapValues:针对(Key,Value)型数据中的Value进行Map操作,而不对Key进行处理。
方框代表RDD分区。a=>a+2代表只对( V1, 1)数据中的1进行加2操作,返回结果为3。
源码:
- /**
- * Pass each value in the key-value pair RDD through a map function without changing the keys;
- * this also retains the original RDD's partitioning.
- */
- def mapValues[U](f: V => U): RDD[(K, U)] = {
- val cleanF = self.context.clean(f)
- new MapPartitionsRDD[(K, U), (K, V)](self,
- (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
- preservesPartitioning = true)
- }
单个RDD或两个RDD聚集
(1)combineByKey
combineByKey是对单个Rdd的聚合。相当于将元素为(Int,Int)的RDD转变为了(Int,Seq[Int])类型元素的RDD。
定义combineByKey算子的说明如下:
- createCombiner: V => C, 在C不存在的情况下,如通过V创建seq C。
- mergeValue:(C, V) => C, 当C已经存在的情况下,需要merge,如把item V加到seq
C中,或者叠加。- mergeCombiners:(C,C) => C,合并两个C。
- partitioner: Partitioner(分区器),Shuffle时需要通过Partitioner的分区策略进行分区。
- mapSideCombine: Boolean=true, 为了减小传输量,很多combine可以在map端先做。例如, 叠加可以先在一个partition中把所有相同的Key的Value叠加, 再shuffle。
- serializerClass:String=null,传输需要序列化,用户可以自定义序列化类。
方框代表RDD分区。 通过combineByKey,将(V1,2)、 (V1,1)数据合并为(V1,Seq(2,1))。
源码:
- /**
- * Generic function to combine the elements for each key using a custom set of aggregation
- * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
- * Note that V and C can be different -- for example, one might group an RDD of type
- * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
- *
- * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
- * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
- * - `mergeCombiners`, to combine two C's into a single one.
- *
- * In addition, users can control the partitioning of the output RDD, and whether to perform
- * map-side aggregation (if a mapper can produce multiple items with the same key).
- */
- def combineByKey[C](createCombiner: V => C,
- mergeValue: (C, V) => C,
- mergeCombiners: (C, C) => C,
- partitioner: Partitioner,
- mapSideCombine: Boolean = true,
- serializer: Serializer = null): RDD[(K, C)] = {
- require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
- if (keyClass.isArray) {
- if (mapSideCombine) {
- throw new SparkException("Cannot use map-side combining with array keys.")
- }
- if (partitioner.isInstanceOf[HashPartitioner]) {
- throw new SparkException("Default partitioner cannot partition array keys.")
- }
- }
- val aggregator = new Aggregator[K, V, C](
- self.context.clean(createCombiner),
- self.context.clean(mergeValue),
- self.context.clean(mergeCombiners))
- if (self.partitioner == Some(partitioner)) {
- self.mapPartitions(iter => {
- val context = TaskContext.get()
- new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
- }, preservesPartitioning = true)
- } else {
- new ShuffledRDD[K, V, C](self, partitioner)
- .setSerializer(serializer)
- .setAggregator(aggregator)
- .setMapSideCombine(mapSideCombine)
- }
- }
- /**
- * Simplified version of combineByKey that hash-partitions the output RDD.
- */
- def combineByKey[C](createCombiner: V => C,
- mergeValue: (C, V) => C,
- mergeCombiners: (C, C) => C,
- numPartitions: Int): RDD[(K, C)] = {
- combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions))
- }
(2)reduceByKey
reduceByKey是更简单的一种情况,只是两个值合并成一个值,所以createCombiner很简单,就是直接返回v,而mergeValue和mergeCombiners的逻辑相同,没有区别。
方框代表RDD分区。 通过用户自定义函数(A,B)=>(A+B),将相同Key的数据(V1,2)、(V1,1)的value相加,结果为(V1,3)。
源码:
- /**
- * Merge the values for each key using an associative reduce function. This will also perform
- * the merging locally on each mapper before sending results to a reducer, similarly to a
- * "combiner" in MapReduce.
- */
- def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
- combineByKey[V]((v: V) => v, func, func, partitioner)
- }
- /**
- * Merge the values for each key using an associative reduce function. This will also perform
- * the merging locally on each mapper before sending results to a reducer, similarly to a
- * "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
- */
- def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = {
- reduceByKey(new HashPartitioner(numPartitions), func)
- }
- /**
- * Merge the values for each key using an associative reduce function. This will also perform
- * the merging locally on each mapper before sending results to a reducer, similarly to a
- * "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
- * parallelism level.
- */
- def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
- reduceByKey(defaultPartitioner(self), func)
- }
(3)partitionBy
partitionBy函数对RDD进行分区操作。
如果原有RDD的分区器和现有分区器(partitioner)一致,则不重分区,如果不一致,则相当于根据分区器生成一个新的ShuffledRDD。
方框代表RDD分区。 通过新的分区策略将原来在不同分区的V1、 V2数据都合并到了一个分区。
源码:
- /**
- * Return a copy of the RDD partitioned using the specified partitioner.
- */
- def partitionBy(partitioner: Partitioner): RDD[(K, V)] = {
- if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
- throw new SparkException("Default partitioner cannot partition array keys.")
- }
- if (self.partitioner == Some(partitioner)) {
- self
- } else {
- new ShuffledRDD[K, V, V](self, partitioner)
- }
- }
(4)cogroup
cogroup函数将两个RDD进行协同划分。对在两个RDD中的Key-Value类型的元素,每个RDD相同Key的元素分别聚合为一个集合,并且返回两个RDD中对应Key的元素集合的迭代器(K, (Iterable[V], Iterable[w]))
。其中,Key和Value,Value是两个RDD下相同Key的两个数据集合的迭代器所构成的元组。
大方框代表RDD,大方框内的小方框代表RDD中的分区。 将RDD1中的数据(U1,1)、(U1,2)和RDD2中的数据(U1,2)合并为(U1,((1,2),(2)))。
源码:
- /**
- * For each key k in `this` or `other1` or `other2` or `other3`,
- * return a resulting RDD that contains a tuple with the list of values
- * for that key in `this`, `other1`, `other2` and `other3`.
- */
- def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
- other2: RDD[(K, W2)],
- other3: RDD[(K, W3)],
- partitioner: Partitioner)
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
- if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
- throw new SparkException("Default partitioner cannot partition array keys.")
- }
- val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
- cg.mapValues { case Array(vs, w1s, w2s, w3s) =>
- (vs.asInstanceOf[Iterable[V]],
- w1s.asInstanceOf[Iterable[W1]],
- w2s.asInstanceOf[Iterable[W2]],
- w3s.asInstanceOf[Iterable[W3]])
- }
- }
- /**
- * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
- * list of values for that key in `this` as well as `other`.
- */
- def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
- : RDD[(K, (Iterable[V], Iterable[W]))] = {
- if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
- throw new SparkException("Default partitioner cannot partition array keys.")
- }
- val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
- cg.mapValues { case Array(vs, w1s) =>
- (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
- }
- }
- /**
- * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
- * tuple with the list of values for that key in `this`, `other1` and `other2`.
- */
- def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
- if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
- throw new SparkException("Default partitioner cannot partition array keys.")
- }
- val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
- cg.mapValues { case Array(vs, w1s, w2s) =>
- (vs.asInstanceOf[Iterable[V]],
- w1s.asInstanceOf[Iterable[W1]],
- w2s.asInstanceOf[Iterable[W2]])
- }
- }
- /**
- * For each key k in `this` or `other1` or `other2` or `other3`,
- * return a resulting RDD that contains a tuple with the list of values
- * for that key in `this`, `other1`, `other2` and `other3`.
- */
- def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)])
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
- cogroup(other1, other2, other3, defaultPartitioner(self, other1, other2, other3))
- }
- /**
- * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
- * list of values for that key in `this` as well as `other`.
- */
- def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = {
- cogroup(other, defaultPartitioner(self, other))
- }
- /**
- * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
- * tuple with the list of values for that key in `this`, `other1` and `other2`.
- */
- def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
- cogroup(other1, other2, defaultPartitioner(self, other1, other2))
- }
- /**
- * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
- * list of values for that key in `this` as well as `other`.
- */
- def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))] = {
- cogroup(other, new HashPartitioner(numPartitions))
- }
- /**
- * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
- * tuple with the list of values for that key in `this`, `other1` and `other2`.
- */
- def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int)
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
- cogroup(other1, other2, new HashPartitioner(numPartitions))
- }
- /**
- * For each key k in `this` or `other1` or `other2` or `other3`,
- * return a resulting RDD that contains a tuple with the list of values
- * for that key in `this`, `other1`, `other2` and `other3`.
- */
- def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
- other2: RDD[(K, W2)],
- other3: RDD[(K, W3)],
- numPartitions: Int)
- : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
- cogroup(other1, other2, other3, new HashPartitioner(numPartitions))
- }
连接
(1)join
join对两个需要连接的RDD进行cogroup函数操作。cogroup操作之后形成的新RDD,对每个key下的元素进行笛卡尔积操作,返回的结果再展平,对应Key下的所有元组形成一个集合,最后返回RDD[(K,(V,W))]。
join的本质是通过cogroup算子先进行协同划分,再通过flatMapValues将合并的数据打散。
对两个RDD的join操作示意图。 大方框代表RDD,小方框代表RDD中的分区。函数对拥有相同Key的元素(例如V1)为Key,以做连接后的数据结果为(V1,(1,1))和(V1,(1,2))。
源码:
- /**
- * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
- * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
- * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
- */
- def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
- this.cogroup(other, partitioner).flatMapValues( pair =>
- for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
- )
- }
(2)leftOuterJoin和rightOuterJoin
LeftOuterJoin(左外连接)和RightOuterJoin(右外连接)相当于在join的基础上先判断一侧的RDD元素是否为空,如果为空,则填充为空。 如果不为空,则将数据进行连接运算,并返回结果。
源码:
- /**
- * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
- * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
- * pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
- * partition the output RDD.
- */
- def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
- this.cogroup(other, partitioner).flatMapValues { pair =>
- if (pair._2.isEmpty) {
- pair._1.iterator.map(v => (v, None))
- } else {
- for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
- }
- }
- }
- /**
- * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
- * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
- * pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
- * partition the output RDD.
- */
- def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
- : RDD[(K, (Option[V], W))] = {
- this.cogroup(other, partitioner).flatMapValues { pair =>
- if (pair._1.isEmpty) {
- pair._2.iterator.map(w => (None, w))
- } else {
- for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)
- }
- }
- }
原文链接:http://blog.csdn.net/jasonding1354
RDD之五:Key-Value型Transformation算子的更多相关文章
- RDD之四:Value型Transformation算子
处理数据类型为Value型的Transformation算子可以根据RDD变换算子的输入分区与输出分区关系分为以下几种类型: 1)输入分区与输出分区一对一型 2)输入分区与输出分区多对一型 3)输入分 ...
- 【Spark】RDD操作具体解释2——值型Transformation算子
处理数据类型为Value型的Transformation算子能够依据RDD变换算子的输入分区与输出分区关系分为下面几种类型: 1)输入分区与输出分区一对一型 2)输入分区与输出分区多对一型 3)输入分 ...
- 【Spark】RDD操作具体解释3——键值型Transformation算子
Transformation处理的数据为Key-Value形式的算子大致能够分为:输入分区与输出分区一对一.聚集.连接操作. 输入分区与输出分区一对一 mapValues mapValues:针对(K ...
- Spark RDD概念学习系列之Spark的算子的分类(十一)
Spark的算子的分类 从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
- Spark RDD概念学习系列之Spark的算子的作用(十四)
Spark的算子的作用 首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...
- 常用Transformation算子
map 产生的键值对是tupple, split分隔出来的是数组 一.常用Transformation算子 (map .flatMap .filter .groupByKey .reduc ...
- 【Spark】RDD操作具体解释4——Action算子
本质上在Actions算子中通过SparkContext运行提交作业的runJob操作,触发了RDD DAG的运行. 依据Action算子的输出空间将Action算子进行分类:无输出. HDFS. S ...
- 常见的transformation算子
RDD:RDD分区数,若从HDFS创建RDD,RDD的分区就是和文件块一一对应,若是集合并行化形式创建,RDD分区数可以指定,一般默认值是CPU的核数. task:task数量就是和分区数量对应. 一 ...
- 大数据笔记(二十九)——RDD简介、特性及常用算子
1.什么是RDD? 最核心 (*)弹性分布式数据集,Resilent distributed DataSet (*)Spark中数据的基本抽象 (*)结合源码,查看RDD的概念 RDD属性 * Int ...
随机推荐
- 对多维向量vector<vector<int> > vec进行操作
直接写作vector<vector<int> > vec在VC++6.0下编译不过改做: typedef std::vector<int> ROW; s ...
- C++ bitset
itset存储二进制数位. bitset就像一个bool类型的数组一样,但是有空间优化——bitset中的一个元素一般只占1 bit,相当于一个char元素所占空间的八分之一. bitset中的每个元 ...
- ansible with_subelements
with_subelements 循环列表中的子元素 (意想不到的地方会用到) --- - hosts: web tasks: - authorized_key: "user={{ item ...
- 使用MyEclipse将HTML5移动项目迁移到PhoneGap(一)
MyEclipse开年钜惠 在线购买低至75折!立即开抢>> [MyEclipse最新版下载] 一.创建一个新的PhoneGap应用程序项目 PhoneGap应用程序项目的结构与HTML5 ...
- js写的一个HashMap
1.脚本 /** * 模拟HashMap */ function HashMap(){ //定义长度 var length = 0; //创建一个对象 var obj = new Object(); ...
- 0107 for循环练习
//画菱形 for(int hs = 1; hs < 11; hs++) { //画空格 for(int kg = 9; kg >= hs; kg--) { System.out.prin ...
- mysql 聚合函数
1.sum 用法 有这种类型的数据: id date user_id result 1 2015-05-04 1 win 2 2015-05-06 1 loss 3 2015-05-09 2 loss ...
- Linux:centos内核升级
centos内核升级 centos升级2.6内核到3.10 在yum的ELRepo源中,有 m ain l ine(3.13.1). l ong- t erm(3.10.28)这2个内核版本,long ...
- 第七届蓝桥杯个人赛省赛--C语言B组
题目一 煤球数目 有一堆煤球,堆成三角棱锥形.具体:第一层放1个,第二层3个(排列成三角形),第三层6个(排列成三角形),第四层10个(排列成三角形),....如果一共有100层,共有多少个煤球? 请 ...
- PTA 大炮打蚊子 (15分)
现在,我们用大炮来打蚊子:蚊子分布在一个M×NM\times NM×N格的二维平面上,每只蚊子占据一格.向该平面的任意位置发射炮弹,炮弹的杀伤范围如下示意: O OXO O 其中,X为炮弹落点中心,O ...