列举spark所有算子
val rdd1 = sc.parallelize(List(,,,,,,,,,))
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x,true)
val rdd3 = rdd2.filter(_>)
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x+"",true)
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x.toString,true)
val rdd4 = sc.parallelize(Array("a b c", "d e f", "h i j"))
rdd4.flatMap(_.split(' ')).collect
------------------------------------------------------------------
val rdd5 = sc.parallelize(List(List("a b c", "a b b"),List("e f g", "a f g"), List("h i j", "a a b")))
rdd5.flatMap(_.flatMap(_.split(" "))).collect
val rdd6 = sc.parallelize(List(,,,))
val rdd7 = sc.parallelize(List(,,,))
val rdd8 = rdd6.union(rdd7)
rdd8.distinct.sortBy(x=>x).collect
--------------------------------------------
val rdd9 = rdd6.intersection(rdd7)
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
--------------------------------------------------------------------------
val rdd3 = rdd1.join(rdd2).collect
rdd3: Array[(String, (Int, Int))] = Array((tom,(,)), (jerry,(,)))
---------------------------------------------------------------------------
val rdd3 = rdd1.leftOuterJoin(rdd2).collect
rdd3: Array[(String, (Int, Option[Int]))] = Array((tom,(,Some())), (jerry,(,Some())), (kitty,(,None)))
---------------------------------------------------------------------------
val rdd3 = rdd1.rightOuterJoin(rdd2).collect
rdd3: Array[(String, (Option[Int], Int))] = Array((tom,(Some(),)), (jerry,(Some(),)), (shuke,(None,)))
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1 union rdd2
val rdd4 = rdd3.groupByKey.collect
rdd4: Array[(String, Iterable[Int])] = Array((tom,CompactBuffer(, )), (shuke,CompactBuffer()), (kitty,CompactBuffer()), (jerry,CompactBuffer(, )))
-----------------------------------------------------------------------------------
val rdd5 = rdd4.map(x=>(x._1,x._2.sum))
rdd5: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1 union rdd2
val rdd6 = rdd3.reduceByKey(_+_).collect
rdd6: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))
val rdd1 = sc.parallelize(List(("tom", ), ("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1.cogroup(rdd2).collect
rdd3: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((tom,(CompactBuffer(, ),CompactBuffer())), (jerry,(CompactBuffer(),CompactBuffer())), (shuke,(CompactBuffer(),CompactBuffer())), (kitty,(CompactBuffer(),CompactBuffer())))
----------------------------------------------------------------------------------------
val rdd4 = rdd3.map(x=>(x._1,x._2._1.sum+x._2._2.sum))
rdd4: Array[(String, Int)] = Array((tom,), (jerry,), (shuke,), (kitty,))
val rdd1 = sc.parallelize(List("tom", "jerry"))
val rdd2 = sc.parallelize(List("tom", "kitty", "shuke"))
val rdd3 = rdd1.cartesian(rdd2).collect
rdd3: Array[(String, String)] = Array((tom,tom), (tom,kitty), (tom,shuke), (jerry,tom), (jerry,kitty), (jerry,shuke))
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD,且key没有被修饰的情况下,可设为true。非K-V型RDD一般不存在分区器;K-V RDD key被修改后,元素将不再满足分区器的分区要求。这些情况下,须设为false,表示返回的RDD没有被分区器分过区。
*/
def mapPartitionsWithIndex[U: ClassTag](-------要求传入一个函数
f: (Int, Iterator[T]) => Iterator[U],------函数要求传入两个参数
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}
scala> val func = (index : Int,iter : Iterator[Int]) => {
| iter.toList.map(x=>"[PartID:" + index + ",val:" + x + "]").iterator
| }
func: (Int, Iterator[Int]) => Iterator[String] = <function2>
scala> val rdd1 = sc.parallelize(List(,,,,,,,,),)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:
scala> rdd1.mapPartitionsWithIndex(func).collect
res0: Array[String] = Array([PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:])
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using
* given combine functions and a neutral "zero value". This function can return a different result
* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
* allowed to modify and return their first argument instead of creating a new U to avoid memory
* allocation.
将RDD中元素聚集,须提供0初值(因为累积元素,所有要提供累积的初值)。先在分区内依照seqOp函数聚集元素(把T类型元素聚集为U类型的分区“结果”),再在分区间按照combOp函数聚集分区计算结果,最后返回这个结果
*
* @param zeroValue the initial value for the accumulated result of each partition for the
* `seqOp` operator, and also the initial value for the combine results from
* different partitions for the `combOp` operator - this will typically be the
* neutral element (e.g. `Nil` for list concatenation or `0` for summation)
* @param seqOp an operator used to accumulate results within a partition
* @param combOp an associative operator used to combine results from different partitions
第一个参数是初始值, 第二个参数:是两个函数[每个函数都是2个参数(第一个参数:先对个个分区进行合并, 第二个:对个个分区合并后的结果再进行合并), 输出一个参数]
*/
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult
}
scala> val rdd1 = sc.parallelize(List(,,,,,,,,), )
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:
//这里先对连个分区分别进行相加,然后两个的分区相加后的结果再相加得出最后的结果
scala> rdd1.aggregate()(_+_,_+_)
res0: Int =
//先对每个分区比较求出最大值,然后每个分区求出的最大值再相加得出最后的结果
scala> rdd1.aggregate()(math.max(_,_),_+_)
res1: Int =
//这里需要注意,初始值是每次都要参与运算的,例如下面的代码:分区1是1,2,3,4;初始值为5,则他们比较最大值就是5,分区2是5,6,7,8,9;初始值为5,则他们比较结果最大值就是9;然后再相加,这里初始值也要参与运算,5+(5+9)=19
scala> rdd1.aggregate()(math.max(_,_),_+_)
res0: Int =
-----------------------------------------------------------------------------------------------
scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:
//这里需要注意,由于每个分区计算是并行计算,所以计算出的结果有先后顺序,所以结果会出现两种情况:如下
scala> rdd2.aggregate("")(_+_,_+_)
res0: String = defabc scala> rdd2.aggregate("")(_+_,_+_)
res2: String = abcdef
//这里的例子更能说明上面提到的初始值参与计算的问题,我们可以看到初始值=号参与了三次计算
scala> rdd2.aggregate("=")(_+_,_+_)
res0: String = ==def=abc
--------------------------------------------------------------------------------------
scala> val rdd3 = sc.parallelize(List("","","",""),)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)
res1: String = scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)
res3: String =
-------------------------------------------------------------------------------------------
scala> val rdd4 = sc.parallelize(List("","","",""),)
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:
//这里需要注意:第一个分区加上初始值元素为"","12","23",两两比较,最小的长度为1;第二个分区加上初始值元素为"","345","",两两比较,最小的长度为0
scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)
res4: String = scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)
res9: String =
------------------------------------------------------------------------------------
//注意与上面的例子的区别,这里定义的rdd里的元素的顺序跟上面不一样,导致结果不一样
scala> val rdd5 = sc.parallelize(List("","","",""),)
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd5.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y)
res1: String =
//定义RDD
scala> val pairRDD = sc.parallelize(List( ("cat",), ("cat", ), ("mouse", ),("cat", ), ("dog", ), ("mouse", )), )
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:
//自定义方法,用于传入mapPartitionsWithIndex
scala> val func=(index:Int,iter:Iterator[(String, Int)])=>{
| iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
| }
func: (Int, Iterator[(String, Int)]) => Iterator[String] = <function2>
//查看分区情况
scala> pairRDD.mapPartitionsWithIndex(func).collect
res2: Array[String] = Array([partID:, val: (cat,)], [partID:, val: (cat,)], [partID:, val: (mouse,)], [partID:, val: (cat,)], [partID:, val: (dog,)], [partID:, val: (mouse,)])
//注意:初始值为0和其他值的区别
scala> pairRDD.aggregateByKey()(_+_,_+_).collect
res4: Array[(String, Int)] = Array((dog,), (cat,), (mouse,)) scala> pairRDD.aggregateByKey()(_+_,_+_).collect
res5: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
//下面三个的区别:,第一个比较好理解,由于初始值为0,所以每个分区输出不同动物中个数最多的那个,然后在累加
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res6: Array[(String, Int)] = Array((dog,), (cat,), (mouse,)) //下面两个:由于有初始值,就需要考虑初始值参与计算,这里第一个分区的元素为("cat",2), ("cat", 5), ("mouse", 4),初始值是10,不同动物之间两两比较value的大小,都需要将初始值加入比较,所以第一个分区输出为("cat", 10), ("mouse", 10);第二个分区同第一个分区,输出结果为(dog,12), (cat,12), (mouse,10);所以最后累加的结果为(dog,12), (cat,22), (mouse,20),注意最后的对每个分区结果计算的时候,初始值不参与计算
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res7: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
//这个和上面的类似
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res8: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
/**
* Return a copy of the RDD partitioned using the specified partitioner.
*/
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
} repartition:返回一个新的RDD
按指定分区数重新分区RDD,存在shuffle。
当指定的分区数比当前分区数目少时,考虑使用coalesce,这样能够避免shuffle。
scala> val rdd1 = sc.parallelize(Array(,,,,,,,),)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.repartition()
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at repartition at <console>: scala> rdd2.partitions.length
res0: Int = scala> val rdd3 = rdd2.coalesce(,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at coalesce at <console>: scala> rdd3.partitions.length
res1: Int =
scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ),("c", ),("d", ),("e", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.collectAsMap
res3: scala.collection.Map[String,Int] = Map(e -> , b -> , d -> , a -> , c -> )
()、源码
/**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
*
* @see [[combineByKeyWithClassTag]]
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
partitioner, mapSideCombine, serializer)(null)
} /**
* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
* This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
*
* @see [[combineByKeyWithClassTag]]
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
} ()参数说明:
第一个参数createCombiner: V => C:生成合并器,每组key,取出第一个value的值,然后返回你想合并的类型。
第二个参数mergeValue: (C, V) => C:函数,局部计算
第三个参数mergeCombiners: (C, C) => C:函数,对局部计算的结果再进行计算
()代码实例
//首先声明两个rdd,然后利用zip将两个rdd合并成一个,rdd6
scala> val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd5 = sc.parallelize(List(,,,,,,,,), )
rdd5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd6 = rdd5.zip(rdd4)
rdd6: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[] at zip at <console>: scala> rdd6.collect
res6: Array[(Int, String)] = Array((,dog), (,cat), (,gnu), (,salmon), (,rabbit), (,turkey), (,wolf), (,bear), (,bee)) //我们需要将按照key进行分组合并,相同的key的value都放在List中
//这里我们第一个参数List(_):表示将第一个value取出放进集合中
//第二个参数(x:List[String],y:String)=>x :+ y:表示局部计算,将value加入到List中
//第三个参数(m:List[String],n:List[String])=>m++n:表示对局部的计算结果再进行计算 scala> val rdd7 = rdd6.combineByKey(List(_),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n)
rdd7: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[] at combineByKey at <console>: scala> rdd7.collect
res7: Array[(Int, List[String])] = Array((,List(dog, cat, turkey)), (,List(wolf, bear, bee, salmon, rabbit, gnu))) //这里第一个参数,可以有另外的写法。如下面的两个
scala> val rdd7 = rdd6.combineByKey(_::List(),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect
rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit))) scala> val rdd7 = rdd6.combineByKey(_::Nil,(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect
rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit)))
scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ), ("b", ), ("c", ), ("c", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.countByKey
res8: scala.collection.Map[String,Long] = Map(a -> , b -> , c -> ) scala> rdd1.countByValue
res9: scala.collection.Map[(String, Int),Long] = Map((c,) -> , (a,) -> , (b,) -> , (c,) -> )
scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", ),("b",)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:
//注意:这里传入的参数,是左闭右闭的区间
scala> val rdd2 = rdd1.filterByRange("b","d")
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[] at filterByRange at <console>: scala> rdd2.collect
res10: Array[(String, Int)] = Array((c,), (d,), (c,), (b,))
scala> val rdd3 = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))
rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd4 = rdd3.flatMapValues(_.split(" "))
rdd4: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[] at flatMapValues at <console>: scala> rdd4.collect
res11: Array[(String, String)] = Array((a,), (a,), (b,), (b,))
mapValues:不改变key,只针对传入的键值对的value进行计算,类似于map;注意与上面的flatMapValues的区别,它不会改变传入的key-value对,只是将value按照传入的函数进行处理;
scala> val rdd3 = sc.parallelize(List(("a",(,)),("b",(,))))
rdd3: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd3.mapValues(x=>x._1 + x._2).collect
res34: Array[(String, Int)] = Array((a,), (b,))
------------------------------------------------------------------------
如果使用flatMapValues,结果如下,它将value全部拆开跟key组成映射
scala> rdd3.flatMapValues(x=>x + "").collect
res36: Array[(String, Char)] = Array((a,(), (a,), (a,,), (a,), (a,)), (b,(), (b,), (b,,), (b,), (b,)))
scala> val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), )
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.map(x=>(x.length,x))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at map at <console>: scala> rdd2.collect
res12: Array[(Int, String)] = Array((,dog), (,wolf), (,cat), (,bear))
-----------------------------------------------------------------------------
scala> val rdd3 = rdd2.foldByKey("")(_+_)
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[] at foldByKey at <console>: scala> rdd3.collect
res13: Array[(Int, String)] = Array((,bearwolf), (,dogcat)) scala> val rdd3 = rdd2.foldByKey(" ")(_+_).collect
rdd3: Array[(Int, String)] = Array((," bear wolf"), (," dog cat"))
-----------------------------------------------------------------------------
//进行wordcout的计算
val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/wc").flatMap(_.split(" ")).map((_, ))
rdd.foldByKey()(_+_)
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.keyBy(_.length)
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at keyBy at <console>: scala> rdd2.collect
res14: Array[(Int, String)] = Array((,dog), (,salmon), (,salmon), (,rat), (,elephant))
scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.keys.collect
res16: Array[String] = Array(e, c, d, c, a) scala> rdd1.values.collect
res17: Array[Int] = Array(, , , , )
列举spark所有算子的更多相关文章
- Spark RDD概念学习系列之Spark的算子的分类(十一)
Spark的算子的分类 从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
- Spark RDD概念学习系列之Spark的算子的作用(十四)
Spark的算子的作用 首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...
- Spark操作算子本质-RDD的容错
Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...
- Spark RDD算子介绍
Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...
- Spark常用算子-KeyValue数据类型的算子
package com.test; import java.util.ArrayList; import java.util.List; import java.util.Map; import or ...
- Spark常用算子-value数据类型的算子
package com.test; import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; im ...
- spark常用算子总结
算子分为value-transform, key-value-transform, action三种.f是输入给算子的函数,比如lambda x: x**2 常用算子: keys: 取pair rdd ...
- spark过滤算子+StringIndexer算子出发的一个逻辑bug
问题描述: 在一段spark机器学习的程序中,同时用到了Filter算子和StringIndexer算子,其中StringIndexer在前,filter在后,并且filter是对stringinde ...
- java实现spark常用算子之Union
import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.a ...
随机推荐
- wireshark相关知识
wireshark抓包原理如下 https://www.cnblogs.com/yhcreak/p/5911904.html
- uva 202
#include <iostream> #include<cstdio> #include<cstring> #include<algorithm> # ...
- iOS launchImage
iOS launchImage https://stackoverflow.com/questions/34027270/ios-launch-screen-in-react-native 如何设置: ...
- 理解java容器:iterator与collection,容器的起源
关于容器 iterator与collection:容器的起源 iterator的简要介绍 iterable<T> iterator<T> 关于remove方法 Collecti ...
- Windows操作系统线程调度耍起来
Windows等常用操作系统都不是实时性操作系统,所谓不是实时性,意思是你写的程序放在这个操作系统上面运行,当程序需要睡眠2秒时,你的程序睡眠的可能不是2秒,而是2秒后能变为可调度状态,但是如果系统程 ...
- uclibc,eglibc,glibc,Musl-libc之间的区别和联系
转自:https://www.sohu.com/a/164202127_424963 1.Glibc glibc = GNU C Library 是GNU项(GNU Project)目,所实现的 C语 ...
- 使用kingshard遇到的坑
禁止用mysqldump 连接kingshard, 会导致表锁死 读取NULL值变为文本 通过kingshard连接 select出来的null值变为文本"NULL" kingsh ...
- http://blog.csdn.net/u012905422/article/details/53340260
轉自:http://blog.csdn.net/u012905422/article/details/53340260 对于python2.7版本,很多教程(如http://stackoverflow ...
- SourceInsight宏插件2(非常好用,强力推荐)
Quicker宏在SI中的使用方法(下载地址:链接:https://pan.baidu.com/s/1VrDxlPhft7RPUCCOKxsGIg 提取码:2d4u) Quicker宏的添加到SI中 ...
- siftflow-fcn32s训练及预测
一.说明 SIFT Flow 是一个标注的语义分割的数据集,有两个label,一个是语义分类(33类),另一个是场景标签(3类). Semantic and geometric segmentatio ...