Spark-RDD算子
一、Spark-RDD算子简介
RDD(Resilient Distributed DataSet)是分布式数据集。RDD是Spark最基本的数据的抽象。
scala中的集合。RDD相当于一个不可变、可分区、里面的元素可以并行计算的集合。 RDD特点:
具有数据流模型的特点
自动容错
位置感知调度
可伸缩性
RDD允许用户在执行多个查询时将工作集缓存在内存中,可以重用工作集,大大的提升了查询速度。 RDD类型分为:
1)Transformation
转换
2)Action
动作
二、RDD创建
RDD分为两种类型:
1)Transformation(lazy-》懒加载)
2)Action(触发任务)
例子:
scala> sc.textFile("/root/words.txt")
res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> sc.textFile("/root/words.txt").flatMap(_.split(" "))
res9: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[31] at flatMap at <console>:25 scala> res8
res10: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> res8.flatMap(_.split(" "))
res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1))
res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27 scala> res13.collect
res14: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1)) scala> val rdd1 = sc.textFile("/root/words.txt")
rdd1: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[39] at textFile at <console>:24 scala> rdd1.count
res15: Long = 3 scala> val list = List(1,3,5,7)
list: List[Int] = List(1, 3, 5, 7) scala> list.map(_ * 100)
res16: List[Int] = List(100, 300, 500, 700) scala> val rdd1 = sc.parallelize(list)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at parallelize at <console>:26 scala> rdd1.map(_ * 100)
res17: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:29 scala> res17.collect
res18: Array[Int] = Array(100, 300, 500, 700) scala> val rdd2 = sc.makeRDD(list)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:26 scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:24 scala> val rdd2 = rdd1.map(_ * 1000)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at map at <console>:26 scala> rdd2.collect
res19: Array[Int] = Array(1000, 2000, 3000, 4000, 5000)
三、常用Transformation
Transformation特点:
1)生成新的RDD
2)lazy懒加载 等待处理
3)并不会存储真正的数据,记录了转换关系
1、map(func)
2、flatMap(func)
3、sortby
4、reduceByKey
scala> sc.textFile("/root/words.txt")
res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> res8.flatMap(_.split(" "))
res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1))
res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27
5、filter 过滤
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at <console>:24 scala> rdd1.filter(_ % 2 == 0)
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[49] at filter at <console>:27 scala> res24.collect
res25: Array[Int] = Array(2, 4)
6、union 并集
scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[50] at parallelize at <console>:24 scala> rdd1.union(rdd2)
res26: org.apache.spark.rdd.RDD[Int] = UnionRDD[51] at union at <console>:29 scala> res26.collect
res27: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7) scala> rdd1 union rdd2
res28: org.apache.spark.rdd.RDD[Int] = UnionRDD[52] at union at <console>:29 scala> res28.collect
res29: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7)
7、groupByKey 分组
scala> val rdd3 = sc.parallelize(List(("Tom",18),("John",16),("Tom",20),("Mary",17),("John",23)))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[53] at parallelize at <console>:24 scala> rdd3.groupByKey
res30: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[54] at groupByKey at <console>:27 scala> res30.collect
res31: Array[(String, Iterable[Int])] = Array((John,CompactBuffer(16, 23)), (Tom,CompactBuffer(18, 20)), (Mary,CompactBuffer(17)))
8、intersection 交集
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:24 scala> rdd1.intersection(rdd2)
res33: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[62] at intersection at <console>:29 scala> res33.collect
res34: Array[Int] = Array(3, 4, 1, 5, 2)
9、join 关联
scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.join(rdd2)
res39: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[69] at join at <console>:29 scala> res39.collect
res40: Array[(String, (Int, Int))] = Array((John,(16,26)), (Tom,(18,28)))
10、leftOuterJoin 左连接
保留左侧RDD,右侧如果join上保留,没join上None
scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.leftOuterJoin(rdd2)
res41: org.apache.spark.rdd.RDD[(String, (Int, Option[Int]))] = MapPartitionsRDD[72] at leftOuterJoin at <console>:29 scala> res41.collect
res42: Array[(String, (Int, Option[Int]))] = Array((John,(16,Some(26))), (Tom,(18,Some(28))), (Mary,(21,None)))
11、rightOuterJoin 右连接
scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.rightOuterJoin(rdd2).collect
res43: Array[(String, (Option[Int], Int))] = Array((John,(Some(16),26)), (Tom,(Some(18),28)), (Cat,(None,17)))
12、cartesian 笛卡尔积
scala> val rdd1 = sc.parallelize(List("Tom","Mary"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[76] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List("John","Joe"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[77] at parallelize at <console>:24 scala> rdd1.cartesian(rdd2)
res45: org.apache.spark.rdd.RDD[(String, String)] = CartesianRDD[78] at cartesian at <console>:29 scala> res45.collect
res46: Array[(String, String)] = Array((Tom,John), (Tom,Joe), (Mary,John), (Mary,Joe))
四、常用的Action
1、collect 收集
scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.collect
res0: Array[Int] = Array(1, 2, 3, 4)
2、saveAsTextFile(path) 存储文件
三份数据:5B 5B 600B
理想切分:5+5+600=610 610/3 = 203
5B一片
5B一片
203一片
203一片
203一片
1一片
scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.saveAsTextFile("/root/RDD1") //查看分区数
scala> rdd1.partitions.length
res3: Int = 4
3、count 计数
scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.count
res2: Long = 4
4、reduce 聚合
scala> val rdd2 = sc.parallelize(List(1,2,3,4),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> rdd2.partitions.length
res4: Int = 2 scala> rdd2.reduce(_+_)
res5: Int = 10
5、countByKey() 根据key计数
scala> sc.parallelize(List(("Tom",18),("Tom",28),("John",14),("Mary",16)))
res9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:25 scala> res9.count
res10: Long = 4 scala> res9.countByKey()
res11: scala.collection.Map[String,Long] = Map(Tom -> 2, Mary -> 1, John -> 1) scala> res9.reduceByKey(_+_)
res12: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:27 scala> res9.reduceByKey(_+_).collect
res13: Array[(String, Int)] = Array((Tom,46), (Mary,16), (John,14))
6、take(n) 取出多少个元素
scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.take(2)
res15: Array[Int] = Array(1, 2) scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.take(2)
res17: Array[Int] = Array(3, 2)
7、first 返回RDD的第一个元素
scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.first
res18: Int = 3
8、takeOrdered(n) 取出多少个元素 默认正序
scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.takeOrdered(2)
res19: Array[Int] = Array(1, 2)
9、top(n) 倒序排序 取出元素
scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.top(2)
res20: Array[Int] = Array(8, 7)
五、spark高级算子
1、mapPartitionsWithIndex(func)
设置分区,并且查看每个分区中存放的元素
查看每个分区中元素
需要传递函数作为参数
val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}
scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24 scala> val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}
func: (Int, Iterator[Int]) => Iterator[String] = <function2> scala> rdd3.mapPartitionsWithIndex(func)
res21: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at mapPartitionsWithIndex at <console>:29 scala> rdd3.mapPartitionsWithIndex(func).collect
res22: Array[String] = Array(partID:0,datas:1], partID:0,datas:2], partID:0,datas:3], partID:1,datas:4], partID:1,datas:5], partID:1,datas:6], partID:1,datas:7])
2、aggregate
聚合,先局部后全局
max 取最大值
min 取最小值
scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24 scala> rdd3.aggregate(0)(_+_,_+_)
res23: Int = 28 scala> rdd3.max
res24: Int = 7 scala> rdd3.min
res25: Int = 1 scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24 scala> rdd3.aggregate(0)(math.max(_,_),_+_)
res29: Int = 10 scala> rdd3.aggregate(10)(math.max(_,_),_+_)
res31: Int = 30 // 1+2+3+20 + 4+5+6+7+20 + 20 = 88
scala> rdd3.aggregate(20)(_+_,_+_)
res32: Int = 88 scala> val rdd4 = sc.parallelize(List("a","b","c","d","e"),2)
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[17] at parallelize at <console>:24 scala> rdd4.aggregate("|")(_+_,_+_)
res33: String = ||ab|cde scala> rdd4.aggregate("|")(_+_,_+_)
res34: String = ||cde|ab scala> val rdd5 = sc.parallelize(List("12","23","234","3456"),2)
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res35: String = 24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res36: String = 42 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res37: String = 24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res38: String = 42 scala> val rdd6 = sc.parallelize(List("12","23","345",""),2)
rdd6: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res41: String = 01 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res42: String = 10 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res43: String = 01 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res48: String = 10 scala> val rdd7 = sc.parallelize(List("12","23","","456"),2)
rdd7: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd7.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res1: String = 11 scala> ("").length
res2: Int = 0 scala> 0.length
<console>:24: error: value length is not a member of Int
0.length
^ scala> 0.toString.length
res5: Int = 1 scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res6: String = 011 scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res7: String = 011
3、aggregateByKey
根据key聚合,先局部再全局
scala> val rdd8 = sc.parallelize(List(("cat",3),("cat",8),("mouse",6),("dog",8)))
rdd8: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> def func(index:Int,iter:Iterator[(String,Int)]):Iterator[String] = {iter.toList.map(x => "partID:" + index + "," + "values:" + x + "]").iterator}
func: (index: Int, iter: Iterator[(String, Int)])Iterator[String] scala> rdd8.mapPartitionsWithIndex(func).collect
res34: Array[String] = Array(partID:0,values:(cat,3)], partID:1,values:(cat,8)], partID:2,values:(mouse,6)], partID:3,values:(dog,8)]) scala> rdd8.aggregateByKey(0)(_+_,_+_).collect
res35: Array[(String, Int)] = Array((dog,8), (mouse,6), (cat,11))
4、combineByKey
aggregateByKey和reduceByKey底层调用都是combineByKey
最底层的方法,先局部累加,再全局累加
scala> val rdd1 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect
rdd1: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1)) scala> val rdd2 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("\t")).map((_,1)).combineByKey(x => x,(m:Int,n:Int) => (m+n),(a:Int,b:Int) => (a+b)).collect
rdd2: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1))
5、coalesce
coalesce(4,true)
分区数4
是否shuffle
repartition的实现,已默认加了shuffle
scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24 scala> rdd2.partitions.length
res42: Int = 2 scala> val rdd3 = rdd2.coalesce(4,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at coalesce at <console>:26 scala> rdd3.partitions.length
res43: Int = 4 scala> val rdd4 = rdd3.repartition(5)
rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at repartition at <console>:28 scala> rdd4.partitions.length
res44: Int = 5
6、filterByRange
过滤出指定范围的元素
scala> val rdd6 = sc.parallelize(List(("a",3),("b",2),("d",5),("e",8)))
rdd6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:24 scala> rdd6.filterByRange("b","d").collect
res45: Array[(String, Int)] = Array((b,2), (d,5)) scala> rdd6.filterByRange("b","e").collect
res46: Array[(String, Int)] = Array((b,2), (d,5), (e,8))
7、flatMapValues
切分出每个元素
scala> val rdd7 = sc.parallelize(List(("a","3 6"),("b","2 5"),("d","5 8")))
rdd7: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at parallelize at <console>:24 scala> rdd7.flatMapValues(_.split(" ")).collect
res47: Array[(String, String)] = Array((a,3), (a,6), (b,2), (b,5), (d,5), (d,8))
8、foldByKey
需求:根据key来拼接字符串
scala> val rdd8 = sc.parallelize(List("Tom","John","Mary","Joe"),2)
rdd8: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at parallelize at <console>:24 scala> val rdd9 = rdd8.map(x => (x.length,x))
rdd9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[32] at map at <console>:26 scala> rdd9.collect
res48: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Mary), (3,Joe)) scala> rdd9.foldByKey("")(_+_).collect
res49: Array[(Int, String)] = Array((4,JohnMary), (3,JoeTom))
9、foreach
遍历元素
import org.apache.spark.{SparkConf, SparkContext} object ForeachDemo { def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ForeachDemo").setMaster("local[2]")
val sc = new SparkContext(conf) //创建rdd
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.foreach(println(_)) sc.stop()
}
}
结果:
10、keyBy
以什么为key
keys values
拿到key 拿到value
scala> val rdd2 = sc.parallelize(List("Tom","John","Jack"),3)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[37] at parallelize at <console>:24 scala> val rdd3 = rdd2.keyBy(_.length)
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[38] at keyBy at <console>:26 scala> rdd3.collect
res60: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Jack)) scala> rdd3.keys.collect
res61: Array[Int] = Array(3, 4, 4) scala> rdd3.values.collect
res62: Array[String] = Array(Tom, John, Jack)
六、RDD并行化流程
Spark-RDD算子的更多相关文章
- Spark RDD 算子总结
Spark算子总结 算子分类 Transformation(转换) 转换算子 含义 map(func) 返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 filter(func) ...
- Spark RDD算子介绍
Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...
- Spark RDD概念学习系列之Spark的算子的分类(十一)
Spark的算子的分类 从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
- Spark RDD概念学习系列之Spark的算子的作用(十四)
Spark的算子的作用 首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...
- Spark操作算子本质-RDD的容错
Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...
- spark教程(四)-SparkContext 和 RDD 算子
SparkContext SparkContext 是在 spark 库中定义的一个类,作为 spark 库的入口点: 它表示连接到 spark,在进行 spark 操作之前必须先创建一个 Spark ...
- Spark性能调优-RDD算子调优篇(深度好文,面试常问,建议收藏)
RDD算子调优 不废话,直接进入正题! 1. RDD复用 在对RDD进行算子时,要避免相同的算子和计算逻辑之下对RDD进行重复的计算,如下图所示: 对上图中的RDD计算架构进行修改,得到如下图所示的优 ...
- Spark—RDD编程常用转换算子代码实例
Spark-RDD编程常用转换算子代码实例 Spark rdd 常用 Transformation 实例: 1.def map[U: ClassTag](f: T => U): RDD[U] ...
- Spark中普通集合与RDD算子的sortBy()有什么区别
分别观察一下集合与算子的sortBy()的参数列表 普通集合的sortBy() RDD算子的sortBy() 结论:普通集合的sortBy就没有false参数,也就是说只能默认的升序排. 如果需要对普 ...
- Spark RDD的依赖解读
在Spark中, RDD是有依赖关系的,这种依赖关系有两种类型 窄依赖(Narrow Dependency) 宽依赖(Wide Dependency) 以下图说明RDD的窄依赖和宽依赖 窄依赖 窄依赖 ...
随机推荐
- storm深入研究
著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出处.作者:He Ransom链接:http://www.zhihu.com/question/23441639/answer/28075 ...
- asp.net DropDownList实现二级联动效果
1.在aspx页面中,拖入两个DroDownList控件,代码如下: <div> <asp:DropDownList ID="s1" runat=" ...
- CSS3 transform 引起z-index失效
https://my.oschina.net/u/2941696/blog/1529373
- 使用tensorflow深度学习识别验证码
除了传统的PIL包处理图片,然后用pytessert+OCR识别意外,还可以使用tessorflow训练来识别验证码. 此篇代码大部分是转载的,只改了很少地方. 代码是运行在linux环境,tesso ...
- 面向对象----构造方法、this 关键字、函数的参数传递、package语句/import语句
构造方法 构造器的定义.作用 构造方法的特征 它具有与类相同的名称:它不含返回值: 注意:在构造方法里不含返回值的概念是不同于“void”的,在定义构造方法时加了“void”,结果这个方法就不再被自动 ...
- THINKPHP include 标签动态加载文件
有时候需要在框架中动态的加载一些文件,文件名不确定,有控制器获取得到,想在模板中使用变量的形式进行加载,本以为这样写可以 结果不行 <include file="User/{$my_t ...
- ios button标记
在写项目的时候,for循环创建多个button,在需要设置背景图片和,需要标记所选中的button的需求, 在这里提供两种方法: 一: 1:把for循环创建的button全部装到一个新建的数组中,把他 ...
- 京东云擎”本周四推出一键免费安装Discuz论坛
“京东云擎”本周四推出一键免费安装Discuz论坛了,让用户能在1分钟之内建立自己的论坛.这是继上周云擎推出一键安装WordPress之后的又一重大免费贡献! 云擎: http://jae.jd.co ...
- Python 使用正则表达式匹配电子邮箱
如下: In [1]: import re In [2]: email = "1210640219@qq.com" In [3]: regular = re.compile(r'[ ...
- Angular基础---->AngularJS的使用(一)
AngularJS主要用于构建单页面的Web应用.它通过增加开发人员和常见Web应用开发任务之间的抽象级别,使构建交互式的现代Web应用变得更加简单.今天,我们就开始Angular环境的搭建和第一个实 ...