Transformation算子

基本的初始化

java

  1. static SparkConf conf = null;
  2. static JavaSparkContext sc = null;
  3. static {
  4. conf = new SparkConf();
  5. conf.setMaster("local").setAppName("TestTransformation");
  6. sc = new JavaSparkContext(conf);
  7. }

scala

  1. private val conf: SparkConf = new SparkConf().setAppName("TestTransformation").setMaster("local")
  2. private val sparkContext = new SparkContext(conf)

一、map、flatMap、mapParations、mapPartitionsWithIndex

1.1 map

(1) 使用Java7进行编写

map十分容易理解,他是将源JavaRDD的一个一个元素的传入call方法,并经过算法后一个一个的返回从而生成一个新的JavaRDD。

  1. public static void map(){
  2. //String[] names = {"张无忌","赵敏","周芷若"};
  3. List<String> list = Arrays.asList("张无忌","赵敏","周芷若");
  4. System.out.println(list.size());
  5. JavaRDD<String> listRDD = sc.parallelize(list);
  6.  
  7. JavaRDD<String> nameRDD = listRDD.map(new Function<String, String>() {
  8. @Override
  9. public String call(String name) throws Exception {
  10. return "Hello " + name;
  11. }
  12. });
  13.  
  14. nameRDD.foreach(new VoidFunction<String>() {
  15. @Override
  16. public void call(String s) throws Exception {
  17. System.out.println(s);
  18. }
  19. });
  20. }

(2) 使用Java8编写

  1. public static void map(){
  2. String[] names = {"张无忌","赵敏","周芷若"};
  3. List<String> list = Arrays.asList(names);
  4. JavaRDD<String> listRDD = sc.parallelize(list);
  5.  
  6. JavaRDD<String> nameRDD = listRDD.map(name -> {
  7. return "Hello " + name;
  8. });
  9.  
  10. nameRDD.foreach(name -> System.out.println(name));
  11.  
  12. }

(3) 使用scala进行编写

  1. def map(): Unit ={
  2. val list = List("张无忌", "赵敏", "周芷若")
  3. val listRDD = sc.parallelize(list)
  4. val nameRDD = listRDD.map(name => "Hello " + name)
  5. nameRDD.foreach(name => println(name))
  6. }

(4) 运行结果

(5) 总结

可以看出,对于map算子,源JavaRDD的每个元素都会进行计算,由于是依次进行传参,所以他是有序的,新RDD的元素顺序与源RDD是相同的。而由有序又引出接下来的flatMap。

1.2 flatMap

(1) 使用Java7进行编写

flatMap与map一样,是将RDD中的元素依次的传入call方法,他比map多的功能是能在任何一个传入call方法的元素后面添加任意多元素,而能达到这一点,正是因为其进行传参是依次进行的。

  1. public static void flatMap(){
  2. List<String> list = Arrays.asList("张无忌 赵敏","宋青书 周芷若");
  3. JavaRDD<String> listRDD = sc.parallelize(list);
  4.  
  5. JavaRDD<String> nameRDD = listRDD
  6. .flatMap(new FlatMapFunction<String, String>() {
  7. @Override
  8. public Iterator<String> call(String line) throws Exception {
  9. return Arrays.asList(line.split(" ")).iterator();
  10. }
  11. })
  12. .map(new Function<String, String>() {
  13. @Override
  14. public String call(String name) throws Exception {
  15. return "Hello " + name;
  16. }
  17. });
  18.  
  19. nameRDD.foreach(new VoidFunction<String>() {
  20. @Override
  21. public void call(String s) throws Exception {
  22. System.out.println(s);
  23. }
  24. });
  25.  
  26. }

(2) 使用Java8进行编写

  1. public static void flatMap(){
  2. List<String> list = Arrays.asList("张无忌 赵敏","宋青书 周芷若");
  3. JavaRDD<String> listRDD = sc.parallelize(list);
  4.  
  5. JavaRDD<String> nameRDD = listRDD.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
  6. .map(name -> "Hello " + name);
  7.  
  8. nameRDD.foreach(name -> System.out.println(name));
  9. }

(3) 使用scala进行编写

  1. def flatMap(): Unit ={
  2. val list = List("张无忌 赵敏","宋青书 周芷若")
  3. val listRDD = sc.parallelize(list)
  4.  
  5. val nameRDD = listRDD.flatMap(line => line.split(" ")).map(name => "Hello " + name)
  6. nameRDD.foreach(name => println(name))
  7. }

(4) 运行结果

(5) 总结

flatMap的特性决定了这个算子在对需要随时增加元素的时候十分好用,比如在对源RDD查漏补缺时。

map和flatMap都是依次进行参数传递的,但有时候需要RDD中的两个元素进行相应操作时(例如:算存款所得时,下一个月所得的利息是要原本金加上上一个月所得的本金的),这两个算子便无法达到目的了,这是便需要mapPartitions算子,他传参的方式是将整个RDD传入,然后将一个迭代器传出生成一个新的RDD,由于整个RDD都传入了,所以便能完成前面说的业务。

1.3 mapPartitions

(1) 使用Java7进行编写

  1. /**
  2. * map:
  3. * 一条数据一条数据的处理(文件系统,数据库等等)
  4. * mapPartitions:
  5. * 一次获取的是一个分区的数据(hdfs)
  6. * 正常情况下,mapPartitions 是一个高性能的算子
  7. * 因为每次处理的是一个分区的数据,减少了去获取数据的次数。
  8. *
  9. * 但是如果我们的分区如果设置得不合理,有可能导致每个分区里面的数据量过大。
  10. */
  11. public static void mapPartitions(){
  12. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6);
  13. //参数二代表这个rdd里面有两个分区
  14. JavaRDD<Integer> listRDD = sc.parallelize(list,2);
  15.  
  16. listRDD.mapPartitions(new FlatMapFunction<Iterator<Integer>, String>() {
  17. @Override
  18. public Iterator<String> call(Iterator<Integer> iterator) throws Exception {
  19. ArrayList<String> array = new ArrayList<>();
  20. while (iterator.hasNext()){
  21. array.add("hello " + iterator.next());
  22. }
  23. return array.iterator();
  24. }
  25. }).foreach(new VoidFunction<String>() {
  26. @Override
  27. public void call(String s) throws Exception {
  28. System.out.println(s);
  29. }
  30. });
  31. }

(2) 使用Java8进行编写

  1. public static void mapParations(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 2);
  4.  
  5. listRDD.mapPartitions(iterator -> {
  6. ArrayList<String> array = new ArrayList<>();
  7. while (iterator.hasNext()){
  8. array.add("hello " + iterator.next());
  9. }
  10. return array.iterator();
  11. }).foreach(name -> System.out.println(name));
  12. }

(3) 使用scala进行编写

  1. def mapParations(): Unit ={
  2. val list = List(1,2,3,4,5,6)
  3. val listRDD = sc.parallelize(list,2)
  4.  
  5. listRDD.mapPartitions(iterator => {
  6. val newList: ListBuffer[String] = ListBuffer()
  7. while (iterator.hasNext){
  8. newList.append("hello " + iterator.next())
  9. }
  10. newList.toIterator
  11. }).foreach(name => println(name))
  12. }

(4) 运行结果

1.4 mapPartitionsWithIndex

每次获取和处理的就是一个分区的数据,并且知道处理的分区的分区号是啥?

(1)使用Java7编写

  1. public static void mapPartitionsWithIndex(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 2);
  4. listRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Integer>, Iterator<String>>() {
  5. @Override
  6. public Iterator<String> call(Integer index, Iterator<Integer> iterator) throws Exception {
  7. ArrayList<String> list1 = new ArrayList<>();
  8. while (iterator.hasNext()){
  9. list1.add(index+"_"+iterator.next());
  10. }
  11. return list1.iterator();
  12. }
  13. },true)
  14. .foreach(new VoidFunction<String>() {
  15. @Override
  16. public void call(String s) throws Exception {
  17. System.out.println(s);
  18. }
  19. });
  20. }

(2)使用Java8编写

  1. public static void mapPartitionsWithIndex() {
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 2);
  4. listRDD.mapPartitionsWithIndex((index,iterator) -> {
  5. ArrayList<String> list1 = new ArrayList<>();
  6. while (iterator.hasNext()){
  7. list1.add(index+"_"+iterator.next());
  8. }
  9. return list1.iterator();
  10. },true)
  11. .foreach(str -> System.out.println(str));
  12. }

(3)使用scala编写

  1. def mapPartitionsWithIndex(): Unit ={
  2. val list = List(1,2,3,4,5,6,7,8)
  3. sc.parallelize(list).mapPartitionsWithIndex((index,iterator) => {
  4. val listBuffer:ListBuffer[String] = new ListBuffer
  5. while (iterator.hasNext){
  6. listBuffer.append(index+"_"+iterator.next())
  7. }
  8. listBuffer.iterator
  9. },true)
  10. .foreach(println(_))
  11. }

(4)运行结果

二、reduce、reduceByKey

2.1 reduce

reduce其实是讲RDD中的所有元素进行合并,当运行call方法时,会传入两个参数,在call方法中将两个参数合并后返回,而这个返回值回合一个新的RDD中的元素再次传入call方法中,继续合并,直到合并到只剩下一个元素时。

(1)使用Java7编写

  1. public static void reduce(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list);
  4.  
  5. Integer result = listRDD.reduce(new Function2<Integer, Integer, Integer>() {
  6. @Override
  7. public Integer call(Integer i1, Integer i2) throws Exception {
  8. return i1 + i2;
  9. }
  10. });
  11. System.out.println(result);
  12.  
  13. }

(2)使用Java8编写

  1. public static void reduce(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list);
  4.  
  5. Integer result = listRDD.reduce((x, y) -> x + y);
  6. System.out.println(result);
  7. }

(3)使用scala编写

  1. def reduce(): Unit ={
  2. val list = List(1,2,3,4,5,6)
  3. val listRDD = sc.parallelize(list)
  4.  
  5. val result = listRDD.reduce((x,y) => x+y)
  6. println(result)
  7. }

(4)运行结果

2.2 reduceByKey

reduceByKey仅将RDD中所有K,V对中K值相同的V进行合并。

(1)使用Java7编写

  1. public static void reduceByKey(){
  2. List<Tuple2<String, Integer>> list = Arrays.asList(
  3. new Tuple2<String, Integer>("武当", 99),
  4. new Tuple2<String, Integer>("少林", 97),
  5. new Tuple2<String, Integer>("武当", 89),
  6. new Tuple2<String, Integer>("少林", 77)
  7. );
  8. JavaPairRDD<String, Integer> listRDD = sc.parallelizePairs(list);
  9. //运行reduceByKey时,会将key值相同的组合在一起做call方法中的操作
  10. JavaPairRDD<String, Integer> result = listRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
  11. @Override
  12. public Integer call(Integer i1, Integer i2) throws Exception {
  13. return i1 + i2;
  14. }
  15. });
  16. result.foreach(new VoidFunction<Tuple2<String, Integer>>() {
  17. @Override
  18. public void call(Tuple2<String, Integer> tuple) throws Exception {
  19. System.out.println("门派: " + tuple._1 + "->" + tuple._2);
  20. }
  21. });
  22. }

(2)使用Java8编写

  1. public static void reduceByKey(){
  2. List<Tuple2<String, Integer>> list = Arrays.asList(
  3. new Tuple2<String, Integer>("武当", 99),
  4. new Tuple2<String, Integer>("少林", 97),
  5. new Tuple2<String, Integer>("武当", 89),
  6. new Tuple2<String, Integer>("少林", 77)
  7. );
  8. JavaPairRDD<String, Integer> listRDD = sc.parallelizePairs(list);
  9.  
  10. JavaPairRDD<String, Integer> resultRDD = listRDD.reduceByKey((x, y) -> x + y);
  11. resultRDD.foreach(tuple -> System.out.println("门派: " + tuple._1 + "->" + tuple._2));
  12. }

(3)使用scala编写

  1. def reduceByKey(): Unit ={
  2. val list = List(("武当", 99), ("少林", 97), ("武当", 89), ("少林", 77))
  3. val mapRDD = sc.parallelize(list)
  4.  
  5. val resultRDD = mapRDD.reduceByKey(_+_)
  6. resultRDD.foreach(tuple => println("门派: " + tuple._1 + "->" + tuple._2))
  7. }

(4)运行结果

三、union,join和groupByKey

3.1 union

当要将两个RDD合并时,便要用到union和join,其中union只是简单的将两个RDD累加起来,可以看做List的addAll方法。就想List中一样,当使用union及join时,必须保证两个RDD的泛型是一致的。

(1)使用Java7编写

  1. public static void union(){
  2. final List<Integer> list1 = Arrays.asList(1, 2, 3, 4);
  3. final List<Integer> list2 = Arrays.asList(3, 4, 5, 6);
  4. final JavaRDD<Integer> rdd1 = sc.parallelize(list1);
  5. final JavaRDD<Integer> rdd2 = sc.parallelize(list2);
  6. rdd1.union(rdd2)
  7. .foreach(new VoidFunction<Integer>() {
  8. @Override
  9. public void call(Integer number) throws Exception {
  10. System.out.println(number + "");
  11. }
  12. });
  13. }

(2)使用Java8编写

  1. public static void union(){
  2. final List<Integer> list1 = Arrays.asList(1, 2, 3, 4);
  3. final List<Integer> list2 = Arrays.asList(3, 4, 5, 6);
  4. final JavaRDD<Integer> rdd1 = sc.parallelize(list1);
  5. final JavaRDD<Integer> rdd2 = sc.parallelize(list2);
  6.  
  7. rdd1.union(rdd2).foreach(num -> System.out.println(num));
  8. }

(3)使用scala编写

  1. def union(): Unit ={
  2. val list1 = List(1,2,3,4)
  3. val list2 = List(3,4,5,6)
  4. val rdd1 = sc.parallelize(list1)
  5. val rdd2 = sc.parallelize(list2)
  6. rdd1.union(rdd2).foreach(println(_))
  7. }

(4)运行结果

3.2 groupByKey

(1)使用Java7编写

union只是将两个RDD简单的累加在一起,而join则不一样,join类似于hadoop中的combin操作,只是少了排序这一段,再说join之前说说groupByKey,因为join可以理解为union与groupByKey的结合:groupBy是将RDD中的元素进行分组,组名是call方法中的返回值,而顾名思义groupByKey是将PairRDD中拥有相同key值得元素归为一组。即:

  1. public static void groupByKey(){
  2. List<Tuple2<String,String>> list = Arrays.asList(
  3. new Tuple2("武当", "张三丰"),
  4. new Tuple2("峨眉", "灭绝师太"),
  5. new Tuple2("武当", "宋青书"),
  6. new Tuple2("峨眉", "周芷若")
  7. );
  8. JavaPairRDD<String, String> listRDD = sc.parallelizePairs(list);
  9.  
  10. JavaPairRDD<String, Iterable<String>> groupByKeyRDD = listRDD.groupByKey();
  11. groupByKeyRDD.foreach(new VoidFunction<Tuple2<String, Iterable<String>>>() {
  12. @Override
  13. public void call(Tuple2<String, Iterable<String>> tuple) throws Exception {
  14. String menpai = tuple._1;
  15. Iterator<String> iterator = tuple._2.iterator();
  16. String people = "";
  17. while (iterator.hasNext()){
  18. people = people + iterator.next()+" ";
  19. }
  20. System.out.println("门派:"+menpai + "人员:"+people);
  21. }
  22. });
  23.  
  24. }

(2)使用Java8编写

  1. public static void groupByKey(){
  2. List<Tuple2<String,String>> list = Arrays.asList(
  3. new Tuple2("武当", "张三丰"),
  4. new Tuple2("峨眉", "灭绝师太"),
  5. new Tuple2("武当", "宋青书"),
  6. new Tuple2("峨眉", "周芷若")
  7. );
  8. JavaPairRDD<String, String> listRDD = sc.parallelizePairs(list);
  9.  
  10. JavaPairRDD<String, Iterable<String>> groupByKeyRDD = listRDD.groupByKey();
  11. groupByKeyRDD.foreach(tuple -> {
  12. String menpai = tuple._1;
  13. Iterator<String> iterator = tuple._2.iterator();
  14. String people = "";
  15. while (iterator.hasNext()){
  16. people = people + iterator.next()+" ";
  17. }
  18. System.out.println("门派:"+menpai + "人员:"+people);
  19. });
  20. }

(3)使用scala编写

  1. def groupByKey(): Unit ={
  2. val list = List(("武当", "张三丰"), ("峨眉", "灭绝师太"), ("武当", "宋青书"), ("峨眉", "周芷若"))
  3. val listRDD = sc.parallelize(list)
  4. val groupByKeyRDD = listRDD.groupByKey()
  5. groupByKeyRDD.foreach(t => {
  6. val menpai = t._1
  7. val iterator = t._2.iterator
  8. var people = ""
  9. while (iterator.hasNext) people = people + iterator.next + " "
  10. println("门派:" + menpai + "人员:" + people)
  11. })
  12. }

(4)运行结果

3.3 join

(1)使用Java7编写

join是将两个PairRDD合并,并将有相同key的元素分为一组,可以理解为groupByKey和Union的结合

  1. public static void join(){
  2. final List<Tuple2<Integer, String>> names = Arrays.asList(
  3. new Tuple2<Integer, String>(1, "东方不败"),
  4. new Tuple2<Integer, String>(2, "令狐冲"),
  5. new Tuple2<Integer, String>(3, "林平之")
  6. );
  7. final List<Tuple2<Integer, Integer>> scores = Arrays.asList(
  8. new Tuple2<Integer, Integer>(1, 99),
  9. new Tuple2<Integer, Integer>(2, 98),
  10. new Tuple2<Integer, Integer>(3, 97)
  11. );
  12.  
  13. final JavaPairRDD<Integer, String> nemesrdd = sc.parallelizePairs(names);
  14. final JavaPairRDD<Integer, Integer> scoresrdd = sc.parallelizePairs(scores);
  15. /**
  16. * <Integer, 学号
  17. * Tuple2<String, 名字
  18. * Integer>> 分数
  19. */
  20. final JavaPairRDD<Integer, Tuple2<String, Integer>> joinRDD = nemesrdd.join(scoresrdd);
  21. // final JavaPairRDD<Integer, Tuple2<Integer, String>> join = scoresrdd.join(nemesrdd);
  22. joinRDD.foreach(new VoidFunction<Tuple2<Integer, Tuple2<String, Integer>>>() {
  23. @Override
  24. public void call(Tuple2<Integer, Tuple2<String, Integer>> tuple) throws Exception {
  25. System.out.println("学号:" + tuple._1 + " 名字:"+tuple._2._1 + " 分数:"+tuple._2._2);
  26. }
  27. });
  28. }

(2)使用Java8编写

  1. public static void join(){
  2. final List<Tuple2<Integer, String>> names = Arrays.asList(
  3. new Tuple2<Integer, String>(1, "东方不败"),
  4. new Tuple2<Integer, String>(2, "令狐冲"),
  5. new Tuple2<Integer, String>(3, "林平之")
  6. );
  7. final List<Tuple2<Integer, Integer>> scores = Arrays.asList(
  8. new Tuple2<Integer, Integer>(1, 99),
  9. new Tuple2<Integer, Integer>(2, 98),
  10. new Tuple2<Integer, Integer>(3, 97)
  11. );
  12.  
  13. final JavaPairRDD<Integer, String> nemesrdd = sc.parallelizePairs(names);
  14. final JavaPairRDD<Integer, Integer> scoresrdd = sc.parallelizePairs(scores);
  15.  
  16. final JavaPairRDD<Integer, Tuple2<String, Integer>> joinRDD = nemesrdd.join(scoresrdd);
  17. joinRDD.foreach(tuple -> System.out.println("学号:"+tuple._1+" 姓名:"+tuple._2._1+" 成绩:"+tuple._2._2));
  18. }

(3)使用scala编写

  1. def join(): Unit = {
  2. val list1 = List((1, "东方不败"), (2, "令狐冲"), (3, "林平之"))
  3. val list2 = List((1, 99), (2, 98), (3, 97))
  4. val list1RDD = sc.parallelize(list1)
  5. val list2RDD = sc.parallelize(list2)
  6.  
  7. val joinRDD = list1RDD.join(list2RDD)
  8. joinRDD.foreach(t => println("学号:" + t._1 + " 姓名:" + t._2._1 + " 成绩:" + t._2._2))
  9.  
  10. }

(4)运行结果

四、sample、cartesian

4.1 sample

(1)使用Java7编写

  1. public static void sample(){
  2. ArrayList<Integer> list = new ArrayList<>();
  3. for(int i=1;i<=100;i++){
  4. list.add(i);
  5. }
  6. JavaRDD<Integer> listRDD = sc.parallelize(list);
  7. /**
  8. * sample用来从RDD中抽取样本。他有三个参数
  9. * withReplacement: Boolean,
  10. * true: 有放回的抽样
  11. * false: 无放回抽象
  12. * fraction: Double:
  13. * 抽取样本的比例
  14. * seed: Long:
  15. * 随机种子
  16. */
  17. JavaRDD<Integer> sampleRDD = listRDD.sample(false, 0.1,0);
  18. sampleRDD.foreach(new VoidFunction<Integer>() {
  19. @Override
  20. public void call(Integer num) throws Exception {
  21. System.out.print(num+" ");
  22. }
  23. });
  24. }

(2)使用Java8编写

  1. public static void sample(){
  2. ArrayList<Integer> list = new ArrayList<>();
  3. for(int i=1;i<=100;i++){
  4. list.add(i);
  5. }
  6. JavaRDD<Integer> listRDD = sc.parallelize(list);
  7.  
  8. JavaRDD<Integer> sampleRDD = listRDD.sample(false, 0.1, 0);
  9. sampleRDD.foreach(num -> System.out.print(num + " "));
  10. }

(3)使用scala编写

  1. def sample(): Unit ={
  2. val list = 1 to 100
  3. val listRDD = sc.parallelize(list)
  4. listRDD.sample(false,0.1,0).foreach(num => print(num + " "))
  5. }

(4)运行结果

4.2 cartesian

cartesian是用于求笛卡尔积的

(1)使用Java7编写

  1. public static void cartesian(){
  2. List<String> list1 = Arrays.asList("A", "B");
  3. List<Integer> list2 = Arrays.asList(1, 2, 3);
  4. JavaRDD<String> list1RDD = sc.parallelize(list1);
  5. JavaRDD<Integer> list2RDD = sc.parallelize(list2);
  6. list1RDD.cartesian(list2RDD).foreach(new VoidFunction<Tuple2<String, Integer>>() {
  7. @Override
  8. public void call(Tuple2<String, Integer> tuple) throws Exception {
  9. System.out.println(tuple._1 + "->" + tuple._2);
  10. }
  11. });
  12.  
  13. }

(2)使用Java8编写

  1. public static void cartesian(){
  2. List<String> list1 = Arrays.asList("A", "B");
  3. List<Integer> list2 = Arrays.asList(1, 2, 3);
  4. JavaRDD<String> list1RDD = sc.parallelize(list1);
  5. JavaRDD<Integer> list2RDD = sc.parallelize(list2);
  6. list1RDD.cartesian(list2RDD).foreach(tuple -> System.out.print(tuple._1 + "->" + tuple._2));
  7. }

(3)使用scala编写

  1. def cartesian(): Unit ={
  2. val list1 = List("A","B")
  3. val list2 = List(1,2,3)
  4. val list1RDD = sc.parallelize(list1)
  5. val list2RDD = sc.parallelize(list2)
  6. list1RDD.cartesian(list2RDD).foreach(t => println(t._1 +"->"+t._2))
  7. }

(4)运行结果

五、filter、distinct、intersection

5.1 filter

(1)使用Java7编写

过滤出偶数

  1. public static void filter(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list);
  4. JavaRDD<Integer> filterRDD = listRDD.filter(new Function<Integer, Boolean>() {
  5. @Override
  6. public Boolean call(Integer num) throws Exception {
  7. return num % 2 == 0;
  8. }
  9. });
  10. filterRDD.foreach(new VoidFunction<Integer>() {
  11. @Override
  12. public void call(Integer num) throws Exception {
  13. System.out.print(num + " ");
  14. }
  15. });
  16.  
  17. }

(2)使用Java8编写

  1. public static void filter(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list);
  4. JavaRDD<Integer> filterRDD = listRDD.filter(num -> num % 2 ==0);
  5. filterRDD.foreach(num -> System.out.print(num + " "));
  6. }

(3)使用scala编写

  1. def filter(): Unit ={
  2. val list = List(1,2,3,4,5,6,7,8,9,10)
  3. val listRDD = sc.parallelize(list)
  4. listRDD.filter(num => num % 2 ==0).foreach(print(_))
  5. }

(4)运行结果

5.2 distinct

(1)使用Java7编写

  1. public static void distinct(){
  2. List<Integer> list = Arrays.asList(1, 1, 2, 2, 3, 3, 4, 5);
  3. JavaRDD<Integer> listRDD = (JavaRDD<Integer>) sc.parallelize(list);
  4. JavaRDD<Integer> distinctRDD = listRDD.distinct();
  5. distinctRDD.foreach(new VoidFunction<Integer>() {
  6. @Override
  7. public void call(Integer num) throws Exception {
  8. System.out.println(num);
  9. }
  10. });
  11. }

(2)使用Java8编写

  1. public static void distinct(){
  2. List<Integer> list = Arrays.asList(1, 1, 2, 2, 3, 3, 4, 5);
  3. JavaRDD<Integer> listRDD = (JavaRDD<Integer>) sc.parallelize(list);
  4. listRDD.distinct().foreach(num -> System.out.println(num));
  5. }

(3)使用scala编写

  1. def distinct(): Unit ={
  2. val list = List(1,1,2,2,3,3,4,5)
  3. sc.parallelize(list).distinct().foreach(println(_))
  4. }

(4)运行结果

5.3 intersection

(1)使用Java7编写

  1. public static void intersection(){
  2. List<Integer> list1 = Arrays.asList(1, 2, 3, 4);
  3. List<Integer> list2 = Arrays.asList(3, 4, 5, 6);
  4. JavaRDD<Integer> list1RDD = sc.parallelize(list1);
  5. JavaRDD<Integer> list2RDD = sc.parallelize(list2);
  6. list1RDD.intersection(list2RDD).foreach(new VoidFunction<Integer>() {
  7. @Override
  8. public void call(Integer num) throws Exception {
  9. System.out.println(num);
  10. }
  11. });
  12. }

(2)使用Java8编写

  1. public static void intersection() {
  2. List<Integer> list1 = Arrays.asList(, , , );
  3. List<Integer> list2 = Arrays.asList(, , , );
  4. JavaRDD<Integer> list1RDD = sc.parallelize(list1);
  5. JavaRDD<Integer> list2RDD = sc.parallelize(list2);
  6. list1RDD.intersection(list2RDD).foreach(num ->System.out.println(num));
  7. }

(3)使用scala编写

  1. def intersection(): Unit ={
  2. val list1 = List(1,2,3,4)
  3. val list2 = List(3,4,5,6)
  4. val list1RDD = sc.parallelize(list1)
  5. val list2RDD = sc.parallelize(list2)
  6. list1RDD.intersection(list2RDD).foreach(println(_))
  7. }

(4)运行结果

六、coalesce、repartition、repartitionAndSortWithinPartitions

6.1 coalesce

分区数由多  -》 变少

(1)使用Java7编写

  1. public static void coalesce(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 3);
  4. listRDD.coalesce(1).foreach(new VoidFunction<Integer>() {
  5. @Override
  6. public void call(Integer num) throws Exception {
  7. System.out.print(num);
  8. }
  9. });
  10. }

(2)使用Java8编写

  1. public static void coalesce() {
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 3);
  4. listRDD.coalesce(1).foreach(num -> System.out.println(num));
  5. }

(3)使用scala编写

  1. def coalesce(): Unit = {
  2. val list = List(1,2,3,4,5,6,7,8,9)
  3. sc.parallelize(list,3).coalesce(1).foreach(println(_))
  4. }

(4)运行结果

6.2 replication

进行重分区,解决的问题:本来分区数少  -》 增加分区数

(1)使用Java7编写

  1. public static void replication(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 1);
  4. listRDD.repartition(2).foreach(new VoidFunction<Integer>() {
  5. @Override
  6. public void call(Integer num) throws Exception {
  7. System.out.println(num);
  8. }
  9. });
  10. }

(2)使用Java8编写

  1. public static void replication(){
  2. List<Integer> list = Arrays.asList(1, 2, 3, 4);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 1);
  4. listRDD.repartition(2).foreach(num -> System.out.println(num));
  5. }

(3)使用scala编写

  1. def replication(): Unit ={
  2. val list = List(1,2,3,4)
  3. val listRDD = sc.parallelize(list,1)
  4. listRDD.repartition(2).foreach(println(_))
  5. }

(4)运行结果

6.3 repartitionAndSortWithinPartitions

repartitionAndSortWithinPartitions函数是repartition函数的变种,与repartition函数不同的是,repartitionAndSortWithinPartitions在给定的partitioner内部进行排序,性能比repartition要高。

(1)使用Java7编写

  1. public static void repartitionAndSortWithinPartitions(){
  2. List<Integer> list = Arrays.asList(1, 3, 55, 77, 33, 5, 23);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 1);
  4. JavaPairRDD<Integer, Integer> pairRDD = listRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
  5. @Override
  6. public Tuple2<Integer, Integer> call(Integer num) throws Exception {
  7. return new Tuple2<>(num, num);
  8. }
  9. });
  10. JavaPairRDD<Integer, Integer> parationRDD = pairRDD.repartitionAndSortWithinPartitions(new Partitioner() {
  11. @Override
  12. public int getPartition(Object key) {
  13. Integer index = Integer.valueOf(key.toString());
  14. if (index % 2 == 0) {
  15. return 0;
  16. } else {
  17. return 1;
  18. }
  19.  
  20. }
  21.  
  22. @Override
  23. public int numPartitions() {
  24. return 2;
  25. }
  26. });
  27. parationRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Tuple2<Integer, Integer>>, Iterator<String>>() {
  28. @Override
  29. public Iterator<String> call(Integer index, Iterator<Tuple2<Integer, Integer>> iterator) throws Exception {
  30. final ArrayList<String> list1 = new ArrayList<>();
  31. while (iterator.hasNext()){
  32. list1.add(index+"_"+iterator.next());
  33. }
  34. return list1.iterator();
  35. }
  36. },false).foreach(new VoidFunction<String>() {
  37. @Override
  38. public void call(String s) throws Exception {
  39. System.out.println(s);
  40. }
  41. });
  42. }

(2)使用Java8编写

  1. public static void repartitionAndSortWithinPartitions(){
  2. List<Integer> list = Arrays.asList(1, 4, 55, 66, 33, 48, 23);
  3. JavaRDD<Integer> listRDD = sc.parallelize(list, 1);
  4. JavaPairRDD<Integer, Integer> pairRDD = listRDD.mapToPair(num -> new Tuple2<>(num, num));
  5. pairRDD.repartitionAndSortWithinPartitions(new HashPartitioner(2))
  6. .mapPartitionsWithIndex((index,iterator) -> {
  7. ArrayList<String> list1 = new ArrayList<>();
  8. while (iterator.hasNext()){
  9. list1.add(index+"_"+iterator.next());
  10. }
  11. return list1.iterator();
  12. },false)
  13. .foreach(str -> System.out.println(str));
  14. }

(3)使用scala编写

  1. def repartitionAndSortWithinPartitions(): Unit ={
  2. val list = List(1, 4, 55, 66, 33, 48, 23)
  3. val listRDD = sc.parallelize(list,1)
  4. listRDD.map(num => (num,num))
  5. .repartitionAndSortWithinPartitions(new HashPartitioner(2))
  6. .mapPartitionsWithIndex((index,iterator) => {
  7. val listBuffer: ListBuffer[String] = new ListBuffer
  8. while (iterator.hasNext) {
  9. listBuffer.append(index + "_" + iterator.next())
  10. }
  11. listBuffer.iterator
  12. },false)
  13. .foreach(println(_))
  14.  
  15. }

(4)运行结果

七、cogroup、sortBykey、aggregateByKey

7.1 cogroup

对两个RDD中的KV元素,每个RDD中相同key中的元素分别聚合成一个集合。与reduceByKey不同的是针对两个RDD中相同的key的元素进行合并。

(1)使用Java7编写

  1. public static void cogroup(){
  2. List<Tuple2<Integer, String>> list1 = Arrays.asList(
  3. new Tuple2<Integer, String>(1, "www"),
  4. new Tuple2<Integer, String>(2, "bbs")
  5. );
  6.  
  7. List<Tuple2<Integer, String>> list2 = Arrays.asList(
  8. new Tuple2<Integer, String>(1, "cnblog"),
  9. new Tuple2<Integer, String>(2, "cnblog"),
  10. new Tuple2<Integer, String>(3, "very")
  11. );
  12.  
  13. List<Tuple2<Integer, String>> list3 = Arrays.asList(
  14. new Tuple2<Integer, String>(1, "com"),
  15. new Tuple2<Integer, String>(2, "com"),
  16. new Tuple2<Integer, String>(3, "good")
  17. );
  18.  
  19. JavaPairRDD<Integer, String> list1RDD = sc.parallelizePairs(list1);
  20. JavaPairRDD<Integer, String> list2RDD = sc.parallelizePairs(list2);
  21. JavaPairRDD<Integer, String> list3RDD = sc.parallelizePairs(list3);
  22.  
  23. list1RDD.cogroup(list2RDD,list3RDD).foreach(new VoidFunction<Tuple2<Integer, Tuple3<Iterable<String>, Iterable<String>, Iterable<String>>>>() {
  24. @Override
  25. public void call(Tuple2<Integer, Tuple3<Iterable<String>, Iterable<String>, Iterable<String>>> tuple) throws Exception {
  26. System.out.println(tuple._1+" " +tuple._2._1() +" "+tuple._2._2()+" "+tuple._2._3());
  27. }
  28. });
  29. }

(2)使用Java8编写

  1. public static void cogroup(){
  2. List<Tuple2<Integer, String>> list1 = Arrays.asList(
  3. new Tuple2<Integer, String>(1, "www"),
  4. new Tuple2<Integer, String>(2, "bbs")
  5. );
  6.  
  7. List<Tuple2<Integer, String>> list2 = Arrays.asList(
  8. new Tuple2<Integer, String>(1, "cnblog"),
  9. new Tuple2<Integer, String>(2, "cnblog"),
  10. new Tuple2<Integer, String>(3, "very")
  11. );
  12.  
  13. List<Tuple2<Integer, String>> list3 = Arrays.asList(
  14. new Tuple2<Integer, String>(1, "com"),
  15. new Tuple2<Integer, String>(2, "com"),
  16. new Tuple2<Integer, String>(3, "good")
  17. );
  18.  
  19. JavaPairRDD<Integer, String> list1RDD = sc.parallelizePairs(list1);
  20. JavaPairRDD<Integer, String> list2RDD = sc.parallelizePairs(list2);
  21. JavaPairRDD<Integer, String> list3RDD = sc.parallelizePairs(list3);
  22.  
  23. list1RDD.cogroup(list2RDD,list3RDD).foreach(tuple ->
  24. System.out.println(tuple._1+" " +tuple._2._1() +" "+tuple._2._2()+" "+tuple._2._3()));
  25. }

(3)使用scala编写

  1. def cogroup(): Unit ={
  2. val list1 = List((1, "www"), (2, "bbs"))
  3. val list2 = List((1, "cnblog"), (2, "cnblog"), (3, "very"))
  4. val list3 = List((1, "com"), (2, "com"), (3, "good"))
  5.  
  6. val list1RDD = sc.parallelize(list1)
  7. val list2RDD = sc.parallelize(list2)
  8. val list3RDD = sc.parallelize(list3)
  9.  
  10. list1RDD.cogroup(list2RDD,list3RDD).foreach(tuple =>
  11. println(tuple._1 + " " + tuple._2._1 + " " + tuple._2._2 + " " + tuple._2._3))
  12. }

(4)运行结果

7.2 sortBykey

sortByKey函数作用于Key-Value形式的RDD,并对Key进行排序。它是在org.apache.spark.rdd.OrderedRDDFunctions中实现的,实现如下

  1. def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size)
  2. : RDD[(K, V)] =
  3. {
  4. val part = new RangePartitioner(numPartitions, self, ascending)
  5. new ShuffledRDD[K, V, V](self, part)
  6. .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  7. }

从函数的实现可以看出,它主要接受两个函数,含义和sortBy一样,这里就不进行解释了。该函数返回的RDD一定是ShuffledRDD类型的,因为对源RDD进行排序,必须进行Shuffle操作,而Shuffle操作的结果RDD就是ShuffledRDD。其实这个函数的实现很优雅,里面用到了RangePartitioner,它可以使得相应的范围Key数据分到同一个partition中,然后内部用到了mapPartitions对每个partition中的数据进行排序,而每个partition中数据的排序用到了标准的sort机制,避免了大量数据的shuffle。下面对sortByKey的使用进行说明:

(1)使用Java7编写

  1. public static void sortByKey(){
  2. List<Tuple2<Integer, String>> list = Arrays.asList(
  3. new Tuple2<>(99, "张三丰"),
  4. new Tuple2<>(96, "东方不败"),
  5. new Tuple2<>(66, "林平之"),
  6. new Tuple2<>(98, "聂风")
  7. );
  8. JavaPairRDD<Integer, String> listRDD = sc.parallelizePairs(list);
  9. listRDD.sortByKey(false).foreach(new VoidFunction<Tuple2<Integer, String>>() {
  10. @Override
  11. public void call(Tuple2<Integer, String> tuple) throws Exception {
  12. System.out.println(tuple._2+"->"+tuple._1);
  13. }
  14. });
  15. }

(2)使用Java8编写

  1. public static void sortByKey(){
  2. List<Tuple2<Integer, String>> list = Arrays.asList(
  3. new Tuple2<>(99, "张三丰"),
  4. new Tuple2<>(96, "东方不败"),
  5. new Tuple2<>(66, "林平之"),
  6. new Tuple2<>(98, "聂风")
  7. );
  8. JavaPairRDD<Integer, String> listRDD = sc.parallelizePairs(list);
  9. listRDD.sortByKey(false).foreach(tuple ->System.out.println(tuple._2+"->"+tuple._1));
  10. }

(3)使用scala编写

  1. def sortByKey(): Unit ={
  2. val list = List((99, "张三丰"), (96, "东方不败"), (66, "林平之"), (98, "聂风"))
  3. sc.parallelize(list).sortByKey(false).foreach(tuple => println(tuple._2 + "->" + tuple._1))
  4. }

(4)运行结果

7.3 aggregateByKey

aggregateByKey函数对PairRDD中相同Key的值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。和aggregate函数类似,aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey是对相同Key中的值进行聚合操作,所以aggregateByKey函数最终返回的类型还是Pair RDD,对应的结果是Key和聚合好的值;而aggregate函数直接是返回非RDD的结果,这点需要注意。在实现过程中,定义了三个aggregateByKey函数原型,但最终调用的aggregateByKey函数都一致。

(1)使用Java7编写

  1. public static void aggregateByKey(){
  2. List<String> list = Arrays.asList("you,jump", "i,jump");
  3. JavaRDD<String> listRDD = sc.parallelize(list);
  4. listRDD.flatMap(new FlatMapFunction<String, String>() {
  5. @Override
  6. public Iterator<String> call(String line) throws Exception {
  7. return Arrays.asList(line.split(",")).iterator();
  8. }
  9. }).mapToPair(new PairFunction<String, String, Integer>() {
  10. @Override
  11. public Tuple2<String, Integer> call(String word) throws Exception {
  12. return new Tuple2<>(word,1);
  13. }
  14. }).aggregateByKey(0, new Function2<Integer, Integer, Integer>() {
  15. @Override
  16. public Integer call(Integer i1, Integer i2) throws Exception {
  17. return i1 + i2;
  18. }
  19. }, new Function2<Integer, Integer, Integer>() {
  20. @Override
  21. public Integer call(Integer i1, Integer i2) throws Exception {
  22. return i1+i2;
  23. }
  24. }).foreach(new VoidFunction<Tuple2<String, Integer>>() {
  25. @Override
  26. public void call(Tuple2<String, Integer> tuple) throws Exception {
  27. System.out.println(tuple._1+"->"+tuple._2);
  28. }
  29. });
  30. }

(2)使用Java8编写

  1. public static void aggregateByKey() {
  2. List<String> list = Arrays.asList("you,jump", "i,jump");
  3. JavaRDD<String> listRDD = sc.parallelize(list);
  4. listRDD.flatMap(line -> Arrays.asList(line.split(",")).iterator())
  5. .mapToPair(word -> new Tuple2<>(word,1))
  6. .aggregateByKey(0,(x,y)-> x+y,(m,n) -> m+n)
  7. .foreach(tuple -> System.out.println(tuple._1+"->"+tuple._2));
  8. }

(3)使用scala编写

  1. def aggregateByKey(): Unit ={
  2. val list = List("you,jump", "i,jump")
  3. sc.parallelize(list)
  4. .flatMap(_.split(","))
  5. .map((_, 1))
  6. .aggregateByKey(0)(_+_,_+_)
  7. .foreach(tuple =>println(tuple._1+"->"+tuple._2))
  8. }

(4)运行结果

Spark学习之路 (六)Spark Transformation和Action的更多相关文章

  1. [转]Spark学习之路 (三)Spark之RDD

    Spark学习之路 (三)Spark之RDD   https://www.cnblogs.com/qingyunzong/p/8899715.html 目录 一.RDD的概述 1.1 什么是RDD? ...

  2. Spark学习笔记2(spark所需环境配置

    Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求 不需要最新版的maven客户端. 解压完成之后 ...

  3. Spark学习之路(十六)—— Spark Streaming 整合 Kafka

    一.版本说明 Spark针对Kafka的不同版本,提供了两套整合方案:spark-streaming-kafka-0-8和spark-streaming-kafka-0-10,其主要区别如下:   s ...

  4. Spark学习之路 (八)SparkCore的调优之开发调优

    摘抄自:https://tech.meituan.com/spark-tuning-basic.html 前言 在大数据计算领域,Spark已经成为了越来越流行.越来越受欢迎的计算平台之一.Spark ...

  5. Spark学习之路 (七)Spark 运行流程

    一.Spark中的基本概念 (1)Application:表示你的应用程序 (2)Driver:表示main()函数,创建SparkContext.由SparkContext负责与ClusterMan ...

  6. Spark学习之路 (三)Spark之RDD

    一.RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素 ...

  7. Spark学习之路 (二)Spark2.3 HA集群的分布式安装

    一.下载Spark安装包 1.从官网下载 http://spark.apache.org/downloads.html 2.从微软的镜像站下载 http://mirrors.hust.edu.cn/a ...

  8. Spark学习之路 (二十二)SparkStreaming的官方文档

    官网地址:http://spark.apache.org/docs/latest/streaming-programming-guide.html 一.简介 1.1 概述 Spark Streamin ...

  9. Spark学习之路(十四)—— Spark Streaming 基本操作

    一.案例引入 这里先引入一个基本的案例来演示流的创建:获取指定端口上的数据并进行词频统计.项目依赖和代码实现如下: <dependency> <groupId>org.apac ...

  10. Spark学习之路 (八)SparkCore的调优之开发调优[转]

    前言 在大数据计算领域,Spark已经成为了越来越流行.越来越受欢迎的计算平台之一.Spark的功能涵盖了大数据领域的离线批处理.SQL类处理.流式/实时计算.机器学习.图计算等各种不同类型的计算操作 ...

随机推荐

  1. Linux snprintf使用总结

    snprintf()函数用于将格式化的数据写入字符串,其原型为:    int snprintf(char *str, int n, char * format [, argument, ...]); ...

  2. 牛客网Wannafly挑战赛25A 因子 数论

    正解:小学数学数论 解题报告: 传送门 大概会连着写几道相对而言比较简单的数学题,,,之后就会比较难了QAQ 所以这题相对而言还是比较水的,,, 首先这种题目不难想到分解质因数趴,, 于是就先对p和n ...

  3. scss是什么?在vue.cli中的安装使用步骤是?有哪几大特性?

    css的预编译: 使用步骤: 第一步:用npm下三个loader(sass-loader.css-loader.node-sass): 第二步:在build目录找到webpack.base.confi ...

  4. 基于sendEmail的简单zabbix邮件报警

    一.sendmail和sendEmail区别 sendmail是一款邮件服务器软件(MTA),sendEmail是命令行SMTP邮件客户端(MUA) 二.senEmail安装 下载地址:http:// ...

  5. 8.0-uC/OS-III临界段

    1.临界段 (临界段代码,也叫临界区,是指那些必须完整连续运行,不可被打断的代码段) 锁调度器,可以执行ISR,开启调度器不可执行ISR: (1).临界段代码,也称作临界域,是一段不可分割的代码. u ...

  6. Linux中安装python3

    [centos7中安装python3]http://blog.csdn.net/wjqwinn/article/details/75633714 (一)安装python3前的准备工作1.修改文件中第一 ...

  7. 20170803 Airflow自带的API进行GET 和POST动作部分内容

    --1 首先你要有安装好的Airflow 环境并且在配置文件中有启用API 属性 --2 就是GET 和POST 方法的调用了 这里说一下,由于Airflow在网络上的资料比较少,可以从GETHUB中 ...

  8. 5 jmeter性能测试小小的实战

    项目描述 被测网址:www.sogou.com指标:相应时间以及错误率场景:线程数 20.Ramp-Up Period(in seconds) 10.循环次数 10 测试步骤 1.打开jmeter工具 ...

  9. SparkSql常用语句

    -连接sparksql: cd /home/mr/spark/bin ./beeline !connect jdbc:hive2://hostname:port --切换数据库 use databas ...

  10. [LeetCode] 130. Surrounded Regions_Medium tag: DFS/BFS

    Given a 2D board containing 'X' and 'O' (the letter O), capture all regions surrounded by 'X'. A reg ...