一、基于排序机制的wordcount程序

1、要求

1、对文本文件内的每个单词都统计出其出现的次数。

2、按照每个单词出现次数的数量,降序排序。

2、代码实现

------java实现-------

package cn.spark.study.core;

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; public class SortWordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SortWordCount").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile("D:\\test-file\\spark.txt"); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { private static final long serialVersionUID = 1L; @Override
public Iterable<String> call(String t) throws Exception {
return Arrays.asList(t.split(" "));
}
}); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<String, Integer> call(String t) throws Exception {
return new Tuple2<String, Integer>(t, 1);
}
}); JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { private static final long serialVersionUID = 1L; @Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}); // 到这里为止,就得到了每个单词出现的次数
// 但是,问题是,我们的新需求,是要按照每个单词出现次数的顺序,降序排序
// wordCounts RDD内的元素是什么?应该是这种格式的吧:(hello, 3) (you, 2)
// 我们需要将RDD转换成(3, hello) (2, you)的这种格式,才能根据单词出现次数进行排序把! // 进行key-value的反转映射
JavaPairRDD<Integer, String> countWords = wordCounts.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<Integer, String> call(Tuple2<String, Integer> t) throws Exception {
return new Tuple2<Integer, String>(t._2, t._1);
}
}); //按照key进行排序
JavaPairRDD<Integer, String> sortedCountWords = countWords.sortByKey(false); //再次将value-key进行反转映射
JavaPairRDD<String, Integer> sortedWordCounts = sortedCountWords.mapToPair(new PairFunction<Tuple2<Integer,String>, String, Integer>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<String, Integer> call(Tuple2<Integer, String> t) throws Exception {
return new Tuple2<String, Integer>(t._2, t._1);
}
}); // 到此为止,我们获得了按照单词出现次数排序后的单词计数
// 打印出来
sortedWordCounts.foreach(new VoidFunction<Tuple2<String,Integer>>() { private static final long serialVersionUID = 1L; @Override
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1 + " appears " + t._2 + " times.");
}
}); sc.close();
} } ---------scala实现--------- package cn.spark.study.core import org.apache.spark.SparkConf
import org.apache.spark.SparkContext /**
* @author Administrator
*/
object SortWordCount { def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("SortWordCount")
.setMaster("local")
val sc = new SparkContext(conf) val lines = sc.textFile("D:\\test-file\\spark.txt", 1)
val words = lines.flatMap { line => line.split(" ") }
val pairs = words.map { word => (word, 1) }
val wordCounts = pairs.reduceByKey(_ + _) val countWords = wordCounts.map(wordCount => (wordCount._2, wordCount._1))
val sortedCountWords = countWords.sortByKey(false)
val sortedWordCounts = sortedCountWords.map(sortedCountWord => (sortedCountWord._2, sortedCountWord._1)) sortedWordCounts.foreach(sortedWordCount => println(
sortedWordCount._1 + " appear " + sortedWordCount._2 + " times."))
} }

二、二次排序

1、要求

1、按照文件中的第一列排序。

2、如果第一列相同,则按照第二列排序。

2、java代码

###SecondarySortKey

package cn.spark.study.core;

import java.io.Serializable;

import scala.math.Ordered;

/**
* 自定义的二次排序key
* @author Administrator
*
*/
public class SecondarySortKey implements Ordered<SecondarySortKey>, Serializable { private static final long serialVersionUID = -2366006422945129991L; // 首先在自定义key里面,定义需要进行排序的列
private int first;
private int second; public SecondarySortKey(int first, int second) {
this.first = first;
this.second = second;
} @Override
public boolean $greater(SecondarySortKey other) {
if(this.first > other.getFirst()) {
return true;
} else if(this.first == other.getFirst() &&
this.second > other.getSecond()) {
return true;
}
return false;
} @Override
public boolean $greater$eq(SecondarySortKey other) {
if(this.$greater(other)) {
return true;
} else if(this.first == other.getFirst() &&
this.second == other.getSecond()) {
return true;
}
return false;
} @Override
public boolean $less(SecondarySortKey other) {
if(this.first < other.getFirst()) {
return true;
} else if(this.first == other.getFirst() &&
this.second < other.getSecond()) {
return true;
}
return false;
} @Override
public boolean $less$eq(SecondarySortKey other) {
if(this.$less(other)) {
return true;
} else if(this.first == other.getFirst() &&
this.second == other.getSecond()) {
return true;
}
return false;
} @Override
public int compare(SecondarySortKey other) {
if(this.first - other.getFirst() != 0) {
return this.first - other.getFirst();
} else {
return this.second - other.getSecond();
}
} @Override
public int compareTo(SecondarySortKey other) {
if(this.first - other.getFirst() != 0) {
return this.first - other.getFirst();
} else {
return this.second - other.getSecond();
}
} // 为要进行排序的多个列,提供getter和setter方法,以及hashcode和equals方法
public int getFirst() {
return first;
} public void setFirst(int first) {
this.first = first;
} public int getSecond() {
return second;
} public void setSecond(int second) {
this.second = second;
} @Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + first;
result = prime * result + second;
return result;
} @Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SecondarySortKey other = (SecondarySortKey) obj;
if (first != other.first)
return false;
if (second != other.second)
return false;
return true;
} } ###SecondarySort package cn.spark.study.core; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; /**
* 二次排序
* 1、实现自定义的key,要实现Ordered接口和Serializable接口,在key中实现自己对多个列的排序算法
* 2、将包含文本的RDD,映射成key为自定义key,value为文本的JavaPairRDD
* 3、使用sortByKey算子按照自定义的key进行排序
* 4、再次映射,剔除自定义的key,只保留文本行
* @author Administrator
*
*/
public class SecondarySort { public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SecondarySort")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile("D:\\test-file\\sort.txt"); JavaPairRDD<SecondarySortKey, String> pairs = lines.mapToPair( new PairFunction<String, SecondarySortKey, String>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<SecondarySortKey, String> call(String line) throws Exception {
String[] lineSplited = line.split(" ");
SecondarySortKey key = new SecondarySortKey(
Integer.valueOf(lineSplited[0]),
Integer.valueOf(lineSplited[1]));
return new Tuple2<SecondarySortKey, String>(key, line);
} }); JavaPairRDD<SecondarySortKey, String> sortedPairs = pairs.sortByKey(); JavaRDD<String> sortedLines = sortedPairs.map( new Function<Tuple2<SecondarySortKey,String>, String>() { private static final long serialVersionUID = 1L; @Override
public String call(Tuple2<SecondarySortKey, String> v1) throws Exception {
return v1._2;
} }); sortedLines.foreach(new VoidFunction<String>() { private static final long serialVersionUID = 1L; @Override
public void call(String t) throws Exception {
System.out.println(t);
} }); sc.close();
} }

3、scala代码

###SecondSortKey

package cn.spark.study.core

/**
* @author Administrator
*/
class SecondSortKey(val first: Int, val second: Int)
extends Ordered[SecondSortKey] with Serializable { def compare(that: SecondSortKey): Int = {
if(this.first - that.first != 0) {
this.first - that.first
} else {
this.second - that.second
}
} } ###SecondSort package cn.spark.study.core import org.apache.spark.SparkConf
import org.apache.spark.SparkContext /**
* @author Administrator
*/
object SecondSort { def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("SecondSort")
.setMaster("local")
val sc = new SparkContext(conf) val lines = sc.textFile("D:\\test-file\\sort.txt", 1)
val pairs = lines.map { line => (
new SecondSortKey(line.split(" ")(0).toInt, line.split(" ")(1).toInt),
line)}
val sortedPairs = pairs.sortByKey()
val sortedLines = sortedPairs.map(sortedPair => sortedPair._2) sortedLines.foreach { sortedLine => println(sortedLine) }
} }

三、topn

1、要求

1、对文本文件内的数字,取最大的前3个。

2、对每个班级内的学生成绩,取出前3名。(分组取topn)

3、课后作业:用Scala来实现分组取topn。

2、获取文本内最大的前三个数

---------java实现----------

package cn.spark.study.core;

import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; public class Top3 {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Top3Java").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile("D:\\test-file\\top.txt"); JavaPairRDD<Integer, String> pairs = lines.mapToPair(new PairFunction<String, Integer, String>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<Integer, String> call(String t) throws Exception {
return new Tuple2<Integer, String>(Integer.valueOf(t), t);
}
}); JavaPairRDD<Integer, String> sortedPairs = pairs.sortByKey(false);
JavaRDD<Integer> sortedNumbers = sortedPairs.map(new Function<Tuple2<Integer,String>, Integer>() { private static final long serialVersionUID = 1L; @Override
public Integer call(Tuple2<Integer, String> v1) throws Exception {
return v1._1;
}
}); List<Integer> sortedNumberList = sortedNumbers.take(3); //此时sortedNumberList是: [9, 7, 6]
for(Integer num : sortedNumberList) {
System.out.println(num);
} sc.close(); }
} ---------scala实现---------- package cn.spark.study.core import org.apache.spark.SparkConf
import org.apache.spark.SparkContext /**
* @author Administrator
*/
object Top3 { def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Top3")
.setMaster("local")
val sc = new SparkContext(conf) val lines = sc.textFile("D:\\test-file\\top.txt", 1)
val pairs = lines.map { line => (line.toInt, line) }
val sortedPairs = pairs.sortByKey(false)
val sortedNumbers = sortedPairs.map(sortedPair => sortedPair._1)
val top3Number = sortedNumbers.take(3) for(num <- top3Number) {
println(num)
}
} }

3、对每个班级内的学生成绩,取出前3名。(分组取topn)

----java实现-----

package cn.spark.study.core;

import java.util.Arrays;
import java.util.Iterator; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; /**
* 分组取top3
* @author Administrator
*
*/
public class GroupTop3 { public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("Top3")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile("D:\\test-file\\score.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair( new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<String, Integer> call(String line) throws Exception {
String[] lineSplited = line.split(" ");
return new Tuple2<String, Integer>(lineSplited[0],
//Integer.valueOf()可以将基本类型int转换为包装类型Integer,或者将String转换成Integer,String如果为Null或“”都会报错;
Integer.valueOf(lineSplited[1]));
} }); JavaPairRDD<String, Iterable<Integer>> groupedPairs = pairs.groupByKey(); JavaPairRDD<String, Iterable<Integer>> top3Score = groupedPairs.mapToPair( new PairFunction<Tuple2<String,Iterable<Integer>>, String, Iterable<Integer>>() { private static final long serialVersionUID = 1L; @Override
public Tuple2<String, Iterable<Integer>> call(
Tuple2<String, Iterable<Integer>> classScores)
throws Exception {
Integer[] top3 = new Integer[3]; String className = classScores._1;
Iterator<Integer> scores = classScores._2.iterator(); while(scores.hasNext()) {
Integer score = scores.next(); for(int i = 0; i < 3; i++) {
if(top3[i] == null) {
top3[i] = score;
break;
} else if(score > top3[i]) {
for(int j = 2; j > i; j--) {
top3[j] = top3[j - 1];
} top3[i] = score; break;
}
}
} return new Tuple2<String,
Iterable<Integer>>(className, Arrays.asList(top3));
} }); top3Score.foreach(new VoidFunction<Tuple2<String,Iterable<Integer>>>() { private static final long serialVersionUID = 1L; @Override
public void call(Tuple2<String, Iterable<Integer>> t) throws Exception {
System.out.println("class: " + t._1);
Iterator<Integer> scoreIterator = t._2.iterator();
while(scoreIterator.hasNext()) {
Integer score = scoreIterator.next();
System.out.println(score);
}
System.out.println("=======================================");
} }); sc.close();
} } -----scala实现------ package cn.spark.study.core import org.apache.spark.SparkConf
import org.apache.spark.SparkContext object GroupTop3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("GroupTop3Scala").setMaster("local")
val context = new SparkContext(conf)
val linesRDD = context.textFile("D:\\test-file\\score.txt")
val studentScores = linesRDD.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
val groupStudentScores = studentScores.groupByKey()
val result = groupStudentScores.map(student => {
val maxScore = new Array[Int](3)
val scores = student._2
for(score <- scores) {
var flag = true
for(i <- 0 until maxScore.length if flag) {
if(maxScore(i) == Nil) {
maxScore(i) = score
flag = false
}else{
if(maxScore(i) < score) {
for(j <- (i + 1 to maxScore.length - 1).reverse){
maxScore(j) = maxScore(j - 1)
}
maxScore(i) = score
flag = false
}
}
}
}
(student._1, maxScore)
}) result.foreach(result =>{
print(result._1 + "班级前三明成绩为:")
for(i <- 0 until result._2.length) {
if(i == 0) print(result._2(i))
else print("," + result._2(i))
}
println()
})
}
}

10、spark高级编程的更多相关文章

  1. Learning Spark中文版--第六章--Spark高级编程(2)

    Working on a Per-Partition Basis(基于分区的操作) 以每个分区为基础处理数据使我们可以避免为每个数据项重做配置工作.如打开数据库连接或者创建随机数生成器这样的操作,我们 ...

  2. Learning Spark中文版--第六章--Spark高级编程(1)

    Introduction(介绍) 本章介绍了之前章节没有涵盖的高级Spark编程特性.我们介绍两种类型的共享变量:用来聚合信息的累加器和能有效分配较大值的广播变量.基于对RDD现有的transform ...

  3. spark高级编程

    启动spark-shell 如果你有一个Hadoop 集群, 并且Hadoop 版本支持YARN, 通过为Spark master 设定yarn-client 参数值,就可以在集群上启动Spark 作 ...

  4. C#高级编程笔记 (6至10章节)运算符/委托/字符/正则/集合

    数学的复习,4^-2即是1/4/4的意思, 4^2是1*2*2的意思,而10^-2为0.01! 7.2运算符 符号 说明 例   ++ 操作数加1 int i=3; j=i++; 运算后i的值为4,j ...

  5. Spark Graphx编程指南

    问题导读1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?2.PageRank算法在图中发挥什么作用?3.三角形计数算法的作用是什么?Spark中文手册-编程指南Spark之一个快 ...

  6. Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN

    Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...

  7. Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南

    Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...

  8. jQuery高级编程

    jquery高级编程1.jquery入门2.Javascript基础3.jQuery核心技术 3.1 jQuery脚本的结构 3.2 非侵扰事JavaScript 3.3 jQuery框架的结构 3. ...

  9. unix环境高级编程基础知识之第二篇(3)

    看了unix环境高级编程第三章,把代码也都自己敲了一遍,另主要讲解了一些IO函数,read/write/fseek/fcntl:这里主要是c函数,比较容易,看多了就熟悉了.对fcntl函数讲解比较到位 ...

随机推荐

  1. LOJ3146 APIO2019路灯(cdq分治+树状数组)

    每个时刻都形成若干段满足段内任意两点可达.将其视为若干正方形.则查询相当于求历史上某点被正方形包含的时刻数量.并且注意到每个时刻只有O(1)个正方形出现或消失,那么求出每个矩形的出现时间和消失时间,就 ...

  2. Java线程同步类容器和并发容器(四)

    同步类容器都是线程安全的,在某些场景下,需要枷锁保护符合操作,最经典ConcurrentModifiicationException,原因是当容器迭代的过程中,被并发的修改了内容. for (Iter ...

  3. REDISTEMPLATE如何注入到VALUEOPERATIONS

    REDISTEMPLATE如何注入到VALUEOPERATIONS 今天看到Spring操作redis  是可以将redisTemplate注入到ValueOperations,避免了ValueOpe ...

  4. 作业调度框架Quartz.NET-现学现用-01-快速入门

    原文:作业调度框架Quartz.NET-现学现用-01-快速入门 前言 你需要应用执行一个任务吗?这个任务每天或每周星期二晚上11:30,或许仅仅每个月的最后一天执行.一个自动执行而无须干预的任务在执 ...

  5. easyui-datagrid 假分页

    假分页就是将数据一下全查出来,利用前端来把所有数据进行分页

  6. Web应用的生命周期(客户端)

    典型的一个Web应用的生命周期从用户在浏览器输入一串URL,或者单击一个链接开始(就是访问一个页面).而这个生命周期的结束就是我们关闭这个页面. 响应流程: 用户输入一个 URL(生命周期开始) 客户 ...

  7. layui.js源码分析

      /*! @Title: Layui @Description:经典模块化前端框架 @Site: www.layui.com @Author: 贤心 @License:MIT */ ;!functi ...

  8. SQL----EXISTS 关键字

    转自:http://blog.sina.com.cn/s/blog_65dbc6df0100mvfx.html 1.EXISTS基本意思 英语解释就是存在,不过他的意思也差不多,相当于存在量词'З'. ...

  9. redis 异常 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk

    MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. 解决方 ...

  10. javascript_15-undefined 和 is not defined 的区别

    undefined 和 is not defined //1 console.log(a); // is not defined //2 var a; console.log(a); //undefi ...