Spark中的键值对操作
1.PairRDD介绍
List<String> list=new ArrayList<String>();
list.add("this is a test");
list.add("how are you?");
list.add("do you love me?");
list.add("can you tell me?");
JavaRDD<String> lines=sc.parallelize(list);
JavaPairRDD<String,String> map =lines.mapToPair(
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String s) throws Exception {
return new Tuple2<String, String>(s.split(" ")[0],s);
//获取第一个单词作为key,s为value
}
}
);
函数名 | 目的 | 示例 | 结果 |
substractByKey | 删掉RDD中键与other RDD 中的键相同的元素 |
rdd.subtractByKey(other) | {(1,2)} |
join | 对两个RDD进行内连接 |
rdd.join(other) | {(3,(4,9)),(3,(6,9))} |
rightOuterJoin | 对两个RDD进行连接操作,右外连接 | rdd.rightOuterJoin(other) | {(3,(4,9)),(3,(6,9))} |
leftOuterJoin | 对两个RDD进行连接操作,左外连接 | rdd.rightOuterJoin(other) | {(1,(2,None)),(3,(4,9)),(3,(6,9))} |
cogroup | 将两个RDD中拥有相同键的数据分组 | rdd.cogroup(other) | {1,([2],[]),(3,[4,6],[9])} |
JavaPairRDD<String,String> result=map.filter(
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> value) throws Exception {
return value._2().length()<20;
}
}
);
for(Tuple2 tuple:result.collect()){
System.out.println(tuple._1()+": "+tuple._2());
strLine.add("how are you");
strLine.add("I am ok");
strLine.add("do you love me");
JavaRDD<String> input=sc.parallelize(strLine);
JavaRDD<String> words=input.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) throws Exception {
return Arrays.asList(s.split(" "));
}
}
);
JavaPairRDD<String,Integer> result=words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2(s, 1);
}
}
).reduceByKey(
new org.apache.spark.api.java.function.Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}
) ;
List<Tuple2<Integer,Integer>> list=new ArrayList<Tuple2<Integer, Integer>>();
list.add(new Tuple2<Integer,Integer>(1,1));
list.add(new Tuple2<Integer, Integer>(1,3));
list.add(new Tuple2<Integer, Integer>(2,2));
list.add(new Tuple2<Integer, Integer>(2,8));
JavaPairRDD<Integer,Integer> map=sc.parallelizePairs(list);
JavaPairRDD<Integer,Integer> results=map.foldByKey(0, new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
for(Tuple2<Integer,Integer> tuple:results.collect())
System.out.println(tuple._1()+"->"+tuple._2());
public class AvgCount implements Serializable{
private int total_;
private int num_;
public AvgCount(int total,int num){
total_=total;
num_=num;
}
public float avg(){
return total_/(float) num_;
}//createCombiner()
static Function<Integer,AvgCount> createAcc =new Function<Integer,AvgCount>(){
public AvgCount call(Integer x){
return new AvgCount(x,1);
}
};//mergeValue()
static Function2<AvgCount,Integer,AvgCount> addAndCount=new Function2<AvgCount, Integer, AvgCount>() {
public AvgCount call(AvgCount a, Integer x) throws Exception {
a.total_+=x;
a.num_+=1;
return a;
}
}; //mmergeCombiners()
static Function2<AvgCount,AvgCount,AvgCount> combine=new Function2<AvgCount, AvgCount, AvgCount>() {
public AvgCount call(AvgCount a, AvgCount b) throws Exception {
a.total_+=b.total_;
a.num_+=b.num_;
return a;
}
};
public static void main(String args[]){
AvgCount initial =new AvgCount(0,0);
SparkConf conf = new SparkConf().setMaster("local").setAppName("my app");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<Integer,Integer>> list=new ArrayList<Tuple2<Integer, Integer>>();
list.add(new Tuple2<Integer,Integer>(1,1));
list.add(new Tuple2<Integer, Integer>(1,3));
list.add(new Tuple2<Integer, Integer>(2,2));
list.add(new Tuple2<Integer, Integer>(2,8));
JavaPairRDD<Integer,Integer> nums=sc.parallelizePairs(list);
JavaPairRDD<Integer,AvgCount> avgCounts=nums.combineByKey(createAcc,addAndCount,combine);
Map<Integer,AvgCount> countMap= avgCounts.collectAsMap();
for(Map.Entry<Integer,AvgCount> entry:countMap.entrySet())
System.out.println(entry.getKey()+": "+entry.getValue().avg());
}
}
JavaPairRDD<Integer,AvgCount> avgCounts=nums.combineByKey(createAcc,addAndCount,combine,10);
List<Tuple2<Integer,Integer>> list1=new ArrayList<Tuple2<Integer, Integer>>();
list1.add(new Tuple2<Integer,Integer>(1,1));
list1.add(new Tuple2<Integer, Integer>(2,2));
list1.add(new Tuple2<Integer, Integer>(1,3));
list1.add(new Tuple2<Integer, Integer>(2,4));
JavaPairRDD<Integer,Integer> nums1=sc.parallelizePairs(list1);
JavaPairRDD<Integer,Iterable<Integer>>results =nums1.groupByKey();
//接下来遍历输出results,注意其中关于Iterable遍历的处理
for(Tuple2<Integer,Iterable<Integer>> tuple :results.collect()){
System.out.print(tuple._1()+": ");
Iterator<Integer> it= tuple._2().iterator();
while(it.hasNext()){
System.out.print(it.next()+" ");
}
System.out.println();
}
List<Tuple2<Integer,Integer>> list1=new ArrayList<Tuple2<Integer, Integer>>();
List<Tuple2<Integer,Integer>> list2=new ArrayList<Tuple2<Integer, Integer>>();
list1.add(new Tuple2<Integer,Integer>(1,1));
list1.add(new Tuple2<Integer, Integer>(2,2));
list1.add(new Tuple2<Integer, Integer>(1,3));
list1.add(new Tuple2<Integer, Integer>(2,4));
list1.add(new Tuple2<Integer, Integer>(3,4));
list2.add(new Tuple2<Integer,Integer>(1,1));
list2.add(new Tuple2<Integer, Integer>(1,3));
list2.add(new Tuple2<Integer, Integer>(2,3));
JavaPairRDD<Integer,Integer> nums1=sc.parallelizePairs(list1);
JavaPairRDD<Integer,Integer> nums2=sc.parallelizePairs(list2);
JavaPairRDD<Integer,Tuple2<Iterable<Integer>,Iterable<Integer>>> results=nums1.cogroup(nums2);
for(Tuple2<Integer,Tuple2<Iterable<Integer>,Iterable<Integer>>> tuple:results.collect()){
System.out.print(tuple._1()+" [ ");
Iterator it1=tuple._2()._1().iterator();
while(it1.hasNext()){
System.out.print(it1.next()+" ");
}
System.out.print("] [ ");
Iterator it2=tuple._2()._2().iterator();
while(it2.hasNext()){
System.out.print(it2.next()+" ");
}
System.out.print("] \n");
}
}
JavaRDD<Integer> nums=sc.parallelize(Arrays.asList(1,5,3,2,6,3));
JavaRDD<Integer> results =nums.sortBy(new Function<Integer, Object>() {
public Object call(Integer v1) throws Exception {
return v1;
}
},false,1);
for(Integer a:results.collect())
System.out.println(a);
ist<Tuple2<Integer, Integer>> list1 = new ArrayList<Tuple2<Integer, Integer>>();
list1.add(new Tuple2<Integer, Integer>(1, 1));
list1.add(new Tuple2<Integer, Integer>(2, 2));
list1.add(new Tuple2<Integer, Integer>(1, 3));
list1.add(new Tuple2<Integer, Integer>(2, 4));
list1.add(new Tuple2<Integer, Integer>(3, 4));
JavaPairRDD<Integer, Integer> nums1 = sc.parallelizePairs(list1);
class comp implements Comparator<Integer>, Serializable {
public int compare(Integer a, Integer b) {
return a.compareTo(b);
}
};
JavaPairRDD<Integer,Integer> results=nums1.sortByKey(new comp());
for(Tuple2<Integer,Integer> tuple: results.collect()){
System.out.println(tuple._1()+": "+tuple._2());
}
List<Tuple2<String,Iterable<String>>> list1=new ArrayList<Tuple2<String, Iterable<String>>>();
list1.add(new Tuple2<String, Iterable<String>>("zhou",Arrays.asList("it","math")));
list1.add(new Tuple2<String, Iterable<String>>("gan",Arrays.asList("money","book")));
JavaPairRDD<String,Iterable<String>> userData=sc.parallelizePairs(list1);
List<Tuple2<String,String>> list2=new ArrayList<Tuple2<String, String>>();
list2.add(new Tuple2<String, String>("zhou","it") );
list2.add(new Tuple2<String,String>("zhou","stock"));
list2.add(new Tuple2<String, String>("gan","money"));
list2.add(new Tuple2<String, String>("gan","book"));
JavaPairRDD<String,String> events=sc.parallelizePairs(list2);
JavaPairRDD<String, Tuple2<Iterable<String>, String>> joined = userData.join(events);
long a=joined.filter(
new Function<Tuple2<String, Tuple2<Iterable<String>, String>>, Boolean>() {
public Boolean call(Tuple2<String, Tuple2<Iterable<String>, String>> tuple) throws Exception {
boolean has = false;
Iterable<String> user=tuple._2()._1();
String link=tuple._2()._2();
for (String s : user) {
if (s.compareTo(link) == 0) {
has = true;
break;
}
}
//保留不在用户订阅表中的RDD
return !has;
}
}
).count();
System.out.println(a);
List<Tuple2<String,Iterable<String>>> list1=new ArrayList<Tuple2<String, Iterable<String>>>();
list1.add(new Tuple2<String, Iterable<String>>("zhou",Arrays.asList("it","math")));
list1.add(new Tuple2<String, Iterable<String>>("gan",Arrays.asList("money","book")));
JavaPairRDD<String,Iterable<String>> userData=sc.parallelizePairs(list1);//请注意,partitionBy是转化操作
userData=userData.partitionBy(new HashPartitioner(100)).persist(StorageLevel.MEMORY_AND_DISK());
Optional<Partitioner> partitioner = userData.partitioner();
System.out.println(partitioner.get());
System.out.println(partitioner.isPresent());
public class main {
private static class Sum implements Function2<Double, Double, Double> {
public Double call(Double a, Double b) {
return a + b;
}
}
public static void main(String args[]){
SparkConf conf =new SparkConf();
conf.setAppName("my spark app");
conf.setMaster("local");
JavaSparkContext sc =new JavaSparkContext(conf);
JavaRDD<String> inputs= sc.textFile("C:\\url.txt");
/*
#以下是url的内容:
www.baidu.com www.hao123.com
www.baidu.com www.2345.com
www.baidu.com www.zhouyang.com
www.hao123.com www.baidu.com
www.hao123.com www.zhouyang.com
www.zhouyang.com www.baidu.com
*/
JavaPairRDD<String, Iterable<String>> links = inputs.mapToPair(
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String s) throws Exception {
String[] parts = s.split(" ");
return new Tuple2<String, String>(parts[0], parts[1]);
}
}
).distinct().groupByKey().cache();
JavaPairRDD<String, Double> ranks = links.mapValues(
new Function<Iterable<String>, Double>() {
public Double call(Iterable<String> v1) throws Exception {
return 1.0;
}
}
);
JavaPairRDD<String, Tuple2<Iterable<String>, Double>> join = links.join(ranks);
for(int current=0;current<10;current++){
// Calculates URL contributions to the rank of other URLs.
JavaPairRDD<String, Double> contribs = links.join(ranks).values()
.flatMapToPair(new PairFlatMapFunction<Tuple2<Iterable<String>, Double>, String, Double>() {
public Iterable<Tuple2<String, Double>> call(Tuple2<Iterable<String>, Double> s) {
int urlCount = Iterables.size(s._1());
List<Tuple2<String, Double>> results = new ArrayList<Tuple2<String, Double>>();
for (String n : s._1()) {
results.add(new Tuple2<String, Double>(n, s._2() / urlCount));
}
return results;
}
});
// Re-calculates URL ranks based on neighbor contributions.
ranks = contribs.reduceByKey(new Sum()).mapValues(new Function<Double, Double>() {
public Double call(Double sum) {
return 0.15 + sum * 0.85;
}
});
}
// Collects all URL ranks and dump them to console.
List<Tuple2<String, Double>> output = ranks.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + " has rank: " + tuple._2() + ".");
}
}
}
class myPartitioner extends Partitioner{
int num;//分区数目
public myPartitioner(int num){//构造方法,初始化num
this.num=num;
}
@Override
public int numPartitions() {//返回分区数目
return num;
}
@Override
public int getPartition(Object key) {
String url =(String)key;
String domain="";
try {
domain=new URL(url).getHost();//获取域名
} catch (MalformedURLException e) {
e.printStackTrace();
}
int code =domain.hashCode()%num;//获取该域名对应的hash值
if(code<0){//getPartition()方法只能返回非负数,对负数进行处理
code+=num;
}
return code;
}
@Override
public boolean equals(Object obj) {
if(obj instanceof myPartitioner){//如果obj是myPartitioner的实例
return ((myPartitioner) obj).num==num;//看分区数是否相同
}
else//否则直接返回false
return false;
}
}
Spark中的键值对操作的更多相关文章
- Spark中的键值对操作-scala
1.PairRDD介绍 Spark为包含键值对类型的RDD提供了一些专有的操作.这些RDD被称为PairRDD.PairRDD提供了并行操作各个键或跨节点重新进行数据分组的操作接口.例如,Pa ...
- Spark学习之键值对操作总结
键值对 RDD 是 Spark 中许多操作所需要的常见数据类型.键值对 RDD 通常用来进行聚合计算.我们一般要先通过一些初始 ETL(抽取.转化.装载)操作来将数据转化为键值对形式.键值对 RDD ...
- Spark学习笔记——键值对操作
键值对 RDD是 Spark 中许多操作所需要的常见数据类型 键值对 RDD 通常用来进行聚合计算.我们一般要先通过一些初始 ETL(抽取.转化.装载)操作来将数据转化为键值对形式. Spark 为包 ...
- Redis中的键值过期操作
1.过期设置 Redis 中设置过期时间主要通过以下四种方式: expire key seconds:设置 key 在 n 秒后过期: pexpire key milliseconds:设置 key ...
- Redis源码解析:09redis数据库实现(键值对操作、键超时功能、键空间通知)
本章对Redis服务器的数据库实现进行介绍,说明Redis数据库相关操作的实现,包括数据库中键值对的添加.删除.查看.更新等操作的实现:客户端切换数据库的实现:键超时相关功能的实现.键空间事件通知等. ...
- Spark学习笔记3:键值对操作
键值对RDD通常用来进行聚合计算,Spark为包含键值对类型的RDD提供了一些专有的操作.这些RDD被称为pair RDD.pair RDD提供了并行操作各个键或跨节点重新进行数据分组的操作接口. S ...
- Spark学习之键值对(pair RDD)操作(3)
Spark学习之键值对(pair RDD)操作(3) 1. 我们通常从一个RDD中提取某些字段(如代表事件时间.用户ID或者其他标识符的字段),并使用这些字段为pair RDD操作中的键. 2. 创建 ...
- spark入门(三)键值对操作
1 简述 Spark为包含键值对类型的RDD提供了一些专有的操作.这些RDD被称为PairRDD. 2 创建PairRDD 2.1 在sprk中,很多存储键值对的数据在读取时直接返回由其键值对数据组成 ...
- Spark基础:(三)Spark 键值对操作
1.pair RDD的简介 Spark为包含键值对类型的RDD提供了一些专有的操作,这些RDD就被称为pair RDD 那么如何创建pair RDD呢? 在不同的语言中有着不同的创建方式 在pytho ...
随机推荐
- 剑指offer 调整数组顺序使得奇数位于偶数前面
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 class Solution { public: void ...
- .OpenWrt驱动程序Makefile的分析概述 、驱动程序代码参考、以及测试程序代码参考
# # # include $(TOPDIR)/rules.mk //一般在 Makefile 的开头 include $(INCLUDE_DIR)/kernel.mk // 文件对于 软件包为内核时 ...
- 转:Visual Studio进行Web性能测试- Part I
原文作者:Ambily.raj Visual Studio是可以用于性能测试的工具之一.Visual Studio Test版或Visual Studio 2010旗舰版为自动化测试提供了支持.本文介 ...
- PAT (Advanced Level) 1071. Speech Patterns (25)
简单题. #include<cstdio> #include<cstring> #include<cmath> #include<vector> #in ...
- 基于Annotation与SpringAOP的缓存简单解决方案
前言: 由于项目的原因,需要对项目中大量访问多修改少的数据进行缓存并管理,为达到开发过程中通过Annotation简单的配置既可以完成对缓存的设置与更新的需求,故而设计的该简易的解决方案. 涉及技术: ...
- ScreenCaptureHtmlUnitDriver.java
https://github.com/apache/incubator-zeppelin/blob/master/zeppelin-server/src/test/java/com/webautoma ...
- Struts2--ModelDriven接收参数
1. JSP文件调用格式: <a href="user/user!add?name=a&age=8">添加用户</a> 2. struts.xml文 ...
- php登录
if ($name && $passowrd){ $sql = "SELECT * FROM liuyanban WHERE name = '$name' and passw ...
- MAC 调整Launchpad 图标大小
1.调整每一列显示图标数量 defaults write com.apple.dock springboard-rows -int 7 2.调整每一行显示图标数量 defaults write com ...
- java中关于编码的问题(字符转换流及字符缓冲流 )
上次我们使用的是字节流,还有一种方式就是字符流,上次说过如何分辨使用哪种流,如果记事本可以读懂则使用字符流,否则使用字节流.使用字符流就需要牵扯到编码的问题,下面给出一种转化流的格式. OutputS ...