MapReduce Top N 、二次排序,MapJoin:

TOP N

对于一组输入List(key,value),我们要创建一个Top N 列表,这是一种过滤模式,查看输入数据特定子集,观察用户的行为。

解决方案

key是唯一键,需要对输入进行额外的聚集处理,先把输入分区成小块,然后把每个小块发送到一个映射器中。每个映射器会创建一个本地Top N 列表发送到一个规约器中,即最终由一个规约其产生一个Top N 列表。对于大多数的MapReduce算法,由一个规约器接收所有数据会使负载不均衡,从而产生瓶颈问题。但是本解决方案产生的Top N 是很少量的数据,如果有映射器,则会需要处理1000 X N的数据。

数据

思路

计算过程以K V对传输

可以创建一个最小堆,最常用的是SortMap<K,V> 和TreeMap<K,V>,如果top N.size()>N,则删除第一个元素(频数最小)如果求Bottom N构造最大堆。

代码

Mapper

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException;
import java.util.TreeMap; public class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {
private TreeMap<Integer, Text> visitimesMap = new TreeMap<Integer, Text>(); @Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
if (value == null) {
return;
}
String[] strs = value.toString().split(" ");
String tId = strs[0];
String reputation = strs[1]; if (tId == null || reputation == null) {
return;
}
visitimesMap.put(Integer.parseInt(reputation), new Text(value));
if (visitimesMap.size() > 10) {
visitimesMap.remove(visitimesMap.firstKey());
}
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException {
for (Text t : visitimesMap.values()) {
context.write(NullWritable.get(), t); }
}
}

Reduce

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException;
import java.util.TreeMap; public class TopTenReduce extends Reducer<NullWritable, Text, NullWritable, Text> {
private TreeMap<Integer, Text> visittimesMap = new TreeMap<Integer, Text>(); @Override
protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
String[] strs = value.toString().split(" ");
visittimesMap.put(Integer.parseInt(strs[1]), new Text(value));
if (visittimesMap.size() > 10) {
visittimesMap.remove(visittimesMap.firstKey());
}
}
for (Text t : visittimesMap.values()) {
context.write(NullWritable.get(), t);
}
}
}

Job

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class TopTenJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.out.println("Usage:TopTenDriver<in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf);
job.setJarByClass(TopTenJob.class);
job.setMapperClass(TopTenMapper.class);
job.setReducerClass(TopTenReduce.class); job.setNumReduceTasks(1); job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

结果

非唯一键解决方案

假设所有的输入键是不唯一的,即会出现key相同而value不同的情况,则需要为两个阶段解决。

第一阶段:通过MapReduce先聚集具有相同键的元组,并将频数相加,将不唯一的键转换为唯一的键,其value值为频数和,输出为唯一键值对<key,value>

第二阶段:将第一阶段的输出的唯一键值对作为输入,采用唯一键的解决方案求Top N 即可。

全排序

创建随机数据

for i in {1..100000};
do
echo $RANDOM
done;
sh createdata.sh >data1
sh createdata.sh >data2
sh createdata.sh >data3
sh createdata.sh >data4

MyPartitioner

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner; public class MyPartitioner extends Partitioner<IntWritable, IntWritable> { public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
int keyInt = Integer.parseInt(key.toString());
if (keyInt > 20000) { //数据区间在在[0,350000]随机生成概率比较均匀.
return 2;
} else if (keyInt > 10000) {
return 1;
} else {
return 0;
}
}
}

Mysort

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator; public class Mysort extends WritableComparator {
public Mysort() {
super(IntWritable.class, true);
} @Override
public int compare(WritableComparable a, WritableComparable b) {
IntWritable v1 = (IntWritable) a;
IntWritable v2 = (IntWritable) b;
return v2.compareTo(v1);
}
}

SimpleMapper

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class SimpleMapper extends Mapper<LongWritable, Text, IntWritable,IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//将Text类的value转换为IntWritable类型
IntWritable intvalue = new IntWritable(Integer.parseInt(value.toString()));
//值写入context
context.write(intvalue,intvalue);
}
}

SimpleReducer

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class SimpleReducer extends Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values) {
context.write(value, NullWritable.get());
}
}
}

SimpleJob

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SimpleJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf); job.setJarByClass(SimpleJob.class); job.setPartitionerClass(MyPartitioner.class);
job.setSortComparatorClass(Mysort.class);
job.setMapperClass(SimpleMapper.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(SimpleReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class); job.setNumReduceTasks(3);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

抽样实现全局排序

自定分区下实现用户按业务分区与Reduce任务数据一 一对于,使用全排序最终的解结果在文件中一次按顺序输出,可能存在认为对数据了解不够多,造成分区的数据个数不一致,导致发生数据倾斜的问题。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.InputSampler;
import org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; import java.io.IOException; public class TotalOrderingPartition extends Configured implements Tool {
//Map类
static class SimpleMapper extends Mapper<Text, Text, Text, IntWritable> {
@Override
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { IntWritable intWritable = new IntWritable(Integer.parseInt(key.toString()));
context.write(key, intWritable);
}
} //Reduce类
static class SimpleReducer extends Reducer<Text, IntWritable, IntWritable, NullWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values) {
context.write(value, NullWritable.get());
}
}
} public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJarByClass(TotalOrderingPartition.class); job.setInputFormatClass(KeyValueTextInputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setNumReduceTasks(3); job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class); TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), new Path(args[2]));
InputSampler.Sampler<Text, Text> sampler = new InputSampler.SplitSampler<Text, Text>(1000, 10); InputSampler.writePartitionFile(job, sampler);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setMapperClass(SimpleMapper.class);
job.setReducerClass(SimpleReducer.class); job.setJobName("Ite Blog"); return job.waitForCompletion(true) ? 0 : 1;
} //主方法
public static void main(String[] args) throws Exception {
args = new String[]{"/input/simple","/output/random_simple1","/output/random_simple2"};
int exitCode = ToolRunner.run(new TotalOrderingPartition(), args);
System.exit(exitCode);
}
}

通过抽样分区,可以使个分区所含的记录大致相等,使作业的总体执行时间不会因为一个Reducer任务之后而拖慢整体进度。

二次排序

二次排序不同于全排序,它在规约阶段对与中间键相关联的中间值的某个属性进行排序。可以对传如各个Reducer的值进行升序或降序排序,即首先按照第一个字段排序,在对第二个字段相同的排序,且不能破坏第一个排序结果。

解决方案

  1. Reducer内排序。让Reducer接收(k2,list(v2))之后,利用Reducer内存进行排序,生成(k2,list(v2)),在进行处理,在数值过多的情况下,容易导致内存溢出。
  2. 利用MapReduce框架的排序框架对Reducer值进行排序。为自然键增加自然值得部分或整个值,构造组合键进行排序。由于排序工作是由MapReduce框架完成的,因此不会溢出。

流程

  1. 构造组合键(k,v1),其中v1是次键,K为自然键值,要在Reduce中注入一个值(v1)或者一个组合键。Reducer值按照什么来排序,就将其加入到自然键中,共同称为组合键。

    对于Key=k2,若所有的映射器生成键值对为(k2,A1),(k2,A2),…(k2,An),对于一每个Ai,对于一个每一个Ai,设Ai为一个m元组值(ai2,ai3,…,aim)用bi来表示。因而映射器生成键值对可以做如下表示:

    (k2,(a1,b1)),(k2(a2,b2)),…(k2,(an,bn))

    自然键位k2,加入ai形成组合键((k2,ai))最终的映射器发出键值对如下:

    ((k2,a1),(a1,b1)),((k2,a2),(a2,b2)),….((k2,an),(an,bn))

  2. 在分区器中加入两个插件类:定制分区器,确保相同自然键的数据到达相同Reducer,也就是说,同一Reducer中可能由key=k2,key=k3的所有数据,定制比较器就将这些数据按照自然键进行分组。

  3. 分组比较器,可以使v1按有序的顺序到达Reducer,使用MapReduce执行框架完成排序,可以保证到达Reducer的值按键有序并按值有序。

实现类

MyWritable

import java.io.*;

import org.apache.hadoop.io.*;

public class MyWritable implements WritableComparable<MyWritable> {

    private int first;
private int second; public MyWritable() {
} public MyWritable(int first, int second) {
set(first, second);
} public void set(int first, int second) {
this.first = first;
this.second = second;
} public int getFirst() {
return first;
} public int getSecond() {
return second;
} public void write(DataOutput out) throws IOException {
out.writeInt(first);
out.writeInt(second);
} public void readFields(DataInput in) throws IOException {
first = in.readInt();
second = in.readInt();
} @Override
public int hashCode() {
return first * 163 + second;
}
@Override
public boolean equals(Object o) {
if (o instanceof MyWritable) {
MyWritable ip = (MyWritable) o;
return first == ip.first && second == ip.second;
}
return false;
} @Override
public String toString() {
return first + "\t" + second;
} public int compareTo(MyWritable ip) {
int cmp = compare(first, ip.first);
if (cmp != 0) {
return cmp;
}
return compare(second, ip.second);
} public static int compare(int a, int b) {
return (a < b ? -1 : (a == b ? 0 : 1));
}
}

MyComparator

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator; public class MyComparator extends WritableComparator {
protected MyComparator() {
super(MyWritable.class, true);
} @Override
public int compare(WritableComparable w1, WritableComparable w2) {
MyWritable ip1 = (MyWritable) w1;
MyWritable ip2 = (MyWritable) w2;
int cmp = MyWritable.compare(ip1.getFirst(), ip2.getFirst());
if (cmp != 0) {
return cmp;
}
return -MyWritable.compare(ip1.getSecond(), ip2.getSecond());
}
}

MyPartitioner

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Partitioner; public class MyPartitioner extends Partitioner<MyWritable, NullWritable> { @Override
public int getPartition(MyWritable key, NullWritable value, int numPartitions) {
return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}

MyGroup

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator; public class MyGroup extends WritableComparator {
protected MyGroup() {
super(MyWritable.class, true);
} @Override
public int compare(WritableComparable w1, WritableComparable w2) {
MyWritable ip1 = (MyWritable) w1;
MyWritable ip2 = (MyWritable) w2;
return MyWritable.compare(ip1.getFirst(), ip2.getFirst());
}
}

SecondarySortMapper

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class SecondarySortMapper extends
Mapper<LongWritable, Text, MyWritable, NullWritable> { @Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String[] strs = value.toString().split(" ");
String year = strs[0];
String Temperature = strs[1];
if (year == null || Temperature == null) {
return;
} context.write(
new MyWritable(Integer.parseInt(strs[0]), Integer
.parseInt(strs[1])), NullWritable.get());
} }

SecondarySortReducer

import java.io.IOException;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer; public class SecondarySortReducer extends
Reducer<MyWritable, NullWritable, MyWritable, NullWritable> { @Override
protected void reduce(MyWritable key, Iterable<NullWritable> values,
Context context) throws IOException, InterruptedException { for (NullWritable val : values) {
context.write(key, NullWritable.get());
}
}
}

SecondarySortJob

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SecondarySortJob {
public static void main(String[] args) throws Exception { Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SecondarySortJob.class);
job.setMapperClass(SecondarySortMapper.class);
job.setPartitionerClass(MyPartitioner.class);
job.setSortComparatorClass(MyComparator.class);
job.setGroupingComparatorClass(MyGroup.class);
job.setReducerClass(SecondarySortReducer.class);
job.setOutputKeyClass(MyWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

原始数据

2016 32
2017 38
2016 31
2016 39
2015 35
2017 34
2016 37
2017 36
2018 35
2015 31
2018 34
2018 33

排序后数据

2015    35
2015 31
2016 39
2016 37
2016 32
2016 31
2017 38
2017 36
2017 34
2018 35
2018 34
2018 33

连接

Reduce端连接

TableBean

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException; public class TableBean implements Writable { private String orderId;
private String pid;
private int amount;
private String pname;
private String flag; public TableBean() {
super();
} public TableBean(String orderId, String pid, int amount, String pname, String flag) {
super();
this.orderId = orderId;
this.pid = pid;
this.amount = amount;
this.pname = pname;
this.flag = flag;
} @Override
public String toString() {
return orderId + "\t" + pname + "\t" + amount;
} public String getOrderId() {
return orderId;
} public void setOrderId(String orderId) {
this.orderId = orderId;
} public String getPid() {
return pid;
} public void setPid(String pid) {
this.pid = pid;
} public int getAmount() {
return amount;
} public void setAmount(int amount) {
this.amount = amount;
} public String getPname() {
return pname;
} public void setPname(String pname) {
this.pname = pname;
} public String getFlag() {
return flag;
} public void setFlag(String flag) {
this.flag = flag;
} @Override
public void write(DataOutput out) throws IOException {
out.writeUTF(orderId);
out.writeUTF(pid);
out.writeInt(amount);
out.writeUTF(pname);
out.writeUTF(flag);
} @Override
public void readFields(DataInput in) throws IOException {
orderId = in.readUTF();
pid = in.readUTF();
amount = in.readInt();
pname = in.readUTF();
flag = in.readUTF();
}
}

TableMapper

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean>{
Text k = new Text();
TableBean tableBean = new TableBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取文件名
FileSplit split = (FileSplit) context.getInputSplit();
String name = split.getPath().getName();
//判断
String line = value.toString();
if (name.startsWith("order")){
String[] fields = line.split("\t");
tableBean.setOrderId(fields[0]);
tableBean.setPid(fields[1]);
tableBean.setAmount(Integer.parseInt(fields[2]));
tableBean.setPname("");
tableBean.setFlag("0");
k.set(fields[1]);
context.write(k, tableBean);
}else {
String[] fields = line.split("\t");
tableBean.setOrderId("");
tableBean.setPid(fields[0]);
tableBean.setAmount(0);
tableBean.setPname(fields[1]);
tableBean.setFlag("1");
k.set(fields[0]);
context.write(k, tableBean);
}
}
}

TableReducer

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List; public class TableReducer extends Reducer<Text, TableBean, NullWritable, TableBean> {
//1001 01 1
//01 小米
@Override
protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
//用来接收order表的数据
List<TableBean> orderBeans = new ArrayList<>();
//用来接受pd表的数据
TableBean pBean = new TableBean();
for (TableBean value : values){
//order表
if (value.getFlag().equals("0")){
TableBean oBean = new TableBean();
try {
BeanUtils.copyProperties(oBean, value);
orderBeans.add(oBean);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}else {
//pd.txt
try {
BeanUtils.copyProperties(pBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
for (TableBean orderbean : orderBeans){
orderbean.setPname(pBean.getPname());
context.write(NullWritable.get(),orderbean);
}
}
}

TableDriver

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class TableDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration); // 2 指定本程序的jar包所在的本地路径
job.setJarByClass(TableDriver.class); // 3 指定本业务job要使用的mapper/Reducer业务类
job.setMapperClass(TableMapper.class);
job.setReducerClass(TableReducer.class); // 4 指定mapper输出数据的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TableBean.class); // 5 指定最终输出的数据的kv类型
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(TableBean.class); // 6 指定job的输入原始文件所在目录
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 将job中配置的相关参数,以及job所用的java类所在的jar包, 提交给yarn去运行
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1); }
}

原始数据

结果

1001    小米    1
1002 华为 2
1003 格力 3

Map端连接

DistributedCacheMapper

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap; public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
Text k = new Text();
HashMap<String, String> map = new HashMap<>();
/**
* 先缓存pd表
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void setup(Context context) throws IOException, InterruptedException {
URI[] cacheFiles = context.getCacheFiles();
//获取输入字节流
FileInputStream fis = new FileInputStream(cacheFiles[0].getPath());
//获取转换流
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
//获取缓存流
BufferedReader br = new BufferedReader(isr);
String line = null;
//开始一行一行读数据
while (StringUtils.isNotEmpty(line = br.readLine())) {
String[] fields = line.split("\t");
//将数据放入map中
map.put(fields[0],fields[1]);
}
fis.close();
isr.close();
br.close();
} @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取一行
String line = value.toString();
//通过pid获取pname
String[] fields = line.split("\t");
String pid = fields[1];
String pname = map.get(pid);
String str = line + "\t" + pname;
k.set(str);
context.write(k, NullWritable.get());
}
}

DistributedCacheDriver

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.net.URI; public class DistributedCacheDriver {
public static void main(String[] args) throws Exception {
// 1 获取job信息
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration); // 2 设置加载jar包路径
job.setJarByClass(DistributedCacheDriver.class); // 3 关联map
job.setMapperClass(DistributedCacheMapper.class);
// 4 设置最终输出数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class); // 5 设置输入输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); // 6 加载缓存数据写入内存 默认是本地路径
job.addCacheFile(new URI(args[2])); // 7 map端join的逻辑不需要reduce阶段,设置reducetask数量为0
job.setNumReduceTasks(0); // 8 提交
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1); }
}

这种方法的运行速度很快,输入数据进行Mapper计算前分了部分,启用了一个Mapper,而Reducer连接时启动了两个Mapper.但该种计算模式受到JVM种堆的分配限制,如果集群中的内存足够大,业务符合逻辑(如内连接和做外连接)时可以考虑这种计算模式,如果数据集很大,Reducer连接操作是一个很不错的选择.当然连接操作的思想还有一些其他的方法,可以根据业务不同,采取不同的编程思想。

MapReduce高级编程2的更多相关文章

  1. MapReduce高级编程

    MapReduce 计数器.最值: 计数器 数据集在进行MapReduce运算过程中,许多时候,用户希望了解待分析的数据的运行的运行情况.Hadoop内置的计数器功能收集作业的主要统计信息,可以帮助用 ...

  2. Hadoop学习笔记(7) ——高级编程

    Hadoop学习笔记(7) ——高级编程 从前面的学习中,我们了解到了MapReduce整个过程需要经过以下几个步骤: 1.输入(input):将输入数据分成一个个split,并将split进一步拆成 ...

  3. 《Hadoop》对于高级编程Hadoop实现构建企业级安全解决方案

    本章小结 ●    理解企业级应用的安全顾虑 ●    理解Hadoop尚未为企业级应用提供的安全机制 ●    考察用于构建企业级安全解决方式的方法 第10章讨论了Hadoop安全性以及Hadoop ...

  4. 《Hadoop高级编程》之为Hadoop实现构建企业级安全解决方案

    本章内容提要 ●    理解企业级应用的安全顾虑 ●    理解Hadoop尚未为企业级应用提供的安全机制 ●    考察用于构建企业级安全解决方案的方法 第10章讨论了Hadoop安全性以及Hado ...

  5. 读《C#高级编程》第1章问题

    读<C#高级编程>第1章 .Net机构体系笔记 网红的话:爸爸说我将来会是一个牛逼的程序员,因为我有一个梦,虽然脑壳笨但是做事情很能坚持. 本章主要是了解.Net的结构,都是一些概念,并没 ...

  6. MVC高级编程+C#高级编程

    本人今年的目标是学习MVC高级编程和C#高级编程,把自己的基础打的扎实,本文中值是一个开到,定期会在上面记录学习的技术点和心得就,加油吧!!!!!

  7. 《C#高级编程》读书笔记

    <C#高级编程>读书笔记 C#类型的取值范围 名称 CTS类型 说明 范围 sbyte System.SByte 8位有符号的整数 -128~127(−27−27~27−127−1) sh ...

  8. jquery插件开发继承了jQuery高级编程思路

    要说jQuery 最成功的地方,我认为是它的可扩展性吸引了众多开发者为其开发插件,从而建立起了一个生态系统.这好比大公司们争相做平台一样,得平台者得天下.苹果,微软,谷歌等巨头,都有各自的平台及生态圈 ...

  9. jQuery高级编程

    jquery高级编程1.jquery入门2.Javascript基础3.jQuery核心技术 3.1 jQuery脚本的结构 3.2 非侵扰事JavaScript 3.3 jQuery框架的结构 3. ...

随机推荐

  1. 自动化部署--shell脚本--1

    传统部署方式1.纯手工scp2.纯手工登录git pull .svn update3.纯手工xftp往上拉4.开发给打一个压缩包,rz上去.解压 传统部署缺点:1.全程运维参与,占用大量时间2.上线速 ...

  2. js正则表达式只能是数字、字母或下划线

    //只能是数字.字母或下划线 function isValid(str) { var reg = /^\w+$/g; return reg.test(str); }

  3. thinkphp本地调用Redis队列任务

    1.安装配置好Redis 2.进入项目根目录文件夹输入cmd进入命令行 3.输入php think 查看php扩展 4.输入 php think queue:listen 启动队列监听

  4. Ansible Ad-Hoc命令(三)

    一.Ad-Hoc 介绍 1.了解下什么是Ad-Hoc ? Ad-Hoc 其实就是基于Ansible 运行的命令集,有些类似终端中敲入的shell命令,Ansible提供了两种运行完成任务的方式,一种是 ...

  5. Apache 配置Https 转发Tomcat Http

    Apache 相对于nginx的配置对比起来相当复杂啦,朋友之前的系统使用的是Apache需要增加一个虚拟主机,主要配置从Apache转发Tomcat. 首先需要拆解下步骤: Apache 支持Htt ...

  6. Docker中使用nginx镜像

    1.到网易蜂巢查看nginx https://c.163yun.com/hub#/m/home/ 复制nginx镜像地址为:docker pull hub.c.163.com/library/ngin ...

  7. Linux重定向及nohup不输出的方法

    转载自:http://blog.csdn.net/qinglu000/article/details/18963031   先说一下linux重定向: 0.1和2分别表示标准输入.标准输出和标准错误信 ...

  8. Ribbon 负载均衡机制

    Ribbon 提供了几个负载均衡的组件,其目的就是让请求转给合适的服务器处理,因此,如何选择合适的服务器变成了负载均衡机制的核心,Ribbon 提供了如下负载均衡规则: RoundRobinRule: ...

  9. App音频内录 录音

    1.android模拟器 天天模拟器+BlueStacks 2.高清内录软件 Audio Record Wizard.exe 3.音频剪切软件 Adobe Audition CS6

  10. openwrt 无线中继

    参考: https://wiki.openwrt.org/doc/recipes/relayclient 该方法可以实现中继AP,而不需要AP(WDS)模式.中继后,相当于该路由所有的LAN口以及AP ...