Hadoop.2.x_高级应用_二次排序及MapReduce端join
一、对于二次排序案例部分理解
- 1. 分析需求(首先对第一个字段排序,然后在对第二个字段排序)
- 杂乱的原始数据 排序完成的数据
- a,1 a,1
- b,1 a,2
- a,2 [排序] a,100
- b,6 ===> b,-3
- c,2 b,-2
- b,-2 b,1
- a,100 b,6
- b,-3 c,-7
- c,-7 c,2
- 2. 分析[MapRedice过程]
- 1> 分析数据传入通过input()传入map()
- 2> map()对数据进行层层过滤,以达到我们想要的数据源,
- 3> 过滤方法中可添加自定义计数器
- 4> 过滤后写入context,转入shuffle阶段
- 5> 可以说大部分shuffle阶段是map()端的shuffle
- 6> 具体shullfe中,数据经过默认分区(hashPartitioner),而默认分区规则是获取
- (key.getHashCode() & Integer.MAX_VALUE)%numReudeceTasks;当然默认reduce数目就一个,
- reduce输出的文件也就一个,我是这样认为的,经过输出测试,就算你设置了自定义的分区,但你的partition数目
- 并没设置,仍然走默认分区
- 7> 分区之后对是分区的一个排序,再对分区中的数据进行排序,排序规则按照key排序,我们可以自定义数据类型对其
- 设置排序规则,比如二次排序,可以自定义一个组合的key,在组合key中定义根据第一个字段排序,如果第一个字段
- 相同,那么再进行对第二个字段排序,以达到二次排序的目的,在分区排序后进入分组阶段也是默认按照key分的,
- 分组需要实现RawComparator
- 8> 分组之后是merge个并归排序然后进入reduce,其中分组决定数据进入某个reduce,而分区决定了reduce阶段生成
- 文件的数目,分组算是shuffle阶段对程序运行的一个优化吧我是这么理解的
- 3. 分析[二次排序]
- 1> 从上面的数据可以看出,我们可以自定义一个数据类型,来存放第一个和第二个字段,然后自定义一个比较器来
- 说明排序规则按照key中的第一个字段进行排序,这里涉及到自定义数据需要实现WritableComparable也可以
- 分别继承Writable和Comparable,反正越方便越好
- 2> 接下来看看分区操作,该例只生成一个排好序的文件,不用自定义分区,自定义分区后也不会走该类,自定义分区需要
- 继承Partitioner,注意是继承,我们自己要重写分区规则
- 3> 然后是分组操作,分组为优化考虑还是有必要的,我们设计分组规则为按照自定义数据类型的第一个字段进行分组,
- 分组需要实现RawComparator
- 4> 考虑哪里还需要优化,根据数据源的数据量,字段是否必在,长度情况,
- 类型情况,是否使用combine与自定义压缩类,数值为负数等,在比较器中既然定义了根据第二个字段比较,我想也
- 没必要加个大数减个大数
- 效果展示:
- 数据源 map()后 shuffle阶段后 reduce()后
- a,1 a#1,1 a#1 [1,2,100] a 1
- b,1 b#1,1 b#-3 [-3,-2,1,6] a 2
- a,2 a#2,2 c#-7 [-7,2] a 100
- b,6 b#6,6 b -3
- c,2 c#2,2 b -2
- b,-2 b#-2,-2 b 1
- a,100 a#100,100 b 6
- b,-3 b#-3,-3 c -7
- c,-7 c#-7,-7 c 2
二、二次排序示例代码
- SSortMr.java ## 主类
- ============
- package com.bigdata_senior.SSortMr;
- import java.io.IOException;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class SSortMr {
- //Mapper Class
- private static class SSortMapper
- extends Mapper<LongWritable, Text, SecondaryWritable, LongWritable>{
- private SecondaryWritable mapOutKey = new SecondaryWritable();
- private LongWritable mapOutValue = new LongWritable();
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String lineValue = value.toString();
- String [] strValue = lineValue.split(",");
- mapOutKey.set(strValue[0],Integer.valueOf(strValue[1]));
- mapOutValue.set(Integer.valueOf(strValue[1]));
- context.write(mapOutKey, mapOutValue);
- System.out.println("key-->"+mapOutKey+" value-->"+mapOutValue);
- }
- }
- //Reduce Class
- private static class SSortReduce
- extends Reducer<SecondaryWritable, LongWritable, Text, LongWritable>{
- private Text reduceOutKey = new Text();
- @Override
- public void reduce(SecondaryWritable key, Iterable<LongWritable> values,Context context)
- throws IOException, InterruptedException {
- for(LongWritable value : values){
- reduceOutKey.set(key.getFirst()+"#"+key.getSecond());
- context.write(reduceOutKey, value);
- }
- }
- }
- //Driver
- public int run(String[] args) throws Exception {
- Configuration configuration = new Configuration();
- Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
- job.setJarByClass(this.getClass());
- //job.setNumReduceTasks(3);
- //input
- Path inPath = new Path(args[0]);
- FileInputFormat.addInputPath(job,inPath);
- //output
- Path outPath = new Path(args[1]);
- FileOutputFormat.setOutputPath(job, outPath);
- //mapper
- job.setMapperClass(SSortMapper.class);
- job.setMapOutputKeyClass(SecondaryWritable.class);
- job.setMapOutputValueClass(LongWritable.class);
- //partitioner
- //job.setPartitionerClass(SecondaryPartionerCLass.class);
- //group
- job.setGroupingComparatorClass(SecondaryGroupClass.class);
- //Reduce
- job.setReducerClass(SSortReduce.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(LongWritable.class);
- //submit job
- boolean isSuccess = job.waitForCompletion(true);
- return isSuccess ? 0 : 1;
- }
- public static void main(String[] args) throws Exception {
- args = new String[]{
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/input",
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/output13"
- };
- //run job
- int status = new SSortMr().run(args);
- System.exit(status);
- }
- }
- SecondaryWritable.java ## 自定义数据类型
- ======================
- package com.bigdata_senior.SSortMr;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import org.apache.hadoop.io.WritableComparable;
- public class SecondaryWritable implements WritableComparable<SecondaryWritable> {
- private String first;
- private int second;
- public SecondaryWritable() {}
- public SecondaryWritable(String first,int second){
- this.set(first, second);
- }
- public void set(String fist,int second){
- this.first = fist;
- this.second = second;
- }
- public String getFirst() {
- return first;
- }
- public void setFirst(String first) {
- this.first = first;
- }
- public int getSecond() {
- return second ;
- }
- public void setSecond(int second) {
- this.second = second ;
- }
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeUTF(this.first);
- out.writeInt(this.second);
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- this.first = in.readUTF();
- this.second = in.readInt();
- }
- @Override
- public int compareTo(SecondaryWritable o) {
- int comp = this.first.compareTo(o.first);
- if(0 != comp){
- return comp;
- }
- return Integer.valueOf(this.second).compareTo(Integer.valueOf(o.second));
- }
- @Override
- public String toString() {
- return first + "#" + second;
- }
- @Override
- public int hashCode() {
- final int prime = 31;
- int result = 1;
- result = prime * result + ((first == null) ? 0 : first.hashCode());
- result = prime * result + second;
- return result;
- }
- @Override
- public boolean equals(Object obj) {
- if (this == obj)
- return true;
- if (obj == null)
- return false;
- if (getClass() != obj.getClass())
- return false;
- SecondaryWritable other = (SecondaryWritable) obj;
- if (first == null) {
- if (other.first != null)
- return false;
- } else if (!first.equals(other.first))
- return false;
- if (second != other.second)
- return false;
- return true;
- }
- }
- SecondaryPartionerCLass.java ## 自定义分区规则(已注释不用)
- ============================
- package com.bigdata_senior.SSortMr;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.mapreduce.Partitioner;
- public class SecondaryPartionerCLass extends Partitioner<SecondaryWritable, LongWritable> {
- @Override
- public int getPartition(SecondaryWritable key, LongWritable value,
- int numPartitions) {
- return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
- }
- }
- SecondaryGroupClass.java ## 自定义分组规则
- ========================
- package com.bigdata_senior.SSortMr;
- import java.util.Arrays;
- import org.apache.hadoop.io.RawComparator;
- import org.apache.hadoop.io.WritableComparator;
- public class SecondaryGroupClass implements RawComparator<SecondaryWritable> {
- @Override
- public int compare(SecondaryWritable o1, SecondaryWritable o2) {
- System.out.println("o1: "+o1.toString()+" o2: "+o2.toString());
- return o1.getFirst().compareTo(o2.getFirst());
- }
- @Override
- public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
- System.out.println("b1: "+Arrays.toString(b1)+" b2: "+Arrays.toString(b2));
- return WritableComparator.compareBytes(b1, 0, l1-4, b2, 0, l2-4);
- }
- }
- 另外还可以: ## 但这个对于小数据可用,大数据将非常消耗资源
- SSortMr2.java
- =============
- package com.bigdata_senior.SSortMr2;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.Arrays;
- import java.util.Collection;
- import java.util.Collections;
- import java.util.Iterator;
- import java.util.List;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class SSortMr2 {
- //Mapper Class
- private static class SSortMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
- private Text mapOutKey = new Text();
- private LongWritable mapOutValue = new LongWritable();
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String lineValue = value.toString();
- String [] strValue = lineValue.split(",");
- mapOutKey.set(strValue[0]);
- mapOutValue.set(Integer.valueOf(strValue[1]));
- context.write(mapOutKey, mapOutValue);
- System.out.println("key-->"+mapOutKey+" value-->"+mapOutValue);
- }
- }
- //Reduce Class
- private static class SSortReduce extends Reducer<Text, LongWritable, Text, Long>{
- @Override
- public void reduce(Text key, Iterable<LongWritable> values,Context context)
- throws IOException, InterruptedException {
- List<Long> longList = new ArrayList<Long>();
- for(LongWritable value: values){
- longList.add(value.get());
- }
- Collections.sort(longList);
- for(Long value : longList){
- System.out.println("key--> "+key+" value--> "+value);
- context.write(key, value);
- }
- }
- }
- //Driver
- public int run(String[] args) throws Exception {
- Configuration configuration = new Configuration();
- Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
- job.setJarByClass(this.getClass());
- //input
- Path inPath = new Path(args[0]);
- FileInputFormat.addInputPath(job,inPath);
- //output
- Path outPath = new Path(args[1]);
- FileOutputFormat.setOutputPath(job, outPath);
- //mapper
- job.setMapperClass(SSortMapper.class);
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(LongWritable.class);
- //Reduce
- job.setReducerClass(SSortReduce.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Long.class);
- //submit job
- boolean isSuccess = job.waitForCompletion(true);
- return isSuccess ? 0 : 1;
- }
- public static void main(String[] args) throws Exception {
- args = new String[]{
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/input",
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/output22"
- };
- //run job
- int status = new SSortMr2().run(args);
- System.exit(status);
- }
- }
三、MapReduce join简单理解
- 1. join(组合)
- 2. 即两张或两张以上的数据源数据组合输出
- 3. 由于学了hive,感觉MapReduce的join不再是重点,因为在MapReduce处理
- 1> 为止join表数目
- 2> 操作繁琐,过滤多样,可能会考虑不全
- 3> 资源消耗较重
- 4. MapReduce的join大致就是将两张表加载进内存,在数据混淆情况下,为其设置自定义数据类型以区分两张表,
- 然后在reudece()中分别获取表并指定输出结果,当然处理join的方式还有很多,比如setup()加载一张表存进集合处理
四、MapReduce join代码示例
- JoinMr.java ## 主类
- ===========
- package com.bigdata_senior.joinMr;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.List;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class JoinMr {
- //Mapper Class
- private static class WordCountMapper extends
- Mapper<LongWritable, Text, LongWritable, JoinWritable>{
- private LongWritable mapoutputkey = new LongWritable();
- private JoinWritable mapoutputvalue = new JoinWritable();
- @Override
- protected void setup(Context context) throws IOException,
- InterruptedException {
- }
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String lineValue = value.toString();
- String [] strValue = lineValue.split(",");
- int length = strValue.length;
- if(3 != length && 4 != length){
- return;
- }
- //get cid
- Long cid = Long.valueOf(strValue[0]);
- //get cname
- String cname = strValue[1];
- //set customer
- if(3 == length){
- String phone = strValue[2];
- mapoutputkey.set(cid);
- mapoutputvalue.set("customer", cname + "," + phone);
- }
- //set order
- if(4 == length){
- String price = strValue[2];
- String date = strValue[3];
- mapoutputkey.set(cid);
- mapoutputvalue.set("order", cname +","+price +","+ date);
- }
- context.write(mapoutputkey, mapoutputvalue);
- }
- }
- //Reduce Class
- private static class WordCountReduce extends
- Reducer<LongWritable, JoinWritable, NullWritable, Text>{
- private Text outputValue = new Text();
- @Override
- public void reduce(LongWritable key, Iterable<JoinWritable> values,Context context)
- throws IOException, InterruptedException {
- String customerInfo = null;
- List<String> orderList = new ArrayList<String>();
- for(JoinWritable value : values){
- if("customer".equals(value.getTag())){
- customerInfo = value.getData();
- System.out.println(customerInfo);
- }else if("order".equals(value.getTag())){
- orderList.add(value.getData());
- }
- }
- for(String order: orderList){
- outputValue.set(key.get()+","+customerInfo+","+order);
- context.write(NullWritable.get(), outputValue);
- }
- }
- }
- //Driver
- public int run(String[] args) throws Exception {
- Configuration configuration = new Configuration();
- Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
- job.setJarByClass(this.getClass());
- //input
- Path inPath = new Path(args[0]);
- FileInputFormat.addInputPath(job,inPath);
- //output
- Path outPath = new Path(args[1]);
- FileOutputFormat.setOutputPath(job, outPath);
- //mapper
- job.setMapperClass(WordCountMapper.class);
- job.setMapOutputKeyClass(LongWritable.class);
- job.setMapOutputValueClass(JoinWritable.class);
- //Reduce
- job.setReducerClass(WordCountReduce.class);
- job.setOutputKeyClass(NullWritable.class);
- job.setOutputValueClass(Text.class);
- //submit job
- boolean isSuccess = job.waitForCompletion(true);
- return isSuccess ? 0 : 1;
- }
- public static void main(String[] args) throws Exception {
- args = new String[]{
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/join/input",
- "hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/join/output2"
- };
- //run job
- int status = new JoinMr().run(args);
- System.exit(status);
- }
- }
- JoinWritable.java ## 自定义数据类型
- package com.bigdata_senior.joinMr;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import org.apache.hadoop.io.Writable;
- public class JoinWritable implements Writable {
- private String tag;
- private String data;
- public JoinWritable(){}
- public JoinWritable(String tag,String data){
- this.set(tag, data);
- }
- public void set(String tag,String data){
- this.setTag(tag);
- this.setData(data);
- }
- public String getTag() {
- return tag;
- }
- public void setTag(String tag) {
- this.tag = tag;
- }
- public String getData() {
- return data;
- }
- public void setData(String data) {
- this.data = data;
- }
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeUTF(this.getTag());
- out.writeUTF(this.getData());
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- this.setTag(in.readUTF());
- this.setData(in.readUTF());
- }
- @Override
- public int hashCode() {
- final int prime = 31;
- int result = 1;
- result = prime * result + ((data == null) ? 0 : data.hashCode());
- result = prime * result + ((tag == null) ? 0 : tag.hashCode());
- return result;
- }
- @Override
- public boolean equals(Object obj) {
- if (this == obj)
- return true;
- if (obj == null)
- return false;
- if (getClass() != obj.getClass())
- return false;
- JoinWritable other = (JoinWritable) obj;
- if (data == null) {
- if (other.data != null)
- return false;
- } else if (!data.equals(other.data))
- return false;
- if (tag == null) {
- if (other.tag != null)
- return false;
- } else if (!tag.equals(other.tag))
- return false;
- return true;
- }
- @Override
- public String toString() {
- return tag + "," +data;
- }
- }
Hadoop.2.x_高级应用_二次排序及MapReduce端join的更多相关文章
- 大数据【四】MapReduce(单词计数;二次排序;计数器;join;分布式缓存)
前言: 根据前面的几篇博客学习,现在可以进行MapReduce学习了.本篇博客首先阐述了MapReduce的概念及使用原理,其次直接从五个实验中实践学习(单词计数,二次排序,计数器,join,分 ...
- python 实现Hadoop的partitioner和二次排序
我们知道,一个典型的Map-Reduce过程包 括:Input->Map->Partition->Reduce->Output. Partition负责把Map任务输出的中间结 ...
- 分别使用Hadoop和Spark实现二次排序
零.序(注意本部分与标题无太大关系,可直接调至第一部分) 既然没用为啥会有序?原因不想再开一篇文章,来抒发点什么感想或者计划了,就在这里写点好了: 前些日子买了几本书,打算学习和研究大数据方面的知识, ...
- matlab学习笔记9 高级绘图命令_2 图形的高级控制_视点控制和图形旋转_色图和颜色映像_光照和着色
一起来学matlab-matlab学习笔记9 高级绘图命令_2 图形的高级控制_视点控制和图形旋转_色图和颜色映像_光照和着色 觉得有用的话,欢迎一起讨论相互学习~Follow Me 参考书籍 < ...
- 3、尚硅谷_SSM高级整合_使用ajax操作实现删除的功能
点击删除的时候,要删除联系人,这里同点击编辑按钮一样给删除按钮添加点击事件的时候不能使用 $(".delete_btn").click(function(){ }); 这种方式,因 ...
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 进击的Python【第五章】:Python的高级应用(二)常用模块
Python的高级应用(二)常用模块学习 本章学习要点: Python模块的定义 time &datetime模块 random模块 os模块 sys模块 shutil模块 ConfigPar ...
- Hadoop Mapreduce分区、分组、二次排序过程详解[转]
原文地址:Hadoop Mapreduce分区.分组.二次排序过程详解[转]作者: 徐海蛟 教学用途 1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2) ...
- Hadoop学习笔记: MapReduce二次排序
本文给出一个实现MapReduce二次排序的例子 package SortTest; import java.io.DataInput; import java.io.DataOutput; impo ...
随机推荐
- 10个很棒的学习Android 开发的网站(转)
看到江湖旅人 写的<10个很棒的学习iOS开发的网站 - 简书>,所以就忍不住写Android 啦,也希望对大家有帮助.我推荐的网站,都是我在学习Android 开发过程中发现的好网站,给 ...
- JSON数据解析(GSON方式) (转)
JSON(JavaScript Object Notation)是一种轻量级的数据交换格式,采用完全独立于语言的文本格式,为Web应用开发提供了一种理想的数据交换格式. 在上一篇博文<Andro ...
- Android的Touch事件处理机制
Android的Touch事件处理机制比较复杂,特别是在考虑了多点触摸以及事件拦截之后. Android的Touch事件处理分3个层面:Activity层,ViewGroup层,View层. 首先说一 ...
- ASP.NET中的GridView自带的编辑更新功能
string ConStr = ConfigurationManager.ConnectionStrings["NorthwindConnectionString"].Connec ...
- TStringList 常用操作
//TStringList 常用方法与属性: var List: TStringList; i: Integer; begin List := TStringList.Create; ...
- The StringFormat property
As we saw in the previous chapters, the way to manipulate the output of a binding before is shown is ...
- The 2015 China Collegiate Programming Contest A. Secrete Master Plan hdu5540
Secrete Master Plan Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 65535/65535 K (Java/Othe ...
- 【原】iOS学习45之多媒体操作
1. 音频 1> 音频实现简述 iOS 里面共有四种专门实现播放音频的方式: System Sound Services(系统声音服务) OpenAL(跨平台的开源的音频处理接口) Audio ...
- Hadoop Streaming框架使用(一)
Streaming简介 link:http://www.cnblogs.com/luchen927/archive/2012/01/16/2323448.html Streaming框架允许任何程 ...
- topcoder SRM 618 DIV2 WritingWords
只需要对word遍历一遍即可 int write(string word) { ; ; i < word.length(); ++ i){ cnt+=word[i]-; } return cnt ...