MapReduce在Shuffle阶段按Mapper输出的Value进行排序

ZKe

-----------------

　　在MapReduce框架中，Mapper的输出在Shuffle阶段，根据Key值分组之后，还将会根据Key值进行排序，因此Reducer的输出我们看到的结果是按Key有序的。

　　同样我们可以让它按Value有序。通过job.setSortComparatorClass(IntWritableComparator.class);即可（这里的排序规则和类型通过自己定义）

　　实体类不仅需要实现Comparable接口，同样还要重写readFiles方法和write方法。然后定义一个该实体的比较器。

　　这里定义一个实体类，由String的id和int的count作为属性，我们根据count进行排序。

static class Record implements Comparable<Record>{

        private String personalId;

        private int count;

        public Record(String id, int count){

            this.personalId = id;

            this.count = count;

        }

        public Record(String line){

            this.personalId = line.split("\t")[0];

            this.count = Integer.parseInt(line.split("\t")[1]);

        }

        /*

         * 反序列化方法

         * @author 180512235 ZhaoKe

         */

        public void readFields(DataInput arg0) throws IOException {

            this.personalId = arg0.readUTF();

            this.count = arg0.readInt();

        }

        // 序列化方法

        public void write(DataOutput arg0) throws IOException {

            arg0.writeUTF(this.personalId);

            arg0.writeInt(this.count);

        }

        public int compareTo(Record o) {

            // TODO Auto-generated method stub

            return this.count<o.count?1:-1;

        }

        public String getPersonalId(){

            return this.personalId;

        }

        public int getCount(){

            return this.count;

        }

    }

它的比较器如下

    static class IntWritableComparator extends WritableComparator {

        /*

         * 重写构造方法，定义比较类 IntWritable

         */

        public IntWritableComparator() {

            super(IntWritable.class, true);

        }

        /*

         * 重写compare方法，自定义比较规则

         */

        @Override

        public int compare(WritableComparable a, WritableComparable b) {

            //向下转型

            IntWritable ia = (IntWritable) a;

            IntWritable ib = (IntWritable) b;

            return ib.compareTo(ia);

        }

    }

Mapper和Reducer如下，没有任何操作，因为Shuffle阶段自己会调用比较器进行排序

    static class SortMapper extends Mapper<LongWritable, Text, IntWritable, Text>{

        private Record r;

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

            r = new Record(value.toString());

            context.write(new IntWritable(r.getCount()), new Text(r.getPersonalId()));

        }

    }

    static class SortReducer extends Reducer<IntWritable, Text, Text, IntWritable>{

        protected void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{

            for(Text value:values){

                context.write(value, key);

            }

        }

    }

主类如下，大家作为模板即可

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        // TODO Auto-generated method stub

        String inputFile = "hdfs://master:9000/user/root/finalClassDesign/originData/submitTop10output/";

        String outputFile = "hdfs://master:9000/user/root/finalClassDesign/originData/sortedSubmitTop10/";

        BasicConfigurator.configure();

        Configuration conf = new Configuration();

//        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

//        if(otherArgs.length != 2){

//            System.err.println("Usage:wordcount<in><out>");

//            System.exit(2);

//        }

        Job job = Job.getInstance(conf, "WordCount");

        job.setJarByClass(SortByMapReduce.class);

        job.setMapperClass(SortMapper.class);

        job.setReducerClass(SortReducer.class);

        job.setMapOutputKeyClass(IntWritable.class);

        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        job.setSortComparatorClass(IntWritableComparator.class);  // 此处必须注意设置比较器=======================================

//        Path path = new Path(otherArgs[1]);

        Path path = new Path(outputFile);

        FileSystem fileSystem = path.getFileSystem(conf);

        if(fileSystem.exists(path)){

            fileSystem.delete(path, true);

        }

//        FileInputFormat.setInputPaths(job, new Path(args[0]));

//        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        FileInputFormat.setInputPaths(job, new Path(inputFile));

        FileOutputFormat.setOutputPath(job, new Path(outputFile));

        boolean res = job.waitForCompletion(true);

        if(res)

            System.out.println("===========waitForCompletion:"+res+"==========");

        System.exit(res?0:1);

    }

MapReduce在Shuffle阶段按Mapper输出的Value进行排序的更多相关文章

MapReduce详解及shuffle阶段
hadoop1.x和hadoop2.x的区别: Hadoop1.x版本: 内核主要由Hdfs和Mapreduce两个系统组成,其中Mapreduce是一个离线分布式计算框架,由一个JobTracker ...
【Hadoop】MapReduce笔记（三）：MapReduce的Shuffle和Sort阶段详解
一.MapReduce 总体架构整体的Shuffle过程包含以下几个部分:Map端Shuffle.Sort阶段.Reduce端Shuffle.即是说:Shuffle 过程横跨 map 和 reduc ...
MapReduce shuffle阶段详解
在Mapreduce中,Shuffle过程是Mapreduce的核心,它分布在Mapreduce的map阶段和reduce阶段,共可分为6个详细的阶段: 1).Collect阶段:将MapTask的结 ...
MapReduce核心 - - - Shuffle
大数据名词(1) -Shuffle Shuffle过程是MapReduce的核心,也被称为奇迹发生的地方.要想理解MapReduce, Shuffle是必须要了解的.我看过很多相关的资料,但每 ...
MapReduce：Shuffle过程详解
1.Map任务处理 1.1 读取HDFS中的文件.每一行解析成一个<k,v>.每一个键值对调用一次map函数. <0,hello you> & ...
大数据技术 - MapReduce的Shuffle及调优
本章内容我们学习一下 MapReduce 中的 Shuffle 过程,Shuffle 发生在 map 输出到 reduce 输入的过程,它的中文解释是 “洗牌”,顾名思义该过程涉及数据的重新分配,主要 ...
MapReduce的Shuffle过程介绍
MapReduce的Shuffle过程介绍 Shuffle的本义是洗牌.混洗,把一组有一定规则的数据尽量转换成一组无规则的数据,越随机越好.MapReduce中的Shuffle更像是洗牌的逆过程,把一 ...
Hadoop MapReduce的Shuffle过程
一.概述理解Hadoop的Shuffle过程是一个大数据工程师必须的,笔者自己将学习笔记记录下来,以便以后方便复习查看. 二. MapReduce确保每个reducer的输入都是按键排序的.系统执行 ...
MapReduce 的 shuffle 过程中经历了几次 sort ？
shuffle 是从map产生输出到reduce的消化输入的整个过程. 排序贯穿于Map任务和Reduce任务,是MapReduce非常重要的一环,排序操作属于MapReduce计算框架的默认行为,不 ...

随机推荐

c++中sprintf和sprintf_s的区别
参考:https://blog.csdn.net/qq_37221466/article/details/81140901 sprintf_s是sprintf的安全版本,指定缓冲区长度来避免sprin ...
Activity的常用控件
TimerPick(时间控件)public Integer getCurrentHour() //返回当前设置的小时public Integer getCurrentMinute()//返回当前设置的 ...
Ubuntu开启/var/log/messages
# 添加配置到/etc/rsyslog.d/50-default.conf cat <<'EOF' | sudo tee -a /etc/rsyslog.d/50-default.conf ...
Codeforces Global Round 11 个人题解（B题）
Codeforces Global Round 11 1427A. Avoiding Zero 题目链接:click here 待补 1427B. Chess Cheater 题目链接:click h ...
vue : 无法加载文件 C:\Users\Lenovo\AppData\Roaming\npm\vue.ps1，因为在此系统上禁止运行脚本。
第一步:用管理员身份打开第二步:执行:set-ExecutionPolicy RemoteSigned 选择Y或A,回车
dockerfile关键字
DockerFile关键字(保留字指令) FORM:基础镜像,表明当前镜像是基于那么镜像的 MAINTAINER :镜像维护者的名字和邮箱地址 RUN:容器构建时需要用到的命令 EXPOSE:当前容器 ...
转一个veth的文章
这篇写的很好,清晰明白,保存一下https://www.cnblogs.com/bakari/p/10613710.html
多测师讲解selenium_iframe框定位_高级讲师肖sir
iframe 框定位方法: 查看iframe框京东点击登录定位元素定位qq: qq登录定位的元素查找iframe框定位iframe框 from selenium import webdrive ...
RDS 事务型数据库sql
-- 替换json中数据 select SUBSTRING_INDEX(SUBSTRING_INDEX('[{"channelCode":"MOBIL",&qu ...
CentOS 7系统常见快捷键操作方式
快捷键操作方式 Linux系统中一些常见的快捷方式,可有效提高操作效率,在某些时刻也能避免操作失误带来的问题. 最有用的快捷键序号快捷键官方说明掌握程度 01 Tab 命令或路径等的补全键移 ...

MapReduce在Shuffle阶段按Mapper输出的Value进行排序

MapReduce在Shuffle阶段按Mapper输出的Value进行排序的更多相关文章

随机推荐

热门专题