hadoop MapReduce辅助排序解析

1、数据样本，w1.csv到w5.csv，每个文件数据样本2000条，第一列是年份从1990到2000随机，第二列数据从1-100随机，本例辅助排序目标是找出每年最大值，实际上结果每年最大就是100，但是这里通过mapreduce辅助排序方式来找。

2、核心概念：

1）分区，假设有海量的数据，为了增加并行度，按照hash算法将所有数据分区后，确保同一年的数据进入到同一个分区，也就是同一个reduce里。

2）键比较器，将数据拆分后，年份和数据同时组成一个组合键（本质就是一个对象），因为采用组合键的key，需要一个键排序比较器来对key排序，通过年份升序，数据降序的算法编写核心比较方法。
那么mapper就会安装key比较器将自己所负责的所有数据排序。

3）分组比较器，在reducer阶段，需求实际上只求最大值，那么实际上就是排序后的第一条，如果reducer阶段不做什么变化，那么数据将会安装年份升序和数据降序输出所有数据（重复的已经被reduce过滤）。
为了只得到一条最大的数据，可以采用设置分组器的方式实现。同一年份，我们只需要一条数据，那么分组比较器就可以只按照年份分组，分组后，reducer最终归并数据后，只会得到排第一的那条最大数据。
这种取最大值得方式，实际上是取巧，是依赖mapper和reducer的排序特性而来。

3、IntPair，本例把整行数据解析后，年份和数据都放入key，需要自定义一个IntPair对象，实际生产环境中可根据需求自定义各种类.

public class IntPair implements WritableComparable<IntPair> {

   private int first;

    private int second;

    public IntPair() {

    }

    public IntPair(int first, int second) {

        this.first = first;

        this.second = second;

    }

    public int getFirst() {

        return first;

    }

    public void setFirst(int first) {

        this.first = first;

    }

    public int getSecond() {

        return second;

    }

    public void setSecond(int second) {

        this.second = second;

    }

    @Override

    public int compareTo(IntPair o) {

        int result = Integer.valueOf(first).compareTo(o.getFirst());

        if(result==0){

            result = Integer.valueOf(second).compareTo(o.getSecond());

        }

        return result;

    }

    public static int compare(int first1,int first2){

        return Integer.valueOf(first1).compareTo(Integer.valueOf(first2));

    }

    @Override

    public void write(DataOutput out) throws IOException {

        out.writeInt(first);

        out.writeInt(second);

    }

    @Override

    public void readFields(DataInput in) throws IOException {

        first = in.readInt();

        second = in.readInt();

    }

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        IntPair intPair = (IntPair) o;

        if (first != intPair.first) return false;

        return second == intPair.second;

    }

    @Override

    public int hashCode() {

        int result = first;

        result = 31 * result + second;

        return result;

    }

    @Override

    public String toString() {

        return first+"\t"+second;

    }

}

4、RecordParser，记录解析器，用于解析数据，规避错误数据

public class RecordParser {

    private int year;

    private int data;

    private boolean valid;

    public int getYear() {

        return year;

    }

    public int getData() {

        return data;

    }

    public boolean isValid() {

        return valid;

    }

    public void parse(String value){

        String[] sValue = value.split(",");

        try {

            year = Integer.parseInt(sValue[0]);

            data = Integer.parseInt(sValue[1]);

            valid = true;

        }catch (Exception e){

            valid = false;

        }

    }

}

5、分区器

/**

 * @Author: xu.dm

 * @Date: 2019/2/21 11:56

 * @Description:根据key进行分区，确保同一个key.first进入相同的分区，泛型类型和mapper输出一致

 */

public class FirstPartitioner extends Partitioner<IntPair,IntWritable> {

    /**

     * Get the partition number for a given key (hence record) given the total

     * number of partitions i.e. number of reduce-tasks for the job.

     * <p>

     * <p>Typically a hash function on a all or a subset of the key.</p>

     *

     * @param key       the key to be partioned.

     * @param value  the entry value.

     * @param numPartitions the total number of partitions.

     * @return the partition number for the <code>key</code>.

     */

    @Override

    public int getPartition(IntPair key, IntWritable value, int numPartitions) {

        return Math.abs(key.getFirst() * 127) % numPartitions;

    }

}

6、key比较器，map阶段的key排序使用，如果没有分组比较器，则key比较器也会应用在混洗和reduce阶段。

/**

 * @Author: xu.dm

 * @Date: 2019/2/21 11:59

 * @Description: key比较器

 * 对IntPair的first升序，second降序，在mapper排序的时候被应用

 * 最终同样年份的数据第一条是最大的。

 */

public class KeyComparator extends WritableComparator {

    protected KeyComparator() {

        super(IntPair.class,true);//需要实例化

    }

    @Override

    public int compare(WritableComparable a, WritableComparable b) {

        IntPair p1=(IntPair)a;

        IntPair p2=(IntPair)b;

        int result = IntPair.compare(p1.getFirst(),p2.getFirst());

        if(result==0){

            result = -IntPair.compare(p1.getSecond(),p2.getSecond()); //前面加一个减号求反

        }

        return result;

    }

}

7、分组比较器，这里最关键，看注释。

/**

 * @Author: xu.dm

 * @Date: 2019/2/21 12:16

 * @Description: 分组比较器，应用在reduce阶段，数据进reduce后，归并之前。
 * 本例目标是：确保同一个年份的数据在同一个组里

 * 之前key比较器使得key值中的年份升序，数据降序排列。

 * 那么这个分组比较器只按年进行比较，意味着，[1990,100]和[1990,00]会被认为是相同的分组，

 * 而，reduce阶段，相同的KEY只取第一个，哦也，这个时候，reduce阶段后，年份中最大的数据就被保存下来，其他数据都被kickout

 * 所以，用这种方式变相的达到取最大值得效果。

 */

public class GroupComparator extends WritableComparator {

    public GroupComparator() {

        super(IntPair.class,true);

    }

    @Override

    public int compare(WritableComparable a, WritableComparable b) {

        IntPair p1=(IntPair)a;

        IntPair p2=(IntPair)b;

        return IntPair.compare(p1.getFirst(),p2.getFirst());

    }

}

8、mapper，如果只取年份里的最大数据，Mapper<LongWritable,Text,IntPair,IntWritable> 的IntWritable可以用NullWritable，这里保留IntWritable是因为，程序稍加改动就可以输出所有年份数据的计数

public class DataMapper extends Mapper<LongWritable,Text,IntPair,IntWritable> {

    private RecordParser parser = new RecordParser();

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        parser.parse(value.toString());

        if(parser.isValid()){

            context.write(new IntPair(parser.getYear(),parser.getData()),new IntWritable(1));

            context.getCounter("MapValidData","dataCounter").increment(1); //做一个计数，总的数据应该是10000条。

        }

    }

}

9、reducer

public class DataReducer extends Reducer<IntPair,IntWritable,IntPair,IntWritable> {

    @Override

    protected void reduce(IntPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;

        //因为分组器，[1990,100]和[1990,00]会被认为是相同的分组

        //这里的计数就会混淆。如果需要年份下各数据的正确的计数结果，则需要注销分组器

//        for(IntWritable val:values){

//            sum+=val.get();

//        }

        context.write(key,new IntWritable(sum));

    }

}

10、job

public class DataSecondarySort extends Configured implements Tool {

    /**

     * Execute the command with the given arguments.

     *

     * @param args command specific arguments.

     * @return exit code.

     * @throws Exception

     */

    @Override

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();

        Job job = Job.getInstance(conf,"Secondary Sort");

//        conf.set("mapreduce.job.ubertask.enable","true");

        if(conf==null){

            return -1;

        }

        job.setJarByClass(DataSecondarySort.class);

        job.setMapperClass(DataMapper.class);

        job.setPartitionerClass(FirstPartitioner.class);

        job.setSortComparatorClass(KeyComparator.class);
//      决定如何分组

        job.setGroupingComparatorClass(GroupComparator.class);

        job.setReducerClass(DataReducer.class);

//        job.setNumReduceTasks(2);//如果数据海量，则可以根据情况设置reduce的数目，也是分区的数量，通过Tool类，也可以在命令行进行设置

        job.setOutputKeyClass(IntPair.class);

        //如果只求最大数，前面的mapper，reducer和这里的输出都可以设置成NullWritable

        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        Path outPath = new Path(args[1]);

        FileSystem fileSystem = outPath.getFileSystem(conf);

        //删除输出路径

        if(fileSystem.exists(outPath))

        {

            fileSystem.delete(outPath,true);

        }

        return job.waitForCompletion(true) ? 0:1;

    }

    public static void main(String[] args) throws Exception{

        int exitCode = ToolRunner.run(new DataSecondarySort(),args);

        System.exit(exitCode);

    }

}

11、如果求最大值，结果会是这样：

[hadoop@bigdata-senior01 ~]$ hadoop fs -cat /output3/part-r-00000 | more

1990    100    0

1991    100    0

1992    100    0

1993    100    0

1994    100    0

1995    100    0

1996    100    0

1997    100    0

1998    100    0

1999    100    0

2000    100    0

如果求最大值和计数则会列出所有数据，当然需要注销分组器的set代码，并打开reducer的sum

[hadoop@bigdata-senior01 ~]$ hadoop fs -cat /output/part-r-00000 | more

1990    100    10

1990    99    15

1990    98    10

1990    97    9

1990    96    6

1990    95    4

1990    94    12

1990    93    9

1990    92    12

1990    91    13

1990    90    8

1990    89    9

... ...

多个分区可以使用job.setNumReduceTasks(n)，或者在命令行上指定

[hadoop@bigdata-senior01 ~]$ hadoop jar DataSecondarySort.jar -D mapreduce.job.reduces=3 /sampler /output3

[hadoop@bigdata-senior01 ~]$ hadoop fs -ls /output3
Found 4 items
-rw-r--r-- 1 hadoop supergroup 0 2019-02-21 15:29 /output3/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 2868 2019-02-21 15:29 /output3/part-r-00000
-rw-r--r-- 1 hadoop supergroup 3860 2019-02-21 15:29 /output3/part-r-00001
-rw-r--r-- 1 hadoop supergroup 3850 2019-02-21 15:29 /output3/part-r-00002

总结一下：最容易混淆的概念就是分组，而排序和分组实际上就是MapReduce最核心的地方。

总结步骤：

1、视数据量大小决定是否分区，分几个区输出数据，可以在作业中设置，也可以在命令行中指定

2、规划数据结构，抽象为对象，自定义对象排序规则，实现排序接口，明确key排序比较器

3、自定义分组规则，视情况进行分组归纳，实现分组排序接口

4、作业配置中配置分区类、排序类、分组类以及输出类。

hadoop MapReduce辅助排序解析的更多相关文章

三种方法实现Hadoop(MapReduce)全局排序(1)
我们可能会有些需求要求MapReduce的输出全局有序,这里说的有序是指Key全局有序.但是我们知道,MapReduce默认只是保证同一个分区内的Key是有序的,但是不保证全局有序.基于此,本文提供三 ...
MapReduce辅助排序
需求:订单数据求出每个订单中最贵的商品? 订单id正序,成交金额倒序. 结果文件三个,每个结果文件只要一条数据. 1.Mapper类 package com.css.order.mr; import ...
Hadoop mapreduce自定义排序WritableComparable
本文发表于本人博客. 今天继续写练习题,上次对分区稍微理解了一下,那根据那个步骤分区.排序.分组.规约来的话,今天应该是要写个排序有关的例子了,那好现在就开始! 说到排序我们可以查看下hadoop源码 ...
Hadoop mapreduce自定义分组RawComparator
本文发表于本人博客. 今天接着上次[Hadoop mapreduce自定义排序WritableComparable]文章写,按照顺序那么这次应该是讲解自定义分组如何实现,关于操作顺序在这里不多说了,需 ...
Hadoop Mapreduce分区、分组、二次排序过程详解[转]
原文地址:Hadoop Mapreduce分区.分组.二次排序过程详解[转]作者: 徐海蛟教学用途 1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2) ...
Hadoop Mapreduce分区、分组、二次排序
1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2)定制了partitioner以将map的结果送往指定reducer的过程: map - partiti ...
Hadoop案例（八）辅助排序和二次排序案例（GroupingComparator）
辅助排序和二次排序案例(GroupingComparator) 1.需求有如下订单数据订单id 商品id 成交金额 0000001 Pdt_01 222.8 0000001 Pdt_05 25.8 ...
Hadoop Mapreduce分区、分组、二次排序过程详解
转载:http://blog.tianya.cn/m/post.jsp?postId=53271442 1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2)定制了 ...
Hadoop MapReduce编程 API入门系列之自定义多种输入格式数据类型和排序多种输出格式（十一）
推荐 MapReduce分析明星微博数据 http://git.oschina.net/ljc520313/codeexample/tree/master/bigdata/hadoop/mapredu ...

随机推荐

dsp6657的串口学习
1. 打算用dsp6657学习下,先用串口实验吧.找一下芯片支持库Chip support libraries,路径D:\ti\pdk_C6657_1_1_1_4\packages\ti\csl,新建 ...
FPGA代码一位半加器入门-第2篇
1. 代码比较简单,总结起来是用逻辑运算替代了数学运算加减,应该是因为这样的逻辑资源耗费的少.S是A+B的个位,CO是A+B的进位. module half_adder(A,B,S,CO); inpu ...
Manual install on Windows 7 with Apache and MySQL
These are instructions for installing on Windows 7 desktop (they may also be useful for a server ins ...
「日常训练」Divisibility by Eight（Codeforces Round 306 Div.2 C）
题意与分析极简单的数论+思维题. 代码 #include <bits/stdc++.h> #define MP make_pair #define PB emplace_back #de ...
前后端分离.net core + vuejs + element
查找一些资料,比较了elementui以及Iview,最终还是选择了elementui搭建前后端分离框架,废话少说了,开始搭建环境: 1.基础软件环境 vue开发环境安装: ①nodejs (我安装的 ...
新的征程 in ZJU
争取考上了心仪的学校并进入了心仪的实验室但是对我来说,未来将是更多的挑战首先我觉得我学习能力还是不足,无法做到一天的高效率学习实验室的方向是可视化,我觉得这个是个非常复杂的方向数学,pyth ...
word record 2
word record 2 scavenger // si ga wen ger a person, animal or insect who takes what others have left ...
JAVA基础学习之路（八）[1]String类的基本特点
String类的两种定义方式: 直接赋值通过构造方法赋值 //直接赋值 public class test2 { public static void main(String args[]) { S ...
Apache POI：Excel读写库
1)Apache POI 简介 Apache POI是用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程式对Microsoft Office格式档案读和写 ...
Liunx 基本命令
find : find ./ -name "*instantiate_post_check.yml*" grep: openstack network show fe92bfcf- ...

hadoop MapReduce辅助排序解析

hadoop MapReduce辅助排序解析的更多相关文章

随机推荐

热门专题