Job类
  1.  /**
  2.    * Define the comparator that controls which keys are grouped together
  3.    * for a single call to
  4.    * {@link Reducer#reduce(Object, Iterable,
  5.    *                       org.apache.hadoop.mapreduce.Reducer.Context)}
  6.    * @param cls the raw comparator to use
  7.    * @throws IllegalStateException if the job is submitted
  8.    * @see #setCombinerKeyGroupingComparatorClass(Class)
  9.    */
  10.   publicvoid setGroupingComparatorClass(Class<? extends RawComparator> cls
  11.                                          ) throws IllegalStateException{
  12.     ensureState(JobState.DEFINE);
  13.     conf.setOutputValueGroupingComparator(cls);
  14.   }
 
JobConf类
在JobConf类中的setOutputValueGroupingComparator方法:
  1.  /**
  2.    * Set the user defined {@link RawComparator} comparator for
  3.    * grouping keys in the input to the reduce.
  4.    *
  5.    * <p>This comparator should be provided if the equivalence rules for keys
  6.    * for sorting the intermediates are different from those for grouping keys
  7.    * before each call to
  8.    * {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.</p>
  9.    * 
  10.    * <p>For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
  11.    * in a single call to the reduce function if K1 and K2 compare as equal.</p>
  12.    *
  13.    * <p>Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
  14.    * how keys are sorted, this can be used in conjunction to simulate
  15.    * <i>secondary sort on values</i>.</p>
  16.    * 
  17.    * <p><i>Note</i>: This is not a guarantee of the reduce sort being
  18.    * <i>stable</i> in any sense. (In any case, with the order of available
  19.    * map-outputs to the reduce being non-deterministic, it wouldn't make
  20.    * that much sense.)</p>
  21.    *
  22.    * @param theClass the comparator class to be used for grouping keys.
  23.    *                 It should implement <code>RawComparator</code>.
  24.    * @see #setOutputKeyComparatorClass(Class)
  25.    * @see #setCombinerKeyGroupingComparator(Class)
  26.    */
  27.   publicvoid setOutputValueGroupingComparator(
  28.       Class<? extends RawComparator> theClass){
  29.     setClass(JobContext.GROUP_COMPARATOR_CLASS,
  30.              theClass,RawComparator.class);
  31.   }
 
ctrl+O
找到getOutputValueGroupingComparator
  1. /**
  2.    * Get the user defined {@link WritableComparable} comparator for
  3.    * grouping keys of inputs to the reduce.
  4.    *
  5.    * @return comparator set by the user for grouping values.
  6.    * @see #setOutputValueGroupingComparator(Class) for details.
  7.    */
  8.   publicRawComparator getOutputValueGroupingComparator(){
  9.     Class<? extends RawComparator> theClass = getClass(
  10.       JobContext.GROUP_COMPARATOR_CLASS, null,RawComparator.class);
  11.     if(theClass == null){
  12.       return getOutputKeyComparator();
  13.     }
  14.     returnReflectionUtils.newInstance(theClass,this);
  15.   }
 
那么谁调用了getOutputValueGroupingComparator方法
ReduceTask类
在ReduceTask类中:
(这里没有定义属性comparator,因为直接作为返回值接受接好了啊)
  1.  RawComparator comparator = job.getOutputValueGroupingComparator();
这里get到的comparator其实就是我们自定义的xxxG
于是查找,哪里用到了comparator
  1. if(useNewApi){
  2.       runNewReducer(job, umbilical, reporter, rIter, comparator,
  3.                     keyClass, valueClass);
  4.     }else{
  5.       runOldReducer(job, umbilical, reporter, rIter, comparator,
  6.                     keyClass, valueClass);
  7.     }
因为有新旧API之分啊
 
所以找到该runNewReducer方法:
  1. private<INKEY,INVALUE,OUTKEY,OUTVALUE>void runNewReducer(JobConf job,
  2.                      final TaskUmbilicalProtocol umbilical,
  3.                      final TaskReporter reporter,
  4.                      RawKeyValueIterator rIter,
  5.                      RawComparator<INKEY> comparator,
  6.                      Class<INKEY> keyClass,
  7.                      Class<INVALUE> valueClass
  8.                      ) throws IOException,InterruptedException,
  9.                               ClassNotFoundException{
  10.     // wrap value iterator to report progress.
  11.     final RawKeyValueIterator rawIter = rIter;
  12.     rIter =newRawKeyValueIterator(){
  13.       publicvoid close() throws IOException{
  14.         rawIter.close();
  15.       }
  16.       publicDataInputBuffer getKey() throws IOException{
  17.         return rawIter.getKey();
  18.       }
  19.       publicProgress getProgress(){
  20.         return rawIter.getProgress();
  21.       }
  22.       publicDataInputBuffer getValue() throws IOException{
  23.         return rawIter.getValue();
  24.       }
  25.       public boolean next() throws IOException{
  26.         boolean ret = rawIter.next();
  27.         reporter.setProgress(rawIter.getProgress().getProgress());
  28.         return ret;
  29.       }
  30.     };
  31.     // make a task context so we can get the classes
  32.     org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
  33.       new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
  34.           getTaskID(), reporter);
  35.     // make a reducer
  36.     org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
  37.       (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
  38.         ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
  39.     org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
  40.       newNewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
  41.     job.setBoolean("mapred.skip.on", isSkipping());
  42.     job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
  43.     org.apache.hadoop.mapreduce.Reducer.Context
  44.          reducerContext = createReduceContext(reducer, job, getTaskID(),
  45.                                                rIter, reduceInputKeyCounter,
  46.                                                reduceInputValueCounter,
  47.                                                trackedRW,
  48.                                                committer,
  49.                                                reporter, comparator, keyClass,
  50.                                                valueClass);
  51.     try{
  52.       reducer.run(reducerContext);
  53.     } finally {
  54.       trackedRW.close(reducerContext);
  55.     }
  56.   }
runNewReducer方法接收该comparator参数后传递给了createReduceContext方法
 
Task类
在Task里面的createReduceContext方法:
  1.  @SuppressWarnings("unchecked")
  2.   protectedstatic<INKEY,INVALUE,OUTKEY,OUTVALUE>
  3.   org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
  4.   createReduceContext(org.apache.hadoop.mapreduce.Reducer
  5.                         <INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,
  6.                       Configuration job,
  7.                       org.apache.hadoop.mapreduce.TaskAttemptID taskId,
  8.                       RawKeyValueIterator rIter,
  9.                       org.apache.hadoop.mapreduce.Counter inputKeyCounter,
  10.                       org.apache.hadoop.mapreduce.Counter inputValueCounter,
  11.                       org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,
  12.                       org.apache.hadoop.mapreduce.OutputCommitter committer,
  13.                       org.apache.hadoop.mapreduce.StatusReporter reporter,
  14.                       RawComparator<INKEY> comparator,
  15.                       Class<INKEY> keyClass,Class<INVALUE> valueClass
  16.   ) throws IOException,InterruptedException{
  17.     org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
  18.     reduceContext =
  19.       newReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,
  20.                                                               rIter,
  21.                                                               inputKeyCounter,
  22.                                                               inputValueCounter,
  23.                                                               output,
  24.                                                               committer,
  25.                                                               reporter,
  26.                                                               comparator,
  27.                                                               keyClass,
  28.                                                               valueClass);
 
ReduceContextImpl类
找到ReduceContextImpl中找到:
  1.  publicReduceContextImpl(Configuration conf,TaskAttemptID taskid,
  2.                            RawKeyValueIterator input,
  3.                            Counter inputKeyCounter,
  4.                            Counter inputValueCounter,
  5.                            RecordWriter<KEYOUT,VALUEOUT> output,
  6.                            OutputCommitter committer,
  7.                            StatusReporter reporter,
  8.                            RawComparator<KEYIN> comparator,
  9.                            Class<KEYIN> keyClass,
  10.                            Class<VALUEIN> valueClass
  11.                           ) throws InterruptedException,IOException{
  12.     super(conf, taskid, output, committer, reporter);
  13.     this.input = input;
  14.     this.inputKeyCounter = inputKeyCounter;
  15.     this.inputValueCounter = inputValueCounter;
  16.     this.comparator = comparator;
  17.     this.serializationFactory =newSerializationFactory(conf);
  18.     this.keyDeserializer = serializationFactory.getDeserializer(keyClass);
  19.     this.keyDeserializer.open(buffer);
  20.     this.valueDeserializer = serializationFactory.getDeserializer(valueClass);
  21.     this.valueDeserializer.open(buffer);
  22.     hasMore = input.next();
  23.     this.keyClass = keyClass;
  24.     this.valueClass = valueClass;
  25.     this.conf = conf;
  26.     this.taskid = taskid;
  27.   }
 
在ReduceContextImpl类内查找comparator
  1. /**
  2.    * Advance to the next key/value pair.
  3.    */
  4.   @Override
  5.   public boolean nextKeyValue() throws IOException,InterruptedException{
  6.     if(!hasMore){
  7.       key = null;
  8.       value = null;
  9.       returnfalse;
  10.     }
  11.     firstValue =!nextKeyIsSame;
  12.     DataInputBuffer nextKey = input.getKey();
  13.     currentRawKey.set(nextKey.getData(), nextKey.getPosition(),
  14.                       nextKey.getLength()- nextKey.getPosition());
  15.     buffer.reset(currentRawKey.getBytes(),0, currentRawKey.getLength());
  16.     key = keyDeserializer.deserialize(key);
  17.     DataInputBuffer nextVal = input.getValue();
  18.     buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()
  19.         - nextVal.getPosition());
  20.     value = valueDeserializer.deserialize(value);
  21.  
  22.     currentKeyLength = nextKey.getLength()- nextKey.getPosition();
  23.     currentValueLength = nextVal.getLength()- nextVal.getPosition();
  24.  
  25.     if(isMarked){
  26.       backupStore.write(nextKey, nextVal);
  27.     }
  28.  
  29.     hasMore = input.next();
  30.    if(hasMore){
  31.       nextKey = input.getKey();
  32.       nextKeyIsSame = comparator.compare(currentRawKey.getBytes(),0,
  33.                                      currentRawKey.getLength(),
  34.                                      nextKey.getData(),
  35.                                      nextKey.getPosition(),
  36.                                      nextKey.getLength()- nextKey.getPosition()
  37.                                          )==0;
  38.     }else{
  39.       nextKeyIsSame =false;
  40.     }
  41.     inputValueCounter.increment(1);
  42.     returntrue;
  43.   }
这个compare方法,调用的是接口RawComparator中的

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); 
而一般如Text、IntWritable这些都实现了该方法
 
(一)未设置
  1. if(theClass == null){
  2.       return getOutputKeyComparator();
  3.     }
  1. /**
  2.    * Get the {@link RawComparator} comparator used to compare keys.
  3.    *
  4.    * @return the {@link RawComparator} comparator used to compare keys.
  5.    */
  6.   publicRawComparator getOutputKeyComparator(){
  7.     Class<? extends RawComparator> theClass = getClass(
  8.       JobContext.KEY_COMPARATOR, null,RawComparator.class);
  9.     if(theClass != null)
  10.       returnReflectionUtils.newInstance(theClass,this);
  11.     returnWritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class),this);
  12.   }
 
没有job.setGroupingComparatorClass(xxxG.class);的时候,即使用默认的,调用Map输出的时候的key所属的类中的comparae,比如Text中的
原来默认情况下,调用的是比较器啊(更准确说是那个比较方法)
(这里比较器又分两种:
           1    key的类类型中的compareTo方法
           2    自定义比较器类中的compare方法
无论我们使用1还是2哪种方式,显然,分组和比较要么都用1 ,要么都用2,这样都是同一套规则,显然也不怎么合适。
所以我们一般是在自定义比较器类的同时又自定义分组类
 
(二)设置了
 
  1.     returnReflectionUtils.newInstance(theClass,this);
如果我们job.setGroupingComparatorClass(xxxG.class),则是创建我们自定义的这个分组类的这个xxxG
这个xxxG得继承WritableComparator类,复写compare方法
如:
public static class SelfGroupComparator extends WritableComparator{
复写compare方法即可
这样,调用逻辑和compare的一样。
 
 
我更推荐方法2
 
 
alt+左箭头,返回上一次查看源码的地方

关于MapReduce中自定义分组类(三)的更多相关文章

  1. 关于MapReduce中自定义分区类(四)

    MapTask类 在MapTask类中找到run函数 if(useNewApi){       runNewMapper(job, splitMetaInfo, umbilical, reporter ...

  2. 关于MapReduce中自定义Combine类(一)

    MRJobConfig      public static fina COMBINE_CLASS_ATTR      属性COMBINE_CLASS_ATTR = "mapreduce.j ...

  3. 2 weekend110的hadoop的自定义排序实现 + mr程序中自定义分组的实现

    我想得到按流量来排序,而且还是倒序,怎么达到实现呢? 达到下面这种效果, 默认是根据key来排, 我想根据value里的某个排, 解决思路:将value里的某个,放到key里去,然后来排 下面,开始w ...

  4. 关于MapReduce中自定义带比较key类、比较器类(二)——初学者从源码查看其原理

    Job类 /**   * Define the comparator that controls    * how the keys are sorted before they   * are pa ...

  5. flask中自定义日志类

    一:项目架构 二:自定义日志类 1. 建立log.conf的配置文件 log.conf [log] LOG_PATH = /log/ LOG_NAME = info.log 2. 定义日志类 LogC ...

  6. python3.4中自定义数组类(即重写数组类)

    '''自定义数组类,实现数组中数字之间的四则运算,内积运算,大小比较,数组元素访问修改及成员测试等功能''' class MyArray: '''保证输入值为数字元素(整型,浮点型,复数)''' de ...

  7. 一脸懵逼学习Hadoop中的MapReduce程序中自定义分组的实现

    1:首先搞好实体类对象: write 是把每个对象序列化到输出流,readFields是把输入流字节反序列化,实现WritableComparable,Java值对象的比较:一般需要重写toStrin ...

  8. 读取SequenceFile中自定义Writable类型值

    1)hadoop允许程序员创建自定义的数据类型,如果是key则必须要继承WritableComparable,因为key要参与排序,而value只需要继承Writable就可以了.以下定义一个Doub ...

  9. Java中自定义注解类,并加以运用

    在Java框架中,经常会使用注解,而且还可以省很多事,来了解下自定义注解. 注解是一种能被添加到java代码中的元数据,类.方法.变量.参数和包都可以用注解来修饰.注解对于它所修饰的代码并没有直接的影 ...

随机推荐

  1. Android快乐贪吃蛇游戏实战项目开发教程-02虚拟方向键(一)自定义控件概述

    该系列教程概述与目录:http://www.cnblogs.com/chengyujia/p/5787111.html 一.自定义控件简介 在本项目中,无论是游戏主区域还是虚拟方向键都是通过自定义控件 ...

  2. [转]Asp.Net 用户验证(自定义IPrincipal和IIdentity)

    本文转自:http://www.cnblogs.com/amylis_chen/archive/2012/08/02/2620129.html Default.aspx 页面预览 默认情况下SignI ...

  3. xib和storyBoard哪些事之UIimage和按钮注意事项

    1.uiimageView 和按钮比较特殊,可以只设置其x,y就可以. 2.可以设置其丝例,如1:1,或者当前图片的比例,这样保证不会变形,设置比例在自身上点击control拉个线,选择 3.在右侧就 ...

  4. Django基础之wsgi

    Django 一 什么是web框架 框架,即framework,特指为解决一个开放性问题而设计的具有一定约束性的支撑结构,使用框架可以帮你快速开发特定的系统,简单地说,就是你用别人搭建好的舞台来做表演 ...

  5. BZOJ 1927: [Sdoi2010]星际竞速

    1927: [Sdoi2010]星际竞速 Time Limit: 20 Sec  Memory Limit: 259 MBSubmit: 2051  Solved: 1263[Submit][Stat ...

  6. PAT 1042. 字符统计(20)

    请编写程序,找出一段给定文字中出现最频繁的那个英文字母. 输入格式: 输入在一行中给出一个长度不超过1000的字符串.字符串由ASCII码表中任意可见字符及空格组成,至少包含1个英文字母,以回车结束( ...

  7. JSP九大内置对象及四个作用域

    九大对象: 内置对象(又叫隐含对象,有9个内置对象):不需要预先声明就可以在脚本代码和表达式中随意使用 1-out: javax.servlet.jsp.JspWriter类型,代表输出流的对象.作用 ...

  8. PPS传奇生死

    地址 :http://game.pps.tv/events/cqss_sign 调用encryptedString('password') RSA加密,通过function(a) 定义 window= ...

  9. C/C++头文件区别

    在从C迁移到C++时,引用的头文件经常忘记是C的还是C++特有的 1. *.h   limits.h ctype.h 2. c* climits cctype [结尾不含.h] 3. 其余的都属于C+ ...

  10. map 函数----filter函数

    # map 函数 l = (1,2,4,5,6,7,8,9,) print(list(map(lambda x:x**2,l)))#使用list类型((map函数(lambda 匿名函数定义x值:x* ...