MRJobConfig
     public static fina COMBINE_CLASS_ATTR
     属性COMBINE_CLASS_ATTR = "mapreduce.job.combine.class"
     ————子接口(F4) JobContent
           方法getCombinerClass
             ————子实现类 JobContextImpl
                 实现getCombinerClass方法:
                 public Class<? extends Reducer<?,?,?,?>> getCombinerClass()
                          throws ClassNotFoundException {
                      return (Class<? extends Reducer<?,?,?,?>>)
                        conf.getClass(COMBINE_CLASS_ATTR, null);
                 }
                 因为JobContextImpl是MRJobConfig子类
                 所以得到了父类MRJobConfig的COMBINE_CLASS_ATTR属性
                 ————子类Job
                     public void setCombinerClass(Class<? extends Reducer> cls
                               ) throws IllegalStateException {
                     ensureState(JobState.DEFINE);
                     conf.setClass(COMBINE_CLASS_ATTR, cls, Reducer.class);
                     }
                因为JobContextImpl是MRJobConfig子类,
                而Job是JobContextImpl的子类
                所以也有COMBINE_CLASS_ATTR属性
                通过setCombinerClass设置了父类MRJobConfig的属性
 
 
MRJobConfig
    ————子接口JobContent
        方法getCombinerClass
        ————子实现类 JobContextImpl
            ————子类 Job
        ————子实现类 TaskAttemptContext
            继承了方法getCombinerClass
 
Task   
   $CombinerRunner(Task的内部类)   
            该内部类有方法create:
            public static <K,V> CombinerRunner<K,V> create(JobConf job,
                               TaskAttemptID taskId,
                               Counters.Counter inputCounter,
                               TaskReporter reporter,
                               org.apache.hadoop.mapreduce.OutputCommitter committer
                              ) throws ClassNotFoundException
            {
                  Class<? extends Reducer<K,V,K,V>> cls =
                    (Class<? extends Reducer<K,V,K,V>>) job.getCombinerClass();
                  if (cls != null) {
                    return new OldCombinerRunner(cls, job, inputCounter, reporter);
                  }
                  // make a task context so we can get the classes
                  org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
                    new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, taskId,
                        reporter);
                  Class<? extends org.apache.hadoop.mapreduce.Reducer<K,V,K,V>> newcls =
                    (Class<? extends org.apache.hadoop.mapreduce.Reducer<K,V,K,V>>)
                       taskContext.getCombinerClass();
                  if (newcls != null) {
                    return new NewCombinerRunner<K,V>(newcls, job, taskId, taskContext,
                                                      inputCounter, reporter, committer);
                  }
                  return null;
            }
                  其中这一段应该是旧的API
                  Class<? extends Reducer<K,V,K,V>> cls =
                          (Class<? extends Reducer<K,V,K,V>>) job.getCombinerClass();
                  if (cls != null) {
                          return new OldCombinerRunner(cls, job, inputCounter, reporter);
                  }
                  而这个是新的API
                  org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
                    new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, taskId,
                        reporter);
                  Class<? extends org.apache.hadoop.mapreduce.Reducer<K,V,K,V>> newcls =
                    (Class<? extends org.apache.hadoop.mapreduce.Reducer<K,V,K,V>>)
                       taskContext.getCombinerClass();
                  if (newcls != null) {
                    return new NewCombinerRunner<K,V>(newcls, job, taskId, taskContext,
                                                      inputCounter, reporter, committer);
                  }
                  return null;
                  (不知道为什么要写全名,去掉那些包名、向上/下转型和各种泛型的话,看起来就会清晰很多?)
                  而TaskAttemptContext是JobContent的子实现类,所以继承了getCombinerClass方法
                  而且,这里用的是多态,其调用的是子实现类TaskAttemptContextImpl的getCombinerClass方法
                  (TaskAttemptContextImpl继承了JobContextImpl,而JobContextImpl实现了该方法)
                  所以最终get到了属性COMBINE_CLASS_ATTR,即得到了我们通过job.setCombinerClass的xxxC
                    而这个xxxC是给了newcls,而newcls是给了NewCombinerRunner的构造函数的reducerClassc参数
                      NewCombinerRunner(Class reducerClass,
                          JobConf job,
                          org.apache.hadoop.mapreduce.TaskAttemptID taskId,
                          org.apache.hadoop.mapreduce.TaskAttemptContext context,
                          Counters.Counter inputCounter,
                          TaskReporter reporter,
                          org.apache.hadoop.mapreduce.OutputCommitter committer)
                      {
                          super(inputCounter, job, reporter);
                          this.reducerClass = reducerClass;
                          this.taskId = taskId;
                          keyClass = (Class<K>) context.getMapOutputKeyClass();
                          valueClass = (Class<V>) context.getMapOutputValueClass();
                          comparator = (RawComparator<K>) context.getCombinerKeyGroupingComparator();
                          this.committer = committer;
                      }
Task          
  MapTask
        $MapOutputBuffer
            private CombinerRunner<K,V> combinerRunner;
            $SpillThread类($表示内部类)
                combinerRunner = CombinerRunner.create(job, getTaskID(),
                                             combineInputCounter,
                                             reporter, null);
                //此时,我们得到了设置好的合并类                            
                if (combinerRunner == null) {
                      // spill directly
                      DataInputBuffer key = new DataInputBuffer();
                      while (spindex < mend &&
                          kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {
                        final int kvoff = offsetFor(spindex % maxRec);
                        int keystart = kvmeta.get(kvoff + KEYSTART);
                        int valstart = kvmeta.get(kvoff + VALSTART);
                        key.reset(kvbuffer, keystart, valstart - keystart);
                        getVBytesForOffset(kvoff, value);
                        writer.append(key, value);
                        ++spindex;
                      }
                } else {
                      int spstart = spindex;
                      while (spindex < mend &&
                          kvmeta.get(offsetFor(spindex % maxRec)
                                    + PARTITION) == i) {
                        ++spindex;
                      }
                      // Note: we would like to avoid the combiner if we've fewer
                      // than some threshold of records for a partition
                      if (spstart != spindex) {
                        combineCollector.setWriter(writer);
                        RawKeyValueIterator kvIter =
                          new MRResultIterator(spstart, spindex);
                        combinerRunner.combine(kvIter, combineCollector);
                      }
                }
            
            再查看combine函数
            在Task的内部类NewCombinerRunner下
            public void combine(RawKeyValueIterator iterator,
                                OutputCollector<K,V> collector)
                throws IOException, InterruptedException,ClassNotFoundException
            {
              // make a reducer
              org.apache.hadoop.mapreduce.Reducer<K,V,K,V> reducer =
                (org.apache.hadoop.mapreduce.Reducer<K,V,K,V>)
                  ReflectionUtils.newInstance(reducerClass, job);
              org.apache.hadoop.mapreduce.Reducer.Context
                   reducerContext = createReduceContext(reducer, job, taskId,
                                                        iterator, null, inputCounter,
                                                        new OutputConverter(collector),
                                                        committer,
                                                        reporter, comparator, keyClass,
                                                        valueClass);
              reducer.run(reducerContext);
            }
            上面的reducerClass就是我们传入的xxxC
            最终是通过反射创建了一个xxxC对象,并将其强制向上转型为Reducer实例对象,
            然后调用了向上转型后对象的run方法(当前的xxxC没有run方法,调用的是父类Reduce的run)
            在类Reducer中,run方法如下
            /**
           * Advanced application writers can use the
           * {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
           * control how the reduce task works.
           */
          public void run(Context context) throws IOException, InterruptedException {
            setup(context);
            try {
              while (context.nextKey()) {
                reduce(context.getCurrentKey(), context.getValues(), context);
                // If a back up store is used, reset it
                Iterator<VALUEIN> iter = context.getValues().iterator();
                if(iter instanceof ReduceContext.ValueIterator) {
                  ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();       
                }
              }
            } finally {
              cleanup(context);
            }
          }
          有由于多态,此时调用的reduce是子类xxxC中的reduce方法
         (多态态性质:子类复写了该方法,则实际上执行的是子类中的该方法)
          所以说,我们自定义combine用的类的时候,应该继承Reducer类,并且复写reduce方法
          且其输入形式:(以wordcount为例)
       reduce(Text key, Iterable<IntWritable> values, Context context)
       其中key是单词个数,而values是个数列表,也就是value1、value2........
       注意,此时已经是列表,即<键,list<值1、值2、值3.....>>
       (之所以得到这个结论,是因为我当时使用的combine类是WCReduce,
        即Reduce和combine所用的类是一样的,通过对代码的分析,传入值的结构如果是<lkey,value>的话,是不可能做到combine的啊——即所谓的对相同值合并,求计数的累积和,这根本就是两个步骤,对key相同的键值对在map端就进行了一次合并了,合并成了<key,value list>,然后才轮到combine接受直接换个形式的输入,并处理——我们的处理是求和,然后再输出到context,进入reduce端的shuffle过程。
        然后我在reduce中遍历了用syso输出
        结果发现是0,而这实际上是因为经过一次遍历,我的指针指向的位置就不对了啊,
        )
嗯,自己反复使用以下的代码,不断的组合、注释,去测试吧~就会得出这样的结论了
  1. /reduce
  2.     publicstaticclassWCReduce extends Reducer<Text,IntWritable,Text,IntWritable>{
  3.         private final IntWritableValueOut=newIntWritable();
  4.         @Override
  5.         protectedvoid reduce(Text key,Iterable<IntWritable> values,
  6.                 Context context)  throws IOException,InterruptedException{
  7.             for(IntWritable value : values){
  8.                 System.out.println(value.get()+"--");
  9.             }
  10.  
  11. //            int total = 0 ;
  12. //            for (IntWritable value : values) {
  13. //                total += value.get();
  14. //            }
  15. //            ValueOut.set(total);
  16. //            context.write(key, ValueOut);
  17.         }
  18.  
  19.     }
  20.           
  21. job.setCombinerClass(WCReduce.class);
 
 

附件列表

关于MapReduce中自定义Combine类(一)的更多相关文章

  1. 关于MapReduce中自定义分区类(四)

    MapTask类 在MapTask类中找到run函数 if(useNewApi){       runNewMapper(job, splitMetaInfo, umbilical, reporter ...

  2. 关于MapReduce中自定义分组类(三)

    Job类  /**    * Define the comparator that controls which keys are grouped together    * for a single ...

  3. 关于MapReduce中自定义带比较key类、比较器类(二)——初学者从源码查看其原理

    Job类 /**   * Define the comparator that controls    * how the keys are sorted before they   * are pa ...

  4. python3.4中自定义数组类(即重写数组类)

    '''自定义数组类,实现数组中数字之间的四则运算,内积运算,大小比较,数组元素访问修改及成员测试等功能''' class MyArray: '''保证输入值为数字元素(整型,浮点型,复数)''' de ...

  5. flask中自定义日志类

    一:项目架构 二:自定义日志类 1. 建立log.conf的配置文件 log.conf [log] LOG_PATH = /log/ LOG_NAME = info.log 2. 定义日志类 LogC ...

  6. 读取SequenceFile中自定义Writable类型值

    1)hadoop允许程序员创建自定义的数据类型,如果是key则必须要继承WritableComparable,因为key要参与排序,而value只需要继承Writable就可以了.以下定义一个Doub ...

  7. Java中自定义注解类,并加以运用

    在Java框架中,经常会使用注解,而且还可以省很多事,来了解下自定义注解. 注解是一种能被添加到java代码中的元数据,类.方法.变量.参数和包都可以用注解来修饰.注解对于它所修饰的代码并没有直接的影 ...

  8. Haoop Mapreduce 中的FileOutputFormat类

    FileOutputFormat类继承OutputFormat,需要提供所有基于文件的OutputFormat实现的公共功能,主要有以下两点: (1)实现checkOutputSpecs方法 chec ...

  9. c#(winform)中自定义ListItem类方便ComboBox添加Item项

    1.定义ListItem类 public class ListItem { private string _key = string.Empty; private string _value = st ...

随机推荐

  1. 刷新页面时 select值保持不变

    刷新页面时,要使下拉菜单(select).raido保持不变,用ajax是无法实现的.我想只能通过cookies才能实现.刷新前先把select或radio的值保存在cookies中,刷新后再填回去. ...

  2. Windows下虚拟机安装Ubuntu15.10 Destop简易操作过程

    一.前提环境: 1.vmware12.1,若您的系统是32位,请使用vmware10以下版本. 2.至少双核处理器,2G以上可用内存. 3.Ubuntu安装包(.iso后缀). 注:请尽量支持正版. ...

  3. Vmware虚拟机Devstack安装openstack(All in one)

    Vmware虚拟机Devstack安装openstack(All in one) 博客园第一篇博客,先练习一下. 官方文档 环境 Vmware虚拟机 OS : Ubuntu 14.04 Nat网络,D ...

  4. python-进程&线程

    进程(process):相当于一个程序要运行时所需的所有资源的集合,相当于一个车间,不工作 两个进程之间的数据不共享,完全不独立,互相不能访问. 线程(thread):一道单一指令的控制流,寄生在进程 ...

  5. iframe的优缺点

    HTML框架简述   一个浏览器窗体可以通过几个页面的组合来显示.我们可以使用框架来完成(frames)这项工作.(框架可以把HTML文档分为多个页面)   框架页使用了表格的方式组合,可以分为数行与 ...

  6. 【第四篇】ASP.NET MVC快速入门之完整示例(MVC5+EF6)

    目录 [第一篇]ASP.NET MVC快速入门之数据库操作(MVC5+EF6) [第二篇]ASP.NET MVC快速入门之数据注解(MVC5+EF6) [第三篇]ASP.NET MVC快速入门之安全策 ...

  7. [LeetCode] Unique Binary Search Trees 独一无二的二叉搜索树

    Given n, how many structurally unique BST's (binary search trees) that store values 1...n? For examp ...

  8. web兼容学习分析笔记--块级、内联、内联块级元素

    一.块级.内联.内联块级元素 (1)块级元素:block **独占一行 **可设置width,height,margin,padding **内部可包含块级或内联元素 (3)内联(行内)元素:inli ...

  9. java之并发编程线程池的学习

    如果并发的线程数量很多,并且每个线程都是执行一个时间很短的任务就结束了,这样频繁创建线程就会大大降低系统的效率,因为频繁创建线程和销毁线程需要时间. java.uitl.concurrent.Thre ...

  10. 在 Sublime Text 3 中配置编译和运行 Java 程序

    参考网址:http://www.open-open.com/lib/view/open1388105023765.html 1. 设置 java 的 PATH 环境变量 2. 创建批处理或 Shell ...