mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次

1 思路：
0.txt MapReduce is simple
1.txt MapReduce is powerfull is simple
2.txt Hello MapReduce bye MapReduce

1 map函数：context.write(word:docid, 1) 即将word:docid作为map函数的输出
输出key        输出value
MapReduce:0.txt 1
is:0.txt 1
simple:0.txt 1
Mapreduce:1.txt 1
is:1.txt 1
powerfull:1.txt 1
is:1.txt 1
simple:1.txt 1
Hello:2.txt 1
MapReduce:2.txt 1
bye:2.txt 1
MapReduce:2.txt 1
2 combine函数：相同key（word:docid)的进行合并操作，然后context.write(word, docid:count),即将word做为输出key，docid：count作为输出value
输入key    输出value  输出key    输出value
MapReduce:0.txt 1 => MapReduce 0.txt:1 
is:0.txt 1        => is 0.txt:1
simple:0.txt 1    => simple 0.txt:1
Mapreduce:1.txt 1 => Mapreduce 1.txt:1
is:1.txt 2        => is 1.txt:2
powerfull:1.txt 1 => powerfull 1.txt:1
simple:1.txt 1    => simple 1.txt:1
Hello:2.txt 1     => Hello 2.txt:1
MapReduce:2.txt 2 => MapReduce 2.txt:2
bye:2.txt 1       => bye 2.txt:1
3 Partitioner函数：HashPartitioner
略，根据combine的输出key进行分区
4 Reducer函数：仅仅是组合字符串了
输出key    输出value
MapReduce 0.txt:1，1.txt:1 2.txt:2
is 0.txt:1，is 1.txt:2
simple 0.txt:1，1.txt:1
powerfull 1.txt:1
Hello 2.txt:1
bye 2.txt:1

//感觉这个地方是有问题的，Combiner相当于一个本地的reduce，万一如果某个文件大于64M（hadoop 2.x 是128M）怎么办呢？会不会一个文件分到两个split中呢那样在这里统计<word_docid, count>是不是会出现问题呢？
//为了确保不出问题，可以采用两个mapreduce 任务实现。http://www.cnblogs.com/i80386/p/3600174.html
combiner是把同一个机器上的多个map的结果先聚合一次

2 代码如下：
package proj;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {

    public static class InvertedIndexMapper extends

            Mapper<Object, Text, Text, Text> {

        private Text keyInfo = new Text();

        private Text valueInfo = new Text();

        private FileSplit split;

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            split = (FileSplit) context.getInputSplit();

            StringTokenizer itr = new StringTokenizer(value.toString());

            while (itr.hasMoreTokens()) {

                keyInfo.set(itr.nextToken() + ":" + split.getPath().toString());

                valueInfo.set("1");

                context.write(keyInfo, valueInfo);

            }

        }

    }

　　

    //感觉这个地方是有问题的，Combiner相当于一个本地的reduce，万一如果某个文件大于64M（hadoop 2.x 是128M） 怎么办呢？会不会一个文件分到两个split中呢 那样在这里统计<word_docid, count>是不是会出现问题呢？
    //为了确保不出问题，可以采用两个mapreduce 任务实现。http://www.cnblogs.com/i80386/p/3600174.html

    public static class InvertedIndexCombiner extends

            Reducer<Text, Text, Text, Text> {

        private Text info = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            int sum = 0;

            for (Text value : values) {

                sum += Integer.parseInt(value.toString());

            }

            int splitIndex = key.toString().indexOf(":");

            info.set(key.toString().substring(splitIndex + 1) + ":" + sum);

            key.set(key.toString().substring(0, splitIndex));

            context.write(key, info);

        }

    }

    public static class InvertedIndexReducer extends

            Reducer<Text, Text, Text, Text> {

        private Text result = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            StringBuffer buff = new StringBuffer();

            for (Text val : values) {

                buff.append(val.toString() + ";");

            }

            result.set(buff.toString());

            context.write(key, result);

        }

    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args)

                .getRemainingArgs();

        Job job = new Job(conf, "InvertedIndex");

        job.setJarByClass(InvertedIndex.class);

        job.setMapperClass(InvertedIndexMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(Text.class);

        job.setCombinerClass(InvertedIndexCombiner.class);

        job.setReducerClass(InvertedIndexReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

运行结果如下：

Hello    hdfs://localhost:9000/user/root/in/2.txt:1;

MapReduce    hdfs://localhost:9000/user/root/in/2.txt:2;hdfs://localhost:9000/user/root/in/0.txt:1;hdfs://localhost:9000/user/root/in/1.txt:1;

bye    hdfs://localhost:9000/user/root/in/2.txt:1;

is    hdfs://localhost:9000/user/root/in/0.txt:1;hdfs://localhost:9000/user/root/in/1.txt:2;

powerfull    hdfs://localhost:9000/user/root/in/1.txt:1;

simple    hdfs://localhost:9000/user/root/in/1.txt:1;hdfs://localhost:9000/user/root/in/0.txt:1;

0.txt MapReduce is simple

1.txt MapReduce is powerfull is simple

2.txt Hello MapReduce bye MapReduce

mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次的更多相关文章

mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
hadoop(二MapReduce)
hadoop(二MapReduce) 介绍 MapReduce:其实就是把数据分开处理后再将数据合在一起. Map负责“分”,即把复杂的任务分解为若干个“简单的任务”来并行处理.可以进行拆分的前提是这 ...
java大数据最全课程学习笔记(6)--MapReduce精通(二)--MapReduce框架原理
目前CSDN,博客园,简书同步发表中,更多精彩欢迎访问我的gitee pages 目录 MapReduce精通(二) MapReduce框架原理 MapReduce工作流程 InputFormat数据 ...
Hadoop学习笔记： MapReduce二次排序
本文给出一个实现MapReduce二次排序的例子 package SortTest; import java.io.DataInput; import java.io.DataOutput; impo ...
(转)MapReduce二次排序
一.概述 MapReduce框架对处理结果的输出会根据key值进行默认的排序,这个默认排序可以满足一部分需求,但是也是十分有限的.在我们实际的需求当中,往往有要对reduce输出结果进行二次排序的需求 ...
MapReduce教程(二)MapReduce框架Partitioner分区<转>
1 Partitioner分区 1.1 Partitioner分区描述在进行MapReduce计算时,有时候需要把最终的输出数据分到不同的文件中,按照手机号码段划分的话,需要把同一手机号码段的数据放 ...
mapreduce二次排序详解
什么是二次排序待排序的数据具有多个字段,首先对第一个字段排序,再对第一字段相同的行按照第二字段排序,第二次排序不破坏第一次排序的结果,这个过程就称为二次排序. 如何在mapreduce中实现二次排序 ...
详细讲解MapReduce二次排序过程
我在15年处理大数据的时候还都是使用MapReduce, 随着时间的推移, 计算工具的发展, 内存越来越便宜, 计算方式也有了极大的改变. 到现在再做大数据开发的好多同学都是直接使用spark, hi ...
二 MapReduce 各阶段流程分析
如果想要将问题变得清晰.精准和优雅, 需要关注 MapReduce 作业所需要的系统资源,尤其是集群内部网络资源使用情况. MR 可以运行在共享集群上处理 TB 级甚至 PB 级的数据.同时,改作业 ...

随机推荐

js日期控件demo
最近在钻研前端,写了个日期控件,内涵代码注释,希望能帮助到大家~ 1.html代码 <!DOCTYPE html> <html xmlns="http://www.w3.o ...
Cocos3.0测试版发布（中文）
最新的cocos2d-x 3.0版本,我们的目标不仅是改进渲染机制,增加对2.5D的支持,基于组件的系统功能,和更好的Label功能.同时我们希望能够进一步优化引擎,并且使用更友好的C++ API ...
LinearLayout增加divider分割线
在android3.0及后面的版本在LinearLayout里增加了个分割线 1 2 android:divider="@drawable/shape"<!--分割线图片-- ...
火球-UML大战需求分析（体验版3.0.2）.pdf
火球-UML大战需求分析(体验版3.0.2).pdf http://files.cnblogs.com/files/happlyonline/%E7%81%AB%E7%90%83-UML%E5%A4% ...
《HTML5 从入门到精通--7.6.3 单元格垂直跨度——rowspan》
单元格除了能够在水平方向上跨列,还能够垂直方向上跨行.跨行设置须要使用rowspan參数. 语法 <td rowspan="单元格跨行数"> 语法解释与水平跨度相相应 ...
Toast的使用具体解释
Android中提供一种简单的Toast消息提示框机制,能够在用户点击了某些button后,提示用户一些信息,提示的信息不能被用户点击,Toast的提示信息依据用户设置的显示时间后自己主动消失.Toa ...
Core Foundation框架介绍
Core Foundation框架介绍 **参考网址: ARC下OC对象和CF对象之间的桥接 Core Foundation框架介绍 Core Foundation框架 Core Foundation ...
[转] C++指针加整数、两个指针相减的问题
http://blog.csdn.net/onlyou930/article/details/6725051 说来惭愧,写C++有一段时间了.这个问题从来没有认真考虑过,此次标记于此: 考虑如下问题: ...
Rational rose下载，安装，破解
rationalrose是一个镜像文件,后缀名是bin 之前尝试过用虚拟光驱来打开,不知道为什么,在win10的环境下,虚拟光驱硬是不能加载bin文件,后来拷到虚拟机上,打开了bin镜像文件,得到了一 ...
Android打开系统的Document文档图片选择
打开Document UI 过滤图片 private void startAcitivty() { Intent intent = new Intent(); intent.setAction(&qu ...

mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次

mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次的更多相关文章

随机推荐

热门专题