MapReduce中的倒排索引

0.倒排索引资料:

http://blog.csdn.net/pzasdq/article/details/51442856

1.三个日志源文件:

a.txt

hello tom

hello jerry

hello tom

b.txt

hello jerry

hello jerry

tom jerry

c.txt

hello jerry

hello tom

希望统计出来的结果如下:

hello   a.txt->3 b.txt->2 c.txt->2

jerry   b.txt->3 a.txt->1 c.txt->1

tom     a.txt->2 b.txt->1 c.txt->1

2.上代码:

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.input.FileSplit;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class InverseIndex {

     public static class IndexMapper extends Mapper<LongWritable, Text, Text, Text>{

         private Text k = new Text();

         private Text v = new Text();

         @Override

         protected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, Text>.Context context)

                 throws IOException, InterruptedException {

             String line = value.toString();

             String [] words = line.split(" ");

             FileSplit inputSplit = (FileSplit)context.getInputSplit();//返回mapper读取的是哪个切片split

             //path=hdfs://itcast:9000/ii/a.txt

             //k2,v2 为 hello->a.txt     {1,1,1}

             String path = inputSplit.getPath().toString();

             for (String word : words) {

                 k.set(word + "->" + path);

                 v.set("1");

                 context.write(k, v);

             }

         }

     }

     public static class IndexCombiner extends Reducer<Text, Text, Text, Text>{

         private Text k = new Text();

         private Text v = new Text();

         @Override

         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)

                 throws IOException, InterruptedException {

             //k2,v2 为hello->a.txt {1,1,1}   ----->  k3,v3为 hello,a.txt->3

             int counter = 0;

             for(Text text :values){

                 counter += Integer.parseInt(text.toString());

             }

             String[] wordAndPath = key.toString().split("->");

             String word = wordAndPath[0];

             String path = wordAndPath[1];

             k.set(word);

             v.set(path+"->"+counter);

             context.write(k,v);

         }

     }

     public static class IndexReducer extends Reducer<Text, Text, Text, Text>{

         private Text v = new Text();

         @Override

         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)

                 throws IOException, InterruptedException {

             //Reducer这里 是把所有key相同的搞到一块了,这个地方对应的values为Iterable也证实这一点.

             //不同的Map根据k2 到达Reducer 把k2相同的汇聚到一起...对应的k2对应的v2组成一个集合.

             //从combiner过来的k和v为   hello,a.txt->3  经过reducer变成

             String result = "";

             for(Text t:values){

                 result += t.toString() + "\t";

             }

             v.set(result);

             context.write(key,v);

         }

     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf);

         job.setJarByClass(InverseIndex.class);

         job.setMapperClass(IndexMapper.class);

         job.setMapOutputKeyClass(Text.class);

         job.setMapOutputValueClass(Text.class);

         job.setCombinerClass(IndexCombiner.class);

         FileInputFormat.setInputPaths(job, new Path(args[0]));

         job.setReducerClass(IndexReducer.class);

         job.setOutputKeyClass(Text.class);

         job.setOutputValueClass(Text.class);

         FileOutputFormat.setOutputPath(job, new Path(args[1]));

         System.exit(job.waitForCompletion(true) ? 0 : 1);//0是正常推出以 1是异常退出.

     }

 }

3.打成jar包,通过命令执行

hadoop jar /root/itcastmr.jar itcastmr.inverseindex.InverseIndex /user/root/InverseIndex /InverseIndexResult

查看结果文件:

MapReduce中的倒排索引的更多相关文章

Hadoop学习笔记—11.MapReduce中的排序和分组
一.写在之前的 1.1 回顾Map阶段四大步骤首先,我们回顾一下在MapReduce中,排序和分组在哪里被执行: 从上图中可以清楚地看出,在Step1.4也就是第四步中,需要对不同分区中的数据进行排 ...
Hadoop学习笔记—12.MapReduce中的常见算法
一.MapReduce中有哪些常见算法 (1)经典之王:单词计数这个是MapReduce的经典案例,经典的不能再经典了! (2)数据去重 "数据去重"主要是为了掌握和利用并行化思 ...
MapReduce中作业调度机制
MapReduce中作业调度机制主要有3种: 1.先入先出FIFO Hadoop 中默认的调度器,它先按照作业的优先级高低,再按照到达时间的先后选择被执行的作业. 2.公平调度器(相当于时间 ...
Mapreduce中的字符串编码
Mapreduce中的字符串编码 $$$ Shuffle的执行过程,需要经过多次比较排序.如果对每一个数据的比较都需要先反序列化,对性能影响极大. RawComparator的作用就不言而喻,能够直接 ...
MapReduce中一次reduce方法的调用中key的值不断变化分析及源码解析
摘要:mapreduce中执行reduce(KEYIN key, Iterable<VALUEIN> values, Context context),调用一次reduce方法,迭代val ...
Hadoop学习之路（二十三）MapReduce中的shuffle详解
概述 1.MapReduce 中,mapper 阶段处理的数据如何传递给 reducer 阶段,是 MapReduce 框架中最关键的一个流程,这个流程就叫 Shuffle 2.Shuffle: 数 ...
[MapReduce_5] MapReduce 中的 Combiner 组件应用
0. 说明 Combiner 介绍 && 在 MapReduce 中的应用 1. 介绍 Combiner: Map 端的 Reduce,有自己的使用场景在相同 Key 过多的情况下 ...
Hadoop案例（七）MapReduce中多表合并
MapReduce中多表合并案例一.案例需求订单数据表t_order: id pid amount 1001 01 1 1002 02 2 1003 03 3 订单数据order.txt 商品信息 ...
MapReduce中的分布式缓存使用
MapReduce中的分布式缓存使用 @(Hadoop) 简介 DistributedCache是Hadoop为MapReduce框架提供的一种分布式缓存机制,它会将需要缓存的文件分发到各个执行任务的 ...

随机推荐

华为机试-iNOC产品部-杨辉三角的变形
题目描述 1 1 1 1 1 2 3 2 1 1 3 6 7 6 3 11 4 10 16 19 16 10 4 1以上三角形的数阵,第一行只有一个数1,以下每行的每个数,是恰好是它上面的数,左上角数 ...
Typecho 调用分类文章列表
其中pageSize后面的数字表示调用文章的数量:mid后面的数字表示调用的分类ID; 提示:Typecho分类目录ID的获取方法是把鼠标移到某分类名称上面,在浏览器状态栏显示的mid=后面的数字便是 ...
从客户端(f="<zhaoyuntang.com")中检测到有潜在危险的 Request.Form 值。
从客户端(f="<yi733.com")中检测到有潜在危险的 Request.Form 值. 解决办法1:在aspx页面头部加 ValidateRequest="f ...
远程算数程序——版本v1.0
很少有需要背诵的程序,但是从这个程序开始,标记的都是必须背诵的. 远程算数程序概述远程算数程序比较简单,分为服务器端和客户端,客户端发送欲计算的表达式给服务器端,服务端经过计算又返回结果给客户端.如 ...
Jack Straws（poj 1127）两直线是否相交模板
http://poj.org/problem?id=1127 Description In the game of Jack Straws, a number of plastic or wood ...
Leetcode--572. Subtree of Another Tree(easy)
Given two non-empty binary trees s and t, check whether tree t has exactly the same structure and no ...
从MS Word到Windows Live Writer
在做笔记的时候,喜欢使用Word进行排版及插入图片,但是当将笔记发布的时候,一般的网站是不支持直接将Word中的图片进行上传的,此时使用Windows Live Writer是一个不错的选择. 可是, ...
bzoj 2440 完全平方数【莫比乌斯函数】
题目题意:第Ki 个不是完全平方数的正整数倍的数. 对于一个数t,t以内的数里的非完全平方数倍数的个数:num=1的倍数的数量−一个质数平方数(9,25,49...)的倍数的数量+两个质数的积平方数 ...
区间DP石子合并问题 & 四边形不等式优化
入门区间DP,第一个问题就是线性的规模小的石子合并问题 dp数组的含义是第i堆到第j堆进行合并的最优值就是说dp[i][j]可以由dp[i][k]和dp[k+1][j]转移过来状态转移方程 dp[ ...
android TextView 设置部分文字背景色和文字颜色
通过SpannableStringBuilder来实现,它就像html里边的元素改变指定文字的文字颜色或背景色 public class MainActivity extends Activity { ...

MapReduce中的倒排索引

MapReduce中的倒排索引的更多相关文章

随机推荐

热门专题