Mapreduce 反向索引

反向索引主要用于全文搜索，就是形成一个word url这样的结构

file1:

MapReduce is simple

file2:

MapReduce is powerful is simple

file3:

Hello MapReduce bye MapReduce

那么经过反向索引后就是：

Hello     file3.txt:1;
MapReduce     file3.txt:2;fil1.txt:1;fil2.txt:1;
bye     file3.txt:1;
is     fil1.txt:1;fil2.txt:2;
powerful     fil2.txt:1;
simple     fil2.txt:1;fil1.txt:1;

主要的方法就是，对每个文件的内容进行遍历，形成的key为word+filename，value=1然后在combiner中将key相同的进行累加，这样就得到在同一个文件中word的字数了。最后在reduce中将filename进行分割即可。不过这里有个小的bug，一般来说combiner是在同一个节点上进行reduce，但是我这里却是用于统计同一个文件了，如果说文件很大，那么很有可能一个文件的内容会被分配到两个不同的节点上，那么就有会bug了。所以这里只能适合小的文件。

PS：获得文件名String filename = ((FileSplit) context.getInputSplit()).getPath().getName();别的似乎没有了。

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable ikey, Text ivalue, Context context)

throws IOException, InterruptedException {

StringTokenizer st= new StringTokenizer(ivalue.toString());

FileSplit split=new FileSplit();

split = (FileSplit) context.getInputSplit();

InputSplit isplit=context.getInputSplit();

String filename = ((FileSplit) context.getInputSplit()).getPath().getName();

while(st.hasMoreTokens()){

//int splitIndex = split.getPath().toString().indexOf("file");

String key=st.nextToken()+":" +filename;

context.write( new Text(key),new Text("1"));

}

public class MyCombiner extends Reducer<Text, Text, Text, Text> {

public void reduce(Text _key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

// process values

int sum=0;

for (Text val : values) {

sum++;

}

StringTokenizer st= new StringTokenizer(_key.toString(),":");

String key=st.nextToken();

String value=st.nextToken();

value=value+ ":"+sum;

context.write( new Text(key),new Text(value));

}

public class MyReducer extends Reducer<Text, Text, Text, Text> {

public void reduce(Text _key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

// process values

String filelist= new String();

for (Text val : values) {

filelist=filelist+val.toString()+ "; ";

}

context.write(_key, new Text(filelist));

//System.out.println(_key.toString()+filelist);

}

Mapreduce 反向索引的更多相关文章

Oracle索引梳理系列（三）- Oracle索引种类之反向索引
版权声明:本文发布于http://www.cnblogs.com/yumiko/,版权由Yumiko_sunny所有,欢迎转载.转载时,请在文章明显位置注明原文链接.若在未经作者同意的情况下,将本文内 ...
Reverse Key Indexes反向索引
Reverse Key Indexes反向索引A reverse key index is a type of B-tree index that physically reverses the by ...
【转】Lucene工作原理——反向索引
原文链接: http://my.oschina.net/wangfree/blog/77045 倒排索引倒排索引(反向索引) 倒排索引源于实际应用中需要根据属性的值来查找记录.这种索引表中的每一项 ...
Oracle 反向索引（反转建索引）理解
一反向索引 1.1 反向索引的定义反向索引作为B-tree索引的一个分支,主要是在创建索引时,针对索引列的索引键值进行字节反转,进而实现分散存放到不同叶子节点块的目的. 1.2 反向索引针对的问题 ...
反向索引（Inverted Index）
转自:http://zhangyu8374.iteye.com/blog/86307 反向索引是一种索引结构,它存储了单词与单词自身在一个或多个文档中所在位置之间的映射.反向索引通常利用关联数组实现. ...
lucene反向索引——倒排表无论是文档号及词频，还是位置信息，都是以跳跃表的结构存在的
转自:http://www.cnblogs.com/forfuture1978/archive/2010/02/02/1661436.html 4.2. 反向信息反向信息是索引文件的核心,也即反向索 ...
Oracle 反键索引/反向索引
反键索引又叫反向索引,不是用来加速数据访问的,而是为了均衡IO,解决热块而设计的比如数据这样: 1000001 1000002 1000005 1000006 在普通索引中会出现在一个叶子上,如果部门 ...
Elastic Search 学习之路（二）——inverted index(反向索引)
这是篇翻译文,图画的挺有意思. Elastic使用非常特殊的数据结构,称作反向索引.反向索引中,包括了一组document中出现的唯一的单词,和对应的单词,所出现的位置.反向索引是在ES中,docum ...
MongoDB入门三步曲2－－基本操作(续)--聚合、索引、游标及mapReduce
mongodb 基本操作(续)--聚合.索引.游标及mapReduce 目录聚合操作 MapReduce 游标索引聚合操作像大多关系数据库一样,Mongodb也提供了聚合操作,这里仅列取常见到 ...

随机推荐

android打成apk
用的软件是这个 D:\软件备份\adt-bundle-windows-x86_64-20140321\adt-bundle-windows-x86_64-20140321\eclipse file-- ...
hadoop中联结不同来源数据
装载自http://www.cnblogs.com/dandingyy/archive/2013/03/01/2938462.html 有时可能需要对来自不同源的数据进行综合分析: 如下例子: 有Cu ...
CDN（转载）
CDN是什么? 谈到CDN的作用,可以用8年买火车票的经历来形象比喻: 8年前,还没有火车票代售点一说,12306.cn更是无从说起.那时候火车票还只能在火车站的售票大厅购买,而我所住的小县城并不通火 ...
debug经验汇总
(1)使用pstack (2)调试core文件 # gdb ./segment core (3)使用strace strace -tt -f -s 1234 -o /tmp/strace.cwc -p ...
sphinx set several dates as filter
http://sphinxsearch.com/forum/view.html?id=3187 > I think I may have found a bug. Yep, it looks w ...
What is “Mock You” ：Raise，callback，verify [转载]
http://www.cnblogs.com/wJiang/archive/2010/02/21/1670637.html Raise 如果你说会用Setup,那么Raise就更简单了.这里注意下它是 ...
无法加载shockwave flash
热心网友 360浏览器的话,浏览器——工具——选项(非Internet选项)——高级设置——FLASH, 默认使用PPAPI Flash(需要重启浏览器) 默认使用NPAPI Flash(需要重启浏览 ...
Recover Polygon (easy)
Recover Polygon (easy) The zombies are gathering in their secret lair! Heidi will strike hard to des ...
gridview中button事件处理
http://msdn.microsoft.com/zh-cn/library/bb907626.aspx 再结合如下: protected void GridView1_RowCommand(obj ...
[转]异常:android.os.NetworkOnMainThreadException
Android 4.1项目:使用新浪微博分享时报: android.os.NetworkOnMainThreadException 网上搜索后知道是因为版本问题,在4.0之后在主线程里面执行Http请 ...

Mapreduce 反向索引

Mapreduce 反向索引的更多相关文章

随机推荐

热门专题