一、倒排索引简单介绍

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。

它是文档检索系统中最经常使用的数据结构。

以英文为例。以下是要被索引的文本：

T0="it is what it is"

T1＝"what is it"

T2＝"it is a banana"

我们就能得到以下的反向文件索引：

 "a":      {2}

 "banana": {2}

 "is":     {0, 1, 2}

 "it":     {0, 1, 2}

 "what":   {0, 1}

检索的条件”what”, “is” 和 “it” 将相应这个集合：{0, 1}&{0, 1, 2}& {0, 1, 2}={0,1}

对于中文分词，能够使用开源的中文分词工具，这里使用ik-analyzer。

准备几个文本文件，写入内容做測试。

file1.txt内容例如以下:

其实我们发现，互联网裁员潮频现甚至要高于其它行业领域

file2.txt内容例如以下:

面对寒冬，互联网企业不得不调整人员结构，优化雇员的投入产出

file3.txt内容例如以下:

在互联网内部，因为内部竞争机制以及要与竞争对手拼进度

file4.txt内容例如以下:

互联网大公司职员尽管能够从复杂性和专业分工中受益

互联网企业不得不调整人员结构

二、加入依赖

出了hadoop主要的jar包意外。加入中文分词的lucene-analyzers-common和ik-analyzers：



   <!--Lucene分词模块-->

    <dependency>

      <groupId>org.apache.lucene</groupId>

      <artifactId>lucene-analyzers-common</artifactId>

      <version>6.0.0</version>

    </dependency>

 <!--IK分词 -->

    <dependency>

      <groupId>cn.bestwu</groupId>

      <artifactId>ik-analyzers</artifactId>

      <version>5.1.0</version>

    </dependency>

三、MapReduce程序

关于Lucene 6.0中IK分词的配置參考http://blog.csdn.net/napoay/article/details/51911875，MapReduce程序例如以下。

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;

import java.io.StringReader;

import java.util.HashMap;

import java.util.Map;

/**

 * Created by bee on 4/4/17.

 */

public class InvertIndexIk {

    public static class InvertMapper extends Mapper<Object, Text, Text, Text> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String filename = ((FileSplit) context.getInputSplit()).getPath().getName()

                    .toString();

            Text fname = new Text(filename);

            IKAnalyzer6x analyzer = new IKAnalyzer6x(true);

            String line = value.toString();

            StringReader reader = new StringReader(line);

            TokenStream tokenStream = analyzer.tokenStream(line, reader);

            tokenStream.reset();

            CharTermAttribute termAttribute = tokenStream.getAttribute

                    (CharTermAttribute.class);

            while (tokenStream.incrementToken()) {

                Text word = new Text(termAttribute.toString());

                context.write(word, fname);

            }

        }

    }

    public static class InvertReducer extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text> values,Reducer<Text,Text,

                Text,Text>.Context context) throws IOException, InterruptedException {

            Map<String, Integer> map = new HashMap<String, Integer>();

            for (Text val : values) {

                if (map.containsKey(val.toString())) {

                    map.put(val.toString(),map.get(val.toString())+1);

                } else {

                    map.put(val.toString(),1);

                }

            }

            int termFreq=0;

            for (String mapKey:map.keySet()){

                termFreq+=map.get(mapKey);

            }

            context.write(key,new Text(map.toString()+"  "+termFreq));

        }

    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        HadoopUtil.deleteDir("output");

        Configuration conf=new Configuration();

        String[] otherargs=new

                String[]{"input/InvertIndex",

                "output"};

        if (otherargs.length!=2){

            System.err.println("Usage: mergesort <in> <out>");

            System.exit(2);

        }

        Job job=Job.getInstance();

        job.setJarByClass(InvertIndexIk.class);

        job.setMapperClass(InvertIndexIk.InvertMapper.class);

        job.setReducerClass(InvertIndexIk.InvertReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job,new Path(otherargs[0]));

        FileOutputFormat.setOutputPath(job,new Path(otherargs[1]));

        System.exit(job.waitForCompletion(true) ? 0: 1);

    }

}

四、执行结果

输出例如以下:

专业分工    {file4.txt=1}  1

中   {file4.txt=1}  1

其实 {file1.txt=1}  1

互联网 {file1.txt=1, file3.txt=1, file4.txt=2, file2.txt=1}  5

人员  {file4.txt=1, file2.txt=1}  2

企业  {file4.txt=1, file2.txt=1}  2

优化  {file2.txt=1}  1

内部  {file3.txt=2}  2

发现  {file1.txt=1}  1

受益  {file4.txt=1}  1

复杂性 {file4.txt=1}  1

大公司 {file4.txt=1}  1

寒冬  {file2.txt=1}  1

投入产出    {file2.txt=1}  1

拼   {file3.txt=1}  1

潮   {file1.txt=1}  1

现   {file1.txt=1}  1

竞争对手    {file3.txt=1}  1

竞争机制    {file3.txt=1}  1

结构  {file4.txt=1, file2.txt=1}  2

职员  {file4.txt=1}  1

行业  {file1.txt=1}  1

裁员  {file1.txt=1}  1

要与  {file3.txt=1}  1

调整  {file4.txt=1, file2.txt=1}  2

进度  {file3.txt=1}  1

雇员  {file2.txt=1}  1

面对  {file2.txt=1}  1

领域  {file1.txt=1}  1

频   {file1.txt=1}  1

高于  {file1.txt=1}  1

结果有三列。依次为词项、词项在单个文件里的词频以及总的词频。

五、參考资料

1.https://zh.wikipedia.org/wiki/ 倒排索引

2. Lucene 6.0下使用IK分词器

MapReduce编程(七) 倒排索引构建的更多相关文章

[置顶] MapReduce 编程之倒排索引
本文调试环境: ubuntu 10.04 , hadoop-1.0.2 hadoop装的是伪分布模式,就是只有一个节点,集namenode, datanode, jobtracker, tasktra ...
MapReduce编程之倒排索引
任务要求: //输入文件格式 18661629496 110 13107702446 110 1234567 120 2345678 120 987654 110 2897839274 1866162 ...
Hadoop MapReduce编程 API入门系列之挖掘气象数据版本2（十）
下面,是版本1. Hadoop MapReduce编程 API入门系列之挖掘气象数据版本1(一) 这篇博文,包括了,实际生产开发非常重要的,单元测试和调试代码.这里不多赘述,直接送上代码. MRUni ...
批处理引擎MapReduce编程模型
批处理引擎MapReduce编程模型作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. MapReduce是一个经典的分布式批处理计算引擎,被广泛应用于搜索引擎索引构建,大规模数据处理 ...
[Hadoop入门] - 1 Ubuntu系统 Hadoop介绍 MapReduce编程思想
Ubuntu系统 (我用到版本号是140.4) ubuntu系统是一个以桌面应用为主的Linux操作系统,Ubuntu基于Debian发行版和GNOME桌面环境.Ubuntu的目标在于为一般用户提供一 ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
MapReduce编程模型及其在Hadoop上的实现
转自:https://www.zybuluo.com/frank-shaw/note/206604 MapReduce基本过程关于MapReduce中数据流的传输过程,下图是一个经典演示: 关于上 ...
三、MapReduce编程实例
前文一.CentOS7 hadoop3.3.1安装(单机分布式.伪分布式.分布式二.JAVA API实现HDFS MapReduce编程实例 @ 目录前文 MapReduce编程实例前言注意 ...

随机推荐

POJ 3608 Bridge Across Islands（旋转卡壳，两凸包最短距离）
Bridge Across Islands Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 7202 Accepted: ...
E-R图样例
基本知识基本的ER模型包含三类元素:实体.关系.属性. 实体(Entities):实体是首要的数据对象,常用于表示一个人.地方.某样事物或某个事件.一个特定的实体被称为实体实例(entity ins ...
ArcGIS10.1发布WFS-T服务
官方帮助文档:http://resources.arcgis.com/zh-cn/help/main/10.1/index.html#/na/0154000003m3000000/ 本文介绍了如何使用 ...
【转】Points To Line
原文地址 Python+Arcpy操作Points(.shp)转换至Polyline(.shp),仔细研读Points To Line (Data Management)说明,参数说明如下: Inpu ...
用Hexo搭建属于自己的iOS技术博客，搬家了
搬家了,本来还打算在博客园混一段时间的,可是当我看到Hexo的时候,已经难以抵挡它的诱惑,简单不简约的界面让我花了整整一天的时间,买域名的过程中发生一点小问题导致DNS解析错误了,但还是成功了.欢迎朋 ...
CAS4.0.4 之自定义登录页实践
因最近公司要用到cas登陆而且要使用自定登陆页面,网络上搜索了一下cas自定义登陆,比较好的两篇文章CAS 之自定义登录页实践和CAS 之跨域 Ajax 登录实践,感觉写的不错,但是发现改动的地方很 ...
CMakeFile命令之file
file:文件操作命令. file(WRITE filename "message towrite"... ) WRITE 将一则信息写入文件’filename’中,如果该文件存在 ...
IP地址网段规划
flask上传文件时request.files为空的解决办法
在做上传文件的时候遇到request.files是空原因在于html中的表单form没有指明 enctype="multipart/form-data" <form met ...
流畅的python第七章函数装饰器和闭包学习记录
本章讨论的话题 python如何计算装饰器句法 python如何判断变量是不是局部的(通过函数内部是否给变量赋值过来判断是否是局部变量) 闭包存在的原因和工作原理(闭包是一种函数,它会保留定义函数时存 ...

MapReduce编程(七) 倒排索引构建