MapReduce的倒排索引

索引：

什么是索引：索引（Index）是帮助数据库高效获取数据的数据结构。索引是在基于数据库表创建的，它包含一个表中某些列的值以及记录对应的地址，并且把这些值存储在一个数据结构中。最常见的就是使用哈希表、B+树作为索引。

索引的具体分析：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

用代码说事，先来看看我的数据吧：

包com.huhu.day05;

import java.io.IOException;

导入org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import com.huhu.day04.ProgenyCount;

公共类InvertedIndex扩展ToolRunner实现工具{

	私人配置conf;

	公共静态类MyMapper扩展Mapper <LongWritable，文本，文本，文本> {

		私人FileSplit拆分;

		private Text va = new Text（）;

		@覆盖

		保护无效设置（Mapper <LongWritable，Text，Text，Text> .Context上下文）

				抛出IOException，InterruptedException {

			split =（FileSplit）context.getInputSplit（）;

		}

		@覆盖

		protected void map（LongWritable key，Text value，Context context）throws IOException，InterruptedException {

			String [] line = value.toString（）。split（“”）;

			通信System.err.println（线）;

			String filename = split.getPath（）。getName（）;

			for（String s：line）{

				va.set（“fileName：”+ filename +“：”+ key.get（）+“\ t索引位置：”+ value.toString（）。indexOf（s）+“\ t”）;

				context.write（new Text（“搜索词：”+ s +“\ r”），new Text（va））;

			}

		}

	}

	公共静态类MyReduce扩展Reducer <文本，文本，文本，文本> {

		@覆盖

		保护无效设置（上下文上下文）抛出IOException，InterruptedException {

		}

		@覆盖

		protected void reduce（Text key，Iterable <Text> values，Context context）

				抛出IOException，InterruptedException {

			StringBuffer sb = new StringBuffer（）;

			for（Text v：values）{

				sb.append（v.toString（））;

			}

			context.write（new Text（key），new Text（sb.toString（）））;

		}

		@覆盖

		保护无效清理（上下文上下文）抛出IOException，InterruptedException {

		}

	}

	公共静态无效的主要（字符串[]参数）抛出异常{

		InvertedIndex t = new InvertedIndex（）;

		配置conf = t.getConf（）;

		String [] other = new GenericOptionsParser（conf，args）.getRemainingArgs（）;

		if（other.length！= 2）{

			System.err.println（“number is fail”）;

		}

		int run = ToolRunner.run（conf，t，args）;

		System.exit（运行）;

	}

	@覆盖

	public Configuration getConf（）{

		if（conf！= null）{

			返回conf;

		}

		返回新的配置（）;

	}

	@覆盖

	public void setConf（Configuration arg0）{

	}

	@覆盖

	公共诠释运行（字符串[]其他）抛出异常{

		配置con = getConf（）;

		Job job = Job.getInstance（con）;

		job.setJarByClass（ProgenyCount.class）;

		job.setMapperClass（MyMapper.class）;

		job.setMapOutputKeyClass（Text.class）;

		job.setMapOutputValueClass（Text.class）;

		//默认分区

		// job.setPartitionerClass（HashPartitioner.class）;

		job.setReducerClass（MyReduce.class）;

		job.setOutputKeyClass（Text.class）;

		job.setOutputValueClass（Text.class）;

		FileInputFormat.addInputPath（job，new Path（“hdfs：// ry-hadoop1：8020 / in / day05 / InvertedIndex”））;

		Path path = new Path（“hdfs：// ry-hadoop1：8020 / out / day05.txt”）;

		FileSystem fs = FileSystem.get（getConf（））;

		if（fs.exists（path））{

			fs.delete（path，true）;

		}

		FileOutputFormat.setOutputPath（job，path）;

		返回job.waitForCompletion（true）？0：1;

	}

}

索引很重要：

详情：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

MapReduce的倒排索引的更多相关文章

利用MapReduce实现倒排索引
这里来学习的是利用MapReduce的分布式编程模型来实现简单的倒排索引. 首先什么是倒排索引? 倒排索引是文档检索中最常用的数据结构,被广泛地应用于全文搜索引擎. 它主要是用来存储某个单词(或词组) ...
MapReduce实例-倒排索引
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境数据:任意数量.格式的文本文件(我用的四个.java代码文件) 方案目标: 根据提供的文本文件,提取出每个单词在哪个文件 ...
mapreduce (三) MapReduce实现倒排索引(二)
hadoop api http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html 改变一下需求: ...
MapReduce实战--倒排索引
本文地址:http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html,转载请注明源地址. 1.倒排索引简介倒排索引(Inver ...
Hadoop实战-MapReduce之倒排索引(八)
倒排索引 (就是key和Value对调的显示结果) 一.需求:下面是用户播放音乐记录,统计歌曲被哪些用户播放过 tom LittleApple jack YesterdayO ...
MapReduce实现倒排索引（类似协同过滤）
一.问题背景倒排索引其实就是出现次数越多,那么权重越大,不过我国有凤巢....zf为啥不管,总局回应推广是不是广告有争议... eclipse里ctrl+t找接口或者抽象类的实现类,看看都有啥方法, ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
使用MapReduce实现一些经典的案例
在工作中,很多时候都是用hive或pig来自动化执行mr统计,但是我们不能忘记原始的mr.本文记录了一些通过mr来完成的经典的案例,有倒排索引.数据去重等,需要掌握. 一.使用mapreduce实现倒 ...

随机推荐

Ansible学习实战手记-你想要知道的可能都在这里了
最近接触了ansible工具,查找了一些资料,也做了一些总结.希望能给刚接触的新手带来一些帮助. 此总结有实际例子,大部分也是从实践中用到才逐一总结的. 当然可能肯定一定会存在一些错误和纰漏,还望大家 ...
SV coverage
covergroup是对coverage model的一种包装,每个covergroup可以包含: 1) sync event来触发采样, 2) 很多coverpoint, 3) cross cove ...
jmeter对自身性能的优化
测试环境 apache-jmeter-2.13 1. 问题描述单台机器的下JMeter启动较大线程数时可能会出现运行报错的情况,或者在运行一段时间后,JMeter每秒生成的请求数会逐步下降, ...
日期计算、正则、sequence、索引、表连接、mybatis
************************** mybatis ******************************************* #{} 的参数替换是发生在 DBMS 中, ...
Oracle数据库管理----性能优化
https://blog.csdn.net/yzllz001/article/details/54848513 数据库访问优化法则要正确的优化SQL,我们需要快速定位能性的瓶颈点,也就是说快速找 ...
Jmeter接口测试+压力测试+环境配置+证书导出
jmeter是apache公司基于java开发的一款开源压力测试工具,体积小,功能全,使用方便,是一个比较轻量级的测试工具,使用起来非常简单.因为jmeter是java开发的,所以运行的时候必须先要安 ...
flutter key
随意点开一个Widget,就会发现,可以传递一个参数Key.那这个Key到底是干啥子,有什么用呢? Flutter是受React启发的,所以Virtual Dom的diff算法也参考过来了(应该是略有 ...
linux bash tutorial
bash read-special-keys-in-bash xdotool linux 登录启动顺序
Linux下按扇区读写块设备
本文介绍Linux下按扇区读写块设备(示例TF卡),实际应用是在Android系统上,主要方法如下: 1.找到sdcard的挂载点,在android2.1系统下应该为/dev/block/mmcblk ...
剑指offer（39）平衡二叉树
题目描述输入一棵二叉树,判断该二叉树是否是平衡二叉树. 题目分析第一种方法: 正常思路,应该会获得节点的左子树和右子树的高度,然后比较高度差是否小于1. 可是这样有一个问题,就是节点重复遍历了,影 ...

MapReduce的倒排索引

MapReduce的倒排索引的更多相关文章

随机推荐

热门专题