需求:

以上三个文件,用MapReduce进行处理,最终输出以下格式:

hello c.txt-->2 b.txt-->2 a.txt-->3
jerry c.txt-->1 b.txt-->3 a.txt-->1
tom c.txt-->1 b.txt-->1 a.txt-->2

思考:

我们需要进行两个步骤:

1.就是之前的统计单词个数的练习,只不过现在需要加上文件名而已。得到如下效果

hello-->a.txt 3
hello-->b.txt 2
hello-->c.txt 2
jerry-->a.txt 1
jerry-->b.txt 3
jerry-->c.txt 1
tom-->a.txt 2
tom-->b.txt 1
tom-->c.txt 1

2.将key由hello-->a.txt这种形式转化成hello这种形式,然后进行分组。得到如下效果:

hello c.txt-->2 b.txt-->2 a.txt-->3
jerry c.txt-->1 b.txt-->3 a.txt-->1
tom c.txt-->1 b.txt-->1 a.txt-->2

文件目录如下:

InverseIndexStepOne.java:

package cn.darrenchan.hadoop.mr.ii;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InverseIndexStepOne {
public static class StepOneMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// 拿到一行数据
String line = value.toString();
// 切分出各个单词
String[] fields = line.split("\t");
// 获取这一行数据所在的文件切片
FileSplit inputSplit = (FileSplit) context.getInputSplit();
// 从文件切片中获取文件名
String fileName = inputSplit.getPath().getName();
for (String field : fields) {
// 封装kv输出 , k : hello-->a.txt v: 1
context.write(new Text(field + "-->" + fileName),
new LongWritable(1));
}
}
} public static class StepOneReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int count = 0;
for (LongWritable value : values) {
count += value.get();
}
// <hello-->a.txt,{1,1,1....}>
context.write(key, new LongWritable(count));
}
} public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf); job.setJarByClass(InverseIndexStepOne.class); job.setMapperClass(StepOneMapper.class);
job.setReducerClass(StepOneReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class); job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class); //检查一下参数所指定的输出路径是否存在,如果已存在,先删除
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(conf);
if (fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
} FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, outputPath); System.exit(job.waitForCompletion(true) ? 0 : 1);
} }

InverseIndexStepTwo.java:

package cn.darrenchan.hadoop.mr.ii;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InverseIndexStepTwo {
// k: 行起始偏移量 v: {hello-->a.txt 3}
// map输出的结果是这个形式 : <hello,a.txt-->3>
public static class StepTwoMapper extends
Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split("-->");
String[] strings = fields[1].split("\t");
context.write(new Text(fields[0]), new Text(strings[0] + "-->"
+ strings[1]));
}
} // 拿到的数据 <hello,{a.txt-->3,b.txt-->2,c.txt-->1}>
// 输出的结果就是 k: hello v: a.txt-->3 b.txt-->2 c.txt-->1
public static class StepTwoReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String result = " ";
for (Text value : values) {
result += value + " ";
}
context.write(key, new Text(result));
}
} public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf); job.setJarByClass(InverseIndexStepTwo.class); job.setMapperClass(StepTwoMapper.class);
job.setReducerClass(StepTwoReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class); // 检查一下参数所指定的输出路径是否存在,如果已存在,先删除
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(conf);
if (fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
} FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, outputPath); System.exit(job.waitForCompletion(true) ? 0 : 1);
} }

首先将三个文件传到HDFS的/ii/srcdata目录下面,执行指令:

hadoop jar ii.jar cn.darrenchan.hadoop.mr.ii.InverseIndexStepOne /ii/srcdata /ii/output1

打印运行信息:

17/03/01 17:55:38 INFO client.RMProxy: Connecting to ResourceManager at weekend110/192.168.230.134:8032
17/03/01 17:55:38 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/03/01 17:55:39 INFO input.FileInputFormat: Total input paths to process : 3
17/03/01 17:55:39 INFO mapreduce.JobSubmitter: number of splits:3
17/03/01 17:55:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1488372977056_0001
17/03/01 17:55:41 INFO impl.YarnClientImpl: Submitted application application_1488372977056_0001
17/03/01 17:55:41 INFO mapreduce.Job: The url to track the job: http://weekend110:8088/proxy/application_1488372977056_0001/
17/03/01 17:55:41 INFO mapreduce.Job: Running job: job_1488372977056_0001
17/03/01 17:55:52 INFO mapreduce.Job: Job job_1488372977056_0001 running in uber mode : false
17/03/01 17:55:52 INFO mapreduce.Job: map 0% reduce 0%
17/03/01 17:56:11 INFO mapreduce.Job: map 33% reduce 0%
17/03/01 17:56:12 INFO mapreduce.Job: map 100% reduce 0%
17/03/01 17:56:18 INFO mapreduce.Job: map 100% reduce 100%
17/03/01 17:56:18 INFO mapreduce.Job: Job job_1488372977056_0001 completed successfully
17/03/01 17:56:18 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=382
FILE: Number of bytes written=372665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=402
HDFS: Number of bytes written=138
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=51196
Total time spent by all reduces in occupied slots (ms)=3018
Total time spent by all map tasks (ms)=51196
Total time spent by all reduce tasks (ms)=3018
Total vcore-seconds taken by all map tasks=51196
Total vcore-seconds taken by all reduce tasks=3018
Total megabyte-seconds taken by all map tasks=52424704
Total megabyte-seconds taken by all reduce tasks=3090432
Map-Reduce Framework
Map input records=8
Map output records=16
Map output bytes=344
Map output materialized bytes=394
Input split bytes=312
Combine input records=0
Combine output records=0
Reduce input groups=9
Reduce shuffle bytes=394
Reduce input records=16
Reduce output records=9
Spilled Records=32
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=1077
CPU time spent (ms)=6740
Physical memory (bytes) snapshot=538701824
Virtual memory (bytes) snapshot=1450766336
Total committed heap usage (bytes)=379793408
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=90
File Output Format Counters
Bytes Written=138

运行结果如下:

hello-->a.txt 3
hello-->b.txt 2
hello-->c.txt 2
jerry-->a.txt 1
jerry-->b.txt 3
jerry-->c.txt 1
tom-->a.txt 2
tom-->b.txt 1
tom-->c.txt 1

执行指令:

hadoop jar ii.jar cn.darrenchan.hadoop.mr.ii.InverseIndexStepTwo /ii/output1 /ii/output2

打印运行信息:

17/03/01 18:03:31 INFO client.RMProxy: Connecting to ResourceManager at weekend110/192.168.230.134:8032
17/03/01 18:03:31 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/03/01 18:03:31 INFO input.FileInputFormat: Total input paths to process : 1
17/03/01 18:03:31 INFO mapreduce.JobSubmitter: number of splits:1
17/03/01 18:03:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1488372977056_0003
17/03/01 18:03:32 INFO impl.YarnClientImpl: Submitted application application_1488372977056_0003
17/03/01 18:03:32 INFO mapreduce.Job: The url to track the job: http://weekend110:8088/proxy/application_1488372977056_0003/
17/03/01 18:03:32 INFO mapreduce.Job: Running job: job_1488372977056_0003
17/03/01 18:03:38 INFO mapreduce.Job: Job job_1488372977056_0003 running in uber mode : false
17/03/01 18:03:38 INFO mapreduce.Job: map 0% reduce 0%
17/03/01 18:03:43 INFO mapreduce.Job: map 100% reduce 0%
17/03/01 18:03:47 INFO mapreduce.Job: map 100% reduce 100%
17/03/01 18:03:48 INFO mapreduce.Job: Job job_1488372977056_0003 completed successfully
17/03/01 18:03:48 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=162
FILE: Number of bytes written=185553
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=249
HDFS: Number of bytes written=112
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2605
Total time spent by all reduces in occupied slots (ms)=2725
Total time spent by all map tasks (ms)=2605
Total time spent by all reduce tasks (ms)=2725
Total vcore-seconds taken by all map tasks=2605
Total vcore-seconds taken by all reduce tasks=2725
Total megabyte-seconds taken by all map tasks=2667520
Total megabyte-seconds taken by all reduce tasks=2790400
Map-Reduce Framework
Map input records=9
Map output records=9
Map output bytes=138
Map output materialized bytes=162
Input split bytes=111
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=162
Reduce input records=9
Reduce output records=3
Spilled Records=18
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=138
CPU time spent (ms)=820
Physical memory (bytes) snapshot=218480640
Virtual memory (bytes) snapshot=726454272
Total committed heap usage (bytes)=137433088
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=138
File Output Format Counters
Bytes Written=112

运行结果如下:

hello c.txt-->2 b.txt-->2 a.txt-->3
jerry c.txt-->1 b.txt-->3 a.txt-->1
tom c.txt-->1 b.txt-->1 a.txt-->2

MapReduce实战(四)倒排索引的实现的更多相关文章

  1. coreseek实战(四):php接口的使用,完善php脚本代码

    coreseek实战(四):php接口的使用,完善php脚本代码 在上一篇文章 coreseeek实战(三)中,已经能够正常搜索到结果,这篇文章主要是把 index.php 文件代码写得相对完整一点点 ...

  2. Python爬虫实战四之抓取淘宝MM照片

    原文:Python爬虫实战四之抓取淘宝MM照片其实还有好多,大家可以看 Python爬虫学习系列教程 福利啊福利,本次为大家带来的项目是抓取淘宝MM照片并保存起来,大家有没有很激动呢? 本篇目标 1. ...

  3. SpringSecurity权限管理系统实战—四、整合SpringSecurity(上)

    目录 SpringSecurity权限管理系统实战-一.项目简介和开发环境准备 SpringSecurity权限管理系统实战-二.日志.接口文档等实现 SpringSecurity权限管理系统实战-三 ...

  4. gRPC学习之四:实战四类服务方法

    欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kubernetes.DevOPS ...

  5. miniFTP项目实战四

    项目简介: 在Linux环境下用C语言开发的Vsftpd的简化版本,拥有部分Vsftpd功能和相同的FTP协议,系统的主要架构采用多进程模型,每当有一个新的客户连接到达,主进程就会派生出一个ftp服务 ...

  6. 恶意代码分析实战四:IDA Pro神器的使用

    目录 恶意代码分析实战四:IDA Pro神器的使用 实验: 题目1:利用IDA Pro分析dll的入口点并显示地址 空格切换文本视图: 带地址显示图形界面 题目2:IDA Pro导入表窗口 题目3:交 ...

  7. MapReduce实战--倒排索引

    本文地址:http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html,转载请注明源地址. 1.倒排索引简介 倒排索引(Inver ...

  8. 《OD大数据实战》MapReduce实战

    一.github使用手册 1. 我也用github(2)——关联本地工程到github 2. Git错误non-fast-forward后的冲突解决 3. Git中从远程的分支获取最新的版本到本地 4 ...

  9. [置顶] MapReduce 编程之 倒排索引

    本文调试环境: ubuntu 10.04 , hadoop-1.0.2 hadoop装的是伪分布模式,就是只有一个节点,集namenode, datanode, jobtracker, tasktra ...

随机推荐

  1. JDBC纯驱动方式连接MySQL

    1 新建一个名为MysqlDemo的JavaProject 2 从http://dev.mysql.com/downloads/connector/j/中下载最新的驱动包. 这里有.tar.gz和.z ...

  2. json字符串跟objective-c模型的相互转换

    在当今这样一个各种openapi开放的年代,在熟悉的语言下面找到一款得心应手的将json字符串转换成模型的库可以说是十分必要的,在NET平台下,我们有Newtonsoft.Json这个库使用,那么在i ...

  3. 在android中画圆形图片的几种办法

    在开发中常常会有一些需求,比方显示头像,显示一些特殊的需求,将图片显示成圆角或者圆形或者其它的一些形状. 可是往往我们手上的图片或者从server获取到的图片都是方形的.这时候就须要我们自己进行处理, ...

  4. NoSQL的CURD结构体的定义

    NoSQL的CURD结构体的定义 flyfish 2015-7-23 參考MongoDB Wire Protocol  在这里document部分使用json表示 使用boost::property_ ...

  5. Errors occurred during the build

    Errors occurred during the build.Errors running builder 'Integrated External Tool Builder' on projec ...

  6. FZU1920 Left Mouse Button(dfs)

     Problem 1920 Left Mouse Button Accept: 385    Submit: 719 Time Limit: 1000 mSec    Memory Limit : 3 ...

  7. 01-spring安装,hello word

    环境搭建 第一步:安装spring 可以参考这个:http://blog.csdn.net/boredbird32/article/details/50932458 安装成功后,重启后有下面这个Spr ...

  8. Python-正确使用Unicode

    正确处理文本,特别是正确处理Unicode.是个老生常谈的问题,有时甚至会难倒经验丰富的开发者.并不是因为这个问题很难,而是因为对软件中的文本,开发者没有正确理解一些关键概念及其表示方法.在Stack ...

  9. 哈希key个数

    $length = keys %hashname; 则$length中得到的直接是该hash的key的个数.

  10. 基于DDD的现代ASP.NET开发框架--ABP系列之1、ABP总体介绍

    点这里进入ABP系列文章总目录 基于DDD的现代ASP.NET开发框架--ABP系列之1.ABP总体介绍 ABP是“ASP.NET Boilerplate Project (ASP.NET样板项目)” ...