MapReduce编程之倒排索引

任务要求：

//输入文件格式

18661629496 110

13107702446 110

1234567 120

2345678 120

987654 110

2897839274 18661629496

//输出文件格式格式

11018661629496|13107702446|987654|18661629496|13107702446|987654|

1201234567|2345678|1234567|2345678|

186616294962897839274|2897839274|

mapreduce程序编写：

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Test2 {

enum Counter

{

LINESKIP,//记录出错的行

}

public static class Map extends Mapper<LongWritable, Text, Text, Text>{

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();//读取源数据

try

{

//数据处理

String [] lineSplit = line.split(" ");//18661629496,110

String anum = lineSplit[0];

String bnum = lineSplit[1];

//输出格式：110,18661629496

context.write(new Text(bnum), new Text(anum));

}

catch(ArrayIndexOutOfBoundsException e)

{

context.getCounter(Counter.LINESKIP).increment(1);//出错时计数器+1

return;

}

public static class Reduce extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

String valueString;

String out="";

for(Text value:values)

{

valueString=value.toString();

out+=valueString+"|";

}

context.write(key, new Text(out));

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

if (args.length != 2) {

System.err.println("请配置输入输出路径 ");

System.exit(2);

}

//各种配置

Job job = new Job(conf, "telephone ");//作业名称配置

//类配置

job.setJarByClass(Test2.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

//map输出格式配置

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

//作业输出格式配置

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

//添加输入输出路径

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

//任务完毕时退出

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

将mapreduce程序打包为jar文件：

1.右键项目名称->Export->java->jar file

2.配置jar文件存储位置

3.选择main calss

4.执行jar文件

[liuqingjie@master hadoop-0.20.2]$ bin/hadoop jar /home/liuqingjie/test2.jar /user/liuqingjie/in /user/liuqingjie/out

15/05/14 01:46:47 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

15/05/14 01:46:47 INFO input.FileInputFormat: Total input paths to process : 2

15/05/14 01:46:48 INFO mapred.JobClient: Running job: job_201505132004_0005

15/05/14 01:46:49 INFO mapred.JobClient: map 0% reduce 0%

15/05/14 01:46:57 INFO mapred.JobClient: map 100% reduce 0%

15/05/14 01:47:09 INFO mapred.JobClient: map 100% reduce 100%

……………………………………………………………………………………

查看结果

[liuqingjie@master hadoop-0.20.2]$ bin/hadoop dfs -cat ./out/*

cat: Source must be a file.

110 18661629496|13107702446|987654|18661629496|13107702446|987654|

120 1234567|2345678|1234567|2345678|

18661629496 2897839274|2897839274|

MapReduce编程之倒排索引的更多相关文章

[置顶] MapReduce 编程之倒排索引
本文调试环境: ubuntu 10.04 , hadoop-1.0.2 hadoop装的是伪分布模式,就是只有一个节点,集namenode, datanode, jobtracker, tasktra ...
MapReduce编程(七) 倒排索引构建
一.倒排索引简单介绍倒排索引(英语:Inverted index),也常被称为反向索引.置入档案或反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射. ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
三、MapReduce编程实例
前文一.CentOS7 hadoop3.3.1安装(单机分布式.伪分布式.分布式二.JAVA API实现HDFS MapReduce编程实例 @ 目录前文 MapReduce编程实例前言注意 ...
Hadoop MapReduce编程 API入门系列之压缩和计数器（三十）
不多说,直接上代码. Hadoop MapReduce编程 API入门系列之小文件合并(二十九) 生成的结果,作为输入源. 代码 package zhouls.bigdata.myMapReduce. ...
[Hadoop入门] - 1 Ubuntu系统 Hadoop介绍 MapReduce编程思想
Ubuntu系统 (我用到版本号是140.4) ubuntu系统是一个以桌面应用为主的Linux操作系统,Ubuntu基于Debian发行版和GNOME桌面环境.Ubuntu的目标在于为一般用户提供一 ...
mapreduce编程模型你知道多少？
上次新霸哥给大家介绍了一些hadoop的相关知识,发现大家对hadoop有了一定的了解,但是还有很多的朋友对mapreduce很模糊,下面新霸哥将带你共同学习mapreduce编程模型. mapred ...
《Data-Intensive Text Processing with mapReduce》读书笔记之二：mapreduce编程、框架及运行
搜狐视频的屌丝男士第二季大结局了,惊现波多野老师,怀揣着无比鸡冻的心情啊,可惜随着剧情的推进发展,并没有出现期待中的屌丝奇遇,大鹏还是没敢冲破尺度的界线.想百度些种子吧,又不想让电脑留下污点证据,要知 ...

随机推荐

linux 在线实验
https://www.shiyanlou.com/courses/running/2
TensorFlow——分布式的TensorFlow运行环境
当我们在大型的数据集上面进行深度学习的训练时,往往需要大量的运行资源,而且还要花费大量时间才能完成训练. 1.分布式TensorFlow的角色与原理在分布式的TensorFlow中的角色分配如下: ...
js作业
1.一张纸的厚度是0.0001米,将纸对折,对折多少次厚度超过珠峰高度8848米var sum=0;i=0;a=0.0001;for(i=0;i<100;i++){ a=a*2; sum=sum ...
Elasticsearch日志收集
Install pip if necessary curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py&q ...
由于找不到 opencv_world320.dll，无法继续执行代码
首先找到自己软件安装(解压)的路径openCV (安装(解压)目录\opencv\build\x64\vc14\bin) 我的安装(解压)目录是:F:\OpenCV\Three320\opencv\b ...
Codeforces Round #449
960 asteri 1384 492 00:04 -1 892 01:33 960 PEPElotas 1384 488 00:06 896 00:26 960 ...
VTK资料收集
使用Qt Creator开发VTK 原文链接:http://blog.csdn.net/numit/article/details/10200507 VTK应用系列:非常强大!非常善良 05-VTK在 ...
Django 中的 csrf_token 与单元测试
Django 中的 csrf_token 与单元测试在<Python Web开发:测试驱动方法>一书中作者使用的 Django 版本是 1.7,而我使用的是1.9.7版(官网已经更新到1 ...
Centos7 执行firewall-cmd –permanent –add-service=mysql报错“ModuleNotFoundError: No module named 'gi'”
因为目前环境Python3.x与Python2.x版本并存,所以导致以上问题. 解决方法: 第一步,vim /usr/bin/firewall-cmd, 将#!/usr/bin/python -Es ...
gulp给文件后添加md5时间戳
这里为总的方法,实际项目中拷贝出来的,底下有详细的总结以及只针对添加时间戳的方法 1 // 引入 gulp及组件 var gulp = require('gulp'), autoprefixer = ...

MapReduce编程之倒排索引

MapReduce编程之倒排索引的更多相关文章

随机推荐

热门专题