hadoop MapReduce —— 输出每个单词所对应的文件

下面是四个文件及其内容。

代码实现：

Mapper：

package cn.tedu.invert;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class InvertMapper extends Mapper<LongWritable, Text, Text, Text> {

    @Override

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        // 获取文件名

        FileSplit fileSplit = (FileSplit)context.getInputSplit();

        String pathName = fileSplit.getPath().getName();

        // 将文件中的内容提取

        String[] words = value.toString().split(" ");

        // 每一个单词都对应着自己所在文件的文件名

        for(String word:words){

            context.write(new Text(word), new Text(pathName));

        }

    }

}

Reducer：

package cn.tedu.invert;

import java.io.IOException;

import java.util.HashSet;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class InvertReducer extends Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        // 哈希表不存重复元素，将重复的文件名去掉

        HashSet<String> set = new HashSet<>();

        for (Text text : values) {

            set.add(text.toString());

        }

        StringBuilder sb = new StringBuilder();

        for (String str : set) {

            sb.append(str.toString()).append(" ");

        }

        context.write(key, new Text(sb.toString()));

    }

}

Driver：

package cn.tedu.invert;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InvertDriver {

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "JobName");

        job.setJarByClass(cn.tedu.invert.InvertDriver.class);

        job.setMapperClass(InvertMapper.class);

        job.setReducerClass(InvertReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.74.129:9000/text/invert"));

        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.74.129:9000/result/invert_result"));

        if (!job.waitForCompletion(true))

            return;

    }

}

结果：

hadoop MapReduce —— 输出每个单词所对应的文件的更多相关文章

Hadoop MapReduce编程 API入门系列之小文件合并（二十九）
不多说,直接上代码. Hadoop 自身提供了几种机制来解决相关的问题,包括HAR,SequeueFile和CombineFileInputFormat. Hadoop 自身提供的几种小文件合并机制 ...
Hadoop MapReduce编程 API入门系列之压缩和计数器（三十）
不多说,直接上代码. Hadoop MapReduce编程 API入门系列之小文件合并(二十九) 生成的结果,作为输入源. 代码 package zhouls.bigdata.myMapReduce. ...
hadoop拾遗（五）---- mapreduce 输出到多个文件 / 文件夹
今天要把HBase中的部分数据转移到HDFS上,想根据时间戳来自动输出到以时间戳来命名的每个文件夹下.虽然以前也做过相似工作,但有些细节还是忘记了,所以这次写个随笔记录一下. package com. ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
使用Python实现Hadoop MapReduce程序
转自:使用Python实现Hadoop MapReduce程序英文原文:Writing an Hadoop MapReduce Program in Python 根据上面两篇文章,下面是我在自己的 ...
Hadoop Mapreduce运行流程
Mapreduce的运算过程为两个阶段: 第一个阶段的map task相互独立,完全并行: 第二个阶段的reduce task也是相互独立,但依赖于上一阶段所有map task并发实例的输出: 这些t ...
hadoop mapreduce 基础实例一记词
mapreduce实现一个简单的单词计数的功能. 一,准备工作:eclipse 安装hadoop 插件: 下载相关版本的hadoop-eclipse-plugin-2.2.0.jar到eclipse/ ...
三.hadoop mapreduce之WordCount例子
目录: 目录见文章1 这个案列完成对单词的计数,重写map,与reduce方法,完成对mapreduce的理解. Mapreduce初析 Mapreduce是一个计算框架,既然是做计算的框架,那么表现 ...
MapReduce编程：单词去重
编程实现单词去重要用到NullWritable类型. NullWritable: NullWritable 是一种特殊的Writable 类型,由于它的序列化是零长度的,所以没有字节被写入流或从流中读 ...

随机推荐

django 浅谈CSRF（Cross-site request forgery）跨站请求伪造
浅谈CSRF(Cross-site request forgery)跨站请求伪造(写的非常好) 本文目录一 CSRF是什么二 CSRF攻击原理三 CSRF攻击防范回到目录一 CSRF是什么 ...
Blender设置界面语言
新安装的Blender默认是英文, 可通过如下方法修改界面语言: 1. 点开文件菜单{File},选择用户首选项{User Preferences}: 2. 在用户首选项{User Preferenc ...
log4j.properties与db.properties
log4j.properties与db.properties db.driver=com.mysql.jdbc.Driver db.url=jdbc:mysql:///mybatis?useUnico ...
PyMongo 常见问题
PyMongo是线程安全的吗PyMongo是线程安全的,并且为多线程应用提供了内置的连接池 PyMongo是进程安全的吗PyMongo不是进程安全的,如果你在fork()中使用MongoClient实 ...
oracle命令导入SQL脚本
使用@导入比如说我在oracle家目录下有a.sql文件命令行sqlplus / as sysdba,进入后 SQL>@/home/oracle/a.sql; 回车搞定
linux服务器进程信息查看命令
#lsof 列出当前系统打开文件,常与-i选项使用,用于查看某个端口被哪个程序占用 [root@bogon ~]# lsof -i:80 COMMAND PID USER FD TYPE DEVICE ...
BackgroundWorker Class Sample for Beginners
Download source - 27.27 KB Introduction This article presents a novice .NET developer to develop a m ...
关于宽带接两台路由,并且第二台需要关闭DHCP的设置
关于宽带接两台路由,并且第二台需要关闭DHCP的设置 https://wenku.baidu.com/view/e317a12d4b35eefdc8d333cb?pcf=2#1
try catch 用法实例
python, 用filter实现素数
# _*_ coding:utf-8 _*_ #step1: 生成一个序列def _odd_iter(): n = 1 while True: n = n + 1 yield n #Step2: 定义 ...

hadoop MapReduce —— 输出每个单词所对应的文件

hadoop MapReduce —— 输出每个单词所对应的文件的更多相关文章

随机推荐

热门专题