hadoop笔记之MapReduce的应用案例(WordCount单词计数)

MapReduce的应用案例(WordCount单词计数)

1. WordCount单词计数

作用：

计算文件中出现每个单词的频数

输入结果按照字母顺序进行排序

Map过程

Reduce过程

WordCount的源代码

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
    public static class WordCountMap extends
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer token = new StringTokenizer(line);
            while (token.hasMoreTokens()) {
                word.set(token.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class WordCountReduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf);
        job.setJarByClass(WordCount.class);
        job.setJobName("wordcount");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(WordCountMap.class);
        job.setReducerClass(WordCountReduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.waitForCompletion(true);
    }
}

代码解析：

首先是导入各种的相应的类import

开始定义一个类WordCount

public class WordCount{}

在这个类里面包含了两个内置的类，一个叫TokenizerMapper，另一个叫IntSumReducer

public static class TokenizerMapper
public static class IntSumReducer

其中第一个设置了Mapper的输入格式分别是Object(key)和Text(value)，输出类型是Text(key)和IntWritable(value)

其中

    extends Mapper<Object,Text,Text,IntWritable>{}

这里的one表示单词出现过1次

private final IntWritable one = new IntWritable(1);

接下来就是map操作

public void map(Object key,Text value,Context context)

map操作做一个节段进行分词,如果发现一个词以后就进行写入一个word一个one

            word.set(itr.nextToken());
            context.write(word,one);

第二个类IntSumReducer，它继承于Reduce接口，设置Reducer的类型是Text和IntWritable，输出类型是Text和IntWritable

public static class IntSumReducer
    extends Reducer<Text,IntWritable,Text,IntWritable>{}

Reducer做累加

        sum+=val.get();

写好之后有个main函数,用于设置相应的配置文件，包括输入文件目录和输出文件目录，配置作业名字，配置作业中的各个类等等

public static void main(String[] args) throws Exception{}

WordCount单词计数步骤：

编写WordCount.java，包含Mapper类和Reducer类
编译WordCount.java，javac -classpath
打包jar -cvf WordCount.jar classes/*
作业提交 hadoop jar WordCount.jar WordCount input output

详细：

检查hadoop的运行情况jps，确定NameNode、DataNode、TaskTracker、Jobtracker、SecondaryNameNode的启动情况

java程序编写

vim WordCount.java

写完之后保存，然后进行编译(因为要动用hadoop里面的一些架包，所以如果用命令行的话要用-classpath进行架包的加入，如果是一些IDE可以直接进行编译)

javac -classpath /opt/hadoop-1.2.1/hadoop-core-1.2.1.jar:/opt/hadoop-1.2.1/lib/commons-cli-1.2.jar -d word_count_class/ WordCount.java
cd word_count_class/
ls

会在word_count_class文件目录下看到三个已经编译好的文件：WordCount.class、WordCount$WordCountMap.class、WordCount$WordCountReduce.class

把编译好的文件打包

jar -cvf wordcount.jar *.class

把原始文件打开WordCount.java，有两个参数输入和输出

    FileInputFormat.addInputPath(job,new Path(arg[0]));
    FileOutputFormat.setOutputPath(job,new Path(args[1]));

进入输入文件目录cd input/，在input文件目录下有两个个文件file1、file2，文件内有一些字符串，将file1和file2提交到hadoop里面去

hadoop fs -mkdir input_wordcount
hadoop fs -put input/* input_wordcount/
hadoop fs -ls input_wordcount
hadoop fs -cat input_wordcount/file1

把输入文件上传到hadoop之后就可以提交作业

hadoop jar word_count_class/wordcount.jar WordCount input_wordcount output_wordcount

输入是input_wordcount，输出是output_wordcount，如果没有output_wordcount这个文件夹，那将会新建一个

查看output文件

hadoop fs -ls output_wordcount

注意最后一个文件，例如这里是part-r-00000

hadoop fs -cat output_wordcount/part-r-00000

这就是WordCount单词计数的完整过程，用的是hadoop1.2.1版本