使用MapReduce实现一些经典的案例

　　在工作中，很多时候都是用hive或pig来自动化执行mr统计，但是我们不能忘记原始的mr。本文记录了一些通过mr来完成的经典的案例，有倒排索引、数据去重等，需要掌握。

一、使用mapreduce实现倒排索引

　倒排索引（Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。通过倒排索引，可以根据单词快速获取包含这个单词的文档列表。

　之所以称之为倒排索引，是因为文章内的单词反向检索获取文章标识，从而完成巨大文件的快速搜索。搜索引擎就是利用倒排索引来进行搜索的，此外，倒排索引也是Lucene的实现原理。

　假设有两个文件，a.txt类容为“hello you hello”，b.txt内容为“hello hans”，则倒排索引后，期望返回如下内容：

"hello" "a.txt:2;b.txt:1"

"you" "a.txt:1"

"hans" "b.txt:1"

　从后想前倒退，要输出结果“"hello" "a.txt:2;b.txt:1"”，则reduce输出为<hello,a.txt:2;b.txt:1>，输入为<hello,a.txt:2>、<hello,b.txt:1>。reduce的输入为map的输出，分一下，要map端直接输出<hello,a.txt:2>这种类型的数据是实现不了的。这时，我们可以借助combine作为中间过渡步骤来实现。combine输入数据为<hello:a.txt,1>、<hello:a.txt,1>、<hello:b.txt,1>，可以转化为符合reduce输入要求的数据，此时map端输出<hello:a.txt,1>类型的数据也是很简单的，实现过程如图1所示。

图1 mapreduce倒排索引实现原理示意图

　实现代码如下：

package com.hicoor.hadoop.mapreduce.reverse;

import java.io.IOException;

import java.net.URI;

import java.net.URISyntaxException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.SplitCompressionInputStream;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

//工具类

class StringUtil {

    public static String getShortPath(String filePath) {

        if (filePath.length() == 0)

            return filePath;

        return filePath.substring(filePath.lastIndexOf("/") + 1);

    }

    public static String getSplitByIndex(String str, String regex, int index) {

        String[] splits = str.split(regex);

        if (splits.length < index)

            return "";

        return splits[index];

    }

}

public class InverseIndex {

    public static class ReverseWordMapper extends

            Mapper<LongWritable, Text, Text, Text> {

        @Override

        protected void map(LongWritable key, Text value,

                Mapper<LongWritable, Text, Text, Text>.Context context)

                throws IOException, InterruptedException {

            FileSplit split = (FileSplit) context.getInputSplit();

            String fileName = StringUtil.getShortPath(split.getPath()

                    .toString());

            StringTokenizer st = new StringTokenizer(value.toString());

            while (st.hasMoreTokens()) {

                String word = st.nextToken().toLowerCase();

                word = word + ":" + fileName;

                context.write(new Text(word), new Text("1"));

            }

        }

    }

    public static class ReverseWordCombiner extends

            Reducer<Text, Text, Text, Text> {

        @Override

        protected void reduce(Text key, Iterable<Text> values,

                Reducer<Text, Text, Text, Text>.Context context)

                throws IOException, InterruptedException {

            long sum = 0;

            for (Text value : values) {

                sum += Integer.valueOf(value.toString());

            }

            String newKey = StringUtil.getSplitByIndex(key.toString(), ":", 0);

            String fileKey = StringUtil

                    .getSplitByIndex(key.toString(), ":", 1);

            context.write(new Text(newKey),

                    new Text(fileKey + ":" + String.valueOf(sum)));

        }

    }

    public static class ReverseWordReducer extends Reducer<Text, Text, Text, Text> {

        @Override

        protected void reduce(Text key, Iterable<Text> values,

                Reducer<Text, Text, Text, Text>.Context context)

                throws IOException, InterruptedException {

            StringBuilder sb = new StringBuilder("");

            for (Text v : values) {

                sb.append(v.toString()+" ");

            }

            context.write(key, new Text(sb.toString()));

        }

    }

    private static final String FILE_IN_PATH = "hdfs://hadoop0:9000/reverse/in/";

    private static final String FILE_OUT_PATH = "hdfs://hadoop0:9000/reverse/out/";

    public static void main(String[] args) throws IOException,

            URISyntaxException, ClassNotFoundException, InterruptedException {

        System.setProperty("hadoop.home.dir", "D:\\desktop\\hadoop-2.6.0");

        Configuration conf = new Configuration();

        // 删除已存在的输出目录

        FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);

        if (fileSystem.exists(new Path(FILE_OUT_PATH))) {

            fileSystem.delete(new Path(FILE_OUT_PATH), true);

        }

        Job job = Job.getInstance(conf, "InverseIndex");

        job.setJarByClass(InverseIndex.class);

        job.setMapperClass(ReverseWordMapper.class);

        job.setCombinerClass(ReverseWordCombiner.class);

        job.setReducerClass(ReverseWordReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));

        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        job.waitForCompletion(true);

    }

}

二、使用mapreduce实现TopK查询

　 TopK问题指在海量数据中查找某条件排名前K名的记录，如在用户存款记录中查找存款余额最大的前3名用户。当数据量不大时，可以直接加载到单机内存中进行处理，但是当数据量非常庞大时，需要借助mapreduce来分布式处理。可以使用HiveQL来处理，也可以自己编写mapduce程序来处理此问题。

　实现原理：在每个map任务中查询并返回当前处理数据最大的top k条记录，然后将所有map输出的记录交由一个reduce任务处理，查找并返回最终的top k记录，过程如图2所示。

图2 mapreduce实现top k过程示意图

　需要注意的是，这里reduce个数只能为1个，并且不需要设置Combiner。

　假设存在文件deposit1.txt和deposit2.txt，其内容分别为（列分别表示用户名与存款金额）：

deposit1.txt

p1    125

p2    23

p3    365

p4    15

p5    188

deposit2.txt

p6    236

p7    115

p8    18

p9    785

p10    214

　要求找出存款金额最大的前3位用户，参考实现代码：

package com.hicoor.hadoop.mapreduce;

import java.io.IOException;

import java.net.URI;

import java.util.Comparator;

import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MapReduceTopKDemo {

    public static final int K = 3;

    //默认的TreeMap是按key升序排列 此方法用于获取降序排列的TreeMap

    private static TreeMap<Long, String> getDescSortTreeMap() {

        return new TreeMap<Long, String>(new Comparator<Long>() {

            @Override

            public int compare(Long o1, Long o2) {

                return o2.compareTo(o1);

            }

        });

    } 

    static class TopKMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

        private TreeMap<Long, String> map = getDescSortTreeMap();

        @Override

        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, LongWritable, Text>.Context context)

                throws IOException, InterruptedException {

            String line = value.toString();

            if(line == null || line == "") return;

            String[] splits = line.split("\t");

            if(splits.length < 2) return;

            map.put(Long.parseLong(splits[1]), splits[0]);

            //只保留最大的K个数据

            if(map.size() > K) {

                //由于记录按照key降序排列 只需删除最后一个记录

                map.remove(map.lastKey());

            }

        }

        @Override

        protected void cleanup(Mapper<LongWritable, Text, LongWritable, Text>.Context context)

                throws IOException, InterruptedException {

            for (Long num : map.keySet()) {

                context.write(new LongWritable(num), new Text(map.get(num)));

            }

        }

    }

    static class TopKReducer extends Reducer<LongWritable, Text, Text, LongWritable> {

        private TreeMap<Long, String> map = getDescSortTreeMap();

        @Override

        protected void reduce(LongWritable key, Iterable<Text> value, Reducer<LongWritable, Text, Text, LongWritable>.Context context)

                throws IOException, InterruptedException {

            StringBuilder ps = new StringBuilder();

            for (Text val : value) {

                ps.append(val.toString());

            }

            map.put(key.get(), ps.toString());

            if(map.size() > K) {

                map.remove(map.lastKey());

            }

        }

        @Override

        protected void cleanup(Reducer<LongWritable, Text, Text, LongWritable>.Context context)

                throws IOException, InterruptedException {

            for (Long num : map.keySet()) {

                context.write(new Text(map.get(num)), new LongWritable(num));

            }

        }

    }

    private final static String FILE_IN_PATH = "hdfs://cluster1/topk/in";

    private final static String FILE_OUT_PATH = "hdfs://cluster1/topk/out";

    /* TopK问题：在海量数据中查找某条件排名前K名的记录，如在用户存款记录中查找存款余额最大的前3名用户

     * 1) 测试输入数据（列分别表示用户账户与存款余额）：

     *         p1    125

     *         p2    23

     *         p3    365

     *         p4    15

     *         p5    188

     *         p6    236

     *         p7    115

     *         p8    18

     *         p9    785

     *         p10    214

     * 2) 输出结果：

     *         p9      785

     *         p3      365

     *         p6      236

     */

    public static void main(String[] args) throws Exception {

        System.setProperty("hadoop.home.dir", "D:\\desktop\\hadoop-2.6.0");

        Configuration conf = getHAContiguration();

        // 删除已存在的输出目录

        FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);

        if (fileSystem.exists(new Path(FILE_OUT_PATH))) {

            fileSystem.delete(new Path(FILE_OUT_PATH), true);

        }

        Job job = Job.getInstance(conf, "MapReduce TopK Demo");

        job.setMapperClass(TopKMapper.class);

        job.setJarByClass(MapReduceTopKDemo.class);

        job.setReducerClass(TopKReducer.class);

        job.setMapOutputKeyClass(LongWritable.class);

        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));

        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        job.waitForCompletion(true);

    }

    private static Configuration getHAContiguration() {

        Configuration conf = new Configuration();

        conf.setStrings("dfs.nameservices", "cluster1");

        conf.setStrings("dfs.ha.namenodes.cluster1", "hadoop1,hadoop2");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop1", "172.19.7.31:9000");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop2", "172.19.7.32:9000");

        conf.setStrings("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

        return conf;

    }

}

　执行结果为：

p9      785

p3      365

p6      236

使用MapReduce实现一些经典的案例的更多相关文章

PE经典DIY案例1：全解开方案让量产PE也能
更新说明:因未来的uefi似乎并不能识别并引导ud区,但能识别和引导量产和u+B+隐藏或高端隐藏区,故解决量产PE对u+B+隐藏区的支持,并增加对UEFI启动支持,已经成为PE制作的最主流技术. PE ...
18个awk的经典实战案例
介绍这些案例是我收集起来的,大多都是我自己遇到过的,有些比较经典,有些比较具有代表性. 这些awk案例我也录了相关视频的讲解awk 18个经典实战案例精讲,欢迎大家去瞅瞅. 插入几个新字段在&qu ...
Spring框架-经典的案例和demo,一些可以直接用于生产，使用atomikos来处理多数据源的一致性事务等
Spring Examples Demo website:http://www.ityouknow.com/ 对Spring框架的学习,包括一些经典的案例和demo,一些可以直接用于生产. sprin ...
【Hadoop离线基础总结】MapReduce自定义InputFormat和OutputFormat案例
MapReduce自定义InputFormat和OutputFormat案例自定义InputFormat 合并小文件需求无论hdfs还是mapreduce,存放小文件会占用元数据信息,白白浪费内 ...
快要C语言考试了，大学生们收好这些经典程序案例，包你考试过关！
距离考试越来越近编程大佬早已饥渴难耐电脑小白还在瑟瑟发抖但是不要怕! 来看看这些经典程序案例包你考试过关! [程序1] 有1.2.3.4个数字,能组成多少个互不相同且无重复数字的三位数?都是多 ...
JAVA并发，经典死锁案例-哲学家就餐
转自:http://blog.csdn.net/tayanxunhua/article/details/38691005 死锁经典案例:哲学家就餐. 这个案例会导致死锁. 通过修改<Java编程 ...
MySQL数据库“十宗罪”【十大经典错误案例】
原文作者:张甦来源:http://blog.51cto.com/sumongodb 今天就给大家列举 MySQL 数据库中,最经典的十大错误案例,并附有处理问题的解决思路和方法,希望能给刚入行,或数 ...
Mapreduce之排序&规约&实战案例
MapReduce 排序和序列化简单介绍 ①序列化 (Serialization) 是指把结构化对象转化为字节流②反序列化 (Deserialization) 是序列化的逆过程. 把字节流转为结构化 ...
C语言经典88案例，我文科妹妹说她都学会了！
案例ex01: 将字符串转换为一个整数 1 题目函数:fun() 功能:将字符串转换为一个整数描述: [不能使用C语言提供的字符串函数] 输入:字符串"-1234" 输出:整型 ...

随机推荐

POJ 1141 Brackets Sequence(DP)
题目链接很早很早之前就看过的一题,今天终于A了.状态转移,还算好想,输出路径有些麻烦,搞了一个标记数组的,感觉不大对,一直wa,看到别人有写直接输出的..二了,直接输出就过了.. #include ...
C#_数据转换实用方法
[String转Array]string str = "123asd456asd789";单字符: string[] a0 = str.Split('a');多字符: string ...
Makefile简易模板
MAKE = g++ -g #MAKE = gcc -g FILES = tf all : $(FILES) #DYSRC = target.c #DYTGT = $(DYSRC:.c=.o) %.o ...
javascript 时间操作
javascript时间函数 javascript提供了Date对象来进行时间和日期的计算.Date对象有多种构造函数: 1.dateObj=new Date() //当前时间 2.dateObj=n ...
C#.NET Form设置/取消开机自动运行，判断程序是否已经设置成开机自动启动（转载）
#region//开机自动运行 private void CB_Auto_CheckedChanged(object sender, EventArgs e) {//CB_ ...
flume-ng配置文档简单说明
1.配置文件现状 1.1 Flume数据接收端 IP地址:54.0.95.67 功能:接收各个端口发来的数据. 启动方式:进入目录 /usr/local/flume/*bin 在终端运行 ./rece ...
ado.net 完整修改删除，攻击防攻击
完整修改和删除:当你输入了要删除的用户名,先提示有没有此条数据先查后删/后改------------ using System; using System.Collections.Generic; ...
js实现事件模型bind与trigger
function Emitter() { this._listener = [];//_listener[自定义的事件名] = [所用执行的匿名函数1, 所用执行的匿名函数2] } //注册事件 Em ...
sql2008 r2 重新启动失败解决办法
一.问题描述: 在计算机中安装sql_server_2008_R2,安装前执行检查时,提示重启计算机失败.重启计算机后,再执行检查仍然提示这个错误. 二.解决方案: 1.在开始->运行中输入re ...
jQuery -> 获取元素的各种过滤器
转自http://blog.csdn.net/feelang/article/details/26613023

使用MapReduce实现一些经典的案例

使用MapReduce实现一些经典的案例的更多相关文章

随机推荐

热门专题