前言

本章主要内容是讲述hadoop的分布式缓存的使用，通过分布式缓存可以将一些需要共享的数据在各个集群中共享。

准备工作

数据集：ufo-60000条记录，这个数据集有一系列包含下列字段的UFO目击事件记录组成，每条记录的字段都是以tab键分割，请看http://www.cnblogs.com/cafebabe-yun/p/8679994.html

sighting date：UFO目击事件发生时间
Recorded date：报告目击事件的时间
Location：目击事件发生的地点
Shape：UFO形状
Duration：目击事件持续时间
Dexcription：目击事件的大致描述

例子：

19950915 19950915 Redmond, WA 6 min. Young man w/ 2 co-workers witness tiny, distinctly white round disc drifting slowly toward NE. Flew in dir. 90 deg. to winds.

需要共享的数据：州名缩写与全称的对应关系

数据：

AL      Alabama

AK      Alaska

AZ      Arizona

AR      Arkansas

CA      California

Distributed Cache介绍

作用：使用分布式缓存，可以将map和reduce任务要用的通用只读文件在集群所有节点共享。

Distributed Cache的使用

题目：使用共享数据替换州名缩写

将上面提到的共享数据保存为 states.txt 文件
将states.txt文件上传到hadoop

hadoop dfs -put states.txt states.txt

编写 UFORecordValidationMapper.java

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFORecordValidationMapper extends MapReduceBase implements Mapper<LongWritable, Text, LongWritable, Text> {

    public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();

        if(validate(line)) {

            output.collect(key, value);

        }

    }

    private boolean validate(String str) {

        String[] parts = str.split("\t");

        if(parts.length != 6) {

            return false;

        }

        return true;

    }

}

编写 UFOLocation2.java

import java.io.*;

import java.util.*;

import java.net.*;

import java.util.regex.*;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapred.lib.*;

public class UFOLocation2 {

    public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {

        private final static LongWritable one = new LongWritable(1);

        private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");

        private Map<String, String> stateNames;    

        @Override

        public void configure(JobConf job) {

            try {

                Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);

                setupStateMap(cacheFiles[0].toString());

            } catch (IOException e) {

                System.err.println("Error reading state file.");

                System.exit(1);

            }

        }

        private void setupStateMap(String fileName) throws IOException {

            Map<String, String> stateCache = new HashMap<String, String>();

            BufferedReader reader = new BufferedReader(new FileReader(fileName));

            String line = null;

            while((line = reader.readLine()) != null) {

                String[] splits = line.split("\t");

                stateCache.put(splits[0], splits[1]);

            }

            stateNames = stateCache;

        }

        @Override

        public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

            String line = value.toString();

            String[] fields = line.split("\t");

            String location = fields[2].trim();

            if(location.length() >= 2) {

                Matcher matcher = locationPattern.matcher(location);

                if(matcher.find()) {

                    int start = matcher.start();

                    String state = location.substring(start, start + 2);

                    output.collect(new Text(lookupState(state.toUpperCase())), one);

                }

            }

        }

        private String lookupState(String state) {

            String fullName = stateNames.get(state);

            if(fullName == null || "".equals(fullName)) {

                fullName = state;

            }

            return fullName;

        }

    }

    public static void main(String...args) throws Exception {

        Configuration config = new Configuration();

        JobConf conf = new JobConf(config, UFOLocation2.class);

        conf.setJobName("UFOLocation2");

        DistributedCache.addCacheFile(new URI("/user/root/states.txt"), conf);

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(LongWritable.class);

        JobConf mapconf1 = new JobConf(false);

        ChainMapper.addMapper(conf, UFORecordValidationMapper.class, LongWritable.class, Text.class, LongWritable.class, Text.class, true, mapconf1);

        JobConf mapconf2 = new JobConf(false);

        ChainMapper.addMapper(conf, MapClass.class, LongWritable.class, Text.class, Text.class, LongWritable.class, true, mapconf2);

        conf.setMapperClass(ChainMapper.class);

        conf.setCombinerClass(LongSumReducer.class);

        conf.setReducerClass(LongSumReducer.class);

        FileInputFormat.setInputPaths(conf, args[0]);

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);

    }

}

编译上述两个文件

javac UFORecordValidationMapper.java UFOLocation2.java

将编译好的文件打包成jar

jar cvf ufo.jar UFO*class

提交打包好的jar包到hadoop上运行

hadoop jar ufo.jar UFOLocation2 ufo.tsv output

从hadoop上获取结果到本地

hadoop dfs -get output/part-00000 ufo_result.txt

查看结果

more ufo_result.txt

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

[转] .net core Session , Working with a distributed cache
本文转自:https://docs.microsoft.com/en-us/aspnet/core/performance/caching/distributed By Steve Smith+ Di ...
Distributed Cache Coherence at Scalable Requestor Filter Pipes that Accumulate Invalidation Acknowledgements from other Requestor Filter Pipes Using Ordering Messages from Central Snoop Tag
A multi-processor, multi-cache system has filter pipes that store entries for request messages sent ...
Hadoop之 MapReducer工作过程
1. 从输入到输出一个MapReducer作业经过了input,map,combine,reduce,output五个阶段,其中combine阶段并不一定发生,map输出的中间结果被分到reduce ...
spark hadoop 对比 Resilient Distributed Datasets
hadoop 迭代消耗大每次迭代启动一个完整的MapReduce作业 spark 首要目标就是避免运算时过多的网络和磁盘IO开销 Resilient Distributed Datasets ht ...
Flink分布式缓存Distributed Cache
1 分布式缓存 Flink提供了一个分布式缓存,类似于hadoop,可以使用户在并行函数中很方便的读取本地文件,并把它放在taskmanager节点中,防止task重复拉取. 此缓存的工作机制如下:程 ...
Distributed Cache(分布式缓存)-SqlServer
分布式缓存是由多个应用服务器共享的缓存,通常作为外部服务存储在单个应用服务器上,常用的有SqlServer,Redis,NCache. 分布式缓存可以提高ASP.NET Core应用程序的性能和可伸缩 ...
hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...
Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...
hadoop常见问题汇集
1 hadoop conf.addResource http://stackoverflow.com/questions/16017538/how-does-configuration-addreso ...

随机推荐

sprint test 添加事务回滚机制
1.原因: 单元测试的时候频繁操作数据库需要修改很多数据,造成不必要的操作,添加事务之后就可以重复对一条数据进行操作,并且在返回结果后进行回滚. 2.解决: 原先继承的是 AbstractJUnit ...
pandas入门（1）
import pandas as pd import numpy as np # 自动创建索引 obj = pd.Series([4, 7, -5, 2]) print(obj, type(obj)) ...
python+selenium下拉列表option对象操作方法二
options = driver.find_elements_by_tag_name('option') #获取所有的option子元素 o ...
Python - pycharm 代码自动补全
有很多人说是代码补全功能未打开,的确,代码补全功能确实要打开才能用,打开方法 file---->power save mode,把这个前面的√号去掉即可
Java枚举enum关键字
枚举的理解枚举其实就是一个类,枚举类的实例是一组限定的对象传统的方式创建枚举 [了解] 对比:单例类 1.构造器私有化 2.本类内部创建对象 3.通过public static方法,对外暴露该对象 ...
java_第一年_JavaWeb（15）
Filter过滤器,Servlet API 中提供了一个Filter接口,用于实现用户在访问某个目标资源前对其进行拦截: 拦截原理:web服务器通过Filter接口调用doFilter方法,会传递一个 ...
Balanced Lineup poj3264 线段树
Balanced Lineup poj3264 线段树题意一串数,求出某个区间的最大值和最小值之间的差解题思路使用线段树,来维护最大值和最小值,使用两个查询函数,一个查区间最大值,一个查区间最 ...
WOJ#3882 旅行问题（POI2004）
描述 John打算驾驶一辆汽车周游一个环形公路.公路上总共有n车站,每站都有若干升汽油(有的站可能油量为零),每升油可以让汽车行驶一千米.John必须从某个车站出发,一直按顺时针(或逆时针)方向走遍所 ...
二分查找---有序数组的 Single Element
有序数组的 Single Element 540. Single Element in a Sorted Array (Medium) Input: [1, 1, 2, 3, 3, 4, 4, 8, ...
Git：将本地项目连接到远程（github、gitee、gitlab）仓库流程
当进行协同开发或者为了代码安全备份需要,一般都会将本地代码和远程仓库相连接. 备注:Github.Gitee.Gitlab是三个常用的远程git仓库,操作流程基本一致. 提前环境要求: 1.node. ...

[hadoop](2) MapReducer:Distributed Cache

前言

准备工作

Distributed Cache介绍

Distributed Cache的使用

[hadoop](2) MapReducer:Distributed Cache的更多相关文章

随机推荐

热门专题