Hadoop学习之路(6)MapReduce自定义分区实现

MapReduce自带的分区器是HashPartitioner

原理：先对map输出的key求hash值，再模上reduce task个数，根据结果，决定此输出kv对，被匹配的reduce任务取走。

自定义分分区需要继承Partitioner，复写getpariton()方法

自定义分区类：

注意：map的输出是<K,V>键值对

其中int partitionIndex = dict.get(text.toString())，partitionIndex是获取K的值

附：被计算的的文本

Dear Dear Bear Bear River Car Dear Dear  Bear Rive

Dear Dear Bear Bear River Car Dear Dear  Bear Rive

需要在main函数中设置，指定自定义分区类

自定义分区类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

import java.util.HashMap;

public class CustomPartitioner extends Partitioner<Text, IntWritable> {

    public static HashMap<String, Integer> dict = new HashMap<String, Integer>();

    //Text代表着map阶段输出的key,IntWritable代表着输出的值

    static{

        dict.put("Dear", 0);

        dict.put("Bear", 1);

        dict.put("River", 2);

        dict.put("Car", 3);

    }

    public int getPartition(Text text, IntWritable intWritable, int i) {

        //

        int partitionIndex = dict.get(text.toString());

        return partitionIndex;

    }

}

注意：map的输出结果是键值对<K,V>,int partitionIndex = dict.get(text.toString());中的partitionIndex是map输出键值对中的键的值，也就是K的值。

Maper类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] words = value.toString().split("\t");

        for (String word : words) {

            // 每个单词出现１次，作为中间结果输出

            context.write(new Text(word), new IntWritable(1));

        }

    }

}

Reducer类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] words = value.toString().split("\t");

        for (String word : words) {

            // 每个单词出现１次，作为中间结果输出

            context.write(new Text(word), new IntWritable(1));

        }

    }

}

main函数：

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountMain {

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        if (args.length != 2 || args == null) {

            System.out.println("please input Path!");

            System.exit(0);

        }

        Configuration configuration = new Configuration();

        configuration.set("mapreduce.job.jar","/home/bruce/project/kkbhdp01/target/com.kaikeba.hadoop-1.0-SNAPSHOT.jar");

        Job job = Job.getInstance(configuration, WordCountMain.class.getSimpleName());

        // 打jar包

        job.setJarByClass(WordCountMain.class);

        // 通过job设置输入/输出格式

        //job.setInputFormatClass(TextInputFormat.class);

        //job.setOutputFormatClass(TextOutputFormat.class);

        // 设置输入/输出路径

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 设置处理Map/Reduce阶段的类

        job.setMapperClass(WordCountMap.class);

        //map combine

        //job.setCombinerClass(WordCountReduce.class);

        job.setReducerClass(WordCountReduce.class);

        //如果map、reduce的输出的kv对类型一致，直接设置reduce的输出的kv对就行；如果不一样，需要分别设置map, reduce的输出的kv类型

        //job.setMapOutputKeyClass(.class)

        // 设置最终输出key/value的类型m

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        job.setPartitionerClass(CustomPartitioner.class);

        job.setNumReduceTasks(4);

        // 提交作业

        job.waitForCompletion(true);

    }

}

main函数参数设置：

Hadoop学习之路(6)MapReduce自定义分区实现的更多相关文章

Hadoop学习之路(7)MapReduce自定义排序
本文测试文本: tom 20 8000 nancy 22 8000 ketty 22 9000 stone 19 10000 green 19 11000 white 39 29000 socrate ...
Hadoop学习之路(5)Mapreduce程序完成wordcount
程序使用的测试文本数据: Dear River Dear River Bear Spark Car Dear Car Bear Car Dear Car River Car Spark Spark D ...
阿里封神谈hadoop学习之路
阿里封神谈hadoop学习之路封神 2016-04-14 16:03:51 浏览3283 评论3 发表于: 阿里云E-MapReduce >> 开源大数据周刊 hadoop 学生 s ...
《Hadoop学习之路》学习实践
(实践机器:blog-bench) 本文用作博文<Hadoop学习之路>实践过程中遇到的问题记录. 本文所学习的博文为博主“扎心了,老铁” 博文记录.参考链接https://www.cnb ...
Hadoop学习之路（十三）MapReduce的初识
MapReduce是什么首先让我们来重温一下 hadoop 的四大组件: HDFS:分布式存储系统 MapReduce:分布式计算系统 YARN:hadoop 的资源调度系统 Common:以上三大 ...
Hadoop mapreduce自定义分区HashPartitioner
本文发表于本人博客. 在上一篇文章我写了个简单的WordCount程序,也大致了解了下关于mapreduce运行原来,其中说到还可以自定义分区.排序.分组这些,那今天我就接上一次的代码继续完善实现自定 ...
Hadoop 学习之路（三）—— 分布式计算框架 MapReduce
一.MapReduce概述 Hadoop MapReduce是一个分布式计算框架,用于编写批处理应用程序.编写好的程序可以提交到Hadoop集群上用于并行处理大规模的数据集. MapReduce作业通 ...
【Hadoop】MapReduce自定义分区Partition输出各运营商的手机号码
MapReduce和自定义Partition MobileDriver主类 package Partition; import org.apache.hadoop.io.NullWritable; i ...
Hadoop学习之路（二十）MapReduce求TopN
前言在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作. 技术点 MapR ...

随机推荐

20191230--python学习第一天（补）
1.py第一个脚本打开电脑终端,功能键+R 输入命令:解释器路径+脚本路径(建议.py后缀) 2.编码 (1)初始编码 ascii,英文,8为表示一个东西,2**8 8位 = 1字节 unicod ...
pytorch之 RNN 参数解释
上次通过pytorch实现了RNN模型,简易的完成了使用RNN完成mnist的手写数字识别,但是里面的参数有点不了解,所以对问题进行总结归纳来解决. 总述:第一次看到这个函数时,脑袋有点懵,总结了下总 ...
pytorch之 optimizer comparison
import torch import torch.utils.data as Data import torch.nn.functional as F import matplotlib.pyplo ...
centos7安装bind（DNS服务）
环境介绍公网IP:149.129.92.239 内网IP:172.17.56.249 系统:CentOS 7.4 一.安装 yum install bind bind-utils -y 二.修改bi ...
window下建立vue.js项目
安装node.js 直接下载安装文件安装就可以了 vue项目搭建 .到自己要件项目的文件夹运行cmd命令 .如果没有安装vue-cli .npm install -g vue-cli .vue ini ...
Day3前端学习之路——CSS基本知识
课程目标初步了解什么是CSS,掌握基本的CSS概念,语法,针对选择器特殊性的计算处理,以及学习如何设置一些简单的样式任务一:回答问题 1.什么是CSS,CSS是如何工作的? CSS 指层叠样式表 ...
el-menu 菜单展示
<template> <div class="tab-container"> <el-menu class="el-menu-vertica ...
js this是什么?[多次书写]
前言以前的时候,我写了一个关于js this的博客,写的非常复杂,分析了各种情况. 现在我想简化. 如果你有后台基础,专门去理解过this,那么请忘记. 这东西是有口诀的: 在方法中,this 表示 ...
JVM垃圾回收——GC
一.JVM内存分配与回收下图为堆内存结构图(注意:元数据区(MetaData )实际上不属于堆): 1.对象优先在Eden区分配大多数情况下,对象在新生代中Eden区分配.当Eden区没有足够空间 ...
python学习------文件的读与写
f=open("yesterday","r",encoding="utf-8") #文件句柄 data=f.read() data2=f.r ...

Hadoop学习之路(6)MapReduce自定义分区实现

Hadoop学习之路(6)MapReduce自定义分区实现的更多相关文章

随机推荐

热门专题