Hadoop实战-MapReduce之WordCount(五)

环境介绍：

主服务器ip：192.168.80.128(master) NameNode SecondaryNameNode ResourceManager

从服务器ip：192.168.80.129(slave1) DataNode NodeManager

从服务器ip: 192.168.80.130(slave2) DataNode NodeManager

1.文件准备

1）在HDFS上创建文件夹

hadoop fs -mkdir /user/joe/wordcount/input

2）在本地创建文件夹

mkdir /home/chenyun/data/mapreduce

3）创建file01

cd /home/chenyun/data/mapreduce
touch file01

vi file01

往file01写入内容：

Hello World, Bye World!

4)创建file02

cd /home/chenyun/data/mapreduce

touch file02 vi file02

往file02写入内容：

Hello Hadoop, Goodbye to hadoop.

5）把本地文件file01、file02上传到hdfs的/user/joe/wordcount/input目录

hadoop fs -put /home/chenyun/data/mapreduce/file01 /user/joe/wordcount/input 

hadoop fs -put /home/chenyun/data/mapreduce/file02 /user/joe/wordcount/input

2.编写mapreduce程序

1）在Eclipse编写Mapreduce程序

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.util.ArrayList;

import java.util.HashSet;

import java.util.List;

import java.util.Set;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Counter;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.StringUtils;

public class WordCount {

	public static class TokenizerMapper extends

			Mapper<Object, Text, Text, IntWritable> {

		static enum CountersEnum {

			INPUT_WORDS

		}

		private final static IntWritable one = new IntWritable(1);

		private Text word = new Text();

		private boolean caseSensitive;

		private Set<String> patternsToSkip = new HashSet<String>();

		private Configuration conf;

		private BufferedReader fis;

		@Override

		public void setup(Context context) throws IOException,

				InterruptedException {

			conf = context.getConfiguration();

			caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);

			if (conf.getBoolean("wordcount.skip.patterns", false)) {

				URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();

				for (URI patternsURI : patternsURIs) {

					Path patternsPath = new Path(patternsURI.getPath());

					String patternsFileName = patternsPath.getName().toString();

					parseSkipFile(patternsFileName);

				}

			}

		}

		private void parseSkipFile(String fileName) {

			try {

				fis = new BufferedReader(new FileReader(fileName));

				String pattern = null;

				while ((pattern = fis.readLine()) != null) {

					patternsToSkip.add(pattern);

				}

			} catch (IOException ioe) {

				System.err

						.println("Caught exception while parsing the cached file '"

								+ StringUtils.stringifyException(ioe));

			}

		}

		@Override

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			String line = (caseSensitive) ? value.toString() : value.toString()

					.toLowerCase();

			for (String pattern : patternsToSkip) {

				line = line.replaceAll(pattern, "");

			}

			StringTokenizer itr = new StringTokenizer(line);

			while (itr.hasMoreTokens()) {

				word.set(itr.nextToken());

				context.write(word, one);

				Counter counter = context.getCounter(

						CountersEnum.class.getName(),

						CountersEnum.INPUT_WORDS.toString());

				counter.increment(1);

			}

		}

	}

	public static class IntSumReducer extends

			Reducer<Text, IntWritable, Text, IntWritable> {

		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values,

				Context context) throws IOException, InterruptedException {

			int sum = 0;

			for (IntWritable val : values) {

				sum += val.get();

			}

			result.set(sum);

			context.write(key, result);

		}

	}

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();

		GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);

		String[] remainingArgs = optionParser.getRemainingArgs();

		if ((remainingArgs.length != 2) && (remainingArgs.length != 4)) {

			System.err

					.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");

			System.exit(2);

		}

		Job job = Job.getInstance(conf, "word count");

		job.setJarByClass(WordCount.class);

		job.setMapperClass(TokenizerMapper.class);

		job.setCombinerClass(IntSumReducer.class);

		job.setReducerClass(IntSumReducer.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(IntWritable.class);

		List<String> otherArgs = new ArrayList<String>();

		for (int i = 0; i < remainingArgs.length; ++i) {

			if ("-skip".equals(remainingArgs[i])) {

				job.addCacheFile(new Path(remainingArgs[++i]).toUri());

				job.getConfiguration().setBoolean("wordcount.skip.patterns",

						true);

			} else {

				otherArgs.add(remainingArgs[i]);

			}

		}

		FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));

		FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

		System.exit(job.waitForCompletion(true) ? 0 : 1);

	}

}

2）导出mapreduce.jar

3) 上传到master的目录

/home/chenyun/project/mapreduce

3.运行wordCount

hadoop jar /home/chenyun/project/mapreduce/mapreduce.jar com.accp.mapreduce.WordCount /user/joe/wordcount/input /user/joe/wordcount/output

4)查看运行结果

hadoop fs -cat /user/joe/wordcount/output/part-r-00000

=======================================================================================================================

4.过滤不需要统计的字符

1）在本地创建/home/chenyun/data/mapreduce/patterns.txt ,在文件里加入

\.

\,

\!

to

2)把文件上传到hdfs上

hadoop fs -put /home/chenyun/data/mapreduce/patterns.txt /user/joe/wordcount

3)运行

hadoop jar /home/chenyun/project/mapreduce/mapreduce.jar com.accp.mapreduce.WordCount -Dwordcount.case.sensitive=true /user/joe/wordcount/input /user/joe/wordcount/output1 -skip /user/joe/wordcount/patterns.txt

4)查看运行结果

hadoop fs -cat /user/joe/wordcount/output1/part-r-00000

======================================================================================================================

5.忽略大小写，进行统计

1)运行

hadoop jar /home/chenyun/project/mapreduce/mapreduce.jar com.accp.mapreduce.WordCount -Dwordcount.case.sensitive=false /user/joe/wordcount/input /user/joe/wordcount/output5 -skip /user/joe/wordcount/patterns.txt

2)查看运行结果

hadoop fs -cat /user/joe/wordcount/output5/part-r-00000

Hadoop实战-MapReduce之WordCount(五)的更多相关文章

hadoop程序MapReduce之WordCount
需求:统计一个文件中所有单词出现的个数. 样板:word.log文件中有hadoop hive hbase hadoop hive 输出:hadoop 2 hive 2 hbase 1 MapRedu ...
Hadoop实战-MapReduce之max、min、avg统计(六)
1.数据准备: Mike,35 Steven,40 Ken,28 Cindy,32 2.预期结果 Max 40 Min 28 Avg 33 3.MapReduce代码如下 import ja ...
Hadoop实战-MapReduce之倒排索引(八)
倒排索引 (就是key和Value对调的显示结果) 一.需求:下面是用户播放音乐记录,统计歌曲被哪些用户播放过 tom LittleApple jack YesterdayO ...
Hadoop实战-MapReduce之分组(group-by)统计(七)
1.数据准备使用MapReduce计算age.txt中年龄最大.最小.均值name,min,max,countMike,35,20,1Mike,5,15,2Mike,20,13,1Steven,40 ...
Hadoop实战3:MapReduce编程-WordCount统计单词个数-eclipse-java-ubuntu环境
之前习惯用hadoop streaming环境编写python程序,下面总结编辑java的eclipse环境配置总结,及一个WordCount例子运行. 一下载eclipse安装包及hadoop插件 ...
Hadoop基础-MapReduce入门篇之编写简单的Wordcount测试代码
Hadoop基础-MapReduce入门篇之编写简单的Wordcount测试代码作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 本文主要是记录一写我在学习MapReduce时的一些 ...
王家林的“云计算分布式大数据Hadoop实战高手之路---从零开始”的第十一讲Hadoop图文训练课程：MapReduce的原理机制和流程图剖析
这一讲我们主要剖析MapReduce的原理机制和流程. “云计算分布式大数据Hadoop实战高手之路”之完整发布目录云计算分布式大数据实战技术Hadoop交流群:312494188,每天都会在群中发 ...
升级版:深入浅出Hadoop实战开发(云存储、MapReduce、HBase实战微博、Hive应用、Storm应用)
Hadoop是一个分布式系统基础架构,由Apache基金会开发.用户可以在不了解分布式底层细节的情况下,开发分布式程序.充分利用集群的威力高速运算和存储.Hadoop实现了一个分布式文件系 ...
三.hadoop mapreduce之WordCount例子
目录: 目录见文章1 这个案列完成对单词的计数,重写map,与reduce方法,完成对mapreduce的理解. Mapreduce初析 Mapreduce是一个计算框架,既然是做计算的框架,那么表现 ...

随机推荐

Method and apparatus for verification of coherence for shared cache components in a system verification environment
A method and apparatus for verification of coherence for shared cache components in a system verific ...
关于django rest framework里token auth的实现及答疑
http://stackoverflow.com/questions/14838128/django-rest-framework-token-authentication ============= ...
Java EE官方文档汇总
Java EE是一个开发规范标准,各个容器厂商根据标准去实现,比如Tomcat等,其中Oracle通过标准用GlassFish去实现. 5:https://docs.oracle.com/javaee ...
[Android] 环境配置之Android Studio开发NDK
分类:Android环境搭建 (14351) (20) ========================================================作者:qiujuer博客:bl ...
openfire Android学习---android客户端聊天开发之登录和注销登录
一切就绪,新建一个android测试工程: 上网权限配置,界面绘制啥的,这里就不说了. 首先导入一个smark包.这个是用来维护长连接的,也可以是asmark.我用的是asmark 先普及一些基本知 ...
Windows Server 2003中报PerfDisk “无法从系统读取磁盘性能信息。
Windows Server 2003中报PerfDisk “无法从系统读取磁盘性能信息.”的问题解决 2015-01-22 09:49:02 标签:Windows Server2003 PerfDi ...
POJ3592 Instantaneous Transference 强连通+最长路
题目链接: id=3592">poj3592 题意: 给出一幅n X m的二维地图,每一个格子可能是矿区,障碍,或者传送点用不同的字符表示: 有一辆矿车从地图的左上角(0,0)出发, ...
OpenCV学习教程入门篇<一、介绍>
OpenCV,是Inter公司开发的免费开源专门因为图像处理和机器视觉的C/C++库,英文全称是Open Source Computer Vision. 1. 可视化语言Matlab与OpenCV都能 ...
WMS8_基本操作
建立分拣[收货.出货.领料] 点击仪表盘上的任何一个 All operations 链接切换至分拣列表视图点击 creae 按钮,建立一个新的分拣 part ...
python（20）- 列表生成式和生成器表达式练习Ⅱ
题目一: 有两个列表,分别存放来老男孩报名学习linux和python课程的学生名字linux=['钢弹','小壁虎','小虎比','alex','wupeiqi','yuanhao']python= ...

Hadoop实战-MapReduce之WordCount(五)

Hadoop实战-MapReduce之WordCount(五)的更多相关文章

随机推荐

热门专题