单词计数-MapReduceJob

pom文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<groupId>com.zuoyan</groupId>

	<artifactId>hadoop</artifactId>

	<version>0.0.1-SNAPSHOT</version>

	<packaging>jar</packaging>

	<name>hadoop</name>

	<url>http://maven.apache.org</url>

	<properties>

		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

	</properties>

	<dependencies>

		<dependency>

			<groupId>junit</groupId>

			<artifactId>junit</artifactId>

			<version>3.8.1</version>

		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->

		<dependency>

			<groupId>org.apache.hadoop</groupId>

			<artifactId>hadoop-client</artifactId>

			<version>3.0.0</version>

		</dependency>

		<!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->

		<dependency>

		    <groupId>com.janeluo</groupId>

		    <artifactId>ikanalyzer</artifactId>

		    <version>2012_u6</version>

		</dependency>

	</dependencies>

	<build>

		<plugins>

			<plugin>

				<artifactId>maven-assembly-plugin</artifactId>

				<configuration>

					<appendAssemblyId>false</appendAssemblyId>

					<descriptorRefs>

						<descriptorRef>jar-with-dependencies</descriptorRef>

					</descriptorRefs>

					<archive>

						<manifest>

							<!-- 此处指定main方法入口的class -->

							<mainClass>com.zuoyan.hadoop.FirstMapReduceJob</mainClass>

<!-- 							<mainClass>com.geotmt.hadoop.hdfs.FirstMapReduceJob</mainClass> -->

						</manifest>

					</archive>

				</configuration>

				<executions>

					<execution>

						<id>make-assembly</id>

						<phase>package</phase>

						<goals>

							<goal>assembly</goal>

						</goals>

					</execution>

				</executions>

			</plugin>

			<plugin>

				<groupId>org.apache.maven.plugins</groupId>

				<artifactId>maven-compiler-plugin</artifactId>

				<version>3.6.2</version>

				<configuration>

					<source>1.8</source>

					<target>1.8</target>

					<encoding>UTF-8</encoding>

				</configuration>

			</plugin>

		</plugins>

	</build>

</project>

单词计数-实现

package com.zuoyan.hadoop;

import java.io.ByteArrayInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.wltea.analyzer.core.IKSegmenter;

import org.wltea.analyzer.core.Lexeme;

/**

 * 单词计数

 *

 */

public class FirstMapReduceJob {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            /*

             * 默认英文分词

             *

            StringTokenizer itr = new StringTokenizer(value.toString());

            while (itr.hasMoreTokens()) {

                word.set(itr.nextToken());

                context.write(word, one);

            }

            */

        	/*

        	 * 中文分词-使用IK分词器分词

        	 */

            byte[] bytes = value.getBytes();

            InputStream inputStream = new ByteArrayInputStream(bytes);

            Reader reader = new InputStreamReader(inputStream);

            IKSegmenter iKSegmenter = new IKSegmenter(reader,true);

            Lexeme t;

            while((t=iKSegmenter.next()) != null){

            	context.write(new Text(t.getLexemeText()), new IntWritable(1));

            }

            //方案二，获取文件信息

//            context.getInputSplit().getLocationInfo();

        }

    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,Context context ) throws IOException, InterruptedException {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

            result.set(sum);

            context.write(key, result);

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if (otherArgs.length != 2) {

            System.err.println("Usage: wordcount <in> <out>");

            System.exit(2);

        }

        Job job = new Job(conf, "word count");

        job.setJarByClass(FirstMapReduceJob.class);

        job.setMapperClass(TokenizerMapper.class);

        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

单词计数-MapReduceJob的更多相关文章

使用Scala实现Java项目的单词计数：串行及Actor版本
其实我想找一门“具有Python的简洁写法和融合Java平台的优势, 同时又足够有挑战性和灵活性”的编程语言. Scala 就是一个不错的选择. Scala 有很多语言特性, 建议先掌握基础常用的: ...
MapReduce之单词计数
最近在看google那篇经典的MapReduce论文,中文版可以参考孟岩推荐的 mapreduce 中文版中文翻译论文中提到,MapReduce的编程模型就是: 计算利用一个输入key/value ...
自定义实现InputFormat、OutputFormat、输出到多个文件目录中去、hadoop1.x api写单词计数的例子、运行时接收命令行参数，代码例子
一:自定义实现InputFormat *数据源来自于内存 *1.InputFormat是用于处理各种数据源的,下面是实现InputFormat,数据源是来自于内存. *1.1 在程序的job.setI ...
Storm实现单词计数
package com.mengyao.storm; import java.io.File; import java.io.IOException; import java.util.Collect ...
hadoop笔记之MapReduce的应用案例(WordCount单词计数)
MapReduce的应用案例(WordCount单词计数) MapReduce的应用案例(WordCount单词计数) 1. WordCount单词计数作用: 计算文件中出现每个单词的频数输入结果 ...
第一章 flex单词计数程序
学习Flex&Bison目标, 读懂SQLite中SQL解析部分代码 Flex&Bison简介Flex做词法分析Bison做语法分析第一个Flex程序, wc.fl, 单词计数程序 ...
Strom的trident单词计数代码
/** * 单词计数 */ public class LocalTridentCount { public static class MyBatchSpout implements IBatchSpo ...
大数据【四】MapReduce（单词计数；二次排序；计数器；join；分布式缓存）
前言: 根据前面的几篇博客学习,现在可以进行MapReduce学习了.本篇博客首先阐述了MapReduce的概念及使用原理,其次直接从五个实验中实践学习(单词计数,二次排序,计数器,join,分 ...
python实现指定目录下批量文件的单词计数：并发版本
在文章 <python实现指定目录下批量文件的单词计数:串行版本>中, 总体思路是: A. 一次性获取指定目录下的所有符合条件的文件 -> B. 一次性获取所有文件的所有文件行 - ...

随机推荐

jenkins之定时任务配置
jenkins可以配置任务定时执行 1.jenkins配置解释说明在每个job的配置项里,有一个构建触发器配置,勾选“定时检查版本库选项”,在输入框可根据需求配置时间: 日程表填写格式: 日程表(S ...
poi3617Best Cow Line ——贪心法
给定长度为N(1≤N≤2000)的字符串S,要构造一个长度为N的字符串T.期初,T是一个空串,随后反复进行下列任意操作. ·从S的头部删除一个字符,加到T的尾部 ·从S的尾部删除一个字符,加到T的尾部 ...
排序算法二：归并排序(Merge sort)
归并排序(Merge sort)用到了分治思想,即分-治-合三步,算法平均时间复杂度是O(nlgn). (一)算法实现 private void merge_sort(int[] array, int ...
Phone-java标准类
//project-module-package //.代表包的目录层次 package cn.learn.day01.demo01; /* 1.类是一组相关属性(成员变量)与行为(方法)的集合,对象 ...
PHP json_encode 中文不转码，低版本处理
5.4 以上版本可以使用 JSON_UNESCAPED_UNICODE 来解决,但是低版本的,需要用其他方式需要注意的是,encode_json参数为数组,不能为对象 function encod ...
批量调整word 图片大小
打开文档后,按Alt+F11,在左边Porject下找到ThisDocument,右键插入模块,贴上下面的 Sub Macro()For Each iShape In ActiveDocument.I ...
python爬取企业登记业务
import requests from lxml import etree import csv for i in range(10, 990, 10): url = "http://12 ...
从0构建webpack开发环境(二) 添加css,img的模块化支持
在一个简单的webpack.config.js中,构建了一个基础的webpack.config.js文件,但是只支持js模块的打包. 本篇中添加对css和img的模块化支持首先需要安装三个个load ...
BZOJ 4034 树链剖分
题目链接:http://www.lydsy.com/JudgeOnline/problem.php?id=4034 题意:中文题面思路:树链剖分入门题. 剖分后就是一个简单的区间更新和区间求和问题. ...
三、Angular项目，app.module.ts解析
1. 项目主要文件存放的路径 2.app.module.ts模块解析 3.模块和组件关系 |--app.module.ts(模块) |--app.component.ts(组件) |--app.co ...

单词计数-MapReduceJob

单词计数-MapReduceJob的更多相关文章

随机推荐

热门专题