We've seen the internals of MapReduce in the last post. Now we can make a little change to the WordCount and create a JAR for being executed by Hadoop. 
If we look at the result of the WordCount we ran before, the lines of the file are only split by space, and thus all other punctuation characters and symbols remain attached to the words and in some way "invalidate" the count. For example, we can see some of these wrong values:

  1. KING 1
  2. KING'S 1
  3. KING. 5
  4. King 6
  5. King, 4
  6. King. 2
  7. King; 1

The word is always king but sometimes it appears in upper case, sometimes in lower case, or with a punctuation character after it. So we'd like to update the behaviour of the WordCount program to count all the occurrences of any word, aside from the punctuation and other symbols. In order to do so, we have to modify the code of the mapper, since it's here that we get the data from the file and split it. 
If we look at the code of mapper:

  1. public static class TokenizerMapper extends Mapper<object, text,="" intwritable=""> {
  2.  
  3. private final static IntWritable one = new IntWritable(1);
  4. private Text word = new Text();
  5.  
  6. public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
  7. StringTokenizer itr = new StringTokenizer(value.toString());
  8. while (itr.hasMoreTokens()) {
  9. word.set(itr.nextToken());
  10. context.write(word, one);
  11. }
  12. }
  13. }

we see that the StringTokenizer takes the line as the parameter; before doing that, we remove all the symbols from the line using a RegExp that maps each of these symbols into a space:

  1. String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");

that simply says "if you see any of these character _, |, $, #, <, >, [, ], *, /, \, ,, ;, ., -, :, ,(, ), ?, !, ", ' transform it into a space". All the backslashes are needed for correctly escaping the characters. Then we trim the token to avoid empty tokens:

  1. itr.nextToken().trim()

So now the code looks like this (the updates to the original file are printed in bold):

  1. public static class TokenizerMapper extends Mapper<object, text,="" intwritable=""> {
  2.  
  3. private final static IntWritable one = new IntWritable(1);
  4. private Text word = new Text();
  5.  
  6. @Override
  7. public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
  8. String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");
  9. StringTokenizer itr = new StringTokenizer(cleanLine);
  10. while (itr.hasMoreTokens()) {
  11. word.set(itr.nextToken().trim());
  12. context.write(word, one);
  13. }
  14. }
  15. }

The complete code of the class is available on my github repository.

We now want to create the JAR to be executed; to do it, we have to go to the output directory of our wordcount file (the directory that contains the .class files) and create a manifest filenamed manifest_file that contains something like this:

  1. Manifest-Version: 1.0
  2. Main-Class: samples.wordcount.WordCount

in which we tell the JAR which class to execute at startup. More details in Java official documentation. Note that there's no need to add classpath info, because the JAR will be run by Hadoop that already has a classpath that contains all needed libraries. 
Now we can launch the command:

  1. $ jar cmf manifest_file wordcount.jar .

that creates a JAR named wordcount.jar that contains all classes starting from this directory (the . parameter) e using the content of manifest_file to create a Manifest. 
We can run the batch as we saw before. Looking at the result file we can check that there are no more symbols and punctuation characters; the occurrences of the word king is now the sum of all occurrences we found before:

  1. king 20

which is what we were looking for.

from: http://andreaiacono.blogspot.com/2014/02/creating-jar-for-hadoop.html

为Hadoop创建JAR包文件Creating a JAR for Hadoop的更多相关文章

  1. 有引用外部jar包时(J2SE)生成jar文件

    一.工程没有引用外部jar包时(J2SE) 选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar fil ...

  2. jmeter,badboy,jar包文件 常数吞吐量计时器?

    badboy录制脚本 1.按f2 红色开始录制 URL输入:https://www.so.com/ 2.搜索框输入zxw 回车键搜索 3.选中关键字(刮例如zxw软件——>tools——> ...

  3. spring原理案例-基本项目搭建 02 spring jar包详解 spring jar包的用途

    Spring4 Jar包详解 SpringJava Spring AOP: Spring的面向切面编程,提供AOP(面向切面编程)的实现 Spring Aspects: Spring提供的对Aspec ...

  4. Eclipse用Runnable JAR file方式打jar包,并用该jar包进行二次开发

    目录: 1.eclipse创建Java项目(带jar包的) 2. eclipse用Export的Runnable JAR file方式打jar包(带jar包的) 打jar包 1)class2json1 ...

  5. 多个jar包合并成一个jar包(ant)

    https://blog.csdn.net/gzl003csdn/article/details/53539133 多个jar包合并成一个jar 使用Apache的Ant是一个基于Java的生成工具. ...

  6. spring boot:在项目中引入第三方外部jar包集成为本地jar包(spring boot 2.3.2)

    一,为什么要集成外部jar包? 不是所有的第三方库都会上传到mvnrepository, 这时我们如果想集成它的第三方库,则需要直接在项目中集成它们的jar包, 在操作上还是很简单的, 这里用luos ...

  7. 多个jar包合并成一个jar包的办法

    步骤: 1.将多个JAR包使用压缩软件打开,并将全包名的类拷贝到一个临时目录地下. 2.cmd命令到该临时目录下,此时会有很多.class文件,其中需要带完整包路径 3.执行 jar -cvfM te ...

  8. Ant打包可运行的Jar包(加入第三方jar包)

    本章介绍使用ant打包可运行的Jar包. 打包jar包最大的问题在于如何加入第三方jar包使得jar文件可以直接运行.以下用一个实例程序进行说明. 程序结构: 关键代码: package com.al ...

  9. 【Maven学习】Maven打包生成普通jar包、可运行jar包、包含所有依赖的jar包

    http://blog.csdn.net/u013177446/article/details/54134394 ******************************************* ...

随机推荐

  1. eclipse重的自动提示与行号和快捷图标的显示

    显示行号:Window->Preferences->Gerenal->Editors->Text Editors然后在show line number上打对勾自动提示:Wind ...

  2. 【转】关于Vue打包的一个要注意的地方

    https://www.jianshu.com/p/4118e76d684a 我们用vue-cli(脚手架)自动生成项目,然后用webpack来打包,往往会遇到这种问题: 1.直接根据README.m ...

  3. vim常见操作命令

    打开多文件vim file1 file2:open/:e 不关闭vim打开文件 分帧窗口:new 新窗口:sp 横向:vsp 纵向ctrl+w窗口切换:tabc 关闭当前窗口:tabo 关闭所有窗口: ...

  4. FileBuffer-ImageBuffer 模拟PE

    这节课的重点是:模拟PE加载过程,按照运行的要求给FileBuffer拉伸放到内存当中,从 FileBuffer 到 ImageBuffer 再到 运行Buffer. PE  加载  过程: 根据si ...

  5. 洛谷——P2071 座位安排 seat.cpp/c/pas

    P2071 座位安排 seat.cpp/c/pas 题目背景 公元二零一四年四月十七日,小明参加了省赛,在一路上,他遇到了许多问题,请你帮他解决. 题目描述 已知车上有N排座位,有N*2个人参加省赛, ...

  6. 【Java虚拟机】JVM学习笔记之GC

    JVM学习笔记二之GC GC即垃圾回收,在C++中垃圾回收由程序员自己来做,例如可以用free和delete来回收对象.而在Java中,JVM替程序员来执行垃圾回收的工作,下面看看GC的详细原理和执行 ...

  7. redis:CLUSTER cluster is down 解决方法

    redis:CLUSTER cluster is down 解决方法 首先进入安装目录的src下: 1. ./redis-trib.rb check 127.0.0.1:7001 检查问题所在 2.  ...

  8. Python return语句 函数返回值

    return语句是从python 函数返回一个值,在讲到定义函数的时候有讲过,每个函数都要有一个返回值.Python中的return语句有什么作用,今天就来仔细的讲解一下. python 函数返回值 ...

  9. python中对list去重的多种方法

    今天遇到一个问题,用了 itertools.groupby 这个函数.不过这个东西最终还是没用上. 问题就是对一个list中的新闻id进行去重,去重之后要保证顺序不变. 直观方法 最简单的思路就是: ...

  10. 51nod部分容斥题解

    51nod1434 区间LCM 跟容斥没有关系.首先可以确定的一个结论是:对于任意正整数,有1*2*...*n | (k+1)*(k+2)*...*(k+n).因为这就是$C_{n+k}^{k}$. ...