来自:http://hadoopi.wordpress.com/2014/06/05/hadoop-add-third-party-libraries-to-mapreduce-job/

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job.

Add libjars option

The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must useGenericOptionsParser class. Easiest way is to implement the Hadoop Tool interface as described in post Hadoop: Implementing the Tool interface for MapReduce driver.

$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar /path/to/my.jar com.wordpress.hadoopi.MyClass -libjars ${LIBJARS} value

This will obviously work only when playing with CLI, so how the heck can we add such external jar files when not using CLI ?

Add jar files to Hadoop classpath

You could certainly upload external jar files to each tasktracker and update HADOOOP_CLASSPATH accordingly, but are you really willing to bother Ops team each time you need to add a new jar ? Works well on a single server node, but are you going to upload such jar across all of the 10, 100 or even more Hadoop nodes ? This approach does not scale at all !

Create a fat jar

Another approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). Be aware this Jar will not only contain your classes, but might also include all your project dependencies (such as Hadoop libraries) unless you explicitly exclude them (using provided tag).
Here is an example of maven plugin you will need to set up

            <plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>
jar-with-dependencies
</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

Following a “mvn clean package” command, your fat JAR will be located in maven project’s target directory as follows

drwxr-xr-x  2 antoine  staff        68 Jun 10 09:30 archive-tmp
drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 classes
drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 generated-sources
drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 generated-test-sources
drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 maven-archiver
drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 myproject-1.0-SNAPSHOT
-rw-r--r-- 1 antoine staff 63880020 Jun 10 09:30 myproject-1.0-SNAPSHOT-jar-with-dependencies.jar
drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 surefire-reports
drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 test-classes

In above example, note the actual size of your JAR file (61MB). Quite fat, isn’t it ?
You can ensure all dependencies have been added by firing up below command

$ jar -tf myproject-1.0-SNAPSHOT-jar-with-dependencies.jar

META-INF/
META-INF/MANIFEST.MF
com/aamend/hadoop/allMyClasses.class
...
com/others/allMyDependencies.class
...

Use Distributed cache

I am always following such approach when using third-party libraries in my MapReduce jobs. One would say such approach is not elegant, but I can work without annoying anyone from Ops team :). I first create a directory “lib” in my HDFS home directory (“/user/hadoopi/”). You could even use “/tmp”, it does not matter. I then create a static method that

  1. Locate the jar file that includes the class I need
  2. Upload this jar to Hadoop HDFS
  3. Add the uploaded jar file to Hadoop distributed cache

Simply add the following lines to some Utils class.

    private static void addJarToDistributedCache(
Class classToAdd, Configuration conf)
throws IOException { // Retrieve jar file for class2Add
String jar = classToAdd.getProtectionDomain().
getCodeSource().getLocation().
getPath();
File jarFile = new File(jar); // Declare new HDFS location
Path hdfsJar = new Path("/user/hadoopi/lib/"
+ jarFile.getName()); // Mount HDFS
FileSystem hdfs = FileSystem.get(conf); // Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar); // Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}

The only thing you need to remember is to add this class prior to Job submission…

    public static void main(String[] args) throws Exception {

        // Create Hadoop configuration
Configuration conf = new Configuration(); // Add 3rd-party libraries
addJarToDistributedCache(MyFirstClass.class, conf);
addJarToDistributedCache(MySecondClass.class, conf); // Create my job
Job job = new Job(conf, "Hadoop-classpath");
.../...
}

Here you are, your MapReduce is now able to use any external JAR file.

Hadoop: Add third-party libraries to MapReduce job的更多相关文章

  1. Hadoop:使用Mrjob框架编写MapReduce

    Mrjob简介 Mrjob是一个编写MapReduce任务的开源Python框架,它实际上对Hadoop Streaming的命令行进行了封装,因此接粗不到Hadoop的数据流命令行,使我们可以更轻松 ...

  2. 【Cloud Computing】Hadoop环境安装、基本命令及MapReduce字数统计程序

    [Cloud Computing]Hadoop环境安装.基本命令及MapReduce字数统计程序 1.虚拟机准备 1.1 模板机器配置 1.1.1 主机配置 IP地址:在学校校园网Wifi下连接下 V ...

  3. 十九、Hadoop学记笔记————Hbase和MapReduce

    概要: hadoop和hbase导入环境变量: 要运行Hbase中自带的MapReduce程序,需要运行如下指令,可在官网中找到: 如果遇到如下问题,则说明Hadoop的MapReduce没有权限访问 ...

  4. hadoop源码分析(2):Map-Reduce的过程解析

    一.客户端 Map-Reduce的过程首先是由客户端提交一个任务开始的. 提交任务主要是通过JobClient.runJob(JobConf)静态函数实现的: public static Runnin ...

  5. Hadoop学习之旅三:MapReduce

    MapReduce编程模型 在Google的一篇重要的论文MapReduce: Simplified Data Processing on Large Clusters中提到,Google公司有大量的 ...

  6. Hadoop:使用原生python编写MapReduce

    功能实现 功能:统计文本文件中所有单词出现的频率功能. 下面是要统计的文本文件 [/root/hadooptest/input.txt] foo foo quux labs foo bar quux ...

  7. Hadoop学习记录(4)|MapReduce原理|API操作使用

    MapReduce概念 MapReduce是一种分布式计算模型,由谷歌提出,主要用于搜索领域,解决海量数据计算问题. MR由两个阶段组成:Map和Reduce,用户只需要实现map()和reduce( ...

  8. Hadoop 学习笔记 (十一) MapReduce 求平均成绩

    china:张三 78李四 89王五 96赵六 67english张三 80李四 82王五    84赵六 86math张三 88李四 99王五 66赵六 77 import java.io.IOEx ...

  9. Hadoop 学习笔记 (十) MapReduce实现排序 全局变量

    一些疑问:1 全排序的话,最后的应该sortJob.setNumReduceTasks(1);2 如果多个reduce task都去修改 一个静态的 IntWritable ,IntWritable会 ...

随机推荐

  1. android的logcat详细用法!

    from://http://www.miui.com/article-272-1.html [技术交流]android的logcat详细用法! logcat是Android中一个命令行工具,可以用于得 ...

  2. EM算法与混合高斯模型

    非常早就想看看EM算法,这个算法在HMM(隐马尔科夫模型)得到非常好的应用.这个算法公式太多就手写了这部分主体部分. 好的參考博客:最大似然预计到EM,讲了详细样例通熟易懂. JerryLead博客非 ...

  3. Shell 命令行快捷键

    在shell命令终端中.Ctrl+n相当于方向向下的方向键,Ctrl+p相当于方向向上的方向键. 在命令终端中通过它们或者方向键能够实现对历史命令的高速查找.这也是高速输入命令的技巧. 在命令终端中能 ...

  4. 动态规划算法——最长公共子序列问题(java实现)

    已知序列X=(A,B,C,A,B,D,A)和序列Y=(B,A,D,B,A),求它们的最长公共子序列S. /* * LCSLength.java * Version 1.0.0 * Created on ...

  5. spring4 quartz2 集群动态任务

    实现定时任务的执行,而且要求定时周期是不固定的.测试地址:http://sms.reyo.cn 生产环境:nginx+tomcat+quartz2.2.1+spring4.2.1 集群. 实现功能:可 ...

  6. 低版本系统兼容的ActionBar(七)自定义Actionbar标题栏字体

    这个自定义字体其实和ActionBar有关,但之前写AtionBar的时候没考虑到修改字体样式,今天看到一篇专门写这个的文章就贴上使用方式.╮(╯▽╰)╭,不得不说Actionbar的那个样式真是让人 ...

  7. MVC的Ajax异步请求

    @using (Ajax.BeginForm("GetTime","order",new AjaxOptions() { Confirm="你确认这么 ...

  8. cross validation笔记

    preface:做实验少不了交叉验证,平时常用from sklearn.cross_validation import train_test_split,用train_test_split()函数将数 ...

  9. Java 文件路径相关

    不得不说Java的文件路径弄得很复杂, 有编译目录和resource目录什么的和解释型语言(PHP)的就是不一样 搞了好几年java一直没认真去研究这些个破路径怎么回事, 每次都忘记, 梳理一下备忘 ...

  10. NOI2015滚粗记

    我的第一次也是最后一次NOI 好像写的晚了许多……可能一谈到退役总会有些伤感,并不愿去面对…… 一路走来已有5年,虽然我总在说“其实我好好学的时间只有半年”,但那也不过是给自己是蒟蒻找的借口吧...一 ...