1.概述

　　今天在观察集群时，发现NN节点的负载过高，虽然对NN节点的资源进行了调整，同时对NN节点上的应用程序进行重新打包调整，负载问题暂时得到缓解。但是，我想了想，这样也不是长久之计。通过这个问题，我重新分析了一下以前应用部署架构图，发现了一些问题的所在，之前的部署架构是，将打包的应用直接部署在Hadoop集群上，虽然这没什么不好，但是我们分析得知，若是将应用部署在DN节点，那么时间长了应用程序会不会抢占DN节点的资源，那么如果我们部署在NN节点上，又对NN节点计算任务时造成影响，于是，经过讨论后，我们觉得应用程序不应该对Hadoop集群造成干扰，他们应该是属于一种松耦合的关系，所有的应用应该部署在一个AppServer集群上。下面，我就为大家介绍今天的内容。

2.应用部署剖析

　　由于之前的应用程序直接部署在Hadoop集群上，这堆集群或多或少造成了一些影响。我们知道在本地开发Hadoop应用的时候，都可以直接运行相关Hadoop代码，这里我们只用到了Hadoop的HDFS的地址，那我们为什么不能直接将应用单独部署呢？其实本地开发就可以看作是AppServer集群的一个节点，借助这个思路，我们将应用单独打包后，部署在一个独立的AppServer集群，只需要用到Hadoop集群的HDFS地址即可，这里需要注意的是，保证AppServer集群与Hadoop集群在同一个网段。下面我给出解耦后应用部署架构图，如下图所示：

　　从图中我们可以看出，AppServer集群想Hadoop集群提交作业，两者之间的数据交互，只需用到Hadoop的HDFS地址和Java API。在AppServer上的应用不会影响到Hadoop集群的正常运行。

3.示例

　　下面为大家演示相关示例，以WordCountV2为例子，代码如下所示：

package cn.hadoop.hdfs.main;

import java.io.IOException;

import java.util.Random;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import cn.hadoop.hdfs.util.SystemConfig;

/**

 * @Date Apr 23, 2015

 *

 * @Author dengjie

 *

 * @Note Wordcount的例子是一个比较经典的mapreduce例子，可以叫做Hadoop版的hello world。

 *       它将文件中的单词分割取出，然后shuffle，sort（map过程），接着进入到汇总统计

 *       （reduce过程），最后写道hdfs中。基本流程就是这样。

 */

public class WordCountV2 {

    private static Logger logger = LoggerFactory.getLogger(WordCountV2.class);

    private static Configuration conf;

    /**

     * 设置高可用集群连接信息

     */

    static {

        String tag = SystemConfig.getProperty("dev.tag");

        String[] hosts = SystemConfig.getPropertyArray(tag + ".hdfs.host", ",");

        conf = new Configuration();

        conf.set("fs.defaultFS", "hdfs://cluster1");

        conf.set("dfs.nameservices", "cluster1");

        conf.set("dfs.ha.namenodes.cluster1", "nna,nns");

        conf.set("dfs.namenode.rpc-address.cluster1.nna", hosts[0]);

        conf.set("dfs.namenode.rpc-address.cluster1.nns", hosts[1]);

        conf.set("dfs.client.failover.proxy.provider.cluster1",

                "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

    }

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        /**

         * 源文件：a b b

         *

         * map之后：

         *

         * a 1

         *

         * b 1

         *

         * b 1

         */

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString());// 整行读取

            while (itr.hasMoreTokens()) {

                word.set(itr.nextToken());// 按空格分割单词

                context.write(word, one);// 每次统计出来的单词+1

            }

        }

    }

    /**

     * reduce之前：

     *

     * a 1

     *

     * b 1

     *

     * b 1

     *

     * reduce之后:

     *

     * a 1

     *

     * b 2

     */

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,

                InterruptedException {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();// 分组累加

            }

            result.set(sum);

            context.write(key, result);// 按相同的key输出

        }

    }

    public static void main(String[] args) {

        try {

            if (args.length < 1) {

                logger.info("args length is 0");

                run("hello.txt");

            } else {

                logger.info("args length is not 0");

                run(args[0]);

            }

        } catch (Exception ex) {

            ex.printStackTrace();

            logger.error(ex.getMessage());

        }

    }

    private static void run(String name) throws Exception {

        long randName = new Random().nextLong();// 重定向输出目录

        logger.info("output name is [" + randName + "]");

        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCountV2.class);

        job.setMapperClass(TokenizerMapper.class);// 指定Map计算的类

        job.setCombinerClass(IntSumReducer.class);// 合并的类

        job.setReducerClass(IntSumReducer.class);// Reduce的类

        job.setOutputKeyClass(Text.class);// 输出Key类型

        job.setOutputValueClass(IntWritable.class);// 输出值类型

        String sysInPath = SystemConfig.getProperty("hdfs.input.path.v2");

        String realInPath = String.format(sysInPath, name);

        String syOutPath = SystemConfig.getProperty("hdfs.output.path.v2");

        String realOutPath = String.format(syOutPath, randName);

        FileInputFormat.addInputPath(job, new Path(realInPath));// 指定输入路径

        FileOutputFormat.setOutputPath(job, new Path(realOutPath));// 指定输出路径

        System.exit(job.waitForCompletion(true) ? 0 : 1);// 执行完MR任务后退出应用

    }

}

　　在本地IDE中运行正常，截图如下所示：

4.应用打包部署

　　然后，我们将WordCountV2应用打包后部署到AppServer1节点，这里由于工程是基于Maven结构的，我们使用Maven命令直接打包，打包命令如下所示：

mvn assembly:assembly

　　然后，我们使用scp命令将打包后的JAR文件上传到AppServer1节点，上传命令如下所示：

scp hadoop-ubas-1.0.-jar-with-dependencies.jar hadoop@apps:~/

　　接着，我们在AppServer1节点上运行我们打包好的应用，运行命令如下所示：

java -jar hadoop-ubas-1.0.-jar-with-dependencies.jar

　　但是，这里却很无奈的报错了，错误信息如下所示：

java.io.IOException: No FileSystem for scheme: hdfs

    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem.access$(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:)

    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:)

    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:)

    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:)

    at cn.hadoop.hdfs.main.WordCountV2.run(WordCountV2.java:)

    at cn.hadoop.hdfs.main.WordCountV2.main(WordCountV2.java:)

-- :: ERROR [WordCountV2.main] - No FileSystem for scheme: hdfs

5.错误分析

　　首先，我们来定位下问题原因，我将打包后的JAR在Hadoop集群上运行，是可以完成良好的运行，并计算出结果信息的，为什么在非Hadoop集群却报错呢？难道是这种架构方式不对？经过仔细的分析错误信息，和我们的Maven依赖环境，问题原因定位出来了，这里我们使用了Maven的assembly插件来打包应用。只是因为当我们使用Maven组件时，它将所有的JARS合并到一个文件中，所有的META-INFO/services/org.apache.hadoop.fs.FileSystem被互相覆盖，仅保留最后一个加入的，在这种情况下FileSystem的列表从Hadoop-Commons重写到Hadoop-HDFS的列表，而DistributedFileSystem就会找不到相应的声明信息。因而，就会出现上述错误信息。在原因找到后，我们剩下的就是去找到解决方法，这里通过分析，我找到的解决办法如下，在Loading相关Hadoop的Configuration时，我们设置相关FileSystem即可，配置代码如下所示：

conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());

conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());

　　接下来，我们重新打包应用，然后在AppServer1节点运行该应用，运行正常，并正常统计结果，运行日志如下所示：

[hadoop@apps example]$ java -jar hadoop-ubas-1.0.-jar-with-dependencies.jar

-- :: INFO  [SystemConfig.main] - Successfully loaded default properties.

-- :: INFO  [WordCountV2.main] - args length is

-- :: INFO  [WordCountV2.main] - output name is []

-- :: WARN  [NativeCodeLoader.main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

-- :: INFO  [deprecation.main] - session.id is deprecated. Instead, use dfs.metrics.session-id

-- :: INFO  [JvmMetrics.main] - Initializing JVM Metrics with processName=JobTracker, sessionId=

-- :: WARN  [JobSubmitter.main] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

-- :: INFO  [FileInputFormat.main] - Total input paths to process :

-- :: INFO  [JobSubmitter.main] - number of splits:

-- :: INFO  [JobSubmitter.main] - Submitting tokens for job: job_local519626586_0001

-- :: INFO  [Job.main] - The url to track the job: http://localhost:8080/

-- :: INFO  [Job.main] - Running job: job_local519626586_0001

-- :: INFO  [LocalJobRunner.Thread-] - OutputCommitter set in config null

-- :: INFO  [LocalJobRunner.Thread-] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

-- :: INFO  [LocalJobRunner.Thread-] - Waiting for map tasks

-- :: INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #] - Starting task: attempt_local519626586_0001_m_000000_0

-- :: INFO  [Task.LocalJobRunner Map Task Executor #] -  Using ResourceCalculatorProcessTree : [ ]

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - Processing split: hdfs://cluster1/home/hdfs/test/in/hello.txt:0+24

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - (EQUATOR)  kvi ()

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - mapreduce.task.io.sort.mb:

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - soft limit at

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - bufstart = ; bufvoid =

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - kvstart = ; length =

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

-- :: INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #] -

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - Starting flush of map output

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - Spilling map output

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - bufstart = ; bufend = ; bufvoid =

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - kvstart = (); kvend = (); length = /

-- :: INFO  [MapTask.LocalJobRunner Map Task Executor #] - Finished spill

-- :: INFO  [Task.LocalJobRunner Map Task Executor #] - Task:attempt_local519626586_0001_m_000000_0 is done. And is in the process of committing

-- :: INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #] - map

-- :: INFO  [Task.LocalJobRunner Map Task Executor #] - Task 'attempt_local519626586_0001_m_000000_0' done.

-- :: INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #] - Finishing task: attempt_local519626586_0001_m_000000_0

-- :: INFO  [LocalJobRunner.Thread-] - map task executor complete.

-- :: INFO  [LocalJobRunner.Thread-] - Waiting for reduce tasks

-- :: INFO  [LocalJobRunner.pool--thread-] - Starting task: attempt_local519626586_0001_r_000000_0

-- :: INFO  [Task.pool--thread-] -  Using ResourceCalculatorProcessTree : [ ]

-- :: INFO  [ReduceTask.pool--thread-] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@

-- :: INFO  [MergeManagerImpl.pool--thread-] - MergerManager: memoryLimit=, maxSingleShuffleLimit=, mergeThreshold=, ioSortFactor=, memToMemMergeOutputsThreshold=

-- :: INFO  [EventFetcher.EventFetcher for fetching Map Completion Events] - attempt_local519626586_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events

-- :: INFO  [LocalFetcher.localfetcher#] - localfetcher# about to shuffle output of map attempt_local519626586_0001_m_000000_0 decomp:  len:  to MEMORY

-- :: INFO  [InMemoryMapOutput.localfetcher#] - Read  bytes from map-output for attempt_local519626586_0001_m_000000_0

-- :: INFO  [MergeManagerImpl.localfetcher#] - closeInMemoryFile -> map-output of size: , inMemoryMapOutputs.size() -> , commitMemory -> , usedMemory ->

-- :: INFO  [EventFetcher.EventFetcher for fetching Map Completion Events] - EventFetcher is interrupted.. Returning

-- :: INFO  [LocalJobRunner.pool--thread-] -  /  copied.

-- :: INFO  [MergeManagerImpl.pool--thread-] - finalMerge called with  in-memory map-outputs and  on-disk map-outputs

-- :: INFO  [Merger.pool--thread-] - Merging  sorted segments

-- :: INFO  [Merger.pool--thread-] - Down to the last merge-pass, with  segments left of total size:  bytes

-- :: INFO  [MergeManagerImpl.pool--thread-] - Merged  segments,  bytes to disk to satisfy reduce memory limit

-- :: INFO  [MergeManagerImpl.pool--thread-] - Merging  files,  bytes from disk

-- :: INFO  [MergeManagerImpl.pool--thread-] - Merging  segments,  bytes from memory into reduce

-- :: INFO  [Merger.pool--thread-] - Merging  sorted segments

-- :: INFO  [Merger.pool--thread-] - Down to the last merge-pass, with  segments left of total size:  bytes

-- :: INFO  [LocalJobRunner.pool--thread-] -  /  copied.

-- :: INFO  [deprecation.pool--thread-] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords

-- :: INFO  [Job.main] - Job job_local519626586_0001 running in uber mode : false

-- :: INFO  [Job.main] -  map % reduce %

-- :: INFO  [Task.pool--thread-] - Task:attempt_local519626586_0001_r_000000_0 is done. And is in the process of committing

-- :: INFO  [LocalJobRunner.pool--thread-] -  /  copied.

-- :: INFO  [Task.pool--thread-] - Task attempt_local519626586_0001_r_000000_0 is allowed to commit now

-- :: INFO  [FileOutputCommitter.pool--thread-] - Saved output of task 'attempt_local519626586_0001_r_000000_0' to hdfs://cluster1/home/hdfs/test/out/6876390710620561863/_temporary/0/task_local519626586_0001_r_000000

-- :: INFO  [LocalJobRunner.pool--thread-] - reduce > reduce

-- :: INFO  [Task.pool--thread-] - Task 'attempt_local519626586_0001_r_000000_0' done.

-- :: INFO  [LocalJobRunner.pool--thread-] - Finishing task: attempt_local519626586_0001_r_000000_0

-- :: INFO  [LocalJobRunner.Thread-] - reduce task executor complete.

-- :: INFO  [Job.main] -  map % reduce %

-- :: INFO  [Job.main] - Job job_local519626586_0001 completed successfully

-- :: INFO  [Job.main] - Counters:

    File System Counters

        FILE: Number of bytes read=

        FILE: Number of bytes written=

        FILE: Number of read operations=

        FILE: Number of large read operations=

        FILE: Number of write operations=

        HDFS: Number of bytes read=

        HDFS: Number of bytes written=

        HDFS: Number of read operations=

        HDFS: Number of large read operations=

        HDFS: Number of write operations=

    Map-Reduce Framework

        Map input records=

        Map output records=

        Map output bytes=

        Map output materialized bytes=

        Input split bytes=

        Combine input records=

        Combine output records=

        Reduce input groups=

        Reduce shuffle bytes=

        Reduce input records=

        Reduce output records=

        Spilled Records=

        Shuffled Maps =

        Failed Shuffles=

        Merged Map outputs=

        GC time elapsed (ms)=

        CPU time spent (ms)=

        Physical memory (bytes) snapshot=

        Virtual memory (bytes) snapshot=

        Total committed heap usage (bytes)=

    Shuffle Errors

        BAD_ID=

        CONNECTION=

        IO_ERROR=

        WRONG_LENGTH=

        WRONG_MAP=

        WRONG_REDUCE=

    File Input Format Counters

        Bytes Read=

    File Output Format Counters

        Bytes Written=

6.总结

　　这里需要注意的是，我们应用部署架构没问题，思路是正确的，问题出在打包上，在打包的时候需要特别注意，另外，有些同学使用IDE的Export导出时也要注意一下，相关依赖是否存在，还有常见的第三方打包工具Fat，这个也是需要注意的。

7.结束语

　　这篇博客就和大家分享到这里，如果大家在研究学习的过程当中有什么问题，可以加群进行讨论或发送邮件给我，我会尽我所能为您解答，与君共勉！

高可用Hadoop平台－应用JAR部署的更多相关文章

高可用Hadoop平台－Oozie工作流之Hadoop调度
1.概述在<高可用Hadoop平台-Oozie工作流>一篇中,给大家分享了如何去单一的集成Oozie这样一个插件.今天为大家介绍如何去使用Oozie创建相关工作流运行与Hadoop上,已 ...
高可用Hadoop平台－Hue In Hadoop
1.概述前面一篇博客<高可用Hadoop平台-Ganglia安装部署>,为大家介绍了Ganglia在Hadoop中的集成,今天为大家介绍另一款工具——Hue,该工具功能比较丰富,下面是今 ...
高可用Hadoop平台－实战尾声篇
1.概述今天这篇博客就是<高可用Hadoop平台>的尾声篇了,从搭建安装到入门运行 Hadoop 版的 HelloWorld(WordCount 可以称的上是 Hadoop 版的 Hel ...
高可用Hadoop平台－探索
1.概述上篇<高可用Hadoop平台-启航>博客已经让我们初步了解了Hadoop平台:接下来,我们对Hadoop做进一步的探索,一步一步的揭开Hadoop的神秘面纱.下面,我们开始赘述今 ...
高可用Hadoop平台－启航
1.概述在上篇博客中,我们搭建了<配置高可用Hadoop平台>,接下来我们就可以驾着Hadoop这艘巨轮在大数据的海洋中遨游了.工欲善其事,必先利其器.是的,没错:我们开发需要有开发工具 ...
高可用Hadoop平台－实战
1.概述今天继续<高可用的Hadoop平台>系列,今天开始进行小规模的实战下,前面的准备工作完成后,基本用于统计数据的平台都拥有了,关于导出统计结果的文章留到后面赘述.今天要和大家分享的 ...
高可用Hadoop平台－集成Hive HAProxy
1.概述这篇博客是接着<高可用Hadoop平台>系列讲,本篇博客是为后面用 Hive 来做数据统计做准备的,介绍如何在 Hadoop HA 平台下集成高可用的 Hive 工具,下面我打算 ...
高可用Hadoop平台－Flume NG实战图解篇
1.概述今天补充一篇关于Flume的博客,前面在讲解高可用的Hadoop平台的时候遗漏了这篇,本篇博客为大家讲述以下内容: Flume NG简述单点Flume NG搭建.运行高可用Flume N ...
高可用Hadoop平台－Ganglia安装部署
1.概述最近,有朋友私密我,Hadoop有什么好的监控工具,其实,Hadoop的监控工具还是蛮多的.今天给大家分享一个老牌监控工具Ganglia,这个在企业用的也算是比较多的,Hadoop对它的兼容 ...

随机推荐

C# 检测证书是否安装、安装证书
检测是否存在指定的证书: /// <summary> /// 检测是否存在指定的证书 /// </summary> /// <returns></return ...
golang环境 centos 7
https://blog.csdn.net/ggq89/article/details/82682171 Linux下Go的安装.配置 .升级和卸载 https://blog.csdn.net/we ...
sql2012包含数据库，快速生成用户tsql脚本
今天太忙(下班时,发现一个考试网站的不算BUG的BUG,这个BUG刚好能让我找到想要的数据,现在正辛苦的编码中...) 不多说,今天的技术文章,简单一点,帖一段昨天写的SQL代码用于SQL2012中 ...
Docker架构
Docker使用客户端-服务器(C/S)架构模式,使用远程API来管理和创建Docker容器. Docker容器通过Docker镜像来创建. 容器与镜像的关系类似于面向对象编程中的对象和类. Dock ...
Delphi 域名解析为IP地址
花生壳:1.LJSZForm-Lable1-Caption改成 “IP地址或域名:”2.LJSZForm-BitBtn1Click-注释掉--else if IsIP(Trim(IPEdit.Text ...
Python语言中的按位运算
(转)位操作是程序设计中对位模式或二进制数的一元和二元操作. 在许多古老的微处理器上, 位运算比加减运算略快, 通常位运算比乘除法运算要快很多. 在现代架构中, 情况并非如此:位运算的运算速度通常与加 ...
python绝技-运用python成为顶级黑客源代码
链接:https://pan.baidu.com/s/1xUV60WoDtiSCywaQ_jV2iQ 密码:7sz3 学习资料就应该是免费了的,我也不懂那些收钱的人是怎么想的(小声bb)
C++基础笔记（int转string）
int a = 23; stringstream ss; ss << a; string s1 = ss.str(); 头文件需添加#include "sstream"
部署自己配置的nginx到kubernetes，并且能通过ingress访问
本文的环境介绍 [root@m-30-1 ~]# kubectl version Client Version: version.Info{Major:"1", Minor:&qu ...
zabbix学习
snmp 默认监控upd161端口 tcp 也有 [root@bogon ~]# netstat -nlutp|grep snmp tcp 0 0 127.0.0.1:199 0.0.0.0:* LI ...

高可用Hadoop平台－应用JAR部署