今天在测试spark-sql运行在yarn上的过程中,无意间从日志中发现了一个问题:

spark-sql --master yarn
// :: INFO Client: Requesting a new application from cluster with  NodeManagers
// :: INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster ( MB per container)
// :: INFO Client: Will allocate AM container, with MB memory including MB overhead
// :: INFO Client: Setting up container launch context for our AM
// :: INFO Client: Preparing resources for our AM container
// :: INFO Client: Uploading resource file:/home/spark/software/source/compile/deploy_spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar -> hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0093/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
// :: INFO Client: Setting up the launch environment for our AM container

再开启一个spark-sql命令行,从日志中再次发现:

// :: INFO Client: Requesting a new application from cluster with  NodeManagers
// :: INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster ( MB per container)
// :: INFO Client: Will allocate AM container, with MB memory including MB overhead
// :: INFO Client: Setting up container launch context for our AM
// :: INFO Client: Preparing resources for our AM container
// :: INFO Client: Uploading resource file:/home/spark/software/source/compile/deploy_spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar -> hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0094/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
// :: INFO Client: Setting up the launch environment for our AM container

然后查看HDFS上的文件:

hadoop fs -ls hdfs://hadoop000:8020/user/spark/.sparkStaging/
drwx------   - spark supergroup           -- : hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0093
drwx------ - spark supergroup -- : hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0094

每个Application都会上传一个spark-assembly-x.x.x-SNAPSHOT-hadoopx.x.x-cdhx.x.x.jar的jar包,影响HDFS的性能以及占用HDFS的空间。

在Spark文档(http://spark.apache.org/docs/latest/running-on-yarn.html)中发现spark.yarn.jar属性,将spark-assembly-xxxxx.jar存放在hdfs://hadoop000:8020/spark_lib/下

在spark-defaults.conf添加属性配置:

spark.yarn.jar hdfs://hadoop000:8020/spark_lib/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar

再次启动spark-sql --master yarn观察日志:

14/12/29 15:39:02 INFO Client: Requesting a new application from cluster with 1 NodeManagers
14/12/29 15:39:02 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
14/12/29 15:39:02 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
14/12/29 15:39:02 INFO Client: Setting up container launch context for our AM
14/12/29 15:39:02 INFO Client: Preparing resources for our AM container
14/12/29 15:39:02 INFO Client: Source and destination file systems are the same. Not copying hdfs://hadoop000:8020/spark_lib/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
14/12/29 15:39:02 INFO Client: Setting up the launch environment for our AM container

观察HDFS上文件

hadoop fs -ls hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0097

该Application对应的目录下没有spark-assembly-xxxxx.jar了,从而节省assembly包上传的过程以及HDFS空间占用。

我在测试过程中遇到了类似如下的错误:

Application application_xxxxxxxxx_yyyy failed 2 times due to AM Container for application_xxxxxxxxx_yyyy 

exited with exitCode: -1000 due to: java.io.FileNotFoundException: File /tmp/hadoop-spark/nm-local-dir/filecache does not exist

在/tmp/hadoop-spark/nm-local-dir路径下创建filecache文件夹即可解决报错问题。

Spark On Yarn中spark.yarn.jar属性的使用的更多相关文章

  1. Spark HA 配置中spark.deploy.zookeeper.url 的意思

    Spark HA的配置网上很多,最近我在看王林的Spark的视频,要付费的.那个人牛B吹得很大,本事应该是有的,但是有本事,不一定就是好老师.一开始吹中国第一,吹着吹着就变成世界第一.就算你真的是世界 ...

  2. Guava com.google.common.base.Stopwatch Spark程序在yarn中 MethodNotFound

    今天在公司提交一个Spark 读取hive中的数据,写入JanusGraph 的app,自己本地调试没有问题,放入环境中提交到yarn 中时,发现app 跑不起. yarn 中日志,也比较明显,app ...

  3. Spark基本工作流程及YARN cluster模式原理(读书笔记)

    Spark基本工作流程及YARN cluster模式原理 转载请注明出处:http://www.cnblogs.com/BYRans/ Spark基本工作流程 相关术语解释 Spark应用程序相关的几 ...

  4. 【原创】大数据基础之Spark(2)Spark on Yarn:container memory allocation容器内存分配

    spark 2.1.1 最近spark任务(spark on yarn)有一个报错 Diagnostics: Container [pid=5901,containerID=container_154 ...

  5. 017 Spark的运行模式(yarn模式)

    1.关于mapreduce on yarn 来提交job的流程 yarn=resourcemanager(RM)+nodemanager(NM) client向RM提交任务 RM向NM分配applic ...

  6. <YARN><MRv2><Spark on YARN>

    MRv1 VS MRv2 MRv1: - JobTracker: 资源管理 & 作业控制- 每个作业由一个JobInProgress控制,每个任务由一个TaskInProgress控制.由于每 ...

  7. Spark记录-源码编译spark2.2.0(结合Hive on Spark/Hive on MR2/Spark on Yarn)

    #spark2.2.0源码编译 #组件:mvn-3.3.9 jdk-1.8 #wget http://mirror.bit.edu.cn/apache/spark/spark-2.2.0/spark- ...

  8. Spark(十二) -- Spark On Yarn & Spark as a Service & Spark On Tachyon

    Spark On Yarn: 从0.6.0版本其,就可以在在Yarn上运行Spark 通过Yarn进行统一的资源管理和调度 进而可以实现不止Spark,多种处理框架并存工作的场景 部署Spark On ...

  9. Spark提交任务(Standalone和Yarn)

    Spark Standalone模式提交任务 Cluster模式: ./spark-submit  \--master spark://node01:7077  \--deploy-mode clus ...

随机推荐

  1. Rhel6-cacti+nagios+ganglia(nginx)配置文档

    (lnmp平台) 系统环境: rhel6 x86_64 iptables and selinux disabled 主机: 192.168.122.185 server85.example.com 1 ...

  2. 《大象-Think In UML》读书笔记2

    什么是UML? UML本身并没有包含软件方法,而仅仅是一种语言,一种建模用的语言,而所有的语言都是基本词汇和语法两部分构成的,UML也不例外.UML中定义了一些建立模型所需要的.表达某种特定含义的基本 ...

  3. 2016 - 1- 23 iOS中xml解析 (!!!!!!!有坑要解决!!!!!!)

    一: iOS中xml解析的几种方式简介 1.官方原生 NSXMLParser :SAX方式解析,使用起来比较简单 2.第三方框架 libxml2 :纯C 同时支持DOM与SAX GDataXML: D ...

  4. php大力力 [045节] 兄弟连高洛峰 PHP教程 2014年[已发布,点击下载]

    http://www.verycd.com/topics/2843130/ 第1部分 WEB开发入门篇第1章LAMP网站构建1.[2014]兄弟连高洛峰 PHP教程1.1.1 新版视频形式介绍[已发布 ...

  5. 关于Let和var声明变量的区别

    Let是ES6中添加进来的一个关键字,用于声明变量,其法与var声明变量相同,不同点在于其作用域(块级). 举例可以看出其具体差别 for(var i=0;i<5;i++){ console.l ...

  6. Binary Tree Inorder Traversal -- LeetCode 94

    Given a binary tree, return the inorder traversal of its nodes' values. For example:Given binary tre ...

  7. HDU 4123 (2011 Asia FZU contest)(树形DP + 维护最长子序列)(bfs + 尺取法)

    题意:告诉一张带权图,不存在环,存下每个点能够到的最大的距离,就是一个长度为n的序列,然后求出最大值-最小值不大于Q的最长子序列的长度. 做法1:两步,第一步是根据图计算出这个序列,大姐头用了树形DP ...

  8. 跨服务器之间的session共享

    跨服务器之间的Session共享方案需求变得迫切起来,最终催生了多种解决方案,下面列举4种较为可行的方案进行对比探讨: 1. 基于NFS的Session共享 NFS是Net FileSystem的简称 ...

  9. 转: SQL Server索引的维护 - 索引碎片、填充因子

    转:http://www.cnblogs.com/kissdodog/archive/2013/06/14/3135412.html 实际上,索引的维护主要包括以下两个方面: 页拆分 碎片 这两个问题 ...

  10. lnmp 在nginx中配置相应的错误页面error_page

    1. 创建自己的404.html页面 2.更改nginx.conf在http定义区域加入: fastcgi_intercept_errors on; 3.更改nginx.conf(或单独网站配置文件, ...