【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程
Spark2.1.1
一 Spark Submit本地解析
1.1 现象
提交命令:
spark-submit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
进程:
hadoop 225653 0.0 0.0 11256 364 ? S Aug24 0:00 bash /$spark-dir/bin/spark-class org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
hadoop 225654 0.0 0.0 34424 2860 ? Sl Aug24 0:00 /$jdk_dir/bin/java -Xmx128m -cp /spark-dir/jars/* org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
1.2 执行过程
1.2.1 脚本执行
-bash-4.1$ cat bin/spark-submit
#!/usr/bin/env bash
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
注释:这里执行了另一个脚本spark-class,具体如下:
-bash-4.1$ cat bin/spark-class
...
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")...
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
注释:这里执行java class: org.apache.spark.launcher.Main,并传入参数,具体如下:
1.2.2 代码执行
org.apache.spark.launcher.Main
... builder = new SparkSubmitCommandBuilder(help); ... List<String> cmd = builder.buildCommand(env); ... List<String> bashCmd = prepareBashCommand(cmd, env); for (String c : bashCmd) { System.out.print(c); System.out.print('\0'); } ...
注释:其中会调用SparkSubmitCommandBuilder来生成Spark Submit命令,具体如下:
org.apache.spark.launcher.SparkSubmitCommandBuilder
... private List<String> buildSparkSubmitCommand(Map<String, String> env)
...
addOptionString(cmd, System.getenv("SPARK_SUBMIT_OPTS"));
addOptionString(cmd, System.getenv("SPARK_JAVA_OPTS"));
...
String driverExtraJavaOptions = config.get(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS);
...
if (isClientMode) {
...
addOptionString(cmd, driverExtraJavaOptions);
...
}
... addPermGenSizeOpt(cmd); cmd.add("org.apache.spark.deploy.SparkSubmit"); cmd.addAll(buildSparkSubmitArgs()); return cmd; ...
注释:这里创建了本地命令,其中java class:org.apache.spark.deploy.SparkSubmit,同时会把各种JavaOptions放到启动命令里(比如SPARK_JAVA_OPTS,DRIVER_EXTRA_JAVA_OPTIONS等),具体如下:
org.apache.spark.deploy.SparkSubmit
def main(args: Array[String]): Unit = { val appArgs = new SparkSubmitArguments(args) //parse command line parameter if (appArgs.verbose) { // scalastyle:off println printStream.println(appArgs) // scalastyle:on println } appArgs.action match { case SparkSubmitAction.SUBMIT => submit(appArgs) case SparkSubmitAction.KILL => kill(appArgs) case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) } } private def submit(args: SparkSubmitArguments): Unit = { val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) //merge all parameters from: command line, properties file, system property, etc... def doRunMain(): Unit = { ... runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) ... } ... private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments) : (Seq[String], Seq[String], Map[String, String], String) = { if (deployMode == CLIENT || isYarnCluster) { childMainClass = args.mainClass ... if (isYarnCluster) { childMainClass = "org.apache.spark.deploy.yarn.Client" ... private def runMain( childArgs: Seq[String], childClasspath: Seq[String], sysProps: Map[String, String], childMainClass: String, verbose: Boolean): Unit = { // scalastyle:off println if (verbose) { printStream.println(s"Main class:\n$childMainClass") printStream.println(s"Arguments:\n${childArgs.mkString("\n")}") printStream.println(s"System properties:\n${sysProps.mkString("\n")}") printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}") printStream.println("\n") } // scalastyle:on println val loader = if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) { new ChildFirstURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } else { new MutableURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } Thread.currentThread.setContextClassLoader(loader) for (jar <- childClasspath) { addJarToClasspath(jar, loader) } for ((key, value) <- sysProps) { System.setProperty(key, value) } var mainClass: Class[_] = null try { mainClass = Utils.classForName(childMainClass) } catch { ... val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass) ... mainMethod.invoke(null, childArgs.toArray) ...
注释:这里首先会解析命令行参数,比如mainClass,准备运行环境包括System Property以及classpath等,然后使用一个新的classloader:ChildFirstURLClassLoader来加载用户的mainClass,然后反射调用mainClass的main方法,这样用户的app.package.AppClass的main方法就开始执行了。
org.apache.spark.SparkConf
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable { import SparkConf._ /** Create a SparkConf that loads defaults from system properties and the classpath */ def this() = this(true) ... if (loadDefaults) { loadFromSystemProperties(false) } private[spark] def loadFromSystemProperties(silent: Boolean): SparkConf = { // Load any spark.* system properties for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) { set(key, value, silent) } this }
注释:这里可以看到spark是怎样加载配置的
1.2.3 --verbose
spark-submit --master local[*] --class app.package.AppClass --jars /$other-dir/other.jar --driver-memory 1g --verbose app-1.0.jar
输出示例:
Main class:
app.package.AppClass
Arguments:System properties:
spark.executor.logs.rolling.maxSize -> 1073741824
spark.driver.memory -> 1g
spark.driver.extraLibraryPath -> /$hadoop-dir/lib/native
spark.eventLog.enabled -> true
spark.eventLog.compress -> true
spark.executor.logs.rolling.time.interval -> daily
SPARK_SUBMIT -> true
spark.app.name -> app.package.AppClass
spark.driver.extraJavaOptions -> -XX:+PrintGCDetails -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:-UseCompressedClassPointers -XX:CompressedClassSpaceSize=3G -XX:+PrintGCTimeStamps -Xloggc:/export/Logs/hadoop/g1gc.log
spark.jars -> file:/$other-dir/other.jar
spark.sql.adaptive.enabled -> true
spark.submit.deployMode -> client
spark.executor.logs.rolling.maxRetainedFiles -> 10
spark.executor.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
spark.eventLog.dir -> hdfs://myhdfs/spark/history
spark.master -> local[*]
spark.sql.crossJoin.enabled -> true
spark.driver.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
Classpath elements:
file:/$other-dir/other.jar
file:/app-1.0.jar
启动时添加--verbose参数后,可以输出所有的运行时信息,有助于判断问题。
【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程的更多相关文章
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理
spark 2.1.1 spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码: org.apache.spark.rdd.RDD /** * Return thi ...
- 【原创】大数据基础之Spark(2)Spark on Yarn:container memory allocation容器内存分配
spark 2.1.1 最近spark任务(spark on yarn)有一个报错 Diagnostics: Container [pid=5901,containerID=container_154 ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- 【原创】大数据基础之Spark(9)spark部署方式yarn/mesos
1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2 ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 大数据与可靠性会碰撞出什么样的Spark?
可靠性工程领域的可靠性评估,可靠性仿真计算,健康检测与预管理(PHM)技术,可靠性试验,都需要大规模数据来进行支撑才能产生好的效果,以往这些数据都是不全并且收集困难,而随着互联网+的大数据时代的来临, ...
- 【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
- 【原创】大数据基础之Flink(1)简介、安装、使用
Flink 1.7 官方:https://flink.apache.org/ 一 简介 Apache Flink is an open source platform for distributed ...
随机推荐
- .Net Core应用框架Util介绍(六)
前面介绍了Util是如何封装以降低Angular应用的开发成本. 现在把关注点移到服务端,本文将介绍分层架构各构造块及基类,并对不同层次的开发人员应如何进行业务开发提供一些建议. Util分层架构介绍 ...
- TensorFlow基础
TensorFlow基础 SkySeraph 2017 Email:skyseraph00#163.com 更多精彩请直接访问SkySeraph个人站点:www.skyseraph.com Over ...
- KeyError: 'Spider not found: test'
Error Msg: File "c:\python36\lib\site-packages\scrapy\cmdline.py", line 157, in _run_comma ...
- python部署galery集群
galery.py文件内容 import pexpect import os import configparser HOSTNAME_DB1='db1' HOSTNAME_DB2='db2' HOS ...
- PS教您与粗壮的胳膊拜拜
Step 01在Photoshop 中打开素材图片,图中圈出的地方是需要调整的. Step 02用[套索工具]圈出胳膊及周围的环境. Step 03单击右键,选择[羽化],设置[羽化半径]为20 像素 ...
- openstack搭建之-nova配置(10)
一. base节点设置数据库 mysql -u root -proot CREATE DATABASE nova_api; CREATE DATABASE nova; CREATE DATABASE ...
- 为什么开源外围包安装指导都是按照到/usr/local/目录下,/usr/local与/usr的区别
很多应用都安装在/usr/local下面,那么,这些应用为什么选择这个目录呢?Automake工具定义了下面的一组变量: Directory variable Default value prefix ...
- php提供的用户密码加密函数
在实际项目中,对用户的密码加密基本上采用的 md5加盐的方式, php5.5后提供了一个加密函数,不需要手动加盐,不需要去维护盐值, $str = "123456"; $pwd ...
- poj-2195(最小费用流)
题意:给你一个n*m的地图,H代表这个点有一个房子,m代表这个点是一个人,每次h走一步就花费一,问最小花费使得每个人能进入一个房间 代码:建立一个源点和汇点,每个人和源点相连,每个房子和汇点相连,每个 ...
- python基础3 字符串常用方法
一. 基础数据类型 总览 int:用于计算,计数,运算等. 1,2,3,100...... str:'这些内容[]' 用户少量数据的存储,便于操作. bool: True, False,两种状态 ...