【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程
Spark2.1.1
一 Spark Submit本地解析
1.1 现象
提交命令:
spark-submit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
进程:
hadoop 225653 0.0 0.0 11256 364 ? S Aug24 0:00 bash /$spark-dir/bin/spark-class org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
hadoop 225654 0.0 0.0 34424 2860 ? Sl Aug24 0:00 /$jdk_dir/bin/java -Xmx128m -cp /spark-dir/jars/* org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
1.2 执行过程
1.2.1 脚本执行
-bash-4.1$ cat bin/spark-submit
#!/usr/bin/env bash
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
注释:这里执行了另一个脚本spark-class,具体如下:
-bash-4.1$ cat bin/spark-class
...
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")...
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
注释:这里执行java class: org.apache.spark.launcher.Main,并传入参数,具体如下:
1.2.2 代码执行
org.apache.spark.launcher.Main
... builder = new SparkSubmitCommandBuilder(help); ... List<String> cmd = builder.buildCommand(env); ... List<String> bashCmd = prepareBashCommand(cmd, env); for (String c : bashCmd) { System.out.print(c); System.out.print('\0'); } ...
注释:其中会调用SparkSubmitCommandBuilder来生成Spark Submit命令,具体如下:
org.apache.spark.launcher.SparkSubmitCommandBuilder
... private List<String> buildSparkSubmitCommand(Map<String, String> env)
...
addOptionString(cmd, System.getenv("SPARK_SUBMIT_OPTS"));
addOptionString(cmd, System.getenv("SPARK_JAVA_OPTS"));
...
String driverExtraJavaOptions = config.get(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS);
...
if (isClientMode) {
...
addOptionString(cmd, driverExtraJavaOptions);
...
}
... addPermGenSizeOpt(cmd); cmd.add("org.apache.spark.deploy.SparkSubmit"); cmd.addAll(buildSparkSubmitArgs()); return cmd; ...
注释:这里创建了本地命令,其中java class:org.apache.spark.deploy.SparkSubmit,同时会把各种JavaOptions放到启动命令里(比如SPARK_JAVA_OPTS,DRIVER_EXTRA_JAVA_OPTIONS等),具体如下:
org.apache.spark.deploy.SparkSubmit
def main(args: Array[String]): Unit = { val appArgs = new SparkSubmitArguments(args) //parse command line parameter if (appArgs.verbose) { // scalastyle:off println printStream.println(appArgs) // scalastyle:on println } appArgs.action match { case SparkSubmitAction.SUBMIT => submit(appArgs) case SparkSubmitAction.KILL => kill(appArgs) case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) } } private def submit(args: SparkSubmitArguments): Unit = { val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) //merge all parameters from: command line, properties file, system property, etc... def doRunMain(): Unit = { ... runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) ... } ... private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments) : (Seq[String], Seq[String], Map[String, String], String) = { if (deployMode == CLIENT || isYarnCluster) { childMainClass = args.mainClass ... if (isYarnCluster) { childMainClass = "org.apache.spark.deploy.yarn.Client" ... private def runMain( childArgs: Seq[String], childClasspath: Seq[String], sysProps: Map[String, String], childMainClass: String, verbose: Boolean): Unit = { // scalastyle:off println if (verbose) { printStream.println(s"Main class:\n$childMainClass") printStream.println(s"Arguments:\n${childArgs.mkString("\n")}") printStream.println(s"System properties:\n${sysProps.mkString("\n")}") printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}") printStream.println("\n") } // scalastyle:on println val loader = if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) { new ChildFirstURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } else { new MutableURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } Thread.currentThread.setContextClassLoader(loader) for (jar <- childClasspath) { addJarToClasspath(jar, loader) } for ((key, value) <- sysProps) { System.setProperty(key, value) } var mainClass: Class[_] = null try { mainClass = Utils.classForName(childMainClass) } catch { ... val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass) ... mainMethod.invoke(null, childArgs.toArray) ...
注释:这里首先会解析命令行参数,比如mainClass,准备运行环境包括System Property以及classpath等,然后使用一个新的classloader:ChildFirstURLClassLoader来加载用户的mainClass,然后反射调用mainClass的main方法,这样用户的app.package.AppClass的main方法就开始执行了。
org.apache.spark.SparkConf
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable { import SparkConf._ /** Create a SparkConf that loads defaults from system properties and the classpath */ def this() = this(true) ... if (loadDefaults) { loadFromSystemProperties(false) } private[spark] def loadFromSystemProperties(silent: Boolean): SparkConf = { // Load any spark.* system properties for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) { set(key, value, silent) } this }
注释:这里可以看到spark是怎样加载配置的
1.2.3 --verbose
spark-submit --master local[*] --class app.package.AppClass --jars /$other-dir/other.jar --driver-memory 1g --verbose app-1.0.jar
输出示例:
Main class:
app.package.AppClass
Arguments:System properties:
spark.executor.logs.rolling.maxSize -> 1073741824
spark.driver.memory -> 1g
spark.driver.extraLibraryPath -> /$hadoop-dir/lib/native
spark.eventLog.enabled -> true
spark.eventLog.compress -> true
spark.executor.logs.rolling.time.interval -> daily
SPARK_SUBMIT -> true
spark.app.name -> app.package.AppClass
spark.driver.extraJavaOptions -> -XX:+PrintGCDetails -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:-UseCompressedClassPointers -XX:CompressedClassSpaceSize=3G -XX:+PrintGCTimeStamps -Xloggc:/export/Logs/hadoop/g1gc.log
spark.jars -> file:/$other-dir/other.jar
spark.sql.adaptive.enabled -> true
spark.submit.deployMode -> client
spark.executor.logs.rolling.maxRetainedFiles -> 10
spark.executor.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
spark.eventLog.dir -> hdfs://myhdfs/spark/history
spark.master -> local[*]
spark.sql.crossJoin.enabled -> true
spark.driver.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
Classpath elements:
file:/$other-dir/other.jar
file:/app-1.0.jar
启动时添加--verbose参数后,可以输出所有的运行时信息,有助于判断问题。
【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程的更多相关文章
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理
spark 2.1.1 spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码: org.apache.spark.rdd.RDD /** * Return thi ...
- 【原创】大数据基础之Spark(2)Spark on Yarn:container memory allocation容器内存分配
spark 2.1.1 最近spark任务(spark on yarn)有一个报错 Diagnostics: Container [pid=5901,containerID=container_154 ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- 【原创】大数据基础之Spark(9)spark部署方式yarn/mesos
1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2 ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 大数据与可靠性会碰撞出什么样的Spark?
可靠性工程领域的可靠性评估,可靠性仿真计算,健康检测与预管理(PHM)技术,可靠性试验,都需要大规模数据来进行支撑才能产生好的效果,以往这些数据都是不全并且收集困难,而随着互联网+的大数据时代的来临, ...
- 【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
- 【原创】大数据基础之Flink(1)简介、安装、使用
Flink 1.7 官方:https://flink.apache.org/ 一 简介 Apache Flink is an open source platform for distributed ...
随机推荐
- mysql8 安装笔记
环境 redhat6.8 ,官网下载 rpm x64 Bund 安装包 安装 rpm -ivh xxx.rpm 安装一系列的rpm. mysql 会创建 mysql 用户及组./etc/my.cnf ...
- 封装的通过微信JS-SDK实现自定义分享到朋友圈或者朋友的ES6类!
引言: 我们经常在做微信H5的过程中需要自定义分享网页,这个如何实现呢?请看如下的封装的ES6类及使用说明! /** * @jssdk js对象,包括appId,timestamp,nonceStr, ...
- REST命令控制Player
本文用Postman工具演示通过REST控制Cnario Playr 注意:Player的REST通信默认关闭,使用前需要从Setting>>Remote devices打开Use RES ...
- vagrant之常用操作
基本操作: 查看版本: vagrant -v 初始化: vagrant init 启动虚拟机: vagrant up 关闭虚拟机: vagrant halt 重启虚拟机: vagrant reload ...
- static:get()什么意思
在类里面static关键词相当于self关键词
- 数据标记系列——图像分割 & PolygonRNN++(二)
实践 1.export PATH=~/anaconda3/bin:$PATH 2.Anaconda3 中创建新环境 Conda create –name=labelme_polyrnn_pp pyth ...
- 【win7】安装开发环境
1. 通用版主分支合并到v3,并删除data下无用文件或添加data有用文件 2. xampp php7与php5切换 是否可以行? 换phpstudy 默认支持php 32位,而我们要下载支持64的 ...
- SpringBoot返回date日期格式化,解决返回为TIMESTAMP时间戳格式或8小时时间差
问题描述 在Spring Boot项目中,使用@RestController注解,返回的java对象中若含有date类型的属性,则默认输出为TIMESTAMP时间戳格式 ,如下所示: 解决方案 ...
- EntityFramework Core笔记:查询数据(3)
1. 基本查询 1.1 加载全部数据 using System.Linq; using (var context = new LibingContext()) { var roles = contex ...
- 利用zabbix api添加、删除、禁用主机
python环境配置yum -y install python-pip安装argparse模块pip install -i https://pypi.douban.com/simple/ argpar ...