Spark2.1.1

一 Spark Submit本地解析

1.1 现象

提交命令:

spark-submit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar

进程:

hadoop 225653 0.0 0.0 11256 364 ? S Aug24 0:00 bash /$spark-dir/bin/spark-class org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar

hadoop 225654 0.0 0.0 34424 2860 ? Sl Aug24 0:00 /$jdk_dir/bin/java -Xmx128m -cp /spark-dir/jars/* org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar

1.2 执行过程

1.2.1 脚本执行

-bash-4.1$ cat bin/spark-submit
#!/usr/bin/env bash

if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

注释:这里执行了另一个脚本spark-class,具体如下:

-bash-4.1$ cat bin/spark-class

...

build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}

CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")

...

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

注释:这里执行java class: org.apache.spark.launcher.Main,并传入参数,具体如下:

1.2.2 代码执行

org.apache.spark.launcher.Main
...

        builder = new SparkSubmitCommandBuilder(help);

...

    List<String> cmd = builder.buildCommand(env);

...

      List<String> bashCmd = prepareBashCommand(cmd, env);

      for (String c : bashCmd) {

        System.out.print(c);

        System.out.print('\0');

      }

...

注释:其中会调用SparkSubmitCommandBuilder来生成Spark Submit命令,具体如下:

org.apache.spark.launcher.SparkSubmitCommandBuilder
...

  private List<String> buildSparkSubmitCommand(Map<String, String> env)
...
addOptionString(cmd, System.getenv("SPARK_SUBMIT_OPTS"));
addOptionString(cmd, System.getenv("SPARK_JAVA_OPTS"));
...
String driverExtraJavaOptions = config.get(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS);
...
if (isClientMode) {
...
addOptionString(cmd, driverExtraJavaOptions);
...
}
... addPermGenSizeOpt(cmd); cmd.add("org.apache.spark.deploy.SparkSubmit"); cmd.addAll(buildSparkSubmitArgs()); return cmd; ...

注释:这里创建了本地命令,其中java class:org.apache.spark.deploy.SparkSubmit,同时会把各种JavaOptions放到启动命令里(比如SPARK_JAVA_OPTS,DRIVER_EXTRA_JAVA_OPTIONS等),具体如下:

org.apache.spark.deploy.SparkSubmit
  def main(args: Array[String]): Unit = {

    val appArgs = new SparkSubmitArguments(args) //parse command line parameter

    if (appArgs.verbose) {

      // scalastyle:off println

      printStream.println(appArgs)

      // scalastyle:on println

    }

    appArgs.action match {

      case SparkSubmitAction.SUBMIT => submit(appArgs)

      case SparkSubmitAction.KILL => kill(appArgs)

      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)

    }

  }

    private def submit(args: SparkSubmitArguments): Unit = {

    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) //merge all parameters from: command line, properties file, system property, etc...

    def doRunMain(): Unit = {

      ...

        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)

      ...

    }

         ...

  private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)

      : (Seq[String], Seq[String], Map[String, String], String) = {

    if (deployMode == CLIENT || isYarnCluster) {

      childMainClass = args.mainClass

      ...

    if (isYarnCluster) {

      childMainClass = "org.apache.spark.deploy.yarn.Client"

      ...

  private def runMain(

      childArgs: Seq[String],

      childClasspath: Seq[String],

      sysProps: Map[String, String],

      childMainClass: String,

      verbose: Boolean): Unit = {

    // scalastyle:off println

    if (verbose) {

      printStream.println(s"Main class:\n$childMainClass")

      printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")

      printStream.println(s"System properties:\n${sysProps.mkString("\n")}")

      printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")

      printStream.println("\n")

    }

    // scalastyle:on println

    val loader =

      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {

        new ChildFirstURLClassLoader(new Array[URL](0),

          Thread.currentThread.getContextClassLoader)

      } else {

        new MutableURLClassLoader(new Array[URL](0),

          Thread.currentThread.getContextClassLoader)

      }

    Thread.currentThread.setContextClassLoader(loader)

    for (jar <- childClasspath) {

      addJarToClasspath(jar, loader)

    }

    for ((key, value) <- sysProps) {

      System.setProperty(key, value)

    }

    var mainClass: Class[_] = null

    try {

      mainClass = Utils.classForName(childMainClass)

    } catch {

    ...

    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)

    ...

      mainMethod.invoke(null, childArgs.toArray)

      ...

注释:这里首先会解析命令行参数,比如mainClass,准备运行环境包括System Property以及classpath等,然后使用一个新的classloader:ChildFirstURLClassLoader来加载用户的mainClass,然后反射调用mainClass的main方法,这样用户的app.package.AppClass的main方法就开始执行了。

org.apache.spark.SparkConf
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable {

  import SparkConf._

  /** Create a SparkConf that loads defaults from system properties and the classpath */

  def this() = this(true)

...

  if (loadDefaults) {

    loadFromSystemProperties(false)

  }

  private[spark] def loadFromSystemProperties(silent: Boolean): SparkConf = {

    // Load any spark.* system properties

    for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {

      set(key, value, silent)

    }

    this

  }

注释:这里可以看到spark是怎样加载配置的

1.2.3 --verbose

spark-submit --master local[*] --class app.package.AppClass --jars /$other-dir/other.jar  --driver-memory 1g --verbose app-1.0.jar

输出示例:

Main class:
app.package.AppClass
Arguments:

System properties:
spark.executor.logs.rolling.maxSize -> 1073741824
spark.driver.memory -> 1g
spark.driver.extraLibraryPath -> /$hadoop-dir/lib/native
spark.eventLog.enabled -> true
spark.eventLog.compress -> true
spark.executor.logs.rolling.time.interval -> daily
SPARK_SUBMIT -> true
spark.app.name -> app.package.AppClass
spark.driver.extraJavaOptions -> -XX:+PrintGCDetails -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:-UseCompressedClassPointers -XX:CompressedClassSpaceSize=3G -XX:+PrintGCTimeStamps -Xloggc:/export/Logs/hadoop/g1gc.log
spark.jars -> file:/$other-dir/other.jar
spark.sql.adaptive.enabled -> true
spark.submit.deployMode -> client
spark.executor.logs.rolling.maxRetainedFiles -> 10
spark.executor.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
spark.eventLog.dir -> hdfs://myhdfs/spark/history
spark.master -> local[*]
spark.sql.crossJoin.enabled -> true
spark.driver.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
Classpath elements:
file:/$other-dir/other.jar
file:/app-1.0.jar

启动时添加--verbose参数后,可以输出所有的运行时信息,有助于判断问题。

【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程的更多相关文章

  1. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  2. 【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理

    spark 2.1.1 spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码: org.apache.spark.rdd.RDD /** * Return thi ...

  3. 【原创】大数据基础之Spark(2)Spark on Yarn:container memory allocation容器内存分配

    spark 2.1.1 最近spark任务(spark on yarn)有一个报错 Diagnostics: Container [pid=5901,containerID=container_154 ...

  4. 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

  5. 【原创】大数据基础之Spark(9)spark部署方式yarn/mesos

    1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2 ...

  6. 大数据基础知识问答----spark篇,大数据生态圈

    Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...

  7. 大数据与可靠性会碰撞出什么样的Spark?

    可靠性工程领域的可靠性评估,可靠性仿真计算,健康检测与预管理(PHM)技术,可靠性试验,都需要大规模数据来进行支撑才能产生好的效果,以往这些数据都是不全并且收集困难,而随着互联网+的大数据时代的来临, ...

  8. 【原创】大数据基础之词频统计Word Count

    对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...

  9. 【原创】大数据基础之Flink(1)简介、安装、使用

    Flink 1.7 官方:https://flink.apache.org/ 一 简介 Apache Flink is an open source platform for distributed ...

随机推荐

  1. WinForm调用钉钉获取考勤结果

    关注点: 1.钉钉AccessToken的获取和防止过期 2.使用TPL并行编程调用钉钉接口 需求详解 公司前台有个大屏,领导想显示全部员工的考勤结果统计情况和车间的实时监控视频,还有车间的看板.简单 ...

  2. PS对街拍女孩照片增加质感

    看到原图时,我的内心是抗拒的,灰蒙蒙毫无质感可言,手机app大概都拍得比这好看(捂脸笑哭). 大概是因为偏背光,光线暧昧不够强烈,且50 1.4这只镜头锐度还欠佳的缘故.所以平时3天修完图的我,这次拖 ...

  3. xcode8 使用Instruments检测定位并解决iOS内存泄露

    https://www.jianshu.com/p/9bc7e65fc247 2017.07.27 17:24* 字数 628 阅读 1319评论 6喜欢 21 简介: 虽然苹果出了ARC(自动内存管 ...

  4. [转帖]Linux中的15个基本‘ls’命令示例

    Linux中的15个基本‘ls’命令示例 https://linux.cn/article-5109-1.html ls -lt 和 ls -ltr 来查看文件新旧顺序. list time rese ...

  5. Python——Flash框架——用户认证

    一.认证扩展 1.Flask-Login:管理已登录用户的用户会话 2.Werkzeug:计算几码散列值并进行核对 3.itsdangerous:生成并核对加密安全令牌 二.Werkzeug gene ...

  6. git生成ssh公钥方法--远程连接github仓库

    先配置全局的用户名和邮箱 $ git config --global user.name "runoob" $ git config --global user.email tes ...

  7. Hive 口袋手册

    2019-04-01 关键字:Hive 学习总结.Hive 基础 . Hive 进阶 .Hive 调优 . Hive 入门手册.Hive PDF 下载 本篇文章系本人就目前所掌握的知识对 Apache ...

  8. 初识 go 语言:方法,接口及并发

    目录 方法,接口及并发 方法 接口 并发 信道 结束语 前言: go语言的第四篇文章,主要讲述go语言中的方法,包括指针,结构体,数组,切片,映射,函数闭包等,每个都提供了示例,可直接运行. 方法,接 ...

  9. postgres 基本操作

    登陆: $ psql -U <user> -d <dbname> 数据库操作: $ \l      //查看库 $  \c <dbname>   //切换库 // ...

  10. Netty 源码分析

    https://segmentfault.com/a/1190000007282628 netty社区-简书闪电侠 :https://netty.io/wiki/related-articles.ht ...