Spark源码分析之Spark-submit和Spark-class

有了前面spark-shell的经验，看这两个脚本就容易多啦。前面总结的Spark-shell的分析可以参考：

Spark-submit

if [ -z "${SPARK_HOME}" ]; then

  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"

fi

# disable randomized hash for string in Python 3.3+

export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

跟Spark-shell一样，先检查是否设置了${SPARK_HOME},然后启动spark-class，并传递了org.apache.spark.deploy.SparkSubmit作为第一个参数，然后把前面Spark-shell的参数都传给spark-class

Spark-class

if [ -z "${SPARK_HOME}" ]; then

  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"

fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary

if [ -n "${JAVA_HOME}" ]; then

  RUNNER="${JAVA_HOME}/bin/java"

else

  if [ `command -v java` ]; then

    RUNNER="java"

  else

    echo "JAVA_HOME is not set" >&2

    exit 1

  fi

fi

# Find assembly jar

SPARK_ASSEMBLY_JAR=

if [ -f "${SPARK_HOME}/RELEASE" ]; then

  ASSEMBLY_DIR="${SPARK_HOME}/lib"

else

  ASSEMBLY_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION"

fi

GREP_OPTIONS=

num_jars="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" | wc -l)"

if [ "$num_jars" -eq "0" -a -z "$SPARK_ASSEMBLY_JAR" -a "$SPARK_PREPEND_CLASSES" != "1" ]; then

  echo "Failed to find Spark assembly in $ASSEMBLY_DIR." 1>&2

  echo "You need to build Spark before running this program." 1>&2

  exit 1

fi

if [ -d "$ASSEMBLY_DIR" ]; then

  ASSEMBLY_JARS="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" || true)"

  if [ "$num_jars" -gt "1" ]; then

    echo "Found multiple Spark assembly jars in $ASSEMBLY_DIR:" 1>&2

    echo "$ASSEMBLY_JARS" 1>&2

    echo "Please remove all but one jar." 1>&2

    exit 1

  fi

fi

SPARK_ASSEMBLY_JAR="${ASSEMBLY_DIR}/${ASSEMBLY_JARS}"

LAUNCH_CLASSPATH="$SPARK_ASSEMBLY_JAR"

# Add the launcher build dir to the classpath if requested.

if [ -n "$SPARK_PREPEND_CLASSES" ]; then

  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"

fi

export _SPARK_ASSEMBLY="$SPARK_ASSEMBLY_JAR"

# For tests

if [[ -n "$SPARK_TESTING" ]]; then

  unset YARN_CONF_DIR

  unset HADOOP_CONF_DIR

fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with

# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating

# an array that will be used to exec the final command.

CMD=()

while IFS= read -d '' -r ARG; do

  CMD+=("$ARG")

done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")

exec "${CMD[@]}"

这个类是真正的执行者,我们好好看看这个真正的入口在哪里？

首先，依然是设置项目主目录：

if [ -z "${SPARK_HOME}" ]; then

  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"

fi

然后，配置一些环境变量:

. "${SPARK_HOME}"/bin/load-spark-env.sh

在spark-env中设置了assembly相关的信息。

然后寻找java,并赋值给RUNNER变量

# Find the java binary

if [ -n "${JAVA_HOME}" ]; then

  RUNNER="${JAVA_HOME}/bin/java"

else

  if [ `command -v java` ]; then

    RUNNER="java"

  else

    echo "JAVA_HOME is not set" >&2

    exit 1

  fi

fi

中间是一大坨跟assembly相关的内容。

最关键的就是下面这句了：

CMD=()

while IFS= read -d '' -r ARG; do

  CMD+=("$ARG")

done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")

exec "${CMD[@]}"

首先循环读取ARG参数，加入到CMD中。然后执行了"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@ 这个是真正执行的第一个spark的类。

该类在launcher模块下，简单的浏览下代码：

public static void main(String[] argsArray) throws Exception {

   ...

    List<String> args = new ArrayList<String>(Arrays.asList(argsArray));

    String className = args.remove(0);

    ...

    //创建命令解析器

    AbstractCommandBuilder builder;

    if (className.equals("org.apache.spark.deploy.SparkSubmit")) {

      try {

        builder = new SparkSubmitCommandBuilder(args);

      } catch (IllegalArgumentException e) {

        ...

      }

    } else {

      builder = new SparkClassCommandBuilder(className, args);

    }

    List<String> cmd = builder.buildCommand(env);//解析器解析参数

    ...

    //返回有效的参数

    if (isWindows()) {

      System.out.println(prepareWindowsCommand(cmd, env));

    } else {

      List<String> bashCmd = prepareBashCommand(cmd, env);

      for (String c : bashCmd) {

        System.out.print(c);

        System.out.print('\0');

      }

    }

  }

launcher.Main返回的数据存储到CMD中。

然后执行命令:

exec "${CMD[@]}"

这里开始真正执行某个Spark的类。

最后来说说这个exec命令，想要理解它跟着其他几个命令一起学习：

source命令，在执行脚本的时候，会在当前的shell中直接把source执行的脚本给挪到自己的shell中执行。换句话说，就是把目标脚本的任务拿过来自己执行。
exec命令，是创建一个新的进程，只不过这个进程与前一个进程的ID是一样的。这样，原来的脚本剩余的部分就不能执行了，因为相当于换了一个进程。另外，创建新进程并不是说把所有的东西都直接复制，而是采用写时复制，即在新进程使用到某些内容时，才拷贝这些内容
sh命令则是开启一个新的shell执行，相当于创建一个新进程

举个简单的例子,下面有三个脚本:

xingoo-test-1.sh

exec -c sh /home/xinghl/test/xingoo-test-2.sh

xingoo-test-2.sh

while true

do

        echo "a2"

        sleep 3

done

xingoo-test-3.sh

sh /home/xinghl/test/xingoo-test-2.sh

xingoo-test-4.sh

source /home/xinghl/test/xingoo-test-2.sh

在执行xingoo-test-1.sh和xingoo-test-4.sh的效果是一样的，都只有一个进程。

在执行xingoo-test-3.sh的时候会出现两个进程。

参考

linux里source、sh、bash、./有什么区别