spark 2.1.1

一启动命令

启动spark thrift命令

$SPARK_HOME/sbin/start-thriftserver.sh

然后会执行

org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

二启动过程及代码分析

hive thrift代码详见：https://www.cnblogs.com/barneywill/p/10185168.html

HiveThriftServer2是spark thrift核心类，继承自Hive的HiveServer2

org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 extends org.apache.hive.service.server.HiveServer2

启动过程：

HiveThriftServer2.main

SparkSQLEnv.init (sparkConf sparkSession sparkContext sqlContext)

HiveThriftServer2.init

addService(ThriftBinaryCLIService)

HiveThriftServer2.start

ThriftBinaryCLIService.run

TServer.serve

类结构：【接口或父类->子类】

TServer->TThreadPoolServer

TProcessorFactory->SQLPlainProcessorFactory

TProcessor->TSetIpAddressProcessor

ThriftCLIService->ThriftBinaryCLIService

CLIService->SparkSQLCLIService (核心子类)

服务初始化过程：

CLIService.init

SparkSQLCLIService.init

addService(SparkSQLSessionManager)

initCompositeService

SparkSQLSessionManager.init

addService(SparkSQLOperationManager)

initCompositeService

SparkSQLOperationManager.init

三 DDL执行过程

ddl执行过程需要和hive metastore交互

从执行计划开始：

spark-sql> explain create table test_table(id string);
== Physical Plan ==
ExecutedCommand
+- CreateTableCommand CatalogTable(
Table: `test_table`
Created: Wed Dec 19 18:04:15 CST 2018
Last Access: Thu Jan 01 07:59:59 CST 1970
Type: MANAGED
Schema: [StructField(id,StringType,true)]
Provider: hive
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
Time taken: 0.28 seconds, Fetched 1 row(s)

从执行计划里可以找到具体的Command，这里是CreateTableCommand

org.apache.spark.sql.execution.command.tables

case class CreateTableCommand(table: CatalogTable, ifNotExists: Boolean) extends RunnableCommand {

  override def run(sparkSession: SparkSession): Seq[Row] = {

    sparkSession.sessionState.catalog.createTable(table, ifNotExists)

    Seq.empty[Row]

  }

}

这里可以看到是直接将请求分发给sparkSession.sessionState.catalog

org.apache.spark.sql.internal.SessionState

  /**

   * Internal catalog for managing table and database states.

   */

  lazy val catalog = new SessionCatalog(

    sparkSession.sharedState.externalCatalog,

    sparkSession.sharedState.globalTempViewManager,

    functionResourceLoader,

    functionRegistry,

    conf,

    newHadoopConf())

取的是sparkSession.sharedState.externalCatalog

org.apache.spark.sql.internal.SharedState

  /**

   * A catalog that interacts with external systems.

   */

  val externalCatalog: ExternalCatalog =

    SharedState.reflect[ExternalCatalog, SparkConf, Configuration](

      SharedState.externalCatalogClassName(sparkContext.conf),

      sparkContext.conf,

      sparkContext.hadoopConfiguration)

...

  private val HIVE_EXTERNAL_CATALOG_CLASS_NAME = "org.apache.spark.sql.hive.HiveExternalCatalog"

  private def externalCatalogClassName(conf: SparkConf): String = {

    conf.get(CATALOG_IMPLEMENTATION) match {

      case "hive" => HIVE_EXTERNAL_CATALOG_CLASS_NAME

      case "in-memory" => classOf[InMemoryCatalog].getCanonicalName

    }

  }

这里可以看到是通过externalCatalogClassName反射实例化的，代码里硬编码使用的是org.apache.spark.sql.hive.HiveExternalCatalog

org.apache.spark.sql.hive.HiveExternalCatalog

  /**

   * A Hive client used to interact with the metastore.

   */

  val client: HiveClient = {

    HiveUtils.newClientForMetadata(conf, hadoopConf)

  }

  private def withClient[T](body: => T): T = synchronized {

    try {

      body

    } catch {

      case NonFatal(exception) if isClientException(exception) =>

        val e = exception match {

          // Since we are using shim, the exceptions thrown by the underlying method of

          // Method.invoke() are wrapped by InvocationTargetException

          case i: InvocationTargetException => i.getCause

          case o => o

        }

        throw new AnalysisException(

          e.getClass.getCanonicalName + ": " + e.getMessage, cause = Some(e))

    }

  }

  override def createDatabase(

      dbDefinition: CatalogDatabase,

      ignoreIfExists: Boolean): Unit = withClient {

    client.createDatabase(dbDefinition, ignoreIfExists)

  }

这个类里执行任何ddl方法都会执行withClient，而withClient有synchronized，执行过程是直接把请求分发给client，下面看client是什么

org.apache.spark.sql.hive.client.IsolatedClientLoader

  /** The isolated client interface to Hive. */

  private[hive] def createClient(): HiveClient = {

    if (!isolationOn) {

      return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this)

    }

    // Pre-reflective instantiation setup.

    logDebug("Initializing the logger to avoid disaster...")

    val origLoader = Thread.currentThread().getContextClassLoader

    Thread.currentThread.setContextClassLoader(classLoader)

    try {

      classLoader

        .loadClass(classOf[HiveClientImpl].getName)

        .getConstructors.head

        .newInstance(version, sparkConf, hadoopConf, config, classLoader, this)

        .asInstanceOf[HiveClient]

    } catch {

可见client直接用的是org.apache.spark.sql.hive.client.HiveClientImpl

org.apache.spark.sql.hive.client.HiveClientImpl

  def withHiveState[A](f: => A): A = retryLocked {

    val original = Thread.currentThread().getContextClassLoader

    // Set the thread local metastore client to the client associated with this HiveClientImpl.

    Hive.set(client)

    // The classloader in clientLoader could be changed after addJar, always use the latest

    // classloader

    state.getConf.setClassLoader(clientLoader.classLoader)

    // setCurrentSessionState will use the classLoader associated

    // with the HiveConf in `state` to override the context class loader of the current

    // thread.

    shim.setCurrentSessionState(state)

    val ret = try f finally {

      Thread.currentThread().setContextClassLoader(original)

      HiveCatalogMetrics.incrementHiveClientCalls(1)

    }

    ret

  }

  private def retryLocked[A](f: => A): A = clientLoader.synchronized {

...

  override def createDatabase(

      database: CatalogDatabase,

      ignoreIfExists: Boolean): Unit = withHiveState {

    client.createDatabase(

      new HiveDatabase(

        database.name,

        database.description,

        database.locationUri,

        Option(database.properties).map(_.asJava).orNull),

        ignoreIfExists)

  }

这个类执行任何ddl方法都会执行withHiveState，withHiveState会执行retryLocked，retryLocked上有synchronized；而且这里也是直接将请求分发给client，这里的client是hive的类org.apache.hadoop.hive.ql.metadata.Hive

四 DML执行过程

dml执行过程最后会执行到spark.sql

sql执行过程：

CLIService.executeStatement （返回OperationHandle）

SessionManager.getSession

SessionManager.openSession

SparkSQLSessionManager.openSession

SparkSQLOperationManager.sessionToContexts.set （openSession时：session和sqlContext建立映射）

HiveSession.executeStatement

HiveSessionImpl.executeStatementInternal

OperationManager.newExecuteStatementOperation

SparkSQLOperationManager.newExecuteStatementOperation

SparkSQLOperationManager.sessionToContexts.get （通过session取到sqlContext）

ExecuteStatementOperation.run

SparkExecuteStatementOperation.run

SparkExecuteStatementOperation.execute

SQLContext.sql （熟悉的spark sql）

可见从SparkSQLCLIService初始化开始，逐个将各个类的实现类改为spark的子类比如：

org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager extends org.apache.hive.service.cli.session.SessionManager
org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager extends org.apache.hive.service.cli.operation.OperationManager
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation extends org.apache.hive.service.cli.operation.ExecuteStatementOperation

从而实现底层实现的替换；

hive的HiveServer2为什么这么容易的被扩展，详见spark代码的sql/hive-thriftserver，这里应该是将hive1.2代码做了很多修改，以后升级就不那么容易了；
至于spark为什么要花这么大力气扩展HiveServer2而不是重新实现，可能是为了保持接口一致，这样有利于原来使用hive thrift的用户平滑的迁移到spark thrift，因为唯一的改动就是切换url，实际上，相同sql下的spark thrift和hive thrift表现还是有很多不同的。

【原创】大数据基础之Spark（3）Spark Thrift实现原理及代码实现的更多相关文章

大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（5）Shuffle实现原理及代码解析
一简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

如何用ABP框架快速完成项目(6) - 用ABP一个人快速完成项目(2) - 使用多个成熟控件框架
正如我在<office365的开发者训练营,免费,在微软广州举办>课程里面所讲的, 站在巨人的肩膀上的其中一项就是, 尽量使用别人成熟的框架. 其中也包括了控件框架 abp和52abp ...
Jquery密码强度校验
function passValidate(){ var password=$password.val().trim() if(password===""){ $mima.addC ...
PTA 天梯赛练习 7-11 玩转二叉树-二叉树重建
以前就思考过这个问题,但是没有深入的想过,这是一种叫二叉树重建的典型题目如果给出中序和前序,求出后序遍历. 这道题则求的是交换儿子节点的层序遍历. 二叉树的重建应该怎么重建,首先我们知道,先根遍历, ...
[2019BUAA软工助教]第0次个人作业
[2019BUAA软工助教]第0次个人作业一.前言我认为人生就是一次次地从<存在>到<光明>. 二.软件工程师的成长博客索引同学们在上这门课的时候基本都是大三,觉得在大 ...
【学习总结】GirlsInAI ML-diary day-11-while循环
[学习总结]GirlsInAI ML-diary 总原博github链接-day11 认识while循环执行对于while/break/continue的认识新值替换变量一般while语句无 ...
MINIST深度学习识别：python全连接神经网络和pytorch LeNet CNN网络训练实现及比较（三）
版权声明:本文为博主原创文章,欢迎转载,并请注明出处.联系方式:460356155@qq.com 在前两篇文章MINIST深度学习识别:python全连接神经网络和pytorch LeNet CNN网 ...
spring4配置文件详解
转自: spring4配置文件详解一.配置数据源基本的加载properties配置文件 <context:property-placeholder location="classp ...
vue独立构建和运行构建
有两种构建方式,独立构建和运行构建.它们的区别在于前者包含模板编译器而后者不包含. 模板编译器:模板编译器的职责是将模板字符串编译为纯 JavaScript 的渲染函数.如果你想要在组件中使用 tem ...
Linux 系统中五笔输入法有些字打不出来（已解决）
最近在使用CentOS7 桌面版本,在用五笔打字时,有些字打不出来,比如“覆盖”.但是在WIN下能打出来. 从网上查找原因,原来是需要改成GBK字符集.方法如下: 修改文件 vim /usr/shar ...
python json数据的转换
1 Python数据转json字符串 import json json_str = json.dumps(py_data) 参数解析: json_str = json.dumps(py_data,s ...

【原创】大数据基础之Spark（3）Spark Thrift实现原理及代码实现

一 启动命令