【原创】大叔经验分享（23）spark sql插入表时的文件个数研究

spark sql执行insert overwrite table时，写到新表或者新分区的文件个数，有可能是200个，也有可能是任意个，为什么会有这种差别？

首先看一下spark sql执行insert overwrite table流程：

1 创建临时目录，比如
- .hive-staging_hive_2018-06-23_00-39-39_825_3122897139441535352-2312/-ext-10000
2 将数据写到临时目录；
3 执行loadTable或loadPartition将临时目录数据move到正是目录；

对应的代码为：

org.apache.spark.sql.hive.execution.InsertIntoHiveTable

case class InsertIntoHiveTable(

    table: MetastoreRelation,

    partition: Map[String, Option[String]],

    child: SparkPlan,

    overwrite: Boolean,

    ifNotExists: Boolean) extends UnaryExecNode {

...

  protected[sql] lazy val sideEffectResult: Seq[InternalRow] = {

...

    val tmpLocation = getExternalTmpPath(tableLocation, hadoopConf)

    val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, false)

...

    @transient val outputClass = writerContainer.newSerializer(table.tableDesc).getSerializedClass

    saveAsHiveFile(child.execute(), outputClass, fileSinkConf, jobConfSer, writerContainer)

...

  private def saveAsHiveFile(

      rdd: RDD[InternalRow],

      valueClass: Class[_],

      fileSinkConf: FileSinkDesc,

      conf: SerializableJobConf,

      writerContainer: SparkHiveWriterContainer): Unit = {

    assert(valueClass != null, "Output value class not set")

    conf.value.setOutputValueClass(valueClass)

    val outputFileFormatClassName = fileSinkConf.getTableInfo.getOutputFileFormatClassName

    assert(outputFileFormatClassName != null, "Output format class not set")

    conf.value.set("mapred.output.format.class", outputFileFormatClassName)

    FileOutputFormat.setOutputPath(

      conf.value,

      SparkHiveWriterContainer.createPathFromString(fileSinkConf.getDirName(), conf.value))

    log.debug("Saving as hadoop file of type " + valueClass.getSimpleName)

    writerContainer.driverSideSetup()

    sqlContext.sparkContext.runJob(rdd, writerContainer.writeToFile _)

    writerContainer.commitJob()

  }

下面先看第一步创建临时目录过程，即getExternalTmpPath

  val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")

  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {

    val extURI: URI = path.toUri

    if (extURI.getScheme == "viewfs") {

      getExtTmpPathRelTo(path.getParent, hadoopConf)

    } else {

      new Path(getExternalScratchDir(extURI, hadoopConf), "-ext-10000")

    }

  }

  private def getExternalScratchDir(extURI: URI, hadoopConf: Configuration): Path = {

    getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, extURI.getPath), hadoopConf)

  }

  private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = {

    val inputPathUri: URI = inputPath.toUri

    val inputPathName: String = inputPathUri.getPath

    val fs: FileSystem = inputPath.getFileSystem(hadoopConf)

    val stagingPathName: String =

      if (inputPathName.indexOf(stagingDir) == -1) {

        new Path(inputPathName, stagingDir).toString

      } else {

        inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length)

      }

    val dir: Path =

      fs.makeQualified(

        new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID))

    logDebug("Created staging dir = " + dir + " for path = " + inputPath)

    try {

      if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {

        throw new IllegalStateException("Cannot create staging directory  '" + dir.toString + "'")

      }

      fs.deleteOnExit(dir)

    } catch {

      case e: IOException =>

        throw new RuntimeException(

          "Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e)

    }

    return dir

  }

  private def executionId: String = {

    val rand: Random = new Random

    val format = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS", Locale.US)

    "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)

  }

临时目录组成为【.hive-staging（配置hive.exec.stagingdir）】_【hive（硬编码）】_【2018-06-23_00-39-39_825（时分秒）】_【3122897139441535352（随机串）】_【2312（taskId）】/-ext-10000（硬编码）

下面看写文件过程，即

sqlContext.sparkContext.runJob(rdd, writerContainer.writeToFile _)

org.apache.spark.SparkContext

  /**

   * Run a job on all partitions in an RDD and return the results in an array.

   */

  def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U): Array[U] = {

    runJob(rdd, func, 0 until rdd.partitions.length)

  }

可见是将rdd逐个分区执行写入操作，rdd有多少个分区就会写入多少个文件，rdd是通过child.execute返回的，即SparkPlan.execute，下面看SparkPlan

org.apache.spark.sql.execution.SparkPlan

  final def execute(): RDD[InternalRow] = executeQuery {

    doExecute()

  }

  protected def doExecute(): RDD[InternalRow]

doExecute是抽象方法，执行计划中的过程都对应到SparkPlan的子类，比如Project对应ProjectExec，SortMergeJoin对应SortMergeJoinExec；

SparkPlan是由SparkPlanner生成的，下面看SparkPlanner：

org.apache.spark.sql.execution.SparkPlanner

  def numPartitions: Int = conf.numShufflePartitions

这里直接取的是SQLConf.numShufflePartitions，下面看SQLConf：

org.apache.spark.sql.internal.SQLConf

  val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions")

    .doc("The default number of partitions to use when shuffling data for joins or aggregations.")

    .intConf

    .createWithDefault(200)

  def numShufflePartitions: Int = getConf(SHUFFLE_PARTITIONS)

这里取的是配置spark.sql.shuffle.partitions，默认200；那么分区数量是怎样用到的？下面看BasicOperators：

org.apache.spark.sql.execution.SparkStrategies.BasicOperators

    def numPartitions: Int = self.numPartitions

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

...

      case logical.RepartitionByExpression(expressions, child, nPartitions) =>

        exchange.ShuffleExchange(HashPartitioning(

          expressions, nPartitions.getOrElse(numPartitions)), planLater(child)) :: Nil

可见shuffle过程会根据numPartitions来创建HashPartitioning，如果sql执行过程需要shuffle（比如有join,group by等操作），那么默认会写200个文件；如果sql执行过程没有shuffle，那么会由HiveTableScan和Filter等来决定写多少个文件；

也可以通过执行计划来看，如果有shuffle过程，执行计划中通常有这么一步：

: +- Exchange(coordinator id: 371605426) hashpartitioning(id#60, 200), coordinator[target post-shuffle partition size: 67108864]

其中hashpartitioning(id#60, 200)中的200就是spark.sql.shuffle.partitions的默认值；

附ShuffleExchange过程：

org.apache.spark.sql.execution.exchange.ShuffleExchange

  def apply(newPartitioning: Partitioning, child: SparkPlan): ShuffleExchange = {

    ShuffleExchange(newPartitioning, child, coordinator = Option.empty[ExchangeCoordinator])

  }

  protected override def doExecute(): RDD[InternalRow] = attachTree(this, "execute") {

    // Returns the same ShuffleRowRDD if this plan is used by multiple plans.

    if (cachedShuffleRDD == null) {

      cachedShuffleRDD = coordinator match {

        case Some(exchangeCoordinator) =>

          val shuffleRDD = exchangeCoordinator.postShuffleRDD(this)

          assert(shuffleRDD.partitions.length == newPartitioning.numPartitions)

          shuffleRDD

        case None =>

          val shuffleDependency = prepareShuffleDependency()

          preparePostShuffleRDD(shuffleDependency)

      }

    }

    cachedShuffleRDD

  }

  /**

   * Returns a [[ShuffleDependency]] that will partition rows of its child based on

   * the partitioning scheme defined in `newPartitioning`. Those partitions of

   * the returned ShuffleDependency will be the input of shuffle.

   */

  private[exchange] def prepareShuffleDependency()

    : ShuffleDependency[Int, InternalRow, InternalRow] = {

    ShuffleExchange.prepareShuffleDependency(

      child.execute(), child.output, newPartitioning, serializer)

  }

  /**

   * Returns a [[ShuffledRowRDD]] that represents the post-shuffle dataset.

   * This [[ShuffledRowRDD]] is created based on a given [[ShuffleDependency]] and an optional

   * partition start indices array. If this optional array is defined, the returned

   * [[ShuffledRowRDD]] will fetch pre-shuffle partitions based on indices of this array.

   */

  private[exchange] def preparePostShuffleRDD(

      shuffleDependency: ShuffleDependency[Int, InternalRow, InternalRow],

      specifiedPartitionStartIndices: Option[Array[Int]] = None): ShuffledRowRDD = {

    // If an array of partition start indices is provided, we need to use this array

    // to create the ShuffledRowRDD. Also, we need to update newPartitioning to

    // update the number of post-shuffle partitions.

    specifiedPartitionStartIndices.foreach { indices =>

      assert(newPartitioning.isInstanceOf[HashPartitioning])

      newPartitioning = UnknownPartitioning(indices.length)

    }

    new ShuffledRowRDD(shuffleDependency, specifiedPartitionStartIndices)

  }

  /**

   * Returns a [[ShuffleDependency]] that will partition rows of its child based on

   * the partitioning scheme defined in `newPartitioning`. Those partitions of

   * the returned ShuffleDependency will be the input of shuffle.

   */

  def prepareShuffleDependency(

      rdd: RDD[InternalRow],

      outputAttributes: Seq[Attribute],

      newPartitioning: Partitioning,

      serializer: Serializer): ShuffleDependency[Int, InternalRow, InternalRow] = {

    val part: Partitioner = newPartitioning match {

      case RoundRobinPartitioning(numPartitions) => new HashPartitioner(numPartitions)

      case HashPartitioning(_, n) =>

        new Partitioner {

          override def numPartitions: Int = n

          // For HashPartitioning, the partitioning key is already a valid partition ID, as we use

          // `HashPartitioning.partitionIdExpression` to produce partitioning key.

          override def getPartition(key: Any): Int = key.asInstanceOf[Int]

        }

      case RangePartitioning(sortingExpressions, numPartitions) =>

        // Internally, RangePartitioner runs a job on the RDD that samples keys to compute

        // partition bounds. To get accurate samples, we need to copy the mutable keys.

        val rddForSampling = rdd.mapPartitionsInternal { iter =>

          val mutablePair = new MutablePair[InternalRow, Null]()

          iter.map(row => mutablePair.update(row.copy(), null))

        }

        implicit val ordering = new LazilyGeneratedOrdering(sortingExpressions, outputAttributes)

        new RangePartitioner(numPartitions, rddForSampling, ascending = true)

      case SinglePartition =>

        new Partitioner {

          override def numPartitions: Int = 1

          override def getPartition(key: Any): Int = 0

        }

      case _ => sys.error(s"Exchange not implemented for $newPartitioning")

      // TODO: Handle BroadcastPartitioning.

    }

    def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match {

      case RoundRobinPartitioning(numPartitions) =>

        // Distributes elements evenly across output partitions, starting from a random partition.

        var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions)

        (row: InternalRow) => {

          // The HashPartitioner will handle the `mod` by the number of partitions

          position += 1

          position

        }

      case h: HashPartitioning =>

        val projection = UnsafeProjection.create(h.partitionIdExpression :: Nil, outputAttributes)

        row => projection(row).getInt(0)

      case RangePartitioning(_, _) | SinglePartition => identity

      case _ => sys.error(s"Exchange not implemented for $newPartitioning")

    }

    val rddWithPartitionIds: RDD[Product2[Int, InternalRow]] = {

      if (needToCopyObjectsBeforeShuffle(part, serializer)) {

        rdd.mapPartitionsInternal { iter =>

          val getPartitionKey = getPartitionKeyExtractor()

          iter.map { row => (part.getPartition(getPartitionKey(row)), row.copy()) }

        }

      } else {

        rdd.mapPartitionsInternal { iter =>

          val getPartitionKey = getPartitionKeyExtractor()

          val mutablePair = new MutablePair[Int, InternalRow]()

          iter.map { row => mutablePair.update(part.getPartition(getPartitionKey(row)), row) }

        }

      }

    }

    // Now, we manually create a ShuffleDependency. Because pairs in rddWithPartitionIds

    // are in the form of (partitionId, row) and every partitionId is in the expected range

    // [0, part.numPartitions - 1]. The partitioner of this is a PartitionIdPassthrough.

    val dependency =

      new ShuffleDependency[Int, InternalRow, InternalRow](

        rddWithPartitionIds,

        new PartitionIdPassthrough(part.numPartitions),

        serializer)

    dependency

  }

【原创】大叔经验分享（23）spark sql插入表时的文件个数研究的更多相关文章

spark sql插入表时的文件个数研究
spark sql执行insert overwrite table时,写到新表或者新分区的文件个数,有可能是200个,也有可能是任意个,为什么会有这种差别? 首先看一下spark sql执行inser ...
【原创】经验分享：一个小小emoji尽然牵扯出来这么多东西？
前言之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...
spark sql建表的异常
在使用spark sql创建表的时候提示如下错误: missing EOF at 'from' near ')' 可以看下你的建表语句中是不是create external table .... ...
【原创】大叔经验分享（15）spark sql limit实现原理
之前讨论过hive中limit的实现,详见 https://www.cnblogs.com/barneywill/p/10109217.html下面看spark sql中limit的实现,首先看执行计 ...
【原创】大叔经验分享（12）如何程序化kill提交到spark thrift上的sql
spark 2.1.1 hive正在执行中的sql可以很容易的中止,因为可以从console输出中拿到当前在yarn上的application id,然后就可以kill任务, WARNING: Hiv ...
【原创】大叔经验分享（84）spark sql中设置hive.exec.max.dynamic.partitions无效
spark 2.4 spark sql中执行 set hive.exec.max.dynamic.partitions=10000; 后再执行sql依然会报错: org.apache.hadoop.h ...
【原创】大叔经验分享（65）spark读取不到hive表
spark 2.4.3 spark读取hive表,步骤: 1)hive-site.xml hive-site.xml放到$SPARK_HOME/conf下 2)enableHiveSupport Sp ...
【原创】大叔经验分享（60）hive和spark读取kudu表
从impala中创建kudu表之后,如果想从hive或spark sql直接读取,会报错: Caused by: java.lang.ClassNotFoundException: com.cloud ...
【原创】大叔经验分享（55）spark连接kudu报错
spark-2.4.2kudu-1.7.0 开始尝试 1)自己手工将jar加到classpath spark-2.4.2-bin-hadoop2.6+kudu-spark2_2.11-1.7.0-cd ...

随机推荐

.netcore mvc docker环境jenkins一键部署（DevOps）
[前言] DevOps方面的文章很早之前就想分享了,挤出一点时间把前段时间搭建的一些提高开发效率的东西给大家分享一下吧. 本文介绍了一个.netcore mvc web项目,从项目push到githu ...
远程连接腾讯云服务器MySQL数据库
1.添加腾讯云安全组规则的MySQL 3306端口将所有端口打开,至少打开3306,不在赘述. 2.打开更改MySQL配置文件打开配置文件 vi /etc/mysql/mysql.conf.d/m ...
使用css画一个箭头
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name ...
一招明白URL和URI的区别
URL和URI的区别(示例): URL[统一资源定位器]: http://localhost:8080/api/account/queryAccountInfo URI[统一资源定位符]: /api/ ...
使用Node.js搭建数据爬虫crawler
0. 通用爬虫框架包括: (1) 将爬取url加入队列,并获取指定url的前端资源(crawler爬虫框架主要使用Crawler类进行抓取网页) (2)解析前端资源,获取指定所需字段的值,即获取有价值 ...
HDU 2571 命运（简单dp）
传送门真是刷越多题,越容易满足.算是一道很简单的DP了.终于可以自己写出来了. 二维矩阵每个点都有一个幸运值,要求从左上走到右下最多能积累多少幸运值. 重点就是左上右下必须都踩到. dp[i][j] ...
AWS设置允许root登陆
Refer to the following to set root login: sudo -s (to become root) vi /root/.ssh/authorized_keys Del ...
51Nod--1247 可能的路径（gcd）
根据规则可知假设 (a,b) 可以到达坐标(aa,bb) 那么 aa=a*x+b*y x y 必定有解所以我们只要求两个坐标的gcd看是否相等就好 #include<bits/stdc ...
来自多校的一个题——数位DP+卡位
n<=1e9就要考虑倍增.矩阵乘法这种了假设L=0 考虑枚举二进制下,所有X与R的LCP长度,前len高位对于第len+1位,假设R的这一位是1 如果一个x的这一位是0了,那么后面可以随便填 ...
centos7 LNMP
Nginx1.13.5 + PHP7.1.10 + MySQL5.7.19 一.安装Nginx 1.安装依赖扩展 # yum -y install wget openssl* gcc gcc-c++ ...

【原创】大叔经验分享（23）spark sql插入表时的文件个数研究

【原创】大叔经验分享（23）spark sql插入表时的文件个数研究的更多相关文章

随机推荐

热门专题