spark sql执行insert overwrite table时,写到新表或者新分区的文件个数,有可能是200个,也有可能是任意个,为什么会有这种差别?

首先看一下spark sql执行insert overwrite table流程:

  • 1 创建临时目录,比如

    • .hive-staging_hive_2018-06-23_00-39-39_825_3122897139441535352-2312/-ext-10000
  • 2 将数据写到临时目录;
  • 3 执行loadTable或loadPartition将临时目录数据move到正是目录;

对应的代码为:

org.apache.spark.sql.hive.execution.InsertIntoHiveTable

case class InsertIntoHiveTable(
table: MetastoreRelation,
partition: Map[String, Option[String]],
child: SparkPlan,
overwrite: Boolean,
ifNotExists: Boolean) extends UnaryExecNode {
...
protected[sql] lazy val sideEffectResult: Seq[InternalRow] = {
...
val tmpLocation = getExternalTmpPath(tableLocation, hadoopConf)
val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, false)
...
@transient val outputClass = writerContainer.newSerializer(table.tableDesc).getSerializedClass
saveAsHiveFile(child.execute(), outputClass, fileSinkConf, jobConfSer, writerContainer)
... private def saveAsHiveFile(
rdd: RDD[InternalRow],
valueClass: Class[_],
fileSinkConf: FileSinkDesc,
conf: SerializableJobConf,
writerContainer: SparkHiveWriterContainer): Unit = {
assert(valueClass != null, "Output value class not set")
conf.value.setOutputValueClass(valueClass) val outputFileFormatClassName = fileSinkConf.getTableInfo.getOutputFileFormatClassName
assert(outputFileFormatClassName != null, "Output format class not set")
conf.value.set("mapred.output.format.class", outputFileFormatClassName) FileOutputFormat.setOutputPath(
conf.value,
SparkHiveWriterContainer.createPathFromString(fileSinkConf.getDirName(), conf.value))
log.debug("Saving as hadoop file of type " + valueClass.getSimpleName)
writerContainer.driverSideSetup()
sqlContext.sparkContext.runJob(rdd, writerContainer.writeToFile _)
writerContainer.commitJob()
}

下面先看第一步创建临时目录过程,即getExternalTmpPath

  val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")

  def getExternalTmpPath(path: Path, hadoopConf: Configuration): Path = {
val extURI: URI = path.toUri
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf)
} else {
new Path(getExternalScratchDir(extURI, hadoopConf), "-ext-10000")
}
} private def getExternalScratchDir(extURI: URI, hadoopConf: Configuration): Path = {
getStagingDir(new Path(extURI.getScheme, extURI.getAuthority, extURI.getPath), hadoopConf)
} private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = {
val inputPathUri: URI = inputPath.toUri
val inputPathName: String = inputPathUri.getPath
val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
val stagingPathName: String =
if (inputPathName.indexOf(stagingDir) == -1) {
new Path(inputPathName, stagingDir).toString
} else {
inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length)
}
val dir: Path =
fs.makeQualified(
new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID))
logDebug("Created staging dir = " + dir + " for path = " + inputPath)
try {
if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'")
}
fs.deleteOnExit(dir)
} catch {
case e: IOException =>
throw new RuntimeException(
"Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e) }
return dir
} private def executionId: String = {
val rand: Random = new Random
val format = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS", Locale.US)
"hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
}

临时目录组成为【.hive-staging(配置hive.exec.stagingdir)】_【hive(硬编码)】_【2018-06-23_00-39-39_825(时分秒)】_【3122897139441535352(随机串)】_【2312(taskId)】/-ext-10000(硬编码)

下面看写文件过程,即

sqlContext.sparkContext.runJob(rdd, writerContainer.writeToFile _)

org.apache.spark.SparkContext

  /**
* Run a job on all partitions in an RDD and return the results in an array.
*/
def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U): Array[U] = {
runJob(rdd, func, 0 until rdd.partitions.length)
}

可见是将rdd逐个分区执行写入操作,rdd有多少个分区就会写入多少个文件,rdd是通过child.execute返回的,即SparkPlan.execute,下面看SparkPlan

org.apache.spark.sql.execution.SparkPlan

  final def execute(): RDD[InternalRow] = executeQuery {
doExecute()
} protected def doExecute(): RDD[InternalRow]

doExecute是抽象方法,执行计划中的过程都对应到SparkPlan的子类,比如Project对应ProjectExec,SortMergeJoin对应SortMergeJoinExec;

SparkPlan是由SparkPlanner生成的,下面看SparkPlanner:

org.apache.spark.sql.execution.SparkPlanner

  def numPartitions: Int = conf.numShufflePartitions

这里直接取的是SQLConf.numShufflePartitions,下面看SQLConf:

org.apache.spark.sql.internal.SQLConf

  val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions")
.doc("The default number of partitions to use when shuffling data for joins or aggregations.")
.intConf
.createWithDefault(200) def numShufflePartitions: Int = getConf(SHUFFLE_PARTITIONS)

这里取的是配置spark.sql.shuffle.partitions,默认200;那么分区数量是怎样用到的?下面看BasicOperators:

org.apache.spark.sql.execution.SparkStrategies.BasicOperators

    def numPartitions: Int = self.numPartitions

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
...
case logical.RepartitionByExpression(expressions, child, nPartitions) =>
exchange.ShuffleExchange(HashPartitioning(
expressions, nPartitions.getOrElse(numPartitions)), planLater(child)) :: Nil

可见shuffle过程会根据numPartitions来创建HashPartitioning,如果sql执行过程需要shuffle(比如有join,group by等操作),那么默认会写200个文件;如果sql执行过程没有shuffle,那么会由HiveTableScan和Filter等来决定写多少个文件;

也可以通过执行计划来看,如果有shuffle过程,执行计划中通常有这么一步:

:  +- Exchange(coordinator id: 371605426) hashpartitioning(id#60, 200), coordinator[target post-shuffle partition size: 67108864]

其中hashpartitioning(id#60, 200)中的200就是spark.sql.shuffle.partitions的默认值;

附ShuffleExchange过程:

org.apache.spark.sql.execution.exchange.ShuffleExchange

  def apply(newPartitioning: Partitioning, child: SparkPlan): ShuffleExchange = {
ShuffleExchange(newPartitioning, child, coordinator = Option.empty[ExchangeCoordinator])
} protected override def doExecute(): RDD[InternalRow] = attachTree(this, "execute") {
// Returns the same ShuffleRowRDD if this plan is used by multiple plans.
if (cachedShuffleRDD == null) {
cachedShuffleRDD = coordinator match {
case Some(exchangeCoordinator) =>
val shuffleRDD = exchangeCoordinator.postShuffleRDD(this)
assert(shuffleRDD.partitions.length == newPartitioning.numPartitions)
shuffleRDD
case None =>
val shuffleDependency = prepareShuffleDependency()
preparePostShuffleRDD(shuffleDependency)
}
}
cachedShuffleRDD
} /**
* Returns a [[ShuffleDependency]] that will partition rows of its child based on
* the partitioning scheme defined in `newPartitioning`. Those partitions of
* the returned ShuffleDependency will be the input of shuffle.
*/
private[exchange] def prepareShuffleDependency()
: ShuffleDependency[Int, InternalRow, InternalRow] = {
ShuffleExchange.prepareShuffleDependency(
child.execute(), child.output, newPartitioning, serializer)
} /**
* Returns a [[ShuffledRowRDD]] that represents the post-shuffle dataset.
* This [[ShuffledRowRDD]] is created based on a given [[ShuffleDependency]] and an optional
* partition start indices array. If this optional array is defined, the returned
* [[ShuffledRowRDD]] will fetch pre-shuffle partitions based on indices of this array.
*/
private[exchange] def preparePostShuffleRDD(
shuffleDependency: ShuffleDependency[Int, InternalRow, InternalRow],
specifiedPartitionStartIndices: Option[Array[Int]] = None): ShuffledRowRDD = {
// If an array of partition start indices is provided, we need to use this array
// to create the ShuffledRowRDD. Also, we need to update newPartitioning to
// update the number of post-shuffle partitions.
specifiedPartitionStartIndices.foreach { indices =>
assert(newPartitioning.isInstanceOf[HashPartitioning])
newPartitioning = UnknownPartitioning(indices.length)
}
new ShuffledRowRDD(shuffleDependency, specifiedPartitionStartIndices)
} /**
* Returns a [[ShuffleDependency]] that will partition rows of its child based on
* the partitioning scheme defined in `newPartitioning`. Those partitions of
* the returned ShuffleDependency will be the input of shuffle.
*/
def prepareShuffleDependency(
rdd: RDD[InternalRow],
outputAttributes: Seq[Attribute],
newPartitioning: Partitioning,
serializer: Serializer): ShuffleDependency[Int, InternalRow, InternalRow] = {
val part: Partitioner = newPartitioning match {
case RoundRobinPartitioning(numPartitions) => new HashPartitioner(numPartitions)
case HashPartitioning(_, n) =>
new Partitioner {
override def numPartitions: Int = n
// For HashPartitioning, the partitioning key is already a valid partition ID, as we use
// `HashPartitioning.partitionIdExpression` to produce partitioning key.
override def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
case RangePartitioning(sortingExpressions, numPartitions) =>
// Internally, RangePartitioner runs a job on the RDD that samples keys to compute
// partition bounds. To get accurate samples, we need to copy the mutable keys.
val rddForSampling = rdd.mapPartitionsInternal { iter =>
val mutablePair = new MutablePair[InternalRow, Null]()
iter.map(row => mutablePair.update(row.copy(), null))
}
implicit val ordering = new LazilyGeneratedOrdering(sortingExpressions, outputAttributes)
new RangePartitioner(numPartitions, rddForSampling, ascending = true)
case SinglePartition =>
new Partitioner {
override def numPartitions: Int = 1
override def getPartition(key: Any): Int = 0
}
case _ => sys.error(s"Exchange not implemented for $newPartitioning")
// TODO: Handle BroadcastPartitioning.
}
def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match {
case RoundRobinPartitioning(numPartitions) =>
// Distributes elements evenly across output partitions, starting from a random partition.
var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions)
(row: InternalRow) => {
// The HashPartitioner will handle the `mod` by the number of partitions
position += 1
position
}
case h: HashPartitioning =>
val projection = UnsafeProjection.create(h.partitionIdExpression :: Nil, outputAttributes)
row => projection(row).getInt(0)
case RangePartitioning(_, _) | SinglePartition => identity
case _ => sys.error(s"Exchange not implemented for $newPartitioning")
}
val rddWithPartitionIds: RDD[Product2[Int, InternalRow]] = {
if (needToCopyObjectsBeforeShuffle(part, serializer)) {
rdd.mapPartitionsInternal { iter =>
val getPartitionKey = getPartitionKeyExtractor()
iter.map { row => (part.getPartition(getPartitionKey(row)), row.copy()) }
}
} else {
rdd.mapPartitionsInternal { iter =>
val getPartitionKey = getPartitionKeyExtractor()
val mutablePair = new MutablePair[Int, InternalRow]()
iter.map { row => mutablePair.update(part.getPartition(getPartitionKey(row)), row) }
}
}
} // Now, we manually create a ShuffleDependency. Because pairs in rddWithPartitionIds
// are in the form of (partitionId, row) and every partitionId is in the expected range
// [0, part.numPartitions - 1]. The partitioner of this is a PartitionIdPassthrough.
val dependency =
new ShuffleDependency[Int, InternalRow, InternalRow](
rddWithPartitionIds,
new PartitionIdPassthrough(part.numPartitions),
serializer) dependency
}

【原创】大叔经验分享(23)spark sql插入表时的文件个数研究的更多相关文章

  1. spark sql插入表时的文件个数研究

    spark sql执行insert overwrite table时,写到新表或者新分区的文件个数,有可能是200个,也有可能是任意个,为什么会有这种差别? 首先看一下spark sql执行inser ...

  2. 【原创】经验分享:一个小小emoji尽然牵扯出来这么多东西?

    前言 之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...

  3. spark sql建表的异常

    在使用spark sql创建表的时候提示如下错误: missing EOF at 'from' near ')' 可以看下你的建表语句中是不是create external table ....   ...

  4. 【原创】大叔经验分享(15)spark sql limit实现原理

    之前讨论过hive中limit的实现,详见 https://www.cnblogs.com/barneywill/p/10109217.html下面看spark sql中limit的实现,首先看执行计 ...

  5. 【原创】大叔经验分享(12)如何程序化kill提交到spark thrift上的sql

    spark 2.1.1 hive正在执行中的sql可以很容易的中止,因为可以从console输出中拿到当前在yarn上的application id,然后就可以kill任务, WARNING: Hiv ...

  6. 【原创】大叔经验分享(84)spark sql中设置hive.exec.max.dynamic.partitions无效

    spark 2.4 spark sql中执行 set hive.exec.max.dynamic.partitions=10000; 后再执行sql依然会报错: org.apache.hadoop.h ...

  7. 【原创】大叔经验分享(65)spark读取不到hive表

    spark 2.4.3 spark读取hive表,步骤: 1)hive-site.xml hive-site.xml放到$SPARK_HOME/conf下 2)enableHiveSupport Sp ...

  8. 【原创】大叔经验分享(60)hive和spark读取kudu表

    从impala中创建kudu表之后,如果想从hive或spark sql直接读取,会报错: Caused by: java.lang.ClassNotFoundException: com.cloud ...

  9. 【原创】大叔经验分享(55)spark连接kudu报错

    spark-2.4.2kudu-1.7.0 开始尝试 1)自己手工将jar加到classpath spark-2.4.2-bin-hadoop2.6+kudu-spark2_2.11-1.7.0-cd ...

随机推荐

  1. 如何基于Winform开发框架或混合框架基础上进行项目的快速开发

    在开发项目的时候,我们为了提高速度和质量,往往不是白手起家,需要基于一定的基础上进行项目的快速开发,这样可以利用整个框架的生态基础模块,以及成熟统一的开发方式,可以极大提高我们开发的效率.本篇随笔就是 ...

  2. Asp.Net Core SignalR 与微信小程序交互笔记

    什么是Asp.Net Core SignalR Asp.Net Core SignalR 是微软开发的一套基于Asp.Net Core的与Web进行实时交互的类库,它使我们的应用能够实时的把数据推送给 ...

  3. Django create和save方法

    Django的模型(Model)的本质是类,并不是一个具体的对象(Object).当你设计好模型后,你就可以对Model进行实例化从而创建一个一个具体的对象.Django对于创建对象提供了2种不同的s ...

  4. mybatis 使用resultMap实现表间关联

    AutoMapping auto mapping,直译过来就是自动映射,工作原理大概如下: 假设我们有一张表,表名为person,包含id,name,age,addr这4个字段 mysql> d ...

  5. Kubernetes的本质

    在前面的四篇文章中,我以 Docker 项目为例,一步步剖析了 Linux 容器的具体实现方式.通过这 些讲解你应该能够明白:一个“容器”,实际上是一个由 Linux Namespace.Linux ...

  6. max-height、min-height、height优先级的问题

    前言 我们在实际开发中可能会限制元素的最大高度,那么我们使用的属性必定是max-height,那么不知道大家有没有考虑过如果同时设置max-height和height会发生什么呢? max-heigh ...

  7. java valueOf()函数

    valueOf() 方法用于返回给定参数的原生 Number 对象值,参数可以是原生数据类型, String等. 该方法是静态方法.该方法可以接收两个参数一个是字符串,一个是基数. 语法 该方法有以下 ...

  8. Quartus13.1全编译出现引脚错误(神级bug)

    BUG现象:分配完管脚后全编译出现如下错误. Error (171172):Detected confilicting assignments for the following nodes.Erro ...

  9. es上的的Watcher示例

    Watcher插件配置(创建预警任务) watcher目前是沒有界面配置的,需要通过Resfulapi调用创建.管理.更新预警任务 创建一个Watcher任务的流程是怎样的? 我们先来看下创建一个预警 ...

  10. C# 执行DOS命令和批处理

    在项目开发中,有时候要处理一些文件,比如视频格式的转换,如果用C开发一套算法,再用C#调用,未免得不偿失!有时候调用现有的程序反而更加方便.今天就来说一下C#中如何调用外部程序,执行一些特殊任务. 这 ...