Spark SQL Catalyst源代码分析之Analyzer
/** Spark SQL源代码分析系列文章*/
前面几篇文章解说了Spark SQL的核心运行流程和Spark SQL的Catalyst框架的Sql Parser是如何接受用户输入sql,经过解析生成Unresolved Logical Plan的。
我们记得Spark SQL的运行流程中还有一个核心的组件式Analyzer,本文将会介绍Analyzer在Spark SQL里起到了什么作用。
Analyzer位于Catalyst的analysis package下。主要职责是将Sql Parser 未能Resolved的Logical Plan 给Resolved掉。
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvb29wc29vbQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" />
一、Analyzer构造
Analyzer会使用Catalog和FunctionRegistry将UnresolvedAttribute和UnresolvedRelation转换为catalyst里全类型的对象。
Analyzer里面有fixedPoint对象,一个Seq[Batch].
class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)
extends RuleExecutor[LogicalPlan] with HiveTypeCoercion { // TODO: pass this in as a parameter.
val fixedPoint = FixedPoint(100) val batches: Seq[Batch] = Seq(
Batch("MultiInstanceRelations", Once,
NewRelationInstances),
Batch("CaseInsensitiveAttributeReferences", Once,
(if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*),
Batch("Resolution", fixedPoint,
ResolveReferences ::
ResolveRelations ::
NewRelationInstances ::
ImplicitGenerate ::
StarExpansion ::
ResolveFunctions ::
GlobalAggregates ::
typeCoercionRules :_*),
Batch("AnalysisOperators", fixedPoint,
EliminateAnalysisOperators)
)
Analyzer里的一些对象解释:
FixedPoint:相当于迭代次数的上限。
/** A strategy that runs until fix point or maxIterations times, whichever comes first. */
case class FixedPoint(maxIterations: Int) extends Strategy
Batch: 批次,这个对象是由一系列Rule组成的,採用一个策略(策略事实上是迭代几次的别名吧,eg:Once)
/** A batch of rules. */。
protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
Rule:理解为一种规则,这样的规则会应用到Logical Plan 从而将UnResolved 转变为Resolved
abstract class Rule[TreeType <: TreeNode[_]] extends Logging { /** Name for this rule, automatically inferred based on class name. */
val ruleName: String = {
val className = getClass.getName
if (className endsWith "$") className.dropRight(1) else className
} def apply(plan: TreeType): TreeType
}
Strategy:最大的运行次数,假设运行次数在最大迭代次数之前就达到了fix point,策略就会停止,不再应用了。
/**
* An execution strategy for rules that indicates the maximum number of executions. If the
* execution reaches fix point (i.e. converge) before maxIterations, it will stop.
*/
abstract class Strategy { def maxIterations: Int }
Analyzer解析主要是依据这些Batch里面定义的策略和Rule来对Unresolved的逻辑计划进行解析的。
这里Analyzer类本身并未定义运行的方法,而是要从它的父类RuleExecutor[LogicalPlan]寻找,Analyzer也实现了HiveTypeCosercion,这个类是參考Hive的类型自己主动兼容转换的原理。如图:
RuleExecutor:运行Rule的运行环境,它会将包括了一系列的Rule的Batch进行运行,这个过程都是串行的。
详细的运行方法定义在apply里:
能够看到这里是一个while循环,每一个batch下的rules都对当前的plan进行作用,这个过程是迭代的,直到达到Fix Point或者最大迭代次数。
def apply(plan: TreeType): TreeType = {
var curPlan = plan batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true // Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val result = rule(plan) //这里将调用各个不同Rule的apply方法,将UnResolved Relations,Attrubute和Function进行Resolve
if (!result.fastEquals(plan)) {
logger.trace(
s"""
|=== Applying Rule ${rule.ruleName} ===
|${sideBySide(plan.treeString, result.treeString).mkString("\n")}
""".stripMargin)
} result //返回作用后的result plan
}
iteration += 1
if (iteration > batch.strategy.maxIterations) { //假设迭代次数已经大于该策略的最大迭代次数,就停止循环
logger.info(s"Max iterations ($iteration) reached for batch ${batch.name}")
continue = false
} if (curPlan.fastEquals(lastPlan)) { //假设在多次迭代中不再变化,由于plan有个unique id,就停止循环。 logger.trace(s"Fixed point reached for batch ${batch.name} after $iteration iterations.")
continue = false
}
lastPlan = curPlan
} if (!batchStartPlan.fastEquals(curPlan)) {
logger.debug(
s"""
|=== Result of Batch ${batch.name} ===
|${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")}
""".stripMargin)
} else {
logger.trace(s"Batch ${batch.name} has no effect.")
}
} curPlan //返回Resolved的Logical Plan
}
二、Rules介绍
2.1、MultiInstanceRelation
Batch("MultiInstanceRelations", Once,
NewRelationInstances)
trait MultiInstanceRelation {
def newInstance: this.type
}
object NewRelationInstances extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = {
val localRelations = plan collect { case l: MultiInstanceRelation => l} //将logical plan应用partial function得到全部MultiInstanceRelation的plan的集合
val multiAppearance = localRelations
.groupBy(identity[MultiInstanceRelation]) //group by操作
.filter { case (_, ls) => ls.size > 1 } //假设仅仅取size大于1的进行兴许操作
.map(_._1)
.toSet //更新plan,使得每一个实例的expId是唯一的。
plan transform {
case l: MultiInstanceRelation if multiAppearance contains l => l.newInstance
}
}
}
2.2、LowercaseAttributeReferences
object LowercaseAttributeReferences extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case UnresolvedRelation(databaseName, name, alias) =>
UnresolvedRelation(databaseName, name, alias.map(_.toLowerCase))
case Subquery(alias, child) => Subquery(alias.toLowerCase, child)
case q: LogicalPlan => q transformExpressions {
case s: Star => s.copy(table = s.table.map(_.toLowerCase))
case UnresolvedAttribute(name) => UnresolvedAttribute(name.toLowerCase)
case Alias(c, name) => Alias(c, name.toLowerCase)()
case GetField(c, name) => GetField(c, name.toLowerCase)
}
}
}
2.3、ResolveReferences
object ResolveReferences extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
case q: LogicalPlan if q.childrenResolved =>
logger.trace(s"Attempting to resolve ${q.simpleString}")
q transformExpressions {
case u @ UnresolvedAttribute(name) =>
// Leave unchanged if resolution fails. Hopefully will be resolved next round.
val result = q.resolve(name).getOrElse(u)//转化为NamedExpression
logger.debug(s"Resolving $u to $result")
result
}
}
}
2.4、 ResolveRelations
Catalog对象里面维护了一个tableName,Logical Plan的HashMap结果。
object ResolveRelations extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case UnresolvedRelation(databaseName, name, alias) =>
catalog.lookupRelation(databaseName, name, alias)
}
}
2.5、ImplicitGenerate
object ImplicitGenerate extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Project(Seq(Alias(g: Generator, _)), child) =>
Generate(g, join = false, outer = false, None, child)
}
}
2.6 StarExpansion
object StarExpansion extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
// Wait until children are resolved
case p: LogicalPlan if !p.childrenResolved => p
// If the projection list contains Stars, expand it.
case p @ Project(projectList, child) if containsStar(projectList) =>
Project(
projectList.flatMap {
case s: Star => s.expand(child.output) //展开,将输入的Attributeexpand(input: Seq[Attribute]) 转化为Seq[NamedExpression]
case o => o :: Nil
},
child)
case t: ScriptTransformation if containsStar(t.input) =>
t.copy(
input = t.input.flatMap {
case s: Star => s.expand(t.child.output)
case o => o :: Nil
}
)
// If the aggregate function argument contains Stars, expand it.
case a: Aggregate if containsStar(a.aggregateExpressions) =>
a.copy(
aggregateExpressions = a.aggregateExpressions.flatMap {
case s: Star => s.expand(a.child.output)
case o => o :: Nil
}
)
}
/**
* Returns true if `exprs` contains a [[Star]].
*/
protected def containsStar(exprs: Seq[Expression]): Boolean =
exprs.collect { case _: Star => true }.nonEmpty
}
}
2.7 ResolveFunctions
object ResolveFunctions extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan =>
q transformExpressions {
case u @ UnresolvedFunction(name, children) if u.childrenResolved =>
registry.lookupFunction(name, children) //看是否注冊了当前udf
}
}
}
2.8 GlobalAggregates
object GlobalAggregates extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Project(projectList, child) if containsAggregates(projectList) =>
Aggregate(Nil, projectList, child)
} def containsAggregates(exprs: Seq[Expression]): Boolean = {
exprs.foreach(_.foreach {
case agg: AggregateExpression => return true
case _ =>
})
false
}
}
2.9 typeCoercionRules
val typeCoercionRules =
PropagateTypes ::
ConvertNaNs ::
WidenTypes ::
PromoteStrings ::
BooleanComparisons ::
BooleanCasts ::
StringToIntegralCasts ::
FunctionArgumentConversion ::
CastNulls ::
Nil
2.10 EliminateAnalysisOperators
object EliminateAnalysisOperators extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Subquery(_, child) => child //遇到Subquery,不反悔本身,返回它的Child,即删除了该元素
case LowerCaseSchema(child) => child
}
}
三、实践
val exec = sqlContext.sql("select mobile as mb, sid as id, mobile*2 multi2mobile, count(1) times from (select * from temp_shengli_mobile)a where pfrom_id=0.0 group by mobile, sid, mobile*2")
14/07/21 18:23:32 DEBUG SparkILoop$SparkILoopInterpreter: Invoking: public static java.lang.String $line47.$eval.$print()
14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'pfrom_id to pfrom_id#5
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1
14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
14/07/21 18:23:33 DEBUG Analyzer:
=== Result of Batch Resolution ===
!Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L] Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]
! Filter ('pfrom_id = 0.0) Filter (CAST(pfrom_id#5, DoubleType) = 0.0)
Subquery a Subquery a
! Project [*] Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]
! UnresolvedRelation None, temp_shengli_mobile, None Subquery temp_shengli_mobile
! SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174) 14/07/21 18:23:33 DEBUG Analyzer:
=== Result of Batch AnalysisOperators ===
!Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L] Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]
! Filter ('pfrom_id = 0.0) Filter (CAST(pfrom_id#5, DoubleType) = 0.0)
! Subquery a Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]
! Project [*] SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174)
! UnresolvedRelation None, temp_shengli_mobile, None
四、总结
原创文章,转载请注明:
转载自:OopsOutOfMemory盛利的Blog,作者: OopsOutOfMemory
本文链接地址:http://blog.csdn.net/oopsoom/article/details/38025185
注:本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议,欢迎转载、转发和评论,可是请保留本文作者署名和文章链接。如若须要用于商业目的或者与授权方面的协商,请联系我。
Spark SQL Catalyst源代码分析之Analyzer的更多相关文章
- Spark SQL Catalyst源代码分析之TreeNode Library
/** Spark SQL源代码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心执行流程.SqlParser,和Analyzer,本来打算直接写Optimizer的,可是发 ...
- Spark SQL Catalyst源代码分析Optimizer
/** Spark SQL源代码分析系列*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer 以及核心类库TreeNode,本文将具体解说S ...
- Spark SQL Catalyst源代码分析之UDF
/** Spark SQL源代码分析系列文章*/ 在SQL的世界里,除了官方提供的经常使用的处理函数之外.一般都会提供可扩展的对外自己定义函数接口,这已经成为一种事实的标准. 在前面Spark SQL ...
- 第三篇:Spark SQL Catalyst源码分析之Analyzer
/** Spark SQL源码分析系列文章*/ 前面几篇文章讲解了Spark SQL的核心执行流程和Spark SQL的Catalyst框架的Sql Parser是怎样接受用户输入sql,经过解析生成 ...
- 第八篇:Spark SQL Catalyst源码分析之UDF
/** Spark SQL源码分析系列文章*/ 在SQL的世界里,除了官方提供的常用的处理函数之外,一般都会提供可扩展的对外自定义函数接口,这已经成为一种事实的标准. 在前面Spark SQL源码分析 ...
- 第五篇:Spark SQL Catalyst源码分析之Optimizer
/** Spark SQL源码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer 以及核心类库TreeNode,本文将详细讲解 ...
- 第六篇:Spark SQL Catalyst源码分析之Physical Plan
/** Spark SQL源码分析系列文章*/ 前面几篇文章主要介绍的是spark sql包里的的spark sql执行流程,以及Catalyst包内的SqlParser,Analyzer和Optim ...
- 第四篇:Spark SQL Catalyst源码分析之TreeNode Library
/** Spark SQL源码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer,本来打算直接写Optimizer的,但是发现 ...
- 第二篇:Spark SQL Catalyst源码分析之SqlParser
/** Spark SQL源码分析系列文章*/ Spark SQL的核心执行流程我们已经分析完毕,可以参见Spark SQL核心执行流程,下面我们来分析执行流程中各个核心组件的工作职责. 本文先从入口 ...
随机推荐
- javascript必须知道的知识要点(一)
该文章不详细叙述各知识要点的具体内容,仅把要点列出来,供大家学习的时候参照,或者检测自己是否熟练掌握了javascript,清楚各个部分的内容. 语句 注释 输出 字面量 变量 数据类型 typeof ...
- MongoDB索引05-30学习笔记
MongoDB 索引 索引通常能够极大的提高查询的效率,如果没有索引,MongoDB在读取数据时必须扫描集合中的每个文件并选取那些符合查询条件的记录. 这种扫描全集合的查询效率是非常低的,特别在处理大 ...
- SMTP协议详解
简单邮件传输协议 (Simple Mail Transfer Protocol, SMTP) 是在Internet传输email的事实标准. SMTP是一个相对简单的基于文本的协议.在其之上指定了一条 ...
- LeetCode Weekly Contest 20
1. 520. Detect Capital 题目描述的很清楚,直接写,注意:字符串长度为1的时候,大写和小写都是满足要求的,剩下的情况单独判断.还有:我感觉自己写的代码很丑,判断条件比较多,需要改进 ...
- Spark Scala语言学习系列之完成HelloWorld程序(三种方式)
三种方式完成HelloWorld程序 分别采用在REPL,命令行(scala脚本)和Eclipse下运行hello world. 一.Scala REPL. windows下安装好scala后,直接C ...
- 控件——DataGridview
控件:DataGridview 用来显示数据, 可以显示和编辑来自多种不同类型的数据源的表格数据. 一.两种显示数据的方式:手动,后台代码 主要通过后台代码:先建立三大类 然后绑定 ...
- WordPress浏览次数统计插件:WP-Postviews使用
WP-Postviews使用 1.要让你的博客在页面上显示浏览次数,你需要修改你博客当前使用的主题,在主循环中插入以下代码: 1 <?php if(function_exists('the_vi ...
- hdu2853 Assignment 完美匹配 多校联赛的好题
PS:好题.不看题解绝对AC不了. 题解来源: http://blog.csdn.net/niushuai666/article/details/7176290 http://www.cnblogs. ...
- 使用vs2017创建项目并添加到git中
参考 https://blog.csdn.net/qq373591361/article/details/71194651 https://blog.csdn.net/boonya/article/d ...
- B站真的是一个神奇的地方,初次用Python爬取弹幕。
"网上冲浪""886""GG""沙发"--如果你用过这些,那你可能是7080后: "杯具"" ...