【原创】大数据基础之Hive(2)Hive SQL执行过程之SQL解析过程
Hive SQL解析过程
SQL->AST(Abstract Syntax Tree)->Task(MapRedTask,FetchTask)->QueryPlan(Task集合)->Job(Yarn)
SQL解析会在两个地方进行:
- 一个是SQL执行前compile,具体在Driver.compile,为了创建QueryPlan;
- 一个是explain,具体在ExplainSemanticAnalyzer.analyzeInternal,为了创建ExplainTask;
SQL执行过程
1 compile过程(SQL->AST(Abstract Syntax Tree)->QueryPlan)
org.apache.hadoop.hive.ql.Driver
public int compile(String command, boolean resetTaskIds, boolean deferClose) {
...
ParseDriver pd = new ParseDriver();
ASTNode tree = pd.parse(command, ctx);
tree = ParseUtils.findRootNonNullToken(tree);
...
BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, tree);
...
sem.analyze(tree, ctx);
...
// Record any ACID compliant FileSinkOperators we saw so we can add our transaction ID to
// them later.
acidSinks = sem.getAcidFileSinks(); LOG.info("Semantic Analysis Completed"); // validate the plan
sem.validate();
acidInQuery = sem.hasAcidInQuery();
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.ANALYZE); if (isInterrupted()) {
return handleInterruption("after analyzing query.");
} // get the output schema
schema = getSchema(sem, conf);
plan = new QueryPlan(queryStr, sem, perfLogger.getStartTime(PerfLogger.DRIVER_RUN), queryId,
queryState.getHiveOperation(), schema);
...
compile过程为先由ParseDriver将SQL转换为ASTNode,然后由BaseSemanticAnalyzer对ASTNode进行分析,最后将BaseSemanticAnalyzer传入QueryPlan构造函数来创建QueryPlan;
1)将SQL转换为ASTNode过程如下(SQL->AST(Abstract Syntax Tree))
org.apache.hadoop.hive.ql.parse.ParseDriver
public ASTNode parse(String command, Context ctx, boolean setTokenRewriteStream)
throws ParseException {
if (LOG.isDebugEnabled()) {
LOG.debug("Parsing command: " + command);
} HiveLexerX lexer = new HiveLexerX(new ANTLRNoCaseStringStream(command));
TokenRewriteStream tokens = new TokenRewriteStream(lexer);
if (ctx != null) {
if ( setTokenRewriteStream) {
ctx.setTokenRewriteStream(tokens);
}
lexer.setHiveConf(ctx.getConf());
}
HiveParser parser = new HiveParser(tokens);
if (ctx != null) {
parser.setHiveConf(ctx.getConf());
}
parser.setTreeAdaptor(adaptor);
HiveParser.statement_return r = null;
try {
r = parser.statement();
} catch (RecognitionException e) {
e.printStackTrace();
throw new ParseException(parser.errors);
} if (lexer.getErrors().size() == 0 && parser.errors.size() == 0) {
LOG.debug("Parse Completed");
} else if (lexer.getErrors().size() != 0) {
throw new ParseException(lexer.getErrors());
} else {
throw new ParseException(parser.errors);
} ASTNode tree = (ASTNode) r.getTree();
tree.setUnknownTokenBoundaries();
return tree;
}
2)analyze过程(AST(Abstract Syntax Tree)->Task)
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer
public void analyze(ASTNode ast, Context ctx) throws SemanticException {
initCtx(ctx);
init(true);
analyzeInternal(ast);
}
其中analyzeInternal是抽象方法,由不同的子类实现,比如DDLSemanticAnalyzer,SemanticAnalyzer,UpdateDeleteSemanticAnalyzer,ExplainSemanticAnalyzer等;
analyzeInternal主要的工作是将ASTNode转化为Task,包括可能的optimize,过程比较复杂,这里不贴代码;
3)创建QueryPlan过程如下(Task->QueryPlan)
org.apache.hadoop.hive.ql.QueryPlan
public QueryPlan(String queryString, BaseSemanticAnalyzer sem, Long startTime, String queryId,
HiveOperation operation, Schema resultSchema) {
this.queryString = queryString; rootTasks = new ArrayList<Task<? extends Serializable>>(sem.getAllRootTasks());
reducerTimeStatsPerJobList = new ArrayList<ReducerTimeStatsPerJob>();
fetchTask = sem.getFetchTask();
// Note that inputs and outputs can be changed when the query gets executed
inputs = sem.getAllInputs();
outputs = sem.getAllOutputs();
linfo = sem.getLineageInfo();
tableAccessInfo = sem.getTableAccessInfo();
columnAccessInfo = sem.getColumnAccessInfo();
idToTableNameMap = new HashMap<String, String>(sem.getIdToTableNameMap()); this.queryId = queryId == null ? makeQueryId() : queryId;
query = new org.apache.hadoop.hive.ql.plan.api.Query();
query.setQueryId(this.queryId);
query.putToQueryAttributes("queryString", this.queryString);
queryProperties = sem.getQueryProperties();
queryStartTime = startTime;
this.operation = operation;
this.autoCommitValue = sem.getAutoCommitValue();
this.resultSchema = resultSchema;
}
可见只是简单的将BaseSemanticAnalyzer中的内容拷贝出来,其中最重要的是sem.getAllRootTasks和sem.getFetchTask;
2 execute过程(QueryPlan->Job)
org.apache.hadoop.hive.ql.Driver
public int execute(boolean deferClose) throws CommandNeedRetryException {
...
// Add root Tasks to runnable
for (Task<? extends Serializable> tsk : plan.getRootTasks()) {
// This should never happen, if it does, it's a bug with the potential to produce
// incorrect results.
assert tsk.getParentTasks() == null || tsk.getParentTasks().isEmpty();
driverCxt.addToRunnable(tsk);
}
...
// Loop while you either have tasks running, or tasks queued up
while (driverCxt.isRunning()) { // Launch upto maxthreads tasks
Task<? extends Serializable> task;
while ((task = driverCxt.getRunnable(maxthreads)) != null) {
TaskRunner runner = launchTask(task, queryId, noName, jobname, jobs, driverCxt);
if (!runner.isRunning()) {
break;
}
}
... private TaskRunner launchTask(Task<? extends Serializable> tsk, String queryId, boolean noName,
String jobname, int jobs, DriverContext cxt) throws HiveException {
...
TaskRunner tskRun = new TaskRunner(tsk, tskRes);
...
tskRun.start();
...
tskRun.runSequential();
...
Driver.run中从QueryPlan中取出Task,并逐个launchTask,launchTask过程为将Task包装为TaskRunner,并最终调用TaskRunner.runSequential,下面看TaskRunner:
org.apache.hadoop.hive.ql.exec.TaskRunner
public void runSequential() {
int exitVal = -101;
try {
exitVal = tsk.executeTask();
...
这里直接调用Task.executeTask
org.apache.hadoop.hive.ql.exec.Task
public int executeTask() {
...
int retval = execute(driverContext);
...
这里execute是抽象方法,由子类实现,比如DDLTask,MapRedTask等,着重看MapRedTask,因为大部分的Task都是MapRedTask:
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
public int execute(DriverContext driverContext) {
...
if (!runningViaChild) {
// we are not running this mapred task via child jvm
// so directly invoke ExecDriver
return super.execute(driverContext);
}
...
这里直接调用父类方法,也就是ExecDriver.execute,下面看:
org.apache.hadoop.hive.ql.exec.mr.ExecDriver
protected transient JobConf job;
...
public int execute(DriverContext driverContext) {
...
JobClient jc = null; MapWork mWork = work.getMapWork();
ReduceWork rWork = work.getReduceWork();
...
if (mWork.getNumMapTasks() != null) {
job.setNumMapTasks(mWork.getNumMapTasks().intValue());
}
...
job.setNumReduceTasks(rWork != null ? rWork.getNumReduceTasks().intValue() : 0);
job.setReducerClass(ExecReducer.class);
...
jc = new JobClient(job);
...
rj = jc.submitJob(job);
this.jobID = rj.getJobID();
...
这里将Task转化为Job提交到Yarn执行;
SQL Explain过程
另外一个SQL解析的过程是explain,在ExplainSemanticAnalyzer中将ASTNode转化为ExplainTask:
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer
public void analyzeInternal(ASTNode ast) throws SemanticException {
...
ctx.setExplain(true);
ctx.setExplainLogical(logical); // Create a semantic analyzer for the query
ASTNode input = (ASTNode) ast.getChild(0);
BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, input);
sem.analyze(input, ctx);
sem.validate(); ctx.setResFile(ctx.getLocalTmpPath());
List<Task<? extends Serializable>> tasks = sem.getAllRootTasks();
if (tasks == null) {
tasks = Collections.emptyList();
} FetchTask fetchTask = sem.getFetchTask();
if (fetchTask != null) {
// Initialize fetch work such that operator tree will be constructed.
fetchTask.getWork().initializeForFetch(ctx.getOpContext());
} ParseContext pCtx = null;
if (sem instanceof SemanticAnalyzer) {
pCtx = ((SemanticAnalyzer)sem).getParseContext();
} boolean userLevelExplain = !extended
&& !formatted
&& !dependency
&& !logical
&& !authorize
&& (HiveConf.getBoolVar(ctx.getConf(), HiveConf.ConfVars.HIVE_EXPLAIN_USER) && HiveConf
.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez"));
ExplainWork work = new ExplainWork(ctx.getResFile(),
pCtx,
tasks,
fetchTask,
sem,
extended,
formatted,
dependency,
logical,
authorize,
userLevelExplain,
ctx.getCboInfo()); work.setAppendTaskType(
HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVEEXPLAINDEPENDENCYAPPENDTASKTYPES)); ExplainTask explTask = (ExplainTask) TaskFactory.get(work, conf); fieldList = explTask.getResultSchema();
rootTasks.add(explTask);
}
【原创】大数据基础之Hive(2)Hive SQL执行过程之SQL解析过程的更多相关文章
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
- CentOS6安装各种大数据软件 第八章:Hive安装和配置
相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...
- 【原创】大数据基础之Benchmark(2)TPC-DS
tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 【原创】大数据基础之Hive(5)性能调优Performance Tuning
1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.eng ...
- 【原创】大数据基础之Hive(1)Hive SQL执行过程之代码流程
hive 2.1 hive执行sql有两种方式: 执行hive命令,又细分为hive -e,hive -f,hive交互式: 执行beeline命令,beeline会连接远程thrift server ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- 【原创】大数据基础之Hive(3)最简绿色部署
hadoop部署参考:https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers - ...
- 了解大数据的技术生态系统 Hadoop,hive,spark(转载)
首先给出原文链接: 原文链接 大数据本身是一个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的.你能够把它比作一个厨房所以须要的各种工具. 锅碗瓢盆,各 ...
随机推荐
- [Oracle]Sqlplus 中使用 new_value
通过再sqlplus 中使用 new_value,可以把从表中查询出来的值,放置到 变量中.然后使用变量时,类似与宏定义一样,就可以像使用表中字段一样方便. 这使得sqlplus 的脚本具备和pl/s ...
- Bootstrap开发框架界面的调整处理
我在之前介绍了很多关于Boostrap的框架方面的文章,主要是介绍各种插件的使用居多,不过有时候觉得基于Metronic的Boostrap框架的界面效果不够紧凑,希望对它进行一定的调整,那么我们应该如 ...
- aelf帮助C#工程师10分钟零门槛搭建DAPP&私有链开发环境
aelf是一个可扩展的去中心化云计算区块链平台,支持高性能合约并行执行.原生多链数据交互.存储使用高性能分布式数据库. aelf整个系统可以在windows.osx及linux运行,团队在osx环境下 ...
- Linux(Ubuntu)使用日记(七)------终端控制器Terminator安装使用
1.目的 实现分屏效果,如图: 如果使用系统自带的终端,可能会使这种效果: 综上所述,知道我们为什么要安装Terminator了吧. 2.安装过程 Terminator 的安装非常方便,在 Ubunt ...
- pci设备驱动相关
pci 设备注册及查找: https://www.cnblogs.com/image-eye/archive/2012/02/15/2352912.html PFN https://nieyong.g ...
- 解决PHP乱码
如果你的PHP页面出现了乱码,之需要在页面的开始处加入下面代码就可以了. <?php header("content-type:text/html;charset=utf-8" ...
- HDU 2571 命运(简单dp)
传送门 真是刷越多题,越容易满足.算是一道很简单的DP了.终于可以自己写出来了. 二维矩阵每个点都有一个幸运值,要求从左上走到右下最多能积累多少幸运值. 重点就是左上右下必须都踩到. dp[i][j] ...
- CentOS下安装nvm
1安装版本管理工具git yum install git 查看git版本 git --version 2 安装Node.js版本管理工具nvm curl -o- https://raw.githubu ...
- mpvue——支持less
安装 安装less和less-loader,我用的是淘宝源,你也可以直接npm $ cnpm install less less-loader --save 配置 打开build目录下的webpack ...
- mysql 如何优化left join
今天遇到一个left join优化的问题,搞了一下午,中间查了不少资料,对MySQL的查询计划还有查询优化有了更进一步的了解,做一个简单的记录: select c.* from hotel_info_ ...