Chapter 4 Syntax Analysis

This chapter is devoted to parsing methods that are typically used in compilers.

We first present the basic concepts, then techniques suitable for hand implementation, and finally algorithms that have been used in automated tools. Since programs may contain syntactic errors, we discuss extensions of the parsing methods for recovery from common errors.

By design, every programming language has precise rules that prescribe the syntactic structure of well-formed programs. In C, for example, a program is made up of functions, a function out of declarations and statements, a statement out of expressions, and so on. The syntax of programming language constructs can be specified by context-free grammars or BNF (Backus-Naur Form) notation, introduced in Section 2.2. Grammars order significant benefits for both language designers and compiler writers.

  • A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming language.
  • From certain classes of grammars, we can construct automatically an efficient parser that determines the syntactic structure of a source program.
  • As a side benefit, the parser-construction process can reveal syntactic ambiguities and trouble spots that might have slipped through the initial design phase of a language.
  • The structure imparted to a language by a properly designed grammar is useful for translating source programs into correct object co de and for detecting errors.
  • A grammar allows a language to be evolved or developed iteratively, by adding new constructs to perform new tasks. These new constructs can be integrated more easily into an implementation that follows the grammatical structure of the language.

4.1 Introduction

In this section, we examine the way the parser fits into a typical compiler. We then lo ok at typical grammars for arithmetic expressions. Grammars for expressions suffice for illustrating the essence of parsing, since parsing techniques for expressions carry over to most programming constructs. This section ends with a discussion of error handling, since the parser must respond gracefully to finding that its input cannot be generated by its grammar.

4.1.1 The Role of the Parser

In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in Fig. 4.1, and verifies that the string of token names can be generated by the grammar for the source language. We expect the parser to rep ort any syntax errors in an intelligible fashion and to recover from commonly occurring errors to continue processing the remainder of the program.

Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the rest of the compiler for further processing. In fact, the parse tree need not be constructed explicitly, since checking and translation actions can be interspersed with parsing, as we shall see. Thus, the parser and the rest of the front end could well be implemented by a single module.

Figure 4.1: Position of parser in compiler model

There are three general types of parsers for grammars: universal, top-down, and bottom-up. Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar (see the bibliographic notes). These general methods are, however, to off inefficient to use in production compilers.

The methods commonly used in compilers can be classified as being either top-down or bottom-up. As implied by their names, top-down methods build parse trees from the top (root) to the bottom (leaves), while bottom-up methods start from the leaves and work their way up to the root. In either case, the input to the parser is scanned from left to right, one symbol at a time.

The most efficient top-down and bottom-up methods work only for subclasses of grammars, but several of these classes, particularly, LL and LR grammars, are expressive enough to describe most of the syntactic constructs in modern programming languages. Parsers implemented by hand often use LL grammars; for example, the predictive-parsing approach of Section 2.4.2 works for LL grammars. Parsers for the larger class of LR grammars are usually constructed using automated tools.

In this chapter, we assume that the output of the parser is some representation of the parse tree for the stream of tokens that comes from the lexical analyzer. In practice, there are a number of tasks that might be conducted during parsing, such as collecting information ab out various tokens into the symbol table, performing type checking and other kinds of semantic analysis, and generating intermediate co de. We have lumped all of these activities into the "rest of the front end" box in Fig. 4.1. These activities will be covered in detail in subsequent chapters.

4.1.2 Representative Grammars

Some of the grammars that will be examined in this chapter are presented here for ease of reference. Constructs that begin with keywords like while or int, are relatively easy to parse, because the keyword guides the choice of the grammar production that must be applied to match the input. We therefore concentrate on expressions, which present more of challenge, because of the associativity and precedence of operators.

Associativity and precedence are captured in the following grammar, which is similar to ones used in Chapter 2 for describing expressions, terms, and factors. E represents expressions consisting of terms separated by + signs, T represents terms consisting of factors separated by * signs, and F represents factors that can be either parenthesized expressions or identifiers:

E→E + T | T

T→T * F | F

F→( E ) | id

(4.1)

Expression grammar (4.1) belongs to the class of LR grammars that are suitable for bottom-up parsing. This grammar can be adapted to handle additional operators and additional levels of precedence. However, it cannot be used for top-down parsing because it is left recursive.

The following non-left-recursive variant of the expression grammar (4.1) will be used for top-down parsing:

E→T E'

E'→+ T E' | ϵ

T→F T'

T'→* F T' | ϵ

F→( E ) | id

(4.2)

The following grammar treats + and * alike, so it is useful for illustrating techniques for handling ambiguities during parsing:

E→E + E | E * E | ( E ) | id

(4.3)

Here, E represents expressions of all types. Grammar (4.3) permits more than one parse tree for expressions like a + b * c.

4.1.3 Syntax Error Handling

The remainder of this section considers the nature of syntactic errors and general strategies for error recovery. Two of these strategies, called panic-mode and phrase-level recovery, are discussed in more detail in connection with specific parsing methods.

If a compiler had to process only correct programs, its design and implementation would be simplified greatly. However, a compiler is expected to assist the programmer in locating and tracking down errors that inevitably creep into programs, despite the programmer's best efforts. Strikingly, few languages have been designed with error handling in mind, even though errors are so common-place. Our civilization would be radically different if spoken languages had the same requirements for syntactic accuracy as computer languages. Most programming language specifications do not describe how a compiler should respond to errors; error handling is left to the compiler designer. Planning the error handling right from the start can both simplify the structure of a compiler and improve its handling of errors.

Common programming errors can occur at many different levels.

  • Lexical errors include misspellings of identifiers, keywords, or operators -- e.g., the use of an identifier elipseSize instead of ellipseSize -- and missing quotes around text intended as a string.
  • Syntactic errors include misplaced semicolons or extra or missing braces; that is, "{" or "}." As another example, in C or Java, the appearance of a case statement without an enclosing switch is a syntactic error (however, this situation is usually allowed by the parser and caught later in the processing, as the compiler attempts to generate code).
  • Semantic errors include type mismatches between operators and operands, e.g., the return of a value in a Java method with result type void.
  • Logical errors can be anything from incorrect reasoning on the part of the programmer to the use in a C program of the assignment operator = instead of the comparison operator ==. The program containing = may be well formed; however, it may not reflect the programmer's intent.

The precision of parsing methods allows syntactic errors to be detected very efficiently. Several parsing methods, such as the LL and LR methods, detect an error as soon as possible; that is, when the stream of tokens from the lexical analyzer cannot be parsed further according to the grammar for the language.

More precisely, they have the viable-prefix property, meaning that they detect that an error has occurred as soon as they see a prefix of the input that cannot be completed to form a string in the language.

Another reason for emphasizing error recovery during parsing is that many errors app ear syntactic, whatever their cause, and are exposed when parsing cannot continue. A few semantic errors, such as type mismatches, can also be detected efficiently; however, accurate detection of semantic and logical errors at compile time is in general a difficult task.

The error handler in a parser has goals that are simple to state but challenging to realize:

  • Report the presence of errors clearly and accurately.
  • Recover from each error quickly enough to detect subsequent errors.
  • Add minimal overhead to the processing of correct programs.

Fortunately, common errors are simple ones, and a relatively straightforward error-handling mechanism often suffices.

How should an error handler rep ort the presence of an error? At the very least, it must rep ort the place in the source program where an error is detected, because there is a good chance that the actual error occurred within the previous few tokens. A common strategy is to print the offending line with a pointer to the position at which an error is detected.

4.1.4 Error-Recovery Strategies

Once an error is detected, how should the parser recover? Although no strategy has proven itself universally acceptable, a few methods have broad applicability. The simplest approach is for the parser to quit with an informative error message when it detects the first error. Additional errors are often uncovered if the parser can restore itself to a state where processing of the input can continue with reasonable hopes that the further processing will provide meaningful diagnostic information. If errors pile up, it is better for the compiler to give up after exceeding some error limit than to produce an annoying avalanche of "spurious" errors.

The balance of this section is devoted to the following recovery strategies: panic-mode, phrase-level, error-productions, and global-correction.

Panic-Mode Recovery

With this method, on discovering an error, the parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found. The synchronizing tokens are usually delimiters, such as semicolon or }, whose role in the source program is clear and unambiguous. The compiler designer must select the synchronizing tokens appropriate for the source language. While panic-mode correction often skips a considerable amount of input without checking it for additional errors, it has the advantage of simplicity, and, unlike some methods to be considered later, is guaranteed not to go into an infinite loop.

Phrase-Level Recovery

On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. A typical local correction is to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon.

The choice of the local correction is left to the compiler designer. Of course, we must be careful to choose replacements that do not lead to infinite loops, as would be the case, for example, if we always inserted something on the input ahead of the current input symbol.

Phrase-level replacement has been used in several error-repairing compilers, as it can correct any input string. Its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection.

Error Productions

By anticipating common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. A parser constructed from a grammar augmented by these error productions detects the anticipated errors when an error production is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that has been recognized in the input.

Global Correction

Ideally, we would like a compiler to make as few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction. Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related stringy, such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest.

Do note that a closest correct program may not be what the programmer had in mind. Nevertheless, the notion of least-cost correction provides a yardstick for evaluating error-recovery techniques, and has been used for finding optimal replacement strings for phrase-level recovery.

Chapter 4 Syntax Analysis的更多相关文章

  1. SQLChop、SQLWall(Druid)、PHP Syntax Parser Analysis

    catalog . introduction . sqlchop sourcecode analysis . SQLWall(Druid) . PHP Syntax Parser . SQL Pars ...

  2. Image Processing, Analysis & and Machine Vision - A MATLAB Companion

    Contents目录 Chapter 0: Introduction to the companion book本辅导书简介 Chapter 1: Introduction 简介 Viewing an ...

  3. Thinking in Java,Fourth Edition(Java 编程思想,第四版)学习笔记(八)之Reusing Classes

    The trick is to use the classes without soiling the existing code. 1. composition--simply create obj ...

  4. Compiler Theory(编译原理)、词法/语法/AST/中间代码优化在Webshell检测上的应用

    catalog . 引论 . 构建一个编译器的相关科学 . 程序设计语言基础 . 一个简单的语法制导翻译器 . 简单表达式的翻译器(源代码示例) . 词法分析 . 生成中间代码 . 词法分析器的实现 ...

  5. webkit模块介绍

    一.Webkit模块   用到的第三方库如下:   cairo 一个2D绘图库 casqt Unicode处理用的库,从QT中抽取部分代码形成的 expat 一个XML SAX解析器的库 freety ...

  6. How Browsers Work: Behind the scenes of modern web browsers

    http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#Parser_Lexer_combination Grammars ...

  7. Impala 源码分析-FE

    By yhluo 2015年7月29日 Impala 3 Comments Impala 源代码目录结构 SQL 解析 Impala 的 SQL 解析与执行计划生成部分是由 impala-fronte ...

  8. 44个JAVA代码质量管理工具(转)

    1. CodePro AnalytixIt’s a great tool (Eclipse plugin) for improving software quality. It has the nex ...

  9. 大白话Vue源码系列(02):编译器初探

    阅读目录 编译器代码藏在哪 Vue.prototype.$mount 构建 AST 的一般过程 Vue 构建的 AST 题接上文,上回书说到,Vue 的编译器模块相对独立且简单,那咱们就从这块入手,先 ...

随机推荐

  1. mysql常用命令用法

    Mysql帮助文档地址:http://dev.mysql.com/doc/ 1.创建数据库: create database database_name; 2.选择数据库: use database_ ...

  2. Python之字符串计算(计算器)

    Python之字符串计算(计算器) import re expression = '-1-2*((60+2*(-3-40.0+42425/5)*(9-2*5/3+357/553/3*99/4*2998 ...

  3. AD采集精度中的LSB

    测量范围+5V, 精度10位,LSB=0.0048 精度16位,LSB=0.000076951 测量范围+-5V, 精度10位,LSB=0.009765625,大约为0.01 精度16位,LSB=0. ...

  4. 洛谷P3373 线段树2(补上注释了)

    毒瘤题.找了一下午+晚上的BUG,才发现原来query_tree写的是a%p; 真的是一个教训 UPD:2019.6.18 #include<iostream> #include<c ...

  5. 听dalao讲课 7.26

    XFZ今天讲了些关于多项式求ln和多项式求导以及多项式求积分的东西 作为一个连导数和积分根本就不会的蒟蒻,就像在听天书,所以不得不补点前置知识 1.积分 积分是微积分学与数学分析里的一个核心概念.通常 ...

  6. [NOIP2005] 提高组 洛谷P1053 篝火晚会

    题目描述 佳佳刚进高中,在军训的时候,由于佳佳吃苦耐劳,很快得到了教官的赏识,成为了“小教官”.在军训结束的那天晚上,佳佳被命令组织同学们进行篝火晚会.一共有n个同学,编号从1到n.一开始,同学们按照 ...

  7. 【PowerDesigner】PowerDesigner之CDM、PDM、SQL之间转换

    有关CDM.PDM.SQL之间转换以及不同数据库之间库表Sql的移植,首先要了解的是它们各自的用途.这里就简单的描述一下,不做详细的解释了. CDM:概念数据模型.CDM就是以其自身方式来描述E-R图 ...

  8. Linux下汇编语言学习笔记15 ---

    这是17年暑假学习Linux汇编语言的笔记记录,参考书目为清华大学出版社 Jeff Duntemann著 梁晓辉译<汇编语言基于Linux环境>的书,喜欢看原版书的同学可以看<Ass ...

  9. Floyd算法——保存路径——输出路径 HDU1385

    题目链接 http://acm.hdu.edu.cn/showproblem.php?pid=1385 参考 http://blog.csdn.net/shuangde800/article/deta ...

  10. Java 读取Excel内容并保存进数据库

    读取Excel中内容,并保存进数据库 步骤 建立数据库连接 读取文件内容 (fileInputStream 放进POI的对应Excel读取接口,实现Excel文件读取) 获取文件各种内容(总列数,总行 ...