Chapter 4 Syntax Analysis

This chapter is devoted to parsing methods that are typically used in compilers.

We first present the basic concepts, then techniques suitable for hand implementation, and finally algorithms that have been used in automated tools. Since programs may contain syntactic errors, we discuss extensions of the parsing methods for recovery from common errors.

By design, every programming language has precise rules that prescribe the syntactic structure of well-formed programs. In C, for example, a program is made up of functions, a function out of declarations and statements, a statement out of expressions, and so on. The syntax of programming language constructs can be specified by context-free grammars or BNF (Backus-Naur Form) notation, introduced in Section 2.2. Grammars order significant benefits for both language designers and compiler writers.

  • A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming language.
  • From certain classes of grammars, we can construct automatically an efficient parser that determines the syntactic structure of a source program.
  • As a side benefit, the parser-construction process can reveal syntactic ambiguities and trouble spots that might have slipped through the initial design phase of a language.
  • The structure imparted to a language by a properly designed grammar is useful for translating source programs into correct object co de and for detecting errors.
  • A grammar allows a language to be evolved or developed iteratively, by adding new constructs to perform new tasks. These new constructs can be integrated more easily into an implementation that follows the grammatical structure of the language.

4.1 Introduction

In this section, we examine the way the parser fits into a typical compiler. We then lo ok at typical grammars for arithmetic expressions. Grammars for expressions suffice for illustrating the essence of parsing, since parsing techniques for expressions carry over to most programming constructs. This section ends with a discussion of error handling, since the parser must respond gracefully to finding that its input cannot be generated by its grammar.

4.1.1 The Role of the Parser

In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in Fig. 4.1, and verifies that the string of token names can be generated by the grammar for the source language. We expect the parser to rep ort any syntax errors in an intelligible fashion and to recover from commonly occurring errors to continue processing the remainder of the program.

Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the rest of the compiler for further processing. In fact, the parse tree need not be constructed explicitly, since checking and translation actions can be interspersed with parsing, as we shall see. Thus, the parser and the rest of the front end could well be implemented by a single module.

Figure 4.1: Position of parser in compiler model

There are three general types of parsers for grammars: universal, top-down, and bottom-up. Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar (see the bibliographic notes). These general methods are, however, to off inefficient to use in production compilers.

The methods commonly used in compilers can be classified as being either top-down or bottom-up. As implied by their names, top-down methods build parse trees from the top (root) to the bottom (leaves), while bottom-up methods start from the leaves and work their way up to the root. In either case, the input to the parser is scanned from left to right, one symbol at a time.

The most efficient top-down and bottom-up methods work only for subclasses of grammars, but several of these classes, particularly, LL and LR grammars, are expressive enough to describe most of the syntactic constructs in modern programming languages. Parsers implemented by hand often use LL grammars; for example, the predictive-parsing approach of Section 2.4.2 works for LL grammars. Parsers for the larger class of LR grammars are usually constructed using automated tools.

In this chapter, we assume that the output of the parser is some representation of the parse tree for the stream of tokens that comes from the lexical analyzer. In practice, there are a number of tasks that might be conducted during parsing, such as collecting information ab out various tokens into the symbol table, performing type checking and other kinds of semantic analysis, and generating intermediate co de. We have lumped all of these activities into the "rest of the front end" box in Fig. 4.1. These activities will be covered in detail in subsequent chapters.

4.1.2 Representative Grammars

Some of the grammars that will be examined in this chapter are presented here for ease of reference. Constructs that begin with keywords like while or int, are relatively easy to parse, because the keyword guides the choice of the grammar production that must be applied to match the input. We therefore concentrate on expressions, which present more of challenge, because of the associativity and precedence of operators.

Associativity and precedence are captured in the following grammar, which is similar to ones used in Chapter 2 for describing expressions, terms, and factors. E represents expressions consisting of terms separated by + signs, T represents terms consisting of factors separated by * signs, and F represents factors that can be either parenthesized expressions or identifiers:

E→E + T | T

T→T * F | F

F→( E ) | id

(4.1)

Expression grammar (4.1) belongs to the class of LR grammars that are suitable for bottom-up parsing. This grammar can be adapted to handle additional operators and additional levels of precedence. However, it cannot be used for top-down parsing because it is left recursive.

The following non-left-recursive variant of the expression grammar (4.1) will be used for top-down parsing:

E→T E'

E'→+ T E' | ϵ

T→F T'

T'→* F T' | ϵ

F→( E ) | id

(4.2)

The following grammar treats + and * alike, so it is useful for illustrating techniques for handling ambiguities during parsing:

E→E + E | E * E | ( E ) | id

(4.3)

Here, E represents expressions of all types. Grammar (4.3) permits more than one parse tree for expressions like a + b * c.

4.1.3 Syntax Error Handling

The remainder of this section considers the nature of syntactic errors and general strategies for error recovery. Two of these strategies, called panic-mode and phrase-level recovery, are discussed in more detail in connection with specific parsing methods.

If a compiler had to process only correct programs, its design and implementation would be simplified greatly. However, a compiler is expected to assist the programmer in locating and tracking down errors that inevitably creep into programs, despite the programmer's best efforts. Strikingly, few languages have been designed with error handling in mind, even though errors are so common-place. Our civilization would be radically different if spoken languages had the same requirements for syntactic accuracy as computer languages. Most programming language specifications do not describe how a compiler should respond to errors; error handling is left to the compiler designer. Planning the error handling right from the start can both simplify the structure of a compiler and improve its handling of errors.

Common programming errors can occur at many different levels.

  • Lexical errors include misspellings of identifiers, keywords, or operators -- e.g., the use of an identifier elipseSize instead of ellipseSize -- and missing quotes around text intended as a string.
  • Syntactic errors include misplaced semicolons or extra or missing braces; that is, "{" or "}." As another example, in C or Java, the appearance of a case statement without an enclosing switch is a syntactic error (however, this situation is usually allowed by the parser and caught later in the processing, as the compiler attempts to generate code).
  • Semantic errors include type mismatches between operators and operands, e.g., the return of a value in a Java method with result type void.
  • Logical errors can be anything from incorrect reasoning on the part of the programmer to the use in a C program of the assignment operator = instead of the comparison operator ==. The program containing = may be well formed; however, it may not reflect the programmer's intent.

The precision of parsing methods allows syntactic errors to be detected very efficiently. Several parsing methods, such as the LL and LR methods, detect an error as soon as possible; that is, when the stream of tokens from the lexical analyzer cannot be parsed further according to the grammar for the language.

More precisely, they have the viable-prefix property, meaning that they detect that an error has occurred as soon as they see a prefix of the input that cannot be completed to form a string in the language.

Another reason for emphasizing error recovery during parsing is that many errors app ear syntactic, whatever their cause, and are exposed when parsing cannot continue. A few semantic errors, such as type mismatches, can also be detected efficiently; however, accurate detection of semantic and logical errors at compile time is in general a difficult task.

The error handler in a parser has goals that are simple to state but challenging to realize:

  • Report the presence of errors clearly and accurately.
  • Recover from each error quickly enough to detect subsequent errors.
  • Add minimal overhead to the processing of correct programs.

Fortunately, common errors are simple ones, and a relatively straightforward error-handling mechanism often suffices.

How should an error handler rep ort the presence of an error? At the very least, it must rep ort the place in the source program where an error is detected, because there is a good chance that the actual error occurred within the previous few tokens. A common strategy is to print the offending line with a pointer to the position at which an error is detected.

4.1.4 Error-Recovery Strategies

Once an error is detected, how should the parser recover? Although no strategy has proven itself universally acceptable, a few methods have broad applicability. The simplest approach is for the parser to quit with an informative error message when it detects the first error. Additional errors are often uncovered if the parser can restore itself to a state where processing of the input can continue with reasonable hopes that the further processing will provide meaningful diagnostic information. If errors pile up, it is better for the compiler to give up after exceeding some error limit than to produce an annoying avalanche of "spurious" errors.

The balance of this section is devoted to the following recovery strategies: panic-mode, phrase-level, error-productions, and global-correction.

Panic-Mode Recovery

With this method, on discovering an error, the parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found. The synchronizing tokens are usually delimiters, such as semicolon or }, whose role in the source program is clear and unambiguous. The compiler designer must select the synchronizing tokens appropriate for the source language. While panic-mode correction often skips a considerable amount of input without checking it for additional errors, it has the advantage of simplicity, and, unlike some methods to be considered later, is guaranteed not to go into an infinite loop.

Phrase-Level Recovery

On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. A typical local correction is to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon.

The choice of the local correction is left to the compiler designer. Of course, we must be careful to choose replacements that do not lead to infinite loops, as would be the case, for example, if we always inserted something on the input ahead of the current input symbol.

Phrase-level replacement has been used in several error-repairing compilers, as it can correct any input string. Its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection.

Error Productions

By anticipating common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. A parser constructed from a grammar augmented by these error productions detects the anticipated errors when an error production is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that has been recognized in the input.

Global Correction

Ideally, we would like a compiler to make as few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction. Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related stringy, such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest.

Do note that a closest correct program may not be what the programmer had in mind. Nevertheless, the notion of least-cost correction provides a yardstick for evaluating error-recovery techniques, and has been used for finding optimal replacement strings for phrase-level recovery.

Chapter 4 Syntax Analysis的更多相关文章

  1. SQLChop、SQLWall(Druid)、PHP Syntax Parser Analysis

    catalog . introduction . sqlchop sourcecode analysis . SQLWall(Druid) . PHP Syntax Parser . SQL Pars ...

  2. Image Processing, Analysis & and Machine Vision - A MATLAB Companion

    Contents目录 Chapter 0: Introduction to the companion book本辅导书简介 Chapter 1: Introduction 简介 Viewing an ...

  3. Thinking in Java,Fourth Edition(Java 编程思想,第四版)学习笔记(八)之Reusing Classes

    The trick is to use the classes without soiling the existing code. 1. composition--simply create obj ...

  4. Compiler Theory(编译原理)、词法/语法/AST/中间代码优化在Webshell检测上的应用

    catalog . 引论 . 构建一个编译器的相关科学 . 程序设计语言基础 . 一个简单的语法制导翻译器 . 简单表达式的翻译器(源代码示例) . 词法分析 . 生成中间代码 . 词法分析器的实现 ...

  5. webkit模块介绍

    一.Webkit模块   用到的第三方库如下:   cairo 一个2D绘图库 casqt Unicode处理用的库,从QT中抽取部分代码形成的 expat 一个XML SAX解析器的库 freety ...

  6. How Browsers Work: Behind the scenes of modern web browsers

    http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#Parser_Lexer_combination Grammars ...

  7. Impala 源码分析-FE

    By yhluo 2015年7月29日 Impala 3 Comments Impala 源代码目录结构 SQL 解析 Impala 的 SQL 解析与执行计划生成部分是由 impala-fronte ...

  8. 44个JAVA代码质量管理工具(转)

    1. CodePro AnalytixIt’s a great tool (Eclipse plugin) for improving software quality. It has the nex ...

  9. 大白话Vue源码系列(02):编译器初探

    阅读目录 编译器代码藏在哪 Vue.prototype.$mount 构建 AST 的一般过程 Vue 构建的 AST 题接上文,上回书说到,Vue 的编译器模块相对独立且简单,那咱们就从这块入手,先 ...

随机推荐

  1. JavaScript关键字

    JavaScript关键字 制作人:全心全意 abstract continue finally instanceof private this boolean default float int p ...

  2. 有向图连通分量SCC

    在无向图中,如果从顶点vi到顶点vj有路径,则称vi和vj连通.如果图中任意两个顶点之间都连通,则称该图为连通图,否则,称该图为非连通图,则其中的极大连通子图称为连通分量,这里所谓的极大是指子图中包含 ...

  3. wamp下mysql错误提示乱码的解法

    出处:http://blog.csdn.net/jsship/article/details/42914217 运行mysql命令时,出现的错误提示是乱码 :    [Err] 1064 - Erre ...

  4. list嵌套,int与str的用法,replace

    #*************************replace(待改,改动值),返回很重要 A = [['libai',89]] A[0][0]=A[0][0].replace('a','af') ...

  5. python 用 PIL 模块 画验证码

    PIL 简单绘画 def get_code_img(request): from PIL import Image, ImageDraw, ImageFont import random def ra ...

  6. 开发调式时生成dump文件

    开发调式时,对程序生成dump文件:1:需要生成的时机,加Thread.sleep(600*1000).2:打开jvisualvm找到该程序进程号.3:jmap.

  7. 7-10 公路村村通(30 分)(最小生成树Prim算法)

    7-10 公路村村通(30 分) 现有村落间道路的统计数据表中,列出了有可能建设成标准公路的若干条道路的成本,求使每个村落都有公路连通所需要的最低成本. 输入格式: 输入数据包括城镇数目正整数N(≤1 ...

  8. 郁闷的出纳员(bzoj 1503)

    Description OIER公司是一家大型专业化软件公司,有着数以万计的员工.作为一名出纳员,我的任务之一便是统计每位员工的工资.这本来是一份不错的工作,但是令人郁闷的是,我们的老板反复无常,经常 ...

  9. 289. Game of Live

    According to the Wikipedia's article: "The Game of Life, also known simply as Life, is a cellul ...

  10. Spring实战读书笔记

    Spring实战读书笔记 Spring-core Spring之旅 - DI 和 AOP 概念 spring 的Bean容器 spring 的 核心模块 Spring的核心策略 POJO 最小侵入式编 ...