Lexer and parser generators (ocamllex, ocamlyacc)
Chapter 12 Lexer and parser generators (ocamllex, ocamlyacc)
This chapter describes two program generators: ocamllex
, that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and ocamlyacc
, that produces a parser from a grammar with associated semantic actions.
These program generators are very close to the well-known lex
and yacc
commands that can be found in most C programming environments. This chapter assumes a working knowledge of lex
and yacc
: while it describes the input syntax for ocamllex
and ocamlyacc
and the main differences with lex
and yacc
, it does not explain the basics of writing a lexer or parser description in lex
and yacc
. Readers unfamiliar with lex
and yacc
are referred to ``Compilers: principles, techniques, and tools'' by Aho, Sethi and Ullman (Addison-Wesley, 1986), or ``Lex & Yacc'', by Levine, Mason and Brown (O'Reilly, 1992).
12.1 Overview of ocamllex
The ocamllex
command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of lex
. Assuming the input file is lexer.mll
, executing
ocamllex lexer.mll
produces Caml code for a lexical analyzer in file lexer.ml
. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.
Lexer buffers are an abstract data type implemented in the standard library module Lexing
. The functions Lexing.from_channel
, Lexing.from_string
and Lexing.from_function
create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module Lexing
in chapter 20.)
When used in conjunction with a parser generated by ocamlyacc
, the semantic actions compute a value belonging to the type token
defined by the generated parsing module. (See the description of ocamlyacc
below.)
12.2 Syntax of lexer definitions
The format of lexer definitions is as follows:
{ header }
let ident = regexp ...
rule entrypoint =
parse regexp { action }
| ...
| regexp { action }
and entrypoint =
parse ...
and ...
{ trailer }
Comments are delimited by (*
and *)
, as in Caml.
12.2.1 Header and trailer
The header and trailer sections are arbitrary Caml text enclosed in curly braces. Either or both can be omitted. If present, the header text is copied as is at the beginning of the output file and the trailer text at the end. Typically, the header section contains the open
directives required by the actions, and possibly some auxiliary functions used in the actions.
12.2.2 Naming regular expressions
Between the header and the entry points, one can give names to frequently-occurring regular expressions. This is written let
ident
=
regexp
. In following regular expressions, the identifier ident can be used as shorthand for regexp.
12.2.3 Entry points
The names of the entry points must be valid identifiers for Caml values (starting with a lowercase letter). Each entry point becomes a Caml function that takes one argument of type Lexing.lexbuf
. Characters are read from the Lexing.lexbuf
argument and matched against the regular expressions provided in the rule, until a prefix of the input matches one of the rule. The corresponding action is then evaluated and returned as the result of the function.
If several regular expressions match a prefix of the input, the ``longest match'' rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
12.2.4 Regular expressions
The regular expressions are in the style of lex
, with a more Caml-like syntax.
'
char
'
- A character constant, with the same syntax as Objective Caml character constants. Match the denoted character.
_
- (Underscore.) Match any character.
eof
- Match the end of the lexer input.
Note: On some systems, with interactive input, an end-of-file may be followed by more characters. However,ocamllex
will not correctly handle regular expressions that containeof
followed by something else. "
string
"
- A string constant, with the same syntax as Objective Caml string constants. Match the corresponding sequence of characters.
[
character-set
]
- Match any single character belonging to the given character set. Valid character sets are: single character constants
'
c
'
; ranges of characters'
c
1'
-
'
c
2'
(all characters between c1 and c2, inclusive); and the union of two or more character sets, denoted by concatenation. [
^
character-set
]
- Match any single character not belonging to the given character set.
regexp
*
- (Repetition.) Match the concatenation of zero or more strings that match
regexp
. regexp
+
- (Strict repetition.) Match the concatenation of one or more strings that match
regexp
. regexp
?
- (Option.) Match either the empty string, or a string matching
regexp
. regexp
1|
regexp
2- (Alternative.) Match any string that matches either
regexp
1orregexp
2 regexp
1regexp
2- (Concatenation.) Match the concatenation of two strings, the first matching
regexp
1, the second matchingregexp
2. (
regexp
)
- Match the same strings as
regexp
. ident
- Reference the regular expression bound to
ident
by an earlierlet
ident
=
regexp
definition.
Concerning the precedences of operators, *
and +
have highest precedence, followed by ?
, then concatenation, then |
(alternation).
12.2.5 Actions
The actions are arbitrary Caml expressions. They are evaluated in a context where the identifier lexbuf
is bound to the current lexer buffer. Some typical uses for lexbuf
, in conjunction with the operations on lexer buffers provided by the Lexing
standard library module, are listed below.
Lexing.lexeme lexbuf
- Return the matched string.
Lexing.lexeme_char lexbuf
n- Return the nth character in the matched string. The first character corresponds to n = 0.
Lexing.lexeme_start lexbuf
- Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.
Lexing.lexeme_end lexbuf
- Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.
- entrypoint
lexbuf
- (Where entrypoint is the name of another entry point in the same lexer definition.) Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.
12.2.6 Reserved identifiers
All identifiers starting with __ocaml_lex
are reserved for use by ocamllex
; do not use any such identifier in your programs.
12.3 Overview of ocamlyacc
The ocamlyacc
command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc
. Assuming the input file is grammar.mly
, executing
ocamlyacc options grammar.mly
produces Caml code for a parser in the file grammar.ml
, and its interface in file grammar.mli
.
The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the ocamllex
program. Lexer buffers are an abstract data type implemented in the standard library module Lexing
. Tokens are values from the concrete type token
, defined in the interface file grammar.mli
produced by ocamlyacc
.
12.4 Syntax of grammar definitions
Grammar definitions have the following format:
%{
header
%}
declarations
%%
rules
%%
trailer
Comments are enclosed between /*
and */
(as in C) in the ``declarations'' and ``rules'' sections, and between (*
and *)
(as in Caml) in the ``header'' and ``trailer'' sections.
12.4.1 Header and trailer
The header and the trailer sections are Caml code that is copied as is into file grammar.ml
. Both sections are optional. The header goes at the beginning of the output file; it usually contains open
directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file.
12.4.2 Declarations
Declarations are given one per line. They all start with a %
sign.
%token
symbol
...symbol
- Declare the given symbols as tokens (terminal symbols). These symbols are added as constant constructors for the
token
concrete type. %token
<
type
>
symbol
...symbol
- Declare the given symbols as tokens with an attached attribute of the given type. These symbols are added as constructors with arguments of the given type for the
token
concrete type. Thetype
part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified (e.g.Modname.typename
) for all types except standard built-in types, even if the properopen
directives (e.g.open Modname
) were given in the header section. That's because the header is copied only to the.ml
output file, but not to the.mli
output file, while thetype
part of a%token
declaration is copied to both. %start
symbol
...symbol
- Declare the given symbols as entry points for the grammar. For each entry point, a parsing function with the same name is defined in the output module. Non-terminals that are not declared as entry points have no such parsing function. Start symbols must be given a type with the
%type
directive below. %type
<
type
>
symbol
...symbol
- Specify the type of the semantic attributes for the given symbols. This is mandatory for start symbols only. Other nonterminal symbols need not be given types by hand: these types will be inferred when running the output files through the Objective Caml compiler (unless the
-s
option is in effect). Thetype
part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified, as explained above for%token
. %left
symbol
...symbol
%right
symbol
...symbol
%nonassoc
symbol
...symbol
- Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a
%left
,%right
or%nonassoc
line. They have lower precedence than symbols declared after in a%left
,%right
or%nonassoc
line. The symbols are declared to associate to the left (%left
), to the right (%right
), or to be non-associative (%nonassoc
). The symbols are usually tokens. They can also be dummy nonterminals, for use with the%prec
directive inside the rules.
12.4.3 Rules
The syntax for rules is as usual:
nonterminal :
symbol ... symbol { semantic-action }
| ...
| symbol ... symbol { semantic-action }
;
Rules can also contain the %prec
symbol directive in the right-hand side part, to override the default precedence and associativity of the rule with the precedence and associativity of the given symbol.
Semantic actions are arbitrary Caml expressions, that are evaluated to produce the semantic attribute attached to the defined nonterminal. The semantic actions can access the semantic attributes of the symbols in the right-hand side of the rule with the $
notation: $1
is the attribute for the first (leftmost) symbol, $2
is the attribute for the second symbol, etc.
The rules may contain the special symbol error
to indicate resynchronization points, as in yacc
.
Actions occurring in the middle of rules are not supported.
Nonterminal symbols are like regular Caml symbols, except that they cannot end with '
(single quote).
12.4.4 Error handling
Error recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named parse_error
with the string "syntax error"
as argument. The default parse_error
function does nothing and returns, thus initiating error recovery (see below). The user can define a customized parse_error
function in the header section of the grammar file.
The parser also enters error recovery mode if one of the grammar actions raises the Parsing.Parse_error
exception.
In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the Parsing.Parse_error
exception.
Refer to documentation on yacc
for more details and guidance in how to use error recovery.
12.5 Options
The ocamlyacc
command recognizes the following options:
-v
- Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file grammar
.output
. -b
prefix- Name the output files prefix
.ml
, prefix.mli
, prefix.output
, instead of the default naming convention.
At run-time, the ocamlyacc
-generated parser can be debugged by setting the p
option in the OCAMLRUNPARAM
environment variable (see section 10.2). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file grammar.output
generated by ocamlyacc -v
.
12.6 A complete example
The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition:
/* File parser.mly */
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
Here is the definition for the corresponding lexer:
(* File lexer.mll *)
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ { INT(int_of_string(Lexing.lexeme lexbuf)) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
Here is the main program, that combines the parser with the lexer:
(* File calc.ml *)
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
To compile everything, execute:
ocamllex lexer.mll # generates lexer.ml
ocamlyacc parser.mly # generates parser.ml and parser.mli
ocamlc -c parser.mli
ocamlc -c lexer.ml
ocamlc -c parser.ml
ocamlc -c calc.ml
ocamlc -o calc lexer.cmo parser.cmo calc.cmo
12.7 Common errors
- ocamllex: transition table overflow, automaton is too big
-
The deterministic automata generated by
ocamllex
are limited to at most 32767 transitions. The message above indicates that your lexer definition is too complex and overflows this limit. This is commonly caused by lexer definitions that have separate rules for each of the alphabetic keywords of the language, as in the following example.rule token = parse
"keyword1" { KWD1 }
| "keyword2" { KWD2 }
| ...
| "keyword100" { KWD100 }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *
{ IDENT(Lexing.lexeme lexbuf) }To keep the generated automata small, rewrite those definitions with only one general ``identifier'' rule, followed by a hashtable lookup to separate keywords from identifiers:
{ let keyword_table = Hashtbl.create 53
let _ =
List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)
[ "keyword1", KWD1;
"keyword2", KWD2; ...
"keyword100", KWD100 ]
}
rule token = parse
['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *
{ let id = Lexing.lexeme lexbuf in
try
Hashtbl.find keyword_table s
with Not_found ->
IDENT s }
Lexer and parser generators (ocamllex, ocamlyacc)的更多相关文章
- 4.9 Parser Generators
4.9 Parser Generators This section shows how a parser generator can be used to facilitate the constr ...
- Writing a simple Lexer in PHP/C++/Java
catalog . Comparison of parser generators . Writing a simple lexer in PHP . phc . JLexPHP: A PHP Lex ...
- 为sproto手写了一个python parser
这是sproto系列文章的第三篇,可以参考前面的<为sproto添加python绑定>.<为python-sproto添加map支持>. sproto是云风设计的序列化协议,用 ...
- SQLChop、SQLWall(Druid)、PHP Syntax Parser Analysis
catalog . introduction . sqlchop sourcecode analysis . SQLWall(Druid) . PHP Syntax Parser . SQL Pars ...
- 前端编译原理 parser.js源码解读
前面已经介绍了一个jison的使用,在正常开发中其实已经够用下,下面主要是看了下parser.js代码解读下,作为一些了解. 下面以最简单的文法产生的parser做一些代码注释 下面是一些注释,标示了 ...
- CSharp使用ANTLR4生成简单计算Parser
ANTLR简介 ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, pr ...
- Handwritten Parsers & Lexers in Go (翻译)
用go实现Parsers & Lexers 在当今网络应用和REST API的时代,编写解析器似乎是一种垂死的艺术.你可能会认为编写解析器是一个复杂的工作,只保留给编程语言设计师,但我想消除这 ...
- Handwritten Parsers & Lexers in Go (Gopher Academy Blog)
Handwritten Parsers & Lexers in Go (原文地址 https://blog.gopheracademy.com/advent-2014/parsers-lex ...
- Browser Page Parsing Details
Browser Work: 1.输入网址. 2.浏览器查找域名的IP地址. 3. 浏览器给web服务器发送一个HTTP请求 4. 网站服务的永久重定向响应 5. 浏览器跟踪重定向地址 现在,浏 ...
随机推荐
- 通过更改scrapy源码进行spider分发实现一个综合爬虫
最近我正写一个项目,项目的需求如下一,要爬取大约100种几百个网页的类容,并且这些网页的爬取频率不一样,有些一天爬取一次,有些一周爬取一次,二,网页爬取内容有变化,也就是说要爬取的内容会根据需求进行改 ...
- Redis Hashes 巧用sort排序
假设我们有如下的数据结构: 我们想按download排序,并且返回hash中的其他field,需要怎么处理呢? 我们首先会想到sort命令.对,就是这个sort. 我们先看一下sort的语法: 可以看 ...
- python 开发搭建目录规范
软件目录结构规范有什么好处: 通过规范化,能够更好的控制软件结构,让程序具有更高的可读性. 项目目录组织结构如下: Foo/ # 项目名 --bin/ # 可执行文件目录 --foo # 可执行程序 ...
- 1. 垃圾收集简介 - GC参考手册
说明: 在本文中, Garbage Collection 翻译为 “垃圾收集”, garbage collector 翻译为 “垃圾收集器”; 一般认为, 垃圾回收 和 垃圾收集 是同义词. Mino ...
- Python中的列表(2)
一.从列表中删除元素 使用del 语句删除. books = ['Pride and Prejudice','Jane Eyre','The Catcher in the Rye'] print(bo ...
- LeetCode(43)Multiply Strings
题目 Given two numbers represented as strings, return multiplication of the numbers as a string. Note: ...
- Division
Description Write a program that finds and displays all pairs of 5-digit numbers that between them ...
- 九度oj 题目1192:回文字符串
题目1192:回文字符串 时间限制:1 秒 内存限制:32 兆 特殊判题:否 提交:4391 解决:2082 题目描述: 给出一个长度不超过1000的字符串,判断它是不是回文(顺读,逆读均相同)的. ...
- mysql replication driver 在jdk1.6下失效问题解决
mysql diver包里有relication driver,可以在jdbc层进行读写分离,主写从读默认的配置方式是指定driver为ReplicationDriver,并改写jdbc url一起j ...
- [luoguP1868] 饥饿的奶牛(DP)
传送门 先把所有区间按照左端点排序 f[i]表示区间0~i的最优解 #include <cstdio> #include <iostream> #include <alg ...