【python】词法语法解析模块ply

官方手册：http://www.dabeaz.com/ply/ply.html

以下例子都来自官方手册：

以四则运算为例： x = 3 + 42 * (s - t)

词法分析：

需要将其分解为:

'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')'

并且给每个部分起一个名字，标识这是什么东西。这些标识会用在后面的语法分析中。

('ID','x'), ('EQUALS','='), ('NUMBER','3'),

('PLUS','+'), ('NUMBER','42), ('TIMES','*'),

('LPAREN','('), ('ID','s'), ('MINUS','-'),

('ID','t'), ('RPAREN',')'

例子：

# ------------------------------------------------------------

# calclex.py

#

# tokenizer for a simple expression evaluator for

# numbers and +,-,*,/

# ------------------------------------------------------------

import ply.lex as lex

# List of token names.   This is always required

tokens = (

   'NUMBER',

   'PLUS',

   'MINUS',

   'TIMES',

   'DIVIDE',

   'LPAREN',

   'RPAREN',

)

# Regular expression rules for simple tokens

t_PLUS    = r'\+'

t_MINUS   = r'-'

t_TIMES   = r'\*'

t_DIVIDE  = r'/'

t_LPAREN  = r'\('

t_RPAREN  = r'\)'

# A regular expression rule with some action code

def t_NUMBER(t):

    r'\d+'

    t.value = int(t.value)

    return t

# Define a rule so we can track line numbers

def t_newline(t):

    r'\n+'

    t.lexer.lineno += len(t.value)

# A string containing ignored characters (spaces and tabs)

t_ignore  = ' \t'

# Error handling rule

def t_error(t):

    print("Illegal character '%s'" % t.value[0])

    t.lexer.skip(1)

# Build the lexer

lexer = lex.lex()

注意：

里面名字的命名格式是固定的，ID的名称必须叫tokens，每个ID具体的内容必须用t_ID来指定。

单个字符可以直接定义变量，复杂成分要用函数形式表示，并且一定要有一个该成分的正则表达式字符串！

比如上面例子中，t_NUMBER函数中有一个r'\d+'，这在一般的python程序中看起来没有意义，但是在ply中则是必须的！它指定了模块如何划分NUMBER。

更多注意事项参考官方手册。

具体使用：

# Test it out

data = '''

3 + 4 * 10

  + -20 *2

'''

# Give the lexer some input

lexer.input(data)

# Tokenize

while True:

    tok = lexer.token()

    if not tok:

        break      # No more input

    print(tok)

结果：

$ python example.py

LexToken(NUMBER,3,2,1)

LexToken(PLUS,'+',2,3)

LexToken(NUMBER,4,2,5)

LexToken(TIMES,'*',2,7)

LexToken(NUMBER,10,2,10)

LexToken(PLUS,'+',3,14)

LexToken(MINUS,'-',3,16)

LexToken(NUMBER,20,3,18)

LexToken(TIMES,'*',3,20)

LexToken(NUMBER,2,3,21)

语法分析：

一个四则运算的语法结构是下面这个样子：

expression : expression + term

           | expression - term

           | term

term       : term * factor

           | term / factor

           | factor

factor     : NUMBER

           | ( expression )

用ply实现四则运算语法分析：

# Yacc example

import ply.yacc as yacc

# Get the token map from the lexer.  This is required.

from calclex import tokens

def p_expression_plus(p):

    'expression : expression PLUS term'

    p[0] = p[1] + p[3]

def p_expression_minus(p):

    'expression : expression MINUS term'

    p[0] = p[1] - p[3]

def p_expression_term(p):

    'expression : term'

    p[0] = p[1]

def p_term_times(p):

    'term : term TIMES factor'

    p[0] = p[1] * p[3]

def p_term_div(p):

    'term : term DIVIDE factor'

    p[0] = p[1] / p[3]

def p_term_factor(p):

    'term : factor'

    p[0] = p[1]

def p_factor_num(p):

    'factor : NUMBER'

    p[0] = p[1]

def p_factor_expr(p):

    'factor : LPAREN expression RPAREN'

    p[0] = p[2]

# Error rule for syntax errors

def p_error(p):

    print("Syntax error in input!")

# Build the parser

parser = yacc.yacc()

while True:

   try:

       s = raw_input('calc > ')

   except EOFError:

       break

   if not s: continue

   result = parser.parse(s)

   print(result)

与词法分析一样，语法分析的命名方式也是固定的 p_成分名_动作。

函数一开始必须是一个声明字符串，格式是 “成分名：成分名成分名 ...” 其中成分名可以是词法分析中的ID，比如上面的PLUS, NUMBER等等。冒号右边的是这个函数结果的成分名。即，语法分析通过组合各个ID得到结构化的结果。

成分相同的结构可以合并，如同时定义加法和减法

def p_expression(p):

    '''expression : expression PLUS term

                  | expression MINUS term'''

    if p[2] == '+':

        p[0] = p[1] + p[3]

    elif p[2] == '-':

        p[0] = p[1] - p[3]

可以看到字符串的内容有变化，多余一种组合的用 | 换行分隔。注意这个字符串的格式是固定的，不过PLUS和MINUS的顺序没有影响，哪个在上面都可以。

更多细节，参见手册。