Simple Matching

LPeg is a powerful notation for matching text data, which is more capable than Lua string patterns and standard regular expressions. However, like any language you need to know the basic words and how to combine them.

The best way to learn is to play with patterns in an interactive session, first by defining some shortcuts:

$ lua -llpeg
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> match = lpeg.match -- match a pattern against a string
> P = lpeg.P -- match a string literally
> S = lpeg.S -- match anything in a set
> R = lpeg.R -- match anything in a range

If you dont' want to create shorcuts manually, you can do this:

> setmetatable(_ENV or _G, { __index = lpeg or require"lpeg" })
    

I don't recommend doing this in serious code, but to explore LPeg, it is very convenient.

Matches occur against the start of the string, and successful matches return the position immediately after the successful match, or nil if unsuccesful. (Here I'm using the fact that f'x' is equivalent to f('x') in Lua; using single quotes has the same meaning as double quotes.)

> = match(P'a','aaa')
2
> = match(P'a','123')
nil

It works like string.find, except it only returns one index.

You can match against ranges or sets of characters:

> = match(R'09','123')
2
> = match(S'123','123')
2

Matching more than one item is done with the ^ operator. In this case, the match is equivalent to the Lua pattern '^a+' - one or more occurrances of 'a':

> = match(P'a'^1,'aaa')
4

Combining patterns in order is done with the * operator. This is equivalent to '^ab*' - one 'a' followed by zero or more 'b's:

> = match(P'a'*P'b'^0,'abbc')
4

So far, lpeg is giving us a more verbose way of expressing regular expressions, but these patterns are composible - they can be easily built up from simpler patterns, without awkward string operations. In this way, lpeg patterns can be made to be easier to read than their equivalent regular expressions. Note that you can often leave out an explicit P call when constructing patterns, if one of the arguments is already a pattern:

> maybe_a = P'a'^-1  -- one or zero matches of 'a'
> match_ab = maybe_a * 'b'
> = match(match_ab, 'ab')
3
> = match(match_ab, 'b')
2
> = match(match_ab, 'aaab')
nil

The + operator means either one or the other pattern:

> either_ab = (P'a' + P'b')^1 -- sequence of either 'a' or 'b'
> = either_ab:match 'aaa'
4
> = either_ab:match 'bbaa'
5

Note that the pattern object has a match method!

Of course, S'ab'^1 would be a shorter way to say this, but the arguments here can be arbitary patterns.

Basic Captures

Getting the index after a match is all very well, and you can then use string.sub to extract the strings. But there are ways of explicitly asking for captures:

> C = lpeg.C  -- captures a match
> Ct = lpeg.Ct -- a table with all captures from the pattern

The first is equivalent to how '(...)' is used in Lua patterns (or '\(...\)' in regular expressions)

> digit = R'09' -- anything from '0' to '9'
> digits = digit^1 -- a sequence of at least one digit
> cdigits= C(digits) -- capture digits
> = cdigits:match '123'
123

So to get the string value, enclose the pattern in C.

This pattern doesn't cover a general integer, which may have a '+' or '-' up front:

> int = S'+-'^-1 * digits
> = match(C(int),'+23')
+23

Unlike with Lua patterns or regular expressions, you don't have to worry about escaping 'magic' characters - every character in a string stands for itself: '(','+','*', etc are just their ASCII equilvalents.

A special kind of capture is provided by the / operator - it passes the captured string through a function or a table. Here I'm adding one to the result, just to show that the result has been converted into a number with tonumber:

> =  match(int/tonumber,'+123') + 1
124

Note that multiple captures can be returned by a match, just like string.match. This is equivalent to '^(a+)(b+)':

> = match(C(P'a'^1) * C(P'b'^1), 'aabbbb')
aa bbbb

Building more complicated Patterns

Consider general floating-point numbers:

> function maybe(p) return p^-1 end
> digits = R'09'^1
> mpm = maybe(S'+-')
> dot = '.'
> exp = S'eE'
> float = mpm * digits * maybe(dot*digits) * maybe(exp*mpm*digits)
> = match(C(float),'2.3')
2.3
> = match(C(float),'-2')
-2
> = match(C(float),'2e-02')
2e-02

This lpeg pattern is easier to read than the regular expression equivalent '[-+]?[0-9]+\.?[0-9]+([eE][+-]?[0-9]+)?'; shorter is always better! One reason is that we can work with patterns as expressions: factor out common patterns, write functions for convenience and clarity, etc. Note that there is no penalty for writing things out in this fashion; lpeg remains a very fast way to parse text!

More complicated structures can be composed from these building blocks. Consider the task of parsing a list of floating point numbers. A list is a number followed by zero or more groups consisting of a comma and a number:

> listf = C(float) * (',' * C(float))^0
> = listf:match '2,3,4'
2 3 4

That's cool, but it would be even cooler to have this as an actual list. This is where lpeg.Ct comes in; it collects all the captures within a pattern into a table.

= match(Ct(listf),'1,2,3')
table: 0x84fe628

Stock Lua does not pretty-print tables, but you can use [? Microlight] for this job:

> tostring = require 'ml'.tstring
> = match(Ct(listf),'1,2,3')
{"1","2","3"}

The values are still strings. It's better to write listf so that it converts its captures:

> floatc = float/tonumber
> listf = floatc * (',' * floatc)^0

This way of capturing lists is very general, since you can put any expression that captures in the place of floatc. But this list pattern is still too restrictive, because generally we want to ignore whitespace

> sp = P' '^0  -- zero or more spaces (like '%s*')
> function space(pat) return sp * pat * sp end -- surrond a pattern with optional space
> floatc = space(float/tonumber)
> listc = floatc * (',' * floatc)^0
> = match(Ct(listc),' 1,2, 3')
{1,2,3}

It's a matter of taste, but here I prefer to allow optional space around the items, rather than allowing space specifically around the delimiter ','.

With lpeg, we can be programmers again with pattern matching, and reuse patterns:

function list(pat)
pat = space(pat)
return pat * (',' * pat)^0
end

So, a list of identifiers (according to the usual rules):

> idenchar = R('AZ','az')+P'_'
> iden = idenchar * (idenchar+R'09')^0
> = list(C(iden)):match 'hello, dolly, _x, s23'
"hello" "dolly" "_x" "s23"

Using explicit ranges seems old-fashioned and error-prone. A more portable solution is to use the lpeg equivalent of character classes, which are by definition locale-independent:

> l = {}
> lpeg.locale(l)
> for k in pairs(l) do print(k) end
"punct"
"alpha"
"alnum"
"digit"
"graph"
"xdigit"
"upper"
"space"
"print"
"cntrl"
"lower"
> iden = (l.alpha+P'_') * (l.alnum+P'_')^0

Given this definition of list, it's easy to define a simple subset of the common CSV format, where each record is a list separated by a linefeed:

> rlistf =  list(float/tonumber)
> csv = Ct( (Ct(listf)+'\n')^1 )
> = csv:match '1,2.3,3\n10,20, 30\n'
{{1,2.3,3},{10,20,30}}

One good reason to learn lpeg is that it performs very satisfactorily. This pattern is a lot faster than parsing the data with Lua string matching.

String Substitution

I will show that lpeg can do all that string.gsub can do, and more generally and flexibly.

One operator that we have not used yet is -, which means 'either/or'. Consider the problem of matching double-quoted strings. In the simplest case, they are a double-quote followed by any characters which are not a double-quote, followed by a closing double-quote. P(1) matches any single character, i.e. it is the equivalent of '.' in string patterns. A string may be empty, so we match zero or more non-quote characters:

> Q = P'"'
> str = Q * (P(1) - Q)^0 * Q
> = C(str):match '"hello"'
"\"hello\""

Or you may want to extract the contents of the string, without quotes. In this context, just using 1 instead of P(1) is not ambiguous, and in fact this is how you will usually see this 'any x which is not a P' pattern:

> str2 = Q * C((1 - Q)^0) * Q
> = str2:match '"hello"'
"hello"

This pattern is obviously generalizable; often the terminating pattern is not the same as the final pattern:

function extract_quote(openp,endp)
openp = P(openp)
endp = endp and P(endp) or openp
local upto_endp = (1 - endp)^1
return openp * C(upto_endp) * endp
end
> return  extract_quote('(',')'):match '(and more)'
"and more"
> = extract_quote('[[',']]'):match '[[long string]]'
"long string"

Now consider translating Markdown code (back-slash enclosed text) into the format understood by the Lua wiki (double-brace enclosed text). The naive way is to extract the string and concatenate the result, but this is clumsy and (as we will see) limit our options tremendously.

function subst(openp,repl,endp)
openp = P(openp)
endp = endp and P(endp) or openp
local upto_endp = (1 - endp)^1
return openp * C(upto_endp)/repl * endp
end
> =  subst('`','{{%1}}'):match '`code`'
"{{code}}"
> = subst('_',"''%1''"):match '_italics_'
"''italics''"

We've come across the capture-processing operator / before, using tonumber to convert numbers. It also understands strings in a very similar format to string.gsub, where %n means the n-th capture.

This operation can be expressed exactly as:

> = string.gsub('_italics_','^_([^_]+)_',"''%1''")
"''italics''"

But the advantage is that we don't have to build up a custom string pattern and worry about escaping 'magic' characters like '(' and ').

lpeg.Cs is a substitution capture, and it provides a more general module of global string substitution. In the lpeg manual, there is this equivalent to string.gsub:

function gsub (s, patt, repl)
patt = P(patt)
local p = Cs ((patt / repl + 1)^0)
return p:match(s)
end > = gsub('hello dog, dog!','dog','cat')
"hello cat, cat!"

To understand the difference, here's that pattern using plain C:

> p = C((P'dog'/'cat' + 1)^0)
> = p:match 'hello dog, dog!'
"hello dog, dog!" "cat" "cat"

The C here just captures the whole match, and each '/' adds a new capture with the value of the replacement string.

With Cs, everything gets captured, and a string is built out of all the captures. Some of those captures get modified by '/', and so we have substitutions.

In Markdown, block quoted lines begin with '> '.

lf = P'\n'
rest_of_line_nl = C((1 - lf)^0*lf) -- capture chars upto \n
quoted_line = '> '*rest_of_line_nl -- block quote lines start with '> '
-- collect the quoted lines and put inside [[[..]]]
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n" > = quote:match '> hello\n> dolly\n'
"[[[
> hello
> dolly
]]]
"

That's not quite right - Cs captures everything, including the '> '. But we can force some captures to return empty strings: }}}

function empty(p)
return C(p)/''
end quoted_line = empty ('> ') * rest_of_line_nl
...

Now things will work correctly!

Here is the program used to convert this document from Markdown to Lua wiki format:

local lpeg = require 'lpeg'

local P,S,C,Cs,Cg = lpeg.P,lpeg.S,lpeg.C,lpeg.Cs,lpeg.Cg

local test = [[
## A title here _we go_ and `a:bonzo()`: one line
two line
three line and `more_or_less_something` [A reference](http://bonzo.dog) > quoted
> lines ]] function subst(openp,repl,endp)
openp = P(openp) -- make sure it's a pattern
endp = endp and P(endp) or openp
-- pattern is 'bracket followed by any number of non-bracket followed by bracket'
local contents = C((1 - endp)^1)
local patt = openp * contents * endp
if repl then patt = patt/repl end
return patt
end function empty(p)
return C(p)/''
end lf = P'\n'
rest_of_line = C((1 - lf)^1)
rest_of_line_nl = C((1 - lf)^0*lf) -- indented code block
indent = P'\t' + P' '
indented = empty(indent)*rest_of_line_nl
-- which we'll assume are Lua code
block = Cs(indented^1)/' [[[!Lua\n%1]]]\n' -- use > to get simple quoted block
quoted_line = empty('> ')*rest_of_line_nl
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n" code = subst('`','{{%1}}')
italic = subst('_',"''%1''")
bold = subst('**',"'''%1'''")
rest_of_line = C((1 - lf)^1)
title1 = P'##' * rest_of_line/'=== %1 ==='
title2 = P'###' * rest_of_line/'== %1 ==' url = (subst('[',nil,']')*subst('(',nil,')'))/'[%2 %1]' item = block + title1 + title2 + code + italic + bold + quote + url + 1
text = Cs(item^1) if arg[1] then
local f = io.open(arg[1])
test = f:read '*a'
f:close()
end print(text:match(test))

Due to an escaping problem with this Wiki, I had to substitute '[' for '{', etc in this source. Be warned!

SteveDonovan, 12 June 2012


Group and back captures

This section will dissect the behavior of group and back captures (Cg() and Cb() respectively).

Group captures (hereafter "groups") come in two flavors: named and anonymous.

    Cg(C"baz" * C"qux", "name") -- named group.

    Cg(C"foo" * C"bar")         -- anonymous group.
    

Let's first get the easy one out of the way: named groups inside table captures.

    Ct(Cc"foo" * Cg(Cc"bar" * Cc"baz", "TAG")* Cc"qux"):match""
--> { "foo", "qux", TAG = "bar" }

In a table capture, the value of the first capture inside the group ("bar") is assigned to the corresponding key ("TAG") in the table. As you can see, Cc"baz" got lost in the process. The label must be a string (or a number that will be automatically converted to a string).

Note that the group must be a direct child of the table, otherwise, the table capture will not handle it:

    Ct(C(Cg(1,"foo"))):match"a"
--> {"a"}

Of captures and values

Before delving into groups proper, we must first explore a subtlety in the way captures handle their subcaptures.

Some captures operate on the values produced by their subcaptures, while others operate on the capture objects. This is sometimes counter-intuitive.

Let's take the following pattern:

    (1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "bc", "b", "c"

As you can see, it inserts three values in the capture stream.

Let's wrap it in a table capture:

    Ct(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> { "bc", "b", "c" }

Ct() operates on values. In the last example, the three values that are inserted in order in the table.

Now, let's try a substitution capture:

    Cs(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "abcd"

Cs() operates on captures. It scans the first level of its nested captures, and only takes the first value of each one. In the above example, "b" and "c" are thus discarded. Here's another example that may make things more clear:

    function the_func (bcd)
assert(bcd == "bcd")
return "B", "C", "D"
end Ct(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> {"B", "C", "D"} -- All values are inserted. Cs(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> "aBe" -- the "C" and "D" have been discarded.

A more detailed account of the by value / by capture behaviour of each kind of capture will be the topic of another section.


Capture opacity

Another important thing to realise is that most captures shadow their subcaptures, but some don't. As you can see in the last example, the value of C"bcd" is passed to the /function capture, but it doesn't end in the final capture list. Ct() and Cs() are also opaque in this regard. They only produce, resectively, one table or one string.

On the other hand, C() is transparent. As we've seen above, the subcaptures of C() are also inserted in the stream.

    C(C"b" * C"c"):match"bc" --> "bc", "b", "c"
    

The only transparent captures are C() and the anonymous Cg().


Anonymous groups

Cg() wraps its subcaptures in a single capture object, but doesn't produce anything of its own. Depending on the context, either all of its values will be inserted, or only the first one.

Here are a few examples for the anonymous groups:

    (1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "b", "c", "d" Ct(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> { "b", "c", "d" } Cs(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "abe" -- "c" and "d" are dropped.

Where this behavior is useful? In folding captures.

Let's write a very basic calculator, that adds or substracts one digit numbers.

    function calc(a, op, b)
a, b = tonumber(a), tonumber(b)
if op == "+" then
return a + b
else
return a - b
end
end digit = R"09" calculate = Cf(
C(digit) * Cg( C(S"+-") * C(digit) )^0
, calc
)
calculate:match"1+2-3+4"
--> 4

The capture tree will look like this [*]:

    {"Cf", func = calc, children = {
{"C", val = "1"},
{"Cg", children = {
{"C", val = "+"},
{"C", val = "2"}
} },
{"Cg", children = {
{"C", val = "-"},
{"C", val = "3"}
} },
{"Cg", children = {
{"C", val = "+"},
{"C", val = "4"}
} }
} }

You probably see where this is going... Like Cs(), Cf() operates on capture objects. It will first extract the first value of the first capture, and use it as the initial value. If there are no more captures, this value becomes the value of the Cf().

But we have more captures. In our case, it will pass all the values of the second capture (the group) to calc(), tacked after the value of the first one. Here's the evaluation of the above Cf()

    first_arg = "1"
next_ones: "+", "2"
first_arg = calc("1", "+", "2") -- 3, calc() returns numbers next_ones: "-", "3"
first_arg = calc(3, "-", "3") next_ones: "+", "4"
first_arg = calc(0, "+", "4") return first_arg -- Tadaaaa.

[*] Actually, at match time, the capture objects only store their bounds and auxiliary data (like calc() for the Cf()). The actual values are produced sequencially after the match has completed, but, it makes things more clear as displayed above. In the above example, the values of the nested C() and Cg(C(),C()) are actually produced one at a time, at each corresponding cycle of the folding process.


Named groups

The ( named Cg() / Cb() ) pair has a behavior similar to the anonymous Cg(), but the values captured in the named Cg() are not inserted locally. They are teleported, and end up inserted in the stream at the place of the Cb().

Here's an example:

    ( 1 * Cg(C"bc", "FOOO") * C"d" * 1 * Cb"FOOO" * Cb"FOOO"):match"abcde"
-- > "d", "bc", "bc"

Warp... and duplication if there is more than one Cb(). Another example:

    ( 1 * Cg(C"b" * C"c" * C"d", "FOOO") * C"e" * Ct(Cb"FOOO") ):match"abcde"
--> "e", { "b", "c", "d" }

Usually, for the sake of clarty, in my code, I alias Cg() to Tag(). I use the former for anonymous groups, and the latter for named groups.

Cb"FOOO" will look back for a corresponding Cg() that has succeeded. It goes back and up in the tree, and consumes captures. In other words, it searches its elder siblings, and the elder siblings of its parents, but not the parents themselves. Neither does it test the children of the siblings/siblings of ancestors.

It proceeds as follows (start from the [ #### ] <--- [[ START ]] and follow the numbers back up).

The [ numbered ] captures are the captures that are tested on order. The ones marked with [ ** ] are not, for the various reasons listed. This is hairy, but AFAICT complete.

    Cg(-- [ ** ] ... This one would have been seen,
-- if the search hadn't stopped at *the one*.
"Too late, mate."
, "~@~"
) * Cg( -- [ 3 ] The search ends here. <--------------[[ Stop ]]
"This is *the one*!"
, "~@~"
) * Cg(-- [ ** ] ... The great grand parent.
-- Cg with the right tag, but direct ancestor,
-- thus not checked. Cg( -- [ 2 ] ... Cg, but not the right tag. Skipped.
Cg( -- [ ** ] good tag but masked by the parent (whatever its type)
"Masked"
, "~@~"
)
, "BADTAG"
) * C( -- [ ** ] ... grand parent. Not even checked. (
Cg( -- [ ** ] ... This subpattern will fail after Cg() succeeds.
-- The group is thus removed from the capture tree, and will
-- not be found dureing the lookup.
"FAIL"
, "~@~"
)
* false
) + Cmt( -- [ ** ] ... Direct parent. Not assessed.
C(1) -- [ 1 ] ... Not a Cg. Skip. * Cb"~@~" -- [ #### ] <----------------- [[ START HERE ]] --
, function(subject, index, cap1, cap2)
return assert(cap2 == "This is *the one*!")
end
)
)
, "~@~" -- [ ** ] This label goes with the great grand parent.
)

[转]http://lua-users.org/wiki/LpegTutorial的更多相关文章

  1. 【搬运工】——初识Lua(转)

    使用 Lua 编写可嵌入式脚本 Lua 提供了高级抽象,却又没失去与硬件的关联. 虽然编译性编程语言和脚本语言各自具有自己独特的优点,但是如果我们使用这两种类型的语言来编写大型的应用程序会是什么样子呢 ...

  2. [翻译]lpeg入门教程

    原文地址:http://lua-users.org/wiki/LpegTutorial 简单匹配 LPeg是一个用于文本匹配的有力表达方式,比Lua原生的字符串匹配和标准正则表达式更优异.但是,就像其 ...

  3. 025_lua脚本语言

    一.--cat /opt/nginx/conf/conf.dlua_package_path '/opt/nginx/conf/lua/?.lua;;'; --lua模块路径,其中”;;”表示默认搜索 ...

  4. 打印Lua的Table对象

    小伙伴们再也不用为打印lua的Table对象而苦恼了, 本人曾也苦恼过,哈哈 不过今天刚完成了这个东西, 以前在网上搜过打印table的脚本,但是都感觉很不理想,于是,自己造轮子了~ 打印的效果,自己 ...

  5. wireshark lua脚本

    1.目的:解析rssp2协议   2.如何使用wireshark lua插件 将编写的(假设为rssp2.lua)lua文本,放入wireshark 安装目录下,放哪里都行只要dofile添加了路径. ...

  6. lua UT测试工具

    luaunit Luaunit is a unit-testing framework for Lua, in the spirit of many others unit-testing frame ...

  7. LuaSrcDiet工具介绍(lua源码处理软件)

    Diet Food Diet (nutrition), the sum of the food consumed by an organism or group Dieting, the delibe ...

  8. Prototype based langue LUA

    Prototype-based programming https://en.wikipedia.org/wiki/Prototype-based_programming Prototype-base ...

  9. lua unit test introduction

    Unit Test Unit testing is about testing your code during development, not in production. Typically y ...

随机推荐

  1. WPFの静态资源(StaticResource)和动态资源(DynamicResource)

    下面是前台代码: <Window.Resources>        <TextBlock x:Key="res1" Text="好好学习"/ ...

  2. C#浅拷贝与深拷贝区别

    也许会有人这样解释C# 中浅拷贝与深拷贝区别: 浅拷贝是对引用类型拷贝地址,对值类型直接进行拷贝. 不能说它完全错误,但至少还不够严谨.比如:string 类型咋说? 其实,我们可以通过实践来寻找答案 ...

  3. SecureCRT配色

    SecureCRT是一款支持SSH(SSH1和SSH2)的终端仿真程序,简单的说是Windows下登录UNIX或Linux服务器主机的软件.SecureCRT支持SSH,同时支持Telnet和rlog ...

  4. CSS的压缩 方法与解压

    为什么要压缩CSS? 1.大网站节约流量 2.加快访问速度 工具:Dreamweaver(手工替换,个人感觉任何文本编辑器都可以)不过DW可以还原 CSS压缩与CSS代码压缩还原方法,CSS压缩工具有 ...

  5. Deep Learning 11_深度学习UFLDL教程:数据预处理(斯坦福大学深度学习教程)

    理论知识:UFLDL数据预处理和http://www.cnblogs.com/tornadomeet/archive/2013/04/20/3033149.html 数据预处理是深度学习中非常重要的一 ...

  6. Python中类的特殊方法详解

    本文和大家分享的主要是python语言中类的特殊方法相关用法,希望对大家有帮助. 构造序列 1._len_(self) 2._getitem_(self,key) 3._setitem_(self,k ...

  7. Python类的实例属性详解

    实例属性 1.类被实例化后才会具有的属性 2.一般在_init_()方法中创建并初始化 3.直接使用即定义:self.<属性名> 4.引用方法:self.<属性名> 5.sel ...

  8. SPSS数据分析—多元方差分析

    之前的单因素方差分析和多因素方差分析,都在针对一个因变量,而实际工作中,经常会碰到多个因变量的情况,如果单纯的将其拆分为多个单因变量的做法不妥,需要使用多元方差分析或因子分析 多元方差分析与一元方差分 ...

  9. 终于解决了IE8不支持数组的indexOf方法,array的IndexOf方法

    /* 终于解决了IE8不支持数组的indexOf方法 */ if (!Array.prototype.indexOf) { Array.prototype.indexOf = function (el ...

  10. Codeforces Round #371 (Div. 2) C. Sonya and Queries

    题目链接 分析:01trie树,很容易就看出来了,也没什么好说的.WA了一发是因为没有看见如果数字位数大于01序列的时候01序列也要补全0.我没有晚上爬起来打,白天发现过的人极多. /******** ...