peg.ms is a pattern-matching library for MiniScript, based on Parsing Expression Grammars (PEGs).
(To debug or pretty print large objects there is a ._str
method implemented for almost each class.)
(The examples use import "peg"
for simple looks. Use ensureImport
to avoid rebuilding pegGrammar
.)
Parse a comma-separated list of numbers surrounded by brackets.
import "peg"
listOfNumbers = new peg.Grammar
listOfNumbers.init "list : '[' space number (',' space number)* space ']' " +
"number <- ([+-]? [0-9]+) {tonumber} " +
"space <- ' '* "
listOfNumbers.capture "tonumber", function(match, subcaptures, arg, ctx)
return match.fragment.val
end function
print listOfNumbers.parse("[]").captures.list // []
print listOfNumbers.parse("[11,22,33]").captures.list // [11, 22, 33]
print listOfNumbers.parse("[ 44, 55 ]").captures.list // [44, 55]
Add "peg-ms/lib"
to the import paths or copy the file peg-ms/lib/peg.ms
to the project's lib/
.
Most of standard PEG syntax is implemented.
Syntax | Description | Precedence |
---|---|---|
'string' or "string" |
literal string | |
[class] |
character class | |
. |
any character | |
name |
non terminal | |
( p ) |
grouping / control of precedence | |
p ? |
optional match | 4 |
p * |
zero or more repetitions | 4 |
p + |
one or more repetitions | 4 |
& p |
and predicate | 3 |
! p |
not predicate | 3 |
p1 p2 |
concatenation | 2 |
p1 / p2 |
ordered choice | 1 |
name1 <- p1 name2 <- p2 ... |
rule definitions |
Comments begin with #
and go until the end of line.
Character classes and literals may include escapes: \t
, \r
, \n
, \[
, \]
, \'
, \"
, \\
and \uXXXX
(hexadecimal unicode).
Additionally, the library supports some nonstandard syntax:
Syntax | Description | Precedence |
---|---|---|
p {} |
anonymous capture | 4 |
p {name} |
function capture | 4 |
p {name:} |
key capture | 4 |
p <name> |
match-time action | 4 |
p <name!> |
match-time error | 4 |
name <- $ |
dynamic inclusion rule | |
name : p |
default initial rule | |
(+name) , (-name) , (/name) |
position flags | |
%if flag% p1 %else% p2 %end% |
if-patterns |
Suffixes {…}
and <…>
have the same precedence as ?*+
.
Create a grammar with new
and initialize it with .init(DEFINITIONS)
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+"
If the grammar has many rules, one of them can be set as the initial rule by passing its name as the second parameter to .init
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+ C <- B B <- A",
"C"
Alternatively, mark the initial rule with :
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+ C: B B <- A" // (C: B) is the initial rule
To inspect the pattern tree built from the PEG string use ._str
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+ C: B B <- A"
print g._str // Grammar(C) ...
To parse a subject text, use .parse(TEXT)
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+ C: B B <- A"
result = g.parse("aaa")
The full list of parameters to .parse
are the following:
Parameter | Type | Default | Description |
---|---|---|---|
subject |
string |
text to parse | |
start |
number |
0 |
(optional) position in the subject to start parsing |
arg |
null |
(optional) parameter for capture and match-time action callbacks | |
initialRule |
string |
null |
(optional) initial rule |
The .parse
method returns a result map with the following fields:
Field | Type | Description |
---|---|---|
result.length |
number or null |
length of the parsed portion of a subject |
result.match |
Match or null |
tree of matched fragments |
result.errors |
list of Error s |
list of encountered errors |
result.captures.list |
list |
list of values captured by {} and {name} |
result.captures.map |
map |
map of values captured by {key:} |
result.capture |
if there's exactly one value in the result.captures.list , returns that value (otherwise null ) |
There's also a ._str
method to inspect the result object:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+"
result = g.parse("aaa")
print result._str // ParseResult( ... )
To check for success or failure, compare result.length
with null
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+"
result = g.parse("aaa")
print result.length // 3 (successful parsing)
result = g.parse("bbb")
print result.length == null // 1 (parsing failed)
If the subject matches, the result.match
field will contain the tree of matched fragments (which also has ._str
):
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+"
result = g.parse("aaa")
print result.match._str // Match( ... ) ...
Each Match
node has the following fields:
Field | Type | Description |
---|---|---|
match.start |
number |
starting position in the subject |
match.length |
number |
length of a match |
match.fragment |
string |
matched text (subject[start : start + length] ) |
match.pattern |
Pattern |
pattern object that produced the match |
match.children |
list of Match es |
submatches |
Match is only returned for inspection/debugging. To extract useful info from parsed texts, use captures:
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+{} "
result = g.parse("aaa")
print result.captures.list // ["aaa"]
Here we used an anonymous capture suffix ({}
, empty braces) after 'a'+
. Changing it to 'a'{}+
will capture each individual 'a'
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a' {} + "
result = g.parse("aaa")
print result.captures.list // ["a", "a", "a"]
The field result.captures.list
is always a list. If you know that only one value should be captured, use result.capture
(no -s):
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+{}"
result = g.parse("aaa")
print result.capture // "aaa"
Another form of capture is a key capture suffix ({key:}
) which puts captured values into result.captures.map
:
import "peg"
g = new peg.Grammar
g.init "A <- [0-9]+{int:} '.' [0-9]+{fract:}"
result = g.parse("3.14")
print result.captures.map.int // 3
print result.captures.map.fract // 14
And at last, a function capture suffix ({name}
). It invokes a callback name
registered with grammar.capture(NAME, CALLBACK)
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a' {} + {slashes}"
g.capture "slashes", function(_,subs,_,_)
return subs.list.join("/")
end function
result = g.parse("aaa")
print result.capture // "a/a/a"
In the example above we only use the second parameter to the callback (subs
). Similar to ParseResult.captures
it has fields .list
and .map
that contain values produced by subpatterns. The full callback's signature is as follows:
Parameter | Type | Description |
---|---|---|
match |
Match |
matched portion of the subject |
subcaptures |
{"list":..., "map":...} |
values captured by the subpatterns |
arg |
optional third parameter to .parse |
|
ctx |
ParseContext |
collection of data associated with current call to .parse |
The return value of the callback becomes the sole capture inside captures.list
(all other subcaptures from subs.list
and subs.map
get dropped). If it's desired to keep the subcaptures, then instead of returning a capture, modify subs
in place AND return the very subs
object:
import "peg"
g = new peg.Grammar
g.init "A <- 'a' {} + {slashes}"
g.capture "slashes", function(_,subs,_,_)
subs.list.push subs.list.join("/")
return subs
end function
result = g.parse("aaa")
print result.captures.list // ["a", "a", "a", "a/a/a"]
Another special case is when the callback returns an instance of the peg.Error
class which will immediately stop the parsing and populate result.errors
with the error.
import "peg"
g = new peg.Grammar
g.init "A <- 'a'+{crash}"
g.capture "crash", function(match,_,_,_)
error = new peg.Error
error.init "message - " + match.fragment
return error
end function
result = g.parse("aaa")
print result.length == null // 1 (failure)
print result.errors[0]._str // SemanticError(message - aaa)
Note, captures are never produced inside predicates &
and !
.
While capture callbacks are executed only after the whole grammar matched, the match-time actions (<name>
) are invoked immediately when the corresponding pattern matches. The callbacks are registered with grammar.matchTime(NAME, CALLBACK)
.
import "peg"
g = new peg.Grammar
g.init "A <- ('a' 'b' <upper> 'c') {}"
g.matchTime "upper", function(match,_,_,_)
match.fragment = match.fragment.upper
return match
end function
result = g.parse("abc")
print result.capture // "aBc"
In the example above we modify the match object converting it to upper case. The full callback's signature is this:
Parameter | Type | Description |
---|---|---|
match |
Match or null |
matched fragment (or null if the pattern failed) |
subcaptures |
function that returns {"list":..., "map":...} object |
captures from subpatterns (only evaluated if gets invoked) |
arg |
optional argument to .parse |
|
ctx |
ParseContext |
collection of data associated with current call to .parse |
Unlike in function captures, the parameter match
may be null
, so check before manipulating.
The return value determines whether the match succeeds or fails: returning null
means failure and returning a Match
objects means success (it doesn't have to be the same match object as in the match
parameter). Even if null
is returned and thus the match fails, the parsing itself doesn't stop because failure of a pattern doesn't yet mean failure of the whole grammar.
Still, if it's desired to signal an error, it can be pushed into ctx.syntaxErrors
or passed to ctx.addSyntaxError(NAME, MESSAGE)
:
import "peg"
g = new peg.Grammar
g.init "A <- 'a' 'b' <whine> 'c'"
g.matchTime "whine", function(_,_,_,ctx)
ctx.addSyntaxError "whine", "just complaining"
return null
end function
result = g.parse("abc")
print result.errors[0]._str // SyntaxError(just complaining; ... )
And the following black magic can be used to set capture values inside a match-time action:
import "peg"
g = new peg.Grammar
g.init "A <- 'a' 'b' <brackets> 'c'"
g.matchTime "brackets", function(match,_,_,_)
match.capture = function(_,_,_,_)
return "[" + match.fragment + "]"
end function
return match
end function
result = g.parse("abc")
print result.capture // "[b]"
A form with an exclamation mark (<name!>
) is a shortcut for "report syntax error if the pattern fails". You don't have to define a callback.
It's not necessary to use a PEG string to create a grammar. It's possible to build the grammar from library level classes:
import "peg"
g = new peg.Grammar
g.init
g.addRule "A",
peg.makeRuleRef("B").withCaptureTag
g.addRule "B",
peg.makeLiteral("foo")
g.setDefaultRule "A"
result = g.parse("foo")
print result.capture // "foo"
In the example above, the call to grammar.init
even without args is still mandatory.
The full list of pattern classes:
"Classic" new / init | Factory | Equivalent in PEG |
---|---|---|
p = new peg.Literal |
p = peg.makeLiteral("string") |
'string' |
p = new peg.CharSet |
p = peg.makeCharSet("!@#") |
[!@#] |
p = new peg.CharRange |
p = peg.makeCharRange("0", "9") |
[0-9] |
p = new peg.AnyChar |
p = peg.makeAnyChar |
. |
p = new peg.RuleRef |
p = peg.makeRuleRef("Foo") |
Foo |
p = new peg.Optional |
p = peg.makeOptional(q) |
q ? |
p = new peg.ZeroOrMore |
p = peg.makeZeroOrMore(q) |
q * |
p = new peg.OneOrMore |
p = peg.makeOneOrMore(q) |
q + |
p = new peg.And |
p = peg.makeAnd(q) |
& q |
p = new peg.Not |
p = peg.makeNot(q) |
! q |
p = new peg.Concat |
p = peg.makeConcat([q, r, ...]) |
q r ... |
p = new peg.Choice |
p = peg.makeChoice([q, r, ...]) |
q / r / ... |
p = new peg.Grammar |
p = peg.makeGrammar("A <- q B <- r") |
A <- q B <- r |
The Grammar
itself is also a pattern and so one grammar can be embedded inside another grammar.
The capture and match-time markers don't have pattern classes of their own. Instead, a property ("tag") is assigned to a subpattern p
:
Method | PEG |
---|---|
p.withCaptureTag |
p {} |
p.withCaptureTag "key:" |
p {key:} |
p.withCaptureTag "name" |
p {name} |
p.withMatchTimeTag "name" |
p <name> |
To create a grammar where some rules only become known at the time of invokation of the .parse
method, use dynamic inclusions (Rule <- $
). To make it work, the arg
parameter to .parse
should be a map with a pair RULENAME -> PATTERN
.
import "peg"
g = new peg.Grammar
g.init "A: B {} B <- $"
result = g.parse("foo", 0, {"B": peg.patternOrLiteral("foo")})
print result.capture // "foo"
(Most of these are ported from the LPeg manual.)
In this example we don't use capture syntax, so result.captures
is empty.
import "peg"
equalAB = new peg.Grammar
equalAB.init " S : 'a' B / 'b' A / '' " +
" A <- 'a' S / 'b' A A " +
" B <- 'b' S / 'a' B B "
result = equalAB.parse("abbabbbbbb")
print result.length // 4
print result.match.fragment // "abba"
print result.errors // []
print result.captures.list // []
print result.captures.map // {}
We know that the parse was successful because result.length
is not null
.
Here we use a simple (anonymous) capture for each individual number and then we add the whole list with {add}
.
import "peg"
addNumbers = new peg.Grammar
addNumbers.init " number <- [0-9]+ {} " +
" addNumbers : ( number ( ',' number ) * ) {add} "
addNumbers.capture "add", function(match, subcaptures, arg, ctx)
add = 0
for s in subcaptures.list
add += s.val
end for
return add
end function
print addNumbers.parse("10,30,43").capture // 83
This example illustrates the use of {key:}
captures:
import "peg"
stringUpper = new peg.Grammar
stringUpper.init " name <- [a-z]+ {str:} " +
" stringUpper : ( name '^' {up:} ? ) {upper} "
stringUpper.capture "upper", function(match, subcaptures, arg, ctx)
s = subcaptures.map.str
if subcaptures.map.hasIndex("up") then s = s.upper
return s
end function
print stringUpper.parse("foo").capture // "foo"
print stringUpper.parse("foo^").capture // "FOO"
Here the construction '^' {up:} ?
only produces a capture if optional ^
symbol is matched, so the {upper}
callback checks subcaptures.map.hasIndex("up")
to make its desision.
Note that a different sequence '^' ? {up:}
would instead capture an empty string if the ^
symbol was missing.
import "peg"
nameValueList = new peg.Grammar
nameValueList.init " space <- [ \t\n\r] * " +
" name <- [a-zA-Z] + {} space " +
" sep <- [,;] space " +
" pair <- ( name '=' space name sep ? ) " +
" list : pair * {list} "
nameValueList.capture "list", function(match, subcaptures, arg, ctx)
vals = {}
for i in range(0, subcaptures.list.len - 1, 2)
vals[subcaptures.list[i]] = subcaptures.list[i + 1]
end for
return vals
end function
vals = nameValueList.parse("a=b, c = hi; next = pi").capture
print vals.a // "b"
print vals.c // "hi"
print vals.next // "pi"
In this example we use a dynamic inclusion sep <- $
the pattern of which becomes known only in the call to the split
function.
import "peg"
splitString = new peg.Grammar
splitString.init " sep <- $ " +
" elem <- ( ! sep . ) * {} " +
" result : elem ( sep elem ) * "
split = function(s, sep)
sep = peg.patternOrLiteral(sep)
return splitString.parse(s, 0, {"sep": sep}).captures.list
end function
print split("a b c", " ") // ["a", "b", "c"]
print split("a//b//c", "//") // ["a", "b", "c"]
spaces = peg.makeOneOrMore(
peg.makeCharSet(" " + char(9)))
print split("a b c", spaces) // ["a", "b", "c"]
We used patternOrLiteral
to convert the sep
string to a Literal
pattern in case it wasn't already a pattern. This allows us to use complex patterns as string separators.
Parsing with Grammar.parse
is always rooted -- it tries to match only at the beginning of the subject (or beginning from index start
if it's given as a second parameter to parse
).
This example shows how to use recursive rules to search for a pattern anywhere in the subject.
import "peg"
searchForPatt = new peg.Grammar
searchForPatt.init " pattern <- $ " +
" search : pattern {patt} / . search "
searchForPatt.capture "patt", function(match, subcaptures, arg, ctx)
return [match.start, match.start + match.length]
end function
search = function(s, pattern)
pattern = peg.patternOrLiteral(pattern)
return searchForPatt.parse(s, 0, {"pattern": pattern}).capture
end function
result = search("hello world!", "world")
print result // [6, 11]
print "hello world!"[result[0]:result[1]] // "world"
import "peg"
balanced = new peg.Grammar
balanced.init " bparen : '(' ( ! [()] . / bparen ) * ')' "
isBalanced = function(s)
return balanced.parse(s).length == s.len
end function
print isBalanced("( ((() ) () ))") // 1
print isBalanced("((()") // 0
import "peg"
gsubGrammar = new peg.Grammar
gsubGrammar.init " pattern <- $ " +
" gsub : ( ( pattern {sub} / . {} ) * ) "
gsubGrammar.capture "sub", function(match, subcaptures, arg, ctx)
return arg.repl
end function
gsub = function(s, patt, repl)
arg = {}
arg.pattern = peg.patternOrLiteral(patt)
arg.repl = repl
return gsubGrammar.parse(s, 0, arg).captures.list.join("")
end function
print gsub("hello foo! goodbye foo!", "foo", "world") // "hello world! goodbye world!"
import "peg"
csvGrammar = new peg.Grammar
csvGrammar.init " quot <- [""] " +
" newline <- [\r\n] " +
" qstr <- quot ( ! quot . / quot quot ) * {qstr} quot " +
" str <- ( ! ( ',' / newline / quot ) . ) * {} " +
" field <- qstr / str " +
" record : field ( ',' field ) * ( newline / !. ) "
csvGrammar.capture "qstr", function(match, subcaptures, arg, ctx)
return match.fragment.replace("""" + """", """")
end function
print csvGrammar.parse("foo,""bar"",baz").captures.list // ["foo", "bar", "baz"]
This example demonstrates the use of the match-time actions.
import "peg"
longStr = new peg.Grammar
longStr.init " longstr : open ( ! close . ) * {} close " +
" open <- '[' '=' * <startEq> '[' " +
" close <- ']' '=' * <endEq> ']' "
longStr.matchTime "startEq", function(match, subcaptures, arg, ctx)
if match != null then
arg.startEq = match.fragment
end if
return match
end function
longStr.matchTime "endEq", function(match, subcaptures, arg, ctx)
if match != null then
if match.fragment.len == arg.startEq.len then return match
match = null
end if
return match
end function
print longStr.parse("[==[foo]=]bar]==]", 0, {}).capture // "foo]=]bar"
First, we used the match-time callback <startEq>
to record the sequence of initial =
symbols into arg
map.
Then we used <endEq>
to compare the number of closing =
s with what we've previously recorded. If they don't match we return null
forcing the parse
method to believe that the match failed and thus it needs to keep searching for the next closing ]=*]
.
import "peg"
arithExp = new peg.Grammar
arithExp.init " Space <- [ " + char(9) + char(10) + char(13) + "] * " +
" Number <- ( '-' ? [0-9] + ) {} Space " +
" TermOp <- [+-] {} Space " +
" FactorOp <- [*/] {} Space " +
" Open <- '(' Space " +
" Close <- ')' Space " +
" Exp : Space ( Term ( TermOp Term ) * ) {eval} " +
" Term <- ( Factor ( FactorOp Factor ) * ) {eval} " +
" Factor <- Number / Open Exp Close "
arithExp.capture "eval", function(match, subcaptures, arg, ctx)
_val = function(x)
if x isa string then return x.val else return x
end function
acc = _val(subcaptures.list[0])
for i in range(1, subcaptures.list.len - 1, 2)
op = subcaptures.list[i]
x = _val(subcaptures.list[i + 1])
if op == "+" then
acc += x
else if op == "-" then
acc -= x
else if op == "*" then
acc *= x
else if op == "/" then
acc /= x
end if
end for
return acc
end function
print arithExp.parse("3 + 5*9 / (1+1) - 12").capture // 13.5