This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:
- When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
- Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm -- we believe that it improves modularity and makes both grammars and semantics easier to understand.
- Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
- Ohm does not (yet) support semantic predicates.
Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:
- Rule applications can accept parameters. This makes it possible to write higher-order rules, such as the built-in
ListOf
rule. - Grammars can be extended in an object-oriented way -- see Defining, Extending, and Overriding Rules.
Arithmetic {
Expr = "1 + 1"
}
This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.
Here is a full list of the different kinds of parsing expressions supported by Ohm:
"hello there"
Matches exactly the characters contained inside the quotation marks.
Special characters ("
, \
, and '
) can be escaped with a backslash -- e.g., "\""
will match a literal quote character in the input stream. Other valid escape sequences include: \b
(backspace), \f
(form feed), \n
(line feed), \r
(carriage return), and \t
(tab), as well as \x
followed by 2 hex digits and \u
followed by 4 hex digits, for matching characters by code point.
start..end
Matches exactly one character whose character code is between the terminals start and end (inclusive). E.g., "a".."c"
will match 'a'
, 'b'
, or 'c'
. Note: start and end must be 1-character Terminal expressions.
ruleName
Matches the body of the rule named ruleName. For example, the built-in rule letter
will parse a string of length 1 that is a letter.
ruleName<expr>
Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">
.
expr *
Matches the expression expr repeated 0 or more times. E.g., "a"*
will match ''
, 'a'
, 'aa'
, ...
Inside a syntactic rule -- any rule whose name begins with an upper-case letter -- spaces before a match are automatically skipped. E.g., "a"*
will match " a a"
as well as "aa"
. See the documentation on syntactic and lexical rules for more information.
expr +
Matches the expression expr repeated 1 or more times. E.g., letter+
will match 'x'
, 'xA'
, ...
As with the *
operator, spaces are skipped when used in a syntactic rule.
expr ?
Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.
expr1 expr2
Matches the expression expr1
followed by expr2
. E.g., "grade" letter
will match 'gradeA'
, 'gradeB'
, ...
As with the *
and +
operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter
will match ' grade A'
as well as 'gradeA'
.
expr1 | expr2
Matches the expression expr1
, and if that does not succeed, matches the expression expr2
. E.g., letter | digit
will match 'a'
, '9'
, ...
& expr
Succeeds if the expression expr
can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &digit
will match 'a9'
, but only consume 'a'. &"a" letter+
will match any string of letters that begins with 'a'.
~ expr
Succeeds if the expression expr
cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any
will consume any single character that is not a new line character.
# expr
Matches expr as if in a lexical context. This can be used to prevent whitespace skipping before an expression that appears in the body of a syntactic rule. For further information, see Syntactic vs. Lexical Rules.
(See src/built-in-rules.ohm.)
any
: Matches the next character in the input stream, if one exists.
letter
: Matches a single character which is a letter (either uppercase or lowercase).
lower
: Matches a single lowercase letter.
upper
: Matches a single uppercase letter.
digit
: Matches a single character which is a digit from 0 to 9.
hexDigit
: Matches a single character which is a either digit or a letter from A-F.
alnum
: Matches a single letter or digit; equivalent to letter | digit
.
space
: Matches a single whitespace character (e.g., space, tab, newline, etc.)
end
: Matches the end of the input stream. Equivalent to ~any
.
caseInsensitive<terminal>
: Matches terminal, but ignoring any differences in casing (based on the simple, single-character Unicode case mappings). E.g., caseInsensitive<"ohm">
will match 'Ohm'
, 'OHM'
, etc.
ListOf<elem, sep>
: Matches the expression elem zero or more times, separated by something that matches the expression sep. E.g., ListOf<letter, ",">
will match ''
, 'a'
, and 'a, b, c'
.
NonemptyListOf<elem, sep>
: Like ListOf
, but matches elem at least one time.
listOf<elem, sep>
: Similar to ListOf<elem, sep>
but interpreted as lexical rule.
grammarName <: supergrammarName { ... }
Declares a grammar named grammarName
which inherits from supergrammarName
.
In the three forms below, the rule body may optionally begin with a |
character, which will be
ignored. Also note that in rule names, case is significant.
ruleName = expr
Defines a new rule named ruleName
in the grammar, with the parsing expression expr
as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.
ruleName := expr
Defines a rule named ruleName
, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.
ruleName += expr
Extends a supergrammar rule named ruleName
, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody
, where oldBody
is the rule body as defined in the supergrammar.
ruleName<arg1, ..., argN> = expr
Defines a new rule named ruleName
which has n parameters. In the rule body expr, the parameter names (e.g. arg1) may be used as rule applications. E.g., Repeat<x> = x x
.
Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:
ident (an identifier)
= ~keyword name
expr -- caseName
When a parsing expression is followed by the characters --
and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:
AddExp = AddExp "+" MulExp -- plus
| MulExp
is equivalent to:
AddExp = AddExp_plus
| MulExp
AddExp_plus = AddExp "+" MulExp
A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.
For the purposes of a syntactic rule, a "whitespace character" is anything that matches its enclosing grammar's "space" rule. The default implementation of "space" matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.