Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

Latest commit

 

History

History
215 lines (124 loc) · 10.2 KB

syntax-reference.md

File metadata and controls

215 lines (124 loc) · 10.2 KB

Syntax Reference

This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:

  • When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
  • Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm -- we believe that it improves modularity and makes both grammars and semantics easier to understand.
  • Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
  • Ohm does not (yet) support semantic predicates.

Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:

Terminology

Arithmetic {
  Expr = "1 + 1"
}

This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.

Parsing Expressions

Here is a full list of the different kinds of parsing expressions supported by Ohm:

Terminals

"hello there"

Matches exactly the characters contained inside the quotation marks.

Special characters (", \, and ') can be escaped with a backslash -- e.g., "\"" will match a literal quote character in the input stream. Other valid escape sequences include: \b (backspace), \f (form feed), \n (line feed), \r (carriage return), and \t (tab), as well as \x followed by 2 hex digits and \u followed by 4 hex digits, for matching characters by code point.

Terminal Range

start..end

Matches exactly one character whose character code is between the terminals start and end (inclusive). E.g., "a".."c" will match 'a', 'b', or 'c'. Note: start and end must be 1-character Terminal expressions.

Rule Application

ruleName

Matches the body of the rule named ruleName. For example, the built-in rule letter will parse a string of length 1 that is a letter.

ruleName<expr>

Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">.

Repetition operators: *, +, ?

expr *

Matches the expression expr repeated 0 or more times. E.g., "a"* will match '', 'a', 'aa', ...

Inside a syntactic rule -- any rule whose name begins with an upper-case letter -- spaces before a match are automatically skipped. E.g., "a"* will match " a a" as well as "aa". See the documentation on syntactic and lexical rules for more information.

expr +

Matches the expression expr repeated 1 or more times. E.g., letter+ will match 'x', 'xA', ...

As with the * operator, spaces are skipped when used in a syntactic rule.

expr ?

Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.

Sequence

expr1 expr2

Matches the expression expr1 followed by expr2. E.g., "grade" letter will match 'gradeA', 'gradeB', ...

As with the * and + operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter will match ' grade A' as well as 'gradeA'.

Alternation

expr1 | expr2

Matches the expression expr1, and if that does not succeed, matches the expression expr2. E.g., letter | digit will match 'a', '9', ...

Lookahead: &

& expr

Succeeds if the expression expr can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &digit will match 'a9', but only consume 'a'. &"a" letter+ will match any string of letters that begins with 'a'.

Negative Lookahead: ~

~ expr

Succeeds if the expression expr cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any will consume any single character that is not a new line character.

Lexification: #

# expr

Matches expr as if in a lexical context. This can be used to prevent whitespace skipping before an expression that appears in the body of a syntactic rule. For further information, see Syntactic vs. Lexical Rules.

Built-in Rules

(See src/built-in-rules.ohm.)

any: Matches the next character in the input stream, if one exists.

letter: Matches a single character which is a letter (either uppercase or lowercase).

lower: Matches a single lowercase letter.

upper: Matches a single uppercase letter.

digit: Matches a single character which is a digit from 0 to 9.

hexDigit: Matches a single character which is a either digit or a letter from A-F.

alnum: Matches a single letter or digit; equivalent to letter | digit.

space: Matches a single whitespace character (e.g., space, tab, newline, etc.)

end: Matches the end of the input stream. Equivalent to ~any.

caseInsensitive<terminal>: Matches terminal, but ignoring any differences in casing (based on the simple, single-character Unicode case mappings). E.g., caseInsensitive<"ohm"> will match 'Ohm', 'OHM', etc.

ListOf<elem, sep>: Matches the expression elem zero or more times, separated by something that matches the expression sep. E.g., ListOf<letter, ","> will match '', 'a', and 'a, b, c'.

NonemptyListOf<elem, sep>: Like ListOf, but matches elem at least one time.

listOf<elem, sep>: Similar to ListOf<elem, sep> but interpreted as lexical rule.

Grammar Syntax

Grammar Inheritance

grammarName <: supergrammarName { ... }

Declares a grammar named grammarName which inherits from supergrammarName.

Defining, Extending, and Overriding Rules

In the three forms below, the rule body may optionally begin with a | character, which will be ignored. Also note that in rule names, case is significant.

ruleName = expr

Defines a new rule named ruleName in the grammar, with the parsing expression expr as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.

ruleName := expr

Defines a rule named ruleName, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.

ruleName += expr

Extends a supergrammar rule named ruleName, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody, where oldBody is the rule body as defined in the supergrammar.

Parameterized Rules

ruleName<arg1, ..., argN> = expr

Defines a new rule named ruleName which has n parameters. In the rule body expr, the parameter names (e.g. arg1) may be used as rule applications. E.g., Repeat<x> = x x.

Rule Descriptions

Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:

ident (an identifier)
  = ~keyword name

Inline Rule Declarations

expr -- caseName

When a parsing expression is followed by the characters -- and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:

AddExp = AddExp "+" MulExp  -- plus
       | MulExp

is equivalent to:

AddExp = AddExp_plus
       | MulExp
AddExp_plus = AddExp "+" MulExp

Syntactic vs. Lexical Rules

A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.

For the purposes of a syntactic rule, a "whitespace character" is anything that matches its enclosing grammar's "space" rule. The default implementation of "space" matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.