Parser combinators for TypeScript and Rust.
Write your language in BBNF
(better
backus-naur form) .
Handles left recursion and left factoring. Performance focused whilst maintaining readability and ease of use.
TypeScript:
import { string, regex } from "@mkbabb/parse-that";
const heyy = regex(/hey+t/);
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"
Rust:
use parse_that::string;
let heyy = string("heyyyyyyyyyt");
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"
Or with a grammar:
import { string, match, generateParserFromBBNF } from "@mkbabb/parse-that";
const grammar = `
expr = term, { ("+" | "-"), term };
term = factor, { ("*" | "/"), factor };
factor = number | "(", expr, ")";
number = /[0-9]+/;
whatzupwitu = "whatzupwitu";
`;
const [nonterminals, ast] = generateParserFromBBNF(grammar);
const expr = nonterminals.expr;
expr.parse("1 + whatzupwitu * 3"); // => [1, "+", ["whatzupwitu", "*", 3]]
nice.
- parse
[that]
Here's a benchmark comparing the following libraries, all parsing the same JSON
grammar:
- Chevrotain
- Parsimmon
- this library, with standard combinators: here
- this library, using a generated parser from BBNF: here
The file used is a 3.8 MB JSON
file, containing ~10K lines of JSON
. Whitespace is
randomly inserted to make the file a bit more difficult to parse (makes the file about
5x the size). Benchmark is run 100 times.
name | hz | min | max | mean | p75 | p99 | p995 | p999 | rme | samples |
---|---|---|---|---|---|---|---|---|---|---|
Standard | 8.1013 | 119.20 | 127.90 | 123.44 | 127.08 | 127.90 | 127.90 | 127.90 | ±2.64% | 10 |
BBNF | 5.1979 | 183.18 | 216.12 | 192.38 | 195.59 | 216.12 | 216.12 | 216.12 | ±3.97% | 10 |
Chevrotain | 6.8699 | 129.91 | 167.46 | 145.56 | 158.21 | 167.46 | 167.46 | 167.46 | ±5.70% | 10 |
Parsimmon | 4.0256 | 246.69 | 250.34 | 248.41 | 249.09 | 250.34 | 250.34 | 250.34 | ±1.54% | 10 |
Standard - test/benchmarks/json.bench.ts > JSON Parser
1.18x faster than Chevrotain
1.56x faster than BBNF
2.01x faster than Parsimmon
Probably not the most scientific comparison, but it's generally about 10-30% faster than what I've tested it against.
Have a look inside benchmarks if you're curious.
The approach taken in the Rust implementation is almost identical to that of the
TypeScript one - there's much to improve here. The benchmark is run on a 2.3 MB JSON
file (canada.json
), containing ~10K lines of JSON
. Benchmark is run 100 times.
TODO: but it averages around 80 MB/s on my machine.
Debugging is made 🌈pretty🌈 by using the debug
combinator - but you must run in
development
mode (vite build --mode development
) to see the output.
As output, you'll see a few things:
- A header containing:
- parsing status (
Ok
orErr
) - current offset into the input string
- debug node's name
- stringified current parser - a bit like the BBNF format
- parsing status (
- A body containing:
- A maximum of 10 lines of the input string, with the current offset into the
parse string denoted by
^
- Line numbers for each line currently displayed
- A maximum of 10 lines of the input string, with the current offset into the
parse string denoted by
The blue
color indicates that that variable is an BBNF nonterminal - yellow
is the
stringified parser.
Better Backus-Naur Form is a simple and readable way to describe a language. A better EBNF.
See the BBNF for BBNF (meta right) at bbnf.bbnf.
With your grammar in hand, call the generateParserFromBBNF
function within
bbnf.ts and you'll be returned two objects:
[nonterminals, ast]: [BBNFNonterminals, BBNFAST] = generateParserFromBBNF(grammar);
a JavaScript object containing all of the parsed nonterminals in your language, and the
AST for your language. Each nonterminal is a Parser
object - use it as you would any
other parser.
Fully featured, and self-parsing, so the BBNF parser-generator is written in BBNF. Checkout the self-parsing example (+ formatting), at bbnf.test.ts.
It's a mesh between your run-of-the-mill EBNF and
W3C's EBNF. So stuff like ?
and *
are
allowed - but it also supports [ ]
and { }
for optional and repeated elements.
Set-like subtraction is supported, so you can do things like a - b
to mean "a, but not
b".
Here's a list of operators:
A?
|[ A ]
: optional AA*
|{ A }
: repeated A (0 or more)A+
: repeated A (1 or more)A | B
: A or B - higher precedence thanA, B
A, B
: A followed by BA - B
: A, but not BA >> B
: A, then B, but only return BA << B
: A, then B, but only return A( A )
: grouping - maximum precedence
And yes emojis are supported. Epsilon even has a special value for it - ε
(or just use
the word epsilon).
This library fully supports left recursion (either direct or indirect) - highly ambiguous grammars.
So grammars like (trivial direct left recursion):
expr = expr , "+" , expr
| integer
| string ;
and like (highly ambiguous):
ms = "s";
mSL = ( mSL , mSL , ms ) ? ;
mz = "z" ;
mZ = mZ | mY | mz ;
mY = mZ, mSL ;
are supported.
Our scheme is multifaceted and optimized depending upon a few factors:
Since BBNF allows for one to easily grab ahold of the AST, we can optimize its structure rather easily. Left recursion is "removed" (it's actually just factored out) using a four-pass algorithm:
-
- Sort the nonterminals topologically
-
- Remove indirect left recursion
-
- Remove left recursion
-
- Factorize
For the above example each pass would look something like this:
Input grammar:
expr = expr , "+" , expr
| integer
| string ;
Sorted & removed left recursion:
expr = integer, expr_0
| string, expr_0 ;
expr_0 = "+" , expr , expr_0
| ε ;
Factorized:
expr = (integer | string) , expr_0 ;
expr_0 = "+" , expr , expr_0
| ε
If any remaining left recursion is found, it's handled via the combinators.
Left recursion can be handled via two combinators: memoize
and mergeMemos
.
mergeMemos
must be applied directly one's left-recursive call, and memoize
must be
applied to the entire parser.
Here's an example:
...
const expression = Parser.lazy(() =>
all(expression, operators.then(expression).opt()).mergeMemos().or(number)
)
.memoize();
...
How this is done under the hood is by using a memoization table and a few clever tricks to detect if the current parser is in a left-recursive call. See the left recursion document for more information, and the memoization tests for examples.
Though left recursion is supported, it's absolutely not optimal. If it can be factored out via the BBNF parser generator generally the performance will quite fine, but if it cannot you may run into some performance issues. This stems, among other things, primarily from JavaScript's lack of tail call optimization. Again, see the left recursion document for more information.
A pretty-printing submodule, inspired by Prettier, is included in the Rust
implementation. I wanted to have something similar in Rust for my own projects, but I
couldn't find anything that was simple enough to use. So I wrote my own. Check out the
pretty
docs for more information.
See api.md for all of the API information.
See the test directory for fully explained and working examples.
- EBNF
- Left recursion information
- Notes On Parsing Ebnf
- Notes On The Formal Theory Of Parsing
- Removing Left Recursion From Context Free Grammars
- A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time
- Modular and efficient top-down parsing for ambiguous left-recursive grammars
Other great parsing libraries 🎉: