Skip to content

Latest commit

 

History

History
401 lines (297 loc) · 15 KB

generated-parser.md

File metadata and controls

401 lines (297 loc) · 15 KB

using the generated parser

When you run owl with the -c option, it generates parser code as a C header file.

$ owl -c grammar.owl
// This file was generated by the Owl parsing tool.
...

Once you're done admiring the beautiful parser code which has filled your terminal buffer, you may decide you want to write it to a file, either by using -o or by redirecting stdout:

$ owl -c grammar.owl -o parser.h
$ owl -c grammar.owl > parser.h

integrating the parser

The header file has two parts (in single-file library style): a header-like part and an implementation-like part. By default, including the header includes only the header-like part. To include the implementation as well, define OWL_PARSER_IMPLEMENTATION before using #include:

// Include parser implementation.
#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"

The implementation should be included by a single .c file somewhere in your project.

step-by-step

Here are some steps you can follow to create a new program that uses a generated parser:

  1. Write a grammar for your language in your-grammar.owl (you can test it out before you compile it by running owl your-grammar.owl).

  2. Run owl -c your-grammar.owl -o parser.h to generate the parser.h header file.

  3. Create a file called main.c that looks like:

    #define OWL_PARSER_IMPLEMENTATION
    #include "parser.h"
    
    int main()
    {
        struct owl_tree *tree;
        tree = owl_tree_create_from_file(stdin);
        owl_tree_print(tree);
        owl_tree_destroy(tree);
        return 0;
    }
  4. Build and run main.c using your favorite build system (or by running cc -o test main.c && ./test).

  5. Type some input matching your grammar into standard input to test that things are working. You should see a parse tree matching your input. Make sure to mark the end of the input using Control+D (or however you specify "end of input").

creating a tree

Owl uses the struct owl_tree type to represent a parse tree. There are three ways to create a tree:

from a string

struct owl_tree *tree = owl_tree_create_from_string(string);

The string parameter must be a null-terminated C string.

The tree returned by owl_create_tree_from_string may reference pieces of its string argument. It's important to keep the string around until you're done with the tree.

from a file

struct owl_tree *tree = owl_tree_create_from_file(file);

The file parameter must be an open FILE *.

Owl will copy the contents of the file into an internal buffer, so feel free to close the file after calling this function.

using options

struct owl_tree *tree = owl_tree_create_with_options(options);

Either options.file or options.string must be set (but not both). Use this function if you want to provide a custom tokenize function (see user defined tokens). More options may be available in the future.

reporting errors

There are a few kinds of errors that can happen while creating a tree (see the table below). If one of these errors happens, the owl_create_tree_... functions return an error tree. Calling any function other than owl_tree_destroy on an error tree will print the error and exit.

error type what it means error range
ERROR_INVALID_FILE The argument to owl_tree_create_from_file was null, or there was an error while reading it. None.
ERROR_INVALID_OPTIONS The options argument to owl_tree_create_with_options either had both options.file and options.string set, or it had neither. None.
ERROR_INVALID_TOKEN Part of the text didn't match any valid token. A range that begins with the first unrecognized character.
ERROR_UNEXPECTED_TOKEN The parser encountered an out-of-place token that didn't fit the grammar. The range of the unexpected token.
ERROR_MORE_INPUT_NEEDED The input is valid so far, but incomplete; more tokens are necessary to complete it. A range positioned at the end of the input.

To handle the error yourself, you can use owl_tree_get_error. The range parameter is optional; if you pass a non-null pointer, it will be filled with a byte range as described in the table above.

struct source_range range;
switch (owl_tree_get_error(tree, &range)) {
case ERROR_NONE:
    // No error; continue with the rest of the program.
    break;
case ERROR_INVALID_FILE:
    fprintf(stderr, "error: invalid file\n");
    exit(1);
default:
    fprintf(stderr, "error: at range %zu %zu\n", range.start, range.end);
    exit(1);
}

cleaning up

When you're done with a tree, use owl_tree_destroy(tree) to reclaim its memory. Calling owl_tree_destroy on a null value is okay (it does nothing).

inside the tree

Each time a rule matches part of the input, Owl records details of the match in a hierarchical structure—that's the parse tree. Let's see what this tree looks like for a list-matching grammar that begins like:

list = item (',' item)*

If the list rule matches the input, Owl will record

  • the range of text that list matches, and
  • match details for each item in the list.

This recorded data is represented by a parsed_list struct:

struct parsed_list {
    struct source_range range;
    struct owl_ref item;
};

Note that item is a struct owl_ref, not a struct parsed_item. To save memory and improve locality, Owl stores parse tree data in a compressed format. A struct owl_ref is like a pointer into this compressed data. The parsed_item_get function unpacks a ref into a struct parsed_item and returns it.

struct parsed_item first_item = parsed_item_get(list.item);

There's a function like this for every rule:

rule name match data type unpacking function
list struct parsed_list parsed_list_get
item struct parsed_item parsed_item_get

iterating

The list.item field is a reference to the first item in the list. Calling owl_next on an item returns a reference to the next item. You can keep calling owl_next in a loop to iterate over the list:

while (!list.item.empty) {
    struct parsed_item item = parsed_item_get(list.item);
    // ...do something with item...
    list.item = owl_next(list.item);
}

Mutating list.item doesn't change the parse tree—it just changes the local, unpacked copy. You can also use a for loop if you don't want to reassign anything:

for (struct owl_ref r = list.item; !r.empty; r = owl_next(r)) {
    struct parsed_item item = parsed_item_get(r);
    // ...do something with item...
}

Each struct owl_ref is valid for as long as the tree is. You can store them and reuse them as much as you want.

named options

If a rule has named options, the chosen option is indicated by the type field in the match struct. For example, say our item rule looks like this:

item =
  'nil' : nothing
  number : numeric
  [ '[' list ']' ] : list

with three named options (nothing, numeric, and list). Its struct has a type field to indicate which option was chosen.

struct parsed_item {
    struct source_range range;
    enum parsed_type type; // <----
    struct owl_ref number;
    struct owl_ref list;
};

and a parsed_type enum is generated with three values.

enum parsed_type {
    PARSED_LIST = 1,
    PARSED_NOTHING,
    PARSED_NUMERIC,
};

All option names go into the same parsed_type enum to avoid unneccessary prefixing.

Fields which can't appear in patterns of the given type always have their empty flag set. For example, if item.type is PARSED_NUMERIC, item.list.empty will always be set.

renames

Most of the time, the rule name works fine as a field name. But occasionally you want to distinguish between multiple references to the same rule:

set-index = expr '[' expr ']' '=' expr

The @ operator can be used to specify a field name for each reference.

set-index = expr@array '[' expr@index ']' '=' expr@value

The new names appear in the fields of the parse tree struct:

struct parsed_set_index {
    struct source_range range;
    struct owl_ref array;
    struct owl_ref index;
    struct owl_ref value;
};

To keep the types of the references consistent, Owl makes sure each field name can only refer to one kind of rule.

# This isn't allowed:
# bad-plus = identifier@bad-name '+' expr@bad-name

getting the root match

The first rule (or root rule) in the grammar matches the entire input. This root match is our starting point in the parse tree. If list is the first rule in the grammar, Owl will generate an owl_tree_get_parsed_list function which returns the root match:

struct parsed_list list = owl_tree_get_parsed_list(tree);

token data

The identifier, integer, number, and string tokens record data about their match like rules do. This data can also be unpacked using parsed_identifier_get, parsed_integer_get, parsed_number_get, or parsed_string_get.

struct parsed_identifier {
    struct source_range range;
    const char *identifier;
    size_t length;
};

struct parsed_integer {
    struct source_range range;
    uint64_t integer;
};

struct parsed_number {
    struct source_range range;
    double number;
};

struct parsed_string {
    struct source_range range;
    const char *string;
    size_t length;
    bool has_escapes;
};

struct parsed_identifier parsed_identifier_get(struct owl_ref);
struct parsed_integer parsed_integer_get(struct owl_ref);
struct parsed_number parsed_number_get(struct owl_ref);
struct parsed_string parsed_string_get(struct owl_ref);

If has_escapes is true, the string data is owned by the owl_tree—otherwise, it's a direct reference to the parsed text.

If you need more than the built-in identifier, integer, number, and string token types, you can define your own token types using .token.

If a grammar has any of these user-defined tokens, owl_tree_create_with_options accepts two extra parameters in its owl_tree_options struct:

struct owl_tree_options {
    // ...
    owl_token_func_t tokenize;
    void *tokenize_info;
};

The tokenize function matches a user-defined token. For example, if your grammar needs to match individual digits:

.token digit '7' '3'

…then you might write this tokenize function:

struct owl_token match_digit(const char *string, void *info)
{
    if ('0' <= string[0] && string[0] <= '9') {
        return (struct owl_token){
            .type = OWL_TOKEN_DIGIT,
            .length = 1,
            .data.integer = string[0] - '0'
        };
    } else
        return owl_token_no_match;
}

…and pass it into owl_tree_create_with_options like this:

struct owl_tree_options options = {
    .string = "2",
    .tokenize = match_digit,
};
struct owl_tree *tree = owl_tree_create_with_options(options);
// ...

Each time Owl's tokenizer steps forward, it calls the tokenize function (if it's not NULL), passing the remaining zero-terminated input as string and the tokenize_info field provided in owl_tree_options as info. The tokenize function then returns an owl_token struct representing details about the match:

struct owl_token {
    enum owl_token_type type;
    size_t length;
    union {
        uint64_t integer;
        double real;
        void *pointer;
    } data;
};

A return value of owl_token_no_match (or any value with the length field set to zero) indicates no match. Otherwise, the length field indicates how long the match is, and type indicates which type of token it is. Each user-defined token type (like .token digit) corresponds to a value in the owl_token_type enum (like OWL_TOKEN_DIGIT). Tokens with type OWL_WHITESPACE will be treated as whitespace instead of as a custom token.

If there's a conflict between a user-defined token match and another token match, the longest match will be chosen, with ties going to keywords first, then user-defined token types, then other token types (like identifier and so on).

User-defined tokens can be unpacked into parsed_... structs just like rules and built-in tokens.

struct parsed_digit {
    struct source_range range;
    union {
        uint64_t integer;
        double real;
        void *pointer;
    } data;
};

Any data returned via the data field in owl_token will appear in the parsed_... struct automatically.

custom prefix

The -p option specifies a custom prefix for names in the generated parser code.

$ owl -c grammar.owl -p asdf -o parser.h

Instead of owl_ref and parsed_integer_get, parser.h will use names like asdf_ref and asdf_integer_get.

function index

ROOT is the root rule name. RULE ranges over all rules.

name arguments return value
owl_next An owl_ref. The next ref matching the corresponding field in the rule, or an empty ref.
owl_refs_equal Two owl_ref values. true if the refs refer to the same match; false otherwise.
owl_tree_create_from_file A FILE * to read from. The file is read into an intermediate string and may be closed immediately. A new tree.
owl_tree_create_from_string A null-terminated string to parse. You retain ownership and must keep the string around until the tree is destroyed. A new tree.
owl_tree_create_with_options An owl_tree_options struct—use this to specify a custom tokenize function. A new tree.
owl_tree_destroy An owl_tree * to destroy, freeing its resources back to the system. May be NULL. None.
owl_tree_get_error An owl_tree * and an error_range out-parameter. The error range may be NULL. An error which interrupted parsing, or ERROR_NONE if there was no error.
owl_tree_get_parsed_ROOT An owl_tree *. A parsed_ROOT struct corresponding to the root match.
owl_tree_print An owl_tree * to print to stdout (typically for debugging purposes). Must not be NULL. None.
owl_tree_root_ref An owl_tree *. The ref corresponding to the root match.
parsed_identifier_get An owl_ref corresponding to an identifier match. A parsed_identifier struct corresponding to the identifier match.
parsed_integer_get An owl_ref corresponding to an integer match. A parsed_integer struct corresponding to the integer match.
parsed_number_get An owl_ref corresponding to a number match. A parsed_number struct corresponding to the number match.
parsed_string_get An owl_ref corresponding to a string match. A parsed_string struct corresponding to the string match.
parsed_RULE_get An owl_ref corresponding to a match for RULE. A parsed_RULE struct corresponding to the ref's match.
parsed_TOKEN_get An owl_ref corresponding to a match for the user-defined token TOKEN. A parsed_TOKEN struct corresponding to the ref's match.