Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update parser to use iterators for blocks and inlines #118

Merged
merged 82 commits into from
Nov 19, 2023
Merged

Conversation

mhatzl
Copy link
Contributor

@mhatzl mhatzl commented Nov 10, 2023

This is a major rework of both block and inline parser.
Now block and inline parsing use a similar iterator based approach,
which hopefully makes implementing new elements much easier.

Relevant decisions you made in this PR

Token Slice instead of Symbol Slice

Block parsing was done over a symbol slice, and a Symbol represented one grapheme
with added line and column information.
While trying to port the iterator approach to inlines, it became too difficult
and slow to iterate over symbols, because next and peeking_next would individually create the inline token from one or more symbols. Using a "cache layer" to keep all "peeked" tokens slightly improved the performance, but significantly increased the complexity.
After talking to @nfejzic, it was decided to convert all symbols to token representations
that still allowed to distinguish between keywords and plain text,
but compressed most graphemes/symbols into one token.
This new token slice makes it easier to implement parsing and matching functions.

Parser chaining

To nest the symbol iterator for block parsing, the iterator was cloned,
and the progress had to be manually updated when "unnesting".
This cloning and updating created an unnecessary parser overhead,
because there is always just one parser that can make progress.
With the change to iterate over a token slice, iterating over this slice allows to easily jump inside the iterator.
A checkpoint and rollback functionality was added to the token iterator,
to allow element parsing, and rollback to a previous created checkpoint if an element parser did not lead to a valid element.

Parser functions for elements now take ownership of the parser,
and must return the optionally modified parser and the optionally created element.
This allows to chain element parsers, but rollback the parser if some parser did not lead to an element. This return tuple has the negative side-effect, that "?" cannot be used for early exit
if the token iterator does not return new tokens.
However, this is considered to be a small tradeoff, because early exit was mostly only used at the start of a parser function.

Implicit substitutions

Because implicit substitutions should take precedence over other block and inline elements,
it was decided to move handling of these substitutions into the conversion from symbols to tokens.
However, this conversion is not yet implemented, because correct direct URI, arrow and emoji substitution will add more complexity, which would further increase the size of this PR.

Prepare to handle logic in verbatim context

The new parser provides context structs for blocks and inlines
to correctly handle whitespace merging, escaping, and prepare to detect logic constructs
in verbatim context.
e.g. inline verbatim {$my-var} with logic

Needed for advanced matching and element spans.
Implicit substs and whitespaces are now merged into Plain.
This makes snapshots more readable.
@mhatzl mhatzl marked this pull request as ready for review November 12, 2023 13:25
@mhatzl
Copy link
Contributor Author

mhatzl commented Nov 12, 2023

Remaining todo! macros are intentional, because NamedSubstitution is not yet implemented.
The base iterator file in commons is commented, because it includes code that could be used as template for direct grapheme iterator to token slice conversion.

At the moment, conversion from str to the token slice is done by first creating the symbol slice and then creating the token slice. With some small dynamic cache in the SymbolIterator, it should be possible to directly convert from str to the token slice, by using the GraphemeIterator.
This should significantly improve overall parsing performance and memory usage.

Copy link
Collaborator

@nfejzic nfejzic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had time for this much right now, but I think these comments will also take some time to resolve. Will review the rest at some later point.

commons/src/lexer/position.rs Outdated Show resolved Hide resolved
commons/src/lexer/symbol/iterator.rs Show resolved Hide resolved
commons/src/lexer/symbol/iterator.rs Outdated Show resolved Hide resolved
commons/src/lexer/symbol/iterator.rs Show resolved Hide resolved
commons/src/lexer/token/implicit/mod.rs Outdated Show resolved Hide resolved
commons/src/lexer/token/iterator/mod.rs Outdated Show resolved Hide resolved
commons/src/lexer/token/iterator/mod.rs Outdated Show resolved Hide resolved
commons/src/lexer/token/iterator/slice.rs Outdated Show resolved Hide resolved
commons/src/lexer/token/mod.rs Outdated Show resolved Hide resolved
commons/src/parsing.rs Outdated Show resolved Hide resolved
@mhatzl mhatzl changed the title feat: major parser update to use iterators for blocks and inlines feat: update parser to use iterators for blocks and inlines Nov 19, 2023
Copy link
Collaborator

@nfejzic nfejzic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large PR, and most of it is not reviewed. However, we internally discussed about it and concluded that there are some time constraints at the moment.

Because of that, I will rely on knowledge and confidence of @mhatzl that this works correctly, and approve the PR.

@mhatzl mhatzl merged commit c1a4827 into main Nov 19, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants