-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update parser to use iterators for blocks and inlines #118
Conversation
Needed for advanced matching and element spans.
Implicit substs and whitespaces are now merged into Plain. This makes snapshots more readable.
Remaining At the moment, conversion from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had time for this much right now, but I think these comments will also take some time to resolve. Will review the rest at some later point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very large PR, and most of it is not reviewed. However, we internally discussed about it and concluded that there are some time constraints at the moment.
Because of that, I will rely on knowledge and confidence of @mhatzl that this works correctly, and approve the PR.
This is a major rework of both block and inline parser.
Now block and inline parsing use a similar iterator based approach,
which hopefully makes implementing new elements much easier.
Relevant decisions you made in this PR
Token Slice instead of Symbol Slice
Block parsing was done over a symbol slice, and a
Symbol
represented one graphemewith added line and column information.
While trying to port the iterator approach to inlines, it became too difficult
and slow to iterate over symbols, because
next
andpeeking_next
would individually create the inline token from one or more symbols. Using a "cache layer" to keep all "peeked" tokens slightly improved the performance, but significantly increased the complexity.After talking to @nfejzic, it was decided to convert all symbols to token representations
that still allowed to distinguish between keywords and plain text,
but compressed most graphemes/symbols into one token.
This new token slice makes it easier to implement parsing and matching functions.
Parser chaining
To nest the symbol iterator for block parsing, the iterator was cloned,
and the progress had to be manually updated when "unnesting".
This cloning and updating created an unnecessary parser overhead,
because there is always just one parser that can make progress.
With the change to iterate over a token slice, iterating over this slice allows to easily jump inside the iterator.
A
checkpoint
androllback
functionality was added to the token iterator,to allow element parsing, and rollback to a previous created checkpoint if an element parser did not lead to a valid element.
Parser functions for elements now take ownership of the parser,
and must return the optionally modified parser and the optionally created element.
This allows to chain element parsers, but rollback the parser if some parser did not lead to an element. This return tuple has the negative side-effect, that "?" cannot be used for early exit
if the token iterator does not return new tokens.
However, this is considered to be a small tradeoff, because early exit was mostly only used at the start of a parser function.
Implicit substitutions
Because implicit substitutions should take precedence over other block and inline elements,
it was decided to move handling of these substitutions into the conversion from symbols to tokens.
However, this conversion is not yet implemented, because correct direct URI, arrow and emoji substitution will add more complexity, which would further increase the size of this PR.
Prepare to handle logic in verbatim context
The new parser provides context structs for blocks and inlines
to correctly handle whitespace merging, escaping, and prepare to detect logic constructs
in verbatim context.
e.g.
inline verbatim {$my-var} with logic