feat: update parser to use iterators for blocks and inlines #118

mhatzl · 2023-11-10T22:07:43Z

This is a major rework of both block and inline parser.
Now block and inline parsing use a similar iterator based approach,
which hopefully makes implementing new elements much easier.

Relevant decisions you made in this PR

Token Slice instead of Symbol Slice

Block parsing was done over a symbol slice, and a Symbol represented one grapheme
with added line and column information.
While trying to port the iterator approach to inlines, it became too difficult
and slow to iterate over symbols, because next and peeking_next would individually create the inline token from one or more symbols. Using a "cache layer" to keep all "peeked" tokens slightly improved the performance, but significantly increased the complexity.
After talking to @nfejzic, it was decided to convert all symbols to token representations
that still allowed to distinguish between keywords and plain text,
but compressed most graphemes/symbols into one token.
This new token slice makes it easier to implement parsing and matching functions.

Parser chaining

To nest the symbol iterator for block parsing, the iterator was cloned,
and the progress had to be manually updated when "unnesting".
This cloning and updating created an unnecessary parser overhead,
because there is always just one parser that can make progress.
With the change to iterate over a token slice, iterating over this slice allows to easily jump inside the iterator.
A checkpoint and rollback functionality was added to the token iterator,
to allow element parsing, and rollback to a previous created checkpoint if an element parser did not lead to a valid element.

Parser functions for elements now take ownership of the parser,
and must return the optionally modified parser and the optionally created element.
This allows to chain element parsers, but rollback the parser if some parser did not lead to an element. This return tuple has the negative side-effect, that "?" cannot be used for early exit
if the token iterator does not return new tokens.
However, this is considered to be a small tradeoff, because early exit was mostly only used at the start of a parser function.

Implicit substitutions

Because implicit substitutions should take precedence over other block and inline elements,
it was decided to move handling of these substitutions into the conversion from symbols to tokens.
However, this conversion is not yet implemented, because correct direct URI, arrow and emoji substitution will add more complexity, which would further increase the size of this PR.

Prepare to handle logic in verbatim context

The new parser provides context structs for blocks and inlines
to correctly handle whitespace merging, escaping, and prepare to detect logic constructs
in verbatim context.
e.g. inline verbatim {$my-var} with logic

Needed for advanced matching and element spans.

Implicit substs and whitespaces are now merged into Plain. This makes snapshots more readable.

mhatzl · 2023-11-12T13:31:55Z

Remaining todo! macros are intentional, because NamedSubstitution is not yet implemented.
The base iterator file in commons is commented, because it includes code that could be used as template for direct grapheme iterator to token slice conversion.

At the moment, conversion from str to the token slice is done by first creating the symbol slice and then creating the token slice. With some small dynamic cache in the SymbolIterator, it should be possible to directly convert from str to the token slice, by using the GraphemeIterator.
This should significantly improve overall parsing performance and memory usage.

nfejzic

I had time for this much right now, but I think these comments will also take some time to resolve. Will review the rest at some later point.

commons/src/lexer/position.rs

commons/src/lexer/symbol/iterator.rs

commons/src/lexer/token/implicit/mod.rs

commons/src/lexer/token/iterator/mod.rs

commons/src/lexer/token/iterator/slice.rs

commons/src/lexer/token/mod.rs

commons/src/parsing.rs

nfejzic

This is a very large PR, and most of it is not reviewed. However, we internally discussed about it and concluded that there are some time constraints at the moment.

Because of that, I will rely on knowledge and confidence of @mhatzl that this works correctly, and approve the PR.

mhatzl added 30 commits October 24, 2023 13:17

fix: replace depth matching with scope nesting

4d03b25

feat: add Any symbol to help with matching

c98cd96

feat: store prev symbol of iterator

d2fb4b5

Needed for advanced matching and element spans.

fix: ensure consumed matches are consumed in parser

b18a241

feat: add match fn to check if prev was a space

95f94b0

feat: use symbol iterator to parse inlines

f54cc5c

feat: use token iterator on top of symbol iterator

4a6461c

feat: implement bold italic parsing

b01f940

fix: adapt prev & cached token & fix bolditalic

2564188

feat: add strikethrough format

8226336

feat: add implicit subst layer in token iterator

64f3f3d

feat: add inline verbatim

ef375fb

fix: allow implicit subst after inline verbatim

8eb4125

feat: add textbox support for inlines

8efb196

feat: add generic ambiguous inline parsing

327de12

fix: mark implicit ends in ambiguous formatting

982eaa1

feat: create generic distinct format parser

7d37090

fix: iterator max length calculation

87d9e7e

arch: improve end calc for distinct formats

5280b51

fix: update token positions for ambiguous splits

07e6cfb

fix: cleanup new inline parser

883dc31

arch: rename commons module scanner to lexer

a0685aa

feat: add context to inline element parsers

f65f22e

fix: move DirectUri out of implicit subst

e134615

fix: remove duplicated implicit kinds

f30f331

feat: update snapshot testing to new inline parsing

1086f07

fix: reset peek index if peeking is not accepted

187cafd

fix: handle blankline escape and prevent panics

28df88e

fix: correct whitespace and newline handling

6f13456

Implicit substs and whitespaces are now merged into Plain. This makes snapshots more readable.

fix: allow Whitespace to be pushed into Plain

fce6d08

mhatzl added 13 commits November 6, 2023 16:10

feat: use parser chaining in inlines

753faae

fix: update verbatim block to use token iterator

6d7bfb7

fix: use token iterator for bullet list parsing

ae2282b

fix: add overline to format kinds

d40f932

fix: handle blanklines for prefix matching

ec0b54d

fix: implement element trait for bullet list

20fdd9b

fix: handle implicit block ends if outer ends

965986b

fix: cleanup block parser and renderer and add doc

5205a98

arch: rename verbatim_block in snapshots

cff3800

fix: remove old inline parsing

6d23272

fix: remove old iterators

5fa4164

fix: remove old doc test

1c8a242

fix: remove todo! except for named substitution

ee23132

mhatzl marked this pull request as ready for review November 12, 2023 13:25

mhatzl added 2 commits November 14, 2023 11:30

fix: ignore end for plain str if implicit closed

9490d8e

fix: add basic verbatim block tests

01bd2ad

nfejzic requested changes Nov 15, 2023

View reviewed changes

mhatzl added 9 commits November 16, 2023 09:49

fix: add rendering for escaped plain and blankline

f046b7f

fix: clarify iterator nesting doc and naming

ac9266b

fix: use more elegant code from review feedback

cb91d3b

fix: apply suggestions from review

38c4840

fix: correct doc for token specific functions

5870ef1

fix: apply suggestion from review

70175a9

fix: handle newlines after blocks

8573b65

fix: handle sub-headings directly after heading

ec5e8b3

arch: rename to_plain_string to as_unimarkup

8a30879

mhatzl changed the title ~~feat: major parser update to use iterators for blocks and inlines~~ feat: update parser to use iterators for blocks and inlines Nov 19, 2023

nfejzic approved these changes Nov 19, 2023

View reviewed changes

mhatzl merged commit c1a4827 into main Nov 19, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update parser to use iterators for blocks and inlines #118

feat: update parser to use iterators for blocks and inlines #118

mhatzl commented Nov 10, 2023 •

edited

Loading

mhatzl commented Nov 12, 2023

nfejzic left a comment

nfejzic left a comment

feat: update parser to use iterators for blocks and inlines #118

feat: update parser to use iterators for blocks and inlines #118

Conversation

mhatzl commented Nov 10, 2023 • edited Loading

Relevant decisions you made in this PR

Token Slice instead of Symbol Slice

Parser chaining

Implicit substitutions

Prepare to handle logic in verbatim context

mhatzl commented Nov 12, 2023

nfejzic left a comment

Choose a reason for hiding this comment

nfejzic left a comment

Choose a reason for hiding this comment

mhatzl commented Nov 10, 2023 •

edited

Loading