feat: create SymbolIterator for block parsing #106

mhatzl · 2023-09-24T14:52:23Z

This PR adds a SymbolIterator to iterate over the Symbol slice returned from the Scanner.
It enables convenient tokenization and parsing for block elements,
by allowing to nest iterators, and define functions to match for block end and to strip line prefixes.

This is part of preparation for parsing more complex elements like lists.

Decisions

Nesting iterators

Because block elements may allow nesting, iterators may also be nested
to make parsing of elements nesting agnostic.
Parent iterators only forward symbols to children, if the parent is not at the end,
and an optional prefix is stripped.

Iterators keep track of their nesting depth, to allow nesting of sequences like (outer (inner)).
Functions/Closures for end/prefix matching

It was decided to allow functions to be passed for end/prefix matching instead of multiple Symbol
slices. This allows fine grained matching, and allows to distinguish between peeked and consumed
matching.

Prefix matching is always consuming, because nested iterators must not see prefix symbols of parent
elements.

Future work

Use SymbolIterator for inlines

Currently, inline parsing is done on a symbol slice.
Inline parsing internally uses get_symbol(index) quite often, which makes it hard to adopt the
iterator approach. For now, blocks retrieve a Symbol vector for inline content,
and then pass this to the inline parser.

With the new iterator approach supporting end detection, it might be possible to use
the same tokenization and parsing approach as done for blocks.
The handling of ambiguous tokens like *** won't be trivial, but I am confident that
it can be done by also matching for Bold and Italic directly in the ambiguous parser.

Changing inline parsing will take time, and possibly break things.
I would suggest that we first implement more blocks to see if the iterator approach is feasible.
If the new approach seems to work well for blocks, we can try to adopt it also for inlines.
Iterators all the way

The scanner creates the vector of Symbols using the grapheme segmenter of icu.
This segmenter itself is an iterator, so we might be able to directly embed this iterator
in the SymbolIteratorRoot. This would require some form of a symbol cache to allow peeking,
but using an internal VecDeque should do the trick.

Note: This effectively removes the option to get a slice of remaining symbols,
and therefore the option to get size hints of remaining symbols.
The size hint could still be approximated by using the utf8 length of the input string.
This won't be accurate on Windows, because new lines have utf8 length of two due to \r\n.

I am also unsure about the effect on performance and memory usage, when switching to full iterator.
Memory should in theory become better, because we do not allocate memory for all symbols.
Compared to allocation needed for the transformed content, this improvement might be neglectable
though.

…kup-rs into symbol-iterator

check in lock file to prevent this in the future

mhatzl · 2023-09-24T16:38:56Z

I had to fix icu dependencies and check in Cargo.lock, because icu has internal version conflicts between 1.2.0 and 1.3.0.

Clippy currently fails due to old code.
This is fixed in PR #105 so I will merge it once this PR is merged into main.

mhatzl · 2023-09-24T16:41:39Z

I also implemented rendering for newline and whitespace, because I wanted to make sure the resulting HTML is still correct using the iterator approach.

mhatzl · 2023-09-25T17:57:40Z

After seeing the failed actions due to icu dependencies yesterday,
I updated to icu 1.3.0, which allows us to use the grapheme iterator without any generated data.
This also allows to remove the pin to a specific version, because we only need icu_segmenter and icu_locid both of which probably won't create version conflicts.

Behavior remains the same, because this was the default anyways.

nfejzic

This is a big one, and big ones always have lots of comments. Some might be just do discuss a little bit and require no action, some are suggestions to (hopefully) improve the implementation ever so slightly.

Overall this looks very nice, and it solves a huge problem of parsing nested content!

Edit:

The scanner creates the vector of Symbols using the grapheme segmenter of icu.
This segmenter itself is an iterator, so we might be able to directly embed this iterator
in the SymbolIteratorRoot. This would require some form of a symbol cache to allow peeking,
but using an internal VecDeque should do the trick.

Regarding this, inspired by this I'm working on a crate called ribbon that should help with this. It's an abstraction driven by an iterator, allowing for simultaneous access over multiple items yielded by an iterator at the same time. This could be useful at multiple places (iterating over symbols, inlines tokenization, inlines token resolver etc.).

nfejzic · 2023-09-26T19:23:23Z

commons/src/scanner/mod.rs

+        Scanner::new()
+    }
+}

-        Self {
-            provider: self.provider.clone(),
-            segmenter,
-        }
+impl Default for Scanner {
+    fn default() -> Self {
+        Self::new()


To be honest I would remove the Scanner::new function at this point and just implement and use Default since we don't pass any params to new anyway.

I decided to fully remove Scanner.
Instead I made scan_str() a public function in the scanner module.

nfejzic · 2023-09-26T19:24:16Z

commons/src/scanner/mod.rs

-// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen.
-include!("./icu_data/mod.rs");
-impl_data_provider!(IcuDataProvider);
+use position::{Offset, Position as SymPos};


It could be a good idea to rename Position to SymPos in general, since that's what it actually is 🤔

I think SymbolPosition would be better in that case, and I would also change Offset to SymbolOffset.

SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

Or we keep the names and move them into the symbol module?
I was thinking about this option, but then scanner becomes a bit useless?
But removing scanner, by moving symbol up did not seem right to me.

commons/src/scanner/symbol/iterator/matcher.rs

nfejzic · 2023-09-26T19:31:05Z

commons/src/scanner/symbol/iterator/matcher.rs

+
+impl<'input> EndMatcher for SymbolIterator<'input> {
+    fn is_empty_line(&mut self) -> bool {
+        // Note: Multiple matches may be set in the match closure, so we need to ensure that all start at the same index


This is an extreme nitpick, but I think convention is to use upper case for NOTE, FIXME, TODO etc. Uppercase versions get highlighted (at least in my editor) 🙈.

You can decide to change this or leave it, just wanted to mention it 👀

I did not know that about NOTE.
But would you also write NOTE in doc-comments?
Usually I write **Note:** in doc-comments, but because normal comments have no formatting, I stayed with Note here, because it felt more consistent.

Would have to look through the code so replace all Note. Maybe keeping it as is for now, and change all in a new PR?

Hmm good question. Generally I wouldn't write any form of Note: in doc-comments. Would probably just explain it, something like Note that ....

You can decide if and when you want to change this, it's not important part of this PR anyway.

I think notes in doc-comments can be useful.
Think of them more like GitHub alerts.

It helps to highlight information that is especially relevant to a user.
Making a section bold does not make it readable, so you use alerts to highlight those sections instead.

I know, but just in general keep in mind that it's not necessary most of the time. If something is that important, maybe separate heading is a better option. Otherwise we can just explain it. I also use NOTE: often, but we should probably reserve it for special cases. It kind of loses it's purpose if we over-use it, just want us to be aware of that.

commons/src/scanner/symbol/iterator/matcher.rs

commons/src/scanner/symbol/iterator/mod.rs

commons/src/scanner/symbol/iterator/root.rs

commons/src/scanner/symbol/iterator/mod.rs

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

Provide `scan_str()` as standalone function.

Assert only in debug mode.

nfejzic

Nice!

nfejzic · 2023-09-29T19:29:56Z

commons/src/scanner/mod.rs

-// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen.
-include!("./icu_data/mod.rs");
-impl_data_provider!(IcuDataProvider);
+use position::{Offset, Position as SymPos};


SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

commons/src/scanner/symbol/iterator/root.rs

commons/src/test_runner/mod.rs

mhatzl · 2023-09-29T19:45:18Z

I will wait for PR #105 to be merged, and then merge this one to resolve the clippy warnings that keep failing the action.

mhatzl · 2023-10-01T17:29:03Z

I now merged changes from PR #105.
There were some merge conflicts, but all tests are passing now, and it should be ok to merge this PR.

For lexer tests in inline I had to remove the EOI symbol before running the tests.
With the EOI symbol, all lexer tests failed with an unwrap on None.
Assuming that we will update all of inline parsing to be based on iterators, I would leave this workaround as is for now.
Because inlines for now won't see the EOI symbol, because block elements strip it for them.

mhatzl and others added 25 commits September 21, 2023 18:02

feat: create SymbolIterator

a2ca1d2

feat: switch block parser to SymbolIterator

998d291

feat: add itertools for SymbolIterator

f1dc373

feat: switch to nesting symbol iterators

de52811

fix: add prefix line test for symbol iterator

5398be0

feat: simplify iterator nesting parsers

aba8224

Merge branch 'symbol-iterator' of https://github.com/Unimarkup/unimar…

88c3064

…kup-rs into symbol-iterator

fix: correct heading end closure to detect heading

fbefb50

fix: ignore newlines between elements

cd608b3

feat: make end-fn optional for new symbol iterator

32778c9

fix: change end fns to get SymboliterMatcher

1a5c5b0

fix: remove new_line from SymbolIterRoot

b8d430b

fix: remove remaining symbols from tokenize output

6ad4a8b

fix: correct prefix consumption for symbol iterator

c73286f

fix: fix endless loop in peeking_next()

27d8d70

fix: correct iterator length calculation

71171f3

fix: prevent plain from merging with newline token

57f5f72

fix: implement rendering for whitespace inlines

16c2a60

fix: add comment why reset_peek() is needed

1df4d76

fix: update verbatim to work with symbol iterator

f7cbbf8

arch: split iterator into multiple files

6c3c28e

fix: add documentation for the symbol iterator

ee317d2

feat: add nesting depth to symbol iterator

0d2c225

fix: add EOI symbol to match end as empty line

dd903f5

fix: remove EOI symbol for lexer tests

b74c089

mhatzl requested a review from nfejzic September 24, 2023 14:52

mhatzl added 2 commits September 24, 2023 16:57

fix: pin zerovec crate to specific version

45f4a1f

fix: resolve icu dependency problems

8487538

check in lock file to prevent this in the future

feat: update icu to not need any generated data

e1751f5

mhatzl added 2 commits September 25, 2023 20:19

fix: remove crate_authors!() due to clippy warning

f31143b

Behavior remains the same, because this was the default anyways.

chore: remove lock file from vc after icu bump

3746027

nfejzic requested changes Sep 28, 2023

View reviewed changes

mhatzl and others added 14 commits September 29, 2023 16:55

fix: add blankline for better readability

f8bab51

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

fix: use debug_assert!() instead of cfg(debug_assertions)

b63b902

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

fix: make peeking_next() more compact

0ad2063

fix: use owned Vec to create Paragraph from

17e1956

fix: use iter::once() to create end sequence

b20952f

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

fix: remove double dot at end of sentence

0dc18ad

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

fix: map length before unwrap of remaining_symbols

85f46ff

Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>

fix: improve comments for SymbolIterator

6e12f23

fix: remove Scanner struct

7235dfb

Provide `scan_str()` as standalone function.

fix: restrict visibility of iterator index fns

02c4505

fix: remove duplicate From<> impls for iterators

0d5c8ab

fix: remove *curr* prefix for iterator functions

d489076

fix: remove *curr* prefix from index in root iterator

01a148b

fix: add assert to ensure update done on act parent

d710917

Assert only in debug mode.

mhatzl requested a review from nfejzic September 29, 2023 17:04

nfejzic previously approved these changes Sep 29, 2023

View reviewed changes

mhatzl mentioned this pull request Sep 30, 2023

feat: run snapshot tests dynamically in unimarkup-core crate #105

Merged

chore: merge branch 'main' into symbol-iterator

a69de7a

mhatzl dismissed nfejzic’s stale review via a69de7a October 1, 2023 17:20

nfejzic approved these changes Oct 2, 2023

View reviewed changes

nfejzic merged commit dd98ae2 into main Oct 2, 2023
4 checks passed

nfejzic deleted the symbol-iterator branch October 2, 2023 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: create SymbolIterator for block parsing #106

feat: create SymbolIterator for block parsing #106

mhatzl commented Sep 24, 2023

mhatzl commented Sep 24, 2023

mhatzl commented Sep 24, 2023

mhatzl commented Sep 25, 2023

nfejzic left a comment •

edited

Loading

nfejzic Sep 26, 2023

mhatzl Sep 29, 2023

nfejzic Sep 26, 2023

mhatzl Sep 29, 2023

nfejzic Sep 29, 2023

mhatzl Sep 29, 2023

nfejzic Sep 26, 2023

mhatzl Sep 29, 2023

nfejzic Sep 29, 2023

mhatzl Sep 29, 2023

nfejzic Sep 29, 2023

nfejzic left a comment

nfejzic Sep 29, 2023

mhatzl commented Sep 29, 2023

mhatzl commented Oct 1, 2023

feat: create SymbolIterator for block parsing #106

feat: create SymbolIterator for block parsing #106

Conversation

mhatzl commented Sep 24, 2023

Decisions

Future work

mhatzl commented Sep 24, 2023

mhatzl commented Sep 24, 2023

mhatzl commented Sep 25, 2023

nfejzic left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfejzic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhatzl commented Sep 29, 2023

mhatzl commented Oct 1, 2023

nfejzic left a comment •

edited

Loading