Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create SymbolIterator for block parsing #106

Merged
merged 45 commits into from
Oct 2, 2023
Merged

Conversation

mhatzl
Copy link
Contributor

@mhatzl mhatzl commented Sep 24, 2023

This PR adds a SymbolIterator to iterate over the Symbol slice returned from the Scanner.
It enables convenient tokenization and parsing for block elements,
by allowing to nest iterators, and define functions to match for block end and to strip line prefixes.

This is part of preparation for parsing more complex elements like lists.

Decisions

  1. Nesting iterators

    Because block elements may allow nesting, iterators may also be nested
    to make parsing of elements nesting agnostic.
    Parent iterators only forward symbols to children, if the parent is not at the end,
    and an optional prefix is stripped.

    Iterators keep track of their nesting depth, to allow nesting of sequences like (outer (inner)).

  2. Functions/Closures for end/prefix matching

    It was decided to allow functions to be passed for end/prefix matching instead of multiple Symbol
    slices. This allows fine grained matching, and allows to distinguish between peeked and consumed
    matching.

    Prefix matching is always consuming, because nested iterators must not see prefix symbols of parent
    elements.

Future work

  • Use SymbolIterator for inlines

    Currently, inline parsing is done on a symbol slice.
    Inline parsing internally uses get_symbol(index) quite often, which makes it hard to adopt the
    iterator approach. For now, blocks retrieve a Symbol vector for inline content,
    and then pass this to the inline parser.

    With the new iterator approach supporting end detection, it might be possible to use
    the same tokenization and parsing approach as done for blocks.
    The handling of ambiguous tokens like *** won't be trivial, but I am confident that
    it can be done by also matching for Bold and Italic directly in the ambiguous parser.

    Changing inline parsing will take time, and possibly break things.
    I would suggest that we first implement more blocks to see if the iterator approach is feasible.
    If the new approach seems to work well for blocks, we can try to adopt it also for inlines.

  • Iterators all the way

    The scanner creates the vector of Symbols using the grapheme segmenter of icu.
    This segmenter itself is an iterator, so we might be able to directly embed this iterator
    in the SymbolIteratorRoot. This would require some form of a symbol cache to allow peeking,
    but using an internal VecDeque should do the trick.

    Note: This effectively removes the option to get a slice of remaining symbols,
    and therefore the option to get size hints of remaining symbols.
    The size hint could still be approximated by using the utf8 length of the input string.
    This won't be accurate on Windows, because new lines have utf8 length of two due to \r\n.

    I am also unsure about the effect on performance and memory usage, when switching to full iterator.
    Memory should in theory become better, because we do not allocate memory for all symbols.
    Compared to allocation needed for the transformed content, this improvement might be neglectable
    though.

@mhatzl mhatzl requested a review from nfejzic September 24, 2023 14:52
@mhatzl
Copy link
Contributor Author

mhatzl commented Sep 24, 2023

I had to fix icu dependencies and check in Cargo.lock, because icu has internal version conflicts between 1.2.0 and 1.3.0.

Clippy currently fails due to old code.
This is fixed in PR #105 so I will merge it once this PR is merged into main.

@mhatzl
Copy link
Contributor Author

mhatzl commented Sep 24, 2023

I also implemented rendering for newline and whitespace, because I wanted to make sure the resulting HTML is still correct using the iterator approach.

@mhatzl
Copy link
Contributor Author

mhatzl commented Sep 25, 2023

After seeing the failed actions due to icu dependencies yesterday,
I updated to icu 1.3.0, which allows us to use the grapheme iterator without any generated data.
This also allows to remove the pin to a specific version, because we only need icu_segmenter and icu_locid both of which probably won't create version conflicts.

Copy link
Collaborator

@nfejzic nfejzic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big one, and big ones always have lots of comments. Some might be just do discuss a little bit and require no action, some are suggestions to (hopefully) improve the implementation ever so slightly.

Overall this looks very nice, and it solves a huge problem of parsing nested content!

Edit:

The scanner creates the vector of Symbols using the grapheme segmenter of icu.
This segmenter itself is an iterator, so we might be able to directly embed this iterator
in the SymbolIteratorRoot. This would require some form of a symbol cache to allow peeking,
but using an internal VecDeque should do the trick.

Regarding this, inspired by this I'm working on a crate called ribbon that should help with this. It's an abstraction driven by an iterator, allowing for simultaneous access over multiple items yielded by an iterator at the same time. This could be useful at multiple places (iterating over symbols, inlines tokenization, inlines token resolver etc.).

Comment on lines 19 to 25
Scanner::new()
}
}

Self {
provider: self.provider.clone(),
segmenter,
}
impl Default for Scanner {
fn default() -> Self {
Self::new()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I would remove the Scanner::new function at this point and just implement and use Default since we don't pass any params to new anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to fully remove Scanner.
Instead I made scan_str() a public function in the scanner module.

// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen.
include!("./icu_data/mod.rs");
impl_data_provider!(IcuDataProvider);
use position::{Offset, Position as SymPos};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a good idea to rename Position to SymPos in general, since that's what it actually is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SymbolPosition would be better in that case, and I would also change Offset to SymbolOffset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we keep the names and move them into the symbol module?
I was thinking about this option, but then scanner becomes a bit useless?
But removing scanner, by moving symbol up did not seem right to me.

commons/src/scanner/symbol/iterator/matcher.rs Outdated Show resolved Hide resolved

impl<'input> EndMatcher for SymbolIterator<'input> {
fn is_empty_line(&mut self) -> bool {
// Note: Multiple matches may be set in the match closure, so we need to ensure that all start at the same index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extreme nitpick, but I think convention is to use upper case for NOTE, FIXME, TODO etc. Uppercase versions get highlighted (at least in my editor) 🙈.

You can decide to change this or leave it, just wanted to mention it 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that about NOTE.
But would you also write NOTE in doc-comments?
Usually I write **Note:** in doc-comments, but because normal comments have no formatting, I stayed with Note here, because it felt more consistent.

Would have to look through the code so replace all Note. Maybe keeping it as is for now, and change all in a new PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. Generally I wouldn't write any form of Note: in doc-comments. Would probably just explain it, something like Note that ....

You can decide if and when you want to change this, it's not important part of this PR anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think notes in doc-comments can be useful.
Think of them more like GitHub alerts.

It helps to highlight information that is especially relevant to a user.
Making a section bold does not make it readable, so you use alerts to highlight those sections instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but just in general keep in mind that it's not necessary most of the time. If something is that important, maybe separate heading is a better option. Otherwise we can just explain it. I also use NOTE: often, but we should probably reserve it for special cases. It kind of loses it's purpose if we over-use it, just want us to be aware of that.

commons/src/scanner/symbol/iterator/matcher.rs Outdated Show resolved Hide resolved
commons/src/scanner/symbol/iterator/mod.rs Outdated Show resolved Hide resolved
commons/src/scanner/symbol/iterator/mod.rs Outdated Show resolved Hide resolved
commons/src/scanner/symbol/iterator/mod.rs Outdated Show resolved Hide resolved
commons/src/scanner/symbol/iterator/root.rs Outdated Show resolved Hide resolved
commons/src/scanner/symbol/iterator/mod.rs Outdated Show resolved Hide resolved
@mhatzl mhatzl requested a review from nfejzic September 29, 2023 17:04
nfejzic
nfejzic previously approved these changes Sep 29, 2023
Copy link
Collaborator

@nfejzic nfejzic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen.
include!("./icu_data/mod.rs");
impl_data_provider!(IcuDataProvider);
use position::{Offset, Position as SymPos};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

commons/src/scanner/symbol/iterator/root.rs Show resolved Hide resolved
commons/src/scanner/symbol/iterator/root.rs Show resolved Hide resolved
commons/src/test_runner/mod.rs Outdated Show resolved Hide resolved
@mhatzl
Copy link
Contributor Author

mhatzl commented Sep 29, 2023

I will wait for PR #105 to be merged, and then merge this one to resolve the clippy warnings that keep failing the action.

@mhatzl
Copy link
Contributor Author

mhatzl commented Oct 1, 2023

I now merged changes from PR #105.
There were some merge conflicts, but all tests are passing now, and it should be ok to merge this PR.

For lexer tests in inline I had to remove the EOI symbol before running the tests.
With the EOI symbol, all lexer tests failed with an unwrap on None.
Assuming that we will update all of inline parsing to be based on iterators, I would leave this workaround as is for now.
Because inlines for now won't see the EOI symbol, because block elements strip it for them.

@nfejzic nfejzic merged commit dd98ae2 into main Oct 2, 2023
4 checks passed
@nfejzic nfejzic deleted the symbol-iterator branch October 2, 2023 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants