-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: create SymbolIterator for block parsing #106
Conversation
…kup-rs into symbol-iterator
check in lock file to prevent this in the future
I had to fix Clippy currently fails due to old code. |
I also implemented rendering for newline and whitespace, because I wanted to make sure the resulting HTML is still correct using the iterator approach. |
After seeing the failed actions due to |
Behavior remains the same, because this was the default anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big one, and big ones always have lots of comments. Some might be just do discuss a little bit and require no action, some are suggestions to (hopefully) improve the implementation ever so slightly.
Overall this looks very nice, and it solves a huge problem of parsing nested content!
Edit:
The scanner creates the vector of Symbols using the grapheme segmenter of icu.
This segmenter itself is an iterator, so we might be able to directly embed this iterator
in the SymbolIteratorRoot. This would require some form of a symbol cache to allow peeking,
but using an internal VecDeque should do the trick.
Regarding this, inspired by this I'm working on a crate called ribbon
that should help with this. It's an abstraction driven by an iterator, allowing for simultaneous access over multiple items yielded by an iterator at the same time. This could be useful at multiple places (iterating over symbols, inlines tokenization, inlines token resolver etc.).
commons/src/scanner/mod.rs
Outdated
Scanner::new() | ||
} | ||
} | ||
|
||
Self { | ||
provider: self.provider.clone(), | ||
segmenter, | ||
} | ||
impl Default for Scanner { | ||
fn default() -> Self { | ||
Self::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I would remove the Scanner::new
function at this point and just implement and use Default
since we don't pass any params to new
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to fully remove Scanner
.
Instead I made scan_str()
a public function in the scanner
module.
// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen. | ||
include!("./icu_data/mod.rs"); | ||
impl_data_provider!(IcuDataProvider); | ||
use position::{Offset, Position as SymPos}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a good idea to rename Position
to SymPos
in general, since that's what it actually is 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SymbolPosition
would be better in that case, and I would also change Offset
to SymbolOffset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SymbolPosition
is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.
P.S. if you can't choose, then choose randomly 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we keep the names and move them into the symbol
module?
I was thinking about this option, but then scanner
becomes a bit useless?
But removing scanner, by moving symbol
up did not seem right to me.
|
||
impl<'input> EndMatcher for SymbolIterator<'input> { | ||
fn is_empty_line(&mut self) -> bool { | ||
// Note: Multiple matches may be set in the match closure, so we need to ensure that all start at the same index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extreme nitpick, but I think convention is to use upper case for NOTE, FIXME, TODO etc. Uppercase versions get highlighted (at least in my editor) 🙈.
You can decide to change this or leave it, just wanted to mention it 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know that about NOTE
.
But would you also write NOTE
in doc-comments?
Usually I write **Note:**
in doc-comments, but because normal comments have no formatting, I stayed with Note
here, because it felt more consistent.
Would have to look through the code so replace all Note
. Maybe keeping it as is for now, and change all in a new PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good question. Generally I wouldn't write any form of Note:
in doc-comments. Would probably just explain it, something like Note that ...
.
You can decide if and when you want to change this, it's not important part of this PR anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think notes in doc-comments can be useful.
Think of them more like GitHub alerts.
It helps to highlight information that is especially relevant to a user.
Making a section bold does not make it readable, so you use alerts to highlight those sections instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but just in general keep in mind that it's not necessary most of the time. If something is that important, maybe separate heading is a better option. Otherwise we can just explain it. I also use NOTE:
often, but we should probably reserve it for special cases. It kind of loses it's purpose if we over-use it, just want us to be aware of that.
Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>
Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>
Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>
Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>
Co-authored-by: Nadir Fejzić <nadirfejzo@gmail.com>
Provide `scan_str()` as standalone function.
Assert only in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
// Note: Run `cargo build` before re-generating the data to ensure the newest binary is inspected by icu4x-datagen. | ||
include!("./icu_data/mod.rs"); | ||
impl_data_provider!(IcuDataProvider); | ||
use position::{Offset, Position as SymPos}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SymbolPosition
is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.
P.S. if you can't choose, then choose randomly 🤣
I will wait for PR #105 to be merged, and then merge this one to resolve the clippy warnings that keep failing the action. |
I now merged changes from PR #105. For lexer tests in inline I had to remove the |
This PR adds a
SymbolIterator
to iterate over theSymbol
slice returned from theScanner
.It enables convenient tokenization and parsing for block elements,
by allowing to nest iterators, and define functions to match for block end and to strip line prefixes.
This is part of preparation for parsing more complex elements like lists.
Decisions
Nesting iterators
Because block elements may allow nesting, iterators may also be nested
to make parsing of elements nesting agnostic.
Parent iterators only forward symbols to children, if the parent is not at the end,
and an optional prefix is stripped.
Iterators keep track of their nesting depth, to allow nesting of sequences like
(outer (inner))
.Functions/Closures for end/prefix matching
It was decided to allow functions to be passed for end/prefix matching instead of multiple Symbol
slices. This allows fine grained matching, and allows to distinguish between peeked and consumed
matching.
Prefix matching is always consuming, because nested iterators must not see prefix symbols of parent
elements.
Future work
Use
SymbolIterator
for inlinesCurrently, inline parsing is done on a symbol slice.
Inline parsing internally uses
get_symbol(index)
quite often, which makes it hard to adopt theiterator approach. For now, blocks retrieve a
Symbol
vector for inline content,and then pass this to the inline parser.
With the new iterator approach supporting end detection, it might be possible to use
the same tokenization and parsing approach as done for blocks.
The handling of ambiguous tokens like
***
won't be trivial, but I am confident thatit can be done by also matching for Bold and Italic directly in the ambiguous parser.
Changing inline parsing will take time, and possibly break things.
I would suggest that we first implement more blocks to see if the iterator approach is feasible.
If the new approach seems to work well for blocks, we can try to adopt it also for inlines.
Iterators all the way
The scanner creates the vector of
Symbol
s using the grapheme segmenter oficu
.This segmenter itself is an iterator, so we might be able to directly embed this iterator
in the
SymbolIteratorRoot
. This would require some form of a symbol cache to allow peeking,but using an internal
VecDeque
should do the trick.Note: This effectively removes the option to get a slice of remaining symbols,
and therefore the option to get size hints of remaining symbols.
The size hint could still be approximated by using the utf8 length of the input string.
This won't be accurate on Windows, because new lines have utf8 length of two due to
\r\n
.I am also unsure about the effect on performance and memory usage, when switching to full iterator.
Memory should in theory become better, because we do not allocate memory for all symbols.
Compared to allocation needed for the transformed content, this improvement might be neglectable
though.