Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create SymbolIterator for block parsing #106

Merged
merged 45 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a2ca1d2
feat: create SymbolIterator
mhatzl Sep 21, 2023
998d291
feat: switch block parser to SymbolIterator
mhatzl Sep 21, 2023
f1dc373
feat: add itertools for SymbolIterator
mhatzl Sep 22, 2023
de52811
feat: switch to nesting symbol iterators
mhatzl Sep 22, 2023
5398be0
fix: add prefix line test for symbol iterator
mhatzl Sep 22, 2023
aba8224
feat: simplify iterator nesting parsers
nfejzic Sep 22, 2023
88c3064
Merge branch 'symbol-iterator' of https://github.com/Unimarkup/unimar…
mhatzl Sep 22, 2023
fbefb50
fix: correct heading end closure to detect heading
mhatzl Sep 22, 2023
cd608b3
fix: ignore newlines between elements
mhatzl Sep 22, 2023
32778c9
feat: make end-fn optional for new symbol iterator
mhatzl Sep 22, 2023
1a5c5b0
fix: change end fns to get SymboliterMatcher
mhatzl Sep 22, 2023
b8d430b
fix: remove new_line from SymbolIterRoot
mhatzl Sep 22, 2023
6ad4a8b
fix: remove remaining symbols from tokenize output
mhatzl Sep 23, 2023
c73286f
fix: correct prefix consumption for symbol iterator
mhatzl Sep 23, 2023
27d8d70
fix: fix endless loop in peeking_next()
mhatzl Sep 23, 2023
71171f3
fix: correct iterator length calculation
mhatzl Sep 23, 2023
57f5f72
fix: prevent plain from merging with newline token
mhatzl Sep 23, 2023
16c2a60
fix: implement rendering for whitespace inlines
mhatzl Sep 23, 2023
1df4d76
fix: add comment why reset_peek() is needed
mhatzl Sep 23, 2023
f7cbbf8
fix: update verbatim to work with symbol iterator
mhatzl Sep 23, 2023
6c3c28e
arch: split iterator into multiple files
mhatzl Sep 23, 2023
ee317d2
fix: add documentation for the symbol iterator
mhatzl Sep 24, 2023
0d2c225
feat: add nesting depth to symbol iterator
mhatzl Sep 24, 2023
dd903f5
fix: add EOI symbol to match end as empty line
mhatzl Sep 24, 2023
b74c089
fix: remove EOI symbol for lexer tests
mhatzl Sep 24, 2023
45f4a1f
fix: pin zerovec crate to specific version
mhatzl Sep 24, 2023
8487538
fix: resolve icu dependency problems
mhatzl Sep 24, 2023
e1751f5
feat: update icu to not need any generated data
mhatzl Sep 25, 2023
f31143b
fix: remove crate_authors!() due to clippy warning
mhatzl Sep 25, 2023
3746027
chore: remove lock file from vc after icu bump
mhatzl Sep 25, 2023
f8bab51
fix: add blankline for better readability
mhatzl Sep 29, 2023
b63b902
fix: use `debug_assert!()` instead of `cfg(debug_assertions)`
mhatzl Sep 29, 2023
0ad2063
fix: make peeking_next() more compact
mhatzl Sep 29, 2023
17e1956
fix: use owned Vec to create Paragraph from
mhatzl Sep 29, 2023
b20952f
fix: use `iter::once()` to create end sequence
mhatzl Sep 29, 2023
0dc18ad
fix: remove double dot at end of sentence
mhatzl Sep 29, 2023
85f46ff
fix: map length before unwrap of remaining_symbols
mhatzl Sep 29, 2023
6e12f23
fix: improve comments for SymbolIterator
mhatzl Sep 29, 2023
7235dfb
fix: remove Scanner struct
mhatzl Sep 29, 2023
02c4505
fix: restrict visibility of iterator index fns
mhatzl Sep 29, 2023
0d5c8ab
fix: remove duplicate From<> impls for iterators
mhatzl Sep 29, 2023
d489076
fix: remove *curr* prefix for iterator functions
mhatzl Sep 29, 2023
01a148b
fix: remove *curr* prefix from index in root iterator
mhatzl Sep 29, 2023
d710917
fix: add assert to ensure update done on act parent
mhatzl Sep 29, 2023
a69de7a
chore: merge branch 'main' into symbol-iterator
mhatzl Oct 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 55 additions & 76 deletions commons/src/scanner/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
//! Scanner and helper types and traits for structurization of Unimarkup input.
//! Functionality, iterators, helper types and traits to get [`Symbol`]s from `&str`.
//! These [`Symbol`]s and iterators are used to convert the input into a Unimarkup document.

use icu_segmenter::GraphemeClusterSegmenter;

Expand All @@ -9,87 +10,65 @@ mod symbol;
use position::{Offset, Position as SymPos};
pub use symbol::{iterator::*, Symbol, SymbolKind};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a good idea to rename Position to SymPos in general, since that's what it actually is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SymbolPosition would be better in that case, and I would also change Offset to SymbolOffset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we keep the names and move them into the symbol module?
I was thinking about this option, but then scanner becomes a bit useless?
But removing scanner, by moving symbol up did not seem right to me.


#[derive(Debug)]
pub struct Scanner {
segmenter: GraphemeClusterSegmenter,
}
/// Scans given input and returns vector of [`Symbol`]s needed to convert the input to Unimarkup content.
pub fn scan_str(input: &str) -> Vec<Symbol<'_>> {
let segmenter = GraphemeClusterSegmenter::new();

impl Clone for Scanner {
fn clone(&self) -> Self {
Scanner::new()
}
}

impl Default for Scanner {
fn default() -> Self {
Self::new()
}
}

impl Scanner {
pub fn new() -> Self {
let segmenter = GraphemeClusterSegmenter::new();

Self { segmenter }
}
let mut symbols: Vec<Symbol> = Vec::new();
let mut curr_pos: SymPos = SymPos::default();
let mut prev_offset = 0;

pub fn scan_str<'s>(&self, input: &'s str) -> Vec<Symbol<'s>> {
let mut symbols: Vec<Symbol> = Vec::new();
let mut curr_pos: SymPos = SymPos::default();
let mut prev_offset = 0;
// skip(1) to ignore break at start of input
for offset in segmenter.segment_str(input).skip(1) {
if let Some(grapheme) = input.get(prev_offset..offset) {
let mut kind = SymbolKind::from(grapheme);

// skip(1) to ignore break at start of input
for offset in self.segmenter.segment_str(input).skip(1) {
if let Some(grapheme) = input.get(prev_offset..offset) {
let mut kind = SymbolKind::from(grapheme);

let end_pos = if kind == SymbolKind::Newline {
SymPos {
line: (curr_pos.line + 1),
..Default::default()
}
} else {
SymPos {
line: curr_pos.line,
col_utf8: (curr_pos.col_utf8 + grapheme.len()),
col_utf16: (curr_pos.col_utf16 + grapheme.encode_utf16().count()),
col_grapheme: (curr_pos.col_grapheme + 1),
}
};

if curr_pos.col_utf8 == 1 && kind == SymbolKind::Newline {
// newline at the start of line -> Blankline
kind = SymbolKind::Blankline;
let end_pos = if kind == SymbolKind::Newline {
SymPos {
line: (curr_pos.line + 1),
..Default::default()
}
} else {
SymPos {
line: curr_pos.line,
col_utf8: (curr_pos.col_utf8 + grapheme.len()),
col_utf16: (curr_pos.col_utf16 + grapheme.encode_utf16().count()),
col_grapheme: (curr_pos.col_grapheme + 1),
}
};

symbols.push(Symbol {
input,
kind,
offset: Offset {
start: prev_offset,
end: offset,
},
start: curr_pos,
end: end_pos,
});

curr_pos = end_pos;
if curr_pos.col_utf8 == 1 && kind == SymbolKind::Newline {
// newline at the start of line -> Blankline
kind = SymbolKind::Blankline;
}
prev_offset = offset;
}

symbols.push(Symbol {
input,
kind: SymbolKind::EOI,
offset: Offset {
start: prev_offset,
end: prev_offset,
},
start: curr_pos,
end: curr_pos,
});

// last offset not needed, because break at EOI is always available
symbols
symbols.push(Symbol {
input,
kind,
offset: Offset {
start: prev_offset,
end: offset,
},
start: curr_pos,
end: end_pos,
});

curr_pos = end_pos;
}
prev_offset = offset;
}

symbols.push(Symbol {
input,
kind: SymbolKind::EOI,
offset: Offset {
start: prev_offset,
end: prev_offset,
},
start: curr_pos,
end: curr_pos,
});

// last offset not needed, because break at EOI is always available
symbols
}
10 changes: 5 additions & 5 deletions commons/src/scanner/symbol/iterator/matcher.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ use super::SymbolIterator;

/// Function type to notify an iterator if an end was reached.
pub type IteratorEndFn = Rc<dyn (Fn(&mut dyn EndMatcher) -> bool)>;

/// Function type to consume prefix sequences of a new line.
pub type IteratorPrefixFn = Rc<dyn (Fn(&mut dyn PrefixMatcher) -> bool)>;

Expand Down Expand Up @@ -103,7 +104,7 @@ impl<'input> EndMatcher for SymbolIterator<'input> {
let is_empty_line = self.is_empty_line();

if is_empty_line {
self.set_curr_index(self.peek_index()); // To consume peeked symbols
self.set_index(self.peek_index()); // To consume peeked symbols
}

is_empty_line
Expand All @@ -126,21 +127,20 @@ impl<'input> EndMatcher for SymbolIterator<'input> {
let matched = self.matches(sequence);

if matched {
self.set_curr_index(self.peek_index()); // To consume peeked symbols
self.set_index(self.peek_index()); // To consume peeked symbols
}

matched
}

fn at_depth(&self, depth: usize) -> bool {
self.curr_depth() == depth
self.depth() == depth
}
}

impl<'input> PrefixMatcher for SymbolIterator<'input> {
fn consumed_prefix(&mut self, sequence: &[SymbolKind]) -> bool {
#[cfg(debug_assertions)]
assert!(
debug_assert!(
!sequence.contains(&SymbolKind::Newline),
"Newline symbol in prefix match is not allowed."
);
Expand Down
77 changes: 38 additions & 39 deletions commons/src/scanner/symbol/iterator/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ pub use root::*;
/// It allows to add matcher functions to notify the iterator,
/// when an end of an element is reached, or what prefixes to strip on a new line.
/// Additionaly, the iterator may be nested to enable transparent iterating for nested elements.
mhatzl marked this conversation as resolved.
Show resolved Hide resolved
///
/// *Transparent* meaning that the nested iterator does not see [`Symbol`]s consumed by the wrapped (parent) iterator.
/// In other words, wrapped iterators control which [`Symbol`]s will be passed to their nested iterator.
/// Therefore, each nested iterator only sees those [`Symbol`]s that are relevant to its scope.
#[derive(Clone)]
pub struct SymbolIterator<'input> {
/// The [`SymbolIteratorKind`] of this iterator.
Expand All @@ -33,8 +37,6 @@ pub struct SymbolIterator<'input> {
}

/// The [`SymbolIteratorKind`] defines the kind of a [`SymbolIterator`].
///
/// **Note:** This enables iterator nesting.
#[derive(Clone)]
pub enum SymbolIteratorKind<'input> {
/// Defines an iterator as being nested.
Expand Down Expand Up @@ -94,25 +96,25 @@ impl<'input> SymbolIterator<'input> {

/// The current nested depth this iterator is at.
/// The root iterator starts at 0, and every iterator created using [`Self::nest()`] is one depth higher than its parent.
pub fn curr_depth(&self) -> usize {
pub fn depth(&self) -> usize {
self.depth
}

/// Returns the current index this iterator is in the [`Symbol`] slice of the root iterator.
pub fn curr_index(&self) -> usize {
pub fn index(&self) -> usize {
match &self.kind {
SymbolIteratorKind::Nested(parent) => parent.curr_index(),
SymbolIteratorKind::Root(root) => root.curr_index,
SymbolIteratorKind::Nested(parent) => parent.index(),
SymbolIteratorKind::Root(root) => root.index,
}
}

/// Sets the current index of this iterator to the given index.
pub fn set_curr_index(&mut self, index: usize) {
pub(super) fn set_index(&mut self, index: usize) {
if index >= self.start_index {
match self.kind.borrow_mut() {
SymbolIteratorKind::Nested(parent) => parent.set_curr_index(index),
SymbolIteratorKind::Nested(parent) => parent.set_index(index),
SymbolIteratorKind::Root(root) => {
root.curr_index = index;
root.index = index;
root.peek_index = index;
}
}
Expand All @@ -128,8 +130,8 @@ impl<'input> SymbolIterator<'input> {
}

/// Sets the peek index of this iterator to the given index.
pub fn set_peek_index(&mut self, index: usize) {
if index >= self.curr_index() {
fn set_peek_index(&mut self, index: usize) {
if index >= self.index() {
match self.kind.borrow_mut() {
SymbolIteratorKind::Nested(parent) => parent.set_peek_index(index),
SymbolIteratorKind::Root(root) => {
Expand All @@ -143,7 +145,7 @@ impl<'input> SymbolIterator<'input> {
///
/// **Note:** Needed to reset peek index after using `peeking_next()`.
pub fn reset_peek(&mut self) {
self.set_peek_index(self.curr_index());
self.set_peek_index(self.index());
}

/// Returns the maximal remaining symbols in this iterator.
Expand Down Expand Up @@ -186,7 +188,7 @@ impl<'input> SymbolIterator<'input> {
) -> SymbolIterator<'input> {
SymbolIterator {
kind: SymbolIteratorKind::Nested(Box::new(self.clone())),
start_index: self.curr_index(),
start_index: self.index(),
depth: self.depth + 1,
prefix_match,
end_match,
Expand All @@ -199,6 +201,13 @@ impl<'input> SymbolIterator<'input> {
/// **Note:** Only updates the parent if `self` is nested.
pub fn update(self, parent: &mut Self) {
if let SymbolIteratorKind::Nested(self_parent) = self.kind {
// Make sure it actually is the parent.
// It is not possible to check more precisely, because other indices are expected to be different due to `clone()`.
debug_assert_eq!(
self_parent.start_index, parent.start_index,
"Updated iterator is not the actual parent of this iterator."
);

*parent = *self_parent;
}
}
Expand Down Expand Up @@ -233,21 +242,11 @@ impl<'input> SymbolIterator<'input> {
}
}

impl<'input> From<&'input [Symbol<'input>]> for SymbolIterator<'input> {
fn from(value: &'input [Symbol<'input>]) -> Self {
SymbolIterator {
kind: SymbolIteratorKind::Root(SymbolIteratorRoot::from(value)),
start_index: 0,
depth: 0,
prefix_match: None,
end_match: None,
iter_end: false,
}
}
}

impl<'input> From<&'input Vec<Symbol<'input>>> for SymbolIterator<'input> {
fn from(value: &'input Vec<Symbol<'input>>) -> Self {
impl<'input, T> From<T> for SymbolIterator<'input>
where
T: Into<&'input [Symbol<'input>]>,
{
fn from(value: T) -> Self {
SymbolIterator {
kind: SymbolIteratorKind::Root(SymbolIteratorRoot::from(value)),
start_index: 0,
Expand Down Expand Up @@ -322,21 +321,21 @@ mod test {

use itertools::{Itertools, PeekingNext};

use crate::scanner::{PrefixMatcher, Scanner, SymbolKind};
use crate::scanner::{PrefixMatcher, SymbolKind};

use super::SymbolIterator;

#[test]
fn peek_while_index() {
let symbols = Scanner::new().scan_str("## ");
let symbols = crate::scanner::scan_str("## ");

let mut iterator = SymbolIterator::from(&symbols);
let mut iterator = SymbolIterator::from(&*symbols);
let hash_cnt = iterator
.peeking_take_while(|symbol| symbol.kind == SymbolKind::Hash)
.count();

let next_symbol = iterator.nth(hash_cnt);
let curr_index = iterator.curr_index();
let curr_index = iterator.index();

assert_eq!(hash_cnt, 2, "Hash symbols in input not correctly detected.");
assert_eq!(curr_index, 3, "Current index was not updated correctly.");
Expand All @@ -353,14 +352,14 @@ mod test {

#[test]
fn peek_next() {
let symbols = Scanner::new().scan_str("#*");
let symbols = crate::scanner::scan_str("#*");

let mut iterator = SymbolIterator::from(&symbols);
let mut iterator = SymbolIterator::from(&*symbols);

let peeked_symbol = iterator.peeking_next(|_| true);
let next_symbol = iterator.next();
let next_peeked_symbol = iterator.peeking_next(|_| true);
let curr_index = iterator.curr_index();
let curr_index = iterator.index();

assert_eq!(curr_index, 1, "Current index was not updated correctly.");
assert_eq!(
Expand All @@ -387,9 +386,9 @@ mod test {

#[test]
fn reach_end() {
let symbols = Scanner::new().scan_str("text*");
let symbols = crate::scanner::scan_str("text*");

let mut iterator = SymbolIterator::from(&symbols).nest(
let mut iterator = SymbolIterator::from(&*symbols).nest(
None,
Some(Rc::new(|matcher| matcher.matches(&[SymbolKind::Star]))),
);
Expand All @@ -415,7 +414,7 @@ mod test {

#[test]
fn with_nested_and_parent_prefix() {
let symbols = Scanner::new().scan_str("a\n* *b");
let symbols = crate::scanner::scan_str("a\n* *b");

let iterator = SymbolIterator::with(
&symbols,
Expand Down Expand Up @@ -452,7 +451,7 @@ mod test {

#[test]
fn depth_matcher() {
let symbols = Scanner::new().scan_str("[o [i]]");
let symbols = crate::scanner::scan_str("[o [i]]");

let mut iterator = SymbolIterator::with(
&symbols,
Expand Down
Loading
Loading