Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update parser to use iterators for blocks and inlines #118

Merged
merged 82 commits into from
Nov 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
4d03b25
fix: replace depth matching with scope nesting
mhatzl Oct 24, 2023
c98cd96
feat: add Any symbol to help with matching
mhatzl Oct 24, 2023
d2fb4b5
feat: store prev symbol of iterator
mhatzl Oct 24, 2023
b18a241
fix: ensure consumed matches are consumed in parser
mhatzl Oct 24, 2023
95f94b0
feat: add match fn to check if prev was a space
mhatzl Oct 25, 2023
f54cc5c
feat: use symbol iterator to parse inlines
mhatzl Oct 25, 2023
4a6461c
feat: use token iterator on top of symbol iterator
mhatzl Oct 26, 2023
b01f940
feat: implement bold italic parsing
mhatzl Oct 27, 2023
2564188
fix: adapt prev & cached token & fix bolditalic
mhatzl Oct 27, 2023
8226336
feat: add strikethrough format
mhatzl Oct 27, 2023
64f3f3d
feat: add implicit subst layer in token iterator
mhatzl Oct 28, 2023
ef375fb
feat: add inline verbatim
mhatzl Oct 28, 2023
8eb4125
fix: allow implicit subst after inline verbatim
mhatzl Oct 28, 2023
8efb196
feat: add textbox support for inlines
mhatzl Oct 28, 2023
327de12
feat: add generic ambiguous inline parsing
mhatzl Oct 29, 2023
982eaa1
fix: mark implicit ends in ambiguous formatting
mhatzl Oct 29, 2023
7d37090
feat: create generic distinct format parser
mhatzl Oct 29, 2023
87d9e7e
fix: iterator max length calculation
mhatzl Oct 29, 2023
5280b51
arch: improve end calc for distinct formats
mhatzl Oct 29, 2023
07e6cfb
fix: update token positions for ambiguous splits
mhatzl Oct 29, 2023
883dc31
fix: cleanup new inline parser
mhatzl Oct 29, 2023
a0685aa
arch: rename commons module scanner to lexer
mhatzl Oct 29, 2023
f65f22e
feat: add context to inline element parsers
mhatzl Oct 29, 2023
e134615
fix: move DirectUri out of implicit subst
mhatzl Oct 29, 2023
f30f331
fix: remove duplicated implicit kinds
mhatzl Oct 29, 2023
1086f07
feat: update snapshot testing to new inline parsing
mhatzl Oct 29, 2023
187cafd
fix: reset peek index if peeking is not accepted
mhatzl Oct 30, 2023
28df88e
fix: handle blankline escape and prevent panics
mhatzl Oct 30, 2023
6f13456
fix: correct whitespace and newline handling
mhatzl Oct 30, 2023
fce6d08
fix: allow Whitespace to be pushed into Plain
mhatzl Oct 30, 2023
cd162f0
fix: prevent implicit close from spanning line end
mhatzl Oct 30, 2023
ac718cc
fix: correctly set prev token for consumed matches
mhatzl Oct 30, 2023
9131848
fix: update tests for token iterator
mhatzl Oct 30, 2023
913ad8f
fix: set one whitespace token per symbol
mhatzl Oct 31, 2023
7cb66a0
fix: return correct prev token in inline iter
mhatzl Oct 31, 2023
c5a12e7
fix: set correct input for Plain in release mode
mhatzl Oct 31, 2023
6216c8e
fix: use fixed-size array for open format map
mhatzl Oct 31, 2023
76fcf7d
fix: remove unnecessary peek-while calls
mhatzl Oct 31, 2023
bd4a0fd
fix: create cached token iterator to improve perf
mhatzl Oct 31, 2023
9f02f41
feat: add caching layer between symbols and tokens
mhatzl Oct 31, 2023
338558a
fix: reduce iterator cloning
mhatzl Oct 31, 2023
83e67b8
fix: remove implicits iterator from iterator chain
mhatzl Nov 1, 2023
94aece0
fix: cache first peeked token per iterator
mhatzl Nov 1, 2023
407a6f5
fix: assert parser fns consume iter only for Some
mhatzl Nov 1, 2023
17b7c0c
fix: set todo! for snapshot impl of new inlines
mhatzl Nov 1, 2023
b9c4477
feat: use token slice as base for iterators
mhatzl Nov 1, 2023
10edf58
fix: cleanup warnings and dead code
mhatzl Nov 2, 2023
903642a
feat: use mut iter ref for nesting
mhatzl Nov 2, 2023
0478d70
fix: remove unnecessary boxes
mhatzl Nov 2, 2023
fb9a929
fix: return referenced tokens in take_to_end
mhatzl Nov 2, 2023
682e452
fix: nest iterators without cloning
mhatzl Nov 2, 2023
0161407
fix: use iterator cloning again
mhatzl Nov 2, 2023
84d7423
fix: replace iterator update with progress
mhatzl Nov 2, 2023
f8ac9dd
feat: update paragraph and heading to token iterator
mhatzl Nov 4, 2023
cc9290d
fix: consume contiguous whitespace in inlines
mhatzl Nov 4, 2023
1c5b614
fix: pass iter infos from inline to block parsers
mhatzl Nov 4, 2023
f656671
fix: set match index for blankline match
mhatzl Nov 4, 2023
79753d4
fix: adapt auto-id generation for headings
mhatzl Nov 5, 2023
753faae
feat: use parser chaining in inlines
mhatzl Nov 6, 2023
6d7bfb7
fix: update verbatim block to use token iterator
mhatzl Nov 6, 2023
ae2282b
fix: use token iterator for bullet list parsing
mhatzl Nov 7, 2023
d40f932
fix: add overline to format kinds
mhatzl Nov 7, 2023
ec0b54d
fix: handle blanklines for prefix matching
mhatzl Nov 9, 2023
20fdd9b
fix: implement element trait for bullet list
mhatzl Nov 9, 2023
965986b
fix: handle implicit block ends if outer ends
mhatzl Nov 9, 2023
5205a98
fix: cleanup block parser and renderer and add doc
mhatzl Nov 10, 2023
cff3800
arch: rename verbatim_block in snapshots
mhatzl Nov 10, 2023
6d23272
fix: remove old inline parsing
mhatzl Nov 10, 2023
5fa4164
fix: remove old iterators
mhatzl Nov 10, 2023
1c8a242
fix: remove old doc test
mhatzl Nov 10, 2023
ee23132
fix: remove todo! except for named substitution
mhatzl Nov 12, 2023
9490d8e
fix: ignore end for plain str if implicit closed
mhatzl Nov 14, 2023
01bd2ad
fix: add basic verbatim block tests
mhatzl Nov 14, 2023
f046b7f
fix: add rendering for escaped plain and blankline
mhatzl Nov 16, 2023
ac9266b
fix: clarify iterator nesting doc and naming
mhatzl Nov 16, 2023
cb91d3b
fix: use more elegant code from review feedback
mhatzl Nov 16, 2023
38c4840
fix: apply suggestions from review
mhatzl Nov 16, 2023
5870ef1
fix: correct doc for token specific functions
mhatzl Nov 16, 2023
70175a9
fix: apply suggestion from review
mhatzl Nov 16, 2023
8573b65
fix: handle newlines after blocks
mhatzl Nov 19, 2023
ec5e8b3
fix: handle sub-headings directly after heading
mhatzl Nov 19, 2023
8a30879
arch: rename `to_plain_string` to `as_unimarkup`
mhatzl Nov 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,3 @@ clap = { version = "4.2.7", features = ["derive", "cargo", "env"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.8.23"
ribbon = "0.7.0"
1 change: 1 addition & 0 deletions commons/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ serde.workspace = true
serde_json.workspace = true
serde_yaml.workspace = true
once_cell = { workspace = true, optional = true }
icu_properties = "1.3.2"
icu_segmenter = "1.3.0"
icu_locid = "1.3.0"
regex = { version = "1.8.1", optional = true }
Expand Down
2 changes: 1 addition & 1 deletion commons/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This crate provides common functionalities needed in other Unimarkup crates.

- `config` ... Contains the config struct defining what arguments are available for compilation
- `scanner` ... Contains the `SymbolIterator` used to transform string input into Unimarkup symbols
- `lexer` ... Contains the `TokenIterator` used to iterate over tokenized input containing Unimarkup elements
- `test_runner` ... Contains convenience traits and macros to create automated snapshot tests

# License
Expand Down
14 changes: 8 additions & 6 deletions commons/src/scanner/mod.rs → commons/src/lexer/mod.rs
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
//! Functionality, iterators, helper types and traits to get [`Symbol`]s from `&str`.
//! These [`Symbol`]s and iterators are used to convert the input into a Unimarkup document.
//! Functionality, iterators, helper types and traits to get [`Tokens`](token::Token)s from `&str`.
//! These [`Tokens`](token::Token)s and iterators are used to convert the input into a Unimarkup document.

use icu_segmenter::GraphemeClusterSegmenter;

pub mod position;
pub mod span;
mod symbol;
pub mod symbol;
pub mod token;

use position::{Offset, Position as SymPos};
pub use symbol::{iterator::*, Symbol, SymbolKind};

/// Scans given input and returns vector of [`Symbol`]s needed to convert the input to Unimarkup content.
use self::symbol::{Symbol, SymbolKind};

/// Scans given input and returns vector of [`Symbol`]s needed to convert the input to [Token](token::Token)s.
pub fn scan_str(input: &str) -> Vec<Symbol<'_>> {
let segmenter = GraphemeClusterSegmenter::new();

Expand Down Expand Up @@ -55,7 +57,7 @@ pub fn scan_str(input: &str) -> Vec<Symbol<'_>> {

symbols.push(Symbol {
input,
kind: SymbolKind::EOI,
kind: SymbolKind::Eoi,
offset: Offset {
start: prev_offset,
end: prev_offset,
Expand Down
35 changes: 23 additions & 12 deletions commons/src/scanner/position.rs → commons/src/lexer/position.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,31 +5,42 @@ use std::ops::{Add, AddAssign, Sub, SubAssign};

use super::span::SpanLen;

/// Indicates position of a symbol in a Unimarkup document. Both line and column
/// Indicates position of a symbol or token in a Unimarkup document. Both line and column
/// counting starts from 1.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Position {
/// Line the symbol is found at
/// Line the symbol or token is found at
pub line: usize,
/// Column at which the symbol is located in line, when encoded as UTF8
/// Column at which the symbol or token is located in line, when encoded as UTF8
pub col_utf8: usize,
/// Column at which the symbol is located in line, when encoded as UTF16
/// Column at which the symbol or token is located in line, when encoded as UTF16
pub col_utf16: usize,
/// Column at which the symbol is located in line, when counting graphemes
/// Column at which the symbol or token is located in line, when counting graphemes
pub col_grapheme: usize,
}

/// Symbol offset in the original input.
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub(crate) struct Offset {
/// Start offset of a symbol, inclusive. This is the same as the end offset
/// of the previous symbol.
/// Symbol or token offset in the original input.
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Offset {
/// Start offset of a symbol or token, inclusive. This is the same as the end offset
/// of the previous symbol or token.
pub start: usize,
/// End offset of a symbol, exclusive. This is the same as the start offset
/// of the next symbol.
/// End offset of a symbol or token, exclusive. This is the same as the start offset
/// of the next symbol or token.
pub end: usize,
}

impl Offset {
pub fn extend(&mut self, other: Offset) {
debug_assert!(
self.start <= other.start,
"Tried to extend self by another offset that started earlier."
);

self.end = self.end.max(other.end)
}
}

impl Position {
pub fn new(line: usize, column: usize) -> Self {
Self {
Expand Down
File renamed without changes.
109 changes: 109 additions & 0 deletions commons/src/lexer/symbol/iterator.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
use itertools::PeekingNext;

use crate::lexer::{Symbol, SymbolKind};

#[derive(Debug, Clone)]
pub struct SymbolIterator<'slice, 'input> {
/// The [`Symbol`] slice the iterator was created for.
symbols: &'slice [Symbol<'input>],
/// The current index of the iterator inside the [`Symbol`] slice.
pub(super) index: usize,
/// The peek index of the iterator inside the [`Symbol`] slice.
pub(super) peek_index: usize,
nfejzic marked this conversation as resolved.
Show resolved Hide resolved
}

impl<'slice, 'input, T> From<T> for SymbolIterator<'slice, 'input>
where
T: Into<&'slice [Symbol<'input>]>,
{
fn from(value: T) -> Self {
SymbolIterator {
symbols: value.into(),
index: 0,
peek_index: 0,
}
}
}

impl<'slice, 'input> Iterator for SymbolIterator<'slice, 'input> {
type Item = &'slice Symbol<'input>;

fn next(&mut self) -> Option<Self::Item> {
let symbol = self.symbols.get(self.index)?;

self.index += 1;
self.peek_index = self.index;

Some(symbol)
}

fn size_hint(&self) -> (usize, Option<usize>) {
(0, Some(self.max_len()))
}
}

impl<'slice, 'input> PeekingNext for SymbolIterator<'slice, 'input> {
fn peeking_next<F>(&mut self, accept: F) -> Option<Self::Item>
where
Self: Sized,
F: FnOnce(&Self::Item) -> bool,
{
let symbol = self.symbols.get(self.peek_index).filter(accept)?;
self.peek_index += 1;
Some(symbol)
}
}

impl<'slice, 'input> SymbolIterator<'slice, 'input> {
/// Returns the maximum length of the remaining [`Symbol`]s this iterator might return.
///
/// **Note:** This length does not consider parent iterators, or matching functions.
/// Therefore, the returned number of [`Symbol`]s might differ, but cannot be larger than this length.
pub fn max_len(&self) -> usize {
self.symbols.len().saturating_sub(self.index)
}

/// Returns `true` if no more [`Symbol`]s are available.
pub fn is_empty(&self) -> bool {
self.max_len() == 0
}

/// Returns the current index this iterator is in the [`Symbol`] slice of the root iterator.
pub fn index(&self) -> usize {
self.index
}

/// Sets the current index of this iterator to the given index.
pub(crate) fn set_index(&mut self, index: usize) {
debug_assert!(self.index <= index, "Tried to move the iterator backward.");

self.index = index;
self.peek_index = index;
}
nfejzic marked this conversation as resolved.
Show resolved Hide resolved

/// Returns the index used to peek.
pub(crate) fn peek_index(&self) -> usize {
self.peek_index
}

/// Sets the peek index of this iterator to the given index.
pub(crate) fn set_peek_index(&mut self, index: usize) {
if self.index() <= index {
self.peek_index = index;
}
}

pub fn reset_peek(&mut self) {
self.set_peek_index(self.index());
}

/// Returns the next [`Symbol`] without changing the current index.
pub fn peek(&mut self) -> Option<&'slice Symbol<'input>> {
self.symbols.get(self.peek_index)
}

/// Returns the [`SymbolKind`] of the peeked [`Symbol`].
pub fn peek_kind(&mut self) -> Option<SymbolKind> {
self.peek().map(|s| s.kind)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,33 @@

use core::fmt;

use icu_properties::sets::CodePointSetDataBorrowed;

use super::position::{Offset, Position};

pub mod iterator;

pub const TERMINAL_PUNCTUATION: CodePointSetDataBorrowed<'static> =
icu_properties::sets::terminal_punctuation();

/// Possible kinds of Symbol found in Unimarkup document.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum SymbolKind {
/// Hash symbol (#) used for headings
Hash,
/// Regular text with no semantic meaning
#[default]
Plain,
/// Unicode terminal punctuation
TerminalPunctuation,
/// Any non-linebreaking whitespace
Whitespace,
/// A line break literal (for example `\n` or '\r\n')
Newline,
/// End of Unimarkup document
EOI,
Eoi,
/// The backslash (`\`) is used for escaping other symbols.
Backslash,
/// Hash symbol (#) used for headings
Hash,
/// The star (`*`) literal is used for various elements.
Star,
/// The minus (`-`) literal is used for various elements.
Expand All @@ -43,6 +51,10 @@ pub enum SymbolKind {
Quote,
/// The dollar (`$`) literal is used for math mode formatting.
Dollar,
/// A colon literal (`:`) is used as marker (e.g. for alias substitutions `::heart::`).
Colon,
/// A dot literal (`.`).
Dot,
/// The open parentheses (`(`) literal is used for additional data to text group elements (e.g.
/// image insert).
OpenParenthesis,
Expand All @@ -56,21 +68,42 @@ pub enum SymbolKind {
OpenBrace,
/// The close brace (`}`) literal is used for inline attributes.
CloseBrace,
/// A colon literal (`:`) is used as marker (e.g. for alias substitutions `::heart::`).
Colon,
}

impl Default for SymbolKind {
fn default() -> Self {
Self::Plain
}
}

impl SymbolKind {
pub fn is_not_keyword(&self) -> bool {
matches!(
self,
SymbolKind::Newline | SymbolKind::Whitespace | SymbolKind::Plain | SymbolKind::EOI
SymbolKind::Newline | SymbolKind::Whitespace | SymbolKind::Plain | SymbolKind::Eoi
)
}

pub fn is_keyword(&self) -> bool {
!self.is_not_keyword()
}

pub fn is_open_parenthesis(&self) -> bool {
matches!(
self,
SymbolKind::OpenParenthesis | SymbolKind::OpenBracket | SymbolKind::OpenBrace
)
}

pub fn is_close_parenthesis(&self) -> bool {
matches!(
self,
SymbolKind::CloseParenthesis | SymbolKind::CloseBracket | SymbolKind::CloseBrace
)
}

pub fn is_parenthesis(&self) -> bool {
self.is_open_parenthesis() || self.is_close_parenthesis()
}

pub fn is_space(&self) -> bool {
matches!(
self,
SymbolKind::Newline | SymbolKind::Whitespace | SymbolKind::Eoi
)
}
}
Expand All @@ -80,7 +113,7 @@ impl SymbolKind {
pub struct Symbol<'a> {
/// Original input the symbol is found in.
pub input: &'a str,
pub(crate) offset: Offset,
pub offset: Offset,
/// Kind of the symbol, e.g. a hash (#)
pub kind: SymbolKind,
/// Start position of the symbol in input.
Expand Down Expand Up @@ -141,7 +174,7 @@ impl Symbol<'_> {
/// # Examples
///
/// ```
/// use unimarkup_commons::scanner::{scan_str, Symbol};
/// use unimarkup_commons::lexer::{scan_str, symbol::Symbol};
///
/// let input = "This is a string";
/// let symbols: Vec<_> = scan_str(input);
Expand Down Expand Up @@ -212,27 +245,47 @@ impl From<&str> for SymbolKind {
"{" => SymbolKind::OpenBrace,
"}" => SymbolKind::CloseBrace,
":" => SymbolKind::Colon,
"." => SymbolKind::Dot,
symbol
if symbol != "\n"
&& symbol != "\r\n"
&& symbol.starts_with(char::is_whitespace) =>
{
SymbolKind::Whitespace
}
_ => SymbolKind::Plain,
_ => {
let mut kind = SymbolKind::Plain;

if let Some(c) = value.chars().next() {
if TERMINAL_PUNCTUATION.contains(c) {
kind = SymbolKind::TerminalPunctuation;
}
}

kind
}
}
}
}

impl SymbolKind {
pub fn as_str(&self) -> &str {
match self {
SymbolKind::Plain | SymbolKind::TerminalPunctuation => {
#[cfg(debug_assertions)]
panic!(
"Tried to create &str from '{:?}', which has undefined &str representation.",
self
);

#[cfg(not(debug_assertions))]
""
}
SymbolKind::Hash => "#",
SymbolKind::Plain => "",
SymbolKind::Tick => "`",
SymbolKind::Whitespace => " ",
SymbolKind::Newline => "\n",
SymbolKind::EOI => "",
SymbolKind::Eoi => "",
SymbolKind::Backslash => "\\",
SymbolKind::Star => "*",
SymbolKind::Minus => "-",
Expand All @@ -251,6 +304,7 @@ impl SymbolKind {
SymbolKind::OpenBrace => "{",
SymbolKind::CloseBrace => "}",
SymbolKind::Colon => ":",
SymbolKind::Dot => ".",
}
}
}
Loading
Loading