Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQ] Simplify token iterator #129

Open
mhatzl opened this issue Mar 20, 2024 · 1 comment
Open

[REQ] Simplify token iterator #129

mhatzl opened this issue Mar 20, 2024 · 1 comment
Labels
waiting-on-assignee Issue/PR author or reviewer is awaiting response from assignee

Comments

@mhatzl
Copy link
Contributor

mhatzl commented Mar 20, 2024

Is your feature request related to other issues/PRs?

unimarkup/specification#55 unimarkup/specification#56

Remove matching fns

Prefix matching will be done on the number of indent spaces.
Therefore, only a number needs to be stored instead of a complex matching fn.

For end matching, it might be possible to provide an enum to cover all needed scenarios.
Storing the enum directly in the iterator should make parsing faster, and significantly reduce complexity. Scoping is still needed for enclosed elements.

  • Blankline ... Iterator ends on blankline, or if iterator end is reached

    Needed for Paragraph, Quote Block, Line Block.

  • NewlineMatch(Vec) ... Matches given tokens once a newline is matched

    Needed for enclosed blocks. Assuming issue [REQ] Relax enclosing block parsing specification#56 gets accepted.

  • BlankOrNewlineMatch(Vec) ... Either ends on blankline, or matches given tokens once a newline is matched

    Needed for Heading and lists, because they do not require a blank line in case the next line starts with the element keyword.

  • Match(Token) ... Ends if the token is matched

    Needed mostly for inline elements, but also for tables.

  • MatchEither(Token1, Token2) ... Ends if either Token1 or Token2 matches

    Needed to handle ambiguous inline elements.

Try to combine block and inline iterator

The inline iterator is currently needed, because base tokens are converted to inline tokens,
and open formats are stored in a slice.

  • Generic token iterator

    It might be possible to make the base token iterator generic.
    The generic token type must have functions to determine if a token is a newline, blankline, or EOI.

    With tokens being generic, a conversion layer may be added to convert between base tokens to inline tokens. This has the benefit of reducing API duplication, because base and inline iterators get merged.

  • Use end matching for inline formats

    Inline formats use an open format map to determine if a format is open or not.
    This open map is needed to decide if a keyword should open or close a format.
    With iterators being nested, it might be possible to add a function that checks whether a format is already open (by having a parent parser that handles the respective format), or not.
    If this works, no open map would be needed for inline parsing, which makes inline parsing much easier.

    To achieve this, iterators must know for what element parsing they are used.
    Could be done by adding a field with type ElementKind. To resolve ambiguous tokens, it must be possible to cache exactly one token.

@mhatzl mhatzl added the waiting-on-assignee Issue/PR author or reviewer is awaiting response from assignee label Mar 20, 2024
@mhatzl
Copy link
Contributor Author

mhatzl commented Mar 20, 2024

It must be possible to get the number of prefix spaces of all parent iterators for correct indentation of block quoted logic.

Unimarkup block content in the logic part is started with """.
To get indentation consistency, prefix for all enclosed lines must be set so that the content starts at the same "visual depth" of the leftmost double quote (in left-to-right flow).

let block = """
            # Heading

            Paragraph
            """;

The number of spaces that were skipped by parent iterators makes it easy to calculate the needed indentation, because with this information, the start.col_grapheme value of the first quote token can be used.

prefix_indent = start.col_grapheme - sum_parent_indents

Without this information, the start.col_grapheme value of the first token an iterator returns after a newline would need to be kept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-on-assignee Issue/PR author or reviewer is awaiting response from assignee
Projects
None yet
Development

No branches or pull requests

1 participant