Skip to content

Latest commit

 

History

History
128 lines (114 loc) · 5.87 KB

TODO.org

File metadata and controls

128 lines (114 loc) · 5.87 KB

Esrap TODO

Optimizations

Common special variable for cache and heads

As access to special variables can be slow, it may make sense to store *cache* and *head* in slots of a structure context and this structure in a new special variable *context*.

Error vs Success results

We’re interested in the production/failure, and the position.

The vast majority of parses result in failures, and we cons up a failed-parse for each. Unless the whole parse fails, the only thing we are interested in is the position.

…and even if the whole parse fails, we use the additional information only to display a fancy error message.

So: unless parse has been called with :debug t, use the fixnum indicating the position to indicate a failed parse.

Results of sequence expressions

In the case of a success and matching a sequence we don’t really need the whole result object for each submatch. Maybe return result as multiple values from each rule, and have with-cached-result cons up the object to store it when?

Character ranges

(or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9") can be implemented with a range-check for char-code. Similarly (or “foobar” “foo” “bar”) can be implemented using tricks from pkhuong’s STRING-CASE.

STARTED Cache

The cache is a big bottleneck. Try out a few different designs. Early experiments show that while it’s easy to make something that conses less, it’s not trivial to make it much faster than the current simple hash-table based version.

Some statistics:

  • 5-10% of positions in a given text have only failure results. If we can efficiently record the rules these are for…
  • 40-50% of positions in a given text end up with exactly one successful result, irrespective of number of failures. Not sure if we can use this.
  • 75% of positions in a given text end up with results. This should make a good estimate for the size of the cache needed.

GC is another related bottleneck. Not because we cons so much, but because we have this massive cache that keeps being written to, so we have boxed objects on dirty pages.

To reduce the GC pressure first optimize the result handling. If the issue still exists, see the first option below.

Maybe: Map rule to a position cache. In the position cache, need to be able to differentiate between 3 states: no result, success, failure. Need to also be able to store the result. If we store results in a single global result vector, and use N bits per position in the position cache: 0 no result, 1 failure, anything else is the position of the result object in the global vector.

Maybe: PCL-style multikey cache.

Maybe: Basic two-level cache. (Version of this on a branch.)

Features

Thread safety

Parsing is currently thread-safe if parse has been compiler-macroexanded. *RULES* needs locking, but isn’t used during actual parsing.

Add define-expression-syntax

STARTED Grammar objects

Rules should be contained in grammars, so that symbols like cl:if can refer to different rules in different contexts. Grammars can also enforce rule numbering, making caching results easier. It should be possible to inherit from other grammars.

Optimizations leading to old rules being used are in principle fine

but I would like to have that behind a flag so it can be turned off for debugging. No need to export or document the flag.

Accepting designators is good

No circular dependencies sounds good to me

Overall I value backwards compatibility highly.

If keeping it doesn’t make things clearly worse or implementation much harder, I would keep it. Therefore I would prefer ADD-RULE and &co to default to grammar at runtime, and have that as either a keyword or an optional argument. Analogously to how INTERN &co work.

DEFRULE on the other hand

should IMO choose the grammar (if not explicitly given) at compile-time if at all possible. Earlier is better there, I suspect.

Naming things with string designators instead of symbols.

I see the attraction, but this means that all rules in a grammar are public and prone to conflict, no?

If I have a grammar G1 in package BAR that specifies a rule called BAR:WHITESPACE and a G2 in FOO that specified FOO:WHITESPACE, then G3 in QUUX can use both without problems. If both rules are really called “WHITESPACE”, things can get confusing pretty quickly…

Grammar names in tests

STARTED Character classes

Have standard-grammar instances that define things like digit, whitespace, ascii, etc.

This will probably be done in a different system.

CANCELED Transform Subseq

(defrule decimal (+ (or "0" "1" ...))
  (:subseq-function parse-integer))

Character ranges

Make it easy to specify character ranges, eg. (char #\0 #\9).

Improvements

Run all tests in evaluated mode

Get rid of *current-cell*

Documentation strings

Structures and classes have a mixture of documentation strings and documentation comments. Which do we want? After deciding, make this consistent.

Write tests for removing rules

STARTED Reference example files from manual

Write tests for %expression-[direct-]dependencies

Better error reports

For parse errors, in particular “incomplete parse” errors, provide a better description of where the parse actually failed, which rule or rules were involved and what input was expected.

Tests for rule tracing functionality

Remove concat

Bugs

NEW Tracing does not work for interpreted rules

NEW Recursive rules cannot be removed

%expression-direct-dependencies returns duplicates