This document records this project's design principles. Inspired by https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html
The speed and memory footprint of image reference parsing is highly unlikely to ever matter to a program handling gigabytes of images. However, optimizing the parsing is fun, and fun is encouraged in this repository.
This library takes an input ascii string (a slice of bytes) and parses the lengths of each of the sections of an image reference.
Using ascii only avoids allocating unicode char
s which each weigh 4 bytes.
Re-parsing bytes costs time and memory. Peeking one byte ahead is ok. Re-parsing sections on error to find an invalid character is also ok as long as the benchmarks don't regress.
&str
s are expensive: they cost 2 usize
s.
Prefer holding one &str
and many short lengths in-memory, then splitting new &str
s using the lengths on-demand.
Use the smallest unsigned integer size that can represent the length of a section of an image reference.
Since most sections of an image reference are under 255 ascii characters long, most lengths can be represented using a u8
.
The encoded section of the digest is technically unbounded, but practically can be measured with a u16
.
Since all lengths can be 0, treat 0 as the None
value rather than using extra space for an Option<Length>
.
Temporarily converting a length to an Option<length>
is ok, since it's roughly equivalent to using a temporary bool while checking len == 0
.
Record invariants using debug_assert!(..)
instead of assert!(..)
to avoid extra computation in release mode.
Put extra debugging variables behind #[cfg(debug_assertions)]
conditional-compilation macros.
To keep the library size small and keep ownership of all of the relevant logic.
I chose not to use the excellent regex
crate since:
- writing the parsers as pure functions avoids issues of cross-thread resource contention.
- I think
regex
relies on pointer-sized offsets for capture groups, which cancels out the short-length optimizations. A scan through theregex
andregex-automata
docs and issues didn't reveal a way to useu8
s . If you know a way to getregex
to use custom offset sizes, please let me know in this repo's issues!