Introduce incremental read and deserialize for text & binary #139

nickbabcock · 2023-12-13T13:49:39Z

Previously jomini APIs required all data to be in memory.

This is less than ideal when working with large save files (100 MB+).

This PR introduces a host of APIs that work off Read implementations.

For example, here is how we could find the binary max nesting depth from stdin:

fn main() -> Result<(), Box<dyn error::Error>> {
    let stdin = io::stdin().lock();
    let mut reader = jomini::binary::TokenReader::new(stdin);

    let mut current_depth = 0;
    let mut max_depth = 0;
    while let Some(token) = reader.next()? {
        match token {
            jomini::binary::Token::Open => {
                current_depth += 1;
                max_depth = max_depth.max(current_depth);
            },
            jomini::binary::Token::Close => current_depth -= 1,
            _ => {}
        }
    }

    println!("{}", max_depth);
	Ok(())
}

APIs for deserialization are similar and are denoted with from_reader

Results when deserializing EU4 saves on x86

Text saves: 33% decrease in latency, 80% reduction in peak memory usage
Binary saves: 10% increase in latency, 50% reduction in peak memory usage

The reason why binary saves saw a smaller impact is that the binary deserializer was already smart enough to incrementally parse the buffered data instead of needing to parse it to a tape first like the text deserializer. The reduction in memory usage is expected to be even more pronounced (90%+) for other games with smaller models.

It is a shame that using the binary incremental deserialization API came with a small performance cost for EU4 saves. I believe part of this is due how DEFLATE works. I previously wrote an article on how inflating from a byte slice was 20% faster than from a stream. The inflate stream implementation has a lot more memcpy due to it needing "access to 32KiB of the previously decompressed data" that it needs to juggle. In the end, a 50% reduction in peak memory seemed worth it.

In the docs, incremental text APIs are marked as experimental as they use a different parsing algorithm that is geared more towards save files. I have not yet fleshed out ergonomic equivalents for more esoteric game syntax (like parameter definitions). Game files can still be parsed with the experimental APIs, but these APIs may change in the future based on feedback.

As part of this PR, the incrementally deserializing binary implementation has been promoted to handle all buffered data, and no longer will a from_slice function parse the data to a tape in an intermediate step.

The new incremental APIs are not totally symmetrical. The binary format has a Lexer that is a zero cost scanner over a slice of data. There is no such equivalent for the text data as it I don't think it is conducive to being used to construct the higher level reading abstractions with good performance. It is no problem for the binary implementations to start reading a string again if it crosses a read boundary, as string operations are independent of their length. Contrast that with text, where if one starts over, each byte of the string would be looked at again (and same goes for any whitespace). This can be worked around by owning data or communicating state, but this complication doesn't seem worth it over bundling everything inside a TokenReader that keeps state local and returns borrowed data.

There are some aspects of the resulting API I don't love:

Now there are two "token" types for each format: (eg: TextToken and text::Token). I don't like how they have the same semantic name. In hindsight, I wish I named the elements of a tape something like TapeNode (simdjson calles them tape_ref). This is something I may consider if 1.0 will ever be reached.
I'm deliberating on if TokenReader is the best name or if BinaryReader and TextReader are more appropriate.
I don't like the thought of maintaining another deserializer implementation, but I don't see a way around that
There are a few places in the code where I felt the need to circumvent the borrow checker as readers are essentially mutable lending iterators, which makes it difficult to use a token with any other API. I would love to solve this.
```
// This makes me sad
let de = unsafe { &mut *(self.de as *mut _) };
```

Future plans:

Immediately start adopting incremental APIs for save file deserialization
Consider them for melting save files
Keep tape parsing for game files for the mid-level DOM-like APIs that have no replacement

Closes #135

I look to have accidentally removed them as part of #139

nickbabcock force-pushed the lex2 branch from 055096c to d7e4dea Compare December 16, 2023 01:20

nickbabcock marked this pull request as ready for review December 16, 2023 01:25

nickbabcock force-pushed the lex2 branch from 0b87f1b to a0573ac Compare December 16, 2023 19:41

nickbabcock added 27 commits December 18, 2023 18:12

...

d7365c2

..

c11cc1c

...

6a8e556

...

3610365

...

7b43bf5

...

4a57d53

...

7b06f42

..

64c16fc

...

30ac784

...

70ed835

...

4a7702b

...

351027f

...

fa4823e

...

86b5c40

...

b15df5e

...

a7c236a

...

320760a

...

d592b3e

...

fc21f63

...

7c46e2e

...

b988012

...

ba4164b

...

3032ce5

...

3454321

...

5a066c2

Add slice optimized readers

dd733ee

...

a74f6a1

nickbabcock force-pushed the lex2 branch from 9ff9fae to a74f6a1 Compare December 19, 2023 12:19

nickbabcock added 4 commits December 19, 2023 06:28

...

89c4954

...

c5b59fe

...

2d05a0b

...

fd18568

nickbabcock merged commit 59f3159 into master Dec 21, 2023
7 checks passed

nickbabcock deleted the lex2 branch December 21, 2023 03:22

nickbabcock added a commit that referenced this pull request Apr 2, 2024

Reintroduce docs on direct token deserialization

975db62

I look to have accidentally removed them as part of #139

nickbabcock mentioned this pull request Apr 2, 2024

Reintroduce docs on direct token deserialization #159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce incremental read and deserialize for text & binary #139

Introduce incremental read and deserialize for text & binary #139

nickbabcock commented Dec 13, 2023 •

edited

Loading

Introduce incremental read and deserialize for text & binary #139

Introduce incremental read and deserialize for text & binary #139

Conversation

nickbabcock commented Dec 13, 2023 • edited Loading

nickbabcock commented Dec 13, 2023 •

edited

Loading