Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce incremental read and deserialize for text & binary #139

Merged
merged 31 commits into from
Dec 21, 2023
Merged

Conversation

nickbabcock
Copy link
Contributor

@nickbabcock nickbabcock commented Dec 13, 2023

Previously jomini APIs required all data to be in memory.

This is less than ideal when working with large save files (100 MB+).

This PR introduces a host of APIs that work off Read implementations.

For example, here is how we could find the binary max nesting depth from stdin:

fn main() -> Result<(), Box<dyn error::Error>> {
    let stdin = io::stdin().lock();
    let mut reader = jomini::binary::TokenReader::new(stdin);

    let mut current_depth = 0;
    let mut max_depth = 0;
    while let Some(token) = reader.next()? {
        match token {
            jomini::binary::Token::Open => {
                current_depth += 1;
                max_depth = max_depth.max(current_depth);
            },
            jomini::binary::Token::Close => current_depth -= 1,
            _ => {}
        }
    }

    println!("{}", max_depth);
	Ok(())
}

APIs for deserialization are similar and are denoted with from_reader

Results when deserializing EU4 saves on x86

  • Text saves: 33% decrease in latency, 80% reduction in peak memory usage
  • Binary saves: 10% increase in latency, 50% reduction in peak memory usage

The reason why binary saves saw a smaller impact is that the binary deserializer was already smart enough to incrementally parse the buffered data instead of needing to parse it to a tape first like the text deserializer. The reduction in memory usage is expected to be even more pronounced (90%+) for other games with smaller models.

It is a shame that using the binary incremental deserialization API came with a small performance cost for EU4 saves. I believe part of this is due how DEFLATE works. I previously wrote an article on how inflating from a byte slice was 20% faster than from a stream. The inflate stream implementation has a lot more memcpy due to it needing "access to 32KiB of the previously decompressed data" that it needs to juggle. In the end, a 50% reduction in peak memory seemed worth it.

In the docs, incremental text APIs are marked as experimental as they use a different parsing algorithm that is geared more towards save files. I have not yet fleshed out ergonomic equivalents for more esoteric game syntax (like parameter definitions). Game files can still be parsed with the experimental APIs, but these APIs may change in the future based on feedback.

As part of this PR, the incrementally deserializing binary implementation has been promoted to handle all buffered data, and no longer will a from_slice function parse the data to a tape in an intermediate step.

The new incremental APIs are not totally symmetrical. The binary format has a Lexer that is a zero cost scanner over a slice of data. There is no such equivalent for the text data as it I don't think it is conducive to being used to construct the higher level reading abstractions with good performance. It is no problem for the binary implementations to start reading a string again if it crosses a read boundary, as string operations are independent of their length. Contrast that with text, where if one starts over, each byte of the string would be looked at again (and same goes for any whitespace). This can be worked around by owning data or communicating state, but this complication doesn't seem worth it over bundling everything inside a TokenReader that keeps state local and returns borrowed data.

There are some aspects of the resulting API I don't love:

  • Now there are two "token" types for each format: (eg: TextToken and text::Token). I don't like how they have the same semantic name. In hindsight, I wish I named the elements of a tape something like TapeNode (simdjson calles them tape_ref). This is something I may consider if 1.0 will ever be reached.
  • I'm deliberating on if TokenReader is the best name or if BinaryReader and TextReader are more appropriate.
  • I don't like the thought of maintaining another deserializer implementation, but I don't see a way around that
  • There are a few places in the code where I felt the need to circumvent the borrow checker as readers are essentially mutable lending iterators, which makes it difficult to use a token with any other API. I would love to solve this.
    // This makes me sad
    let de = unsafe { &mut *(self.de as *mut _) };

Future plans:

  • Immediately start adopting incremental APIs for save file deserialization
  • Consider them for melting save files
  • Keep tape parsing for game files for the mid-level DOM-like APIs that have no replacement

Closes #135

@nickbabcock nickbabcock merged commit 59f3159 into master Dec 21, 2023
7 checks passed
@nickbabcock nickbabcock deleted the lex2 branch December 21, 2023 03:22
nickbabcock added a commit that referenced this pull request Apr 2, 2024
I look to have accidentally removed them as part of #139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant