Skip to content

Commit

Permalink
Introduce incremental read and deserialize for text & binary (#139)
Browse files Browse the repository at this point in the history
Previously jomini APIs required all data to be in memory.

This is less than ideal when working with large save files (100 MB+).

This PR introduces a host of APIs that work off `Read` implementations.

For example, here is how we could find the binary max nesting depth from stdin:

```rust
fn main() -> Result<(), Box<dyn error::Error>> {
    let stdin = io::stdin().lock();
    let mut reader = jomini::binary::TokenReader::new(stdin);

    let mut current_depth = 0;
    let mut max_depth = 0;
    while let Some(token) = reader.next()? {
        match token {
            jomini::binary::Token::Open => {
                current_depth += 1;
                max_depth = max_depth.max(current_depth);
            },
            jomini::binary::Token::Close => current_depth -= 1,
            _ => {}
        }
    }

    println!("{}", max_depth);
	Ok(())
}
```

APIs for deserialization are similar and are denoted with `from_reader`

Results when deserializing EU4 saves on x86
 - Text saves: 33% decrease in latency, 80% reduction in peak memory usage
 - Binary saves: 10% increase in latency, 50% reduction in peak memory usage

The reason why binary saves saw a smaller impact is that the binary deserializer was already smart enough to incrementally parse the buffered data instead of needing to parse it to a tape first like the text deserializer. The reduction in memory usage is expected to be even more pronounced (90%+) for other games with smaller models.

It is a shame that using the binary incremental deserialization API came with a small performance cost for EU4 saves. I believe part of this is due how DEFLATE works. [I previously wrote an article](https://nickb.dev/blog/deflate-yourself-for-faster-rust-zips/) on how inflating from a byte slice was 20% faster than from a stream. The inflate stream implementation has a lot more `memcpy` due to it needing ["access to 32KiB of the previously decompressed data"](https://docs.rs/miniz_oxide/0.7.1/miniz_oxide/inflate/core/fn.decompress.html) that it needs to juggle. In the end, a 50% reduction in peak memory seemed worth it. 

In the docs, incremental text APIs are marked as experimental as they use a different parsing algorithm that is geared more towards save files. I have not yet fleshed out ergonomic equivalents for more esoteric game syntax (like parameter definitions). Game files can still be parsed with the experimental APIs, but these APIs may change in the future based on feedback.

As part of this PR, the incrementally deserializing binary implementation has been promoted to handle all buffered data, and no longer will a `from_slice` function parse the data to a tape in an intermediate step.

The new incremental APIs are not totally symmetrical. The binary format has a `Lexer` that is a zero cost scanner over a slice of data. There is no such equivalent for the text data as it I don't think it is conducive to being used to construct the higher level reading abstractions with good performance. It is no problem for the binary implementations to start reading a string again if it crosses a read boundary, as string operations are independent of their length. Contrast that with text, where if one starts over, each byte of the string would be looked at again (and same goes for any whitespace). This can be worked around by owning data or communicating state, but this complication doesn't seem worth it over bundling everything inside a `TokenReader` that keeps state local and returns borrowed data.

There are some aspects of the resulting API I don't love:
 - Now there are two "token" types for each format: (eg: `TextToken` and `text::Token`). I don't like how they have the same semantic name. In hindsight, I wish I named the elements of a tape something like `TapeNode` (simdjson calles them `tape_ref`). This is something I may consider if 1.0 will ever be reached.
 - I'm deliberating on if `TokenReader` is the best name or if `BinaryReader` and `TextReader` are more appropriate.
 - I don't like the thought of maintaining another deserializer implementation, but I don't see a way around that
 - There are a few places in the code where I felt the need to circumvent the borrow checker as readers are essentially mutable lending iterators, which makes it difficult to use a token with any other API. I would love to solve this.
   ```rust
   // This makes me sad
   let de = unsafe { &mut *(self.de as *mut _) };
   ```
  • Loading branch information
nickbabcock authored Dec 21, 2023
1 parent 7853832 commit 59f3159
Show file tree
Hide file tree
Showing 19 changed files with 5,636 additions and 2,172 deletions.
93 changes: 15 additions & 78 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ Converters](https://github.com/ParadoxGameConverters) and

## Quick Start

Below is a demonstration on parsing plaintext data using jomini tools.
Below is a demonstration of deserializing plaintext data using serde.
Several additional serde-like attributes are used to reconcile the serde
data model with structure of these files.

```rust
use jomini::{
Expand Down Expand Up @@ -71,9 +73,9 @@ let actual: Model = jomini::text::de::from_windows1252_slice(data)?;
assert_eq!(actual, expected);
```

## Binary Parsing
## Binary Deserialization

Parsing data encoded in the binary format is done in a similar fashion but with a couple extra steps for the caller to supply:
Deserializing data encoded in the binary format is done in a similar fashion but with a couple extra steps for the caller to supply:

- How text should be decoded (typically Windows-1252 or UTF-8)
- How rational (floating point) numbers are decoded
Expand All @@ -84,7 +86,7 @@ Implementors be warned, not only does each Paradox game have a different binary
Below is an example that defines a sample binary format and uses a hashmap token lookup.

```rust
use jomini::{BinaryDeserializer, Encoding, JominiDeserialize, Windows1252Encoding};
use jomini::{Encoding, JominiDeserialize, Windows1252Encoding, binary::BinaryFlavor};
use std::{borrow::Cow, collections::HashMap};

#[derive(JominiDeserialize, PartialEq, Debug)]
Expand Down Expand Up @@ -116,8 +118,7 @@ let data = [ 0x82, 0x2d, 0x01, 0x00, 0x0f, 0x00, 0x03, 0x00, 0x45, 0x4e, 0x47 ];
let mut map = HashMap::new();
map.insert(0x2d82, "field1");

let actual: MyStruct = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
.deserialize_slice(&data[..], &map)?;
let actual: MyStruct = BinaryTestFlavor.deserialize_slice(&data[..], &map)?;
assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
```

Expand All @@ -126,59 +127,14 @@ without any duplication.

One can configure the behavior when a token is unknown (ie: fail immediately or try to continue).

### Ondemand Deserialization

The ondemand deserializer is a one-shot deserialization mode is often faster
and more memory efficient as it does not parse the input into an intermediate
tape, and instead deserializes right from the input.

It is instantiated and used similarly to `BinaryDeserializer`

```rust
use jomini::OndemandBinaryDeserializer;
// [...snip code from previous example...]

let actual: MyStruct = OndemandBinaryDeserializer::builder_flavor(BinaryTestFlavor)
.deserialize_slice(&data[..], &map)?;
assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
```

### Direct identifier deserialization with `token` attribute

There may be some performance loss during binary deserialization as
tokens are resolved to strings via a `TokenResolver` and then matched against the
string representations of a struct's fields.

We can fix this issue by directly encoding the expected token value into the struct:

```rust
#[derive(JominiDeserialize, PartialEq, Debug)]
struct MyStruct {
#[jomini(token = 0x2d82)]
field1: String,
}

// Empty token to string resolver
let map = HashMap::<u16, String>::new();

let actual: MyStruct = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
.deserialize_slice(&data[..], &map)?;
assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
```

Couple notes:

- This does not obviate need for the token to string resolver as tokens may be used as values.
- If the `token` attribute is specified on one field on a struct, it must be specified on all fields of that struct.

## Caveats

Caller is responsible for:
Before calling any Jomini API, callers are expected to:

- Determining the correct format (text or binary) ahead of time
- Stripping off any header that may be present (eg: `EU4txt` / `EU4bin`)
- Providing the token resolver for the binary format
- Providing the conversion to reconcile how, for example, a date may be encoded as an integer in
- Determine the correct format (text or binary) ahead of time.
- Strip off any header that may be present (eg: `EU4txt` / `EU4bin`)
- Provide the token resolver for the binary format
- Provide the conversion to reconcile how, for example, a date may be encoded as an integer in
the binary format, but as a string when in plaintext.

## The Mid-level API
Expand All @@ -199,6 +155,9 @@ for (key, _op, value) in reader.fields() {
}
```

For even lower level of parisng, see the respective binary and text
documentation.

The mid-level API also provides the excellent utility of converting the
plaintext Clausewitz format to JSON when the `json` feature is enabled.

Expand All @@ -211,28 +170,6 @@ let actual = reader.json().to_string()?;
assert_eq!(actual, r#"{"foo":"bar"}"#);
```

## One Level Lower

At the lowest layer, one can interact with the raw data directly via `TextTape`
and `BinaryTape`.

```rust
use jomini::{TextTape, TextToken, Scalar};

let data = b"foo=bar";

assert_eq!(
TextTape::from_slice(&data[..])?.tokens(),
&[
TextToken::Unquoted(Scalar::new(b"foo")),
TextToken::Unquoted(Scalar::new(b"bar")),
]
);
```

If one will only use `TextTape` and `BinaryTape` then `jomini` can be compiled without default
features, resulting in a build without dependencies.

## Write API

There are two targeted use cases for the write API. One is when a text tape is on hand.
Expand Down
23 changes: 16 additions & 7 deletions benches/jomini_bench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,9 @@ use criterion::{
};
use flate2::read::GzDecoder;
use jomini::{
binary::{
de::OndemandBinaryDeserializerBuilder, BinaryFlavor, BinaryTapeParser, TokenResolver,
},
binary::{BinaryFlavor, BinaryTapeParser, TokenResolver},
common::Date,
BinaryDeserializer, BinaryTape, Encoding, Scalar, TextTape, Utf8Encoding, Windows1252Encoding,
BinaryTape, Encoding, Scalar, TextTape, Utf8Encoding, Windows1252Encoding,
};
use std::{borrow::Cow, io::Read};

Expand Down Expand Up @@ -125,15 +123,26 @@ pub fn binary_deserialize_benchmark(c: &mut Criterion) {
group.throughput(Throughput::Bytes(data.len() as u64));
group.bench_function("ondemand", |b| {
b.iter(|| {
let _res: Gamestate = OndemandBinaryDeserializerBuilder::with_flavor(BinaryTestFlavor)
let _res: Gamestate = BinaryTestFlavor
.deserializer()
.deserialize_slice(&data[..], &MyBinaryResolver)
.unwrap();
})
});
group.bench_function("ondemand-reader", |b| {
b.iter(|| {
let _res: Gamestate = BinaryTestFlavor
.deserializer()
.deserialize_reader(&data[..], &MyBinaryResolver)
.unwrap();
})
});
group.bench_function("tape", |b| {
b.iter(|| {
let _res: Gamestate = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
.deserialize_slice(&data[..], &MyBinaryResolver)
let tape = BinaryTape::from_slice(&data[..]).unwrap();
let _res: Gamestate = BinaryTestFlavor
.deserializer()
.deserialize_tape(&tape, &MyBinaryResolver)
.unwrap();
})
});
Expand Down
11 changes: 11 additions & 0 deletions fuzz/fuzz_targets/fuzz_binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,17 @@ fuzz_target!(|data: &[u8]| {
hash.insert(0x354eu16, "selector");
hash.insert(0x209u16, "localization");

let mut lexer = jomini::binary::Lexer::new(data);
let mut reader = jomini::binary::TokenReader::builder().buffer_len(100).build(data);

loop {
match (lexer.read_token(), reader.read()) {
(Ok(t1), Ok(t2)) => assert_eq!(t1, t2),
(Err(e1), Err(e2)) => { break; }
(x, y) => panic!("{:?} {:?}", x, y),
}
}

let mut utape = jomini::BinaryTape::default();
let ures =
jomini::binary::BinaryTapeParser.parse_slice_into_tape_unoptimized(&data, &mut utape);
Expand Down
11 changes: 11 additions & 0 deletions fuzz/fuzz_targets/fuzz_text.rs
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,17 @@ where
}

fuzz_target!(|data: &[u8]| {
let mut reader = jomini::text::TokenReader::new(data);
let mut i = 0;
while let Ok(Some(x)) = reader.next() {
if matches!(x, jomini::text::Token::Open) {
i += 1;
if i % 2 == 1 {
let _ = reader.skip_container();
}
}
}

let _: Result<Meta, _> = jomini::TextTape::from_slice(&data).and_then(|tape| {
let tokens = tape.tokens();
for (i, token) in tokens.iter().enumerate() {
Expand Down
Loading

0 comments on commit 59f3159

Please sign in to comment.