Introduce incremental read and deserialize for text & binary (#139)

Previously jomini APIs required all data to be in memory. This is less than ideal when working with large save files (100 MB+). This PR introduces a host of APIs that work off `Read` implementations. For example, here is how we could find the binary max nesting depth from stdin: ```rust fn main() -> Result<(), Box<dyn error::Error>> { let stdin = io::stdin().lock(); let mut reader = jomini::binary::TokenReader::new(stdin); let mut current_depth = 0; let mut max_depth = 0; while let Some(token) = reader.next()? { match token { jomini::binary::Token::Open => { current_depth += 1; max_depth = max_depth.max(current_depth); }, jomini::binary::Token::Close => current_depth -= 1, _ => {} } } println!("{}", max_depth); Ok(()) } ``` APIs for deserialization are similar and are denoted with `from_reader` Results when deserializing EU4 saves on x86 - Text saves: 33% decrease in latency, 80% reduction in peak memory usage - Binary saves: 10% increase in latency, 50% reduction in peak memory usage The reason why binary saves saw a smaller impact is that the binary deserializer was already smart enough to incrementally parse the buffered data instead of needing to parse it to a tape first like the text deserializer. The reduction in memory usage is expected to be even more pronounced (90%+) for other games with smaller models. It is a shame that using the binary incremental deserialization API came with a small performance cost for EU4 saves. I believe part of this is due how DEFLATE works. [I previously wrote an article](https://nickb.dev/blog/deflate-yourself-for-faster-rust-zips/) on how inflating from a byte slice was 20% faster than from a stream. The inflate stream implementation has a lot more `memcpy` due to it needing ["access to 32KiB of the previously decompressed data"](https://docs.rs/miniz_oxide/0.7.1/miniz_oxide/inflate/core/fn.decompress.html) that it needs to juggle. In the end, a 50% reduction in peak memory seemed worth it. In the docs, incremental text APIs are marked as experimental as they use a different parsing algorithm that is geared more towards save files. I have not yet fleshed out ergonomic equivalents for more esoteric game syntax (like parameter definitions). Game files can still be parsed with the experimental APIs, but these APIs may change in the future based on feedback. As part of this PR, the incrementally deserializing binary implementation has been promoted to handle all buffered data, and no longer will a `from_slice` function parse the data to a tape in an intermediate step. The new incremental APIs are not totally symmetrical. The binary format has a `Lexer` that is a zero cost scanner over a slice of data. There is no such equivalent for the text data as it I don't think it is conducive to being used to construct the higher level reading abstractions with good performance. It is no problem for the binary implementations to start reading a string again if it crosses a read boundary, as string operations are independent of their length. Contrast that with text, where if one starts over, each byte of the string would be looked at again (and same goes for any whitespace). This can be worked around by owning data or communicating state, but this complication doesn't seem worth it over bundling everything inside a `TokenReader` that keeps state local and returns borrowed data. There are some aspects of the resulting API I don't love: - Now there are two "token" types for each format: (eg: `TextToken` and `text::Token`). I don't like how they have the same semantic name. In hindsight, I wish I named the elements of a tape something like `TapeNode` (simdjson calles them `tape_ref`). This is something I may consider if 1.0 will ever be reached. - I'm deliberating on if `TokenReader` is the best name or if `BinaryReader` and `TextReader` are more appropriate. - I don't like the thought of maintaining another deserializer implementation, but I don't see a way around that - There are a few places in the code where I felt the need to circumvent the borrow checker as readers are essentially mutable lending iterators, which makes it difficult to use a token with any other API. I would love to solve this. ```rust // This makes me sad let de = unsafe { &mut *(self.de as *mut _) }; ```
rakaly · Dec 21, 2023 · 59f3159 · 59f3159
1 parent 7853832
commit 59f3159
Show file tree

Hide file tree

Showing 19 changed files with 5,636 additions and 2,172 deletions.
diff --git a/README.md b/README.md
@@ -24,7 +24,9 @@ Converters](https://github.com/ParadoxGameConverters) and
 
 ## Quick Start
 
-Below is a demonstration on parsing plaintext data using jomini tools.
+Below is a demonstration of deserializing plaintext data using serde.
+Several additional serde-like attributes are used to reconcile the serde
+data model with structure of these files.
 
 ```rust
 use jomini::{
@@ -71,9 +73,9 @@ let actual: Model = jomini::text::de::from_windows1252_slice(data)?;
 assert_eq!(actual, expected);
 ```
 
-## Binary Parsing
+## Binary Deserialization
 
-Parsing data encoded in the binary format is done in a similar fashion but with a couple extra steps for the caller to supply:
+Deserializing data encoded in the binary format is done in a similar fashion but with a couple extra steps for the caller to supply:
 
 - How text should be decoded (typically Windows-1252 or UTF-8)
 - How rational (floating point) numbers are decoded
@@ -84,7 +86,7 @@ Implementors be warned, not only does each Paradox game have a different binary
 Below is an example that defines a sample binary format and uses a hashmap token lookup.
 
 ```rust
-use jomini::{BinaryDeserializer, Encoding, JominiDeserialize, Windows1252Encoding};
+use jomini::{Encoding, JominiDeserialize, Windows1252Encoding, binary::BinaryFlavor};
 use std::{borrow::Cow, collections::HashMap};
 
 #[derive(JominiDeserialize, PartialEq, Debug)]
@@ -116,8 +118,7 @@ let data = [ 0x82, 0x2d, 0x01, 0x00, 0x0f, 0x00, 0x03, 0x00, 0x45, 0x4e, 0x47 ];
 let mut map = HashMap::new();
 map.insert(0x2d82, "field1");
 
-let actual: MyStruct = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
-    .deserialize_slice(&data[..], &map)?;
+let actual: MyStruct = BinaryTestFlavor.deserialize_slice(&data[..], &map)?;
 assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
 ```
 
@@ -126,59 +127,14 @@ without any duplication.
 
 One can configure the behavior when a token is unknown (ie: fail immediately or try to continue).
 
-### Ondemand Deserialization
-
-The ondemand deserializer is a one-shot deserialization mode is often faster
-and more memory efficient as it does not parse the input into an intermediate
-tape, and instead deserializes right from the input.
-
-It is instantiated and used similarly to `BinaryDeserializer`
-
-```rust
-use jomini::OndemandBinaryDeserializer;
-// [...snip code from previous example...]
-
-let actual: MyStruct = OndemandBinaryDeserializer::builder_flavor(BinaryTestFlavor)
-    .deserialize_slice(&data[..], &map)?;
-assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
-```
-
-### Direct identifier deserialization with `token` attribute
-
-There may be some performance loss during binary deserialization as
-tokens are resolved to strings via a `TokenResolver` and then matched against the
-string representations of a struct's fields.
-
-We can fix this issue by directly encoding the expected token value into the struct:
-
-```rust
-#[derive(JominiDeserialize, PartialEq, Debug)]
-struct MyStruct {
-    #[jomini(token = 0x2d82)]
-    field1: String,
-}
-
-// Empty token to string resolver
-let map = HashMap::<u16, String>::new();
-
-let actual: MyStruct = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
-    .deserialize_slice(&data[..], &map)?;
-assert_eq!(actual, MyStruct { field1: "ENG".to_string() });
-```
-
-Couple notes:
-
-- This does not obviate need for the token to string resolver as tokens may be used as values.
-- If the `token` attribute is specified on one field on a struct, it must be specified on all fields of that struct.
-
 ## Caveats
 
-Caller is responsible for:
+Before calling any Jomini API, callers are expected to:
 
-- Determining the correct format (text or binary) ahead of time
-- Stripping off any header that may be present (eg: `EU4txt` / `EU4bin`)
-- Providing the token resolver for the binary format
-- Providing the conversion to reconcile how, for example, a date may be encoded as an integer in
+- Determine the correct format (text or binary) ahead of time.
+- Strip off any header that may be present (eg: `EU4txt` / `EU4bin`)
+- Provide the token resolver for the binary format
+- Provide the conversion to reconcile how, for example, a date may be encoded as an integer in
   the binary format, but as a string when in plaintext.
 
 ## The Mid-level API
@@ -199,6 +155,9 @@ for (key, _op, value) in reader.fields() {
 }
 ```
 
+For even lower level of parisng, see the respective binary and text
+documentation.
+
 The mid-level API also provides the excellent utility of converting the
 plaintext Clausewitz format to JSON when the `json` feature is enabled.
 
@@ -211,28 +170,6 @@ let actual = reader.json().to_string()?;
 assert_eq!(actual, r#"{"foo":"bar"}"#);
 ```
 
-## One Level Lower
-
-At the lowest layer, one can interact with the raw data directly via `TextTape`
-and `BinaryTape`.
-
-```rust
-use jomini::{TextTape, TextToken, Scalar};
-
-let data = b"foo=bar";
-
-assert_eq!(
-    TextTape::from_slice(&data[..])?.tokens(),
-    &[
-        TextToken::Unquoted(Scalar::new(b"foo")),
-        TextToken::Unquoted(Scalar::new(b"bar")),
-    ]
-);
-```
-
-If one will only use `TextTape` and `BinaryTape` then `jomini` can be compiled without default
-features, resulting in a build without dependencies.
-
 ## Write API
 
 There are two targeted use cases for the write API. One is when a text tape is on hand.

diff --git a/benches/jomini_bench.rs b/benches/jomini_bench.rs
@@ -3,11 +3,9 @@ use criterion::{
 };
 use flate2::read::GzDecoder;
 use jomini::{
-    binary::{
-        de::OndemandBinaryDeserializerBuilder, BinaryFlavor, BinaryTapeParser, TokenResolver,
-    },
+    binary::{BinaryFlavor, BinaryTapeParser, TokenResolver},
     common::Date,
-    BinaryDeserializer, BinaryTape, Encoding, Scalar, TextTape, Utf8Encoding, Windows1252Encoding,
+    BinaryTape, Encoding, Scalar, TextTape, Utf8Encoding, Windows1252Encoding,
 };
 use std::{borrow::Cow, io::Read};
 
@@ -125,15 +123,26 @@ pub fn binary_deserialize_benchmark(c: &mut Criterion) {
     group.throughput(Throughput::Bytes(data.len() as u64));
     group.bench_function("ondemand", |b| {
         b.iter(|| {
-            let _res: Gamestate = OndemandBinaryDeserializerBuilder::with_flavor(BinaryTestFlavor)
+            let _res: Gamestate = BinaryTestFlavor
+                .deserializer()
                 .deserialize_slice(&data[..], &MyBinaryResolver)
                 .unwrap();
         })
     });
+    group.bench_function("ondemand-reader", |b| {
+        b.iter(|| {
+            let _res: Gamestate = BinaryTestFlavor
+                .deserializer()
+                .deserialize_reader(&data[..], &MyBinaryResolver)
+                .unwrap();
+        })
+    });
     group.bench_function("tape", |b| {
         b.iter(|| {
-            let _res: Gamestate = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
-                .deserialize_slice(&data[..], &MyBinaryResolver)
+            let tape = BinaryTape::from_slice(&data[..]).unwrap();
+            let _res: Gamestate = BinaryTestFlavor
+                .deserializer()
+                .deserialize_tape(&tape, &MyBinaryResolver)
                 .unwrap();
         })
     });

diff --git a/fuzz/fuzz_targets/fuzz_binary.rs b/fuzz/fuzz_targets/fuzz_binary.rs
@@ -62,6 +62,17 @@ fuzz_target!(|data: &[u8]| {
     hash.insert(0x354eu16, "selector");
     hash.insert(0x209u16, "localization");
 
+    let mut lexer = jomini::binary::Lexer::new(data);
+    let mut reader = jomini::binary::TokenReader::builder().buffer_len(100).build(data);
+
+    loop {
+        match (lexer.read_token(), reader.read()) {
+            (Ok(t1), Ok(t2)) => assert_eq!(t1, t2),
+            (Err(e1), Err(e2)) => { break; }
+            (x, y) => panic!("{:?} {:?}", x, y),
+        }
+    }
+
     let mut utape = jomini::BinaryTape::default();
     let ures =
         jomini::binary::BinaryTapeParser.parse_slice_into_tape_unoptimized(&data, &mut utape);

diff --git a/fuzz/fuzz_targets/fuzz_text.rs b/fuzz/fuzz_targets/fuzz_text.rs
@@ -98,6 +98,17 @@ where
 }
 
 fuzz_target!(|data: &[u8]| {
+    let mut reader = jomini::text::TokenReader::new(data);
+    let mut i = 0;
+    while let Ok(Some(x)) = reader.next() {
+        if matches!(x, jomini::text::Token::Open) {
+            i += 1;
+            if i % 2 == 1 {
+                let _ = reader.skip_container();
+            }
+        }
+    }
+
     let _: Result<Meta, _> = jomini::TextTape::from_slice(&data).and_then(|tape| {
         let tokens = tape.tokens();
         for (i, token) in tokens.iter().enumerate() {