-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce incremental read and deserialize for text & binary #139
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nickbabcock
added a commit
that referenced
this pull request
Apr 2, 2024
I look to have accidentally removed them as part of #139
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously jomini APIs required all data to be in memory.
This is less than ideal when working with large save files (100 MB+).
This PR introduces a host of APIs that work off
Read
implementations.For example, here is how we could find the binary max nesting depth from stdin:
APIs for deserialization are similar and are denoted with
from_reader
Results when deserializing EU4 saves on x86
The reason why binary saves saw a smaller impact is that the binary deserializer was already smart enough to incrementally parse the buffered data instead of needing to parse it to a tape first like the text deserializer. The reduction in memory usage is expected to be even more pronounced (90%+) for other games with smaller models.
It is a shame that using the binary incremental deserialization API came with a small performance cost for EU4 saves. I believe part of this is due how DEFLATE works. I previously wrote an article on how inflating from a byte slice was 20% faster than from a stream. The inflate stream implementation has a lot more
memcpy
due to it needing "access to 32KiB of the previously decompressed data" that it needs to juggle. In the end, a 50% reduction in peak memory seemed worth it.In the docs, incremental text APIs are marked as experimental as they use a different parsing algorithm that is geared more towards save files. I have not yet fleshed out ergonomic equivalents for more esoteric game syntax (like parameter definitions). Game files can still be parsed with the experimental APIs, but these APIs may change in the future based on feedback.
As part of this PR, the incrementally deserializing binary implementation has been promoted to handle all buffered data, and no longer will a
from_slice
function parse the data to a tape in an intermediate step.The new incremental APIs are not totally symmetrical. The binary format has a
Lexer
that is a zero cost scanner over a slice of data. There is no such equivalent for the text data as it I don't think it is conducive to being used to construct the higher level reading abstractions with good performance. It is no problem for the binary implementations to start reading a string again if it crosses a read boundary, as string operations are independent of their length. Contrast that with text, where if one starts over, each byte of the string would be looked at again (and same goes for any whitespace). This can be worked around by owning data or communicating state, but this complication doesn't seem worth it over bundling everything inside aTokenReader
that keeps state local and returns borrowed data.There are some aspects of the resulting API I don't love:
TextToken
andtext::Token
). I don't like how they have the same semantic name. In hindsight, I wish I named the elements of a tape something likeTapeNode
(simdjson calles themtape_ref
). This is something I may consider if 1.0 will ever be reached.TokenReader
is the best name or ifBinaryReader
andTextReader
are more appropriate.Future plans:
Closes #135