-
Notifications
You must be signed in to change notification settings - Fork 145
Add MVP LZ4 component #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Jacob Strieb <99368685+rbs-jacob@users.noreply.github.com>
|
I'm attempting to unpack an LZ4 (legacy) compressed kernel with this unpacker, but I'm getting a runtime error: Using the standard lz4 tool (v1.9.4) the image unpacks. I wonder if this is a bug in the Python lz4 module.
|
I pushed a fix for this. The LZ4 legacy format uses a block size of 8MB, but the original packer/unpacker did not have a block size. I also updated the tests and added new assets greater than 8MB to test this failure. |
rbs-jacob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most things fairly minor, and several are completely optional.
| # Read the original content | ||
| initial_data = test_case.input_file.read_bytes() | ||
|
|
||
| modification = b"OFRAK" | ||
|
|
||
| # Create resource and unpack | ||
| resource = await ofrak_context.create_root_resource_from_file(test_case.test_file) | ||
| await resource.unpack() | ||
|
|
||
| # Verify it has the expected tag | ||
| assert resource.has_tag(Lz4Data) | ||
|
|
||
| # Get the child and verify initial content | ||
| child = await resource.get_only_child() | ||
| child_data = await child.get_data() | ||
| assert child_data == initial_data | ||
|
|
||
| # Modify the data | ||
| child.queue_patch(Range.from_size(0, len(modification)), modification) | ||
| await child.save() | ||
|
|
||
| # Pack it back | ||
| await resource.pack() | ||
|
|
||
| # Verify the repacked data by unpacking it again | ||
| repacked_data = await resource.get_data() | ||
| verify_resource = await ofrak_context.create_root_resource( | ||
| "repacked_test.lz4", data=repacked_data | ||
| ) | ||
| await verify_resource.unpack() | ||
|
|
||
| verify_child = await verify_resource.get_only_child() | ||
| verified_data = await verify_child.get_data() | ||
| assert verified_data.startswith(modification) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Read the original content | |
| initial_data = test_case.input_file.read_bytes() | |
| modification = b"OFRAK" | |
| # Create resource and unpack | |
| resource = await ofrak_context.create_root_resource_from_file(test_case.test_file) | |
| await resource.unpack() | |
| # Verify it has the expected tag | |
| assert resource.has_tag(Lz4Data) | |
| # Get the child and verify initial content | |
| child = await resource.get_only_child() | |
| child_data = await child.get_data() | |
| assert child_data == initial_data | |
| # Modify the data | |
| child.queue_patch(Range.from_size(0, len(modification)), modification) | |
| await child.save() | |
| # Pack it back | |
| await resource.pack() | |
| # Verify the repacked data by unpacking it again | |
| repacked_data = await resource.get_data() | |
| verify_resource = await ofrak_context.create_root_resource( | |
| "repacked_test.lz4", data=repacked_data | |
| ) | |
| await verify_resource.unpack() | |
| verify_child = await verify_resource.get_only_child() | |
| verified_data = await verify_child.get_data() | |
| assert verified_data.startswith(modification) | |
| resource = await ofrak_context.create_root_resource_from_file(test_case.test_file) | |
| await resource.unpack() | |
| assert resource.has_tag(Lz4Data) | |
| initial_data = test_case.input_file.read_bytes() | |
| child = await resource.get_only_child() | |
| child_data = await child.get_data() | |
| assert child_data == initial_data | |
| modification = b"OFRAK" | |
| child.queue_patch(Range.from_size(0, len(modification)), modification) | |
| await child.save() | |
| await resource.pack() | |
| repacked_data = await resource.get_data() | |
| verify_resource = await ofrak_context.create_root_resource( | |
| "repacked_test.lz4", data=repacked_data | |
| ) | |
| await verify_resource.unpack() | |
| verify_child = await verify_resource.get_only_child() | |
| verified_data = await verify_child.get_data() | |
| assert verified_data.startswith(modification) |
In my opinion, these comments just add noise. This code is pretty self-explanatory, and doesn't need them. Explaining that resource.pack means "pack it back" is actively harmful to code readability in my opinion.
I only bring this up to get ahead of it, since I imagine there will be increasing LLM assistance when making OFRAK contributions. We will want to agree on whether comments like these are appropriate or not in order to more effectively use LLMs, and ensure they don't delay code review.
Not sure if there is a Claude setting to get it to chill out with the comments? I've heard they help it "think through" writing code, and do a better job. If that's the case, maybe an additional step at the end to go through and automatically clean up/remove many of them, specified in a Claude rules file somewhere?
There is a direct suggestion in this comment as an example of how much more concise the code could be, with added benefit of logical grouping for the expressions. But the general comment applies to most of these test functions.
I'd love to find a way to solve this and/or add it to the contributor guidelines so I don't have to point it out in nitpicky comments like this one!
| Test the functionality of the LZ4 component, including unpacking, | ||
| modifying, and repacking LZ4-compressed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have issues with the tests in this file, but I'm curious why you didn't use the unpack modify pack test pattern (or other test patterns)?
| # LZ4 legacy block size (uncompressed) is 8 MB (see https://github.com/lz4/lz4/blob/67a385a170d2dc331a25677e0d20d96eef0450c5/programs/lz4io.c#L86) | ||
| decompressed_data += lz4.block.decompress( | ||
| compressed_block, | ||
| uncompressed_size=8 * (1 << 20), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| uncompressed_size=8 * (1 << 20), | |
| uncompressed_size=8 * 1024 * 1024, |
Sometimes I find it clearer to express MB in terms of KB squared. Feel free to ignore this – just a matter of taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context I chose to use the same syntax that the lz4 repo used here. I agree it's less intuitive, but wanted to ensure it matches the reference implementation.
| Compression level can be specified via config (default: 0). | ||
| """ | ||
|
|
||
| targets = (Lz4ModernData,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that none of these target and pack the skippable data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There doesn't seem to be a good use case for now to actually pack skippable data -- once we need this we can add it in.
| child_data = await lz4_child.get_data() | ||
|
|
||
| # LZ4 legacy format uses 8 MB blocks (see https://github.com/lz4/lz4/blob/67a385a170d2dc331a25677e0d20d96eef0450c5/programs/lz4io.c#L86) | ||
| LEGACY_BLOCK_SIZE = 8 * (1 << 20) # 8 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LEGACY_BLOCK_SIZE = 8 * (1 << 20) # 8 MB | |
| LEGACY_BLOCK_SIZE = 8 * 1024 * 1024 # 8 MB |
Same comment as above about using KB squared. Once again, feel free to ignore.
|
|
||
| # Append block size + compressed block data | ||
| compressed_block_size = len(compressed_block) | ||
| lz4_compressed += compressed_block_size.to_bytes(4, "little") + compressed_block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running this for moderately large files? Does it run slowly? I've had speedups from appending byte strings to a list, and then concatenating the list in one go. Curious if that would apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. @rbs-afflitto I'll fix this for you.
One sentence summary of this PR (This should go in the CHANGELOG!)
Add LZ4 compression format unpackers and packers with support for all frame types (modern, legacy, skippable)
Link to Related Issue(s)
N/A.
Please describe the changes in your request.
Lz4 Components.
Lz4Unpacker currently supports unpacking modern LZ4 format (Lz4ModernData),
legacy format (see Lz4LegacyData), and skippable data (Lz4SkippableData).
Lz4Packer supports repacking the modern LZ4 format (Lz4ModernData), matching block/checksum
information extracted during unpacking. Compression level can be specified via config.
Lz4LegacyPacker supports repacking legacy LZ4 format (Lz4LegacyData) with compression level
support (default/fast/high modes). Compression level can be specified via config.
Anyone you think should look at this, specifically?
@rbs-jacob