Skip to content

Conversation

@whyitfor
Copy link
Contributor

@whyitfor whyitfor commented Oct 23, 2025

  • I have reviewed the OFRAK contributor guide and attest that this pull request is in accordance with it.
  • I have made or updated a changelog entry for the changes in this pull request.

One sentence summary of this PR (This should go in the CHANGELOG!)
Add LZ4 compression format unpackers and packers with support for all frame types (modern, legacy, skippable)

Link to Related Issue(s)
N/A.

Please describe the changes in your request.
Lz4 Components.

Lz4Unpacker currently supports unpacking modern LZ4 format (Lz4ModernData),
legacy format (see Lz4LegacyData), and skippable data (Lz4SkippableData).

Lz4Packer supports repacking the modern LZ4 format (Lz4ModernData), matching block/checksum
information extracted during unpacking. Compression level can be specified via config.

Lz4LegacyPacker supports repacking legacy LZ4 format (Lz4LegacyData) with compression level
support (default/fast/high modes). Compression level can be specified via config.

Anyone you think should look at this, specifically?
@rbs-jacob

whyitfor and others added 2 commits October 26, 2025 19:22
Co-authored-by: Jacob Strieb <99368685+rbs-jacob@users.noreply.github.com>
@whyitfor whyitfor requested a review from rbs-jacob October 30, 2025 16:33
@rbs-afflitto
Copy link
Collaborator

rbs-afflitto commented Nov 4, 2025

I'm attempting to unpack an LZ4 (legacy) compressed kernel with this unpacker, but I'm getting a runtime error:

Invalid LZ4 legacy format: header says 4623489 bytes but found 17143553

Using the standard lz4 tool (v1.9.4) the image unpacks. I wonder if this is a bug in the Python lz4 module.

17143561 seems to be the file size of the lz4 file.

@rbs-afflitto
Copy link
Collaborator

I'm attempting to unpack an LZ4 (legacy) compressed kernel with this unpacker, but I'm getting a runtime error:

Invalid LZ4 legacy format: header says 4623489 bytes but found 17143553

Using the standard lz4 tool (v1.9.4) the image unpacks. I wonder if this is a bug in the Python lz4 module.

17143561 seems to be the file size of the lz4 file.

I pushed a fix for this. The LZ4 legacy format uses a block size of 8MB, but the original packer/unpacker did not have a block size. I also updated the tests and added new assets greater than 8MB to test this failure.

@rbs-jacob rbs-jacob marked this pull request as ready for review November 6, 2025 17:26
Copy link
Member

@rbs-jacob rbs-jacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most things fairly minor, and several are completely optional.

Comment on lines +76 to +109
# Read the original content
initial_data = test_case.input_file.read_bytes()

modification = b"OFRAK"

# Create resource and unpack
resource = await ofrak_context.create_root_resource_from_file(test_case.test_file)
await resource.unpack()

# Verify it has the expected tag
assert resource.has_tag(Lz4Data)

# Get the child and verify initial content
child = await resource.get_only_child()
child_data = await child.get_data()
assert child_data == initial_data

# Modify the data
child.queue_patch(Range.from_size(0, len(modification)), modification)
await child.save()

# Pack it back
await resource.pack()

# Verify the repacked data by unpacking it again
repacked_data = await resource.get_data()
verify_resource = await ofrak_context.create_root_resource(
"repacked_test.lz4", data=repacked_data
)
await verify_resource.unpack()

verify_child = await verify_resource.get_only_child()
verified_data = await verify_child.get_data()
assert verified_data.startswith(modification)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Read the original content
initial_data = test_case.input_file.read_bytes()
modification = b"OFRAK"
# Create resource and unpack
resource = await ofrak_context.create_root_resource_from_file(test_case.test_file)
await resource.unpack()
# Verify it has the expected tag
assert resource.has_tag(Lz4Data)
# Get the child and verify initial content
child = await resource.get_only_child()
child_data = await child.get_data()
assert child_data == initial_data
# Modify the data
child.queue_patch(Range.from_size(0, len(modification)), modification)
await child.save()
# Pack it back
await resource.pack()
# Verify the repacked data by unpacking it again
repacked_data = await resource.get_data()
verify_resource = await ofrak_context.create_root_resource(
"repacked_test.lz4", data=repacked_data
)
await verify_resource.unpack()
verify_child = await verify_resource.get_only_child()
verified_data = await verify_child.get_data()
assert verified_data.startswith(modification)
resource = await ofrak_context.create_root_resource_from_file(test_case.test_file)
await resource.unpack()
assert resource.has_tag(Lz4Data)
initial_data = test_case.input_file.read_bytes()
child = await resource.get_only_child()
child_data = await child.get_data()
assert child_data == initial_data
modification = b"OFRAK"
child.queue_patch(Range.from_size(0, len(modification)), modification)
await child.save()
await resource.pack()
repacked_data = await resource.get_data()
verify_resource = await ofrak_context.create_root_resource(
"repacked_test.lz4", data=repacked_data
)
await verify_resource.unpack()
verify_child = await verify_resource.get_only_child()
verified_data = await verify_child.get_data()
assert verified_data.startswith(modification)

In my opinion, these comments just add noise. This code is pretty self-explanatory, and doesn't need them. Explaining that resource.pack means "pack it back" is actively harmful to code readability in my opinion.

I only bring this up to get ahead of it, since I imagine there will be increasing LLM assistance when making OFRAK contributions. We will want to agree on whether comments like these are appropriate or not in order to more effectively use LLMs, and ensure they don't delay code review.

Not sure if there is a Claude setting to get it to chill out with the comments? I've heard they help it "think through" writing code, and do a better job. If that's the case, maybe an additional step at the end to go through and automatically clean up/remove many of them, specified in a Claude rules file somewhere?

There is a direct suggestion in this comment as an example of how much more concise the code could be, with added benefit of logical grouping for the expressions. But the general comment applies to most of these test functions.

I'd love to find a way to solve this and/or add it to the contributor guidelines so I don't have to point it out in nitpicky comments like this one!

Comment on lines +2 to +3
Test the functionality of the LZ4 component, including unpacking,
modifying, and repacking LZ4-compressed data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have issues with the tests in this file, but I'm curious why you didn't use the unpack modify pack test pattern (or other test patterns)?

# LZ4 legacy block size (uncompressed) is 8 MB (see https://github.com/lz4/lz4/blob/67a385a170d2dc331a25677e0d20d96eef0450c5/programs/lz4io.c#L86)
decompressed_data += lz4.block.decompress(
compressed_block,
uncompressed_size=8 * (1 << 20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uncompressed_size=8 * (1 << 20),
uncompressed_size=8 * 1024 * 1024,

Sometimes I find it clearer to express MB in terms of KB squared. Feel free to ignore this – just a matter of taste.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context I chose to use the same syntax that the lz4 repo used here. I agree it's less intuitive, but wanted to ensure it matches the reference implementation.

Compression level can be specified via config (default: 0).
"""

targets = (Lz4ModernData,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that none of these target and pack the skippable data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There doesn't seem to be a good use case for now to actually pack skippable data -- once we need this we can add it in.

child_data = await lz4_child.get_data()

# LZ4 legacy format uses 8 MB blocks (see https://github.com/lz4/lz4/blob/67a385a170d2dc331a25677e0d20d96eef0450c5/programs/lz4io.c#L86)
LEGACY_BLOCK_SIZE = 8 * (1 << 20) # 8 MB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LEGACY_BLOCK_SIZE = 8 * (1 << 20) # 8 MB
LEGACY_BLOCK_SIZE = 8 * 1024 * 1024 # 8 MB

Same comment as above about using KB squared. Once again, feel free to ignore.


# Append block size + compressed block data
compressed_block_size = len(compressed_block)
lz4_compressed += compressed_block_size.to_bytes(4, "little") + compressed_block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running this for moderately large files? Does it run slowly? I've had speedups from appending byte strings to a list, and then concatenating the list in one go. Curious if that would apply here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. @rbs-afflitto I'll fix this for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants