feat(integration-tests): Extract tarballs with tar to strip top-level directory; Fix path cleanup utilities.#1661
Conversation
WalkthroughModified Changes
Sequence DiagramsequenceDiagram
participant Test as Test Code
participant Func as _download_and_extract_dataset
participant Curl as curl
participant Tar as tar
participant FS as Filesystem
Test->>Func: Call with dataset name & URL
rect rgb(200, 220, 240)
Note over Func: Old Flow (shutil)
Func->>Curl: Download tarball
Curl-->>Func: Save file
Func->>FS: shutil.unpack_archive()
FS-->>Func: Extract complete
end
rect rgb(220, 240, 200)
Note over Func: New Flow (curl + tar)
Func->>Curl: Download tarball
Curl-->>Func: Save file
Func->>FS: Apply permissions (gu+w)
Func->>Func: Locate tar executable
Func->>Tar: tar -xf with -C target dir
Tar-->>Func: Extract complete
alt has_leading_directory_component = true
Func->>FS: Strip leading directory
FS-->>Func: Done
end
end
Func-->>Test: Return IntegrationTestLogs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
integration-tests/tests/fixtures/integration_test_logs.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:59.010Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code graph analysis (1)
integration-tests/tests/fixtures/integration_test_logs.py (2)
integration-tests/tests/utils/config.py (1)
IntegrationTestLogs(104-128)integration-tests/tests/utils/utils.py (1)
unlink(65-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
integration-tests/tests/fixtures/integration_test_logs.py (1)
83-112: curl/tar/chmod pipeline andhas_leading_directory_componentflag look solidThe explicit curl+tar flow with
--fail/--location,-Cinto the extraction dir, and the optional--strip-components 1governed byhas_leading_directory_componentis clear and should behave well on both macOS and Linux. The post-extractionchmod -R gu+won both the tarball and extraction directory aligns with the goal of making cached datasets easily overwritable between runs.Also applies to: 118-120
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
shutil.rmtree will raise on first run when the extraction directory does not exist
Calling shutil.rmtree(integration_test_logs.extraction_dir) unconditionally will raise a FileNotFoundError on the very first run (before the directory has ever been created), causing the fixture to fail before any download/extraction happens. The earlier helper-based cleanup was tolerant of missing paths.
- integration_test_logs.tarball_path.unlink(missing_ok=True)
- shutil.rmtree(integration_test_logs.extraction_dir)
- integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
+ integration_test_logs.tarball_path.unlink(missing_ok=True)
+ try:
+ shutil.rmtree(integration_test_logs.extraction_dir)
+ except FileNotFoundError:
+ pass
+ integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)If you still want the previous “force remove with sudo” behaviour for stubborn paths, you could instead wire this back through tests.utils.utils.unlink, but the above is the minimal fix for the first-run failure.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | |
| shutil.rmtree(integration_test_logs.extraction_dir) | |
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) | |
| integration_test_logs.tarball_path.unlink(missing_ok=True) | |
| try: | |
| shutil.rmtree(integration_test_logs.extraction_dir) | |
| except FileNotFoundError: | |
| pass | |
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 71 to
73, calling shutil.rmtree(integration_test_logs.extraction_dir) will raise
FileNotFoundError on first run; update the cleanup to tolerate missing
directories by either checking existence before removal or calling
shutil.rmtree(integration_test_logs.extraction_dir, ignore_errors=True) so the
fixture no longer fails when the extraction dir doesn't exist.
There was a problem hiding this comment.
This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.
There was a problem hiding this comment.
@quinntaylormitchell, thank you for confirming the issue and validating the fix! I appreciate you testing it on a fresh build. The try/except approach should ensure the fixture works reliably on first run while still cleaning up stale directories on subsequent runs.
(\(\
( -.-)
o_(")(")
[Happy to help!]
🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
There was a problem hiding this comment.
resolved by using the new helper.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
quinntaylormitchell
left a comment
There was a problem hiding this comment.
Looks mostly good; a few comments.
Also, I would write a little blurb in the "Validation performed" section.
| "param name: | ||
| :param tarball_url: | ||
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | ||
| single top level directory. Defaults to True. | ||
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | ||
| extracted, and permission changed to be overritable. | ||
| """ |
There was a problem hiding this comment.
| "param name: | |
| :param tarball_url: | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top level directory. Defaults to True. | |
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | |
| extracted, and permission changed to be overritable. | |
| """ | |
| :param name: | |
| :param tarball_url: | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top-level directory. | |
| :return: The IntegrationTestLogs object describing the donloaded and extracted logs. | |
| """ |
clarity and typo fixes.
There was a problem hiding this comment.
revamped the docstring for clarity.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
A couple things about these three lines (in addition to the rabbit comment, which is correct):
- I remember that in
tests.utils.utils, you define a function calledunlink; you should use it. That being said, I've realized that function doesn't actually do what it says in the docstring. It saysRemove a file or directory atpath., but it only usesshutil.rmtree, which doesn't remove files (afaik). I think that we should do either one of two things: A. we should update that function to something likedef remove_file_or_dir, make it do what the docstring says, and then use it everywhere in our integration tests, or B. we should remove that function entirely and use only a mixture ofshutil.rmtreefor directories andos.unlinkfor files. - Why don't we make
extraction_dirin theIntegrationTestLogs__post_init__?
There was a problem hiding this comment.
- discussed in other posts.
- I think
_download_and_extract_datasetis a giant post init setup forIntegrationTestLogs, so we don't really need to put any set up into the actual__post_init__
quinntaylormitchell
left a comment
There was a problem hiding this comment.
looking mostly good; a few nits but that's all
| if not path_to_remove.exists(): | ||
| return |
There was a problem hiding this comment.
coderabbit left a comment on my implementation of this in #1801, not sure if you want to address it or not:
"Broken symlinks will not be removed.
Path.exists() follows symlinks and returns False for broken symlinks, causing the function to return early without removing them. This could leave stale symlinks in test directories."
There was a problem hiding this comment.
Updated my implementation.
| subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True) | ||
| subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True) |
There was a problem hiding this comment.
| subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True) | |
| subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True) | |
| subprocess.run([chmod_bin, "gu+w", tarball_path_str], check=True) | |
| subprocess.run([chmod_bin, "--recursive", "gu+w", extract_path_str], check=True) |
a couple nits:
- I don't think we need the
-Rflag in the first one because the tarball is a single file (no recursion needed) - for the second line, change
-Rto--recursivefor readability
There was a problem hiding this comment.
Agree with the first one. For the second one, I think MacOS doesn't support a lot of long options.
https://www.unix.com/man_page/osx/1/chmod/
There was a problem hiding this comment.
ohhhh ok good to know
quinntaylormitchell
left a comment
There was a problem hiding this comment.
lgtm, ready for maintainer review!
tar for extracting tarball downloads with optional leading directory component stripping.tar to strip the top-level directory.
junhaoliao
left a comment
There was a problem hiding this comment.
mostly lgtm. no major issue found except the maintainability concern with unlink(). the rest of the comments are straight-forward and we should be able to merge after addressing them
tar to strip the top-level directory.tar to strip top-level directory; Fix path cleanup utilities.
Description
This PR belongs to the integration-tests feature plan described in #1657.
This change replaces the previous archive extraction workflow, which used
shutil.unpack_archive, with an explicittarbased implementation. The new approach correctly handles stripping a leading directory component when the tarball includes a top level folder. For example, the old workflow extractedbuild/integration-tests/postgresql.tar.gzinto:which adds an unnecessary extra level of nesting due to the preserved top level directory inside the archive. Since we wish to control the extraction directory name, we explicitly remove that leading component. The extraction path becomes:
Because
shutil.unpack_archivedoes not support stripping directory components, switching totarprovides the required extraction results. The chosen extraction flags remain compatible with both macOS and Linux.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.