feat(integration-tests): Extract tarballs with `tar` to strip top-level directory; Fix path cleanup utilities. by Bill-hbrhbr · Pull Request #1661 · y-scope/clp

Bill-hbrhbr · 2025-11-25T20:12:33Z

Description

This PR belongs to the integration-tests feature plan described in #1657.

This change replaces the previous archive extraction workflow, which used shutil.unpack_archive, with an explicit tar based implementation. The new approach correctly handles stripping a leading directory component when the tarball includes a top level folder. For example, the old workflow extracted build/integration-tests/postgresql.tar.gz into:

build/integration-tests/postgresql/postgresql/postgresql.log

which adds an unnecessary extra level of nesting due to the preserved top level directory inside the archive. Since we wish to control the extraction directory name, we explicitly remove that leading component. The extraction path becomes:

build/integration-tests/postgresql/postgresql.log

Because shutil.unpack_archive does not support stripping directory components, switching to tar provides the required extraction results. The chosen extraction flags remain compatible with both macOS and Linux.

Checklist

The PR satisfies the contribution guidelines.
Necessary docs have been updated, OR no docs need to be updated.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.

Validation performed

Summary by CodeRabbit

Chores
- Enhanced internal testing infrastructure with improved reliability and robustness. Modernized dataset extraction workflows with better error handling mechanisms. Implemented flexible options for directory component management during extraction. Updated permission handling to ensure consistent access across extraction operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1212888482202074

coderabbitai · 2025-11-25T20:12:44Z

Walkthrough

Modified _download_and_extract_dataset function to replace shutil.unpack_archive with curl and tar-based extraction. Added has_leading_directory_component parameter to optionally strip leading directory components. Updated permission handling for both tarball and extraction directory, with added tar executable validation and per-step error handling.

Changes

Cohort / Cohort / File(s)	Summary
Download and Extract Dataset Function `integration-tests/tests/fixtures/integration_test_logs.py`	Replaced `shutil.unpack_archive`-based extraction with curl and tar subprocess calls; added `has_leading_directory_component` parameter (defaults `True`); updated permission handling to apply `gu+w` to both tarball and extraction directory; added tar executable lookup and per-step error handling; added function docstring documenting new parameter behaviour.

Sequence Diagram

sequenceDiagram
    participant Test as Test Code
    participant Func as _download_and_extract_dataset
    participant Curl as curl
    participant Tar as tar
    participant FS as Filesystem

    Test->>Func: Call with dataset name & URL
    
    rect rgb(200, 220, 240)
    Note over Func: Old Flow (shutil)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: shutil.unpack_archive()
    FS-->>Func: Extract complete
    end
    
    rect rgb(220, 240, 200)
    Note over Func: New Flow (curl + tar)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: Apply permissions (gu+w)
    Func->>Func: Locate tar executable
    Func->>Tar: tar -xf with -C target dir
    Tar-->>Func: Extract complete
    alt has_leading_directory_component = true
        Func->>FS: Strip leading directory
        FS-->>Func: Done
    end
    end
    
    Func-->>Test: Return IntegrationTestLogs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Subprocess error handling: Verify tar executable validation and per-step error handling for curl and tar commands
Permission handling changes: Confirm that applying gu+w to both tarball and extraction directory is intentional and secure
Directory stripping logic: Review conditional logic for has_leading_directory_component parameter and edge cases
Backward compatibility: Validate that defaulting has_leading_directory_component to True maintains existing behaviour for current callers

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: switching to tar-based extraction and fixing path cleanup utilities, which aligns with the file modifications in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21b8da6 and 3ad919f.

📒 Files selected for processing (1)

integration-tests/tests/fixtures/integration_test_logs.py (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:59.010Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

🧬 Code graph analysis (1)

integration-tests/tests/fixtures/integration_test_logs.py (2)

integration-tests/tests/utils/config.py (1)

IntegrationTestLogs (104-128)

integration-tests/tests/utils/utils.py (1)

unlink (65-86)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (1)

integration-tests/tests/fixtures/integration_test_logs.py (1)

83-112: curl/tar/chmod pipeline and has_leading_directory_component flag look solid

The explicit curl+tar flow with --fail/--location, -C into the extraction dir, and the optional --strip-components 1 governed by has_leading_directory_component is clear and should behave well on both macOS and Linux. The post-extraction chmod -R gu+w on both the tarball and extraction directory aligns with the goal of making cached datasets easily overwritable between runs.

Also applies to: 118-120

integration-tests/tests/fixtures/integration_test_logs.py

coderabbitai · 2025-11-25T20:15:40Z

integration-tests/tests/fixtures/integration_test_logs.py

+    integration_test_logs.tarball_path.unlink(missing_ok=True)
+    shutil.rmtree(integration_test_logs.extraction_dir)
+    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)


⚠️ Potential issue | 🔴 Critical

shutil.rmtree will raise on first run when the extraction directory does not exist

Calling shutil.rmtree(integration_test_logs.extraction_dir) unconditionally will raise a FileNotFoundError on the very first run (before the directory has ever been created), causing the fixture to fail before any download/extraction happens. The earlier helper-based cleanup was tolerant of missing paths.

- integration_test_logs.tarball_path.unlink(missing_ok=True) - shutil.rmtree(integration_test_logs.extraction_dir) - integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) + integration_test_logs.tarball_path.unlink(missing_ok=True) + try: + shutil.rmtree(integration_test_logs.extraction_dir) + except FileNotFoundError: + pass + integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)

If you still want the previous “force remove with sudo” behaviour for stubborn paths, you could instead wire this back through tests.utils.utils.unlink, but the above is the minimal fix for the first-run failure.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

integration_test_logs.tarball_path.unlink(missing_ok=True)

shutil.rmtree(integration_test_logs.extraction_dir)

integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)

integration_test_logs.tarball_path.unlink(missing_ok=True)

try:

shutil.rmtree(integration_test_logs.extraction_dir)

except FileNotFoundError:

pass

integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)

🤖 Prompt for AI Agents

In integration-tests/tests/fixtures/integration_test_logs.py around lines 71 to 73, calling shutil.rmtree(integration_test_logs.extraction_dir) will raise FileNotFoundError on first run; update the cleanup to tolerate missing directories by either checking existence before removal or calling shutil.rmtree(integration_test_logs.extraction_dir, ignore_errors=True) so the fixture no longer fails when the extraction dir doesn't exist.

This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

@quinntaylormitchell, thank you for confirming the issue and validating the fix! I appreciate you testing it on a fresh build. The try/except approach should ensure the fixture works reliably on first run while still cleaning up stale directories on subsequent runs.

(\(\ ( -.-) o_(")(") [Happy to help!]

🧠 Learnings used

Learnt from: Bill-hbrhbr Repo: y-scope/clp PR: 1100 File: integration-tests/tests/fixtures/integration_test_logs.py:54-56 Timestamp: 2025-08-17T16:10:38.722Z Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: jackluo923 Repo: y-scope/clp PR: 1054 File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32 Timestamp: 2025-07-07T17:41:15.655Z Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Learnt from: Bill-hbrhbr Repo: y-scope/clp PR: 1100 File: integration-tests/tests/utils/config.py:121-123 Timestamp: 2025-08-20T22:07:04.953Z Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.

Learnt from: Bill-hbrhbr Repo: y-scope/clp PR: 1122 File: components/core/src/clp/clp/CMakeLists.txt:175-195 Timestamp: 2025-07-23T09:54:45.185Z Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

resolved by using the new helper.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

quinntaylormitchell

Looks mostly good; a few comments.

Also, I would write a little blurb in the "Validation performed" section.

quinntaylormitchell · 2025-11-26T15:34:29Z

integration-tests/tests/fixtures/integration_test_logs.py

+    "param name:
+    :param tarball_url:
+    :param has_leading_directory_component: Whether all files inside the tarball are stored under a
+    single top level directory. Defaults to True.
+    :return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
+    extracted, and permission changed to be overritable.
+    """


Suggested change

"param name:

:param tarball_url:

:param has_leading_directory_component: Whether all files inside the tarball are stored under a

single top level directory. Defaults to True.

:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,

extracted, and permission changed to be overritable.

"""

:param name:

:param tarball_url:

:param has_leading_directory_component: Whether all files inside the tarball are stored under a

single top-level directory.

:return: The IntegrationTestLogs object describing the donloaded and extracted logs.

"""

clarity and typo fixes.

revamped the docstring for clarity.

quinntaylormitchell · 2025-11-26T15:52:24Z

integration-tests/tests/fixtures/integration_test_logs.py

+    integration_test_logs.tarball_path.unlink(missing_ok=True)
+    shutil.rmtree(integration_test_logs.extraction_dir)
+    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)


This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

quinntaylormitchell · 2025-11-26T16:06:50Z

integration-tests/tests/fixtures/integration_test_logs.py

+    integration_test_logs.tarball_path.unlink(missing_ok=True)
+    shutil.rmtree(integration_test_logs.extraction_dir)
+    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)


A couple things about these three lines (in addition to the rabbit comment, which is correct):

I remember that in tests.utils.utils, you define a function called unlink; you should use it. That being said, I've realized that function doesn't actually do what it says in the docstring. It says Remove a file or directory at path., but it only uses shutil.rmtree, which doesn't remove files (afaik). I think that we should do either one of two things: A. we should update that function to something like def remove_file_or_dir, make it do what the docstring says, and then use it everywhere in our integration tests, or B. we should remove that function entirely and use only a mixture of shutil.rmtree for directories and os.unlink for files.

Why don't we make extraction_dir in the IntegrationTestLogs __post_init__?

discussed in other posts.

I think _download_and_extract_dataset is a giant post init setup for IntegrationTestLogs, so we don't really need to put any set up into the actual __post_init__

integration-tests/tests/fixtures/integration_test_logs.py

quinntaylormitchell

looking mostly good; a few nits but that's all

integration-tests/tests/utils/utils.py

quinntaylormitchell · 2026-02-03T16:15:05Z

integration-tests/tests/utils/utils.py

+    if not path_to_remove.exists():
+        return


coderabbit left a comment on my implementation of this in #1801, not sure if you want to address it or not:

"Broken symlinks will not be removed.

Path.exists() follows symlinks and returns False for broken symlinks, causing the function to return early without removing them. This could leave stale symlinks in test directories."

Updated my implementation.

quinntaylormitchell · 2026-02-03T21:23:09Z

integration-tests/tests/fixtures/integration_test_logs.py

+    subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True)
+    subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True)


Suggested change

subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True)

subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True)

subprocess.run([chmod_bin, "gu+w", tarball_path_str], check=True)

subprocess.run([chmod_bin, "--recursive", "gu+w", extract_path_str], check=True)

a couple nits:

I don't think we need the -R flag in the first one because the tarball is a single file (no recursion needed)

for the second line, change -R to --recursive for readability

Agree with the first one. For the second one, I think MacOS doesn't support a lot of long options.
https://www.unix.com/man_page/osx/1/chmod/

ohhhh ok good to know

quinntaylormitchell

lgtm, ready for maintainer review!

junhaoliao

mostly lgtm. no major issue found except the maintainability concern with unlink(). the rest of the comments are straight-forward and we should be able to merge after addressing them

integration-tests/tests/utils/utils.py

integration-tests/tests/fixtures/integration_test_logs.py

Tar extract switch from shutil to tar and use subprocess

3ad919f

Bill-hbrhbr requested a review from a team as a code owner November 25, 2025 20:12

Bill-hbrhbr mentioned this pull request Nov 25, 2025

feat: clp-s ordered compression and decompression identity transformation integration tests. #1657

Open

Bill-hbrhbr requested a review from quinntaylormitchell November 25, 2025 20:14

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

quinntaylormitchell requested changes Nov 26, 2025

View reviewed changes

junhaoliao added this to the Backlog milestone Jan 19, 2026

Bill-hbrhbr added 4 commits February 2, 2026 19:18

Update docstring and add remove_path helper

82464e8

Merge branch 'main' into it-strip-tarball-leading-component

0dc20a4

Move remove_path location

519eddd

Use up to date helpers

81da2cc

Bill-hbrhbr requested a review from quinntaylormitchell February 2, 2026 19:39

Merge branch 'main' into it-strip-tarball-leading-component

36eab7f

quinntaylormitchell reviewed Feb 3, 2026

View reviewed changes

Address review comments

b835039

Bill-hbrhbr requested a review from quinntaylormitchell February 4, 2026 11:49

quinntaylormitchell previously approved these changes Feb 4, 2026

View reviewed changes

Bill-hbrhbr changed the title ~~feat(integration-tests): Use tar for extracting tarball downloads with optional leading directory component stripping.~~ feat(integration-tests): Extract tarballs with tar to strip the top-level directory. Feb 4, 2026

junhaoliao requested changes Feb 5, 2026

View reviewed changes

junhaoliao modified the milestones: Backlog, Mid-February 2026 Feb 5, 2026

Address review comment

a44fa1e

Bill-hbrhbr dismissed quinntaylormitchell’s stale review via a44fa1e February 11, 2026 13:42

Merge branch 'main' into it-strip-tarball-leading-component

fb980c3

Bill-hbrhbr requested a review from junhaoliao February 11, 2026 13:43

Bill-hbrhbr added 2 commits February 11, 2026 08:49

Fix lint

de26125

Remove unlink() and replace with remove_path()

62f286b

Bill-hbrhbr changed the title ~~feat(integration-tests): Extract tarballs with tar to strip the top-level directory.~~ feat(integration-tests): Extract tarballs with tar to strip top-level directory; Fix path cleanup utilities. Feb 11, 2026

Merge branch 'main' into it-strip-tarball-leading-component

ba8a374

Merge branch 'main' into it-strip-tarball-leading-component

1cf8f42

junhaoliao approved these changes Feb 24, 2026

View reviewed changes

Merge branch 'main' into it-strip-tarball-leading-component

4ce6245

Bill-hbrhbr merged commit 4485405 into y-scope:main Feb 25, 2026
21 checks passed

Bill-hbrhbr deleted the it-strip-tarball-leading-component branch February 25, 2026 16:48

		subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True)
		subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True)

Conversation

Bill-hbrhbr commented Nov 25, 2025 • edited by junhaoliao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

quinntaylormitchell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quinntaylormitchell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quinntaylormitchell left a comment

Choose a reason for hiding this comment

Uh oh!

junhaoliao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Bill-hbrhbr commented Nov 25, 2025 •

edited by junhaoliao

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

coderabbitai bot Nov 25, 2025 •

edited

Loading