Skip to content

feat(integration-tests): Extract tarballs with tar to strip top-level directory; Fix path cleanup utilities.#1661

Merged
Bill-hbrhbr merged 14 commits intoy-scope:mainfrom
Bill-hbrhbr:it-strip-tarball-leading-component
Feb 25, 2026
Merged

feat(integration-tests): Extract tarballs with tar to strip top-level directory; Fix path cleanup utilities.#1661
Bill-hbrhbr merged 14 commits intoy-scope:mainfrom
Bill-hbrhbr:it-strip-tarball-leading-component

Conversation

@Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Nov 25, 2025

Description

This PR belongs to the integration-tests feature plan described in #1657.

This change replaces the previous archive extraction workflow, which used shutil.unpack_archive, with an explicit tar based implementation. The new approach correctly handles stripping a leading directory component when the tarball includes a top level folder. For example, the old workflow extracted build/integration-tests/postgresql.tar.gz into:

build/integration-tests/postgresql/postgresql/postgresql.log

which adds an unnecessary extra level of nesting due to the preserved top level directory inside the archive. Since we wish to control the extraction directory name, we explicitly remove that leading component. The extraction path becomes:

build/integration-tests/postgresql/postgresql.log

Because shutil.unpack_archive does not support stripping directory components, switching to tar provides the required extraction results. The chosen extraction flags remain compatible with both macOS and Linux.

Checklist

  • The PR satisfies the contribution guidelines.
  • Necessary docs have been updated, OR no docs need to be updated.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.

Validation performed

Summary by CodeRabbit

  • Chores
    • Enhanced internal testing infrastructure with improved reliability and robustness. Modernized dataset extraction workflows with better error handling mechanisms. Implemented flexible options for directory component management during extraction. Updated permission handling to ensure consistent access across extraction operations.

✏️ Tip: You can customize this high-level summary in your review settings.


@Bill-hbrhbr Bill-hbrhbr requested a review from a team as a code owner November 25, 2025 20:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

Walkthrough

Modified _download_and_extract_dataset function to replace shutil.unpack_archive with curl and tar-based extraction. Added has_leading_directory_component parameter to optionally strip leading directory components. Updated permission handling for both tarball and extraction directory, with added tar executable validation and per-step error handling.

Changes

Cohort / Cohort / File(s) Summary
Download and Extract Dataset Function
integration-tests/tests/fixtures/integration_test_logs.py
Replaced shutil.unpack_archive-based extraction with curl and tar subprocess calls; added has_leading_directory_component parameter (defaults True); updated permission handling to apply gu+w to both tarball and extraction directory; added tar executable lookup and per-step error handling; added function docstring documenting new parameter behaviour.

Sequence Diagram

sequenceDiagram
    participant Test as Test Code
    participant Func as _download_and_extract_dataset
    participant Curl as curl
    participant Tar as tar
    participant FS as Filesystem

    Test->>Func: Call with dataset name & URL
    
    rect rgb(200, 220, 240)
    Note over Func: Old Flow (shutil)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: shutil.unpack_archive()
    FS-->>Func: Extract complete
    end
    
    rect rgb(220, 240, 200)
    Note over Func: New Flow (curl + tar)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: Apply permissions (gu+w)
    Func->>Func: Locate tar executable
    Func->>Tar: tar -xf with -C target dir
    Tar-->>Func: Extract complete
    alt has_leading_directory_component = true
        Func->>FS: Strip leading directory
        FS-->>Func: Done
    end
    end
    
    Func-->>Test: Return IntegrationTestLogs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Subprocess error handling: Verify tar executable validation and per-step error handling for curl and tar commands
  • Permission handling changes: Confirm that applying gu+w to both tarball and extraction directory is intentional and secure
  • Directory stripping logic: Review conditional logic for has_leading_directory_component parameter and edge cases
  • Backward compatibility: Validate that defaulting has_leading_directory_component to True maintains existing behaviour for current callers
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: switching to tar-based extraction and fixing path cleanup utilities, which aligns with the file modifications in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21b8da6 and 3ad919f.

📒 Files selected for processing (1)
  • integration-tests/tests/fixtures/integration_test_logs.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:59.010Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code graph analysis (1)
integration-tests/tests/fixtures/integration_test_logs.py (2)
integration-tests/tests/utils/config.py (1)
  • IntegrationTestLogs (104-128)
integration-tests/tests/utils/utils.py (1)
  • unlink (65-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
integration-tests/tests/fixtures/integration_test_logs.py (1)

83-112: curl/tar/chmod pipeline and has_leading_directory_component flag look solid

The explicit curl+tar flow with --fail/--location, -C into the extraction dir, and the optional --strip-components 1 governed by has_leading_directory_component is clear and should behave well on both macOS and Linux. The post-extraction chmod -R gu+w on both the tarball and extraction directory aligns with the goal of making cached datasets easily overwritable between runs.

Also applies to: 118-120

Comment on lines 71 to 73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

shutil.rmtree will raise on first run when the extraction directory does not exist

Calling shutil.rmtree(integration_test_logs.extraction_dir) unconditionally will raise a FileNotFoundError on the very first run (before the directory has ever been created), causing the fixture to fail before any download/extraction happens. The earlier helper-based cleanup was tolerant of missing paths.

-    integration_test_logs.tarball_path.unlink(missing_ok=True)
-    shutil.rmtree(integration_test_logs.extraction_dir)
-    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
+    integration_test_logs.tarball_path.unlink(missing_ok=True)
+    try:
+        shutil.rmtree(integration_test_logs.extraction_dir)
+    except FileNotFoundError:
+        pass
+    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)

If you still want the previous “force remove with sudo” behaviour for stubborn paths, you could instead wire this back through tests.utils.utils.unlink, but the above is the minimal fix for the first-run failure.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
integration_test_logs.tarball_path.unlink(missing_ok=True)
try:
shutil.rmtree(integration_test_logs.extraction_dir)
except FileNotFoundError:
pass
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 71 to
73, calling shutil.rmtree(integration_test_logs.extraction_dir) will raise
FileNotFoundError on first run; update the cleanup to tolerate missing
directories by either checking existence before removal or calling
shutil.rmtree(integration_test_logs.extraction_dir, ignore_errors=True) so the
fixture no longer fails when the extraction dir doesn't exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinntaylormitchell, thank you for confirming the issue and validating the fix! I appreciate you testing it on a fresh build. The try/except approach should ensure the fixture works reliably on first run while still cleaning up stale directories on subsequent runs.

   (\(\
   ( -.-)
   o_(")(")
   [Happy to help!]

🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by using the new helper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good; a few comments.

Also, I would write a little blurb in the "Validation performed" section.

Comment on lines 55 to 61
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
:param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top-level directory.
:return: The IntegrationTestLogs object describing the donloaded and extracted logs.
"""

clarity and typo fixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revamped the docstring for clarity.

Comment on lines 71 to 73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

Comment on lines 71 to 73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things about these three lines (in addition to the rabbit comment, which is correct):

  1. I remember that in tests.utils.utils, you define a function called unlink; you should use it. That being said, I've realized that function doesn't actually do what it says in the docstring. It says Remove a file or directory at path., but it only uses shutil.rmtree, which doesn't remove files (afaik). I think that we should do either one of two things: A. we should update that function to something like def remove_file_or_dir, make it do what the docstring says, and then use it everywhere in our integration tests, or B. we should remove that function entirely and use only a mixture of shutil.rmtree for directories and os.unlink for files.
  2. Why don't we make extraction_dir in the IntegrationTestLogs __post_init__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. discussed in other posts.
  2. I think _download_and_extract_dataset is a giant post init setup for IntegrationTestLogs, so we don't really need to put any set up into the actual __post_init__

@junhaoliao junhaoliao added this to the Backlog milestone Jan 19, 2026
Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking mostly good; a few nits but that's all

Comment on lines 104 to 105
if not path_to_remove.exists():
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coderabbit left a comment on my implementation of this in #1801, not sure if you want to address it or not:

"Broken symlinks will not be removed.

Path.exists() follows symlinks and returns False for broken symlinks, causing the function to return early without removing them. This could leave stale symlinks in test directories."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated my implementation.

Comment on lines 111 to 112
subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True)
subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subprocess.run([chmod_bin, "-R", "gu+w", tarball_path_str], check=True)
subprocess.run([chmod_bin, "-R", "gu+w", extract_path_str], check=True)
subprocess.run([chmod_bin, "gu+w", tarball_path_str], check=True)
subprocess.run([chmod_bin, "--recursive", "gu+w", extract_path_str], check=True)

a couple nits:

  1. I don't think we need the -R flag in the first one because the tarball is a single file (no recursion needed)
  2. for the second line, change -R to --recursive for readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the first one. For the second one, I think MacOS doesn't support a lot of long options.
https://www.unix.com/man_page/osx/1/chmod/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhhh ok good to know

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, ready for maintainer review!

@Bill-hbrhbr Bill-hbrhbr changed the title feat(integration-tests): Use tar for extracting tarball downloads with optional leading directory component stripping. feat(integration-tests): Extract tarballs with tar to strip the top-level directory. Feb 4, 2026
Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm. no major issue found except the maintainability concern with unlink(). the rest of the comments are straight-forward and we should be able to merge after addressing them

@junhaoliao junhaoliao modified the milestones: Backlog, Mid-February 2026 Feb 5, 2026
@Bill-hbrhbr Bill-hbrhbr changed the title feat(integration-tests): Extract tarballs with tar to strip the top-level directory. feat(integration-tests): Extract tarballs with tar to strip top-level directory; Fix path cleanup utilities. Feb 11, 2026
@Bill-hbrhbr Bill-hbrhbr merged commit 4485405 into y-scope:main Feb 25, 2026
21 checks passed
@Bill-hbrhbr Bill-hbrhbr deleted the it-strip-tarball-leading-component branch February 25, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants