Skip to content

feat(integration-tests): Test package compression.#1801

Open
quinntaylormitchell wants to merge 20 commits intoy-scope:mainfrom
quinntaylormitchell:testing-package-compression-1
Open

feat(integration-tests): Test package compression.#1801
quinntaylormitchell wants to merge 20 commits intoy-scope:mainfrom
quinntaylormitchell:testing-package-compression-1

Conversation

@quinntaylormitchell
Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell commented Dec 18, 2025

Description

This PR implements integration tests for package compression. The following new features are added:

  • New tests test_clp_json_compression_json_multifile and test_clp_text_compression_text_multifile.
  • Functions run_package_compression_script and verify_compression.
  • New class PackageCompressionJob.
  • Two sample datasets: json-multifile and text-multifile.

Future development of compression package testing can be accomplished by adding new tests to the test_clp_* file using the same format as the test_clp_*_compression_* tests.

NOTE: verify_compression does not check for the correctness of compression when testing clp-json; to accomplish this check, we need #1299 merged.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ran the following command from clp/integration-tests:

uv run pytest -m 'package'

All clp-json and clp-text tests passed.

Summary by CodeRabbit

  • Tests

    • Added end-to-end compression validation tests for JSON and text multi-file datasets, including setup, execution and verification steps.
  • Documentation

    • Added metadata READMEs describing dataset composition, file counts, event volumes and timestamp ranges for JSON and text test datasets.
  • Chores

    • Improved test infrastructure and utilities: compression job execution, verification, directory cleanup utilities and a test environment variable to locate test assets.

@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner December 18, 2025 20:09
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

Walkthrough

Adds integration test tooling and datasets for package compression workflows: pytest environment var, new utilities to run/verify compression and manage paths, tests for JSON and text multifile packages, and sample log datasets for both formats.

Changes

Cohort / File(s) Summary
Pytest config
integration-tests/.pytest.ini
Added environment variable D:INTEGRATION_TESTS_PROJECT_ROOT=./.
Path & config utilities
integration-tests/tests/utils/config.py, integration-tests/tests/fixtures/path_configs.py
Added package_test_scripts_dir and package_decompression_dir fields, new properties (compress_script_path, decompress_script_path, clp_*_test_data_path), clear_package_archives() method, and new PackageCompressionJob dataclass; fixt_package_path_config now populates package_test_scripts_dir from env var.
Execution helper
integration-tests/tests/utils/package_utils.py
Added run_package_compression_script() to build and execute compression commands using PackageCompressionJob and test config.
Verification utilities
integration-tests/tests/utils/asserting_utils.py
Added verify_package_compression() to validate compression results; for clp-text it decompresses archives and compares decompressed output to original dataset; uses extraction command and cleanup utilities.
Filesystem utilities
integration-tests/tests/utils/utils.py
Added clear_directory() and remove_path() helpers to remove directory contents and individual paths safely.
Compression tests
integration-tests/tests/package_tests/clp_json/test_clp_json.py, integration-tests/tests/package_tests/clp_text/test_clp_text.py
Replaced placeholder tests with full compression workflow tests (test_clp_json_compression_json_multifile, test_clp_text_compression_text_multifile): validate instance, clear archives, construct PackageCompressionJob, run compression, verify results, cleanup.
Test data — JSON multifile
integration-tests/tests/package_tests/clp_json/data/json-multifile/README.md, integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/*
Added README and five *.jsonl log files containing 8 timestamped JSONL events each for STS-135 across multiple dates.
Test data — Text multifile
integration-tests/tests/package_tests/clp_text/data/text-multifile/README.md, integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/*
Added README and five *.txt log files containing 10 timestamped text entries each for Apollo 17 mission days.

Sequence Diagram(s)

sequenceDiagram
    actor Test as Compression Test
    participant Exec as Compression Script
    participant FS as File System
    participant Decomp as Decompression Script

    Test->>FS: Clear archives
    Test->>Exec: Run compression job (config, dataset path, flags)
    Exec->>FS: Read dataset files
    Exec->>FS: Write compressed archive
    Test->>FS: Locate compressed archive
    Test->>Decomp: Run decompression (temp config)
    Decomp->>FS: Extract archive -> decompressed dir
    Test->>FS: Compare decompressed dir vs original dataset
    Test->>FS: Assert equality and cleanup
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding integration tests for package compression functionality.
Docstring Coverage ✅ Passed Docstring coverage is 96.55% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.14.14)
integration-tests/tests/fixtures/path_configs.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load extended configuration /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml (/integration-tests/pyproject.toml extends /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml)
�[1mCause:�[0m Failed to read /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml
�[1mCause:�[0m No such file or directory (os error 2)

integration-tests/tests/package_tests/clp_json/test_clp_json.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load extended configuration /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml (/integration-tests/pyproject.toml extends /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml)
�[1mCause:�[0m Failed to read /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml
�[1mCause:�[0m No such file or directory (os error 2)

integration-tests/tests/package_tests/clp_text/test_clp_text.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load extended configuration /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml (/integration-tests/pyproject.toml extends /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml)
�[1mCause:�[0m Failed to read /tools/yscope-dev-utils/exports/lint-configs/python/ruff.toml
�[1mCause:�[0m No such file or directory (os error 2)

  • 3 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e9a86b and 3aacf69.

📒 Files selected for processing (15)
  • integration-tests/tests/conftest.py (1 hunks)
  • integration-tests/tests/data/__init__.py (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/README.md (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-08.jsonl (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-09.jsonl (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-11.jsonl (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-19.jsonl (1 hunks)
  • integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-21.jsonl (1 hunks)
  • integration-tests/tests/data/jobs/compression/__init__.py (1 hunks)
  • integration-tests/tests/data/jobs/compression/compression_jobs.py (1 hunks)
  • integration-tests/tests/fixtures/package_config.py (2 hunks)
  • integration-tests/tests/fixtures/package_instance.py (2 hunks)
  • integration-tests/tests/test_package_start.py (3 hunks)
  • integration-tests/tests/utils/clp_job_utils.py (1 hunks)
  • integration-tests/tests/utils/config.py (3 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
📚 Learning: 2025-08-20T08:36:38.391Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/__init__.py:1-1
Timestamp: 2025-08-20T08:36:38.391Z
Learning: In the CLP project, __init__.py files should include short module-level docstrings describing the package purpose, as required by their ruff linting configuration, rather than being left empty.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/data/__init__.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/utils/clp_job_utils.py
  • integration-tests/tests/test_package_start.py
  • integration-tests/tests/data/jobs/compression/compression_jobs.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/utils/clp_job_utils.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/test_package_start.py
📚 Learning: 2025-07-28T08:33:57.487Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/test_package_start.py
  • integration-tests/tests/data/__init__.py
📚 Learning: 2025-08-17T16:10:38.722Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

  • integration-tests/tests/test_package_start.py
  • integration-tests/tests/fixtures/package_instance.py
  • integration-tests/tests/conftest.py
  • integration-tests/tests/fixtures/package_config.py
  • integration-tests/tests/data/__init__.py
📚 Learning: 2025-08-08T21:15:10.905Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

  • integration-tests/tests/test_package_start.py
  • integration-tests/tests/data/__init__.py
📚 Learning: 2025-08-16T10:24:29.316Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.

Applied to files:

  • integration-tests/tests/data/__init__.py
🧬 Code graph analysis (4)
integration-tests/tests/utils/clp_job_utils.py (1)
integration-tests/tests/utils/config.py (2)
  • PackageCompressionJob (125-144)
  • PackageJobList (148-151)
integration-tests/tests/test_package_start.py (1)
integration-tests/tests/fixtures/package_instance.py (1)
  • fixt_package_instance (21-52)
integration-tests/tests/fixtures/package_config.py (4)
integration-tests/tests/utils/clp_job_utils.py (1)
  • build_package_job_list (32-48)
integration-tests/tests/utils/clp_mode_utils.py (2)
  • get_clp_config_from_mode (87-98)
  • get_required_component_list (101-116)
integration-tests/tests/utils/config.py (2)
  • PackageConfig (155-194)
  • PackagePathConfig (68-121)
integration-tests/tests/utils/port_utils.py (1)
  • assign_ports_from_base (39-61)
integration-tests/tests/data/jobs/compression/compression_jobs.py (1)
integration-tests/tests/utils/config.py (1)
  • PackageCompressionJob (125-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (15)
integration-tests/tests/data/jobs/compression/__init__.py (1)

1-1: LGTM!

The module docstring is clear and follows the project's ruff linting requirements for init.py files.

integration-tests/tests/data/__init__.py (1)

1-1: LGTM!

The module docstring appropriately describes the package purpose and complies with the project's linting requirements.

integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-21.jsonl (1)

1-8: LGTM!

The test dataset is well-structured with consistent JSON Lines formatting, sequential timestamps, and appropriate mission event data for integration testing.

integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-19.jsonl (1)

1-8: LGTM!

The test dataset contains well-formed JSON Lines with sequential timestamps and appropriate re-entry sequence events for integration testing.

integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-08.jsonl (1)

1-8: LGTM!

The test dataset is properly formatted with sequential timestamps and realistic launch sequence events suitable for integration testing.

integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-09.jsonl (1)

1-8: LGTM!

The test dataset contains properly formatted JSON Lines with sequential timestamps and appropriate ISS docking sequence events for integration testing.

integration-tests/tests/utils/config.py (3)

8-8: LGTM!

The Any import is appropriately used for the flexible typing of flag values in PackageCompressionJob.


147-152: LGTM!

The PackageJobList dataclass is well-defined with appropriate typing for holding a list of compression jobs.


173-174: LGTM!

The addition of the optional package_job_list field to PackageConfig appropriately extends the configuration to support test job management.

integration-tests/tests/data/datasets/json-multifile/README.md (1)

1-16: LGTM!

The dataset metadata accurately reflects the five JSONL files, with correct counts, type, and timestamp ranges verified against the actual dataset files.

integration-tests/tests/data/datasets/json-multifile/logs/sts-135-2011-07-11.jsonl (1)

1-8: LGTM! Test data is well-structured.

The JSON Lines test data is well-formed with consistent timestamps, sequential line indices, and appropriate field values for integration testing.

integration-tests/tests/fixtures/package_config.py (1)

48-60: LGTM! Job list building logic is correct.

The fixture correctly builds the package job list based on the CLI options, passing the appropriate filter and conditionally skipping job creation when --no-jobs is specified.

integration-tests/tests/fixtures/package_instance.py (1)

32-38: LGTM! Skip logic correctly handles job filtering.

The conditional skip logic appropriately handles the case where a mode has no jobs matching the filter, while still allowing package startup/shutdown testing when --no-jobs is explicitly specified.

integration-tests/tests/utils/clp_job_utils.py (1)

32-48: LGTM! Job list building logic is clear and correct.

The function appropriately filters compression jobs and returns None when no jobs match, which integrates well with the fixture logic in package_instance.py.

integration-tests/tests/conftest.py (1)

31-36: LGTM! The --no-jobs option is well-defined.

The boolean flag and help text clearly communicate the option's purpose for testing package startup and shutdown without running jobs.

Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some ideas we’ve touched on before but never really circled back to or explored in depth. I think I have a good understanding of your overall design, and it makes sense to me. That said, I wonder if we could lean toward a more modular approach rather than bundling all package related tests together.

In particular, I’m a bit hesitant about adding more pytest parser options. They work, but I worry they might make the testing setup harder to reason about over time. I’d be interested in discussing whether we can achieve the same goals using pytest’s existing mechanisms while keeping things simpler and more modular.

@junhaoliao junhaoliao modified the milestones: Backlog, February 2026 Jan 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integration-tests/tests/fixtures/package_test_config.py (1)

37-41: Define package_job_list before use.

package_job_list is referenced but never set in this fixture, which will raise a NameError when the fixture runs. Consider deriving it from the new CLI options and build_package_job_list.

🛠️ Suggested fix
-from tests.utils.config import PackageModeConfig, PackagePathConfig, PackageTestConfig
+from tests.utils.config import PackageModeConfig, PackagePathConfig, PackageTestConfig
+from tests.utils.clp_job_utils import build_package_job_list
@@
     # Construct PackageTestConfig.
+    job_name_contains = request.config.getoption("--job-name-contains")
+    no_jobs = request.config.getoption("--no-jobs")
+    package_job_list = None if no_jobs else build_package_job_list(
+        mode_config.mode_name, job_name_contains
+    )
     package_test_config = PackageTestConfig(
         path_config=fixt_package_path_config,
         mode_config=mode_config,
         base_port=base_port,
         package_job_list=package_job_list,
     )

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
integration-tests/tests/utils/config.py (1)

101-117: Create package_decompression_dir before use.
It’s initialised but never created, so decompression can fail if the script expects the directory to exist.

🛠️ Proposed fix
         # Create directories if they do not already exist.
         self.temp_config_dir.mkdir(parents=True, exist_ok=True)
+        self.package_decompression_dir.mkdir(parents=True, exist_ok=True)
         self.clp_log_dir.mkdir(parents=True, exist_ok=True)
integration-tests/tests/package_tests/clp_json/test_clp_json.py (1)

59-117: Add cleanup to avoid test cross-contamination.
The TODOs indicate cleanup is pending; leaving var/data, var/log, and var/tmp can leak state across runs. Consider a try/finally cleanup in the test or a fixture finalizer.

🧹 Suggested cleanup pattern
-from tests.utils.utils import resolve_path_env_var
+from tests.utils.utils import resolve_path_env_var, unlink
...
     package_test_config = fixt_package_instance.package_test_config
-    run_package_compression_script(compression_job, package_test_config)
-
-    # Check the correctness of compression.
-    verify_compression(compression_job, package_test_config)
+    try:
+        run_package_compression_script(compression_job, package_test_config)
+        # Check the correctness of compression.
+        verify_compression(compression_job, package_test_config)
+    finally:
+        path_config = package_test_config.path_config
+        unlink(path_config.clp_package_dir / "var" / "data")
+        unlink(path_config.clp_package_dir / "var" / "log")
+        unlink(path_config.clp_package_dir / "var" / "tmp")

If you want, I can draft a reusable fixture/helper for cleanup.

🤖 Fix all issues with AI agents
In `@integration-tests/tests/utils/asserting_utils.py`:
- Around line 109-151: In verify_compression, the output_path is built as
decompression_dir / path_to_dataset which collapses to path_to_dataset when
path_to_dataset is absolute; change the construction so output_path points to
the extracted dataset directory under decompression_dir (e.g., use the dataset
basename via path_to_dataset.name or Path(path_to_dataset).name) before calling
is_dir_tree_content_equal; update references to decompression_dir,
path_to_dataset and output_path accordingly so the comparison checks the
decompressed copy, not the original.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In
`@integration-tests/tests/package_tests/clp_text/sample_datasets/text-multifile/logs/apollo-17_day08.txt`:
- Around line 1-15: Update the two suggested style refinements in the log
entries: change the phrase in the e006 entry ("flags and commemorative plaques
documented in high resolution imagery") to use the compound adjective
"high-resolution imagery", and change the phrase in the e013 entry ("crew rests
briefly prior to ascent timeline") to the more concise "crew rests briefly
before ascent timeline". Locate and edit the exact lines containing "e006" and
"e013" to apply these textual replacements.

In
`@integration-tests/tests/package_tests/clp_text/sample_datasets/text-multifile/README.md`:
- Line 4: The line containing "Unstructured text" has a trailing space causing
markdownlint MD009; edit the README.md and remove the trailing whitespace at the
end of that line (the line that exactly reads "Unstructured text") so it ends
with no extra space.

In `@integration-tests/tests/package_tests/clp_text/test_clp_text.py`:
- Around line 59-63: Update the docstring in test_clp_text.py to reference the
correct package and dataset: replace the incorrect "clp-json" with "clp-text"
and "json-multifile" with "text-multifile" so the docstring accurately describes
that the clp-text package compresses the text-multifile dataset; locate the
docstring at the top of the test function in this file and make the two string
edits.
- Around line 64-91: Call clear_package_archives() on the test package path
config before running compression to ensure no pre-existing archives affect
results; retrieve the config from fixt_package_instance.package_test_config (or
assign package_path_config =
fixt_package_instance.package_test_config.path_config) and invoke
package_path_config.clear_package_archives() prior to creating/running
PackageCompressionJob and run_package_compression_script so the test starts with
a clean archive state.

In `@integration-tests/tests/utils/config.py`:
- Around line 140-143: The clear_package_archives method should guard against a
missing archives directory before calling clean_directory: check the computed
archives_dir (from self.clp_package_dir / "var" / "data" / "archives") exists
and is a directory (e.g., archives_dir.exists() or archives_dir.is_dir()) and
only call clean_directory(archives_dir) when it does; if it doesn't exist,
simply return/do nothing to avoid raising FileNotFoundError from
clean_directory.

In `@integration-tests/tests/utils/utils.py`:
- Around line 13-23: The clean_directory function should handle a non-existent
directory gracefully to avoid FileNotFoundError in callers like
clear_package_archives: update clean_directory to check directory.exists() (or
directory.is_dir()) and simply return when the path is missing (or else
explicitly document that it raises FileNotFoundError in the docstring if you
want failure semantics); keep the rest of the logic
(iterdir/unlink/shutil.rmtree) unchanged and reference the function name
clean_directory and the caller clear_package_archives when making the change so
tests that expect absent archive dirs no longer fail unexpectedly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@integration-tests/tests/package_tests/clp_text/sample_datasets/text-multifile/logs/apollo-17_day01.txt`:
- Around line 1-10: The timestamp milliseconds in the log lines (e.g., the
sequence starting with "1972-12-07T05:33:00.000 apollo-17 d01 e000 ..." showing
.000, .999, .998, ...) form an artificial decreasing pattern; decide whether
this was intentional and if not, normalize the milliseconds across the entries
(for example set all timestamps to .000 or to realistic randomized ms) in the
apollo-17_day01 log block so the data generation artifact is removed; update the
log lines that include the pattern (.000/.999/.998/... shown in the sample
entries) to use the chosen consistent or randomized millisecond format.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@integration-tests/tests/utils/asserting_utils.py`:
- Around line 143-146: The second assignment to decompression_dir is
redundant—remove the repeated line "decompression_dir =
package_test_config.path_config.package_decompression_dir" so the function
reuses the earlier decompression_dir defined at line 129; then keep the
subsequent computations that use decompression_dir (the output_path =
decompression_dir / path_to_dataset.relative_to(path_to_dataset.anchor) line)
intact, ensuring compression_job.path_to_dataset and path_to_dataset variables
remain used as before.
- Around line 122-124: The branch handling mode in asserting_utils.py currently
does a no-op "assert True" for ("clp-json", "clp-presto"), so replace that
placeholder with an explicit test skip or warning: import pytest (or use the
existing test logger) and call pytest.skip("Waiting for PR 1299") inside the
mode in ("clp-json", "clp-presto") branch (or log a clear warning and return
from verify_compression) so the behavior is explicit; update the branch
referencing the mode variable and the placeholder assertion to perform the
skip/log instead of assert True.

@quinntaylormitchell quinntaylormitchell changed the title feat(integration-tests): Add data classes, preliminary infra, and test datasets for package compression testing. feat(integration-tests): Test package compression. Jan 29, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@integration-tests/tests/utils/asserting_utils.py`:
- Around line 143-151: The test currently calls pytest.fail before calling
unlink(decompression_dir), so decompression_dir is only cleaned on success; wrap
the verification and cleanup in a try/finally so unlink(decompression_dir)
always runs: compute path_to_dataset and output_path, run the
is_dir_tree_content_equal check inside the try block and call pytest.fail on
mismatch, and perform unlink(decompression_dir) in the finally block to
guarantee cleanup; reference the existing names compression_job,
path_to_dataset, output_path, is_dir_tree_content_equal, pytest.fail, and unlink
to locate where to implement the try/finally.

Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass.

Comment on lines 76 to 77
resolve_path_env_var("INTEGRATION_TESTS_DIR")
/ "tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this path a part of IntegrationTestPathConfig. We can name it test_scripts_dir or test_source_dir. In hindsight, should've named the other attribute test_build_dir. We can fix it in a later PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the fact that the package test code doesn't use IntegrationTestPathConfig anywhere at the moment, it seems a bit weird to introduce that class throughout the package test code just to store one path (cause then we'd have to have IntegrationTestPathConfig as a parameter in every package test function). My next thought is "What if we stored it in PackagePathConfig instead", but that doesn't really seem right either.

I think ultimately what we need is an ironed-out (standardized) way to store paths that are relevant to all integration tests (maybe that's what should get stored in pytest.ini?), and then a different way to store paths for integration tests of different types (maybe PackagePathConfig for package, BinaryPathConfig for binaries, something like that).

Let's talk about this offline; it doesn't seem like it's a blocker for this PR in particular though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approach to path definitions in the integration tests project is as follows:

  1. pytest.ini defines environment variable to path mappings with sensible defaults, while still allowing users to override locations as needed. For example, CLP_CORE_BINS_DIR typically points to clp/build/core, but users may choose a different build location such as clp/components/core/build. New additions to pytest.ini should not contain any paths that can be fully derived from existing ones (i.e. it's not possible for them to have custom locations).

  2. All paths defined in pytest.ini are resolved centrally in path_configs.py and exposed through session-scoped fixtures.

  3. Derived paths, such as the path to the clp-s binary or to the package compression script compress.py, are provided as @property attributes on these session-scoped fixture classes.

run_and_assert(stop_cmd, timeout=DEFAULT_CMD_TIMEOUT_SECONDS)


def run_package_compression_script(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't feel like it's unique to running the compression job script. This function can be generalized to running any binary/script that takes option flags first and then positional arguments in the end.
Of course there are also cases where these two scenarios are interleaved. E.g.

clp-s x --ordered archive decomp

where the binary is first followed by pos arg, then by an option, and eventually by pos args again. Anyways let's not account for this situation now.
What I suggest is moving the core of this function into a new helper called build_command, and call run_and_assert externally. That means you should also generalize PackageCompressionJob to CommandArgumentBuilder or something that fits, and compression_job.path_to_dataset actually becomes the one and only positional argument for CommandArgumentBuilder.
CommandArgumentBuilder shouldn’t be a dataclass since it will hold construction logic, and it can have an API called build() (instead of the build_command() as I suggested before).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In concept I agree with you. I don't see that there is any real reason to generalize it within the context of this PR though; how about I leave it as-is for now, and then when I put up the search PRs, I can generalize it to build_command then?

pytest.fail("Mode mismatch: running configuration does not match intended configuration.")


def verify_compression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def verify_compression(
def verify_package_compression(

Also this seems more like part of the testing code instead of code that can be reused. I don't know how I feel about putting this in asserting_utils instead of directly in test_clp_text and test_clp_json

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with name change.

I do see the value in having a version of this function that is specific to each mode to avoid the multiplexing that's going on here, but that means that each future mode module (e.g., test_clp_presto, test_clp_json_s3, etc.) will need one too. I think that keeping it here is ultimately the better call, especially as the number of modes will grow faster than the number of ways we verify compression (e.g., the future test_clp_presto and test_clp_json_s3 will both use the same verification workflow as clp-json).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@integration-tests/tests/package_tests/clp_json/test_clp_json.py`:
- Around line 111-117: This test file
(integration-tests/tests/package_tests/clp_json/test_clp_json.py) currently
contains only TODOs and a bare "assert True" which yields false CI positives;
update the placeholder test to be explicitly skipped or xfailed until
implemented by either adding pytest import and decorating the test function with
`@pytest.mark.skip`(reason="TODO: implement compression/search assertions") or
`@pytest.mark.xfail`(strict=False, reason="TODO: implement"), or call
pytest.skip("TODO: implement") at the top of the test body; ensure you reference
the existing test function name in your change so the test runner marks it
skipped/xfail instead of passing.

In
`@integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day10.txt`:
- Line 4: Replace the unhyphenated phrase "ground based" with the compound
adjective "ground-based" in the log line containing "navigation sightings
compare well with ground based tracking solutions" so the text reads "navigation
sightings compare well with ground-based tracking solutions"; update the entry
in apollo-17_day10.txt accordingly.

In `@integration-tests/tests/utils/asserting_utils.py`:
- Around line 110-156: The verify_package_compression function currently checks
mode for "clp-json" and "clp-text" but lacks an else branch, so unsupported or
misconfigured modes silently pass; add an explicit failure when
package_test_config.mode_config.mode_name is not recognized (i.e., add an else
after the existing if/elif) and call pytest.fail with a clear message that
includes the unexpected mode and relevant identifiers (e.g., package_test_config
or compression_job.path_to_dataset) to fail-fast on mode-related
misconfigurations.

In `@integration-tests/tests/utils/utils.py`:
- Around line 117-118: The early-return check uses path_to_remove.exists() which
skips broken symlinks; change the guard to allow symlinks (including broken
ones) to be removed by checking either os.path.lexists(path_to_remove) or by
replacing the condition with "if not (path_to_remove.exists() or
path_to_remove.is_symlink()): return" so the code will not return when
path_to_remove is a broken symlink; update the logic around path_to_remove in
utils.py accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@integration-tests/tests/utils/asserting_utils.py`:
- Line 147: Add the missing timeout to the decompression invocation: change the
run_and_assert(decompress_cmd) call to pass timeout=DEFAULT_CMD_TIMEOUT_SECONDS;
if DEFAULT_CMD_TIMEOUT_SECONDS is not in scope, import it from where other tests
pull it (or define it locally using the same value) so run_and_assert and
decompress_cmd use the consistent timeout to prevent indefinite hangs.

In `@integration-tests/tests/utils/config.py`:
- Around line 161-172: Change PackageCompressionJob so options and
positional_args default to empty lists using
dataclasses.field(default_factory=list) instead of typing them as list[str] |
None; update any consuming code such as run_package_compression_script to treat
these attributes as lists (remove None checks/conditional extends) and rely on
them being empty by default, and adjust any callers that currently pass None to
omit those parameters.

]

# Run decompression command and assert that it succeeds.
run_and_assert(decompress_cmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing timeout on run_and_assert for the decompression command.

All other run_and_assert calls in package_utils.py pass timeout=DEFAULT_CMD_TIMEOUT_SECONDS. Without a timeout here, a stalling decompression process will hang the test indefinitely.

🛡️ Proposed fix

You'll need to import the constant or define a local one:

+from tests.utils.package_utils import DEFAULT_CMD_TIMEOUT_SECONDS

Then:

         # Run decompression command and assert that it succeeds.
-        run_and_assert(decompress_cmd)
+        run_and_assert(decompress_cmd, timeout=DEFAULT_CMD_TIMEOUT_SECONDS)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
run_and_assert(decompress_cmd)
from tests.utils.package_utils import DEFAULT_CMD_TIMEOUT_SECONDS
# ... (other imports and code)
run_and_assert(decompress_cmd, timeout=DEFAULT_CMD_TIMEOUT_SECONDS)
🤖 Prompt for AI Agents
In `@integration-tests/tests/utils/asserting_utils.py` at line 147, Add the
missing timeout to the decompression invocation: change the
run_and_assert(decompress_cmd) call to pass timeout=DEFAULT_CMD_TIMEOUT_SECONDS;
if DEFAULT_CMD_TIMEOUT_SECONDS is not in scope, import it from where other tests
pull it (or define it locally using the same value) so run_and_assert and
decompress_cmd use the consistent timeout to prevent indefinite hangs.

Comment on lines +161 to +172
@dataclass(frozen=True)
class PackageCompressionJob:
"""A compression job for a package test."""

#: The absolute path to the dataset (either a file or directory).
path_to_original_dataset: Path

#: Options to specify in the compression command.
options: list[str] | None

#: Positional arguments to specify in the compression command (do not put paths to compress)
positional_args: list[str] | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using empty lists as defaults instead of None.

Using None for options and positional_args forces every caller to pass None explicitly (since frozen=True requires all fields), and the consuming code in run_package_compression_script must null-check before extending. Using field(default_factory=list) would simplify both construction and consumption.

♻️ Proposed refactor

In config.py:

+from dataclasses import dataclass, field, InitVar
+
 `@dataclass`(frozen=True)
 class PackageCompressionJob:
     """A compression job for a package test."""

     #: The absolute path to the dataset (either a file or directory).
     path_to_original_dataset: Path

     #: Options to specify in the compression command.
-    options: list[str] | None
+    options: list[str] = field(default_factory=list)

     #: Positional arguments to specify in the compression command (do not put paths to compress)
-    positional_args: list[str] | None
+    positional_args: list[str] = field(default_factory=list)

In package_utils.py:

-    if compression_job.options is not None:
-        compress_cmd.extend(compression_job.options)
-
-    if compression_job.positional_args is not None:
-        compress_cmd.extend(compression_job.positional_args)
+    compress_cmd.extend(compression_job.options)
+    compress_cmd.extend(compression_job.positional_args)

Callers would then omit options and positional_args when not needed:

PackageCompressionJob(path_to_original_dataset=some_path)
🤖 Prompt for AI Agents
In `@integration-tests/tests/utils/config.py` around lines 161 - 172, Change
PackageCompressionJob so options and positional_args default to empty lists
using dataclasses.field(default_factory=list) instead of typing them as
list[str] | None; update any consuming code such as
run_package_compression_script to treat these attributes as lists (remove None
checks/conditional extends) and rely on them being empty by default, and adjust
any callers that currently pass None to omit those parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants