Add tests for the multi-table synthesizer code #69

lotif · 2025-10-22T21:19:44Z

PR Type

Fix

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868fuke6e

As I started to refactor the functions in models/clavaddpm/synthesizer.py, I noticed some of the functions are never executed in our tests because the current synthesizer tests only check for single table.

Here, I am adding the test also for the multi-table synthesizer code so I can refactor it safely.

Also being done here:

Better way of checking if a test is being run in github actions by checking an environment variable that only exists there
A few fixes to the current synthesizer code
Fixing the test data to include a missing table relation that is important for the synthesizer, and re-generating and uploading the assertion data from github because it affects the results of existing tests

Tests Added

Fixed some existing tests
Added tests/integration/models/clavaddpm/test_synthesizer.py::test_clava_syntheesize_multi_table

Summary by CodeRabbit

Bug Fixes
- Table cleaning now removes all columns with names containing "_id" when producing cleaned outputs.
Tests
- Integration tests now detect CI and use stricter assertions on CI; non‑CI runs emit warnings and use looser tolerances.
- Added an end‑to‑end multi‑table synthesis integration test and updated/standardized related test fixtures and expectations to include a placeholder relation.
- Updated tests to account for the placeholder column in expectations.
Chores
- CI workflows and test runs now set pytest logging to WARNING and perform an environment check to detect CI.

…efault class method and update function signatures to use Transformations directly."

…for better type safety and clarity in clustering function signatures.

…afety and streamline model retrieval in fine-tuning and training functions.

…arameters dataclasses for improved structure and type safety in model configuration across fine-tuning and training functions."

…YCond enum in dataset, model, and training modules for improved type safety and clarity in handling y column conditions."

…eplace string literals for loss type specification in fine-tuning and training modules, enhancing type safety and clarity.

…ng literals for scheduler specification in fine-tuning and training modules, enhancing type safety and clarity.

…sSecondMomentResampler to accept num_timesteps directly, replacing the diffusion object dependency, and enhance the ScheduleSampler enum with a method for creating samplers."

…ical_forward_backward_log and _compute_top_k functions to utilize the new ReductionMethod enum for improved type safety and consistency.

…e to the gaussian diffusion file

…o marcelo/classes-and-enums-2

lotif · 2025-10-24T17:57:42Z

.github/workflows/integration_tests.yml


+      - name: Is running on CI environment (GitHub Actions)?
+        run: |
+          python -c "import os; print('Result: ', os.getenv('GITHUB_ACTIONS', 'Not set'))"


Adding this just so we can see if this environment variable is set in case we need to debug it later.

lotif · 2025-10-24T17:58:15Z

src/midst_toolkit/models/clavaddpm/synthesizer.py

+
+        before_matching_dir = save_dir / "before_matching"
+        before_matching_dir.mkdir(parents=True, exist_ok=True)
+        with open(before_matching_dir / "synthetic_tables.pkl", "wb") as file:


Making sure the directory exists before saving a file in it.

src/midst_toolkit/models/clavaddpm/synthesizer.py

lotif · 2025-10-24T17:59:53Z

tests/integration/assets/multi_table/dataset_meta.json

@@ -1,5 +1,6 @@
 {
    "relation_order": [
+        [null, "account"],


This relation order was missing and it is needed for the synthesizer code. This is also the reason why the assertion data had to be re-generated and re-uploaded.

lotif · 2025-10-24T18:00:51Z

tests/integration/models/clavaddpm/test_model.py

    # In the else block, we set a tolerance that would work across platforms
    # however, it is way too high of a tolerance.
-    if torch.allclose(model_data[model_layers[0]], expected_model_data[expected_model_layers[0]]):
+    if is_running_on_ci_environment():


This is a better check for those if conditions.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/models/clavaddpm/test_model.py (1)
278-278: Fix inconsistent filename spelling.

This line still references syntetic_data.json (with typo), while line 336 was corrected to synthetic_data.json. Both tests should use the correctly spelled filename.

Apply this diff to fix the inconsistency:
-    with open("tests/integration/assets/single_table/assertion_data/syntetic_data.json", "r") as f:
+    with open("tests/integration/assets/single_table/assertion_data/synthetic_data.json", "r") as f:

🧹 Nitpick comments (8)

.github/workflows/integration_tests.yml (1)
61-64: CI env echo is fine; simpler shell is enough.

You can avoid spinning up Python here.

Apply:
-      - name: Is running on CI environment (GitHub Actions)?
-        run: |
-          python -c "import os; print('Result: ', os.getenv('GITHUB_ACTIONS', 'Not set'))"
+      - name: Is running on CI environment (GitHub Actions)?
+        run: echo "Result: ${GITHUB_ACTIONS:-Not set}"
src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
790-793: Directory creation before persisting: good; prefer explicit pickle protocol.

Use highest protocol for speed/size.
-        with open(before_matching_dir / "synthetic_tables.pkl", "wb") as file:
-            pickle.dump(synthetic_tables, file)
+        with open(before_matching_dir / "synthetic_tables.pkl", "wb") as file:
+            pickle.dump(synthetic_tables, file, protocol=pickle.HIGHEST_PROTOCOL)
tests/integration/utils.py (1)
4-11: CI detection works; small robustness tweak.

Make it case-insensitive to avoid surprises.
-    return os.getenv("GITHUB_ACTIONS", "false") == "true"
+    return os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
tests/integration/models/clavaddpm/test_synthesizer.py (2)
66-66: Minor: fix test name typo.

Keeps repo tidy and aids discovery.
-def test_clava_syntheesize_multi_table(tmp_path: Path):
+def test_clava_synthesize_multi_table(tmp_path: Path):
73-86: Unpack training return to improve readability and ensure updated tables are used.

The return type is tuple[Tables, dict[...]]. Unpacking avoids tuple indexing and ensures the updated tables from training (with added placeholder column for root tables) is passed to synthesis, rather than the original tables.
-    models = clava_training(tables, relation_order, tmp_path, DIFFUSION_CONFIG, CLASSIFIER_CONFIG, device=DEVICE)
+    tables, models = clava_training(tables, relation_order, tmp_path, DIFFUSION_CONFIG, CLASSIFIER_CONFIG, device=DEVICE)
@@
-        models[1],
+        models,
The DEVICE variable already correctly falls back to "cpu" on systems without CUDA support.
tests/integration/models/clavaddpm/test_model.py (3)
293-307: Consider using pytest warnings for better visibility.

The CI-aware testing approach is reasonable for handling platform differences. However, logging.warning may not be visible in test output depending on the test runner configuration.

Consider using pytest's warning mechanism for better visibility:
-        logging.warning("Not running on CI, assertions are made with a higher tolerance.")
+        pytest.warns(UserWarning, match="Not running on CI, assertions are made with a higher tolerance.")
Or use pytest.skip with a reason if the assertions should truly be skipped on non-CI environments:
    if not is_running_on_ci_environment():
        pytest.skip("Strict assertions only run on CI due to platform differences")
350-365: CI-aware assertions follow consistent pattern.

Moving model_layers extraction before the conditional block avoids duplication. The same recommendation about using pytest warnings (from lines 293-307) applies here as well.

385-389: Conditional sample assertions completely skipped on non-CI.

Unlike previous CI-aware checks that use looser tolerances, this block completely skips assertions on non-CI environments, reducing test coverage.

Consider adding at least a basic sanity check on non-CI:
     else:
         logging.warning("Not running on CI, skipping detailed assertions.")
+        # Basic sanity checks even on non-CI
+        assert conditional_sample.shape == expected_conditional_sample.shape
+        assert not torch.isnan(conditional_sample).any()
+        assert not torch.isinf(conditional_sample).any()
This ensures the operation doesn't produce invalid results even if exact values differ across platforms.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75b33f1 and 503e800.

⛔ Files ignored due to path filters (2)

tests/integration/assets/multi_table/assertion_data/cleaned_tables.pkl is excluded by !**/*.pkl
tests/integration/assets/multi_table/assertion_data/diffusion_parameters.pkl is excluded by !**/*.pkl

📒 Files selected for processing (8)

.github/workflows/integration_tests.yml (1 hunks)
src/midst_toolkit/models/clavaddpm/synthesizer.py (2 hunks)
tests/integration/assets/multi_table/assertion_data/syntetic_data.json (0 hunks)
tests/integration/assets/multi_table/assertion_data/synthetic_data.json (1 hunks)
tests/integration/assets/multi_table/dataset_meta.json (1 hunks)
tests/integration/models/clavaddpm/test_model.py (10 hunks)
tests/integration/models/clavaddpm/test_synthesizer.py (1 hunks)
tests/integration/utils.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/integration/assets/multi_table/assertion_data/syntetic_data.json

🧰 Additional context used

🧬 Code graph analysis (2)

tests/integration/models/clavaddpm/test_synthesizer.py (6)

src/midst_toolkit/common/random.py (2)

set_all_random_seeds (11-55)

unset_all_random_seeds (58-67)

src/midst_toolkit/models/clavaddpm/clustering.py (1)

clava_clustering (28-85)

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

load_multi_table (18-75)

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

clava_synthesizing (652-829)

src/midst_toolkit/models/clavaddpm/train.py (1)

clava_training (46-138)

tests/integration/utils.py (1)

is_running_on_ci_environment (4-11)

tests/integration/models/clavaddpm/test_model.py (2)

tests/integration/utils.py (1)

is_running_on_ci_environment (4-11)

src/midst_toolkit/models/clavaddpm/gaussian_multinomial_diffusion.py (1)

conditional_sample (1307-1355)

🪛 Ruff (0.14.1)

tests/integration/models/clavaddpm/test_synthesizer.py

93-95: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🔇 Additional comments (8)

tests/integration/assets/multi_table/dataset_meta.json (1)

3-3: Relation root entry looks correct.

JSON null mapping to Python None aligns with synthesizer logic for root tables.

tests/integration/assets/multi_table/assertion_data/synthetic_data.json (1)

1-51: Fixture rename and content LGTM.

No issues spotted.

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

806-811: Verify if _synthetic_with_id.csv output is required by external systems before deciding.

The concern is valid: after removing "_id" columns at lines 806-810, the condition at line 820 (if f"{cleaned_key}_id" in cleaned_val.columns) will always be false, and _synthetic_with_id.csv will no longer be written.

No consumers of _synthetic_with_id.csv found within the repository. However, this doesn't rule out external dependencies or undocumented usages. If backward compatibility is required, the proposed refactor is technically sound—final_tables remains available and contains "_id" columns, so both file variants can be written as suggested.

tests/integration/models/clavaddpm/test_synthesizer.py (1)

17-21: No issues found—clustering_method support is correct.

Verification confirms the implementation fully supports "kmeans_and_gmm". The enum ClusteringMethod.KMEANS_AND_GMM exists with this exact value, it's documented in the _run_clustering function docstring (line 125), and the implementation has explicit handling for it (clustering.py line 535). The test configuration is valid and consistent with the codebase.

Likely an incorrect or invalid review comment.

tests/integration/models/clavaddpm/test_model.py (4)

2-2: LGTM! Imports support CI-aware testing.

The logging and CI detection imports are properly utilized throughout the test file to enable environment-aware assertions.

Also applies to: 19-19

245-246: LGTM! Relation order correctly reflects multi-table hierarchy.

The updated assertions properly capture the hierarchical structure where account is the root table and trans is its child.

336-336: LGTM! Typo fixed in filename.

The correction from syntetic_data.json to synthetic_data.json is good, but ensure all references are updated (see comment on line 278).

404-404: The placeholder column is guaranteed to exist for the account table.

Based on the test data structure, relation_order is [[None, "account"], ["account", "trans"]], making account a top-level table (parent=None). Since the clava_clustering function only adds the placeholder column to tables where parent is None (as confirmed in clustering.py:83), the placeholder column will always be present for the account table at line 404. The code is correct and safe; no defensive error handling is needed.

tests/integration/models/clavaddpm/test_synthesizer.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/integration/models/clavaddpm/test_synthesizer.py (1)
91-96: Acknowledged: pickle usage flagged in past review.

This was already noted in previous review comments. The pickle.loads pattern is acceptable for repository-controlled test fixtures but carries security risks with untrusted data.

Based on past review comments, consider refactoring to use pickle.load with a file handle:
-        expected_cleaned_tables = pickle.loads(
-            Path("tests/integration/assets/multi_table/assertion_data/cleaned_tables.pkl").read_bytes(),
-        )
+        with open("tests/integration/assets/multi_table/assertion_data/cleaned_tables.pkl", "rb") as f:
+            expected_cleaned_tables = pickle.load(f)

🧹 Nitpick comments (1)

tests/integration/models/clavaddpm/test_synthesizer.py (1)
72-72: Clarify the tuple unpacking for better readability.

The code works correctly but is confusing: models at line 72 receives the full tuple (tables, models_dict) from clava_training, then line 83 uses models[1] to extract the models dictionary. This makes the variable name models misleading.

Apply this diff for clearer code:
-    models = clava_training(tables, relation_order, tmp_path, DIFFUSION_CONFIG, CLASSIFIER_CONFIG, device=DEVICE)
+    tables, models = clava_training(tables, relation_order, tmp_path, DIFFUSION_CONFIG, CLASSIFIER_CONFIG, device=DEVICE)
 
     # TODO: Temporary, we should refactor those configs
     configs = deepcopy(SYNTHESIZING_CONFIG)
     configs["general"]["workspace_dir"] = str(tmp_path)
 
     cleaned_tables, _, _ = clava_synthesizing(
         tables,
         relation_order,
         tmp_path,
         all_group_lengths_prob_dicts,
-        models[1],
+        models,
         configs,
     )
Also applies to: 78-85

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef0c2a1 and 5a9bc1e.

📒 Files selected for processing (3)

tests/integration/models/clavaddpm/test_model.py (10 hunks)
tests/integration/models/clavaddpm/test_synthesizer.py (1 hunks)
tests/integration/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration/models/clavaddpm/test_model.py
tests/integration/utils.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/integration/models/clavaddpm/test_synthesizer.py (6)

src/midst_toolkit/common/random.py (2)

set_all_random_seeds (11-55)

unset_all_random_seeds (58-67)

src/midst_toolkit/models/clavaddpm/clustering.py (1)

clava_clustering (28-85)

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

load_multi_table (18-75)

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

clava_synthesizing (652-829)

src/midst_toolkit/models/clavaddpm/train.py (1)

clava_training (46-138)

tests/integration/utils.py (1)

is_running_on_ci_environment (4-11)

🪛 Ruff (0.14.1)

tests/integration/models/clavaddpm/test_synthesizer.py

92-94: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: run-code-check
GitHub Check: integration-tests
GitHub Check: unit-tests

🔇 Additional comments (6)

tests/integration/models/clavaddpm/test_synthesizer.py (6)

1-13: LGTM!

Imports are well-organized and all appear necessary for the integration test.

16-61: LGTM!

Configuration constants are well-structured and appropriate for integration testing. The workspace_dir: None placeholder is correctly set to tmp_path later in the test.

64-68: LGTM!

Excellent use of deterministic seeds and tmp_path for reproducible, isolated testing.

70-76: LGTM!

The workflow setup correctly loads data, performs clustering, and configures the synthesizer with proper test isolation using tmp_path.

88-89: LGTM!

Shape assertions provide a basic sanity check for the synthesized tables.

101-101: LGTM!

Proper cleanup of random seeds after test execution.

tests/integration/models/clavaddpm/test_synthesizer.py

emersodb

Overall, this looks good to me.

src/midst_toolkit/models/clavaddpm/synthesizer.py

tests/integration/assets/multi_table/assertion_data/synthetic_data.json

… by changing the log level in the workflow

lotif and others added 30 commits September 29, 2025 11:18

Merge "Refactor Transformations handling: replace get_T_dict with a d…

c059d93

…efault class method and update function signatures to use Transformations directly."

Refactor clustering method handling: Introduce ClusteringMethod enum …

9f0290a

…for better type safety and clarity in clustering function signatures.

Refactor model handling: Introduce ModelType enum for improved type s…

80073f2

…afety and streamline model retrieval in fine-tuning and training functions.

Merge "Refactor model parameters: Introduce ModelParameters and RTDLP…

1d43783

…arameters dataclasses for improved structure and type safety in model configuration across fine-tuning and training functions."

Merge "Refactor y condition handling: Replace string literals with Is…

f7d69ec

…YCond enum in dataset, model, and training modules for improved type safety and clarity in handling y column conditions."

Refactor Gaussian loss handling: Introduce GaussianLossType enum to r…

01ed9c1

…eplace string literals for loss type specification in fine-tuning and training modules, enhancing type safety and clarity.

Refactor scheduler handling: Introduce Scheduler enum to replace stri…

0dbc6b4

…ng literals for scheduler specification in fine-tuning and training modules, enhancing type safety and clarity.

Merge "Refactor sampler initialization: Update UniformSampler and Los…

128da65

…sSecondMomentResampler to accept num_timesteps directly, replacing the diffusion object dependency, and enhance the ScheduleSampler enum with a method for creating samplers."

Enhance metric and loss handling: Refactor loss computation in _numer…

f158b44

…ical_forward_backward_log and _compute_top_k functions to utilize the new ReductionMethod enum for improved type safety and consistency.

Transforming a lot of literals into enums

6503788

WIP renaming RTDL, cat and num and data splits

d599334

Using more data splits and adding types for gaussian parametrization

1950f5c

Adding enum for YType

3e8237c

Merge branch 'main' into marcelo/classes-and-enums-2

c957dec

Renaming Scheduler to SchedulerType and moving it and GaussianLossTyp…

4d0707b

…e to the gaussian diffusion file

Merge remote-tracking branch 'origin/marcelo/classes-and-enums-2' int…

b7db96e

…o marcelo/classes-and-enums-2

WIP CR by David

30d0a0d

Merge branch 'main' into marcelo/classes-and-enums-2

779b108

Cont'd CR comments by David

0e3a42a

Merge remote-tracking branch 'origin/marcelo/classes-and-enums-2' int…

8600ccb

…o marcelo/classes-and-enums-2

Adding TODO

bc67266

WIp starting the breakdown

774e99b

Renames

1168e93

Last breakdown

bf05c3c

Removing ignore

d90ed2c

Finished refactoring

96414be

Merge branch 'main' into marcelo/remove-ignores

aafd66c

Renamings, mostly

594d9cd

More enums

3a2b203

Adding datasplits class

972947d

lotif added 3 commits October 24, 2025 13:07

equals on each key instead

b06db63

Uploading updated tensors for github

a3a439c

Removing file upload

503e800

lotif changed the title ~~Marcelo/add syth multi table test~~ Add tests for the multi-table synthesizer code Oct 24, 2025

lotif commented Oct 24, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/synthesizer.py Show resolved Hide resolved

lotif commented Oct 24, 2025

View reviewed changes

lotif requested review from ElahehBassak, amrit110, bzamanlooy, emersodb, fatemetkl, masi-sh and sarakodeiri October 24, 2025 18:01

lotif marked this pull request as ready for review October 24, 2025 18:01

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

tests/integration/models/clavaddpm/test_synthesizer.py Show resolved Hide resolved

lotif added 2 commits October 24, 2025 14:24

Addressing coderabbit comment

ef0c2a1

More code comments by coderabbit

5a9bc1e

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

tests/integration/models/clavaddpm/test_synthesizer.py Outdated Show resolved Hide resolved

emersodb approved these changes Oct 24, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/synthesizer.py Show resolved Hide resolved

tests/integration/assets/multi_table/assertion_data/synthetic_data.json Show resolved Hide resolved

lotif and others added 4 commits October 24, 2025 16:47

Changing to log warning and making sure it appears in the github logs…

7e1e07f

… by changing the log level in the workflow

Merge branch 'main' into marcelo/add-syth-multi-table-test

bbfd6b7

better way of getting ID columns

88f1d8c

Didn't work, applying David's suggestion instead

b8e9990

lotif requested a review from emersodb October 27, 2025 15:55

lotif merged commit dabf10a into main Oct 27, 2025
6 of 7 checks passed

lotif deleted the marcelo/add-syth-multi-table-test branch October 27, 2025 16:04

lotif restored the marcelo/add-syth-multi-table-test branch October 27, 2025 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for the multi-table synthesizer code #69

Add tests for the multi-table synthesizer code #69

Uh oh!

lotif commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

lotif Oct 24, 2025

Uh oh!

lotif Oct 24, 2025

Uh oh!

Uh oh!

lotif Oct 24, 2025

Uh oh!

lotif Oct 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

emersodb left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add tests for the multi-table synthesizer code #69

Add tests for the multi-table synthesizer code #69

Uh oh!

Conversation

lotif commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Short Description

Tests Added

Summary by CodeRabbit

Uh oh!

lotif Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

lotif Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lotif Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

lotif Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lotif commented Oct 22, 2025 •

edited by coderabbitai bot

Loading