Skip to content

Enhance data contract quality rules generation with schema validation support#1043

Open
vb-dbrks wants to merge 11 commits intomainfrom
feat-odcs-schema-check
Open

Enhance data contract quality rules generation with schema validation support#1043
vb-dbrks wants to merge 11 commits intomainfrom
feat-odcs-schema-check

Conversation

@vb-dbrks
Copy link
Contributor

Changes

  1. Updated DQGenerator to support schema validation rules generation from ODCS contracts, ensuring dataset schemas match contract definitions.
  2. Added generate_schema_validation parameter to generate_rules_from_contract method, defaulting to True.
  3. Enhanced documentation to reflect the new schema validation feature and its requirements.
  4. Introduced InvalidPhysicalTypeError for better error handling when physical types are missing or invalid in schema properties.
  5. Updated integration tests to cover schema validation scenarios, ensuring rules are generated correctly.

Linked issues

Resolves #1016

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the data contract quality rules generation feature to support schema validation based on ODCS (Open Data Contract Standard) v3.x contracts. The changes introduce automatic generation of has_valid_schema rules that ensure dataset schemas match contract definitions, resolving issue #1016.

Changes:

  • Added schema validation rules generation via a new generate_schema_validation parameter (defaults to True)
  • Introduced InvalidPhysicalTypeError for better error handling when physicalType is missing or invalid
  • Updated sample contracts, unit tests, and integration tests to cover schema validation scenarios
  • Enhanced documentation to describe the new feature

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/errors.py Added InvalidPhysicalTypeError exception class for schema validation errors
src/databricks/labs/dqx/profiler/generator.py Added generate_schema_validation parameter to generate_rules_from_contract method
src/databricks/labs/dqx/datacontract/contract_rules_generator.py Implemented schema validation logic with Unity Catalog type validation and DDL generation
src/databricks/labs/dqx/datacontract/init.py Updated module docstring to reflect ODCS v3.x support
tests/unit/test_datacontract_generator.py Added comprehensive unit tests for schema validation including error cases and edge cases
tests/integration/test_datacontract_integration.py Added integration tests covering all code paths for schema validation
tests/datacontract_helpers.py Added helper function to filter schema validation rules
tests/init.py Added package initialization file
tests/resources/sample_datacontract.yaml Updated all properties with physicalType (Unity Catalog types) and added comprehensive data types schema
docs/dqx/src/pages/index.tsx Added Data Contract capability to homepage
docs/dqx/docs/reference/profiler.mdx Updated API reference to document generate_schema_validation parameter
docs/dqx/docs/guide/data_contract_quality_rules_generation.mdx Added schema validation section to user guide

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vb-dbrks vb-dbrks requested review from ghanse and removed request for grusin-db February 23, 2026 14:48
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

✅ 573/573 passed, 3 flaky, 33 skipped, 8h34m16s total

Flaky tests:

  • 🤪 test_e2e_workflow_for_patterns (10.004s)
  • 🤪 test_e2e_workflow_for_patterns_exclude_patterns (6.49s)
  • 🤪 test_e2e_workflow_for_patterns_exclude_output (7.862s)

Running from acceptance #4037

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 97.27891% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.77%. Comparing base (1b0c113) to head (33e0c10).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../labs/dqx/datacontract/contract_rules_generator.py 97.26% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1043      +/-   ##
==========================================
+ Coverage   91.54%   91.77%   +0.22%     
==========================================
  Files          98       98              
  Lines        8945     9091     +146     
==========================================
+ Hits         8189     8343     +154     
+ Misses        756      748       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.

}
)
_UNITY_COMPLEX_PREFIXES: tuple[str, ...] = (
"ARRAY<",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ARRAY<", "MAP<", and "STRUCT<" are unreachable entries here. In _validate_unity_physical_type, all three are handled by their dedicated validators (_validate_array_type, _validate_map_type, _validate_struct_type) before the generic _UNITY_COMPLEX_PREFIXES fallback is reached. The fallback only ever fires for GEOGRAPHY(, GEOMETRY(, and INTERVAL .

Having them in the set is misleading — it implies the fallback path handles them, but they are intercepted earlier. Remove the three dead entries:

_UNITY_COMPLEX_PREFIXES: tuple[str, ...] = (
    "GEOGRAPHY(",
    "GEOMETRY(",
    "INTERVAL ",
)

odcs: OpenDataContractStandard,
default_criticality: str,
) -> list[dict]:
"""Generate one dataset-level has_valid_schema rule per ODCS schema. Always strict."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strict=True means the check will fail if the dataset has any columns not declared in the contract. This is reasonable for strongly-typed contracts, but many real-world tables have additive schema evolution (extra columns are fine, missing required columns are not).

There is currently no way for callers to get non-strict schema validation — even passing strict=False at a higher level is impossible because this is hardcoded.

Consider accepting a strict parameter in _generate_schema_validation_rules_for_schema (and threading it from generate_rules_from_contract), defaulting to True to preserve current behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwojtyczka - this one is also another trade-off... In the purist sense, Contracts should be enforceable and strict as that's their true purpose... but real world, I understand that customers might be creating 1 Data Product : N Data Contracts with overlapping schema so strict enforcement would become a blocker.... but this becomes one more knob. any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least generate the checks with a strict parameter for clarity. The user can choose whether to customize the generated check for permissive or strict validation. We could also log a message about the chosen schema validation mode.

def _temp_contract_path_with_content(content: str) -> str:
"""Write raw content to a temp YAML file and return its path. Caller must os.unlink(path)."""
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
f.write(content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Caller must os.unlink(path)" comment documents a resource-leak footgun. Looking at the one call site (test_error_invalid_contract_raises_odcs_error), it correctly wraps cleanup in try/finally, but this is easy to forget in future tests.

Compare with _generate_rules_from_temp_contract just above, which handles cleanup internally. A context manager or fixture achieves the same without pushing cleanup responsibility to callers:

@contextlib.contextmanager
def _temp_contract_file(content: str):
    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
        f.write(content)
        path = f.name
    try:
        yield path
    finally:
        try:
            os.unlink(path)
        except OSError:
            pass

vb-dbrks added 10 commits March 3, 2026 10:54
… support

Changes
- Updated `DQGenerator` to support schema validation rules generation from ODCS contracts, ensuring dataset schemas match contract definitions.
- Added `generate_schema_validation` parameter to `generate_rules_from_contract` method, defaulting to `True`.
- Enhanced documentation to reflect the new schema validation feature and its requirements.
- Introduced `InvalidPhysicalTypeError` for better error handling when physical types are missing or invalid in schema properties.
- Updated integration tests to cover schema validation scenarios, ensuring rules are generated correctly.

Tests
- [x] manually tested
- [x] added unit tests
- [x] added integration tests
Changes:
- Updated the regex pattern for DECIMAL type validation to capture precision and scale.
- Modified the validation logic to return a normalized DECIMAL format without spaces in the generated DDL.
- Added a unit test to ensure DECIMAL types with spaces are correctly normalized in schema validation.

Tests:
- [x] added unit tests for DECIMAL normalization
Changes:
- Updated logic to backtick-escape column names with special characters or those starting with a digit, adhering to Databricks/ANSI identifier rules.
- Modified unit test to validate backtick-escaping for new cases, including identifiers starting with a digit.

Tests:
- [x] updated unit tests for special character handling
…ulesGenerator

Changes:
- Added a note in the documentation to clarify that schema validation requires a `physicalType` field for each property in the contract.
- Updated the logic in `DataContractRulesGenerator` to ensure that complex physical types are normalized to uppercase for consistent DDL generation.
- Introduced a new unit test to validate the normalization of complex types during schema validation.

Tests:
- [x] added unit tests for schema validation and complex type normalization
Changes:
- Modified the physicalType values in the schema properties of `dqx_demo_datacontract_odcs.py` to use uppercase formats (e.g., STRING, DATE, INT, DECIMAL) for consistency across the data contract definitions.

Tests:
- [x] verified schema integrity after updates
Changes:
- Clarified the description of generated rule types in `dqx_demo_datacontract_odcs.py` and `data_contract_quality_rules_generation.mdx`.
- Adjusted language for consistency and clarity, including updates to the phrasing of rule types and the benefits of data contract-based generation.

Tests:
- [x] reviewed documentation for accuracy and clarity
…rror handling

Changes:
- Introduced methods for extracting and validating complex types (ARRAY, MAP, STRUCT) in the DataContractRulesGenerator.
- Implemented recursive validation for physical types, ensuring compliance with Spark's limits on DECIMAL precision and scale.
- Added error handling for unmatched angle brackets and invalid inner types, raising InvalidPhysicalTypeError where appropriate.
- Updated integration and unit tests to cover new validation scenarios, including nested types and DECIMAL constraints.

Tests:
- [x] added unit tests for complex type validation
- [x] updated integration tests for schema validation scenarios
… update documentation

Changes:
- Removed the `generate_schema_validation` parameter from the `generate_rules_from_contract` method, ensuring that schema validation rules are always generated when a schema is defined in the contract.
- Updated documentation to clarify that schema validation rules are generated by default and to enhance descriptions related to schema validation.
- Adjusted integration and unit tests to reflect the changes in schema validation behavior, ensuring that rules are generated correctly without the need for the parameter.

Tests:
- [x] updated unit tests for schema validation
- [x] modified integration tests to validate new behavior
…est.py

Changes:
- Removed `get_schema_validation_rules` from `datacontract_helpers.py` and added it to `conftest.py` for better organization.
- Updated import statements in integration and unit tests to reflect the new location of the `get_schema_validation_rules` function.

Tests:
- [x] verified that tests pass with the updated imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Data Contract Implicit Checks to Include Schema Validation

4 participants