Enhance data contract quality rules generation with schema validation support#1043
Enhance data contract quality rules generation with schema validation support#1043
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the data contract quality rules generation feature to support schema validation based on ODCS (Open Data Contract Standard) v3.x contracts. The changes introduce automatic generation of has_valid_schema rules that ensure dataset schemas match contract definitions, resolving issue #1016.
Changes:
- Added schema validation rules generation via a new
generate_schema_validationparameter (defaults to True) - Introduced
InvalidPhysicalTypeErrorfor better error handling when physicalType is missing or invalid - Updated sample contracts, unit tests, and integration tests to cover schema validation scenarios
- Enhanced documentation to describe the new feature
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/errors.py | Added InvalidPhysicalTypeError exception class for schema validation errors |
| src/databricks/labs/dqx/profiler/generator.py | Added generate_schema_validation parameter to generate_rules_from_contract method |
| src/databricks/labs/dqx/datacontract/contract_rules_generator.py | Implemented schema validation logic with Unity Catalog type validation and DDL generation |
| src/databricks/labs/dqx/datacontract/init.py | Updated module docstring to reflect ODCS v3.x support |
| tests/unit/test_datacontract_generator.py | Added comprehensive unit tests for schema validation including error cases and edge cases |
| tests/integration/test_datacontract_integration.py | Added integration tests covering all code paths for schema validation |
| tests/datacontract_helpers.py | Added helper function to filter schema validation rules |
| tests/init.py | Added package initialization file |
| tests/resources/sample_datacontract.yaml | Updated all properties with physicalType (Unity Catalog types) and added comprehensive data types schema |
| docs/dqx/src/pages/index.tsx | Added Data Contract capability to homepage |
| docs/dqx/docs/reference/profiler.mdx | Updated API reference to document generate_schema_validation parameter |
| docs/dqx/docs/guide/data_contract_quality_rules_generation.mdx | Added schema validation section to user guide |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
|
✅ 573/573 passed, 3 flaky, 33 skipped, 8h34m16s total Flaky tests:
Running from acceptance #4037 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1043 +/- ##
==========================================
+ Coverage 91.54% 91.77% +0.22%
==========================================
Files 98 98
Lines 8945 9091 +146
==========================================
+ Hits 8189 8343 +154
+ Misses 756 748 -8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
src/databricks/labs/dqx/datacontract/contract_rules_generator.py
Outdated
Show resolved
Hide resolved
| } | ||
| ) | ||
| _UNITY_COMPLEX_PREFIXES: tuple[str, ...] = ( | ||
| "ARRAY<", |
There was a problem hiding this comment.
"ARRAY<", "MAP<", and "STRUCT<" are unreachable entries here. In _validate_unity_physical_type, all three are handled by their dedicated validators (_validate_array_type, _validate_map_type, _validate_struct_type) before the generic _UNITY_COMPLEX_PREFIXES fallback is reached. The fallback only ever fires for GEOGRAPHY(, GEOMETRY(, and INTERVAL .
Having them in the set is misleading — it implies the fallback path handles them, but they are intercepted earlier. Remove the three dead entries:
_UNITY_COMPLEX_PREFIXES: tuple[str, ...] = (
"GEOGRAPHY(",
"GEOMETRY(",
"INTERVAL ",
)| odcs: OpenDataContractStandard, | ||
| default_criticality: str, | ||
| ) -> list[dict]: | ||
| """Generate one dataset-level has_valid_schema rule per ODCS schema. Always strict.""" |
There was a problem hiding this comment.
strict=True means the check will fail if the dataset has any columns not declared in the contract. This is reasonable for strongly-typed contracts, but many real-world tables have additive schema evolution (extra columns are fine, missing required columns are not).
There is currently no way for callers to get non-strict schema validation — even passing strict=False at a higher level is impossible because this is hardcoded.
Consider accepting a strict parameter in _generate_schema_validation_rules_for_schema (and threading it from generate_rules_from_contract), defaulting to True to preserve current behaviour.
There was a problem hiding this comment.
@mwojtyczka - this one is also another trade-off... In the purist sense, Contracts should be enforceable and strict as that's their true purpose... but real world, I understand that customers might be creating 1 Data Product : N Data Contracts with overlapping schema so strict enforcement would become a blocker.... but this becomes one more knob. any thoughts?
There was a problem hiding this comment.
We should at least generate the checks with a strict parameter for clarity. The user can choose whether to customize the generated check for permissive or strict validation. We could also log a message about the chosen schema validation mode.
| def _temp_contract_path_with_content(content: str) -> str: | ||
| """Write raw content to a temp YAML file and return its path. Caller must os.unlink(path).""" | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f: | ||
| f.write(content) |
There was a problem hiding this comment.
The "Caller must os.unlink(path)" comment documents a resource-leak footgun. Looking at the one call site (test_error_invalid_contract_raises_odcs_error), it correctly wraps cleanup in try/finally, but this is easy to forget in future tests.
Compare with _generate_rules_from_temp_contract just above, which handles cleanup internally. A context manager or fixture achieves the same without pushing cleanup responsibility to callers:
@contextlib.contextmanager
def _temp_contract_file(content: str):
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
f.write(content)
path = f.name
try:
yield path
finally:
try:
os.unlink(path)
except OSError:
pass… support Changes - Updated `DQGenerator` to support schema validation rules generation from ODCS contracts, ensuring dataset schemas match contract definitions. - Added `generate_schema_validation` parameter to `generate_rules_from_contract` method, defaulting to `True`. - Enhanced documentation to reflect the new schema validation feature and its requirements. - Introduced `InvalidPhysicalTypeError` for better error handling when physical types are missing or invalid in schema properties. - Updated integration tests to cover schema validation scenarios, ensuring rules are generated correctly. Tests - [x] manually tested - [x] added unit tests - [x] added integration tests
Changes: - Updated the regex pattern for DECIMAL type validation to capture precision and scale. - Modified the validation logic to return a normalized DECIMAL format without spaces in the generated DDL. - Added a unit test to ensure DECIMAL types with spaces are correctly normalized in schema validation. Tests: - [x] added unit tests for DECIMAL normalization
Changes: - Updated logic to backtick-escape column names with special characters or those starting with a digit, adhering to Databricks/ANSI identifier rules. - Modified unit test to validate backtick-escaping for new cases, including identifiers starting with a digit. Tests: - [x] updated unit tests for special character handling
…ulesGenerator Changes: - Added a note in the documentation to clarify that schema validation requires a `physicalType` field for each property in the contract. - Updated the logic in `DataContractRulesGenerator` to ensure that complex physical types are normalized to uppercase for consistent DDL generation. - Introduced a new unit test to validate the normalization of complex types during schema validation. Tests: - [x] added unit tests for schema validation and complex type normalization
Changes: - Modified the physicalType values in the schema properties of `dqx_demo_datacontract_odcs.py` to use uppercase formats (e.g., STRING, DATE, INT, DECIMAL) for consistency across the data contract definitions. Tests: - [x] verified schema integrity after updates
Changes: - Clarified the description of generated rule types in `dqx_demo_datacontract_odcs.py` and `data_contract_quality_rules_generation.mdx`. - Adjusted language for consistency and clarity, including updates to the phrasing of rule types and the benefits of data contract-based generation. Tests: - [x] reviewed documentation for accuracy and clarity
…rror handling Changes: - Introduced methods for extracting and validating complex types (ARRAY, MAP, STRUCT) in the DataContractRulesGenerator. - Implemented recursive validation for physical types, ensuring compliance with Spark's limits on DECIMAL precision and scale. - Added error handling for unmatched angle brackets and invalid inner types, raising InvalidPhysicalTypeError where appropriate. - Updated integration and unit tests to cover new validation scenarios, including nested types and DECIMAL constraints. Tests: - [x] added unit tests for complex type validation - [x] updated integration tests for schema validation scenarios
… update documentation Changes: - Removed the `generate_schema_validation` parameter from the `generate_rules_from_contract` method, ensuring that schema validation rules are always generated when a schema is defined in the contract. - Updated documentation to clarify that schema validation rules are generated by default and to enhance descriptions related to schema validation. - Adjusted integration and unit tests to reflect the changes in schema validation behavior, ensuring that rules are generated correctly without the need for the parameter. Tests: - [x] updated unit tests for schema validation - [x] modified integration tests to validate new behavior
…est.py Changes: - Removed `get_schema_validation_rules` from `datacontract_helpers.py` and added it to `conftest.py` for better organization. - Updated import statements in integration and unit tests to reflect the new location of the `get_schema_validation_rules` function. Tests: - [x] verified that tests pass with the updated imports
…a Contract Integration Errors
1a31e31 to
e5d2ee0
Compare
Changes
Linked issues
Resolves #1016
Tests