Automated Versioning of Checks#1044
Automated Versioning of Checks#1044STEFANOVIVAS wants to merge 25 commits intodatabrickslabs:mainfrom
Conversation
…t fingerprint in result df.
…t fingerprint in result df.
Code reviewFound 2 issues:
dqx/src/databricks/labs/dqx/checks_storage.py Lines 148 to 153 in 92a5d6b
dqx/src/databricks/labs/dqx/checks_storage.py Lines 430 to 441 in 92a5d6b 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Additional code review findings (lower confidence)Found 4 additional issues (scored below primary threshold):
dqx/src/databricks/labs/dqx/checks_storage.py Lines 356 to 378 in 92a5d6b
dqx/src/databricks/labs/dqx/checks_serializer.py Lines 387 to 395 in 92a5d6b
dqx/src/databricks/labs/dqx/config.py Lines 287 to 301 in 92a5d6b
dqx/src/databricks/labs/dqx/checks_storage.py Lines 147 to 158 in 92a5d6b 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…spark.catalog.tableExists fails for tables with special characters
… a rule_set_fingerprint in lakebase storage
…config.mode=overwrite for lakebase storage
…nt columns exists in delta table
| """ | ||
| check_rows = df.where(f"run_config_name = '{run_config_name}'").collect() | ||
| filtered_df = df.where(f"run_config_name = '{run_config_name}'") | ||
| if filtered_df.isEmpty(): |
There was a problem hiding this comment.
filtered_df.isEmpty() triggers a full Spark job just to check for zero rows, before you do the actual .collect() a few lines below. If the DataFrame is empty, collect() already returns [] naturally. This early check doubles the number of Spark actions in the common (non-empty) path.
Remove the guard entirely, or if an early log is important, keep it but note the extra cost:
check_rows = filtered_df.collect()
if not check_rows:
logger.info(f"No checks found for run_config_name '{run_config_name}'.")
return []| filtered_df = filtered_df.where((F.col("rule_set_fingerprint") == rule_set_fingerprint)&(F.col("run_config_name") == run_config_name)) | ||
|
|
||
| else: | ||
| rule_set_fingerprint=filtered_df.select(F.col("rule_set_fingerprint")).where(F.col("run_config_name") == run_config_name).orderBy(F.col("created_at").desc()).limit(1).collect()[0][0] |
There was a problem hiding this comment.
Three problems on this line:
1. Crash risk: .collect()[0][0] raises IndexError if created_at is NULL for all rows or filtered_df is empty at this point.
2. Redundant filter: .where(F.col("run_config_name") == run_config_name) is applied a second time here even though filtered_df was already filtered by run_config_name earlier (line 374).
3. Line length: This is 200+ characters — well over the project's style limit.
Suggested rewrite:
result = (
filtered_df
.select(F.col("rule_set_fingerprint"))
.orderBy(F.col("created_at").desc())
.limit(1)
.collect()
)
if not result:
return []
rule_set_fingerprint = result[0][0]| "criticality": check_dict.get("criticality", "error"), | ||
| "function": check_dict.get("check", {}).get("function"), | ||
| "arguments": check_dict.get("check", {}).get("arguments"), | ||
| "filter": check_dict.get("filter"), |
There was a problem hiding this comment.
for_each_column is not included in fingerprint_data. Two rules that differ only in for_each_column (e.g. applying the same check to different column lists) will produce the same fingerprint — incorrect deduplication and missed version changes.
fingerprint_data = {
...
"filter": check_dict.get("filter"),
"for_each_column": check_dict.get("check", {}).get("for_each_column"), # add this
}| logger.info("Rule version columns exist or added.") | ||
|
|
||
| normalized_checks = self._normalize_checks(checks, config) | ||
| rule_set_fingerprint = normalized_checks[0].get("rule_set_fingerprint") |
There was a problem hiding this comment.
normalized_checks[0] raises IndexError if checks is empty. The same issue exists in the same pattern in the TableChecksStorageHandler.save (where rules_df.select("rule_set_fingerprint").first() is slightly safer, but still asymmetric).
rule_set_fingerprint = normalized_checks[0].get("rule_set_fingerprint") if normalized_checks else NoneOr guard earlier:
if not checks:
logger.info("No checks to save — skipping.")
return|
|
||
| config_load = TableChecksStorageConfig( | ||
| location=table_name, | ||
| rule_set_fingerprint="e27b1748e670c8bceeb8449ac494f22bd80a934a30a3c86919547de56790bc00", |
There was a problem hiding this comment.
Hardcoded hash values in tests are brittle. If the fingerprinting algorithm, input data, or canonical serialisation ever changes, this test will silently fail (wrong fingerprint → empty result → assertion passes on the wrong data) or raise a confusing error.
Compute the expected fingerprint from the same input data instead:
from databricks.labs.dqx.checks_serializer import compute_rule_set_fingerprint
config_load = TableChecksStorageConfig(
location=table_name,
rule_set_fingerprint=compute_rule_set_fingerprint(INPUT_CHECKS[:1]),
)Same applies to the equivalent hardcoded hash in test_save_and_load_checks_from_lakebase_table.py.
…rageHandler and LakebaseStorageHandler
…VAS/dqx into feature/versioning_check_rules
Changes
Adding automated versioning of rules.
Linked issues
Resolves #672
Tests