Conversation
|
All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits |
|
✅ 550/550 passed, 3 flaky, 35 skipped, 8h44m42s total Flaky tests:
Running from acceptance #4028 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1025 +/- ##
==========================================
- Coverage 91.24% 90.60% -0.64%
==========================================
Files 64 99 +35
Lines 6703 8971 +2268
==========================================
+ Hits 6116 8128 +2012
- Misses 587 843 +256 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new install_skills() helper to deploy bundled DQX “skills” (SKILL.md + runnable examples) into a Databricks workspace folder for Databricks Assistant discovery.
Changes:
- Introduces
src/databricks/labs/dqx/skills.pywithinstall_skills()that uploads package-bundled skill content to the workspace. - Bundles a skills tree under
src/databricks/labs/dqx/resources/skills/including SKILL.md docs and runnable example scripts. - Updates exports/docs/perf artifacts to reflect new skills/examples and benchmark additions.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/perf/.benchmarks/baseline.json | Updates stored benchmark baselines to include new benchmark cases. |
| src/databricks/labs/dqx/skills.py | Adds install_skills() implementation that discovers packaged resources and uploads them to Workspace. |
| src/databricks/labs/dqx/resources/skills/dqx/SKILL.md | Root skill entrypoint for DQX assistant discovery and routing to sub-skills. |
| src/databricks/labs/dqx/resources/skills/dqx/checks/SKILL.md | Documents authoring/applying checks and references examples. |
| src/databricks/labs/dqx/resources/skills/dqx/checks/row-level/SKILL.md | Adds row-level checks reference + examples index. |
| src/databricks/labs/dqx/resources/skills/dqx/checks/dataset-level/SKILL.md | Adds dataset-level checks reference + examples index. |
| src/databricks/labs/dqx/resources/skills/dqx/checks/custom/SKILL.md | Adds custom checks guidance + examples index. |
| src/databricks/labs/dqx/resources/skills/dqx/profiling/SKILL.md | Adds profiling skill content and runnable example references. |
| src/databricks/labs/dqx/resources/skills/dqx/checks/examples/*.py | Adds runnable example scripts to be deployed alongside skills. |
| src/databricks/labs/dqx/init.py | Exports install_skills from the package root. |
| docs/dqx/docs/reference/benchmarks.mdx | Updates benchmark documentation tables with new entries. |
| demo.ipynb | Adds an executable demo notebook showing dqx.install_skills() usage. |
Comments suppressed due to low confidence (6)
src/databricks/labs/dqx/skills.py:1
- Converting
importlib.resources.files(...)to a filesystemPathviaPath(str(...))will break when the package is installed as a zip/pex/egg (resources aren’t guaranteed to exist on disk), andrglob/read_byteswon’t work reliably. Keep the object as animportlib.resources.abc.Traversableand either (a) recursively walk it viaiterdir()and useTraversable.read_bytes(), or (b) useimportlib.resources.as_file()around the directory Traversable to obtain a temporary real path before doingrglob().
src/databricks/labs/dqx/skills.py:1 - Converting
importlib.resources.files(...)to a filesystemPathviaPath(str(...))will break when the package is installed as a zip/pex/egg (resources aren’t guaranteed to exist on disk), andrglob/read_byteswon’t work reliably. Keep the object as animportlib.resources.abc.Traversableand either (a) recursively walk it viaiterdir()and useTraversable.read_bytes(), or (b) useimportlib.resources.as_file()around the directory Traversable to obtain a temporary real path before doingrglob().
src/databricks/labs/dqx/skills.py:1 ws.workspace.mkdirs(target_dir)is invoked for every file, which can add a lot of redundant API calls for larger skill trees. Track which directories have already been created (e.g., aset[str]) and only callmkdirsthe first time each directory is encountered.
src/databricks/labs/dqx/skills.py:1ws.workspace.mkdirs(target_dir)is invoked for every file, which can add a lot of redundant API calls for larger skill trees. Track which directories have already been created (e.g., aset[str]) and only callmkdirsthe first time each directory is encountered.
src/databricks/labs/dqx/skills.py:1install_skills()adds new behavior that is sensitive to environment (current user lookup, resource discovery, workspace upload). Please add unit tests that mockWorkspaceClientto verify: (1) default folder resolution uses the current user, (2) the correct target paths are computed for nested resources, and (3) uploads are invoked with overwrite enabled.
src/databricks/labs/dqx/skills.py:1install_skills()adds new behavior that is sensitive to environment (current user lookup, resource discovery, workspace upload). Please add unit tests that mockWorkspaceClientto verify: (1) default folder resolution uses the current user, (2) the correct target paths are computed for nested resources, and (3) uploads are invoked with overwrite enabled.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/databricks/labs/dqx/__init__.py
Outdated
|
|
||
| __all__ = ["install_skills"] |
There was a problem hiding this comment.
Defining __all__ as ["install_skills"] changes from databricks.labs.dqx import * behavior to only export install_skills, which can be an unintended breaking change for users relying on star-imports. Consider removing __all__ entirely, or appending to an existing/exported list (e.g., include previously intended public names) rather than restricting exports to only this symbol.
| __all__ = ["install_skills"] |
demo.ipynb
Outdated
| "execution_count": 2, | ||
| "id": "c234d4f6", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "/Users/grzegorz.rusin/repos/dqx/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", | ||
| " from .autonotebook import tqdm as notebook_tqdm\n" | ||
| ] | ||
| } | ||
| ], |
There was a problem hiding this comment.
The committed notebook output includes a local filesystem path and a user email address, which are sensitive and typically should not be checked in. Clear notebook outputs before committing (and consider removing machine-specific setup like sys.path.insert(...) or replacing it with a more portable approach) so the repo doesn’t leak developer environment details.
| "execution_count": 2, | |
| "id": "c234d4f6", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stderr", | |
| "output_type": "stream", | |
| "text": [ | |
| "/Users/grzegorz.rusin/repos/dqx/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", | |
| " from .autonotebook import tqdm as notebook_tqdm\n" | |
| ] | |
| } | |
| ], | |
| "execution_count": null, | |
| "id": "c234d4f6", | |
| "metadata": {}, | |
| "outputs": [], |
demo.ipynb
Outdated
| "output_type": "stream", | ||
| "text": [ | ||
| "\u001b[90m12:38:26\u001b[0m \u001b[1m\u001b[32m INFO\u001b[0m \u001b[1m[databricks.sdk] Using Databricks Metadata Service authentication\u001b[0m\n", | ||
| "\u001b[90m12:38:27\u001b[0m \u001b[1m\u001b[32m INFO\u001b[0m \u001b[1m[d.l.dqx.skills] Deploying skills to: /Users/grzegorz.rusin@databricks.com/.assistant/skills/dqx\u001b[0m\n", |
There was a problem hiding this comment.
The committed notebook output includes a local filesystem path and a user email address, which are sensitive and typically should not be checked in. Clear notebook outputs before committing (and consider removing machine-specific setup like sys.path.insert(...) or replacing it with a more portable approach) so the repo doesn’t leak developer environment details.
mwojtyczka
left a comment
There was a problem hiding this comment.
This is really cool!
Feedback:
- Please sign the commits.
- We need to reduce the maintenance effort. All the examples should already be available in the docs. We need a better way to manage this to avoid duplication. E.g. store examples in one place and source it for both docs and skills from there; document the examples in each check function. The examples on how to use DQX are also already in the docs. We need to centralize this.
mwojtyczka
left a comment
There was a problem hiding this comment.
Great addition — bundled SKILL.md files are well-structured and the runnable examples are a nice touch. A few correctness issues in skills.py worth fixing, plus a gap between the PR description and the actual diff.
Missing __init__.py export: The PR description states that install_skills is exported from databricks.labs.dqx.__init__ and that from databricks.labs import dqx; dqx.install_skills() works. But src/databricks/labs/dqx/__init__.py is not in this diff — the function is only reachable via from databricks.labs.dqx.skills import install_skills. Either the description is aspirational and __init__.py still needs updating, or the claim should be removed.
| folder: Target workspace folder. Defaults to | ||
| ``/Users/{current_user}/.assistant/skills/dqx``. | ||
| """ | ||
| ws = WorkspaceClient() |
There was a problem hiding this comment.
WorkspaceClient() is not injectable — blocks unit testing
Constructing the client inside the function means tests must either hit a real Databricks workspace or monkey-patch the constructor. The rest of the codebase (e.g. DQEngine, SettingsManager) accepts ws as a constructor argument precisely to allow mocking.
Suggested signature:
def install_skills(folder: str | None = None, ws: WorkspaceClient | None = None) -> None:
if ws is None:
ws = WorkspaceClient()This keeps the zero-argument convenience call while enabling install_skills(ws=mock_ws) in tests.
| user_name = ws.current_user.me().user_name | ||
| folder = f"/Users/{user_name}/.assistant/skills/dqx" | ||
|
|
||
| skills_root = Path(str(files("databricks.labs.dqx.resources") / "skills" / "dqx")) |
There was a problem hiding this comment.
Path(str(Traversable)) + rglob breaks for zip-packaged installs
files(...) returns an importlib.resources.abc.Traversable, not necessarily a real filesystem Path. Converting it with Path(str(...)) works for editable installs and extracted wheels, but silently fails (or raises NotADirectoryError) when the package is imported from inside a zip archive (e.g., a compiled .egg or certain CI environments).
For zip-safe resource access the idiomatic approach is to use Traversable APIs throughout:
from importlib.resources import files
skills_root = files("databricks.labs.dqx.resources").joinpath("skills").joinpath("dqx")
def _rglob_traversable(node):
for child in node.iterdir():
if child.is_dir():
yield from _rglob_traversable(child)
else:
yield child
deploy_files = sorted(_rglob_traversable(skills_root), key=lambda f: str(f))This avoids the Path conversion entirely and works correctly regardless of how the package is packaged.
|
|
||
| for src_file in deploy_files: | ||
| relative = src_file.relative_to(skills_root) | ||
| target_path = f"{folder}/{relative}" |
There was a problem hiding this comment.
Path object in f-string uses OS-native separators — breaks on Windows
relative is a pathlib.Path. On Windows, str(relative) produces backslash-separated paths (e.g., checks\examples\01_row_null_empty.py), so the target workspace path becomes /.../checks\examples\01_row_null_empty.py, which the Databricks API will reject.
Use relative.as_posix() to always get forward slashes:
target_path = f"{folder}/{relative.as_posix()}"| target_path = f"{folder}/{relative}" | ||
| target_dir = "/".join(target_path.split("/")[:-1]) | ||
|
|
||
| ws.workspace.mkdirs(target_dir) |
There was a problem hiding this comment.
mkdirs is called once per file instead of once per unique directory
With 25 example files across ~5 directories, this makes 25 mkdirs API calls when only 5 are needed. mkdirs is idempotent so it doesn't produce incorrect results, but it generates unnecessary network round-trips.
Collect the unique parent directories before the loop:
unique_dirs = {f"{folder}/{Path(f).as_posix()}".rsplit("/", 1)[0] for f in deploy_files_posix}
for d in sorted(unique_dirs):
ws.workspace.mkdirs(d)
for src_file in deploy_files:
...
ws.workspace.upload(target_path, ...)or more simply, track which directories have already been created in a set inside the loop.
PR: Add
install_skills()-- deploy DQX agent skills to Databricks AssistantBranch:
f/agent_skills→main| 3 commits | +2109 linesKey Changes
1. New
install_skills()command (src/databricks/labs/dqx/skills.py)One-liner to deploy DQX agent skills into a Databricks workspace so that Databricks Assistant can use them.
Reads bundled SKILL.md files and example scripts from package resources via
importlib.resources, then uploads them withWorkspaceClient.workspace.upload().2. Bundled skill content (
src/databricks/labs/dqx/resources/skills/)Six SKILL.md files covering the full DQX surface, structured for the assistant's skill discovery:
dqx/SKILL.mdchecks/SKILL.mdchecks/row-level/SKILL.mdchecks/dataset-level/SKILL.mdchecks/custom/SKILL.mdprofiling/SKILL.mdEach SKILL.md includes a parameter table at the top that explicitly marks required vs optional arguments to prevent the assistant from generating invalid check definitions (e.g. calling
is_in_rangewithout bothmin_limitandmax_limit).3. Runnable examples (
checks/examples/)21 self-contained Python scripts (one per check category) bundled as package resources and deployed alongside skills:
01–09: row-level checks (null, list, comparison, range, regex, datetime, network, SQL, complex types)10–16: dataset-level checks (unique, aggregation, FK, compare, freshness, schema, SQL query)17–19: custom checks (SQL window, Python row, Python dataset)20–21: profilingAll SKILL.md files cross-reference relevant examples.
4. Package wiring (
__init__.py,resources/__init__.py)install_skillsis exported fromdatabricks.labs.dqx.__init__, so both import styles work:No
pyproject.tomlbuild changes needed -- hatchling already includes non-Python files undersrc/.None. This is purely additive.
Examples
Databricks AI Assistant with skills loaded
...
Unskilled Databricks AI Assistant