Skip to content

Fixed absolute path normalisation in source code analysis #2920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 10, 2024
Merged

Conversation

ericvergnaud
Copy link
Contributor

Changes

Workspace API does not support relative subpaths such as "/a/b/../c".
This PR fixes the issue by resolving workspace paths before calling the API.

Linked issues

Resolves #2882
Requires #databrickslabs/blueprint#156
Requires #databrickslabs/blueprint#157

Functionality

None

Tests

  • added integration tests

@ericvergnaud ericvergnaud requested a review from a team as a code owner October 10, 2024 14:16
@nfx nfx requested a review from asnare October 10, 2024 14:19
Copy link

github-actions bot commented Oct 10, 2024

✅ 43/43 passed, 2 flaky, 2 skipped, 1h29m52s total

Flaky tests:

  • 🤪 test_running_real_workflow_linter_job (1m33.223s)
  • 🤪 test_job_dlt_task_linter_unhappy_path (20m28.949s)

Running from acceptance #6617

@nfx nfx changed the title Fix bug 2882 Fixed absolute path normalisation in source code analysis Oct 10, 2024
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -41,6 +41,7 @@ def resolve(self, path_lookup: PathLookup, path: Path) -> Path | None:
return None."""
# check current working directory first
absolute_path = path_lookup.cwd / path
absolute_path = absolute_path.resolve()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure it doesn't fail

@nfx nfx temporarily deployed to account-admin October 10, 2024 16:05 — with GitHub Actions Inactive
@nfx nfx merged commit 9ac48a5 into main Oct 10, 2024
7 checks passed
@nfx nfx deleted the fix-bug-2882 branch October 10, 2024 16:36
nfx added a commit that referenced this pull request Oct 10, 2024
* Added `google-cloud-storage` to known list ([#2827](#2827)). In this release, we have added the `google-cloud-storage` library, along with its various modules and sub-modules, to our project's known list in a JSON file. Additionally, we have included the `google-crc32c` and `google-resumable-media` libraries. These libraries provide functionalities such as content addressable storage, checksum calculation, and resumable media upload and download. This change is a partial resolution to issue [#1931](#1931), which is likely related to the integration or usage of these libraries in the project. Software engineers should take note of these additions and how they may impact the project's functionality.
* Added `google-crc32c` to known list ([#2828](#2828)). With this commit, we have added the `google-crc32c` library to our system's known list, addressing part of issue [#1931](#1931). This addition enhances the overall functionality of the system by providing efficient and high-speed CRC32C computation when utilized. The `google-crc32c` library is known for its performance and reliability, and by incorporating it into our system, we aim to improve the efficiency and robustness of the CRC32C computation process. This enhancement is part of our ongoing efforts to optimize the system and ensure a more efficient experience for our end-users. With this change, users can expect faster and more reliable CRC32C computations in their applications.
* Added `holidays` to known list ([#2906](#2906)). In this release, we have expanded the known list in our open-source library to include a new `holidays` category, aimed at supporting tracking of holidays for different countries, religions, and financial institutions. This category includes several subcategories, such as calendars, countries, deprecation, financial holidays, groups, helpers, holiday base, mixins, observed holiday base, registry, and utils. Each subcategory contains an empty list, allowing for future data storage related to holidays. This change partially resolves issue [#1931](#1931), and represents a significant step towards supporting a more comprehensive range of holiday tracking needs in our library. Software engineers may utilize this new feature to build applications that require tracking and management of various holidays and related data.
* Added `htmlmin` to known list ([#2907](#2907)). In this update, we have added the `htmlmin` library to the `known.json` configuration file's list of known libraries. This addition enables the use and management of `htmlmin` and its components, including `htmlmin.command`, `htmlmin.decorator`, `htmlmin.escape`, `htmlmin.main`, `htmlmin.middleware`, `htmlmin.parser`, `htmlmin.python3html`, and `htmlmin.python3html.parser`. This change partially addresses issue [#1931](#1931), which may have been caused by the integration or usage of `htmlmin`. Software engineers can now utilize `htmlmin` and its features in their projects, thanks to this enhancement.
* Document preparing external locations when creating catalogs ([#2915](#2915)). Databricks Labs' UCX tool has been updated to incorporate the preparation of external locations when creating catalogs during the upgrade to Unity Catalog (UC). This enhancement involves the addition of new documentation outlining how to physically separate data in storage within UC, adhering to Databricks' best practices. The `create-catalogs-schemas` command has been updated to create UC catalogs and schemas based on a mapping file, allowing users to reuse previously created external locations or establish new ones outside of UCX. For data separation, users can leverage external locations when using subpaths, providing flexibility in data management during the upgrade process.
* Fixed `KeyError` from `assess_workflows` task ([#2919](#2919)). In this release, we have made significant improvements to error handling in our open-source library. We have fixed a KeyError in the `assess_workflows` task and modified the `_safe_infer_internal` and `_unsafe_infer_internal` methods to handle both `InferenceError` and `KeyError` during inference. When an error occurs, we now log the error message with the node and yield a `Uninferable` object. Additionally, we have updated the `do_infer_values` method of the `_LocalInferredValue` class to yield an iterator of iterables of `NodeNG` objects. We have added multiple unit tests for inferring values in Python code, including cases for handling externally defined values and their absence. These changes ensure that our library can handle errors more gracefully and provide more informative feedback during inference, making it more robust and easier to use in software engineering projects.
* Fixed `OSError: [Errno 95]` bug in `assess_workflows` task by skipping GIT-sourced workflows from static code analysis ([#2924](#2924)). In this release, we have resolved the `OSError: [Errno 95]` bug in the `assess_workflows` task that occurred while performing static code analysis on GIT-sourced workflows. A new attribute `Source` has been introduced in the `jobs` module of the `databricks.sdk.service` package to identify the source of a notebook task. If the notebook task source is GIT, a new `DependencyProblem` is raised, indicating that notebooks in GIT should be analyzed using the `databricks labs ucx lint-local-code` CLI command. The `_register_notebook` method has been updated to check if the notebook task source is GIT and return an appropriate `DependencyProblem` message. This change enhances the reliability of the `assess_workflows` task by avoiding the aforementioned bug and provides a more informative message when notebooks are sourced from GIT. This change is part of our ongoing effort to improve the project's quality and reliability and benefits software engineers who adopt the project.
* Fixed absolute path normalisation in source code analysis ([#2920](#2920)). In this release, we have addressed an issue with the Workspace API not supporting relative subpaths such as "/a/b/../c", which has been resolved by resolving workspace paths before calling the API. This fix is backward compatible and ensures the correct behavior of the source code analysis. Additionally, we have added integration tests and co-authored this commit with Eric Vergnaud and Serge Smertin. Furthermore, we have added a new test case that supports relative grand-parent paths in the dependency graph construction, utilizing a new `NotebookLoader` class. This loader is responsible for loading the notebook content and metadata given a path, and this new test case exercises the path resolution logic when a notebook depends on another notebook located two levels up in the directory hierarchy. These changes improve the robustness and reliability of the source code analysis in the presence of relative paths.
* Fixed downloading wheel libraries from DBFS on mounted Azure Storage fail with access denied ([#2918](#2918)). In this release, we have introduced enhancements to the library's handling of registering and downloading wheel libraries from DBFS on mounted Azure Storage, addressing an issue that resulted in access denied errors. The changes include improved error handling with the addition of a `try-except` block to handle potential `BadRequest` exceptions and the inclusion of three new methods to register different types of libraries. The `_register_requirements_txt` method reads requirements files and registers each library specified in the file, logging a warning message for any references to other requirements or constraints files. The `_register_whl` method creates a temporary copy of the given wheel file in the local file system and registers it, while the `_register_egg` method checks the runtime version and yields a `DependencyProblem` if the version is greater than (14, 0). These changes simplify the code and enhance error handling while addressing the reported issues related to registering libraries. The changes are implemented in the `jobs.py` file located in the `databricks/labs/ucx/source_code` directory, which also includes the import of the `BadRequest` exception class from `databricks.sdk.errors`.
* Fixed issue with migrating MANAGED hive_metastore table to UC ([#2892](#2892)). In this release, we have implemented changes to address the issue of migrating HMS (Hive Metastore) managed tables to UC (Unity Catalog) as EXTERNAL. Historically, deleting a managed table also removed the underlying data, leading to potential data loss and making the UC table unusable. The new approach provides options to mitigate these issues, including migrating as EXTERNAL or cloning the data to maintain integrity. These changes aim to prevent accidental data deletion, ensure data recoverability, and avoid inconsistencies when new data is added to either HMS or UC. We have introduced new class attributes, methods, and parameters in relevant modules such as `WorkspaceConfig`, `Table`, `migrate_tables`, and `install.py`. These modifications support the new migration strategies and allow for more flexibility in managing how tables are migrated and how data is handled. The upgrade process can be triggered using the `migrate-tables` UCX command or by running the table migration workflows deployed to the workspace. Thorough testing and documentation have been performed to minimize risks of data inconsistencies during migration. It is crucial to understand the implications of these changes and carefully consider the trade-offs before migrating managed tables to UC as EXTERNAL.
* Improve creating UC catalogs ([#2898](#2898)). In this release, the process of creating Unity Catalog (UC) catalogs has been significantly improved with the resolution of several issues discovered during manual testing. The `databricks labs ucx create-ucx-catalog/create-catalogs-schemas` command has been updated to ensure a better user experience and enhance consistency. Changes include requesting the catalog location even if the catalog already exists, eliminating multiple loops over storage locations, and improving logging and matching storage locations. The code now includes new checks to avoid requesting a catalog's storage location if it already exists and updates the behavior of the `_create_catalog_validate` and `_validate_location` methods. Additionally, new unit tests have been added to verify these changes. Under the hood, a new method, `get_catalog`, has been introduced to the `WorkspaceClient` class, and several test functions, such as `test_create_ucx_catalog_skips_when_ucx_catalogs_exists` and `test_create_all_catalogs_schemas_creates_catalogs`, have been implemented to ensure the proper functioning of the updated command. This release addresses issue [#2879](#2879) and enhances the overall process of creating UC catalogs, making it more efficient and reliable.
* Improve logging when skipping grant a in `create-catalogs-schemas` ([#2917](#2917)). In this release, the logging for skipping grants in the `_update_principal_acl` method of the `CatalogSchema` class has been improved. The code now logs a more detailed message when it cannot identify a UC grant for a specific grant object, indicating that the grant is a legacy grant that is not supported in UC, along with the grant's action type and associated object. This change provides more context for debugging and troubleshooting purposes. Additionally, the functionality of using a `DENY` grant instead of a `USAGE` grant for a specific principal and schema in the hive metastore has been introduced. The test case `test_catalog_schema_acl()` in the `test_catalog_schema.py` file has been updated to reflect this new behavior. A new test case `test_create_all_catalogs_schemas_logs_untranslatable_grant(caplog)` has also been added to verify the new logging behavior for skipping legacy grants that are not supported in UC. These changes improve the logging system and enhance the `CatalogSchema` class functionality in the open-source library.
* Verify migration progress prerequisites during UCX catalog creation ([#2912](#2912)). In this update, a new method `verify()` has been added to the `verify_progress_tracking` object in the `workspace_context` object to verify the prerequisites for UCX catalog creation. The prerequisites include the existence of a UC metastore, a UCX catalog, and a successful `assessment` job run. If the assessment job is pending or running, the code will wait up to 1 hour for it to finish before considering the prerequisites unmet. This feature includes modifications to the `create-ucx-catalog` CLI command and adds unit tests. This resolves issue [#2816](#2816) and ensures that the migration progress prerequisites are met before creating the UCX catalog. The `VerifyProgressTracking` class has been added to the `databricks.labs.ucx.progress.install` module and is used in the `Application` class. The changes include a new `timeout` argument to specify the waiting time for pending or running assessment jobs. The commit also includes several new unit tests for the `VerifyProgressTracking` class and modifications to the `test_install.py` file in the `tests/unit/progress` directory. The code has been manually tested and meets the requirements.
@nfx nfx mentioned this pull request Oct 10, 2024
nfx added a commit that referenced this pull request Oct 10, 2024
* Added `google-cloud-storage` to known list
([#2827](#2827)). In this
release, we have added the `google-cloud-storage` library, along with
its various modules and sub-modules, to our project's known list in a
JSON file. Additionally, we have included the `google-crc32c` and
`google-resumable-media` libraries. These libraries provide
functionalities such as content addressable storage, checksum
calculation, and resumable media upload and download. This change is a
partial resolution to issue
[#1931](#1931), which is
likely related to the integration or usage of these libraries in the
project. Software engineers should take note of these additions and how
they may impact the project's functionality.
* Added `google-crc32c` to known list
([#2828](#2828)). With this
commit, we have added the `google-crc32c` library to our system's known
list, addressing part of issue
[#1931](#1931). This
addition enhances the overall functionality of the system by providing
efficient and high-speed CRC32C computation when utilized. The
`google-crc32c` library is known for its performance and reliability,
and by incorporating it into our system, we aim to improve the
efficiency and robustness of the CRC32C computation process. This
enhancement is part of our ongoing efforts to optimize the system and
ensure a more efficient experience for our end-users. With this change,
users can expect faster and more reliable CRC32C computations in their
applications.
* Added `holidays` to known list
([#2906](#2906)). In this
release, we have expanded the known list in our open-source library to
include a new `holidays` category, aimed at supporting tracking of
holidays for different countries, religions, and financial institutions.
This category includes several subcategories, such as calendars,
countries, deprecation, financial holidays, groups, helpers, holiday
base, mixins, observed holiday base, registry, and utils. Each
subcategory contains an empty list, allowing for future data storage
related to holidays. This change partially resolves issue
[#1931](#1931), and
represents a significant step towards supporting a more comprehensive
range of holiday tracking needs in our library. Software engineers may
utilize this new feature to build applications that require tracking and
management of various holidays and related data.
* Added `htmlmin` to known list
([#2907](#2907)). In this
update, we have added the `htmlmin` library to the `known.json`
configuration file's list of known libraries. This addition enables the
use and management of `htmlmin` and its components, including
`htmlmin.command`, `htmlmin.decorator`, `htmlmin.escape`,
`htmlmin.main`, `htmlmin.middleware`, `htmlmin.parser`,
`htmlmin.python3html`, and `htmlmin.python3html.parser`. This change
partially addresses issue
[#1931](#1931), which may
have been caused by the integration or usage of `htmlmin`. Software
engineers can now utilize `htmlmin` and its features in their projects,
thanks to this enhancement.
* Document preparing external locations when creating catalogs
([#2915](#2915)). Databricks
Labs' UCX tool has been updated to incorporate the preparation of
external locations when creating catalogs during the upgrade to Unity
Catalog (UC). This enhancement involves the addition of new
documentation outlining how to physically separate data in storage
within UC, adhering to Databricks' best practices. The
`create-catalogs-schemas` command has been updated to create UC catalogs
and schemas based on a mapping file, allowing users to reuse previously
created external locations or establish new ones outside of UCX. For
data separation, users can leverage external locations when using
subpaths, providing flexibility in data management during the upgrade
process.
* Fixed `KeyError` from `assess_workflows` task
([#2919](#2919)). In this
release, we have made significant improvements to error handling in our
open-source library. We have fixed a KeyError in the `assess_workflows`
task and modified the `_safe_infer_internal` and
`_unsafe_infer_internal` methods to handle both `InferenceError` and
`KeyError` during inference. When an error occurs, we now log the error
message with the node and yield a `Uninferable` object. Additionally, we
have updated the `do_infer_values` method of the `_LocalInferredValue`
class to yield an iterator of iterables of `NodeNG` objects. We have
added multiple unit tests for inferring values in Python code, including
cases for handling externally defined values and their absence. These
changes ensure that our library can handle errors more gracefully and
provide more informative feedback during inference, making it more
robust and easier to use in software engineering projects.
* Fixed `OSError: [Errno 95]` bug in `assess_workflows` task by skipping
GIT-sourced workflows from static code analysis
([#2924](#2924)). In this
release, we have resolved the `OSError: [Errno 95]` bug in the
`assess_workflows` task that occurred while performing static code
analysis on GIT-sourced workflows. A new attribute `Source` has been
introduced in the `jobs` module of the `databricks.sdk.service` package
to identify the source of a notebook task. If the notebook task source
is GIT, a new `DependencyProblem` is raised, indicating that notebooks
in GIT should be analyzed using the `databricks labs ucx
lint-local-code` CLI command. The `_register_notebook` method has been
updated to check if the notebook task source is GIT and return an
appropriate `DependencyProblem` message. This change enhances the
reliability of the `assess_workflows` task by avoiding the
aforementioned bug and provides a more informative message when
notebooks are sourced from GIT. This change is part of our ongoing
effort to improve the project's quality and reliability and benefits
software engineers who adopt the project.
* Fixed absolute path normalisation in source code analysis
([#2920](#2920)). In this
release, we have addressed an issue with the Workspace API not
supporting relative subpaths such as "/a/b/../c", which has been
resolved by resolving workspace paths before calling the API. This fix
is backward compatible and ensures the correct behavior of the source
code analysis. Additionally, we have added integration tests and
co-authored this commit with Eric Vergnaud and Serge Smertin.
Furthermore, we have added a new test case that supports relative
grand-parent paths in the dependency graph construction, utilizing a new
`NotebookLoader` class. This loader is responsible for loading the
notebook content and metadata given a path, and this new test case
exercises the path resolution logic when a notebook depends on another
notebook located two levels up in the directory hierarchy. These changes
improve the robustness and reliability of the source code analysis in
the presence of relative paths.
* Fixed downloading wheel libraries from DBFS on mounted Azure Storage
fail with access denied
([#2918](#2918)). In this
release, we have introduced enhancements to the library's handling of
registering and downloading wheel libraries from DBFS on mounted Azure
Storage, addressing an issue that resulted in access denied errors. The
changes include improved error handling with the addition of a
`try-except` block to handle potential `BadRequest` exceptions and the
inclusion of three new methods to register different types of libraries.
The `_register_requirements_txt` method reads requirements files and
registers each library specified in the file, logging a warning message
for any references to other requirements or constraints files. The
`_register_whl` method creates a temporary copy of the given wheel file
in the local file system and registers it, while the `_register_egg`
method checks the runtime version and yields a `DependencyProblem` if
the version is greater than (14, 0). These changes simplify the code and
enhance error handling while addressing the reported issues related to
registering libraries. The changes are implemented in the `jobs.py` file
located in the `databricks/labs/ucx/source_code` directory, which also
includes the import of the `BadRequest` exception class from
`databricks.sdk.errors`.
* Fixed issue with migrating MANAGED hive_metastore table to UC
([#2892](#2892)). In this
release, we have implemented changes to address the issue of migrating
HMS (Hive Metastore) managed tables to UC (Unity Catalog) as EXTERNAL.
Historically, deleting a managed table also removed the underlying data,
leading to potential data loss and making the UC table unusable. The new
approach provides options to mitigate these issues, including migrating
as EXTERNAL or cloning the data to maintain integrity. These changes aim
to prevent accidental data deletion, ensure data recoverability, and
avoid inconsistencies when new data is added to either HMS or UC. We
have introduced new class attributes, methods, and parameters in
relevant modules such as `WorkspaceConfig`, `Table`, `migrate_tables`,
and `install.py`. These modifications support the new migration
strategies and allow for more flexibility in managing how tables are
migrated and how data is handled. The upgrade process can be triggered
using the `migrate-tables` UCX command or by running the table migration
workflows deployed to the workspace. Thorough testing and documentation
have been performed to minimize risks of data inconsistencies during
migration. It is crucial to understand the implications of these changes
and carefully consider the trade-offs before migrating managed tables to
UC as EXTERNAL.
* Improve creating UC catalogs
([#2898](#2898)). In this
release, the process of creating Unity Catalog (UC) catalogs has been
significantly improved with the resolution of several issues discovered
during manual testing. The `databricks labs ucx
create-ucx-catalog/create-catalogs-schemas` command has been updated to
ensure a better user experience and enhance consistency. Changes include
requesting the catalog location even if the catalog already exists,
eliminating multiple loops over storage locations, and improving logging
and matching storage locations. The code now includes new checks to
avoid requesting a catalog's storage location if it already exists and
updates the behavior of the `_create_catalog_validate` and
`_validate_location` methods. Additionally, new unit tests have been
added to verify these changes. Under the hood, a new method,
`get_catalog`, has been introduced to the `WorkspaceClient` class, and
several test functions, such as
`test_create_ucx_catalog_skips_when_ucx_catalogs_exists` and
`test_create_all_catalogs_schemas_creates_catalogs`, have been
implemented to ensure the proper functioning of the updated command.
This release addresses issue
[#2879](#2879) and enhances
the overall process of creating UC catalogs, making it more efficient
and reliable.
* Improve logging when skipping grant a in `create-catalogs-schemas`
([#2917](#2917)). In this
release, the logging for skipping grants in the `_update_principal_acl`
method of the `CatalogSchema` class has been improved. The code now logs
a more detailed message when it cannot identify a UC grant for a
specific grant object, indicating that the grant is a legacy grant that
is not supported in UC, along with the grant's action type and
associated object. This change provides more context for debugging and
troubleshooting purposes. Additionally, the functionality of using a
`DENY` grant instead of a `USAGE` grant for a specific principal and
schema in the hive metastore has been introduced. The test case
`test_catalog_schema_acl()` in the `test_catalog_schema.py` file has
been updated to reflect this new behavior. A new test case
`test_create_all_catalogs_schemas_logs_untranslatable_grant(caplog)` has
also been added to verify the new logging behavior for skipping legacy
grants that are not supported in UC. These changes improve the logging
system and enhance the `CatalogSchema` class functionality in the
open-source library.
* Verify migration progress prerequisites during UCX catalog creation
([#2912](#2912)). In this
update, a new method `verify()` has been added to the
`verify_progress_tracking` object in the `workspace_context` object to
verify the prerequisites for UCX catalog creation. The prerequisites
include the existence of a UC metastore, a UCX catalog, and a successful
`assessment` job run. If the assessment job is pending or running, the
code will wait up to 1 hour for it to finish before considering the
prerequisites unmet. This feature includes modifications to the
`create-ucx-catalog` CLI command and adds unit tests. This resolves
issue [#2816](#2816) and
ensures that the migration progress prerequisites are met before
creating the UCX catalog. The `VerifyProgressTracking` class has been
added to the `databricks.labs.ucx.progress.install` module and is used
in the `Application` class. The changes include a new `timeout` argument
to specify the waiting time for pending or running assessment jobs. The
commit also includes several new unit tests for the
`VerifyProgressTracking` class and modifications to the
`test_install.py` file in the `tests/unit/progress` directory. The code
has been manually tested and meets the requirements.
nfx pushed a commit that referenced this pull request Oct 14, 2024
…databricks-labs-blueprint` (#2950)

## Changes

In #2920 we introduced a change to ensure that notebook paths are
normalised, resolving #2882. This change depended on an upstream fix
(databrickslabs/blueprint#157) included in the 0.9.1 release of the
dependency. This PR ensures that we run against that release or later.
nfx added a commit that referenced this pull request Oct 14, 2024
* Added `imbalanced-learn` to known list ([#2943](#2943)). A new open-source library, "imbalanced-learn," has been added to the project's known list of libraries, providing various functionalities for handling imbalanced datasets. The addition includes modules such as "imblearn", "imblearn._config", "imblearn._min_dependencies", "imblearn._version", "imblearn.base", and many others, enabling features such as over-sampling, under-sampling, combining sampling techniques, and creating ensembles. This change partially resolves issue [#1931](#1931), which may have been related to the handling of imbalanced datasets, thereby enhancing the project's ability to manage such datasets.
* Added `importlib_resources` to known list ([#2944](#2944)). In this update, we've added the `importlib_resources` package to the known list in the `known.json` file. This package offers a consistent and straightforward interface for accessing resources such as data files and directories in Python packages. It includes several modules, including `importlib_resources`, `importlib_resources._adapters`, `importlib_resources._common`, `importlib_resources._functional`, `importlib_resources._itertools`, `importlib_resources.abc`, `importlib_resources.compat`, `importlib_resources.compat.py38`, `importlib_resources.compat.py39`, `importlib_resources.future`, `importlib_resources.future.adapters`, `importlib_resources.readers`, and `importlib_resources.simple`. These modules provide various functionalities for handling resources within a Python package. By adding this package to the known list, we enable its usage and integration with the project's codebase. This change partially addresses issue [#1931](#1931), improving the management and accessibility of resources within our Python packages.
* Dependency update: ensure we install with at least version 0.9.1 of `databricks-labs-blueprint` ([#2950](#2950)). In the updated `pyproject.toml` file, the version constraint for the `databricks-labs-blueprint` dependency has been revised to range between 0.9.1 and 0.10, specifically targeting 0.9.1 or higher. This modification ensures the incorporation of a fixed upstream issue (databrickslabs/blueprint[#157](#157)), which was integrated in the 0.9.1 release. This adjustment was triggered by a preceding change ([#2920](#2920)) that standardized notebook paths, thereby addressing issue [#2882](#2882), which was dependent on this upstream correction. By embracing this upgrade, users can engage the most recent dependency version, thereby ensuring the remediation of the aforementioned issue.
* Fixed an issue with source table deleted after migration ([#2927](#2927)). In this release, we have addressed an issue where a source table was marked as migrated even after it was deleted following migration. An exception handling mechanism has been added to the `is_migrated` method to return `True` and log a warning message if the source table does not exist, indicating that it has been migrated. A new test function, `test_migration_index_deleted_source`, has also been included to verify the migration index behavior when the source table no longer exists. This function creates a source and destination table, sets the destination table's `upgraded_from` property to the source table, drops the source table, and checks if the migration index contains the source table and if an error message was recorded, indicating that the source table no longer exists. The `get_seen_tables` method remains unchanged in this diff.
* Improve robustness of `sqlglot` failure handling ([#2952](#2952)). This PR introduces changes to improve the robustness of error handling in the `sqlglot` library, specifically targeting issues with inadequate parsing quality. The `collect_table_infos` method has been updated and renamed to `collect_used_tables` to accurately gather information about tables used in a SQL expression. The `lint_expression` and `collect_tables` methods have also been updated to use the new `collect_used_tables` method for better accuracy. Additionally, methods such as `find_all`, `walk_expressions`, and the test suite for the SQL parser have been enhanced to handle potential failures and unsupported SQL syntax more gracefully, by returning empty lists or logging warning messages instead of raising errors. These changes aim to improve the reliability and robustness of the `sqlglot` library, enabling it to handle unexpected input more effectively.
* Log warnings when mounts are discovered on incorrect cluster type ([#2929](#2929)). The `migrate-tables` command in the ucx project's CLI now includes a verification step to ensure the successful completion of a prerequisite assessment workflow before execution. If this workflow has not been completed, a warning message is logged and the command is not executed. A new exception handling mechanism has been implemented for the `dbutils.fs.mounts()` method, which logs a warning and skips mount point discovery if an exception is raised. A new unit test has been added to verify that a warning is logged when attempting to discover mounts on an incompatible cluster type. The diff also includes a new method `VerifyProgressTracking` for verifying progress tracking and updates to existing test methods to include verification of successful runs and error handling before assessment. These changes improve the handling of edge cases in the mount point discovery process, add warnings for mounts on incorrect cluster types, and increase test coverage with progress tracking verification.
* `create-uber-principal` fixes and improvements ([#2941](#2941)). This change introduces fixes and improvements to the `create-uber-principal` functionality within the `databricks-sdk-py` project, specifically targeting the Azure access module. The main enhancements include addressing an issue with the Databricks warehouses API by adding the `set_workspace_warehouse_config_wrapper` function, modifying the command to request the uber principal name only when necessary, improving storage account crawl logic, and introducing new methods to manage workspace-level configurations. Error handling mechanisms have been fortified through added and modified try-except blocks. Additionally, several unit and integration tests have been implemented and verified to ensure the functionality is correct and running smoothly. These changes improve the overall robustness and versatility of the `create-uber-principal` command, directly addressing issues [#2764](#2764), [#2771](#2771), and progressing on [#2949](#2949).
@nfx nfx mentioned this pull request Oct 14, 2024
nfx added a commit that referenced this pull request Oct 14, 2024
* Added `imbalanced-learn` to known list
([#2943](#2943)). A new
open-source library, "imbalanced-learn," has been added to the project's
known list of libraries, providing various functionalities for handling
imbalanced datasets. The addition includes modules such as "imblearn",
"imblearn._config", "imblearn._min_dependencies", "imblearn._version",
"imblearn.base", and many others, enabling features such as
over-sampling, under-sampling, combining sampling techniques, and
creating ensembles. This change partially resolves issue
[#1931](#1931), which may
have been related to the handling of imbalanced datasets, thereby
enhancing the project's ability to manage such datasets.
* Added `importlib_resources` to known list
([#2944](#2944)). In this
update, we've added the `importlib_resources` package to the known list
in the `known.json` file. This package offers a consistent and
straightforward interface for accessing resources such as data files and
directories in Python packages. It includes several modules, including
`importlib_resources`, `importlib_resources._adapters`,
`importlib_resources._common`, `importlib_resources._functional`,
`importlib_resources._itertools`, `importlib_resources.abc`,
`importlib_resources.compat`, `importlib_resources.compat.py38`,
`importlib_resources.compat.py39`, `importlib_resources.future`,
`importlib_resources.future.adapters`, `importlib_resources.readers`,
and `importlib_resources.simple`. These modules provide various
functionalities for handling resources within a Python package. By
adding this package to the known list, we enable its usage and
integration with the project's codebase. This change partially addresses
issue [#1931](#1931),
improving the management and accessibility of resources within our
Python packages.
* Dependency update: ensure we install with at least version 0.9.1 of
`databricks-labs-blueprint`
([#2950](#2950)). In the
updated `pyproject.toml` file, the version constraint for the
`databricks-labs-blueprint` dependency has been revised to range between
0.9.1 and 0.10, specifically targeting 0.9.1 or higher. This
modification ensures the incorporation of a fixed upstream issue
(databrickslabs/blueprint[#157](#157)),
which was integrated in the 0.9.1 release. This adjustment was triggered
by a preceding change
([#2920](#2920)) that
standardized notebook paths, thereby addressing issue
[#2882](#2882), which was
dependent on this upstream correction. By embracing this upgrade, users
can engage the most recent dependency version, thereby ensuring the
remediation of the aforementioned issue.
* Fixed an issue with source table deleted after migration
([#2927](#2927)). In this
release, we have addressed an issue where a source table was marked as
migrated even after it was deleted following migration. An exception
handling mechanism has been added to the `is_migrated` method to return
`True` and log a warning message if the source table does not exist,
indicating that it has been migrated. A new test function,
`test_migration_index_deleted_source`, has also been included to verify
the migration index behavior when the source table no longer exists.
This function creates a source and destination table, sets the
destination table's `upgraded_from` property to the source table, drops
the source table, and checks if the migration index contains the source
table and if an error message was recorded, indicating that the source
table no longer exists. The `get_seen_tables` method remains unchanged
in this diff.
* Improve robustness of `sqlglot` failure handling
([#2952](#2952)). This PR
introduces changes to improve the robustness of error handling in the
`sqlglot` library, specifically targeting issues with inadequate parsing
quality. The `collect_table_infos` method has been updated and renamed
to `collect_used_tables` to accurately gather information about tables
used in a SQL expression. The `lint_expression` and `collect_tables`
methods have also been updated to use the new `collect_used_tables`
method for better accuracy. Additionally, methods such as `find_all`,
`walk_expressions`, and the test suite for the SQL parser have been
enhanced to handle potential failures and unsupported SQL syntax more
gracefully, by returning empty lists or logging warning messages instead
of raising errors. These changes aim to improve the reliability and
robustness of the `sqlglot` library, enabling it to handle unexpected
input more effectively.
* Log warnings when mounts are discovered on incorrect cluster type
([#2929](#2929)). The
`migrate-tables` command in the ucx project's CLI now includes a
verification step to ensure the successful completion of a prerequisite
assessment workflow before execution. If this workflow has not been
completed, a warning message is logged and the command is not executed.
A new exception handling mechanism has been implemented for the
`dbutils.fs.mounts()` method, which logs a warning and skips mount point
discovery if an exception is raised. A new unit test has been added to
verify that a warning is logged when attempting to discover mounts on an
incompatible cluster type. The diff also includes a new method
`VerifyProgressTracking` for verifying progress tracking and updates to
existing test methods to include verification of successful runs and
error handling before assessment. These changes improve the handling of
edge cases in the mount point discovery process, add warnings for mounts
on incorrect cluster types, and increase test coverage with progress
tracking verification.
* `create-uber-principal` fixes and improvements
([#2941](#2941)). This
change introduces fixes and improvements to the `create-uber-principal`
functionality within the `databricks-sdk-py` project, specifically
targeting the Azure access module. The main enhancements include
addressing an issue with the Databricks warehouses API by adding the
`set_workspace_warehouse_config_wrapper` function, modifying the command
to request the uber principal name only when necessary, improving
storage account crawl logic, and introducing new methods to manage
workspace-level configurations. Error handling mechanisms have been
fortified through added and modified try-except blocks. Additionally,
several unit and integration tests have been implemented and verified to
ensure the functionality is correct and running smoothly. These changes
improve the overall robustness and versatility of the
`create-uber-principal` command, directly addressing issues
[#2764](#2764),
[#2771](#2771), and
progressing on
[#2949](#2949).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: assess_workflow failing with "Parsing Python code failed"
2 participants