Skip to content

Conversation

@acrylJonny
Copy link
Collaborator

@acrylJonny acrylJonny commented Nov 27, 2025

Add Storage Lineage to Hive/Hive Metastore + Code Refactoring

Summary

Adds storage lineage support to Hive and Hive Metastore connectors, enabling lineage tracking between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, DBFS). Also refactors Hive sources into a clean directory structure.

Key Changes

Storage Lineage (Opt-in Feature)

New configuration options (disabled by default):

  • emit_storage_lineage: Enable storage lineage extraction
  • hive_storage_lineage_direction: Set direction (upstream or downstream)
  • include_column_lineage: Enable column-level lineage
  • storage_platform_instance: Platform instance for storage URNs

Supported platforms: S3, Azure (ADLS/ABFS), GCS, HDFS, DBFS, local files

Code Refactoring

  • Moved hive.pyhive/hive_source.py
  • Moved hive_metastore.pyhive/hive_metastore_source.py
  • Created hive/storage_lineage.py with shared logic
  • Extracted HiveStorageLineageConfigMixin to eliminate duplication
  • Updated setup.py entry points to use fully qualified paths

Code Quality Improvements

  • Converted HiveStorageLineageConfig to Pydantic model
  • Created LineageDirection and StoragePlatform StrEnums for type safety
  • Fixed unsafe exception handling in get_db_schema (now raises ValueError for invalid input)
  • Improved error reporting in get_workunits_internal (specific exceptions + proper logging)
  • Removed 50+ redundant comments

PR Checks

  • The PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@acrylJonny acrylJonny marked this pull request as draft November 27, 2025 16:02
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 27, 2025
@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 83.97933% with 62 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...c/datahub/ingestion/source/sql/hive/hive_source.py 79.06% 36 Missing ⚠️
...tahub/ingestion/source/sql/hive/storage_lineage.py 90.41% 16 Missing ⚠️
...ingestion/source/sql/hive/hive_metastore_source.py 77.27% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

description="Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2",
)

emit_storage_lineage: bool = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we usually call this emit_... or include_...?

default=False,
description="Whether to emit storage-to-Hive lineage",
)
hive_storage_lineage_direction: str = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ability to choose direction, do we have something similar in other sources?

anyway, it could be a Literal https://docs.pydantic.dev/1.10/usage/types/#literal-type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have this already in this in the main hive source (i.e. not the hive metastore source)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore adding it here also for consistency. One of the main reasons for this is because Spark often can show lineage just into the files rather than the metastore depending on how the table is updated.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 3, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 3, 2025
if self._COMPLEX_TYPE.match(fields[0].nativeDataType) and isinstance(
fields[0].type.type, NullTypeClass
):
assert len(fields) == 1
Copy link

@aikido-pr-checks aikido-pr-checks bot Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.

Remediation: Raise an exception instead of using assert.
View details in Aikido Security

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs!

Still a new user may wonder: which one hive or hive-metastore should I use? is hive going to be deprecated? is there feature parity or what's the feature that one has that the other misses?

And for existing users of hive: what's the plan? is it going to be deprecated eventually? are new features going to be implemented in the hive one too?

This creates URNs like:

```
urn:li:dataset:(urn:li:dataPlatform:hive,database.table,prod-hive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both hive and hive-metastore generate URNs with hive platform in the URN, right?
so if a user moves from hive to hive-metastore that won't be a breaking change in the identiyy of the assets, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. You can move between the two and the URNs generated ultimately are the same but using different approaches (metastore db vs thrift). This hasn't changed.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 4, 2025
Comment on lines -783 to +840
# add view properties
properties: Dict[str, str] = {
"is_view": "True",
}
properties: Dict[str, str] = {"is_view": "True"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not new behaviour

anyway, why not using SubTypes aspect here to track this as a view?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to make this a view subtype - just didn't want to introduce this as it is a change in the connector. Happy to do so if you think that's wise.

"type": coltype,
"nullable": True,
"default": None,
"full_type": orig_col_type, # pass it through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so full_type, this is what we really miss in the original non-patched method?

what would be the impact of missing int?
patch is a heavy tech debt, so wondering if the value worths it

Copy link
Collaborator Author

@acrylJonny acrylJonny Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing -

result.append(
{
"name": col_name,
"type": coltype,
"nullable": True,
"default": None,
"full_type": orig_col_type, # pass it through
"comment": _comment,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this patch is not new? is that what you mean?

Comment on lines +136 to +137
HiveDialect.get_view_names = get_view_names_patched
HiveDialect.get_view_definition = get_view_definition_patched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveDialect, this comes from our fork, right? https://github.com/acryldata/PyHive
May we just fix there?

Comment on lines +398 to +403
except (ValueError, TypeError, AttributeError) as e:
logger.warning(
f"Failed to create storage dataset MCPs for {storage_location}: {e}",
exc_info=True,
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exception swallowing pattern, is that what we usually do in other sources when we fail emission?

@sgomezvillamor
Copy link
Contributor

codecoverage report shows little coverage in failure/exception scenarios, we could iimprove a little bit there, mainly in hive_source.py

also, concerning the lack of coverage for dbapi_get_columns_patched

@acrylJonny
Copy link
Collaborator Author

codecoverage report shows little coverage in failure/exception scenarios, we could iimprove a little bit there, mainly in hive_source.py

also, concerning the lack of coverage for dbapi_get_columns_patched

I'll add these today so that we can get this signed off.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter needs-review Label for PRs that need review from a maintainer. labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants