Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOP-22366] Basic column lineage handling in consumer #155

Merged
merged 11 commits into from
Feb 19, 2025
13 changes: 13 additions & 0 deletions docs/changelog/next_release/155.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Implement basic column lineage handler in consumer.

There are 2 DTO classes:

``ColumnLineageDTO`` - describes tuple (operation, source_dataset, target_dataset).
``DatasetColumnRelationDTO`` - describes source_column -> target_column (optional) relations within ``ColumnLineageDTO``

Extractor builds list of these ``ColumnLineageDTO`` for each operation, and then perform bulk insert.
Table rows are immutable, using ``ON CONFLICT DO NOTHING`` with few additional checks to reduce database IO.

Open Lineage integration for Spark before v1.23 (or with columnLineage.datasetLineageEnabled=false, which is still default)
produced INDIRECT lineage for each combination source_column x target_column,
which is amlost the cartesian join. It is VERY expensive to handle, so we ignore this cases.
Loading