Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting column lineage from subquery with column name specified #638

Open
Ricardop1 opened this issue Jul 17, 2024 · 1 comment
Open

Getting column lineage from subquery with column name specified #638

Ricardop1 opened this issue Jul 17, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@Ricardop1
Copy link

Describe the bug
Relations are wrong when using intermediate tables on column level. When using a intermediate table and naming its columns, it is describing the columns as the Table/Subquery instead of the table name specified.

SQL

insert into db_test.table_target
select col_1_renamed, col_2_renamed from
(Select col1, max(col2) from db_test.table_src
where
  col1 = "AV" and
  col2 is not null
) AS Inter_table(col_1_renamed, col_2_renamed);

To Reproduce

from sqllineage.runner import LineageRunner, LineageLevel

query = """ insert into db_test.table_target
select col_1_renamed, col_2_renamed from
(Select col1, max(col2) from db_test.table_src
where
  col1 = "AV" and
  col2 is not null
) AS Inter_table(col_1_renamed, col_2_renamed);
"""
result = (
    LineageRunner(query, dialect="teradata")
)
lineage_stmt = result.to_cytoscape(LineageLevel.COLUMN)
print(lineage_stmt)
{'data': {'id': 'db_test.table_src.col1', 'parent': 'db_test.table_src', 'parent_candidates': [{'name': 'db_test.table_src', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col1', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_src.col2', 'parent': 'db_test.table_src', 'parent_candidates': [{'name': 'db_test.table_src', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).max(col2)', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col_1_renamed', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_target.col_1_renamed', 'parent': 'db_test.table_target', 'parent_candidates': [{'name': 'db_test.table_target', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col_2_renamed', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_target.col_2_renamed', 'parent': 'db_test.table_target', 'parent_candidates': [{'name': 'db_test.table_target', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_src', 'type': 'Table'}}
{'data': {'id': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}}
{'data': {'id': 'db_test.table_target', 'type': 'Table'}}
{'data': {'id': 'e0', 'source': 'db_test.table_src.col1', 'target': '(col_1_renamed, col_2_renamed).col1'}}
{'data': {'id': 'e1', 'source': 'db_test.table_src.col2', 'target': '(col_1_renamed, col_2_renamed).max(col2)'}}
{'data': {'id': 'e2', 'source': '(col_1_renamed, col_2_renamed).col_1_renamed', 'target': 'db_test.table_target.col_1_renamed'}}
{'data': {'id': 'e3', 'source': '(col_1_renamed, col_2_renamed).col_2_renamed', 'target': 'db_test.table_target.col_2_renamed'}}

Expected behavior
The code should detect Inter_table as the Subquery and not (col_1_renamed, col_2_renamed) as a whole. The column lineage as source and target should be:
db_test.table_src.col1 -> Inter_table.col_1_renamed
db_test.table_src.col2 -> Inter_table.col_2_renamed

Inter_table.col_1_renamed -> db_test.table_target.col_1_renamed
Inter_table.col_2_renamed -> db_test.table_target.col_2_renamed

**Python version

  • 3.10.12

SQLLineage version (available via sqllineage --version):

  • 1.5.3

Additional context
Looks like there is a problem when detecting intermediate tables

@Ricardop1 Ricardop1 added the bug Something isn't working label Jul 17, 2024
@reata
Copy link
Owner

reata commented Feb 7, 2025

I wasn't aware one can also specify column name for a subquery like this. This syntax is never analyzed so naturally the result is incorrect.

I did some research and it seems this syntax is not teradata specific. Postgres as well as MySQL also support it.

@reata reata changed the title Bug when Getting column lineage on Teradata with intermediate tables Bug when Getting column lineage on subquery with column name specified Feb 7, 2025
@reata reata added enhancement New feature or request and removed bug Something isn't working labels Feb 7, 2025
@reata reata changed the title Bug when Getting column lineage on subquery with column name specified Getting column lineage from subquery with column name specified Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants