A tool to analyze the impact of schema drift (disappearing columns) in dbt source tables by generating column lineage reports.
This project helps track how changes in source table columns (such as columns being removed or renamed) affect downstream dbt models. It uses column lineage data generated by dbt-colibri to create HTML reports showing direct and downstream usage of source columns.
- Python 3.6+
- dbt-colibri (install via
pip install dbt-colibri)
-
Install dbt-colibri:
pip install dbt-colibri
-
In your dbt project directory, run the following commands to generate the required manifest:
dbt compile dbt docs generate colibri generate
This will create
colibri-manifest.jsonin your dbt project'sdist/directory.
-
Update the
MANIFEST_PATHvariable inschema_drift_impact.pyto point to yourcolibri-manifest.jsonfile (e.g.,../integration_tests/dist/colibri-manifest.json). -
Modify the
INPUTSlist inschema_drift_impact.pyto specify the source tables and columns you want to analyze. Each input should be a dictionary with:source: The fully qualified source name (e.g.,"source.integration_tests.kdrogaieva.source_table")source_column: The column name to track (e.g.,"dummy_varchar")
-
Set the
OUTPUT_FORMATto"html"(default) to generate an HTML report, or"json"for JSON output. -
Run the script:
python schema_drift_impact.py
-
If using HTML output, the report will be saved as
column_lineage_report.htmland also printed to the console.
The report shows for each specified source column:
- Direct Usage: Models that directly reference the column
- Downstream Usage: Models that depend on the direct usage models
- Errors: Any issues encountered during analysis or notification, nothing is found if a column is not used anywhere
schema_drift_impact.py: Main script to generate reportscolibri_lineage.py: Library for processing column lineage datacolumn_lineage_report.html: Generated HTML report (example)
This tool is particularly useful for:
- Assessing the impact of schema changes in source systems
- Identifying which dbt models would be affected by column removals
- Planning data migration or refactoring efforts
- Ensuring data pipeline reliability when source schemas evolve