Skip to content

KaterynaD/schema_drift_impact

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

schema_drift_impact

A tool to analyze the impact of schema drift (disappearing columns) in dbt source tables by generating column lineage reports.

Overview

This project helps track how changes in source table columns (such as columns being removed or renamed) affect downstream dbt models. It uses column lineage data generated by dbt-colibri to create HTML reports showing direct and downstream usage of source columns.

Dependencies

  • Python 3.6+
  • dbt-colibri (install via pip install dbt-colibri)

Setup

  1. Install dbt-colibri:

    pip install dbt-colibri
  2. In your dbt project directory, run the following commands to generate the required manifest:

    dbt compile
    dbt docs generate
    colibri generate

    This will create colibri-manifest.json in your dbt project's dist/ directory.

Usage

  1. Update the MANIFEST_PATH variable in schema_drift_impact.py to point to your colibri-manifest.json file (e.g., ../integration_tests/dist/colibri-manifest.json).

  2. Modify the INPUTS list in schema_drift_impact.py to specify the source tables and columns you want to analyze. Each input should be a dictionary with:

    • source: The fully qualified source name (e.g., "source.integration_tests.kdrogaieva.source_table")
    • source_column: The column name to track (e.g., "dummy_varchar")
  3. Set the OUTPUT_FORMAT to "html" (default) to generate an HTML report, or "json" for JSON output.

  4. Run the script:

    python schema_drift_impact.py
  5. If using HTML output, the report will be saved as column_lineage_report.html and also printed to the console.

Output

The report shows for each specified source column:

  • Direct Usage: Models that directly reference the column
  • Downstream Usage: Models that depend on the direct usage models
  • Errors: Any issues encountered during analysis or notification, nothing is found if a column is not used anywhere

Files

  • schema_drift_impact.py: Main script to generate reports
  • colibri_lineage.py: Library for processing column lineage data
  • column_lineage_report.html: Generated HTML report (example)

Purpose

This tool is particularly useful for:

  • Assessing the impact of schema changes in source systems
  • Identifying which dbt models would be affected by column removals
  • Planning data migration or refactoring efforts
  • Ensuring data pipeline reliability when source schemas evolve

About

A tool to analyze the impact of schema drift (disappearing columns) in dbt source tables by generating column lineage reports.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors