Skip to content

Automatic Schema Lineage Discovery for SQL and Notebook Pipelines — includes Algorithm 1 implementation, SDG builder, synthetic corpus generator, and Colab quickstart for reproducible experiments.

Notifications You must be signed in to change notification settings

habiiibo03/StructureLineage

Repository files navigation

🧩 StructureLineage

StructureLineage is a framework for generating synthetic pipelines, building Schema Dependency Graphs (SDGs), and evaluating lineage mappings with precision/recall metrics.

Author: Habib Maicha


🚀 Quickstart in Google Colab

Run the full pipeline interactively with one click:

Open In Colab

The Quickstart notebook demonstrates how to:

  1. Clone the StructureLineage repo
  2. Generate a synthetic project
  3. Build the Schema Dependency Graph (SDG)
  4. Evaluate precision/recall against ground truth
  5. ✅ Auto-verify pipeline success
  6. 📊 Visualize the SDG (tables vs views)

📊 Example SDG Visualization

Here’s a simplified layered view of tables (blue) and views (green):

Schema Dependency Graph


⚙️ Requirements

  • Python 3.9+
  • sqlglot, duckdb, networkx, pandas, pytest

Install dependencies:

pip install -r requirements.txt

🖥️ Getting Started Locally

Clone the repository and run the pipeline on your machine:

git clone https://github.com/habiiibo03/StructureLineage.git cd StructureLineage pip install -r 

requirements.txt Generate a synthetic project:

python -m src.tools.gen_synthetic examples/local_project --n_tables 3 --n_views 3

Build the Schema Dependency Graph:

python -m src.sl_core.build_sdg examples/local_project 

Evaluate results:

pytest

🧪 Testing

Run unit tests with:

pytest 

About

Automatic Schema Lineage Discovery for SQL and Notebook Pipelines — includes Algorithm 1 implementation, SDG builder, synthetic corpus generator, and Colab quickstart for reproducible experiments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published