Skip to content

A simple local lab to learn ETL and data quality using DuckDB and Great Expectations. Runs fully in Docker and perfect for beginners to try and explore.

Notifications You must be signed in to change notification settings

ankit-khare-2015/duckdb-gx-docker-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuckDB and Great Expectations Local Lab

A small local project to learn how to build ETL and data quality checks using Docker.

This project shows how DuckDB and Great Expectations work together in a simple and clean setup. Everything runs locally using Docker. No cloud account is needed. This is a small and safe place to learn, test ideas, and explore data engineering concepts.


Why DuckDB

DuckDB is very fast and easy to use. It works inside your Python process and you do not need any server.

Important features used in this project

  • Very fast for analytics
  • Easy to read CSV and JSON files
  • SQL support
  • One single file to store the database
  • Good for local demos and small projects
  • Perfect for beginners who want to learn data pipelines

In this project, DuckDB is used to

  • Load raw CSV files
  • Create a simple fact table
  • Run SQL queries
  • Store everything inside a file in data/warehouse

Why Great Expectations

Great Expectations is a tool to check the quality of your data. It helps you make sure your data is correct before you use it in reports or models.

Important features used in this project

  • Check for missing values
  • Check for unique combinations
  • Check value ranges
  • Check allowed values
  • Clear pass or fail output
  • Save detailed results into a JSON file

In this project, Great Expectations is used to

  • Validate the fact table built by DuckDB
  • Show a clear summary of which checks passed
  • Save a full validation report to data/validations

Why use DuckDB and Great Expectations together

DuckDB gives you a small and fast local data warehouse. Great Expectations gives you a clear way to check and trust your data.

Using both together helps you learn how real data teams work. It follows the idea of ETL → Validate → Use or Publish

This is a very common pattern in production systems. Here you get a small version to learn.


Project Structure

duckdb-gx-docker-lab/
  data/
    raw/                   input CSV files
    warehouse/             DuckDB file created after ETL

  etl/
    load_to_duckdb.py      ETL script
    run_validations.py     Data quality script

  docker/
    Dockerfile             environment for running everything

  docker-compose.yml       Docker service setup
  requirements.txt         Python packages

Short summary of the scripts

load_to_duckdb.py

  • Creates schemas
  • Loads CSV files
  • Builds a fact table
  • Shows a small data summary

run_validations.py

  • Reads the fact table
  • Runs Great Expectations
  • Shows clear pass or fail checks
  • Saves detailed results

How to set up and run the project

Step 1. Build the Docker image

docker compose build

Step 2. Run the ETL

docker compose run --rm etl python etl/load_to_duckdb.py

Step 3. Run the data quality checks

docker compose run --rm etl python etl/run_validations.py

Example output from ETL

DuckDB and Great Expectations Demo ETL
Using DuckDB at: /app/data/warehouse/analytics_test.duckdb

1. Creating schemas
2. Loading CSV files
3. Building fact table

ETL finished successfully
Fact table summary:
- Rows: 3
- Distinct users: 3
- Date range: 2024-05-01 to 2024-05-02

Example output from validations

Great Expectations: Data Validation
Reading from DuckDB file

Dataset summary:
- Rows: 3
- Distinct users: 3
- Date range: 2024-05-01 to 2024-05-02

Running checks

Check results:
- PASSED user_id not null
- PASSED unique user_id and event_date
- PASSED event_count >= 0
- PASSED country in allowed list

Validation file saved in data/validations
All checks passed

Practice ideas

ETL improvements

  • Add new columns
  • Add filters
  • Add joins
  • Create a second fact table

Data Quality improvements

  • Check allowed event types
  • Check timestamp formats
  • Check for future dates
  • Check that user_id exists in the users table

Architecture improvements

  • Add a staging table
  • Add a write, audit, publish flow
  • Add a simple rollback idea

Visual ideas

  • Use DuckDB inside a Jupyter notebook
  • Query DuckDB with Polars

Learning resources

DuckDB

Great Expectations

About

A simple local lab to learn ETL and data quality using DuckDB and Great Expectations. Runs fully in Docker and perfect for beginners to try and explore.

Topics

Resources

Stars

Watchers

Forks