A small local project to learn how to build ETL and data quality checks using Docker.
This project shows how DuckDB and Great Expectations work together in a simple and clean setup. Everything runs locally using Docker. No cloud account is needed. This is a small and safe place to learn, test ideas, and explore data engineering concepts.
DuckDB is very fast and easy to use. It works inside your Python process and you do not need any server.
- Very fast for analytics
- Easy to read CSV and JSON files
- SQL support
- One single file to store the database
- Good for local demos and small projects
- Perfect for beginners who want to learn data pipelines
- Load raw CSV files
- Create a simple fact table
- Run SQL queries
- Store everything inside a file in data/warehouse
Great Expectations is a tool to check the quality of your data. It helps you make sure your data is correct before you use it in reports or models.
- Check for missing values
- Check for unique combinations
- Check value ranges
- Check allowed values
- Clear pass or fail output
- Save detailed results into a JSON file
- Validate the fact table built by DuckDB
- Show a clear summary of which checks passed
- Save a full validation report to data/validations
DuckDB gives you a small and fast local data warehouse. Great Expectations gives you a clear way to check and trust your data.
Using both together helps you learn how real data teams work. It follows the idea of ETL → Validate → Use or Publish
This is a very common pattern in production systems. Here you get a small version to learn.
duckdb-gx-docker-lab/
data/
raw/ input CSV files
warehouse/ DuckDB file created after ETL
etl/
load_to_duckdb.py ETL script
run_validations.py Data quality script
docker/
Dockerfile environment for running everything
docker-compose.yml Docker service setup
requirements.txt Python packages
- Creates schemas
- Loads CSV files
- Builds a fact table
- Shows a small data summary
- Reads the fact table
- Runs Great Expectations
- Shows clear pass or fail checks
- Saves detailed results
docker compose builddocker compose run --rm etl python etl/load_to_duckdb.pydocker compose run --rm etl python etl/run_validations.pyDuckDB and Great Expectations Demo ETL
Using DuckDB at: /app/data/warehouse/analytics_test.duckdb
1. Creating schemas
2. Loading CSV files
3. Building fact table
ETL finished successfully
Fact table summary:
- Rows: 3
- Distinct users: 3
- Date range: 2024-05-01 to 2024-05-02
Great Expectations: Data Validation
Reading from DuckDB file
Dataset summary:
- Rows: 3
- Distinct users: 3
- Date range: 2024-05-01 to 2024-05-02
Running checks
Check results:
- PASSED user_id not null
- PASSED unique user_id and event_date
- PASSED event_count >= 0
- PASSED country in allowed list
Validation file saved in data/validations
All checks passed
- Add new columns
- Add filters
- Add joins
- Create a second fact table
- Check allowed event types
- Check timestamp formats
- Check for future dates
- Check that user_id exists in the users table
- Add a staging table
- Add a write, audit, publish flow
- Add a simple rollback idea
- Use DuckDB inside a Jupyter notebook
- Query DuckDB with Polars