Skip to content

vbalalian/roman_coins_data_pipeline

Repository files navigation

roman_counting_coins

Roman Coins

End-to-end ELT pipeline project

Continuous Integration

Extracting, Loading, and Transforming data on Roman Coins gathered from wildwinds.com

Tools: Python, PostgreSQL, Docker, FastAPI, Airbyte, MinIO, Dagster, DuckDB, dbt

Scrapes data on coins from the Roman Empire from wildwinds.com, and loads the data into a postgres server. Due to the required 30-second delay between page requests, scraping takes several hours to complete; the data is loaded into the server as it arrives.

Serves data from the roman coins dataset, and allows data addition and manipulation via POST, PUT, and PATCH endpoints. Data is continuously added during web scraping.

Custom airbyte connector streams incremental data from the API to a standalone MinIO bucket.

Resilient storage for the incoming data stream. Data is replicated "at least once" by Airbyte, so some duplicated data is acceptable at this stage. Deduplication will be easily handled by dbt at the next stage of the pipeline.

Sensors trigger Airbyte syncs and DuckDB loads on a minute-by-minute basis.

Local data warehouse.

Transforms data within the data warehouse.

Requirements:

Docker
Docker Compose
Airbyte

To Run:

Step 1: Ensure Docker and Airbyte are both up and running.

Step 2: (Optional) Set preferred credentials/variables in project .env file

Step 3: Run the following terminal commands:

git clone https://github.com/vbalalian/roman_coins_data_pipeline.git
cd roman_coins_data_pipeline
docker compose up

This will run the web scraper, the API, MinIO, and Dagster; then build the custom Airbyte connector, configure the API-Airbyte-Minio connection, and trigger Airbyte syncs and DuckDB load jobs automatically using sensors.