Roman Coins

End-to-end ELT pipeline project

Extracting, Loading, and Transforming data on Roman Coins gathered from wildwinds.com

Tools: Python, PostgreSQL, Docker, FastAPI, Airbyte, MinIO, Dagster, DuckDB, dbt

Web Scraper

Scrapes data on coins from the Roman Empire from wildwinds.com, and loads the data into a postgres server. Due to the required 30-second delay between page requests, scraping takes several hours to complete; the data is loaded into the server as it arrives.

API

Serves data from the roman coins dataset, and allows data addition and manipulation via POST, PUT, and PATCH endpoints. Data is continuously added during web scraping.

Airbyte

Custom airbyte connector streams incremental data from the API to a standalone MinIO bucket.

MinIO

Resilient storage for the incoming data stream. Data is replicated "at least once" by Airbyte, so some duplicated data is acceptable at this stage. Deduplication will be easily handled by dbt at the next stage of the pipeline.

Dagster

Sensors trigger Airbyte syncs and DuckDB loads on a minute-by-minute basis.

DuckDB

Local data warehouse.

dbt

Transforms data within the data warehouse.

Requirements:

Docker
Docker Compose
Airbyte

To Run:

Step 1: Ensure Docker and Airbyte are both up and running.

Step 2: (Optional) Set preferred credentials/variables in project .env file

Step 3: Run the following terminal commands:

git clone https://github.com/vbalalian/roman_coins_data_pipeline.git
cd roman_coins_data_pipeline
docker compose up

This will run the web scraper, the API, MinIO, and Dagster; then build the custom Airbyte connector, configure the API-Airbyte-Minio connection, and trigger Airbyte syncs and DuckDB load jobs automatically using sensors.

View the web_scraper container logs in Docker to follow the progress of the Web Scraping
Access the API directly at http://localhost:8010, or interact with the different endpoints at http://localhost:8010/docs
Access the Airbyte UI at http://localhost:8000
Access the MinIO Console at http://localhost:9090
Access the Dagster UI at http://localhost:3000
At the moment, duckdb access is limited to docker exec commands on one of the dagster services with access to the duckdb volume.

Name		Name	Last commit message	Last commit date
Latest commit History 305 Commits
.github/workflows		.github/workflows
api		api
extract-load-transform		extract-load-transform
web_scraping		web_scraping
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
README.md		README.md
compose.test.yaml		compose.test.yaml
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Roman Coins

End-to-end ELT pipeline project

Web Scraper

API

Airbyte

MinIO

Dagster

DuckDB

dbt

Requirements:

To Run:

About

Releases

Packages

Contributors 2

Languages

vbalalian/roman_coins_data_pipeline

Folders and files

Latest commit

History

Repository files navigation

Roman Coins

End-to-end ELT pipeline project

Requirements:

To Run:

About

Topics

Resources

Stars

Watchers

Forks

Languages