A modern data analytics platform combining MLB statistics with fantasy baseball insights.
Getting Started • Architecture • Documentation • Contributing
- Docker and Docker Compose
- Python 3.9+
- Make (optional, but recommended)
- Cloudflare R2 bucket populated with MLB stats data (see MLB Stats Pipeline below)
Before running this project, you need to populate your Cloudflare R2 bucket with MLB statistics data using the MLB Stats Dagster Pipeline:
-
Clone and set up the MLB Stats Pipeline:
git clone https://github.com/waaronmorris/mlb_stats_dagster.git cd mlb_stats_dagster
-
Configure the pipeline:
cp .env.example .env # Edit .env with your Cloudflare R2 credentials and MLB season configuration
-
Start the pipeline using Docker Compose:
docker-compose up -d
-
Access the Dagster UI at
http://localhost:3000
and run the pipeline to populate your R2 bucket with MLB statistics data. -
Once the pipeline has completed and your R2 bucket contains the MLB stats data, return here to continue setup.
-
Clone the repository:
git clone https://github.com/yourusername/mlb-stats.git cd mlb-stats
-
Set up environment:
cp env/.env.default env/.env # Edit env/.env with your configuration including: # - CLOUDFLARE_R2_ACCESS_KEY # - CLOUDFLARE_R2_SECRET_KEY # - CLOUDFLARE_R2_BUCKET_NAME ln -s env/.env .env
-
Start services:
docker-compose up -d
-
Access Superset at
http://localhost:8088
graph LR
R2[(Cloudflare R2)]
DDB[(DuckDB)]
DBT[dbt Transformations]
SUP[Superset Dashboards]
R2 -->|Raw Data| DDB
DDB -->|Source Tables| DBT
DBT -->|Transformed Models| DDB
DDB -->|Analytics Tables| SUP
flowchart TD
A[MLB Stats Dagster Pipeline] -->|Parquet Files| B[(Cloudflare R2)]
B -->|Source Data| C[dbt Processing]
C -->|Transformed Data| D[(DuckDB Analytics)]
D -->|Metrics| E[Superset Dashboards]
The ETL pipeline consists of several key stages, with initial data loading occurring during container build:
-
MLB Stats Dagster Pipeline
- External pipeline that provides structured Parquet files
- Data is uploaded to Cloudflare R2
- Initial data is loaded during container build
-
Data Processing (dbt)
- Sources: Direct mappings to R2 Parquet files
- Staging: Cleaned and standardized data models
- Intermediate: Core business logic and relationships
- Marts: Analytics-ready aggregated tables
-
Data Loading
- Transformed data loaded into DuckDB analytics tables
- Optimized for query performance with appropriate indexes
- Partitioned by season and update frequency
-
Data Consumption
- Superset dashboards for visualization
- Interactive analytics queries
- Performance-optimized views
Key Features:
- MLB Stats Dagster pipeline for data ingestion
- Centralized R2 storage
- Data quality checks at ingestion
- Full data lineage tracking
- Automated recovery procedures
- Performance optimization through partitioning
erDiagram
PLAYERS ||--o{ GAME_STATS : has
PLAYERS {
int player_id
string name
string team
string position
}
GAME_STATS {
int game_id
int player_id
date game_date
float batting_avg
int home_runs
int rbis
}
GAME_STATS ||--o{ FANTASY_POINTS : generates
FANTASY_POINTS {
int game_id
int player_id
float points
string category
}
-
Data Storage
- Cloudflare R2: Object storage for raw data
- DuckDB: High-performance analytics database
- PostgreSQL: Metadata and user management
-
Processing & Analytics
- dbt: Data transformation
- Apache Superset: Visualization and dashboards
- Redis: Caching and queue management
-
Infrastructure
- Docker: Containerization
- Make: Development automation
.
├── env/ # Environment configuration
│ ├── .env.default # Default environment variables
│ ├── .env.schema # Environment variables documentation
│ └── README.md # Environment setup guide
├── docker/ # Docker configuration files
├── superset/ # Superset configuration and dashboards
├── dbt/ # Data transformation models
├── scripts/ # Utility scripts
├── docker-compose.yml # Service definitions
├── Makefile # Development commands
└── README.md # This file
Common development tasks are automated through the Makefile:
make up # Start all services
make down # Stop all services
make logs # View logs
make test # Run tests
make clean # Clean up containers and volumes
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- MLB Stats API for providing baseball statistics
- Apache Superset community
- DuckDB team
For support, please open an issue or contact the maintainers.