Skip to content

A complete automated ML system that demonstrates production MLOps practices. Features Apache Airflow orchestration, MLflow experiment tracking, FastAPI model serving, and Streamlit frontend - all containerized with Docker.

Notifications You must be signed in to change notification settings

examplefirstaccount/california-housing-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏡 Automated ML Pipeline for California Housing Price Prediction

This project demonstrates a fully automated MLOps pipeline for predicting California housing prices. It covers the entire lifecycle from data ingestion and cleaning to model training, evaluation, and deployment, orchestrated by Apache Airflow and containerized with Docker.

The pipeline is designed to run automatically every 5 minutes, ensuring that the deployed model stays up-to-date with the latest processed data and model training.

✨ Features

  • Automated Pipeline Orchestration: Uses Apache Airflow to schedule and manage the end-to-end ML workflow.
  • Data Engineering:
    • Loads raw California Housing data.
    • Performs data cleaning (imputation of missing values, outlier removal).
    • Splits data into training and testing sets.
  • Model Engineering:
    • Applies feature scaling using StandardScaler.
    • Trains a RandomForestRegressor model.
    • Evaluates model performance (MSE, R2 Score).
    • Tracks experiments, logs parameters, metrics, and models using MLflow.
    • Persists the trained model and scaler for deployment.
  • Model Deployment:
    • API Service: A FastAPI application serving model predictions via a RESTful endpoint.
    • User Interface: A Streamlit web application for interactive predictions, communicating with the FastAPI.
    • Containerization: Both API and App are deployed as separate Docker containers.
  • Persistent Storage: Data, models, and MLflow runs are persisted using Docker volumes.

🛠️ Technologies Used

  • Orchestration: Apache Airflow
  • Data Handling: Pandas, scikit-learn
  • Model Tracking: MLflow
  • Web API: FastAPI
  • Frontend App: Streamlit
  • Containerization: Docker, Docker Compose

📂 Repository Structure

├── README.md
├── code
│   ├── datasets          # Data Engineering stage scripts
│   │   └── data_engineering.py
│   ├── deployment        # Deployment stage (API & App)
│   │   ├── api           # FastAPI model API
│   │   │   ├── Dockerfile
│   │   │   ├── requirements.txt
│   │   │   └── main.py
│   │   └── app           # Streamlit prediction UI
│   │       ├── Dockerfile
│   │       ├── requirements.txt
│   │       └── main.py
│   └── models            # Model Engineering stage scripts
│       └── model_engineering.py
├── data
│   ├── processed         # Output of data engineering (train/test CSVs)
│   └── raw               # Raw dataset (input for data engineering)
├── docker-compose.yml    # Defines all services
├── models                # Trained ML models and scalers
├── notebooks             # Jupyter notebooks (e.g. for EDA)
└── services
    └── airflow
        ├── dags          # Airflow DAG definitions
        │   └── ml_pipeline.py
        ├── Dockerfile
        ├── requirements.txt
        └── logs          # Airflow task logs

🚀 Getting Started

Follow these steps to set up and run the automated ML pipeline.

Prerequisites

  • Docker: Ensure Docker is installed and running on your system.

1. Clone the Repository

git clone https://github.com/examplefirstaccount/california-housing-mlops.git
cd california-housing-mlops

2. Initialize Airflow Database

Before starting all services, you need to initialize Airflow's database and create an admin user.

docker compose up airflow-init --build

Wait for this command to complete. You should see output indicating the airflow user has been created.

3. Start All Services

Now, bring up all other services (Postgres, Airflow Webserver, Scheduler, MLflow, FastAPI, Streamlit) in detached mode.

docker compose up -d --build

This command will:

  • Build the api and app Docker images.
  • Start all containers as defined in docker-compose.yml.
  • Create Docker volumes for pgdata (Postgres), mlruns (MLflow artifacts).

4. Access the UIs

Once all services are up and running (this might take a few moments for containers to stabilize), you can access the various UIs:

  • Airflow UI: http://localhost:8080 (Login: airflow / airflow)
  • MLflow UI: http://localhost:5000
  • Streamlit App: http://localhost:8501

5. Run the ML Pipeline

  1. Enable the DAG: Navigate to the Airflow UI (http://localhost:8080).
  2. Find the DAG named ml_pipeline_california_housing.
  3. Toggle the DAG from "Off" to "On".
  4. The pipeline is scheduled to run every 5 minutes. You can also manually trigger a run by clicking the "Play" button icon on the DAG row.
  5. Monitor the DAG runs in the Airflow UI. Each run consists of three main stages: data_engineering, model_engineering, and redeploy_api.

The redeploy_api task ensures that after a new model is trained, the FastAPI container picks up the latest model artifacts from the /models volume (done simply by API call).

Screenshots

Airflow UI

Screenshot of the Airflow DAG ml_pipeline_california_housing showing successful runs. Airflow UI Screenshot

MLflow UI

Screenshot of the MLflow UI displaying logged experiments, runs, metrics, and models. MLflow UI Screenshot

Streamlit App

Screenshot of the Streamlit application's prediction interface. Streamlit App Screenshot

Stopping Services

To stop and remove all running Docker containers:

docker compose down

To remove all containers, networks, and associated volumes (this will delete your Postgres data, MLflow runs):

docker compose down --volumes

About

A complete automated ML system that demonstrates production MLOps practices. Features Apache Airflow orchestration, MLflow experiment tracking, FastAPI model serving, and Streamlit frontend - all containerized with Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published