🏡 Automated ML Pipeline for California Housing Price Prediction

This project demonstrates a fully automated MLOps pipeline for predicting California housing prices. It covers the entire lifecycle from data ingestion and cleaning to model training, evaluation, and deployment, orchestrated by Apache Airflow and containerized with Docker.

The pipeline is designed to run automatically every 5 minutes, ensuring that the deployed model stays up-to-date with the latest processed data and model training.

✨ Features

Automated Pipeline Orchestration: Uses Apache Airflow to schedule and manage the end-to-end ML workflow.
Data Engineering:
- Loads raw California Housing data.
- Performs data cleaning (imputation of missing values, outlier removal).
- Splits data into training and testing sets.
Model Engineering:
- Applies feature scaling using StandardScaler.
- Trains a RandomForestRegressor model.
- Evaluates model performance (MSE, R2 Score).
- Tracks experiments, logs parameters, metrics, and models using MLflow.
- Persists the trained model and scaler for deployment.
Model Deployment:
- API Service: A FastAPI application serving model predictions via a RESTful endpoint.
- User Interface: A Streamlit web application for interactive predictions, communicating with the FastAPI.
- Containerization: Both API and App are deployed as separate Docker containers.
Persistent Storage: Data, models, and MLflow runs are persisted using Docker volumes.

🛠️ Technologies Used

Orchestration: Apache Airflow
Data Handling: Pandas, scikit-learn
Model Tracking: MLflow
Web API: FastAPI
Frontend App: Streamlit
Containerization: Docker, Docker Compose

📂 Repository Structure

├── README.md
├── code
│   ├── datasets          # Data Engineering stage scripts
│   │   └── data_engineering.py
│   ├── deployment        # Deployment stage (API & App)
│   │   ├── api           # FastAPI model API
│   │   │   ├── Dockerfile
│   │   │   ├── requirements.txt
│   │   │   └── main.py
│   │   └── app           # Streamlit prediction UI
│   │       ├── Dockerfile
│   │       ├── requirements.txt
│   │       └── main.py
│   └── models            # Model Engineering stage scripts
│       └── model_engineering.py
├── data
│   ├── processed         # Output of data engineering (train/test CSVs)
│   └── raw               # Raw dataset (input for data engineering)
├── docker-compose.yml    # Defines all services
├── models                # Trained ML models and scalers
├── notebooks             # Jupyter notebooks (e.g. for EDA)
└── services
    └── airflow
        ├── dags          # Airflow DAG definitions
        │   └── ml_pipeline.py
        ├── Dockerfile
        ├── requirements.txt
        └── logs          # Airflow task logs

🚀 Getting Started

Follow these steps to set up and run the automated ML pipeline.

Prerequisites

Docker: Ensure Docker is installed and running on your system.

1. Clone the Repository

git clone https://github.com/examplefirstaccount/california-housing-mlops.git
cd california-housing-mlops

2. Initialize Airflow Database

Before starting all services, you need to initialize Airflow's database and create an admin user.

docker compose up airflow-init --build

Wait for this command to complete. You should see output indicating the airflow user has been created.

3. Start All Services

Now, bring up all other services (Postgres, Airflow Webserver, Scheduler, MLflow, FastAPI, Streamlit) in detached mode.

docker compose up -d --build

This command will:

Build the api and app Docker images.
Start all containers as defined in docker-compose.yml.
Create Docker volumes for pgdata (Postgres), mlruns (MLflow artifacts).

4. Access the UIs

Once all services are up and running (this might take a few moments for containers to stabilize), you can access the various UIs:

Airflow UI: http://localhost:8080 (Login: airflow / airflow)
MLflow UI: http://localhost:5000
Streamlit App: http://localhost:8501

5. Run the ML Pipeline

Enable the DAG: Navigate to the Airflow UI (http://localhost:8080).
Find the DAG named ml_pipeline_california_housing.
Toggle the DAG from "Off" to "On".
The pipeline is scheduled to run every 5 minutes. You can also manually trigger a run by clicking the "Play" button icon on the DAG row.
Monitor the DAG runs in the Airflow UI. Each run consists of three main stages: data_engineering, model_engineering, and redeploy_api.

The redeploy_api task ensures that after a new model is trained, the FastAPI container picks up the latest model artifacts from the /models volume (done simply by API call).

Screenshots

Airflow UI

Screenshot of the Airflow DAG ml_pipeline_california_housing showing successful runs.

MLflow UI

Screenshot of the MLflow UI displaying logged experiments, runs, metrics, and models.

Streamlit App

Screenshot of the Streamlit application's prediction interface.

Stopping Services

To stop and remove all running Docker containers:

docker compose down

To remove all containers, networks, and associated volumes (this will delete your Postgres data, MLflow runs):

docker compose down --volumes

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
docs/assets		docs/assets
models		models
services/airflow		services/airflow
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏡 Automated ML Pipeline for California Housing Price Prediction

✨ Features

🛠️ Technologies Used

📂 Repository Structure

🚀 Getting Started

Prerequisites

1. Clone the Repository

2. Initialize Airflow Database

3. Start All Services

4. Access the UIs

5. Run the ML Pipeline

Screenshots

Airflow UI

MLflow UI

Streamlit App

Stopping Services

About

Uh oh!

Releases

Packages

Languages

examplefirstaccount/california-housing-mlops

Folders and files

Latest commit

History

Repository files navigation

🏡 Automated ML Pipeline for California Housing Price Prediction

✨ Features

🛠️ Technologies Used

📂 Repository Structure

🚀 Getting Started

Prerequisites

1. Clone the Repository

2. Initialize Airflow Database

3. Start All Services

4. Access the UIs

5. Run the ML Pipeline

Screenshots

Airflow UI

MLflow UI

Streamlit App

Stopping Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages