This project demonstrates a fully automated MLOps pipeline for predicting California housing prices. It covers the entire lifecycle from data ingestion and cleaning to model training, evaluation, and deployment, orchestrated by Apache Airflow and containerized with Docker.
The pipeline is designed to run automatically every 5 minutes, ensuring that the deployed model stays up-to-date with the latest processed data and model training.
- Automated Pipeline Orchestration: Uses Apache Airflow to schedule and manage the end-to-end ML workflow.
- Data Engineering:
- Loads raw California Housing data.
- Performs data cleaning (imputation of missing values, outlier removal).
- Splits data into training and testing sets.
- Model Engineering:
- Applies feature scaling using
StandardScaler. - Trains a
RandomForestRegressormodel. - Evaluates model performance (MSE, R2 Score).
- Tracks experiments, logs parameters, metrics, and models using MLflow.
- Persists the trained model and scaler for deployment.
- Applies feature scaling using
- Model Deployment:
- API Service: A FastAPI application serving model predictions via a RESTful endpoint.
- User Interface: A Streamlit web application for interactive predictions, communicating with the FastAPI.
- Containerization: Both API and App are deployed as separate Docker containers.
- Persistent Storage: Data, models, and MLflow runs are persisted using Docker volumes.
- Orchestration: Apache Airflow
- Data Handling: Pandas, scikit-learn
- Model Tracking: MLflow
- Web API: FastAPI
- Frontend App: Streamlit
- Containerization: Docker, Docker Compose
├── README.md
├── code
│ ├── datasets # Data Engineering stage scripts
│ │ └── data_engineering.py
│ ├── deployment # Deployment stage (API & App)
│ │ ├── api # FastAPI model API
│ │ │ ├── Dockerfile
│ │ │ ├── requirements.txt
│ │ │ └── main.py
│ │ └── app # Streamlit prediction UI
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── main.py
│ └── models # Model Engineering stage scripts
│ └── model_engineering.py
├── data
│ ├── processed # Output of data engineering (train/test CSVs)
│ └── raw # Raw dataset (input for data engineering)
├── docker-compose.yml # Defines all services
├── models # Trained ML models and scalers
├── notebooks # Jupyter notebooks (e.g. for EDA)
└── services
└── airflow
├── dags # Airflow DAG definitions
│ └── ml_pipeline.py
├── Dockerfile
├── requirements.txt
└── logs # Airflow task logs
Follow these steps to set up and run the automated ML pipeline.
- Docker: Ensure Docker is installed and running on your system.
git clone https://github.com/examplefirstaccount/california-housing-mlops.git
cd california-housing-mlopsBefore starting all services, you need to initialize Airflow's database and create an admin user.
docker compose up airflow-init --buildWait for this command to complete. You should see output indicating the airflow user has been created.
Now, bring up all other services (Postgres, Airflow Webserver, Scheduler, MLflow, FastAPI, Streamlit) in detached mode.
docker compose up -d --buildThis command will:
- Build the
apiandappDocker images. - Start all containers as defined in
docker-compose.yml. - Create Docker volumes for
pgdata(Postgres),mlruns(MLflow artifacts).
Once all services are up and running (this might take a few moments for containers to stabilize), you can access the various UIs:
- Airflow UI:
http://localhost:8080(Login:airflow/airflow) - MLflow UI:
http://localhost:5000 - Streamlit App:
http://localhost:8501
- Enable the DAG: Navigate to the Airflow UI (
http://localhost:8080). - Find the DAG named
ml_pipeline_california_housing. - Toggle the DAG from "Off" to "On".
- The pipeline is scheduled to run every 5 minutes. You can also manually trigger a run by clicking the "Play" button icon on the DAG row.
- Monitor the DAG runs in the Airflow UI. Each run consists of three main stages:
data_engineering,model_engineering, andredeploy_api.
The redeploy_api task ensures that after a new model is trained, the FastAPI container picks up the latest model artifacts from the /models volume (done simply by API call).
Screenshot of the Airflow DAG ml_pipeline_california_housing showing successful runs.

Screenshot of the MLflow UI displaying logged experiments, runs, metrics, and models.

Screenshot of the Streamlit application's prediction interface.

To stop and remove all running Docker containers:
docker compose downTo remove all containers, networks, and associated volumes (this will delete your Postgres data, MLflow runs):
docker compose down --volumes