Predict a student's Mathematics score from demographics and study-related inputs using a productionized ML pipeline
Built with Sklearn | Flask Web UI | Dockerized | Deployed on AWS (Beanstalk & ECR→EC2 via GitHub Actions)
- 🎯 Demo
- ✨ Features
- 📁 Project Structure
- 📊 Data
- 🤖 Model Overview
- 🚀 Quickstart
- ⚙️ Configuration
- 🌐 Routes / API
- 🔧 Training Pipeline
- 🎯 Inference Pipeline
- 📝 Logging & Errors
- 🔄 CI/CD (GitHub Actions → ECR → EC2)
- ☁️ Deployment on AWS Elastic Beanstalk
- 📸 Screenshots
- 🐛 Troubleshooting
- 🔒 Security & Cost Notes
- 🗺️ Roadmap
- 📄 License
- 🙏 Acknowledgements
🔗 Live URL: http://<your-ec2-public-ip>:8080/
(hosted on EC2, port 8080)
💡 See Screenshots for the homepage, form, and prediction result.
Feature | Description |
---|---|
🎯 Math Score Prediction | Predict from 7 inputs (gender, race/ethnicity, parental education, lunch, test prep, reading score, writing score) |
🧰 End-to-End Pipeline | Ingestion → Transform → Model Training → Evaluation → Persisted Artifacts |
🌐 Flask Web UI | Simple two-page flow (index & predict) |
🐳 Dockerized App | Easy to run locally or in the cloud |
🚀 Two AWS Deployments | • Elastic Beanstalk (Python platform) • GitHub Actions → Amazon ECR → EC2 self-hosted runner |
.
├─ 📂 .ebextensions/
│ └─ python.config # Beanstalk: WSGI path / platform opts (optional)
├─ 📂 .github/
│ └─ 📂 workflows/
│ └─ main.yaml # CI (build) + CD (push to ECR & run on EC2)
├─ 📂 artifacts/
│ ├─ data.csv # Raw snapshot used locally
│ ├─ train.csv, test.csv # Split datasets
│ ├─ preprocessor.pkl # Saved ColumnTransformer (OneHot + Scale + Impute)
│ └─ model.pkl # Trained best model serialized
├─ 📂 logs/ # (Optional) runtime/training logs
├─ 📂 notebook/ # (Optional) experiments
├─ 📂 src/
│ ├─ 📂 components/
│ │ ├─ data_ingestion.py # Read dataset, write artifacts/{raw,train,test}.csv
│ │ ├─ data_transformation.py # Build & persist sklearn preprocessor
│ │ ├─ model_trainer.py # Train/evaluate models, save best model
│ │ └─ 📂 artifacts/ # (component-specific outputs if any)
│ ├─ 📂 pipeline/
│ │ ├─ train_pipeline.py # (optional) training entrypoint
│ │ └─ predict_pipeline.py # Load artifacts & predict on new data
│ ├─ exception.py # CustomException with context
│ ├─ logger.py # Logging helper
│ └─ utils.py # save_object, evaluate_models, helpers
├─ 📂 templates/
│ ├─ index.html # Landing page
│ └─ home.html # Form + prediction result
├─ app.py # Flask app (Gunicorn entrypoint: `app:application`)
├─ Dockerfile # 3.11-slim base + gunicorn
├─ requirements.txt # Pinned libs compatible with train/infer
├─ setup.py # (optional) packaging
└─ README.md
The dataset contains student demographics and study attributes with target math_score
.
gender
race_ethnicity
parental_level_of_education
lunch
test_preparation_course
reading_score
writing_score
math_score
(target)
The pipeline performs an 80/20 train/test split (random_state=42
) and persists train.csv
, test.csv
for reproducibility.
⚠️ Note: Use your own dataset or ensure you have the right to use and distribute it.
- Linear Regression, Lasso, Ridge
- K-Nearest Neighbors Regressor
- Decision Tree, Random Forest Regressor
- XGBRegressor (XGBoost)
- CatBoosting Regressor
Each model underwent comprehensive hyperparameter tuning using modular programming approach:
Model | Hyperparameters Tuned |
---|---|
Ridge | alpha : [0.1, 0.5, 1.0, 5.0, 10.0] |
Lasso | alpha : [0.001, 0.01, 0.1, 1.0] |
Random Forest | n_estimators , max_depth , min_samples_split , min_samples_leaf |
XGBoost | learning_rate , n_estimators , max_depth , subsample |
CatBoost | iterations , learning_rate , depth |
KNN | n_neighbors : [3, 5, 7, 9, 11] |
Decision Tree | max_depth , min_samples_split , min_samples_leaf |
Tuning Strategy:
- GridSearchCV with 5-fold cross-validation
- Automated hyperparameter selection in
src/utils.py
- Modular design allows easy parameter updates
Model | Test R² | RMSE | Best Parameters |
---|---|---|---|
Ridge (Best) | 0.8806 | 5.39 | alpha=1.0 |
Linear Regression | 0.8803 | - | Default |
CatBoost | 0.852 | - | depth=6, iterations=100 |
Random Forest | 0.847 | - | n_estimators=100, max_depth=10 |
XGBoost | 0.822 | - | learning_rate=0.1, n_estimators=100 |
KNN | 0.784 | - | n_neighbors=5 |
Decision Tree | 0.760 | - | max_depth=5 |
artifacts/preprocessor.pkl
— OneHotEncoder (categoricals) + StandardScaler (numericals), with imputersartifacts/model.pkl
— Best model by test R² (with optimal hyperparameters)
Prerequisites: Python 3.11 recommended
# 1) Clone the repository
git clone https://github.com/<you>/Complete_ML_Project.git
cd Complete_ML_Project
# 2) Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4) Run the application
python app.py
# OR using gunicorn:
# gunicorn -w 2 -k gthread -b 0.0.0.0:8080 app:application
# 5) Open http://localhost:8080
# Build the image
docker build -t student-performance:latest .
# Run the container (container port 8080 → host 8080)
docker run --rm -p 8080:8080 student-performance:latest
# Open http://localhost:8080
Setting | Value | Description |
---|---|---|
Port | 8080 | Container binds to 0.0.0.0:8080 (change with -p HOST:CONTAINER ) |
Artifacts | artifacts/ |
Expects preprocessor.pkl and model.pkl at runtime |
Env Vars | None | No environment variables required for basic usage |
Method | Endpoint | Description |
---|---|---|
GET |
/ |
Home / landing page (templates/index.html ) |
GET |
/predictdata |
Renders form (templates/home.html ) |
POST |
/predictdata |
Accepts form inputs → returns predicted Math score |
Reads dataset → writes artifacts/data.csv
, train.csv
, test.csv
Numerical Features (reading_score
, writing_score
):
- SimpleImputer(median)
- StandardScaler
Categorical Features (gender
, race_ethnicity
, parental_level_of_education
, lunch
, test_preparation_course
):
- SimpleImputer(most_frequent)
- OneHotEncoder
- Scaling
→ Persisted as artifacts/preprocessor.pkl
- Trains multiple regression models with hyperparameter tuning
- Uses GridSearchCV for optimal parameter selection
- Evaluates models using cross-validation
- Reports comprehensive metrics on train/test sets
- Implemented in modular fashion in
src/components/model_trainer.py
- Selects best model based on R² score on test set
- Ridge Regression selected (R² = 0.8806) with optimal hyperparameters
- Model saved with fitted parameters as
artifacts/model.pkl
PredictPipeline
loads preprocessor.pkl
& model.pkl
, applies the same transforms, and returns a numeric Math score prediction.
Inputs from form:
gender
race_ethnicity
parental_level_of_education
lunch
test_preparation_course
reading_score
writing_score
src/logger.py
— Standard logging with info statements around key pipeline stepssrc/exception.py
— CustomException with filename/line/context to ease debugging
- CI: Lint/tests (placeholder)
- Build & Push: Docker image → Amazon ECR
- Deploy: Self-hosted runner on EC2 pulls and runs:
docker rm -f ml_project_container || true
docker run -d --name ml_project_container -p 8080:8080 <image-uri>
Secret | Description |
---|---|
AWS_ACCESS_KEY_ID |
AWS access key |
AWS_SECRET_ACCESS_KEY |
AWS secret key |
AWS_REGION |
e.g., us-east-2 |
AWS_ACCOUNT_ID |
12-digit account ID |
ECR_REPOSITORY_NAME |
e.g., studentperformance |
- Labels:
self-hosted
,Linux
,X64
- Setup: Install Docker (
apt install -y docker.io
), enable service, add runner user to docker group
Alternative to the ECR/EC2 pipeline
- Platform: Python 3.9/3.13 on Amazon Linux 2023
- App Code: Your repo zipped or linked via CodePipeline
.ebextensions/python.config
example:
option_settings:
"aws:elasticbeanstalk:container:python":
WSGIPath: application:application
💡 If your entry is
app.py
withapplication = Flask(__name__)
, setWSGIPath: app:application
Issue | Solution |
---|---|
Invalid reference format during deploy | Ensure AWS_ACCOUNT_ID , AWS_REGION , ECR_REPOSITORY_NAME secrets are set |
Cannot connect to Docker daemon on runner | Start Docker: sudo systemctl enable --now docker ; add runner user to docker group |
ECR auth errors | Ensure IAM policy includes ecr:GetAuthorizationToken and repo push/pull actions |
Port not reachable | EC2 Security Group must allow TCP 8080; if ufw active: sudo ufw allow 8080/tcp |
- ✅ Prefer GitHub OIDC + IAM role over long-lived AWS keys
- ✅ Restrict SG ingress (ideally your IP only) or front app with load balancer/HTTPS reverse proxy
- ✅ Watch EC2/ECR costs; prune unused images, stop instances when idle
- Add HTTPS via Nginx/Caddy + Let's Encrypt on EC2
- Versioned image tags (
:sha-<GITHUB_SHA>
) and blue/green deploy - Add tests + lint checks in CI
- Optional REST API endpoint for programmatic prediction
- Model monitoring and retraining pipeline
- Performance metrics dashboard
This project is licensed under the MIT License - see the LICENSE file for details.
- scikit-learn, XGBoost, CatBoost
- Flask & Jinja
- AWS (ECR/EC2/Beanstalk)
- GitHub Actions
- Docker
Made by Rroopesh Hari
⭐ Star this repo if you find it helpful!