Skip to content

felixfaruix/ML-Advanced-Stacking-Regression

Repository files navigation

House Prices - Advanced stacking Regression via Optuna

Python Jupyter scikit-learn Pandas NumPy XGBoost LightGBM CatBoost Optuna

Project Overview

An advanced machine learning pipeline for predicting house prices using ensemble learning, automated hyperparameter optimization, and feature engineering. This project involves production-ready ML practices including custom sklearn transformers, statistical outlier detection, and automated model selection.

Custom Transformers

  • LogTransformer: Skewness reduction with sparse feature detection
  • OutliersRemoval: Statistical outlier elimination for training data
  • CatEncoder: Combined one-hot and ordinal encoding with missing value handling
  • TotalArea: Domain-specific feature combination (basement + ground floor)
  • TotalBaths: Weighted bathroom counting (full=1.0, half=0.5)
  • HighlyCorrelatedFeatures: Automated multicollinearity reduction
  • MedianImputer: Robust missing value imputation
  • AgeCalculator: Temporal feature engineering (house age from build/sale years)

Model Evaluation Pipeline

  • Optuna Objective Function: Automated ensemble discovery with cross-validation scoring
  • Performance Metrics: RMSE, R², MAE for comprehensive evaluation
  • Residual Analysis: Error distribution examination for model validation
  • Visualization: Optimization history and parameter importance plots

Data Processing Strategy

  • Training Pipeline: Includes outlier removal and feature engineering
  • Validation/Test Pipeline: Consistent preprocessing without outlier removal
  • Missing Value Handling: Differential imputation

Getting Started:

pip install -r requirements.txt