This repository contains an exploration and implementation of various machine learning models to predict housing prices. The project was developed as part of the Machine Learning course at the Carlos III University of Madrid.
The project was centered around understanding the relationship between various input variables and housing prices. The primary goal was to create a model that could predict the price of houses based on these input features.
- One-Hot Encoding: Applied to all categorical variables. Irrelevant variables were discarded.
- Normalization: Used Z-Score normalization on real (float) and integer data. However, special treatment was given to the price data due to its specific distribution.
- Visualizations: Functions to visualize distributions and relations between variables were utilized.
- Correlation Matrix: Established to visualize relationships among variables.
- Data Splitting: The dataset was divided into training (80%) and testing (20%) samples.
A key component that set our model apart was the exploration of non-linear relations between features in our dataset. This exploration allowed us to create a unique set of input features, especially with a combination of bathrooms, bedrooms, and ratings (aseos+hab*rating) showing a linear correlation with the price.
- Baseline Model: Linear regression from scikit-learn was used as the baseline. Its performance was surprisingly competitive.
- Other Models Tested: RandomForestRegressor, ElasticNet, Lasso, Ridge, DecisionTreeRegressor, KNeighborsRegressor, GradientBoostRegressor, AdaBoostRegressor, and CatBoostRegressor.
- Model Optimization: Hyperparameters for GradientBoostRegressor and CatBoostRegressor were particularly optimized.
- MAE (Mean Absolute Error): Used to measure the average of the absolute difference between the predicted and actual values.
- MAPE (Mean Absolute Percentage Error): Offered a percentage error which is more interpretable and scale-independent.
The standout model in terms of performance was CatBoostRegresso, achieving MAPE values around 10-12% on the sample test set. In the competition held in class, its MAPE was of 12.91%, the best result among all groups, 0.64% better than the second best group.
You can test the model on your own by running the following command:
cd validator
python validator.py
- Artículo sobre balanceo de clases en aprendizaje automático
- Random Oversampling y Undersampling para clasificación con clases desequilibradas
- Handling Imbalanced Data by Oversampling with SMOTE and its Variants
- Notas sobre cómo manejar datos desequilibrados
- Artículos sobre datos desequilibrados
- Libro: "An Introduction to Statistical Learning: with Applications in R" por James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013)