This project aims to predict individual income using a complete Machine Learning workflow.
The dataset contains demographic, education, employment, and household-related attributes, and the model uses a Decision Tree Regressor with full hyperparameter tuning to estimate income values.
- Perform data cleaning and preprocessing.
- Apply Ordinal and One-Hot Encoding to categorical features.
- Use log transformation to reduce skewness in the target variable.
- Split data into training and testing sets.
- Use GridSearchCV to optimize tree hyperparameters.
- Evaluate model performance (RΒ², RMSE).
- Visualize predicted vs actual income values.
- Python
- Pandas
- NumPy
- Scikit-Learn
- Plotly
- Jupyter Notebook
- Google Colab
- Algorithm: Decision Tree Regression
- Tuning:
max_depthmin_samples_leafmin_samples_split
- Scoring: Negative Mean Squared Error (MSE)
- Target: Income (log-transformed during training)
Due to the synthetic nature of the dataset, the model shows:
- Training RΒ² : low
- Testing RΒ² : low
This indicates underfitting, meaning the dataset lacks strong relationships between features and income.
Despite this, the project demonstrates a clean, end-to-end ML pipeline suitable for learning and experimentation.
- data.csv β Dataset used for training and tetsing the model
- Income Prediction Project No LogTransformation.ipynb β Main notebook containing full ML workflow without log transformation applied ( higher in the accuracy )
- Income Prediction with Log Trasnfromation.ipynb β Another version of the notebook containing full ML workflow with log transformation applied ( lower in the accuracy )
- README.md β Project documentation
Due to the synthetic nature of the dataset and the selected model ( Decision Tree Regression ) , the model shows underfitting, with low RΒ² scores on the both notebooks .
- Income Prediction Project No LogTransformation.ipyn β Rsquared = 1.68%
- Income Prediction Project with Log Trasnfromation.ipyn β Rsquared = -8.54%
This demonstrates a realistic challenge when datasets lack strong featureβtarget relationships.
- Try ensemble models: RandomForest, GradientBoosting, XGBoost
- Use a more realistic dataset
- Apply advanced feature engineering to extract meaningful patterns
Developed by Samir Mohamed as part of a regression machine learning practice project.