This project aims to predict credit scores (categorized as 'Good', 'Standard', and 'Poor') using machine learning techniques. The entire ML pipeline includes data extraction, cleaning, preprocessing, model training, feature selection, dimensionality reduction, and deployment using Streamlit. The project applies Mutual Information for feature selection and Principal Component Analysis (PCA) for dimensionality reduction.
Project link: Credit_Score_Prediction
- Getting Started π
- Prerequisites π
- Data Extraction & Cleaning π§Ή
- Data Preprocessing π
- Feature Selection Using Mutual Information π
- Dimensionality Reduction with PCA π
- Model Training ποΈββοΈ
- Model Evaluation π
- Pipeline Integration π
- Model Deployment on Streamlit π
- Real World Examples π
- Results π
- Contributing π€
- License π
- Acknowledgments π
Follow these instructions to set up and run the project locally for development and testing purposes.
Ensure that you have Python installed along with the required libraries. Download Python from python.org and install the libraries using pip.
The dataset used for this project is stored in the data
directory. The primary data file is credit_score_data.csv
, which contains features such as Annual_Income
, Monthly_Inhand_Salary
, Interest_Rate
, and others relevant to predicting credit scores.
- Data is loaded from CSV files using Pandas.
- Initial exploration is conducted to understand the data's structure and contents.
- Missing values are handled using imputation techniques.
- Outliers are detected and treated.
- Irrelevant features are removed to streamline the dataset.
The preprocessing steps involve scaling numerical features, encoding categorical variables, and splitting the dataset into training and test sets.
- Standardization is applied to numerical features to ensure all features contribute equally to the model.
- Categorical variables are converted into numerical form using techniques like one-hot encoding.
- The dataset is split into training and testing sets using an 80-20 ratio.
Feature selection is performed using Mutual Information to identify the most relevant features for predicting credit scores.
- Measures the dependency between each feature and the target variable.
- Helps in selecting features that provide the most information about the target.
- Top Features: Bar plot showcasing features with the highest mutual information scores.
Principal Component Analysis (PCA) is used to reduce the dimensionality of the feature space while retaining the most important information.
- Reduces the number of features by transforming them into a new set of uncorrelated features (principal components).
- Helps in speeding up the model training process and reducing overfitting.
- PCA Components: Scree plot to visualize the explained variance by each principal component.
The model training process focuses on using XGBoost (Extreme Gradient Boosting) due to its superior performance and efficiency in handling classification tasks, especially with structured/tabular data like the one used in this project.
- High Accuracy: XGBoost consistently performs well in competitions and real-world applications because it can effectively capture complex patterns in data.
- Handling Missing Values: XGBoost has built-in handling for missing values, which is crucial in datasets with incomplete information.
- Regularization: It includes L1 (Lasso) and L2 (Ridge) regularization techniques to prevent overfitting, making it robust against noise and irrelevant features.
- Scalability: XGBoost is highly optimized for both parallel and distributed computing, making it suitable for large datasets.
- Tree Pruning: Employs a unique approach to tree pruning (max_depth parameter) that optimizes computational efficiency.
- Flexibility: XGBoost can handle both regression and classification tasks with binary or multi-class outputs, making it versatile for different problem types.
- Interpretability: Feature importance scores can be easily extracted, aiding in the interpretability of the model results.
The models are evaluated using various metrics, including accuracy, precision, recall, and F1-score.
- Provides detailed performance metrics for each class (Good, Standard, Poor).
- A matrix showing the actual vs. predicted classifications.
- Helps in understanding model errors.
- ROC (Receiver Operating Characteristic) curve plots the true positive rate against the false positive rate.
- AUC (Area Under the Curve) measures the model's ability to distinguish between classes.
- The entire ML pipeline from data extraction to model training and evaluation is integrated for streamlined processing.
- The trained model is deployed using Streamlit for easy access and interaction.
- Users can input new data and obtain credit score predictions in real-time.
- Examples include predicting credit scores for various profiles to illustrate the model's real-world applicability.
- The final trained models are saved and can be used to classify new instances of credit scores.
- Contributions are welcome! Please feel free to fork the repository, make your changes, and create a pull request.
- This project is licensed under the MIT License - see the LICENSE file for more details.
- Special thanks to data providers and open-source contributors who helped make this project possible.