This project was created as a part of CSE343, Machine Learning Course at IIIT Delhi.
Employee attrition refers to an employee’s voluntary or involuntary resignation from a workforce. Organizations spend many resources in hiring talented employees and training them. Every employee is critical to a company’s success. Our goal was to predict employee attrition and identify the factors contributing to an employee leaving a workforce. We trained various classification models on our dataset and assessed their performance using different metrics such as accuracy, precision, recall and F1 Score. We also analyzed the dataset to identify key factors contributing to an employee leaving a workforce. Our project will assist organizations in gaining fresh insights into what drives attrition and thus enhance retention rate.
We trained and evaluated 9 supervised machine learning classification models.
- Logistic Regression
- Naive Bayes
- Decision Tree
- Random Forest
- AdaBoost
- Support Vector Machine
- Linear Discriminant Analysis
- Multilayer Perceptron
- K-Nearest Neighbors
We trained our models on 6 different datasets
- Imabalanced
- Undersampled
- Oversampled
- PCA
- Undersampling With PCA
- Oversampling With PCA
Further, to get the best performance, hyperparameter tuning was carried out using RandomSearchCV and GridSearchCV. K-fold cross-validation with 5 folds was also performed on the training set. To handle model interpretability, appropriate graphs and figures were used.Accuracy for the attrition decision is a biased metric, and hence we evaluated the model on all the following classification metrics: accuracy, precision, recall and F1 Score.
We used the IBM Employee Attrition dataset from Kaggle. It contains 35 columns and 1470 rows and has a mix of numerical and categorical features. A sample row is shown below.
The figure below shows feature importance w.r.t random forest with oversampling. We observe that the most important features were MonthlyIncome followed by OverTime and Age, while the least important features were Performance Rating, Gender and BusinessTravel.
The best performance was obtained in Random Forest Model with PCA and Oversampling with an accuracy of 99.2%, the precision of 98.6%, recall of 99.8% and F1 Score of 99.2%.
Jupyter Notebook can be run using Google Colab or locally using Anaconda Navigator.
Steps to run using Google Colab
- Upload the dataset
- Click on Runtime -> Run all / Restart and Run all