Employee Attrition Prediction Using Machine Learning

This project was created as a part of CSE343, Machine Learning Course at IIIT Delhi.

Group Members

Introduction

Employee attrition refers to an employee’s voluntary or involuntary resignation from a workforce. Organizations spend many resources in hiring talented employees and training them. Every employee is critical to a company’s success. Our goal was to predict employee attrition and identify the factors contributing to an employee leaving a workforce. We trained various classification models on our dataset and assessed their performance using different metrics such as accuracy, precision, recall and F1 Score. We also analyzed the dataset to identify key factors contributing to an employee leaving a workforce. Our project will assist organizations in gaining fresh insights into what drives attrition and thus enhance retention rate.

Methodology

Machine Learning Models

We trained and evaluated 9 supervised machine learning classification models.

Logistic Regression
Naive Bayes
Decision Tree
Random Forest
AdaBoost
Support Vector Machine
Linear Discriminant Analysis
Multilayer Perceptron
K-Nearest Neighbors

Datasets

We trained our models on 6 different datasets

Imabalanced
Undersampled
Oversampled
PCA
Undersampling With PCA
Oversampling With PCA

Further, to get the best performance, hyperparameter tuning was carried out using RandomSearchCV and GridSearchCV. K-fold cross-validation with 5 folds was also performed on the training set. To handle model interpretability, appropriate graphs and figures were used.Accuracy for the attrition decision is a biased metric, and hence we evaluated the model on all the following classification metrics: accuracy, precision, recall and F1 Score.

Dataset

We used the IBM Employee Attrition dataset from Kaggle. It contains 35 columns and 1470 rows and has a mix of numerical and categorical features. A sample row is shown below.

Results

The figure below shows feature importance w.r.t random forest with oversampling. We observe that the most important features were MonthlyIncome followed by OverTime and Age, while the least important features were Performance Rating, Gender and BusinessTravel.

Best Performing Model

The best performance was obtained in Random Forest Model with PCA and Oversampling with an accuracy of 99.2%, the precision of 98.6%, recall of 99.8% and F1 Score of 99.2%.

Instructions to run

Jupyter Notebook can be run using Google Colab or locally using Anaconda Navigator.

Steps to run using Google Colab

Upload the dataset
Click on Runtime -> Run all / Restart and Run all

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Images		Images
Imbalanced		Imbalanced
Oversampling		Oversampling
PCA		PCA
Undersampling and Oversampling With PCA		Undersampling and Oversampling With PCA
Undersampling		Undersampling
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Employee Attrition Prediction Using Machine Learning

Group Members

Introduction

Methodology

Machine Learning Models

Datasets

Dataset

Results

Best Performing Model

Instructions to run

Libraries Used

Report

About

Packages

Languages

aastha985/Employee_Attrition_Prediction

Folders and files

Latest commit

History

Repository files navigation

Employee Attrition Prediction Using Machine Learning

Group Members

Introduction

Methodology

Machine Learning Models

Datasets

Dataset

Results

Best Performing Model

Instructions to run

Libraries Used

Report

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages