Skip to content

Latest commit

 

History

History
33 lines (19 loc) · 6.74 KB

File metadata and controls

33 lines (19 loc) · 6.74 KB

Logistic Regression for Absenteeism Prediction

Case Study provided and developed during the "Data Science Course 2023: Complete Data Science Bootcamp". The aim of the project is to predict whether an employee of a company will be excessively absent from work or not, based on data of 11 different features across 700 unique entries. The original dataset along with its features' descriptions is provided in the Absenteeism_data.csv and the AbsenteeismFeatures.pdf files respectively. The whole course can be found here: https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/

The approach consists of two main parts: the preprocessing phase (Absenteeism-Preprocessing.ipynb file), where targeted operations take place in order to validate, clean as well as save and prepare our source data for further analysis and the model development (Absenteeism-Logistic-Regression.ipynb file), where after some more data manipulation, a Logistic Regression algorithm is applied in order to achieve the desired estimations, which means results organized in a way that could give meaningful future insights.

In the preprocessing phase, we start by giving a first glimpse over our dataset, checking for missing values and identifying some quick but useful information. Then we go deeper into manipulating the columns-features of our dataset. More precisely: 1) we drop the 'ID' feature since it bears absolutely no meaning for us, 2) we inspect the 'Reasons for Absence' column in order to group the 28 different reasons into 4 separate categories (details explained in the .pdf file), action that will significantly help us avoid multicolinearity issues in our regression later on, 3) we change the type of the 'Date' column and we extract the 'Month' and the 'Day of the Week' values since they will be more useful in our regression analysis and 4) we group the 4 different levels of education (high school, graduate, postgraduate and masters/doctors) into two main categories. Finally, we create a copy of our preprocessed dataset in a .csv format and store it in the same directory as the notebooks we are working on.

It's time for the most interesting part, the Regression's development. Before we reach that step, we need to perform some final preprocessing procedures such as separating the inputs from the target column, standardizing the inputs so that we bring them to a common scale without distorting the differences in the range of their values and lastly shuffling and splitting our dataset into train and test subsets. Next comes the training of our model which results in some useful information like the coefficients and the intercept of the Logistic Regression as well as the odds ratios of each feature, stored in a summary table ready for later analysis. In the end, after backtesting and presenting the predicted probabilities of each outcome (1: the employee will be excessively absent in the future - 0: will not), we save the model and the scaler (standardization) into an appropriate format so that we can reuse them in the future.

Results - (Future Update)

The results will be further analyzed and visualized in Tableau in a future update, also including the deployment of the project, but with a regression score of 0.742 one could suggest further experimentation with possibly better predictions, achievable through alternative approaches such as Linear Regression or even Neural Networks.


Update: Deployment-Integration

This update delivers the deployment of the study with predictions on new data of 40 new entries-employees across the same 11 features-variables (Absenteeism_new_data.csv file), all included in a single Jupyter Notebook (Absenteeism-Deployment.ipynb file), accompanied with a final version of both the preprocessing and the regression parts in a Python module (absenteeism_module.py file). Furthermore, the results get analyzed in Tableau, providing a dashboard (Absenteeism Dashboard.png file) with useful insights regarding the relationships between 4 features (age, reasons for absence, transportation expense and amount of children) and the predicted probability of an employee to be excessively absent from work in the future.

In details, after having incorporated the first two Jupyter Notebooks of the project (Absenteeism-Preprocessing.ipynb and Absenteeism-Logistic-Regression.ipynb) in one Python module, we just need to store this file and the deployment Notebook in the same folder. When we simply run the Notebook, in the first two lines of code, the whole Python module gets imported as well as the fresh dataset (without targets) gets displayed. The next three lines provide the code required for calling the methods we need from the Python module we created, basically the two commands for preprocessing the data and displaying the output of the model. Finally, the last command in the sixth line exports those outputs, including the predicted probability of each employee to be excessively absent, in the Absenteeism_predictions.csv file.

This file can now be imported in Tableau, where it gets analyzed to result in a dashboard with 3 visualizations. More precisely, we plot the predicted probabilities of those 40 employees to be excessively absent from work, against: 1) their transportation expenses along with the amount of children they have, 2) their age and 3) their declared reason for absence in the past. From the graphs, we can easily conclude the following:

  1. there is a loose positive correlation between transportation expenses and probability for being absent, people with no children do not exhibit a high probability for excessive absence and generally do not spend a lot for transportation expenses, while people with 1 or 2 children do not usually spend more than 240$ per month for transportation,
  2. most of the individuals in our dataset were 40 years old or younger, but older ones present mostly higher probability to be absent and
  3. the expected probability of an individual to be excessively absent because of a reason from the first group (serious reasons) is higher than 50%, the opposite occurs for the fourth group (light reasons), whereas we cannot derive any meaningful insights for the second and the third group, as none of our observations has been away from work because of a reason in the second group and very few and symmetrically distributed observations have specified one of the reasons in the third group.

A screenshot of the dashboard can be found in the .png file.

The interactive dashboard can be accessed through the .twbx file or through the link below:

https://public.tableau.com/app/profile/konstantinos.karras/viz/AbsenteeismDashboard_16796600243480/AbsenteeismDashboard