Analysing and applying ML methods to a salary database

The purpose of this project is to apply data transformations and machine learning techniques in order to predict the expected salary of people person given a few descriptors such as their job title, their education and their experience.

Project materials

The main technical narrative, walkthroughs, and experiments live in the Jupyter notebook: HRdataNotebook.ipynb.
A business-focused slide deck summarizing value, robustness, and generalization guidance is available in HR_Salary_Analysis_Briefing.pdf.

The Method

The analysis is performed through the method provided by a utility class, SupervisedDataframe, which has been created to manage the pandas data structure for a typical EDA+fast-modeling job. SupervisedDataframe has tools to operate on the training and the validation set at the same time for data transformation and feature engineering. At the same time, the training and the validation set are properly isolated when special transformations (such as grouped statistics) are applied.

Acquiring data from csv files and segmenting the dataframe into training and validation sets
Plotting variables against specific variables
Checking for null values and optionally deleting rows with null values
Creating grouped statistics (mean, median, var) based on the categorical columns of the training set
One hot encoding and scaling

Usage notes before model training

To keep train/test alignment intact while preparing your data:

Run check_nulls(erase=True) to drop only training rows that contain missing feature values (the test set is never truncated). The method prints the per-column NA counts for the train and test splits to help you decide whether deletion or imputation is appropriate.
Run to_one_hot([...]) after defining your categorical feature list. One-hot columns are fit on the training categories and reindexed so the test set always has the same columns (unseen test categories are safely ignored instead of shifting columns).

The Data

The data we will work on in this project are:

1,000,000 training examples, complete with salary information
1,000,000 test examples, with no salary information attached

While the dataset examined here is a simplified data model of a real-world job market, the procedures implemented here should be easily transferable to more complex scenarios and with a richer variety of categories.

As we will see when looking at the properties of each feature, at a first look the data seem to have been machine-generated over a uniform distribution. We will expect a lot of noise barring us from predicting the target with very high accuracy. Nevertheless, the data do show some interesting pattern that will eventually drive our hypotheses in building an appropriate model.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
data		data
HR_Salary_Analysis_Briefing.pdf		HR_Salary_Analysis_Briefing.pdf
HRdataNotebook.ipynb		HRdataNotebook.ipynb
README.md		README.md
supervisedf.py		supervisedf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysing and applying ML methods to a salary database

Project materials

The Method

Usage notes before model training

The Data

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analysing and applying ML methods to a salary database

Project materials

The Method

Usage notes before model training

The Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages