DataScience_glassdoor_salary_project: Project Overview

Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists and data analyst negotiate their income when they get a job.
Scraped over 1000 job descriptions from glassdoor using python and selenium
Performed Feature Engineering from the text of each job description to quantify the value companies put on python, excel, aws, and spark using different encoding techniques.
Performed Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model and finally gone with Random Forest Regressor.
Built a client facing API using flask.

Code and tools Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: pip install -r requirements.txt

Web Scraping using selenium

Scraped over 1000 job postings from glassdoor.com. With each job, we got the following:

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data Cleaning(Amazing Part)

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created different variables according to the usecase:

Parsed numeric data out of salary
Made columns for employer provided salary,hourly wages,minimum salary,maximum salary and average salary.
Removed rows without salary
Parsed rating out of company text
Made a new column for company state
Added a column for if the job was at the company’s headquarters
Added a column to check whether the job posting location is same as headquarter location as it plays an important role in salary prediction.
Transformed founded date into age of company
Made columns for if different skills were listed in the job description:
- Python
- R
- Excel
- AWS
- Spark
Column for simplified job title and Seniority
Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

Model Building

First, I transformed the categorical variables into dummy variables using encoding technique. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried three different models:

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit and it is a good fit after performing hyperparameter optimization using grid search CV.

Productionization

This is a final step in this, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary which turns out to best approach.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
__pycache__		__pycache__
flaskAPI		flaskAPI
EDA_DATA.csv		EDA_DATA.csv
Glassdoor_salary_data_cleaned.csv		Glassdoor_salary_data_cleaned.csv
README.md		README.md
chromedriver.exe		chromedriver.exe
correlation.png		correlation.png
data_cleaning.py		data_cleaning.py
glassdoor_jobs.csv		glassdoor_jobs.csv
glassdoor_scrapper.py		glassdoor_scrapper.py
insights.PNG		insights.PNG
model_building.py		model_building.py
models		models
salary.png		salary.png
top 20 headquarters.png		top 20 headquarters.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataScience_glassdoor_salary_project: Project Overview

Code and tools Used

Web Scraping using selenium

Data Cleaning(Amazing Part)

EDA

Model Building

Productionization

About

Releases

Packages

Languages

architsharrma/DataScience_glassdoor_salary_project

Folders and files

Latest commit

History

Repository files navigation

DataScience_glassdoor_salary_project: Project Overview

Code and tools Used

Web Scraping using selenium

Data Cleaning(Amazing Part)

EDA

Model Building

Productionization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages