Skip to content

jnunez03/KaggleVisualization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Project: Visualizing NYC Payroll Data 📈

Data can be found here. Everything shown was inspired by me and all my decision making/questioning. I wanted to use my own fresh approach to this data set.

  • Note 1: I will try and reproduce this in blog format and/or jupyter notebook, so stay tuned for that! Also, explain code line by line.

  • Note 2: Clicking on graphs that may be hard to read will help make them more clear as it opens it in a new tab in most cases or just brings it in a kind of zoom and it becomes more clear.

  • Note 3: I use a lot of boxplots. Here is a refresher: Interpreting Boxplots.

Code can be found in .py file above!!

Breakdown of the most employees by Agency per year.

Another thing is there are a lot of different departments and could be categorized based on context, for example, "Education", etc., and these could be merged into 1 category. For instance, colleges could be merged instead of having "Hostos CC, Laguardia CC, etc." as separate index rows. (I ended up using this approach for boxplots on "all occupations pay vs. years of experience" plots)! Data

Here is the mean Base Salary and Gross Salary and how they changed over the years.

Data

Data

  • Axis on these graphs are slightly different due to Base Salary being an integer and Gross Salary being a float data type. This is could easily be fixed, by making Gross Pay variable an integer.

Who were the lowest paid employees and in what years? (I could have done a breakdown year by year and graph the change)

Data

Who were the highest paid employees and in what years?

Data

Top Salary by Borough over the years! 2014 to 2017.

Data Data Data Data

Some Distributions

base

gross

These are the distributions yearly.

Data

With Kernal Densities.

Data

Normal Distribution Added (*The normal line for 2017 is 3 graphs up. Again, had a problem processing pixel density.)

nested_distributions

QQ Plots

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. ... A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another.

  • This tests for normality in the data.

2014 QQplot of Gross Pay

2014

2015 QQplot of Gross Pay

2015

2016 QQplot of Gross Pay

2016

2017 QQplot of Gross Pay

2017

None of these plots look like they would go through the origin if they were straight lines. However, even when following a normal distribution, a qq-plot is almost never near perfect. What is displayed in the curve-ish like form away from the quantile line is an indicator of skewed data and data with heavy tails (kurtosis). These in particular look like they fall under a chi-squared distribution or a student t-distribution. However, in this case it is normal, but possesses a right skew!

There is some skewness in our Gross Paid data.

This leads me to study the other variables: Total OT paid - Total Other Pay! Studying further we can see if this "other pay" is what is causing the skewness. However, doing a QQplot of base salary the same skewness still appears which shows me this is not due to the OT paid and Total other pay. Also, this just confirms the right skewness that is visible in the histograms above.

Top 8 Average Paid Jobs?

top8avgpaidjobs

In General, how does experience relate to base pay & gross pay?

Caveat, someone could be hired in each year with experience outside of NYC! This experience is not reflective of overall experience, but experience based on being a registered worker in NYC! For example, a random NYC worker who started working in 2006. In 2017, we could expect that persons salary to be somewhere around 50,000 to 80,000! However, there could be someone who comes in with more experience and could get paid $120,000, but will show up in the 2017 section of the plot, because he was never registered as a worker in NYC only until 2017. (I hope this explanation is clear. However there is a "slight trend", where on average, more experience = more pay!)

boxplot_general

However, no strong correlation detected. Let's see how this differs from gross pay!

This is across all occupations for workers from 2000 to 2017, paid minimum of 20,000 annually

spearmanr

  • The score is moderate. This is a statistical measure of monotonic relationship between data. So it has to be either increasing or decreasing when plotted against each other (can't be both in the same plot like a sine curve).
  • A score closer to 1 shows perfect correlation. A score closer to 0 means there is absolutely no correlation.
  • .47 tells us there is some slight correlation between Gross Pay and Years of Experience. Let's graph it!

We see an obvious upward trend!

yoe_gross_plot

This differs greatly from what Base Salary showed. Which shows Gross Salary is the true indicator of Pay Vs. Experience!

yoe_gross_newboxplot

Let's Analyze this Difference: Gross Pay - Base Pay

grossminusbase

  • Mean: -2636.90 (Which means Base Salary was greater than Gross Salary on average)
  • Median: -1076.7
  • Standard Deviation: 11029
  • Minimum positive value: .11 (~11 cents...taxes must have been brutal!) -Caveat, I did this without taking care of outliers!

Our QQ plot below just shows that we have A LOT of extreme values! In statistics term: Fat-Tails!

grossminusbaseqqplot

Conclusion!

  • When we looked above at Base Salary, we did not see a strong correlation between Base Salary and Years of Experience! The Boxplot only revealed a slight upward trend. When we took a look at Gross Salary, we saw a stronger trend (stronger correlation) which was denoted by the spearman rank test that had a value of .47 which is a moderate score. Across All Occupations, it is revealed the more experience you have the more money you will end up making, despite what your base salary value is! We know that there are other forms of payment such as bonuses, OT paid, stocks, bonds, etc.

NEXT!

I chose to look at the "jobs with the most employees"(most common jobs) and find the distribution of gross pay!

alljobs3 0

  • note: clicking on the actual graph will make viewing it more clear!
  • Only 1 of the variables is indecipherable. If anyone wants to take a shot at figuring it out, please let me know!
  • I chose the axis in a way where outliers can be visible. Obviously, it takes away from the aesthetics and is likely unnecessary, but this is for observational purposes!
  • (In reality, those dots way outside are outliers and don't need to be shown. There is a technique to calculate outliers beforehand and remove them from the plot)
  • You can also note which jobs have employees that are paid way above the normal pay and there are many jobs where all employees are paid within the IQR (Inter-Quartile Range) and display no outliers at all.
  • You can also note there are employees that are paid below the 25th percentile, which could be due to lack of experience and/or a host of other reasons.

What Variables are related? Well besides the obvious, OT hours and OT paid are the only other most correlated variable (makes sense)!

This is across all occupations for workers from 2000 to 2017, paid minimum of 20,000 annually

heatmap

This is another way to view the heatmap. As you can see which variables are correlated by actual plots using seaborn.pairplot function!

pairplot


Next: Analyzing Teacher Data!

  • My subset: Teachers paid annually registered in 2017, with starting work dates from 2000 to 2017 (~18 years of experience to ~1 year of experience).

Distribution

Let's see how the pay is distributed based on years of experience! aabase aagros

  • One thing to note, in 2015 there was a huge increase in teachers who started working with 3430. However, due to the data not being accumulated enough for 2017 (we only have data of 149 teachers who started in 2017). We can note the discrepancy, because the mean base salary drastically drops, but if there was more data, it would be closer to how 2015 and 2016 look.
  • Shame on myself for not plotting this the other way! The downward trend is only showing as experience decreases so does pay!
  • There was not a lot of 2017 data, so that is why the value is truly so low. There being outliers as well makes it even lower. numnum

How do Variables relate to each other?

There seems to be no relationship with total other pay and how much you make!

basewtotalotherpay2016 grosswtotalotherpay2016

In 2016, there does seem to be a relation with your base salary and gross salary.

basewithgross2016

Base Vs. Gross Pay For 2017 (n=149, smaller than 2016 sample size)

basewithgross2017

Teachers in their first year of work don't seem to make much!

Gross Pay

teachersfirstyear

Base Pay (Large discrepancy with Gross Pay!)

2017basesalary_histogram

Teachers with ~18 years experience make on average $84,482 (USD)!

teacherwith17yearexp11

~22 Years of Experience? (Average is $90,263)!

teachers1995_exp

To top it off, here are Boxplot Pay distributions based on the "YEAR" a teacher started working for registered teachers in 2017.

This is Base Pay BTW! (Which differs largely from Gross Pay!)

teacherboxplot

Gross Pay (I did not try to fix it, just so you could see the outliers and the huge differences between Base Pay)

teachergrosshisto

What I Learned?

  • My intuition lead me to think otherwise of what the data showed.
  • I learned so much more in Python, I feel like a connoisseur.
  • Visualizing data definitely gives the overall picture of what is hidden in just words and numbers. Sure you could do this in excel, but definitely nowhere near the caliber and flexibility that Python offers.
  • Seeing negative salaries was shocking and salaries that were very small. It didn't trip me, I just knew that some people are hired as poll workers, or seasonally and may have salaries under 10,000, so most of my analysis was using a subset of the overall data to leave out all the "seasonal" or temporary workers hired and to get a better detail of true annual salaries.

What I would love to do?

  • I want to implement some machine learning just to play with it. COMING SOON!
  • Also, I would have liked to add a visual with the actual state of New York and color code by county the amount of employess/salary. Note that there were counties outside of NY in this dataset, due to the fact that people work remotely.
Data from Kaggle

About

Created by: Justin Nuñez

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages