Use SQL to analyze data on employees of a company from the 1980s and 1990s using six CSV files.
The first step before performing any queries was to sketch out an ERD of the tables. I used QuickDBD to quickly see how the tables corresponding to each CSV file would relate to each other.
Using the ERD, I created the schema for the tables in my database, specifying data types, and setting primary and foreign keys as appropriate. I then imported the data into these tables.
There were nine main questions I wanted to answer, the first eight being done through PostgreSQL.
For my first query, I wanted to find the employee number, last name, first name, gender, and salary of every employee. The information I needed was contained on two tables, employees and salaries, so I selected the appropriate columns and then joined those tables with an inner join on emp_no, which both tables had.
For the second query, I wanted to list all employees who were hired in 1986. I selected the first_name, last_name, and hire_date columns from the employees table and then used the WHERE clause to set the date of hire condition.
The third query looked into managers, listing the department number and name, and the managers’ name, employee number, and employment dates. The relevant information was stored in four separate tables this time, which required three inner joins. As before, I selected the relevant columns and then joined the dept_manager table with the employees table, and the employees table with the dept_emp table, both on emp_no. I joined the dept_emp table with the departments table on dept_no.
For the fourth query, I wanted to list the departments of every employee along with their full name and employee number. The information I needed was stored on two tables, employees and departments. However, the tables did not share any primary/foreign keys. I therefore, had to use a third table (dept_emp) to join the other to together. I first joined employees to dept_emp on the shared emp_no, and then dept_emp to departments on the shared dept_no. Through these joins, I was able to display the department name of each employee along with their name and employee number.
For the fifth query, I wanted to list all employees whose first name is "Hercules" and last names begin with "B." To do this, I set two conditions joined by the AND clause, and made use of the SQL wildcard character % to search for last names beginning with “B”.
The sixth query was much like the fourth, except I wanted to list information only for employees in the Sales department. Instead of repeating the fourth query, I created a view of it, named emp_info, and queried that, setting the condition of the department name being “Sales”.
The seventh query was like the sixth, but looked into employees in the Development department as well. Here, I used the same query as in the sixth, but for the condition, I used a subquery to search for the department name being “Sales” or “Development”.
For the eighth query, I wanted to count how many employees shared the same last name. To do this, I selected the last_name column from the employees table and then also used the COUNT() function on the same column. I then grouped the data by last names and ordered the data by the count of last names in descending order to see the last names that were most shared among employees.
Finally, I wanted to determine the average salary for each position in the company and graph my findings. To accomplish this, I made use of SQL Alchemy to import the database into Pandas. After connecting to the database and creating an engine, I imported emp_no and salary from the salaries table and emp_no and title from the titles table. I then performed an inner join of the two tables on the shared emp_no before grouping the data by title. I then took the mean of the salaries per title and rounded by two decimal points. I then plot the data in a bar chart to find that there was not much variance in salary amounts between each employee title.