-
Notifications
You must be signed in to change notification settings - Fork 11
2b. SQL Joins
When creating a database, it is important to think about how data will be stored. This is known as normalization.
There are essentially three ideas that are aimed at database normalization:
- Are the tables storing logical groupings of the data?
- Can I make changes in a single location, rather than in many tables for the same information?
- Can I access and manipulate data quickly and efficiently?
JOINs are useful for allowing us to pull data from multiple tables.
In order to write a JOIN
, we need a SELECT
and FROM
clause just like any other query. You will notice, we have introduced two new parts to our regular queries: JOIN
and ON
.
-
JOIN
introduces the second table from which you would like to pull data -
ON
tells you how you would like to merge the tables in theFROM
andJOIN
statements together.
JOINs are like a second FROM
clause. It identifies the table where the data that we want to join lives. Finally, we need to specify the relationship between the two tables. This specifies a logical statement to combine the table in from the join table. These are referred to as INNER JOINS
. To learn about rows that DO NOT appear in both tables we need to use LEFT JOIN
and RIGHT JOIN
.
SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
We are able to pull data from two tables:
- orders
- accounts
Above, we are only pulling data from the orders table since in the SELECT
statement we only reference columns from the orders table.
The ON
statement holds the two columns that get linked across the two tables.
If we wanted to only pull individual elements from either the orders or accounts table, we can do this by using the exact same information in the FROM and ON statements. However, in your SELECT
statement, you will need to know how to specify tables and columns in the SELECT
statement:
- The table name is always before the period, i.e
tablename
. - The column you want from that table is always after the period, i.e
tablename.column
.
For example, if we want to pull only the account name and the dates, but none of the other columns, we can do this with the following query:
SELECT accounts.name, orders.occurred_at
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
This query only pulls two columns, not all the information in these two tables.
Alternatively, the below query pulls all the columns from both the accounts and orders table.
SELECT *
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
And the first query pulls all the information from only the orders table:
SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
Joining tables allows you access to each of the tables in the SELECT statement through the table name, and the columns will always follow a .
after the table name.
Pull all the data from the accounts table, and all the data from the orders table.
SELECT orders.*, accounts.*
FROM accounts
JOIN orders
ON accounts.id = orders.account_id;
Pull standard_qty, gloss_qty, and poster_qty from the orders table, and the website and the primary_poc from the accounts table.
SELECT orders.standard_qty, orders.gloss_qty, orders.poster_qty, accounts.website, accounts.primary_poc
FROM accounts
JOIN orders
ON orders.account_id = accounts.id;
If we wanted to join all three of these tables, we could use the same logic. The code below pulls all of the data from all of the joined tables.
SELECT *
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id
JOIN
holds a table, and ON
is a link for our PK to equal the FK.
To pull specific columns, the SELECT
statement will need to specify the table that you are wishing to pull the column from, as well as the column name. We could pull only three columns in the above by changing the select statement to the below, but maintain the rest of the JOIN
information:
SELECT web_events.channel, accounts.name, orders.total
Provide a table for all web_events associated with account name of Walmart
. There should be three columns. Be sure to include the primary_poc, time of the event, and the channel for each event. Additionally, you might choose to add a fourth column to assure only Walmart events were chosen.
With Alias
SELECT a.primary_poc, w.occurred_at, w.channel, a.name
FROM web_events w
JOIN accounts a
ON w.account_id = a.id
WHERE a.name = 'Walmart';
Without Alias
SELECT accounts.primary_poc, web_events.occurred_at, web_events.channel, accounts.name
FROM web_events
JOIN accounts
ON accounts.id = web_events.account_id
WHERE accounts.name = 'Walmart'
Provide a table that provides the region for each sales_rep along with their associated accounts. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) by account name.
NOTE: columns need unique names hence r.name region, s.name rep, a.name account
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
ORDER BY a.name;
Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. Your final table should have 3 columns: region name, account name, and unit price. A few accounts have 0 for total, so I divided by (total + 0.01) to assure not dividing by zero.
SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id;
If we want to include data that is not included in both tables, but only one of the tables that we are using in our JOIN
statement, there are 3 types of JOINS we might use:
-
LEFT JOIN
(This is usually used most often) RIGHT JOIN
FULL OUTER JOIN
When we write our JOIN
statements, the table listed in the FROM
clause is the left table, while the table in the JOIN in considered the right table.
SELECT
FROM left table
LEFT JOIN right table
- Images from Udacity: Data Foundations Nanodegree program