Skip to content

Commit

Permalink
Merge pull request #5 from ucdavisdatalab/eh_comments
Browse files Browse the repository at this point in the history
My comments on the SQL reader
  • Loading branch information
MicheleTobias committed Apr 17, 2024
2 parents 3b40248 + 442cbd0 commit d0ac0c7
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 23 deletions.
25 changes: 17 additions & 8 deletions 01_concepts.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,25 @@ A **relational database** is a collection of tables (organized in rows and colum

Database tables are analogous to CSV files, spreadsheets in Excel, or data frames in programming languages like R or Python.

Ideally each table can be connected to another table by a column that both tables have that store the information to match up the rows. This column is called a **key**. For example, a key commonly used on campus is your student or employee ID number.
Ideally each table can be connected to another table by a column that is present
in both tables. That column may have different numbers of observations in each
table, but the values will match up. This column is called a **key**. For
example, a key commonly used on campus is your student or employee ID number.

Let's look at an example dataset of fictional student data with data about courses, grades, and employment. Can we say anything about the relationship between course grades and employment based on this data?

**Table: Student**

|ID|Name|
|--|--|
|:--|:--|
|123|Jane Smith|
|456|Maria Martinez|
|789|Paul Jones|

**Table: Courses**

|ID|Course|Grade|
|--|--|--|
|:--|:--|:--|
|123|Calculus|A-|
|456|Calculus|A|
|789|Calculus|C+|
Expand All @@ -32,7 +35,7 @@ Let's look at an example dataset of fictional student data with data about cours
**Table: Employment**

|ID|Position|Employer|HoursPerWeek|
|--|--|--|--|
|:--|:--|:--|:--|
|123|Student Assistant|University Research Lab|5|
|456|Customer Service|Alumni Center|5|
|456|Research Assistant|University Research Lab|15|
Expand All @@ -45,14 +48,14 @@ Let's look at an example dataset of fictional student data with data about cours
## What is SQL?

SQL stands for **structured query language**. SQL is a programming language that allows you to request (query) information from a database using a standard set of keywords. You can pronounce SQL as "ess cue ell" or "sequel".

<!-- love the pronunciation guide -->

### What kinds of questions can SQL answer?

SQL excels at extracting and combining information from large datasets. Some questions you might ask with SQL include:

* How many items are there in my data with a specific label?
* What re the unique values in a given column?
* What are the unique values in a given column?
* Which records (rows) relate to a specific time period in my data?


Expand All @@ -69,6 +72,9 @@ For this workshop, we'll use [SQLite][sqlite], which is a simple and widely-used

Every RDBMS has its own implementation or "dialect" of SQL. In other words, the set of SQL keywords supported differs slightly from one RDBMS to another, and sometimes queries have to be written differently, but the basics are the same. Details about the supported keywords for a given RDBMS can be found in that system's documentation. The keywords covered in this workshop are supported by most systems.

<!-- I might add something here about how when googling, it is important to specify -->
<!-- which version of SQL you are using (ex sqlite, postgres, ms server) -->

Some RDBMS allow you to add functions with extensions. For example, the PostGIS extension adds keywords to PostgreSQL to all you to work with location information to do spatial analysis.


Expand All @@ -86,13 +92,16 @@ SQL has major advantages in several areas important to researchers:
+ Typically faster to run a process in a database than in a spreadsheet
+ Store lots of data (compare with Excel's row limits)
* Data management
+ One database file stores many, many tables which is represented as one file in your file browser
+ One database file is the equivalent to many, many spreadsheet files (like csvs or xlsx files)
+ Write a query instead of making a new files or tabs

What does SQL not do well?

* Most RDBMS do not visualize data, however, you can connect your database to visualization tools to perform these kinds of tasks seamlessly.
* SQL is designed to work with tabular data. If your data is another type - for example graph data or tree data - you might want to explore other database types.
* The SQL language is designed for data querying not data analysis. If you want
to run statistics on your data you can connect to your database from a
programming language like R or python, or from statistical software.
* SQL assumes you work with tabular data. If your data is another type - for example graph data or tree data - you might want to explore other database types.

<!--
* You can manage multiple data sources
Expand Down
2 changes: 2 additions & 0 deletions 02_the-library-checkouts-database.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ important for determining what types of questions you can answer with SQL!

Here's an ERD for the Library Checkouts database:

<!-- the loans-in-house is spelled wrong (laons-in-house) -->

![](images/DataDiagram_ucd_library.png)

Lets break down the components of the ERD:
Expand Down
3 changes: 3 additions & 0 deletions 03_setup_database.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ Let's connect to the database that we'll be using for this workshop:
6. Click the "Connect to the database" icon ![](images/sqlitestudio-connect-db-icon.png) [**4**]
- You are now connected to the database and can execute SQL to the database!

<!-- What are the [3] and [4] for? Also the file isn't called lcdb, so you
might want to change how you refer to it or change the name of the file -->


### Saving Scripts

Expand Down
40 changes: 28 additions & 12 deletions 04_hands-on-with-sql-code.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

We just learned that SQL is a language that allows us to interact with and manage a database. Let's learn some SQL queries to get some hands-on experience.

<!-- I would add a sentence or two here about how to open up the SQL editor -->

## Viewing Data

### SELECT & FROM
Expand All @@ -14,13 +16,17 @@ Now click the *Execute all* button. ![alt text](images/Button_Execute.PNG)

This query asks the database to select everything (* means "everything") from the table *items*. It ends with a semicolon to tell the database that this is the end of our request.

SQL doesn't care if you add extra white space (spaces, tabs, or new lines)
to your query to make it easier to read. All that matters is that you use the
correct keyword structure and end your query with a semicolon (;). Because of
this, the query below does exactly the same thing as the first query we ran.

```
SELECT
*
FROM
items;
```
The above query does exactly the same thing as the first one, hence the need for the end of query indicator. We can use new lines to help us organize large queries to make them easier to read.

SQL ignores capitalization, spaces, and new lines in a query. Some tools which
use SQL also ignore semicolons. However, it's conventional to:
Expand All @@ -47,7 +53,7 @@ FROM items;
### Unique Values

What if we now want to knowwhat all the possible languages are in our data set? We could scroll through the results and try to keep track of unique values, but that is tedious - and we'll likely miss some, especially if they are uncommon.
What if we now want to know what all the possible languages are in our data set? We could scroll through the results and try to keep track of unique values, but that is tedious - and we'll likely miss some, especially if they are uncommon.

Instead we can use the `SELECT DISTINCT` keywords on one or more columns to show
all the unique values.
Expand Down Expand Up @@ -295,6 +301,10 @@ There will be times where we want to find only the rows that do not satisfy some

Below is a query to find items that ***do not*** have a certain number of recalls - in this case, we're excluding items with 0, 1, or 3 recalls.

<!--Why would you want to exclude items with 0, 1, 3 recalls? I mean it doesn't
matter that much, but it seems a bit arbitrary. What about excluding checkouts
that happened during the pandeming (2020-22?) -->

```
SELECT *
FROM items
Expand Down Expand Up @@ -391,7 +401,9 @@ Notice here how we asked for two columns - the `library_code` and the count of
`item_id`.

> **CHALLENGE**:
> You can also `GROUP BY` more than one column by listing the columns to group by with each column name separated by a comma. How would you find the total number of times a patron checked out in each library?
> You can also `GROUP BY` more than one column by listing the columns to group by with each column name separated by a comma. How would you find the total number of times a patron checked out an item at each library?
<!-- Are you going to bring up the use of -1 as the missing value for patron_id? -->

### Having

Expand All @@ -411,7 +423,7 @@ Now we've seen how we can use functions to aggregate data and how grouping data

## Joining Data

Joining tables allows us to combine information from more than one table into a new table. The tables need to have a ***key*** column to be able to link the tables together. A key is a column that contains information that allows it to relate to information in another table. In our Library Checkouts ERD, the *item_id* column in *itmes* is a key column that links to *item_id* in *checkouts*.
Joining tables allows us to combine information from more than one table into a new table. The tables need to have a ***key*** column to be able to link the tables together. A key is a column that contains information that allows it to relate to information in another table. In our Library Checkouts ERD, the *item_id* column in *items* is a key column that links to *item_id* in *checkouts*.


### JOIN Types
Expand Down Expand Up @@ -471,8 +483,8 @@ SELECT
items.title,
checkouts.item_id,
checkouts.due_date
FROM checkouts
INNER JOIN items ON items.item_id = checkouts.item_id;
FROM items
INNER JOIN checkouts ON items.item_id = checkouts.item_id;
```

We interpret the `INNER JOIN` query as, "all books that have been checked out."
Expand All @@ -487,22 +499,20 @@ SELECT
items.title,
checkouts.item_id,
checkouts.due_date
FROM checkouts
LEFT JOIN items ON items.item_id = checkouts.item_id;
FROM items
LEFT JOIN checkouts ON items.item_id = checkouts.item_id;
```

We interpret the `LEFT JOIN` query as, "all books and if they have been checked out or not."

You might be thinking, what would happen if the tables in the `LEFT JOIN` were flipped? We would get the same result as the `INNER JOIN` query! That's because there's no instances where a checkout without a book could ever happen!

> **CHALLENGE**:
> Can you write a query that contains the title of the books and the ID of the patrons that checked them out?
## Subqueries

So far we've been working with one `SELECT` statement, but we can actually combine multiple `SELECT` statements using subqueries. Subqueries are nested queries enclosed in parentheses that can be used with other keywords like `JOIN` and `WHERE`. Below are 2 examples of these use cases.

You can think of a subquery as a process where you write a query to create a table,, then query the table you just constructed. This can be especially helpful with large complex tables where simplifying helps you understand the query better, or when you need to complete a multi-step query and don't want to make extra tables or views (something we'll cover in the next sections).
You can think of a subquery as a process where you write a query to create a table, then query the table you just constructed. This can be especially helpful with large complex tables where simplifying helps you understand the query better, or when you need to complete a multi-step query and don't want to make extra tables or views (something we'll cover in the next sections).

Let's first look at a subquery in the `WHERE` clause:

Expand Down Expand Up @@ -577,6 +587,10 @@ CREATE TEMPORARY TABLE mircoform AS
) AS microforms ON checkouts.item_id = microforms.item_id;
```

<!-- when I run this code, a new table does not appear on the databases pane on
the left where it lists out the tables in the database. This isn't necessarily
an issue but you may get some questions about whether it has worked. -->

In much the same way we made the new table, we can make a view:

```
Expand Down Expand Up @@ -636,6 +650,8 @@ WHERE receiving_date IS NULL;

The `SET` keyword specifically targets just the *receiving_date* column and replaces *NULL* values with "*N/A*" when the condition is met in the `WHERE` clause. It leaves the other values alone. If the `WHERE` clause is removed, it will set all values in the whole column to "*N/A*" overwriting the users address, so proceed with caution!

<!-- why would you prefer N/A over NULL for this? -->

### Add & Populate a Column

Sometimes we want to make a new column and add data into it. Let's make a new column called *year* in the *patrons* table and populate it with the year parsed from the *creation_date* column.
Expand All @@ -655,7 +671,7 @@ Now we update all values with the results of a string parsing cution that return
```
UPDATE patrons
SET year = substr(creation_date, -4, 4)
WHERE state IS NOT NULL;
WHERE creation_date IS NOT NULL;
```

The function substr() creates a substring from a string object - in this case, our *creation_date* string. The second argument, -4, indicates the position to start the substring from. Negative values tell the function to start from the right side of the string (or the end of the string) rather than the left. Finally, the third argument indicates how many characters to include. We chose 4 because our date string has a 4 digit year.
Expand Down
3 changes: 3 additions & 0 deletions 05_conclusion.Rmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Conclusion

We covered a wide variety of SQL processes you might need in setting up a database and querying data. Did we cover everything you might need to know? Of course not. It's only a 2 hour workshop and SQL is a big language, but we've learned enough terminology and seen enough typical workflows for you to get started. To help you learn more and expand your SQL skills, we've assembled a list of resources in the Resources section of the reader.

<!-- I'm not sure an entire conclusions section is really necessary. I might
just add this at the top of the resources section -->
13 changes: 10 additions & 3 deletions index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,22 @@ After this workshop learners should be able to:

#### Prerequisites {-}

No prior programming experience is necessary. We recommend learners either attend or review the written materials for DataLab's[Overview of Databases & Data Storage Technologies](https://ucdavisdatalab.github.io/workshop_intro_to_databases/) workshop.
No prior programming experience is necessary. We recommend learners either attend or review the written materials for DataLab's [Overview of Databases & Data Storage Technologies](https://ucdavisdatalab.github.io/workshop_intro_to_databases/) workshop.

Before the workshop, learners
should:

* Install [SQLiteStudio][sqlitestudio] and verify that it runs. See the
[install guide][install] for details.
<!-- I think the link to the SQLiteStudio website here is confusing. -->
* Install SQLiteStudio using the [install guide][install] and verify that it runs.
* Download the file `2024-04-09_library-data.sqlite` from [this link][materials].

> **NOTE**:
> If you have a Mac (OSX), you will need to right-click on the SQLiteStudio
> installer and select open. If you open the installer regularly, the Mac
> operating system will block the installer from running.
<!-- Pamela told me to remove my references to GradPathways from my presentation
I would check with her about including this -->
Please see these [recommendations for making SQLiteStudio easier to read](https://veroniiiica.com/sqlitestudio-and-low-vision/), particularly for those with low vision and those who use a screen reader.

[sqlite]: https://sqlite.org/
Expand Down

0 comments on commit d0ac0c7

Please sign in to comment.