week1.qmd

---
title: "Week 1"
format: html
---

## Our Composable Database System

- Client: R/RStudio
- Database Engine: DuckDB
- Data Storage: single file in `data/` folder

 
## Connecting to our database

To access the data, we need to create a database connection. We use `dbConnect()` from the `DBI` package to do this. The first argument specifies the Database engine (`duckdb()`), and the second provides the file location: ("data/synthea-smaller_breast_cancer.db").

```{r}
#| context: setup
library(duckdb)
library(DBI)

con <- DBI::dbConnect(duckdb::duckdb(), 
                      "data/GiBleed_5.3_1.1.duckdb")
```

Once open, we can use `con` (our database connection)

:::{.callout-note}
## Keep in Mind: SQL ignores letter case

These are the same to the database engine: 

```
SELECT person_id FROM person
```

```
select PERSON_ID FROM person
```

And so on. Our convention is that we capitalize SQL clauses such as `SELECT` so you can differentiate them from other information.
:::

## Looking at the Entire Database

One of the first things we can learn is to show the contents of the entire database; we can do this with `SHOW TABLES`:

```{sql}
#| connection: "con"
SHOW TABLES;
```

We can get further information about the tables within our database using `DESCRIBE`; This will give us more information about 

```{sql}
#| connection: "con"
DESCRIBE;
```


We'll look at a few tables in our work:

  - `person` - Contains personal & demographic data
  - `procedure_occurrence` - procedures performed on patients and when they happened
  - `condition_occurrence` - patient conditions (such as illnesses) and when they occurred
  - `concept` - contains the specific information (names of concepts) that map into all three above tables
  
  We'll talk much more later about the relationships between these tables.  

## `SELECT` and `FROM`

If we want to see the contents of a table, we can use `SELECT` and `FROM`.

```
SELECT *          # select all columns
  FROM person     # from the person table
  LIMIT 10;       # return only 10 rows
```

```{sql}
#| connection: "con"
SELECT * FROM person LIMIT 10
```

1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant?

## Try it Out

Look at the first few rows of `procedure_occurrence`. 

```{sql}
#| eval: FALSE
#| connection: "con"
SELECT * FROM ____ LIMIT 10;
```

1. Why is there a `person_id` column in this table as well?

## `SELECT`ing a few columns in our table

We can use the `SELECT` clause to grab specific columns in our data. 

```
SELECT person_id, birth_datetime, gender_concept_id # Columns in our table
  FROM person;                                      # Our Table
```

```{sql}
#| connection: "con"
SELECT person_id, birth_datetime, gender_concept_id 
  FROM person
  LIMIT 10;
```

## Try it Out

What happens if we ask for a column that doesn't exist in our data?

```{sql}
#| connection: "con"
#| eval: false
SELECT person_id, birth_datetime, gender_concept_id, blah
  FROM person;
```

## Check on Learning

Add `race_concept_id` and `year_of_birth` to your `SELECT` query:

```{sql}
#| connection: "con"
#| eval: false
SELECT person_id, birth_datetime, gender_concept_id, ____, ____
  FROM person;
```


## `WHERE` - filtering our table

Adding `WHERE` to our SQL statement lets us add filtering to our query:

```{sql}
#| connection: "con"
SELECT person_id, gender_source_value, race_source_value, year_of_birth 
  FROM person 
  WHERE year_of_birth < 1980
```

One critical thing to know is that you don't need to include the columns you're filtering on in the `SELECT` part of the statement. For example, we could do the following as well, removing `year_of_birth` from our `SELECT`:

```{sql}
#| connection: "con"
SELECT person_id, gender_source_value, race_source_value 
  FROM person 
  WHERE year_of_birth < 2000
```

:::{.callout-note}
### Quick Note

For R users, notice the similarity of `select()` with `SELECT`. We can rewrite the above in `dplyr` code as:

```r
person |>
  select(person_id, gender_source_value, race_source_value)
```

A lot of `dplyr` was inspired by SQL. In fact, there is a package called `dbplyr` that translates `dplyr` statements into SQL. A lot of us use it, and it's pretty handy.
:::

## `COUNT` - how many rows?

Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for. 

```{sql}
#| connection: "con"
SELECT COUNT(*) 
  FROM person
  WHERE year_of_birth < 2000
```

Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`:

```{sql}
#| connection: "con"
SELECT COUNT(person_id) 
  FROM person
  WHERE year_of_birth > 1970
```

Let's switch gears to the `procedure_concept_id` table. Let's count the overall number of `procedure_concept_id`s in our table:

```{sql}
#| connection: "con"
SELECT COUNT(procedure_concept_id)
  FROM procedure_occurrence
```

Hmmm. That's quite a lot, but are there repeat `procedure_concept_id`s?

When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column:

```{sql}
#| connection: "con"
SELECT COUNT(DISTINCT procedure_concept_id)
  FROM procedure_occurrence
```

## Check on Learning

Count the distinct values of `gender_source_value` in `person`:

```{sql}
#| connection: "con"
#| eval: false
SELECT COUNT(DISTINCT --------------)
  FROM -------;
```

## Keys: Linking tables together

One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*.

We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables. 

```{sql}
#| connection: "con"
DESCRIBE person
```

Scanning the rows, which field/column is the primary key for `person`?

Try and find the *primary key* for `procedure_occurrence`. What is it?

```{sql}
#| connection: "con"
DESCRIBE procedure_occurrence
```

We'll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another.

What column is the same in both tables? That is a hint for what we'll cover next week: `JOIN`ing tables.

## Data Types

If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types:

- `INTEGER`
- `TIMESTAMP`
- `DATE`
- `VARCHAR`

Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates. 

You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html).

::: {.callout-warning}
## Keep in Mind: Beware the Smart Quotes

**Beware cutting and pasting code from Microsoft Word**

Microsoft products such as Word, will transform double quotes `"` into what are called *smart quotes*: `“”`. This is bad for us, because it breaks our code.

```
"This is the Original String"
```

will transform into:

```
“This is the Original String”
```

It is very hard to see the difference between these, but if you cut and paste the bottom one (from a word document), your code will not run. That's because the smart quotes aren't double quotes, which you need to specify a string.

Just be aware that you might have to fix these quotes if you're cutting / pasting from a Microsoft product (although google is also guilty of this).

Oftentimes, you can disable this in Word/Google Docs, or be prepared to replace the smart quotes with double quotes.
:::

## Always close the connection

When we're done, it's best to close the connection with `dbDisconnect()`. 

```{r}
dbDisconnect(con)
```