generated from fhdsl/OTTR_Quarto
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek1.qmd
287 lines (197 loc) · 7.74 KB
/
week1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: "Week 1"
format: html
---
## Our Composable Database System
- Client: R/RStudio
- Database Engine: DuckDB
- Data Storage: single file in `data/` folder
## Connecting to our database
To access the data, we need to create a database connection. We use `dbConnect()` from the `DBI` package to do this. The first argument specifies the Database engine (`duckdb()`), and the second provides the file location: ("data/synthea-smaller_breast_cancer.db").
```{r}
#| context: setup
library(duckdb)
library(DBI)
con <- DBI::dbConnect(duckdb::duckdb(),
"data/GiBleed_5.3_1.1.duckdb")
```
Once open, we can use `con` (our database connection)
:::{.callout-note}
## Keep in Mind: SQL ignores letter case
These are the same to the database engine:
```
SELECT person_id FROM person
```
```
select PERSON_ID FROM person
```
And so on. Our convention is that we capitalize SQL clauses such as `SELECT` so you can differentiate them from other information.
:::
## Looking at the Entire Database
One of the first things we can learn is to show the contents of the entire database; we can do this with `SHOW TABLES`:
```{sql}
#| connection: "con"
SHOW TABLES;
```
We can get further information about the tables within our database using `DESCRIBE`; This will give us more information about
```{sql}
#| connection: "con"
DESCRIBE;
```
We'll look at a few tables in our work:
- `person` - Contains personal & demographic data
- `procedure_occurrence` - procedures performed on patients and when they happened
- `condition_occurrence` - patient conditions (such as illnesses) and when they occurred
- `concept` - contains the specific information (names of concepts) that map into all three above tables
We'll talk much more later about the relationships between these tables.
## `SELECT` and `FROM`
If we want to see the contents of a table, we can use `SELECT` and `FROM`.
```
SELECT * # select all columns
FROM person # from the person table
LIMIT 10; # return only 10 rows
```
```{sql}
#| connection: "con"
SELECT * FROM person LIMIT 10
```
1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant?
## Try it Out
Look at the first few rows of `procedure_occurrence`.
```{sql}
#| eval: FALSE
#| connection: "con"
SELECT * FROM ____ LIMIT 10;
```
1. Why is there a `person_id` column in this table as well?
## `SELECT`ing a few columns in our table
We can use the `SELECT` clause to grab specific columns in our data.
```
SELECT person_id, birth_datetime, gender_concept_id # Columns in our table
FROM person; # Our Table
```
```{sql}
#| connection: "con"
SELECT person_id, birth_datetime, gender_concept_id
FROM person
LIMIT 10;
```
## Try it Out
What happens if we ask for a column that doesn't exist in our data?
```{sql}
#| connection: "con"
#| eval: false
SELECT person_id, birth_datetime, gender_concept_id, blah
FROM person;
```
## Check on Learning
Add `race_concept_id` and `year_of_birth` to your `SELECT` query:
```{sql}
#| connection: "con"
#| eval: false
SELECT person_id, birth_datetime, gender_concept_id, ____, ____
FROM person;
```
## `WHERE` - filtering our table
Adding `WHERE` to our SQL statement lets us add filtering to our query:
```{sql}
#| connection: "con"
SELECT person_id, gender_source_value, race_source_value, year_of_birth
FROM person
WHERE year_of_birth < 1980
```
One critical thing to know is that you don't need to include the columns you're filtering on in the `SELECT` part of the statement. For example, we could do the following as well, removing `year_of_birth` from our `SELECT`:
```{sql}
#| connection: "con"
SELECT person_id, gender_source_value, race_source_value
FROM person
WHERE year_of_birth < 2000
```
:::{.callout-note}
### Quick Note
For R users, notice the similarity of `select()` with `SELECT`. We can rewrite the above in `dplyr` code as:
```r
person |>
select(person_id, gender_source_value, race_source_value)
```
A lot of `dplyr` was inspired by SQL. In fact, there is a package called `dbplyr` that translates `dplyr` statements into SQL. A lot of us use it, and it's pretty handy.
:::
## `COUNT` - how many rows?
Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for.
```{sql}
#| connection: "con"
SELECT COUNT(*)
FROM person
WHERE year_of_birth < 2000
```
Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`:
```{sql}
#| connection: "con"
SELECT COUNT(person_id)
FROM person
WHERE year_of_birth > 1970
```
Let's switch gears to the `procedure_concept_id` table. Let's count the overall number of `procedure_concept_id`s in our table:
```{sql}
#| connection: "con"
SELECT COUNT(procedure_concept_id)
FROM procedure_occurrence
```
Hmmm. That's quite a lot, but are there repeat `procedure_concept_id`s?
When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column:
```{sql}
#| connection: "con"
SELECT COUNT(DISTINCT procedure_concept_id)
FROM procedure_occurrence
```
## Check on Learning
Count the distinct values of `gender_source_value` in `person`:
```{sql}
#| connection: "con"
#| eval: false
SELECT COUNT(DISTINCT --------------)
FROM -------;
```
## Keys: Linking tables together
One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*.
We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables.
```{sql}
#| connection: "con"
DESCRIBE person
```
Scanning the rows, which field/column is the primary key for `person`?
Try and find the *primary key* for `procedure_occurrence`. What is it?
```{sql}
#| connection: "con"
DESCRIBE procedure_occurrence
```
We'll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another.
What column is the same in both tables? That is a hint for what we'll cover next week: `JOIN`ing tables.
## Data Types
If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types:
- `INTEGER`
- `TIMESTAMP`
- `DATE`
- `VARCHAR`
Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates.
You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html).
::: {.callout-warning}
## Keep in Mind: Beware the Smart Quotes
**Beware cutting and pasting code from Microsoft Word**
Microsoft products such as Word, will transform double quotes `"` into what are called *smart quotes*: `“”`. This is bad for us, because it breaks our code.
```
"This is the Original String"
```
will transform into:
```
“This is the Original String”
```
It is very hard to see the difference between these, but if you cut and paste the bottom one (from a word document), your code will not run. That's because the smart quotes aren't double quotes, which you need to specify a string.
Just be aware that you might have to fix these quotes if you're cutting / pasting from a Microsoft product (although google is also guilty of this).
Oftentimes, you can disable this in Word/Google Docs, or be prepared to replace the smart quotes with double quotes.
:::
## Always close the connection
When we're done, it's best to close the connection with `dbDisconnect()`.
```{r}
dbDisconnect(con)
```