forked from PsyTeachR/ads-v2
-
Notifications
You must be signed in to change notification settings - Fork 2
/
07-joins.qmd
630 lines (428 loc) · 26.1 KB
/
07-joins.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
# Data Relations {#sec-joins}
## Intended Learning Outcomes {#sec-ilo-joins .unnumbered}
- Be able to match related data across multiple tables
- Be able to combine data from multiple files
## Functions used {#functions-joins}
```{r, include = FALSE}
# load tidyverse packages separately so auto-links work in `func()` notation
library(readr)
library(dplyr)
library(tibble)
library(purrr)
library(stringr)
```
* built-in (you can always use these without loading any packages)
* base:: `library()`, `dir.create()`, `list.files()`, `as.numeric()`, `as.factor()`, `as.character()`, `as.data.frame()`
* tidyverse (you can use all these with `library(tidyverse)`)
* readr:: `write_csv()`, `read_csv()`
* dplyr:: `left_join()`, `right_join()`, `inner_join()`, `full_join()`, `semi_join()`, `anti_join()`, `bind_rows()`, `bind_cols()`, `intersect()`, `union()`, `setdiff()`, `mutate()`
* tibble:: `tibble()`, `as_tibble()`
* purrr:: `map_df()`
* stringr:: `str_replace_all()`
## Walkthrough video {#sec-walkthrough-joins .unnumbered}
There is a walkthrough video of this chapter available via [Echo360.](https://echo360.org.uk/media/519bb538-58b3-4710-85df-d71c56001a5d/public) Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
## Set-up {#sec-setup-joins}
First, create a new project for the work we'll do in this chapter named `r path("07-relations")`. Second, open and save a new R Markdown document named `relations.Rmd`, delete the welcome text, and load the required packages for this chapter.
```{r setup-joins, message=FALSE, verbatim="r setup, include=FALSE"}
library(tidyverse) # includes readr & tibble
```
Download the [Data transformation cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf).
## Loading data {#sec-joins-data}
The data you want to report on or visualise are often in more than one file (or more than one tab of an excel file or googlesheet). You might need to join up a table of customer information with a table of orders, or combine the monthly social media reports across several months.
For this demo, rather than loading in data, we'll create two small data tables from scratch using the `tibble()` function.
`customers` has id, city and postcode for five customers 1-5.
- `1:5` will fill the variable `id` with all integers between 1 and 5.
- `city` and `code` both use the `c()` function to enter multiple strings. Note that each entry is contained within its own quotation marks, apart from missing data, which is recorded as `NA`.
- When entering data like this, it's important that the order of each variable matches up. So number 1 will correspond to "Port Ellen" and "PA42 7DU".
```{r subject}
customers <- tibble(
id = 1:5,
city = c("Port Ellen", "Dufftown", NA, "Aberlour", "Tobermory"),
postcode = c("PA42 7DU", "AB55 4DH", NA, "AB38 7RY", "PA75 6NR")
)
```
`r kable(customers, caption="Demo customers table.")`
`orders` has customer id and the number of items ordered. Some customers from the previous table have no orders, some have more than one order, and some are not in the customer table.
```{r orders}
orders <- tibble(
id = c(2, 3, 4, 4, 5, 5, 6, 6, 7),
items = c(10, 18, 21, 23, 9, 11, 11, 12, 3)
)
```
`r kable(orders, caption="Demo orders table.")`
## Mutating Joins
`r glossary("Mutating joins")` act like the `dplyr::mutate()` function in that they add new columns to one table based on values in another table. (We'll learn more about the `mutate()` function in @sec-tidy).)
All the mutating joins have this basic syntax:
`****_join(x, y, by = NULL, suffix = c(".x", ".y"))`
- `x` = the first (left) table
- `y` = the second (right) table
- `by` = what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
- `suffix` = if columns have the same name in the two tables, but you aren't joining by them, they get a suffix to make them unambiguous. This defaults to ".x" and ".y", but you can change it to something more meaningful.
::: {.callout-note}
You can leave out the `by` argument if you're matching on all of the columns with the same name, but it's good practice to always specify it so your code is robust to changes in the loaded data.
:::
### left_join() {#sec-left_join}
::: join
<img src="images/joins/left_join.png"/>
:::
A `left_join` keeps all the data from the first (left) table and adds anything that matches from the second (right) table. If the right table has more than one match for a row in the left table, there will be more than one row in the joined table (see ids 4 and 5).
```{r left_join}
left_data <- left_join(customers, orders, by = "id")
left_data
```
::: join
<img src="images/joins/left_join_rev.png"/>
:::
The order you specify the tables matters, in the below code we have reversed the order and so the result is all rows from the `orders` table joined to any matching rows from the `customers` table.
```{r left_join-2}
left2_data <- left_join(orders, customers, by = "id")
left2_data
```
### right_join() {#sec-right_join}
::: join
<img src="images/joins/right_join.png"/>
:::
A `right_join` keeps all the data from the second (right) table and joins anything that matches from the first (left) table.
```{r right_join}
right_data <- right_join(customers, orders, by = "id")
right_data
```
::: {.callout-note}
This table has the same information as `left_join(orders, customers, by = "id")`, but the columns are in a different order (left table, then right table).
:::
### inner_join() {#sec-inner_join}
::: join
<img src="images/joins/inner_join.png"/>
:::
An `inner_join` returns all the rows that have a match in both tables. Changing the order of the tables will change the order of the columns, but not which rows are kept.
```{r inner_join}
inner_data <- inner_join(customers, orders, by = "id")
inner_data
```
### full_join() {#sec-full_join}
::: join
<img src="images/joins/full_join.png"/>
:::
A `full_join` lets you join up rows in two tables while keeping all of the information from both tables. If a row doesn't have a match in the other table, the other table's column values are set to `NA`.
```{r full_join}
full_data <- full_join(customers, orders, by = "id")
full_data
```
## Filtering Joins
`r glossary("Filtering joins")` act like the `dplyr::filter()` function in that they keep and remove rows from the data in one table based on the values in another table. The result of a filtering join will only contain rows from the left table and have the same number or fewer rows than the left table. We'll learn more about the `filter()` function in @sec-wrangle.
### semi_join() {#sec-semi_join}
::: join
<img src="images/joins/semi_join.png"/>
:::
A `semi_join` returns all rows from the left table where there are matching values in the right table, keeping just columns from the left table.
```{r semi_join}
semi_data <- semi_join(customers, orders, by = "id")
semi_data
```
::: {.callout-note}
Unlike an inner join, a semi join will never duplicate the rows in the left table if there is more than one matching row in the right table.
:::
::: join
<img src="images/joins/semi_join_rev.png"/>
:::
Order matters in a semi join.
```{r semi_join-2}
semi2_data <- semi_join(orders, customers, by = "id")
semi2_data
```
### anti_join() {#sec-anti_join}
::: join
<img src="images/joins/anti_join.png"/>
:::
An `anti_join` return all rows from the left table where there are *not* matching values in the right table, keeping just columns from the left table.
```{r anti_join}
anti_data <- anti_join(customers, orders, by = "id")
anti_data
```
::: join
<img src="images/joins/anti_join_rev.png"/>
:::
Order matters in an anti join.
```{r anti_join-2}
anti2_data <- anti_join(orders, customers, by = "id")
anti2_data
```
## Multiple joins
The `****_join()` functions are all **two-table verbs**, that is, you can only join together two tables at a time. However, you may often need to join together multiple tables. To do so, you simply need to add on additional joins. You can do this by creating an intermediate object or more efficiently by using a pipe.
```{r}
# create a table of overall customer satisfaction scores
satisfaction <- tibble(
id = 1:5,
satisfaction = c(4, 3, 2, 3, 1)
)
# perform the initial join
join_1 <- left_join(customers, orders, by = "id")
# perform the second join on the new object
join_2 <- left_join(join_1, satisfaction,
by = "id")
```
```{r}
# more efficient method using the pipe
pipe_join <- customers %>%
left_join(orders, by = "id") %>%
left_join(satisfaction, by = "id")
```
::: {.callout-warning}
At every stage of any analysis you should **check your output** to ensure that what you created is what you intended to create, but this is particularly true of joins. You should be familiar enough with your data through routine checks using functions like `glimpse()`, `str()`, and `summary()` to have a rough idea of what the join should result in. At the very least, you should know whether the joined object should result in more or fewer variables and observations.
If you have a multi-line join like in the above piped example, build up the code and check the output at each stage.
:::
## Binding Joins
`r glossary("Binding joins")` bind one table to another by adding their rows or columns together.
### bind_rows() {#sec-bind_rows}
You can combine the rows of two tables with `bind_rows`.
Here we'll add customer data for customers 6-9 and bind that to the original customer table.
```{r bind_rows}
new_customers <- tibble(
id = 6:9,
city = c("Falkirk", "Ardbeg", "Doogal", "Kirkwall"),
postcode = c("FK1 4RS", "PA42 7EA", "G81 4SJ", "KW15 1SE")
)
bindr_data <- bind_rows(customers, new_customers)
bindr_data
```
The columns just have to have the same names, they don't have to be in the same order. Any columns that differ between the two tables will just have `NA` values for entries from the other table.
If a row is duplicated between the two tables (like id 5 below), the row will also be duplicated in the resulting table. If your tables have the exact same columns, you can use `union()` (see @sec-union) to avoid duplicates.
```{r bind-rows-union}
new_customers <- tibble(
id = 5:9,
postcode = c("PA75 6NR", "FK1 4RS", "PA42 7EA", "G81 4SJ", "KW15 1SE"),
city = c("Tobermory", "Falkirk", "Ardbeg", "Doogal", "Kirkwall"),
new = c(1,2,3,4,5)
)
bindr2_data <- bind_rows(customers, new_customers)
bindr2_data
```
### bind_cols() {#sec-bind_cols}
You can merge two tables with the same number of rows using `bind_cols`. This is only useful if the two tables have the same number of rows in the exact same order.
```{r bind_cols}
new_info <- tibble(
colour = c("red", "orange", "yellow", "green", "blue")
)
bindc_data <- bind_cols(customers, new_info)
bindc_data
```
::: {.callout-note}
The only advantage of `bind_cols()` over a mutating join is when the tables don't have any IDs to join by and you have to rely solely on their order. Otherwise, you should use a mutating join (all four mutating joins result in the same output when all rows in each table have exactly one match in the other table).
:::
### Importing multiple files
If you need to import and bind a whole folder full of files that have the same structure, get a list of all the files you want to combine. It's easiest if they're all in the same directory, although you can use a pattern to select the files you want if they have a systematic naming structure.
First, save the two customer tables to CSV files. The `dir.create()` function makes a folder called "data". The `showWarnings = FALSE` argument means that you won't get a warning if the folder already exists, it just won't do anything.
```{r}
# write our data to a new folder for the demo
dir.create("data", showWarnings = FALSE)
write_csv(x = customers, file = "data/customers1.csv")
write_csv(x = new_customers, file = "data/customers2.csv")
```
Next, retrieve a list of all file names in the data folder that contain the string "customers"
```{r}
files <- list.files(
path = "data",
pattern = "customers",
full.names = TRUE
)
files
```
Next, we'll iterate over this list to read in the data from each file. Whilst this won't be something we cover in detail in the core resources of this course, `r glossary("iteration")` is an important concept to know about. Iteration is where you perform the same task on multiple different inputs. As a general rule of thumb, if you find yourself copying and pasting the same thing more than twice, there's a more efficient and less error-prone way to do it, although these functions do typically require a stronger grasp of programming.
The <pkg>purrr</pkg> package contains functions to help with iteration. `purrr::map_df()` maps a function to a list and returns a data frame (table) of the results.
* `.x` is the list of file paths
* `.f` specifies the function to map to each of those file paths.
* The resulting object `all_files` will be a data frame that combines all the files together, similar to if you had imported them separately and then used `bind_rows()`.
```{r, message=FALSE}
all_files <- purrr::map_df(.x = files, .f = read_csv)
```
::: {.callout-note}
If all of your data files have the _exact_ same structure (i.e., the same number of columns with the same names), you can just use `read_csv()` and set the `file` argument to a vector of the file names. However, the pattern above is safer if you might have a file with an extra or missing column.
```{r, eval=FALSE}
files <- c("data/customers1.csv", "data/customers1.csv")
all_files <- read_csv(files)
```
:::
```{r, echo=FALSE, results='hide'}
# clean up temporary files
file.remove("data/customers1.csv")
file.remove("data/customers2.csv")
```
## Set Operations
`r glossary("Set operations")` compare two tables and return rows that match (intersect), are in either table (union), or are in one table but not the other (setdiff).
### intersect() {#sec-intersect}
`dplyr::intersect()` returns all rows in two tables that match exactly. The columns don't have to be in the same order, but they have to have the same names.
```{r intersect}
new_customers <- tibble(
id = 5:9,
postcode = c("PA75 6NR", "FK1 4RS", "PA42 7EA", "G81 4SJ", "KW15 1SE"),
city = c("Tobermory", "Falkirk", "Ardbeg", "Doogal", "Kirkwall")
)
intersect_data <- intersect(customers, new_customers)
intersect_data
```
::: {.callout-warning}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has a `base::intersect()` function that doesn't work like `dplyr::intersect()`. The error message can be confusing and looks something like this:
```{r base-intersect, error = TRUE}
base::intersect(customers, new_customers)
```
:::
### union() {#sec-union}
`dplyr::union()` returns all the rows from both tables, removing duplicate rows, unlike `bind_rows()`.
```{r union}
union_data <- union(customers, new_customers)
union_data
```
::: {.callout-warning}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has a `base::union()` function. You usually won't get an error message, but the output won't be what you expect.
```{r base-union}
base::union(customers, new_customers)
```
:::
### setdiff() {#sec-setdiff}
`dplyr::setdiff` returns rows that are in the first table, but not in the second table.
```{r setdiff}
setdiff_data <- setdiff(customers, new_customers)
setdiff_data
```
Order matters for `setdiff`.
```{r setdiff-order}
setdiff2_data <- setdiff(new_customers, customers)
setdiff2_data
```
::: {.callout-warning}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has a `base::setdiff()` function. You usually won't get an error message, but the output might not be what you expect because `base::setdiff()` expects columns to be in the same order, so id 5 here registers as different between the two tables.
```{r base-setdiff}
base::setdiff(customers, new_customers)
```
:::
## Conflicting variable types
As we covered in @sec-data, when you import or create data, R will do its best to set each column to an appropriate data type. However, sometimes it gets it wrong or sometimes there's something in the way the data has been encoded in the original spreadsheet that causes the data type to be different than expected. When joining datasets by common columns, it's important that not only are the variable names identical, but the data type of those variables is identical.
Let's recreate our `new_customers` dataset but this time, we'll specify that `id` is a `r glossary("character")` variable.
```{r}
new_customers2 <- tibble(
id = as.character(5:9),
postcode = c("PA75 6NR", "FK1 4RS", "PA42 7EA", "G81 4SJ", "KW15 1SE"),
city = c("Tobermory", "Falkirk", "Ardbeg", "Doogal", "Kirkwall")
)
str(new_customers2)
```
If you try to join this dataset to any of the other datasets where `id` is stored as a `r glossary("numeric")` variable, it will produce an error.
```{r error=TRUE}
inner_join(customers, new_customers2)
```
The same goes for `bind_rows()`:
```{r error=TRUE}
bind_rows(customers, new_customers2)
```
As alternative method to change variable types from what we showed you in @sec-data is to use the `as.***` functions. If you type `as.` into a code chunk, you will see that there are a huge number of these functions for transforming variables and datasets to different types. Exactly which one you need will depend on the data you have, but a few commonly used ones are:
* `as.numeric()` - convert a variable to `r glossary("numeric")`. Useful for when you have a variable of real numbers that have been encoded as character. Any values that can't be turned into numbers (e.g., if you have the word "missing" in cells that you have no data for), will be returned as `NA`.
* `as.factor()` - convert a variable to a `r glossary("factor")`. You can set the factor levels and labels manually, or use the default order (alphabetical).
* `as.character()` - convert a variable to `r glossary("character")` data.
* `as_tibble()` and `as.data.frame()` - convert a list object (not a variable) to a tibble or a data frame (two different table formats). This isn't actually relevant to what we're discussing here, but it's a useful one to be aware of because sometimes you'll run into issues where you get an error that specifically requests your data is a tibble or data frame type and you can use this function to overwrite your object.
To use these functions on a variable we can use `mutate()` to overwrite the variable with that variable as the new data type:
```{r}
new_customers2 <- new_customers2 %>%
mutate(id = as.numeric(id))
```
Once you've done this, the joins will now work:
```{r}
inner_join(orders, new_customers2)
```
## Exercises
There's lots of different use cases for the `****_join()` functions. These exercises will allow you to practice different joins. If you have any examples of where joins might be helpful in your own work, please post them on Teams in the week 6 channel, as having many concrete examples can help distinguish between the different joins.
### Grade data
The University of Glasgow's Schedule A grading scheme uses a 22-point alphanumeric scale (there's more information in your summative report [assessment information sheet](https://sway.office.com/k0CnXGd6RjbVokkR){target="_blank"}). Each alphanumeric grade (e.g., B2) has an underlying numeric Grade Point (e.g., 16).
Often when we're working with student grades they are provided to us in only one of these forms, but we need to be able to go between the two. For example, we need the numeric form in order to be able to calculate descriptive statistics about the mean grade, but we need the alphanumeric form to release to student records.
* Download <a href="data/grade_data1.csv" download>grade_data.csv</a>, <a href="data/grade_data2.csv" download>grade_data2.csv</a> and <a href="data/scheduleA.csv" download>scheduleA.csv</a> into your data folder.
* Read in `scheduleA.csv` and save it to an object named `schedule`.
* Read in `grade_data1.csv` and save it to an object named `grades1`.
* Read in `grade_data2.csv` and save it to an object named `grades2`.
```{r, message=FALSE, webex.hide = TRUE}
schedule <- read_csv("data/scheduleA.csv")
grades1 <- read_csv("data/grade_data1.csv")
grades2 <- read_csv("data/grade_data2.csv")
```
### Matching the variable types
At UofG, all students are given a GUID, a numeric ID number. However, that ID number is also then combined with the first letter from your surname to create your username that is used with your email address. For example, if your ID is 1234567 and your surname is Nordmann, your username would be 1234567n. From a data wrangling perspective this is very annoying because the numeric ID will be stored as numeric data, but the username will be stored as character because of the letter at the end. `grades1` has a numeric id whilst `grades2` has the additional letter. In order to join these datasets, we need to standardise the variables.
First, remove the letter character from `id` using the function `stringr::str_replace_all()`, which replaces text that matches a pattern. Here, we're using the pattern `"[a-z]"`, which matches all lowercase letters a through z, and replacing them with `""`. See the help for `?about_search_regex` for more info about how to set patterns (these can get really complex).
```{r}
grades1 <- grades1 %>%
mutate(id = str_replace_all(
id, # the variable you want to search
pattern = "[a-z]", # find all letters a-z
replacement = "" # replace with nothing
))
```
Now, transform the data type of `id` so that it matches the data type in `grades2`.
```{r, results='hide', webex.hide = TRUE}
# check variable types
glimpse(grades1)
glimpse(grades2)
grades1 <- grades1 %>%
mutate(id = as.numeric(id))
```
### Complete records
In this example, we want to join the grade data to schedule A so that each student with a grade has both the grade and the grade point. But we also want a complete record of all students on the course, so students with missing grades should still be included in the data.
* Join `grades1` and `scheduleA` and store this table in an object named `exam_all`.
* Do the same for `grades2` and save it in `essay_all`.
* Both `exam_all` and `essay_all` should have 100 observations of 4 variables.
`r hide("Hint")`
You want to keep all of the data from `grade_data1` and `grade_data2`, but you only want the alphanumeric grades from `schedule` for the Grade Point values that exist in `grades`. E.g., if no-one was awarded an F1, your final dataset shouldn't have that in it.
`r unhide()`
```{r, webex.hide = TRUE}
exam_all <- left_join(grades1, schedule, by = "Points")
essay_all <- left_join(grades2, schedule, by = "Points")
```
### Missing data
Alternatively, you may wish to have a dataset that only contains data for students who submitted each assessment and have a grade. First, run `summary()` on both `exam_all` and `essay_all`.
* How many exam grades are missing? `r fitb(sum(is.na(exam_all$Grade)))`
* How many essay grades are missing? `r fitb(sum(is.na(essay_all$Grade)))`
Now, create an object `exam_grades` that joins together `grades1` and `schedule`, but this time the resulting object should only contain data from students who have a grade. Do the same but for `grades2` and store it in `essay_grades`.
Before you do this, given what you know about how many data points are missing in each data set:
* How many observations should `exam_grades` have? `r fitb(sum(!is.na(exam_all$Grade)))`
* How many observations should `essay_grades` have? `r fitb(sum(!is.na(essay_all$Grade)))`
```{r, webex.hide = TRUE}
exam_grades <- inner_join(grades1, schedule, by = "Points")
essay_grades <- inner_join(grades2, schedule, by = "Points")
```
`r hide("Alternative solution")`
It's worth noting that in reality you wouldn't actually go back to the raw data and do another join to get this dataset, you could just remove all the missing response by adding `%>% drop_na()` to `exam_all` and `essay_all`. However, for the purposes of teaching joins, we'll do it this slightly artificial way.
`r unhide()`
Now, create a dataset `completes` that joins the grades for students who have a grade for **both** the essay and the exam.
* Because both `exam_grades` and `essay_grades` have the variables `Assessment`, `Points` and `Grades` that are named the same, but have different data, you should amend the suffix so that the resulting variables are named `Points_exam` and `Points_essay` etc. You may need to consult the help documentation to see an example to figure this out.
* Clean up the file with `select()` and only keep the variables `id`, `Grade_exam`, and `Grade_essay`
```{r, webex.hide = TRUE}
completes <- inner_join(exam_grades, essay_grades,
by = "id",
suffix = c("_exam", "_essay")) %>%
select(id, Grade_exam, Grade_essay)
```
* How many students have a grade for both the exam and the essay? `r fitb(nrow(completes))`
Now create a dataset `no_essay` that contains students that have a grade for the exam, but not the essay.
```{r, webex.hide = TRUE}
no_essay <- anti_join(exam_grades, essay_grades, by = "id")
```
- How many students have a grade for the exam but not the essay? `r fitb(nrow(no_essay))`
Finally, now make a dataset `no_exam` that contains students have have a grade for the essay but not the exam
```{r, webex.hide = TRUE}
no_exam <- anti_join(essay_grades, exam_grades, by = "id")
```
- How many students have a grade for the exam but not the essay? `r fitb(nrow(no_exam))`
### Share on Teams
Once you're done, share your knitted html file and the Rmd file for the exercises on Teams in the Week 07 channel.
## Glossary {#sec-glossary-joins}
```{r, echo = FALSE, results='asis'}
glossary_table()
```
## Further resources {#sec-resources-joins}
* [Data transformation cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf)
* [Chapter 19: Joins](https://r4ds.hadley.nz/joins) in *R for Data Science*
* [Chapter 26: Iteration](https://r4ds.hadley.nz/iteration) in *R for Data Science*.
* [purrr cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/purrr.pdf)
```{r, include = FALSE}
# clean up temp datasets
files <- c("data/avatar_na.csv", "data/family.csv")
file.exists(files) %>%
`[`(files, .) %>%
file.remove()
```