forked from rstats-tln/transform-data-with-dplyr
-
Notifications
You must be signed in to change notification settings - Fork 5
/
transform_1.Rmd
581 lines (368 loc) · 14.5 KB
/
transform_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
---
title: "Data transformation using dplyr (aka five verbs)"
author: "Taavi Päll"
date: "2021-09-16"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
I our previous classes we were working with small and cleaned up **mgp** dataset to go through steps of creating some of the most common visualisation types.
In data analysis workflow you need to plot out data at two points:
1. During **exploratory data analysis** where you learn to know your dataset and
2. **reporting** when you try to communicate what you have found.
Importantly, data analysis is not a linear process, but an iterative process of data transformation, modeling and visualisation.
Here we add another step to data analysis process: data transformation.
In most cases you need to transform your data during analysis, because in real life you rarely start with a dataset that is in the right form for visualisation and modeling.
Usually you will need to:
- summarise your data,
- create new variables,
- rename variables,
- reorder the observations.
We are going to use the dplyr library from tidyverse to learn how to carry out these tasks.
## Sources
Again, we are going to follow closely R4DS book chapter "Data transformation" available
- from http://r4ds.had.co.nz/transform.html.
- More examples are available from https://rstats-tartu.github.io/lectures/tidyverse.html#dplyr-ja-selle-viis-verbi
Load tidyverse library and dataset:
```{r}
library(tidyverse)
library(lubridate) # library to work with dates and time
library(here) # (always) locate files in your project folder P.S. load here after lubridate, because lubridate has also (now deprecated) function called here
```
### COVID19 data
Estonian COVID19 tests data was downloaded from Estonian Health Board open data portal <https://www.terviseamet.ee/et/koroonaviirus/avaandmed> and contains positive and negative test results with test dates, including metadata about subject gender, age group, country, and county.
Whole dataset was sampled down and includes 5% of the original data.
Dataset was downloaded and prepared using [get_data.R](scripts/get_data.R) script
Let's import covid_tests.csv file from data subfolder.
As we cleverly assigned short name to our dataset after preprocessing, we can use same name for our imported object for better readability of code.
It's very difficult to follow, when all object are named "df1", "df2" etc or "m", "m1" etc.
```{r}
(covid_tests <- read_csv(here("data", "covid_tests.csv")))
```
`here()` uses clever heuristics to identify your working directory and updates file paths respectively:
```{r}
here("data", "covid_tests.csv")
```
Set up here like so, cd to your working directory and run:
```{r}
here::set_here()
```
You could also just run:
```{r, eval=FALSE}
covid_tests <- read_csv("data/covid_tests.csv")
```
## dplyr basics
Most of the data transformation tasks can be carried out using five verbs from dplyr library:
- Pick observations by their values (filter()).
- Reorder the rows (arrange()).
- Pick variables by their names (select()).
- Create new variables with functions of existing variables (mutate()).
- Collapse many values down to a single summary (summarise()).
- These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
- The first argument is a data frame.
- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
- The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let's dive in and see how these verbs work.
## Filter rows with filter()
filter() allows you to subset observations based on their values.
The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
For example, we can filter only positive test results (P):
```{r}
filter(covid_tests, ResultValue == "P")
```
dplyr runs the filtering operation and returns a new data frame.
dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`, like so:
Assign bacterial viruses to object phages:
```{r}
pos_tests <- filter(covid_tests, ResultValue == "P")
pos_tests
```
### Comparisons
What is this == operator? Why not use = to check equality:
```{r, eval=FALSE}
filter(covid_tests, ResultValue = "P")
```
It appears that = is another assignment operator besides ->
There's another common problem you might encounter when using ==: floating point numbers.
Although, theoretically TRUE, following comparisons return FALSE!
```{r}
sqrt(2) ^ 2 == 2
1/49 * 49 == 1
```
This is because computers and R use finite precision arithmetic and cannot store an infinite number of digits.
This can be overcome by using near() function instead of ==:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
### Logical operators
Multiple comparisons within filter() function are combined with comma "," which means "and" (&).
In case of "and" all comparisons must evaluate to TRUE for observations to be returned.
Together, logical (boolean) operators are:
- & is AND,
- | is OR,
- ! is NOT
The following code finds all tests from "Tartu maakond" OR "Harju maakond":
```{r}
filter(covid_tests, County == "Tartu maakond" | County == "Harju maakond")
```
You can't write something like `filter(covid_tests, County == "Tartu maakond" | "Harju maakond")` and in case of numeric
variables this will give you wrong answer instead of Error, so be careful:
```{r, eval=FALSE}
filter(covid_tests, County == "Tartu maakond" | "Harju maakond")
```
A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y:
Filter observations where County is "Tartu maakond" and "Harju maakond":
```{r}
filter(covid_tests, County %in% c("Tartu maakond", "Harju maakond"))
```
What happens under the hood:
```{r, eval=FALSE}
head(covid_tests$County) %in% c("Tartu maakond", "Harju maakond")
```
Remember that:
- !(x & y) is the same as !x | !y
- !(x | y) is the same as !x & !y
For example, if you wanted to find tests that are not from two oldest age groups -- "80-84" and "üle 85", you could use either of the following two filters:
```{r}
age_groups <- filter(covid_tests, !(AgeGroup == "80-84" | AgeGroup == "üle 85"))
```
```{r}
unique(age_groups$AgeGroup)
```
```{r}
!all(c("80-84", "üle 85") %in% age_groups$AgeGroup)
```
Previous expression can be rewritten like so:
```{r}
filter(covid_tests, !(AgeGroup == "80-84") , !(AgeGroup == "üle 85"))
```
OR
```{r}
filter(covid_tests, AgeGroup != "80-84" , AgeGroup != "üle 85")
```
Which one from these three is more explicit?
Note that comma here in filter function between logical evaluations means "&".
"Small than", "bigger than" and "NOT":
```{r}
3 >= c(2, 3, 4)
3 <= c(2, 3, 4)
3 != c(2, 3, 4)
```
### Missing values
One important feature of R that can make comparison tricky are missing values, or "NA"s ("not availables").
NA represents an unknown value so missing values are "contagious":
almost any operation involving an unknown value will also be unknown.
```{r}
NA > 5
10 == NA
NA + 10
NA / 2
```
As Rsudio already might suggest, if you want to determine if a value is missing, use is.na():
```{r}
x <- NA
is.na(x)
```
Let's use is.na() within filter to filter rows with missing "AgeGroup":
```{r}
filter(covid_tests, is.na(AgeGroup))
```
Ok. Now we got all rows with missing "AgeGroup", how would you change this code to really exclude these rows with missing data (Hint: !FALSE):
```{r}
filter(covid_tests, !is.na(AgeGroup))
```
There are other functions that remove rows with NA in any of the columns from your data frame and keep only "complete cases":
```{r}
na.exclude(covid_tests)
na.omit(covid_tests)
drop_na(covid_tests)
```
```{r}
drop_na(covid_tests)
```
```{r}
drop_na(covid_tests, AgeGroup)
```
### Finding non-exact matches
Often you find yourself in need to filter categorical variables based on some non-exact matching, for example when
- values are too long,
- there are too many unique unkown values,
- some observations that belong to the same category have slightly different values (`foo bar` and `foo-bar` and `foobar`).
To solve this problem more elegantly, you can use regular expressions. (for help see ?`regular expression`)
regular expressions are covered in *tidyverse* by the **stringr** package.
The stringr function that is useful within filter is str_detect().
```{r}
library(stringr)
fruits <- c("banana", "foo TRUMP bar", "foo bar", "foo-bar", "foobar", "foo")
str_detect(fruits, "foo(.+)?bar")
```
```{r}
str_which(fruits, "foo(.*)?bar")
```
```{r}
fruits[str_detect(fruits, "foo(.*)?bar")]
```
As you can see, str_detect() returns logical vector.
So how this should work when you run str_detect inside filter?
Tests from "Harju" county:
```{r}
filter(covid_tests, str_detect(County, "Harju"))
```
### Exercises
1. Find all tests that
- are done in 1 Aug 2021 and after:
```{r}
aug <- filter(covid_tests, StatisticsDate >= "2021-08-01")
min(aug$StatisticsDate)
max(aug$StatisticsDate)
```
- belong to AgeGroup *0-4*
```{r}
ageg <- filter(covid_tests, AgeGroup == "0-4")
unique(ageg$AgeGroup)
sum(is.na(ageg$AgeGroup))
```
- were released between 1 Jan 2021 - 1 Mar 2021, including these days:
```{r}
betw <- filter(covid_tests, StatisticsDate >= "2021-01-01", StatisticsDate <= "2021-03-01")
min(betw$StatisticsDate)
max(betw$StatisticsDate)
n_distinct(betw$StatisticsDate)
```
2. there is also between() function in dplyr. What does it do? How can you use it to find tests done in 2021 between week 8-10? Find "wk" and "yr" variables in data.
```{r}
filter(covid_tests, yr == 2021, between(wk, 8, 10))
```
## Arrange rows with arrange()
arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
```{r}
arrange(covid_tests, StatisticsDate)
```
Use desc() to re-order by a column in descending order. You can combine variables:
```{r}
arrange(covid_tests, AgeGroup, desc(StatisticsDate))
```
Missing values are always sorted at the end, even with desc() function:
```{r}
df <- tibble(x = c(5, NA, 2))
arrange(df, x)
```
```{r}
arrange(df, desc(x))
```
### Get unique rows with distinct()
Sometimes observations become duplicated during data wrangling, sometimes you need to get unique combinations of observations.
dplyr has distinct() function to retain only unique rows from input table.
covid_tests data has `r nrow(covid_tests)` rows, but what if we would like to get only unique wk and yr combinations..
First let's select organism_name and tax_id columns
```{r}
yrwk <- select(covid_tests, yr, wk)
yrwk
```
Here we are... with distinct organism_name tax_id combinations.
```{r}
distinct(yrwk)
```
distinct() works on tables
```{r}
?distinct
```
unique() function from base R works on vectors
```{r}
x <- rep(letters[1:3], each = 3) # simulate character vector x with repetitions
x
unique(x) # select unique set of values from x
```
### Exercises
1. How could you use arrange() to sort covid_tests with missing County values to the start? (Hint: use is.na()).
```{r}
```
2. Sort covid_tests to find most recent tests:
```{r}
```
## Select columns with select()
select() allows you to rapidly zoom in on a useful subset of columns using operations based on the names of the variables.
Select first three columns:
```{r}
select(covid_tests, 1:3)
```
Select columns from Gender to ResultValue and other way around:
```{r}
select(covid_tests, Gender:ResultValue)
select(covid_tests, ResultValue:Gender)
```
!!! select works in both directions: L>R and R<L!
Exlude column Country:
```{r}
select(covid_tests, -Country)
```
Another way, exclude as a vector, in case of vector use quoted variable names!:
```{r}
vars_out <- c("Country", "yr", "wk")
select(covid_tests, -vars_out)
```
> Use minus sign to exclude variables!
> Submit variables to select as character vector!
There are a number of __helper functions you can use within select()__:
- starts_with("abc"): matches names that begin with "abc".
Select columns that start with "bill"
```{r}
library(palmerpenguins)
select(penguins, starts_with("bill"))
```
- ends_with("xyz"): matches names that end with "xyz".
Select columns that end with "mm"
```{r}
select(penguins, ends_with("mm"))
```
- contains("ijk"): matches names that contain "ijk".
All columns that contain word "length"
```{r}
select(penguins, contains("length"))
```
- matches("(.)\\1"): selects variables that match a regular expression.
This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings.
```{r, eval = FALSE}
matches("^abc") # same as starts_with("abc")
matches("xyz$") # same as ends_with("xyz")
matches("ijk") # same as contains("ijk")
```
Select columns/variables with bill measurements:
```{r}
select(penguins, matches("bill.*mm"))
```
- num_range("V", 1:10000) matches V1, V2 and V3.
- everything() is useful if you have a handful of variables you'd like to move to the start of the data frame.
Move columns tax_id, size_kb and gc to the start of the data frame, and keep all other columns.
You can rearrange the order of columns, let's move date and test result in front.
```{r}
select(covid_tests, StatisticsDate, ResultValue, everything())
```
See ?select for more details.
### Exercises
1. What happens if you include the name of a variable (e.g. gc) multiple times in a select() call?
```{r}
```
2. What does the one_of() function do?
Why might it be helpful in conjunction with this vector?
```{r}
(vars <- c("StatisticsDate", "ResultValue", "flipper_length_mm"))
select(covid_tests, one_of(vars))
```
What happens if you try to select columns just by using this vector:
```{r, eval=FALSE}
select(covid_tests, c("StatisticsDate", "ResultValue", "flipper_length_mm"))
```
3. Select all variables from 'penguins' dataset that contain string 'FLIPPER' (note case!).
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r}
?select
select(penguins, contains("FLIPPER", ignore.case = TRUE))
```