-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path0002_tidyverse_datatable.Rmd
840 lines (608 loc) · 27.8 KB
/
0002_tidyverse_datatable.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
---
title: "data.table and tidyverse"
description: Learn to flexibly manipulate datasets with data.table and tidyverse (mostly dplyr)
output:
radix::radix_article:
toc: true
toc_depth: 3
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
R.options = list(width = 80),
tidy = TRUE,
tidy.opts = list(width.cutoff = 80))
```
```{r wrap-hook, include=FALSE}
# function that adds knitr parameter to control line width
# https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.Rmd
library(knitr)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
Get source code for this RMarkdown script [here](https://github.com/hauselin/rtutorialsite/blob/master/0002_tidyverse_datatable.Rmd).
## Consider being a patron and supporting my work?
[Donate and become a patron](https://donorbox.org/support-my-teaching): If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.
## Load packages/libraries
Use `library()` to load packages at the top of each R script.
```{r loading packages, results="hide", message=FALSE, warning=FALSE}
library(data.table); library(tidyverse)
library(hausekeep)
# if you don't have my package and want to use it, you'll have to install it from github
# devtools::install_github("hauselin/hausekeep") # you might have to install devtools first (see above)
```
## Reading data into R
Read file in a directory and save the data as an object in the environment by using the assignment `<-` operator. `
If you don't have the dataset, right click [**here**](https://raw.githubusercontent.com/hauselin/rtutorialsite/master/sleep.csv) to download and save `sleep.csv` dataset. If you're following the tutorial step by step, you should also create a data folder in your current folder, and put the sleep.csv file inside the data folder.
```{r read data, results="hide", message=FALSE, warning=FALSE}
df1 <- read.csv("./data/sleep.csv") # base R read.csv() function
# same as df1 <- read.csv("data/sleep.csv")
# READ: assign the output return by read.csv("data/sleep.csv") into df1
df2 <- fread("./data/sleep.csv") # fread() from library(data.table)
# my favorite is fread from data.table
df3 <- fread("data/sleep.csv") # same as above
# or download data from website directly
url <- "https://raw.githubusercontent.com/hauselin/rtutorialsite/master/data/sleep.csv"
df_url <- fread(url)
```
<aside>
The `sleep` dataset is actually a built-in dataset in R. Try typing `sleep` in your console and `?sleep` for more info on this dataset. R has a lot of built-in datasets; type `data()` in the console to see what dataests are available.
</aside>
I **always** use `fread()` from the `data.table` to read data now. It's super intelligent and fast (reads gigabytes of data in just a few seconds).
The `.` in the file path simply refers to the current working directory, so it can be dropped. And `..` can be used to refer to the parent directory. If your current directory is `/home/desktop`, then `.` refers to just that, and `..` refers to the parent directory of your current directory, which is the `/home` directory.
## Comparing the outputs of read.csv(x) and fread(x)
```{r}
df1 # output of read.csv("./data/sleep.csv")
class(df1) # read.csv("./data/sleep.csv")
```
How's the output different from the one above?
```{r}
df3 # output of fread("./data/sleep.csv")
class(df3) # fread("./data/sleep.csv")
```
How's the output different from the two outputs above?
### Reading URLs and other formats
Check out the csv (comma separated values) data [here](https://raw.githubusercontent.com/hauselin/rtutorialsite/master/data/sleep.csv). You can read data directly off a website.
Most of these read functions can import/read different types of files (e.g., csv, txt, URLs) as long as the raw data are formatted properly (e.g., separated by commas, tabs). But if you're trying to read proprietary data formats (e.g., SPSS datasets, Excel sheets), you'll need to use other libraries (e.g., `readxl`, `foreign`) to read those data into R.
```{r}
df_url <- fread("https://raw.githubusercontent.com/hauselin/rtutorialsite/master/data/sleep.csv")
df_url # print data to console; same dataset fread("data/sleep.csv")
```
## Summarizing objects
You can summarize objects quickly by using `summary()`, `str()`, `glimpse()`, or `print(x, n)`.
To view the first/last few items of an object, use `head()` or `tail()`.
```{r}
df3 # see entire data.table
summary(df3) # we use summary() for many many other purposes
str(df3)
glimpse(df3)
head(df3)
head(df3, n = 3) # what does this do?
tail(df3, n = 2)
dim(df3) # dimensions (row by columns)
```
Use pipes `%>%` to summarize objects
```{r}
df3 %>% head(n = 2)
df3 %>% head(2) # why does this work?
df3 %>% summary() # does this work? why?
```
## Using `$` and `[]` to extract elements using their names
```{r}
names(df3)
df3$extra # extracts column/variable as a vector
df3$group
df3$ID
```
```{r}
# create a list with named items a, b, c
myList <- list(a = -999, b = c(TRUE, FALSE, T, T), c = c('my_amazing_list'))
class(myList)
names(myList)
myList # note the structure of a list ($ signs tell you how to get items)
myList$a
myList$b
myList$c
```
<aside>
Lists can store objects of different classes (unlike most other objects, which expect all items to be of same class).
</aside>
```{r}
# same as df1$extra, but use characters (in '') to extract elements
df1['extra']
```
** BUT the syntax above only works for the `data.frame` class!**
```{r, error=TRUE}
df3['extra'] # fails!!!
```
If it's a `data.table` class, you do it differently (so know the classes of your objects)! But more on `data.table` later on.
```{r}
df3[, 'extra'] # df3[i, j] (i is row, and j is column)
# df3$extra also works for data.table
```
## Writing/saving dataframes or datatables as csv files
**IMPORTANT**: The functions below overwrite any existing files that have the same name and you can't recover the original file if you've overwritten it!
```{r, eval=FALSE}
# save data to your working directory
# data.table's fwrite
fwrite(df3, 'example1_df3.csv')
# save data to the data inside your current directory, assuming it exists!
fwrite(df3, 'data/example1_df3.csv')
```
The more common base R function is `write.csv()` but I never use it now and always use `fwrite` from `data.table`, which works for any type of dataframe you want to save (not limited to `data.table`).
Here's the base R function. Note that you'll have to specify `row.names = F` to prevent `write.csv()` from adding an extra column with row numbers when it saves the data (another reason to use `fwrite` instead).
```{r, eval=FALSE}
# saves in your working directory
write.csv(df3, 'example1_df3.csv', row.names = F)
# saves in your data directory (assumes data directory exists!)
write.csv(df3, './data/example2_df3.csv', row.names = F)
```
## tidyverse: a collection of R packages
[tidyverse:](https://www.tidyverse.org/packages/)
> The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Included packages: `ggplot2`, `dplyr`, `tidyr`, `stringr` etc. [see official website for documentation](https://www.tidyverse.org/packages/)
```{r}
library(tidyverse)
```
I'm running `library(tidyverse)` here for educational purposes; you only need to load packages **once** per R session you're running (we already loaded our packages at the top of this script).
### Manipulating datasets with `dplyr` (a package in tidyverse)
Read in data from a csv file (stored in "./data/simpsonsParadox.csv"). Right-click to download and save the data [here](https://raw.githubusercontent.com/hauselin/rtutorialsite/master/data/simpsonsParadox.csv) (you can also use the `fread()` function to read and download it directly from the URL; see code below)
* `fread()`: a function from `data.table` (fast-read, hence fread) that is **VERY** fast and powerful, and much better than `read.csv()` or `read.table()` from base R
```{r}
df4 <- fread("./data/simpsonsParadox.csv")
# or download data directly from URL
url <- "https://raw.githubusercontent.com/hauselin/rtutorialsite/master/data/simpsonsParadox.csv"
df4 <- fread(url)
class(df4) # note it's a data.table AND a data.frame
dim(df4) # no. of rows and columns
```
```{r}
glimpse(df4) # have a glimpse of the data (quick summary of data and column classes)
# str(df4) # similar to glimpse
```
`glimpse()` summarizes your data and tells you how many rows/columns you have and the class of each column.
* `dbl`: double (a type of number with decimal places)
* `chr`: character/string
### Select columns/variables with `select()`
Select with names
```{r}
select(df4, iq) # just prints the output to console without saving
df4_iq <- select(df4, iq) # if you want to save as a new object
df4_iq # print df4_iq
```
```{r}
# select without assigning/saving output to a variable/object
select(df4, class, grades)
select(df4, iq, grades)
```
Select multiple columns in sequence with :
```{r}
select(df4, iq:class)
```
Select with numbers
```{r}
select(df4, 1, 3)
select(df4, 1:3) # what does 1:3 do? run 1:3 in your console
```
How can we reorder columns with `select()`?
```{r}
select(df4, 3:1) # column 3, then 2, then 1
```
Select with `starts_with()` or `ends_with()`
```{r}
select(df4, starts_with("c"))
select(df4, starts_with("g"))
select(df4, starts_with("g"), ends_with("s"))
```
Dropping columns with `-`
```{r}
select(df4, -grades) # what should you get?
select(df4, -ends_with("s")) # what should you get?
select(df4, -ends_with("s"), class, -1) # what should you get?
```
Renaming while selecting columns/variables
```{r}
select(df4, intelligence = iq) # select iq and rename it to intelligence
```
Other options for `select()` include `matches()`, `contains()`. For more information, see tutorial/vignette [here](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html). For official documentation, see [here](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).
### Select rows with `slice()`
```{r}
slice(df4, 1:5) # rows 1 to 5
slice(df4, c(1, 3, 5, 7, 9)) # rows 1, 3, 5, 7, 9
slice(df4, seq(from = 1, to = 10, by = 2)) # same as above, but using sequence function (from 1 to 10, by/in steps of 2)
slice(df4, -c(1:39)) # remove rows 1 to 39
```
### Filtering or subsetting data/rows with `filter()`
While `select()` acts on columns, `filter()` acts on rows. It chooses/subsets rows based on criteria you specify.
How many classes are there in this dataset? How many unique classes?
```{r}
df4$class
unique(df4$class) # unique classes
df4$class %>% unique() # same as above but with pipes
```
Filter rows that match one criterion
```{r}
filter(df4, class == "a") # how many rows of data do we have now?
filter(df4, class == 'b') # R accepths single or double quotations
df4_classA <- filter(df4, class == 'a') # to save filtered data as an object
df4_classA
```
Filter rows that match multiple criteria with `%in%`
Let's say you want to get the rows where the `class` variable is `a` or `b`. You might write the following:
```{r, eval=FALSE, error=TRUE}
filter(df4, class == c("a", "b"))
```
What's wrong? Look at the output and compare with `filter(df4, class == "a")` and `filter(df4, class == "b")`. How many rows should you expect from `filter(df4, class == c("a", "b"))`? How many rows did you get?
Here's how to do it correctly. You use `%in%` if you want to match multiple criteria. `==` only works if you're matching by just ONE criterion
```{r}
filter(df4, class %in% c("a", "b")) # check number of rows of output
```
Filter by range
```{r}
filter(df4, iq > 120) # iq greater than 120
filter(df4, iq >= 115) # iq greater than or equal to 120 (> and =, without space in between)
filter(df4, grades < 20) # grades less than 20
filter(df4, grades <= 30) # grades less than 30
filter(df4, grades <= 30 & iq > 120) # the intelligent but lazy students
```
Filter for values between a range with `%between%`
```{r}
filter(df4, grades %between% c(40, 45)) # grades between 40 and 45
filter(df4, between(grades, 40, 45)) # same as above
```
If you read the documentation for `?between`, you'll learn that `%between%` and `between` are shortcuts for `x >= left & x <= right`.
```{r}
filter(df4, grades >= 40 & grades <= 45)
```
Filter through negating `!`
```{r}
filter(df4, class != "a") # class IS NOT "a" (! and =, without space in between)
```
How would you remove values that match "a" and "b"?
```{r, eval=FALSE, warning=FALSE, message=FALSE, echo=TRUE}
# the following is WRONG so I'm not running it, but try it yourself
filter(df4, class !%in% c("a", "b")) # syntax error...
```
<aside>
RStudio underlines potential errors for you (like Microsoft Word highlighting typos and grammar errors).
</aside>
```{r}
filter(df4, !class %in% c("a", "b")) # correct syntax
```
### Operators
`==`, `%in%`, `%between%`, `>`, `>=`, `<`, `<=`, `!=` are operators. See [here](https://www.statmethods.net/management/operators.html) for more information.
### Rename variables/columns with `rename()`
```{r}
df4_renamed <- rename(df4, intelligence = iq, examGrades = grades, classroom = class) # rename(data, newname = oldname)
df4_renamed
glimpse(df4_renamed)
names(df4_renamed)
```
### Create new columns/variables with `mutate()`
```{r}
# no assignment to object/variable, so only the output is printed; nothing's saved
mutate(df4,
iqMeanCentered = iq - mean(iq, na.rm = T), # substract mean iq from each iq
gradesRound0dp = round(grades)) # round grades
```
```{r}
mutate(df4, student_number = 1:nrow(df4)) # assign row/student number
```
```{r}
nrow(df4) # number of rows; ncol() gives you number of columns
1:nrow(df4) # generate sequence of numbers from 1 to number of rows of dataset
```
```{r}
mutate(df4, student_number = 1:nrow(df4))
```
### Sorting or arranging data rows with `arrange()`
```{r}
arrange(df4, iq) # ascending order of iq
arrange(df4, iq, grades) # ascending order of iq then grades
arrange(df4, desc(iq)) # descending order of iq
arrange(df4, desc(class)) # descending order of class
```
`order()` is another function (base R function) people often use to arrange rows. See [here](https://www.statmethods.net/management/sorting.html) for more information.
### Combining functions with pipes `%>%`
```{r}
df4 # reminder what the dataset looks like
```
You can combine multiple functions with `%>%`. After adding each function/line, you can check your output before you add the next function/line. This way, you can build really complicated and long code/syntax without nesting functions!
NOTE that comments # are placed AFTER the pipe symbol!
```{r}
df4 %>%
mutate(gradesRound = round(grades), studentNo = 1:nrow(df4)) %>% # round grades, add subject number: .N is a shortcut for nrow(df4)
select(-grades) %>% # remove original grades variable
select(studentNo, class, iq, gradesRound) %>% # reorder columns
rename(intelligence = iq, examGrades = gradesRound, classroom = class) %>% # rename variables
filter(intelligence %between% c(90, 105)) # select only those with intelligence between 90 and 105
```
The code above doesn't save the output as an object. Below we save it as an object.
```{r}
df5 <- df4 %>%
mutate(gradesRound = round(grades), studentNo = 1:nrow(df4)) %>% # round grades, add subject number: .N is a shortcut for nrow(df4)
select(-grades) %>% # remove original grades variable
select(studentNo, class, iq, gradesRound) %>% # reorder columns
rename(intelligence = iq, examGrades = gradesRound, classroom = class) %>% # rename variables
filter(intelligence %between% c(90, 105)) # select only those with intelligence between 90 and 105
df5 # check output
```
You can even pipe the output to any other functions (e.g., plotting functions)
```{r}
df4 %>%
mutate(gradesRound = round(grades), studentNo = 1:nrow(df4)) %>% # round grades, add subject number: .N is a shortcut for nrow(df4)
select(-grades) %>% # remove original grades variable
select(studentNo, class, iq, gradesRound) %>% # reorder columns
rename(intelligence = iq, examGrades = gradesRound, classroom = class) %>% # rename variables
filter(intelligence %between% c(80, 115)) %>% # select only those with intelligence between 80 and 115
ggplot(aes(x = intelligence, y = examGrades)) + # note the + sign! (ggplot uses + sign)
geom_point() + # add each data point
geom_smooth(method = 'lm', se = T) # fit regression line with standard error (se = TRUE)
```
Higher intelligence, worse grades? What's going on? We figure out why later on. And more on `ggplot2` package in future tutorials.
### Compute summary statistics with `summarize()` or `summarise()`
```{r}
df5 %>%
group_by(classroom) %>% # group by classroom
summarise(iqMean = mean(intelligence, na.rm = T))
```
```{r}
df5 %>%
group_by(classroom) %>% # group by classroom
summarize(iqClassMean = mean(intelligence, na.rm = T),
examGradesClassMean = mean(examGrades, na.rm = T))
```
Same code but with original dataset (dimensions: 40 x 3)
```{r}
df4 %>%
group_by(class) %>% # grouping by class
summarise(iqClassMean = mean(iq, na.rm = T),
examGradesClassMean = mean(grades, na.rm = T))
```
Group by multiple variables/conditions
Randomly generate gender of student for each row of data with `sample()`
```{r}
sample(x = c("female", "male"), size = 40, replace = T) # what is this doing
```
```{r}
df4$gender <- sample(x = c("female", "male"), size = 40, replace = T)
# df4 <- mutate(df4, gender = sample(x = c("female", "male"), size = .N, replace = T)) # save output as above
```
Because the gender labels are generated randomly, you'll get different values each time you re-run the code.
```{r}
df4
```
Compute mean for each class by gender
```{r}
df4 %>%
group_by(class, gender) %>% # group by class then gender
summarise(iqClassMean = mean(iq, na.rm = T),
examGradesClassMean = mean(grades, na.rm = T))
```
## More `dplyr` and `tidyverse` information
For much more information, see the following sites
* [dplyr tutorial/vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)
* [official dplyr documentation](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)
* [tidyverse](https://www.tidyverse.org/)
## Supercharging your workflow with data.table()
While the syntax of `tidyverse` and `dplyr` functions are really easy to understand, they sometimes can be quite long-winded. Using pipes `%>%` makes your code readable, but is a bit long to read sometimes. Now we'll see how `data.table()` can shorten all that code while maintaining readability. Also, `data.table()` is MUCH faster, which is especially useful when dealing with bigger datasets (hundreds of MBs and GBs and even TBs).
If you use `fread('filename')` to read your dataset into R, then your object is already a `data.table`. Check it with `class(objectName)`.
```{r}
df4
class(df4)
df1
class(df1)
```
If your object isn't a `data.table`, you can convert it to one using `setDT()`.
```{r}
setDT(df1) # setDT() also works (and works without reassignment: no need to use <-)
class(df1) # but setDT() doesn't convert your dataset to a tibble class at the same time
```
### `data.table()` basics: [i, j, by]
`data.table` uses a special but extremely concise syntax that only works with objects that have the `data.table` class associated with them. If you try to use this special syntax on other classes, you'll screw up big time. So check your class or try to convert to or use `data.table` whenever possible!
data.table[i, j, by]
* i: row (lets you perform row operations like `filter()` and `slice()`)
* j: column (lets you perform column operations like `select()` and `summarize()` and `mutate()`)
* by: group by (equivalent to `group_by()`)
### Filter data.table via i
```{r}
df4
class(df4) # is it a data.table?
```
Different ways to filter via i
```{r}
df4[i = gender == 'female',] # just female (j, by are NULL)
df4[gender == 'female',] # i parameter is not required
df4[i = gender != 'female',] # not female
df4[gender != 'female',]
df4[grades > 85,] # rows where grades > 85
# same as filter(df4, grades > 85), but much more concise
df4[iq >= 123 & grades < 50] # smart AND failed
df4[iq >= 123 | grades < 50] # smart OR failed
```
Slice (select rows) with indices via i
```{r}
df4[1:3] # rows 1 to 3
df4[35:.N] # rows 35 to last row
```
`.N` is a shortcut for the last index. If a dataset has 50 rows, the `.N` has the value 50.
```{r}
df4[.N] # last row (.N has the value 40 for our dataset)
df4[40] # don't believe? try it
df4[(.N-3):.N] # last four rows
df4[(40-3):40] # same
df4[37:40] # same
```
### Selecting columns via j
```{r}
df4[, j = grades] # vector
df4[, grades] # same as above, but note the comma, which indicates j via position (i is before the first comma)
class(df4[, grades]) # not a data.table!
df4$grades # vector (same as above)
class(df4$grades) # not a data.table!
```
How to select columns and keep them as data.table?
```{r}
df4[, .(grades)] # output looks like a table (tibble + local data table)
class(df4[, .(grades)]) # still a data.table!
```
```{r}
df4[, j = .(grades, gender, iq)] # select multiple columns
df4[, .(grades,gender, iq)] # same as above and we often omit j =
# select(df4, grades, gender, iq) # same as above but with select()
df4[, grades:gender] # select grades to gender
# select(df4, grades:gender) # same as above but with select()
# df4[, .(grades:gender)] # this version doesn't work at the moment
```
```{r}
df4[, c(2, 3, 4)] # via column index/number
df4[, -c(2, 3, 4)] # via column index/number (minus/not columns 1, 3, 4)
df4[, 1:3] # via column index/number (1 to 3)
```
Other ways to select columns
```{r}
df4[1:4, "grades"] # rows 1 to 4; column grades
df4[c(2, 5, 8), c("grades", "iq")] # rows 2, 5, 8; column grades and iq
```
Column names are stored in an object
```{r, error=TRUE}
cols <- c("gender", "class")
df4[, cols] # doesn't work!
df4[, ..cols] # works! (special syntax)
```
### Chaining with `data.table` ("piping")
```{r}
df4[1:5, 1:3][grades < 80, ][iq > 95, ] # data.table chaining (or piping)
df4[1:5, 1:3] %>% filter(grades < 80) %>% filter(iq > 95) # same result as above
```
### Summarize data via j
```{r}
# compute grand mean iq and rename variable as iq_grand_mean
df4[, j = .(iq_grand_mean = mean(iq, na.rm = T))]
df4[, .(iq_grand_mean = mean(iq, na.rm = T))] # also works
# also works, but no renaming and returns a vector (not a data.table!)
df4[, mean(iq, na.rm = T)]
```
Compare output with `summary()`
```{r}
summary(df4) # check mean
```
What about other statistics and variables? Standard deviation?
```{r}
df4[, .(iq_grand_mean = mean(iq, na.rm = T),
iq_sd = sd(iq, na.rm = T),
grades_grand_mean = mean(grades, na.rm = T),
grades_sd = sd(grades, na.rm = T))]
```
Extra stuff...
```{r}
# standard way to fit regression models
lm(formula = grades ~ iq, data = df4) # y predicted by x (grades predicted by iq)
# fit linear regression (lm) inside data.table
df4[, lm(formula = grades ~ iq)]
```
Again, note the negative relationship between iq and grades. We'll explore why in future tutorials.
The point here is to show how powerful `j` is in `data.table`. `j` accepts any function. You can't use this syntax if your object is not a `data.table`.
```{r}
df4[, summary(lm(formula = grades ~ iq))] # more extensive output
# summary(lm(formula = grades ~ iq, data = df4)) # standard way to fit regression models
```
Or use the `summaryh()` function in the `hausekeep` package to get APA-formatted results and effect size estimates.
If you don't have my package and want to use it, you'll have to install it from github. Run `devtools::install_github("hauselin/hausekeep")` in your R console (you might have to install devtools `install.packages(devtools)` first).
```{r}
df4[, summaryh(lm(formula = grades ~ iq))]
# summaryh(lm(formula = grades ~ iq, data = df4)) # standard way to fit regression models
```
### Compute summary statistics and apply functions to j by groups
What if we want the mean iq and grades for each class? Here is where `data.table` is much more concise than `dplyr` and `tidyverse`.
`data.table` syntax
```{r}
df4[, .(iqMean = mean(iq, na.rm = T)), by = class]
# omitting by also works, by not recommended because it makes your code very hard to read
df4[, .(iqMean = mean(iq, na.rm = T)), class] # not recommended
```
`dplyr` syntax
```{r}
df4 %>%
group_by(class) %>%
summarize(iqMean = mean(iq, na.rm = T))
```
Summarize by class and gender
```{r}
df4[, .(iqMean = mean(iq, na.rm = T)), by = .(class, gender)]
df4[, .(iqMean = mean(iq, na.rm = T)), keyby = .(gender, class)] # summarize and sort/arrange by class then gender
```
Summarize by booleans
```{r}
df4[, .(iqMean = mean(iq, na.rm = T)), by = .(gender == "male")]
df4[, .(iqMean = mean(iq, na.rm = T)), by = .(gender == "male", class == "a")]
```
Combining pipes with `data.table` and `ggplot`
```{r}
df4[, .(iqMean = mean(iq, na.rm = T)), .(class, gender)] %>% # compute class/gender mean
ggplot(aes(class, iqMean, col = gender)) + # plot mean values
geom_point(position = position_dodge(0.5)) # plot points and dodge points to avoid overlapping
```
Extra cool stuff again...
* Fit model to entire dataset (grades ~ iq) and use summaryh to summarize model results
```{r}
df4[, coef(lm(grades ~ iq)), by = class] # get the coefficients for each class
df4[, summaryh(lm(grades ~ iq))]
df4[, summaryh(lm(grades ~ iq)), by = class] # fit model to each class separately
```
What we fit just one model to all the data (all 40 rows), what's the relationship between iq and grades? Positive or negative?
And what happens when we fit the model to each class separately, what's the relationship between iq and grades? Positive or negative? We'll explore these relationships in depth in future tutorials.
### Creating new variables/columns and reassigning in data.tables with `:=`
```{r}
df4[, class := toupper(class)] # convert to upper case
head(df4) # print the first few rows of df4
df4[, class := tolower(class)] # convert to lower case
head(df4)
df4[, sex := gender] # make a copy of column
# same as df4$sex <- df4$gender
head(df4)
df4[, sex := substr(sex, 1, 1)] # take only first character
# same as df4$sex <- substr(df4$sex, 1, 1)
head(df4)
df4[, iqCopy := iq]
head(df4)
df4[iqCopy < 100, iqCopy := NA] # convert values less than 100 to NA
head(df4)
df4[is.na(iqCopy)] # filter via i (show only rows where iqCopy is NA)
# same as filter(df4, is.na(iqCopy))
```
```{r, error=TRUE, eval=FALSE}
df4[iqCopy == NA] # DOESN'T WORK!!! use is.na()
```
Remove a column by assigning a `NULL` to that column
```{r}
df4[, iqCopy := NULL]
# same as df4$iqCopy <- NULL
glimpse(df4)
df4[, sex := NULL]
glimpse(df4)
```
### Renaming with `setnames()`
You don't need to reassign with <- if you use `setnames()`! See `?setnames` to see how this function works.
```{r}
setnames(df4, "iq", "intelligence") # setnames(datatable, oldname, newname)
# if you rename all variables, you don't need to provide the oldname
setnames(df4, c("intelligence", "scores", "classroom", "sex"))
```
## More `data.table` information
For more `data.table` information and tips and tricks, google for them...
* [tutorial/vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)
* [official documentation](https://cran.r-project.org/web/packages/data.table/data.table.pdf)
* [my own collection of resources](https://hausetutorials.netlify.com/posts/2019-04-11-datatable-resources/)
## Support my work
[Support my work and become a patron here](https://donorbox.org/support-my-teaching)!