-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path0001_rbasics.Rmd
522 lines (358 loc) · 19.9 KB
/
0001_rbasics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
---
title: "Programming and data science basics"
description: R, programming, and data science basics
output:
radix::radix_article:
toc: true
toc_depth: 3
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
R.options = list(width = 80),
tidy = TRUE,
tidy.opts = list(width.cutoff = 80))
```
```{r wrap-hook, include=FALSE}
# function that adds knitr parameter to control line width
# https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.Rmd
library(knitr)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
Get source code for this RMarkdown script [here](https://github.com/hauselin/rtutorialsite/blob/master/0001_rbasics.Rmd).
## Consider being a patron and supporting my work?
[Donate and become a patron](https://donorbox.org/support-my-teaching): If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.
## What is data science?
* Cleaning, wrangling, and munging data
* Summarizing and visualizing data
* Fitting models to data
* Evaluating fitted models
## Setting up: Installing R packages/libraries `install.packages()`
Use the `install.packages()` function to install packages from CRAN (The Comprehensive R Archive Network), which hosts official releases of different packages (also known as libraries) written by R users (people like you and I can write these packages). For more info, see this [tutorial on installation](0000_install.html).
Install packages all at once, calling `install.packages()` function just once, using the `c()` (combine/concatenate) to combine all your package names into one big vector (more on what vectors and classes are later).
<aside>
Install packages once and you'll have them on your computer. You only need to update them regularly in the future. No need to rerun `install.packages()` every time you want to use these packages.
</aside>
```{r installing packages2, eval=FALSE}
install.packages(c("data.table", "tidyverse"))
```
## Using/loading R packages when you begin a new RStudio session `library()`
Use `library()` to load packages and use semi-colon (;) to load multiple packages in the same line. I always load the packages below whenever I start a new RStudio session. Sometimes you'll see people using `require()` instead of `library()`. Both works!
```{r loading packages, results="hide", message=FALSE, warning=FALSE}
library(data.table); library(tidyverse)
```
## Changing R default option settings
I also strongly recommend changing a few default R options. Click on **RStudio -> Preferences -> General Tab** and (un)check the boxes below.
* You might want to change your default working directory at the top (mine is `~Dropbox/Working Datasets`). This directory will be where RStudio saves all your work automatically if you don't manually specify/change your working directory later on (more on directories later on).
* By default, RStudio reloads your previously saved work whenever you reopen it, which can often be disastrous (just like you might not want Microsoft Word to always reopen the document you last worked on every single time you open it). So we are disabling (unchecking) relevant features (**uncheck** the highlighted box below and select **Never**).
![Changing RStudio preferences](./attachments/R_preference_settings.png)
## Working/current directory: Where are you and where should you be? `getwd()`
The **working directory** (also known as current directory) is the folder (also known as directory) you're currently in. If you're navigating your computer with your mouse, you'll be clicking and moving into different folders. Directory is just a different way of saying 'location' or where you're at right now (which folder you're currently in).
```{r}
getwd() # prints in the console your current directory
```
<aside>
("getwd" stands for get working directory)
</aside>
The path above tells you where your working directory is now. It's conceptually equivalent to you opening a window and using your mouse to manually navigate to that folder.
The `[1]` in the output refers to the element number of your output.
To change your working directory (to where your project folder is located), use the `setwd()`. This function is easy to use, but the difficulty for most beginners is getting the path to your folder (e.g., `"/Users/Hause/Dropbox/Working Projects/RDataScience"`) so you can specify something like `setwd("/Users/Hause/Dropbox/Working Projects/RDataScience")`.
### Two ways to change/set your working directory (both uses `setwd()`).
1. Go to your menubar (at the top). Click **Help** and search for **set working directory**. RStudio will tell you how to do it via the **Session** menu. Select **Set Working Directory** and **Choose Directory**. Then navigate to your project directory and you're done.
![Set working directory via menubar](./attachments/setdirectory_01.png)
2. On one of the RStudio panes, you'll see a **Files** tab (click on it). Use that pane to navigate to your project directory. When you're there, click **More** and **Set As Working Directory**.
![Set working directory via Files panel](./attachments/setdirectory_02.png)
Whether you choose method 1 or 2, you should see your new directory being set in the console panel, which should look something like `> setwd("your/path/is/here")`. **COPY AND PASTE THIS OUTPUT (but without the >) to your current script**. So you should be copying something like this:
```{r set working directory, eval=FALSE}
setwd("your/path/is/here")
```
## Getting help via ? or `help()`
To get help and read the documentation for any function, use `?` or the `help()` function. Equivalent ways to get help:
```r
# get help for mean function
help(mean)
?mean # also works
?mean() # also works
# 3 ways to get help for setwd function
?setwd
?setwd()
help(setwd)
```
Beginners will often find the **Examples** section of the documentation (at the bottom of the document) most useful. Try copying and pasting the code from that section into your script to see how that function works.
![Learning to use documentation](./attachments/read_doc.png)
### Which package does a function come from?
If you ask for help using `?` or `help()`, at the top left corner of the documentation, you'll see something that looks like `functionName{anotherName}`, the name within the `{}` tells you which package a particular function comes from.
![Learning to read documentation](./attachments/read_doc2.png)
```r
?lm # linear regression, lm{stats} (comes from stats package, which comes with R)
?mean # mean, mean{base} (comes from base package, which comes with R)
```
To explicitly specify which package to use, you use the following syntax: `package::function`. Usually we don't need to explicitly specify which package unless different packages have functions with the same names.
```{r}
mean(c(1, 2, 3)) # compute mean
base::mean(c(1, 2, 3)) # same as above (explicitly specifies the base package)
# fit linear model/regression function lm(y ~ x, data) (built-in dataset mtcars)
lm(mpg ~ cyl, data = mtcars)
stats::lm(mpg ~ cyl, data = mtcars) # (explicitly specifies the stats package)
```
## Objects, variables, and classes
Objects are 'things' in your environment. Just like physical objects (e.g., jeans, frying pan) in your physical environment.
Different objects have different **properties**, and thus belong to different categories/classes: Jeans belong to clothes and frying pan belongs to utensils. Same with programming language objects: different objects in your environment belong to different categories/classes (often also known as **type**).
Different categories/classes have different **properties** and **actions** associated with them. You wear your jeans but not your frying pan; you can fry eggs with your pan but not your jeans. Same with different programming objects, which includes **vectors, lists, dataframes, datatables, matrices, characters (also known as strings), numerics, integers, booleans** (and the list goes on).
Different programming objects/classes/categories have different **properties** and **actions** associated with them, just like things in everyday life. Birds have wings (a property/feature) and can fly (an action).
You can check the class/category/type of an object with the `class()` function.
```{r}
class(mean)
```
**Note (courtesy of John Eusebio)**: You can fry your jeans on your frying pan. It's just not recommended. Same thing with some R functions. You can do some things, like use a for loop to iterate through and operate on every element of a matrix, but you shouldn't. There are tools that are made to do stuff like matrix operations much more efficiently and effectively.
Creating objects (specifically vectors) with the `<-` assignment operator
Keyboard shortcut for `<-` is `Alt -`
```{r}
variable1 <- 10 # assign/store the value 10 in variable1
variable2 <- 2000
v3 <- variable1 + variable2 # add variable1 to variable2
variable1; variable2; # print variables to console
v3 # print variables to console
print(v3) # same as above v3 (print explicitly prints output to console)
```
Check your panes and click on the **Environment** tab. What do you see there right now? What's new?
![Creating new variables in the environment](./attachments/environment_pane.png)
Note that R variable names can only begin with characters/letters, not numbers or other symbols.
```{r}
class(variable1)
class(variable2)
class(v3)
```
```{r}
v4 <- c(1, 2, 3, 4, 5) # 'c' stands for concatenate or combine
v4 # prints v4, which is a vector
```
What does `c()` do?
**Vectors** are objects that store data of the same **class**
`c()` combines values of the same class into a vector (or list).
<aside>Like how you put all your clothes (one category/class of objects) into your wardrobe (vector)!</aside>
```{r}
v4
class(v4)
```
```{r}
roomsInHouse <- c("Kitchen", "Bedroom") # a vector with all characters
roomsInHouse
```
Note that all the values in `roomsInHouse` are in quotation marks "", meaning that the values are all characters (a category or class of objects in R).
```{r}
class("Date of Birth")
```
```{r}
class(12121999)
```
```{r}
mixedClasses <- c("Date of Birth", 12121999)
mixedClasses # what class is this? why?
```
```{r}
class(mixedClasses)
```
Why is it character?
```{r}
sleep # dataframe that is built into R (comes with any R installation)
# for more datasets that came with R, type data()
```
```{r}
class(sleep)
```
Booleans (class called "logical"): `TRUE` or `FALSE` values. `TRUE` is actually coded as 1 and `FALSE` coded as 0. Must be all upper-case (`TRUE`, `T`, `FALSE`, `F`). Lower-case doesn't work!
```{r}
booleanExample <- c(T, F)
class(booleanExample) # true
class(c(TRUE, FALSE))
class(F) # false
class(c(T, F))
```
### Side note: variable naming conventions
Also, you can use different variable naming conventions. I use both the snake case and camel case (just two of the many conventions).
* [snake case](https://en.wikipedia.org/wiki/Snake_case): `my_variable`, `rooms_in_house`, `model_output` (looks like a snake)
* [camel case](https://en.wikipedia.org/wiki/Camel_case): `myVariable`, `roomsInHouse`, `modelOutput` (looks like a camel)
![Camel case picture from [Wikipedia](https://en.wikipedia.org/wiki/Camel_case)](./attachments/camelcase.png)
## Indices and indexing with [i, j]
| | Column 1 | Column 2 | Column 3 |
| :-------: | :----------: | :----------: | :----------: |
| **Row 1** | i = 1, j = 1 | i = 1, j = 2 | i = 1, j = 3 |
| **Row 2** | i = 2, j = 1 | i = 2, j = 2 | i = 2, j = 3 |
| **Row 3** | i = 3, j = 1 | i = 3, j = 2 | i = 3, j = 3 |
i: row index, j: column index
Index is just a fancy way of saying numbering or counting. `R` begins counting from 1 (i.e., indexing begins from 1), whereas many other languages like Python and JavaScript begin from 0.
```{r}
# create a matrix with values 10, 20, 30, 40, and make it 2 rows long
example_matrix <- matrix(c(10, 20, 30, 40), nrow = 2)
example_matrix
```
To select specific values in the matrix, use the [i, j] syntax, where i refers to the row number and j refers to the column number.
```{r}
# what does this return? ('return' is a way to say "output" or "spit/print out")
example_matrix[1, 2]
```
What indices would you use to get the value 20 specifically?
```{r}
example_matrix[2, 1]
```
## Using functions
Functions take some **input**, transform that **input**, and **spits out** (**return** is the technical term) some output.
```{r}
mean(c(10, 20, 30)) # mean function
```
What is the input to the `mean()` function above? What is the output?
* `mean()` is the function you **call** (technical term)
* `c(10, 20, 30)` is the **input** you provide to the function `mean()`
* the **function call** `mean(c(10, 20, 30))` **returns** the value 20
In case you're more familiar with math notation $y = f(x)$: `x` is the input to the function $f()$, and `y` is the output of the function $f(x)$.
What are the inputs to the `matrix()` function below? How is the `matrix()` function transforming your input? What do you think the output will be?
```{r}
matrix(c(10, 20, 30, 40, 50, 60), nrow = 6)
```
```{r}
matrix(c(10, 20, 30, 40, 50, 60), ncol = 6)
```
```{r}
matrix(c(10, 20, 30, 40, 50, 60), nrow = 2)
```
```{r}
matrix(c(10, 20, 30, 40, 50, 60), nrow = 2, byrow = TRUE)
# how does this output differ from the one above?
```
```{r}
paste0(c("a", "b", "c"), 1:3) # what is the paste0 function doing?
```
```{r}
paste(c("a", "b", "c"), 1:3) # How is paste() different from paste0() above?
```
```{r}
paste(c("a", "b", "c"), 1:3, sep = '_ _ _') # What's happening?
```
## Piping or chaining functions with `%>%`
Often we apply multiple functions in succession. We say we **wrap** or **nest** functions within functions. Math notation: $y = h(g(f(x)))$: apply function f to x, then function g to that output, and then function h to that output.
```{r}
x <- c(10.1, 10.1, 20.3, 20.3, 30.2, 30.7, 30.7)
x
```
```{r}
mean(x) # mean of values in x
```
```{r}
unique(x) # unique values in x
```
```{r}
mean(unique(x)) # mean of unique values in x
```
```{r}
round(mean(unique(x))) # round of mean of unique values in x
```
As you can see, wrapping or nesting functions within functions can be quite difficult to read. We can get rid of such nesting/wrapping with pipes `%>%`, which is available to you when you load the `tidyverse` package with `library(tidyverse)` above. Read `%>%` as "then".
Keyboard shortcut for `%>%`: Shift-Command-M (Mac) or Shift-Ctrl-M (Windows)
Take output of `x`, then apply `unique()` to that output.
```{r}
x %>% unique()
```
Take output of `x`, then apply `unique()` to that output, then apply `mean()` to that output.
```{r}
x %>% unique() %>% mean()
```
Take output of `x`, then apply `unique()` to that output, then apply `mean()` to that output, then apply `round()` to that output.
```{r}
x %>% unique() %>% mean() %>% round()
```
Same outputs below
```{r}
round(mean(unique(x))) # less legible
x %>% unique() %>% mean() %>% round() # more legible
```
Piping with `%>%` makes it easier to read your code (reading left to right: `x %>% unique() %>% mean() %>% round()`), rather than from inside to outside: `round(mean(unique(x)))`.
If your pipes get too long, you can separate pipes into different lines.
```{r}
x %>% # one line at a time (press enter after each pipe)
unique() %>%
mean() %>%
round()
```
In R, this approach of using pipes `%>%` is called **piping**. In other languages like Python, it's called **chaining**.
## Functions and argument order
Make sure to specify your function arguments in the correct order. If not, make sure to specify the name of each argument you're using!
```{r}
numbers <- c(1:3, NA)
numbers
```
**Argument**-**value** pairs: argument is x, value is `numbers` (defined above as `r numbers`)
```{r}
mean(x = numbers) # what happened?
```
**Argument-value** pairs
* argument1 = x, value = `numbers`
* argument2 = na.rm, value = `TRUE`
```{r}
# remove missing (NA) values by giving the value TRUE to the na.rm argument
mean(x = numbers, na.rm = TRUE)
```
### Reading function documentation
To improve as a programmer, you'll need to learn to read and understand documentation. R's documentation is quite intuitive once you understand how to read it.
```{r}
?mean # show help documentation
```
Decoding the **Usage** and **Arguments** sections
**Usage section**
`mean(x, ...)`: the function `mean()` can be called/used by providing one argument (`x`), which has no default values, so some value/object MUST be provided
`mean(x, trim=0, na.rm=FALSE, ...)`: tells you which arguments have default parameters and what they are (`trim` parameter has a default value of `0`, and `na.rm` parameter has a default value of `FALSE`—so `NA` or missing values are not removed by default); because `trim` and `na.rm` have default values, you don't need to supply them when calling the function `mean()`
**Arguments section**
This section provides more details on each parameter of the function.
![Argument-value pairs](./attachments/function_argument_value.png)
Note that `TRUE` can be written simply as `T` and `FALSE` can be written as `F`.
```{r}
mean(na.rm = T, x = numbers) # does this work?
```
```{r}
mean(numbers, na.rm = T) # does this work?
```
```{r}
mean(na.rm = T, numbers) # NEVER DO THIS EVEN IF IT WORKS!!! BAD PRACTICE!!!
```
Why is this so bad?
```{r, error=TRUE}
mean(numbers, TRUE) # what happens? why?
```
The function fails to run. Why? Type `?mean` in the console and read the documentation to figure out why. What arguments do the `mean` function also have? What's the expected order of the arguments?
What are the default values to the arguments? How do you know if there are default values?
## Pressing Tab key to autocomplete! Tab will be your best friend!
We often will never remember all the arguments all the functions will require (there are just too many functions!). But when we type the function name followed by brackets `mean()`, the cursor will automatically move between the brackets. You can press the `Tab` key on your keyboard to get RStudio to tell you what arguments this function expects.
Be creative! Tab and autocomplete works in MANY other situations! Explore! (variables, filenames, directory paths etc.)
## Good practices for reproducible research
* One directory/folder per project (create an [RProject](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) for each directory/project)
* Ensure your working directory is set properly before you begin (use RProjects!)
* Ensure your environment is cleared before you begin
* Load your libraries at the top of each script
* Give your variables and objects sensible names
* Optional: save and restore your work with `save.image()` and `load()`
## Four-step philosophy
1. Know your subgoals and especially your end goals
2. Know what you’re passing in to functions
3. Know what your functions return you
4. Know how to verify or summarize what your functions return you
## Common beginner errors
* Not looking at the output in the console (treating R like a black box)
* Console still expects more code: + rather than > (press Escape to get rid of +)
* Naming your variables un-systematically and calling the wrong variable because of typos
* Not knowing what data you’re giving a function
* Not knowing what class of data a function expects
* Not knowing what class of data your function returns
* Not learning how to **properly** use [stackoverflow](https://meta.stackoverflow.com/questions/252149/how-does-a-new-user-get-started-on-stack-overflow)
## Support my work
[Support my work and become a patron here](https://donorbox.org/support-my-teaching)!