-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsetup.Rmd
191 lines (141 loc) · 7.51 KB
/
setup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# R Setup
## Preparing your environment for `R`
The Institute and Faculty of Actuaries have provided their [own guide](https://www.actuaries.org.uk/system/files/field/document/R-Guide_technical.pdf) to getting up and running with `R`.
The steps to have `R` working is dependant on your operating system. The following resources _should_ allow for your local installation of `R` to be relatively painless:
1. Download and install `R` from [CRAN](https://cran.rstudio.com/)^[CRAN is the The Comprehensive R Archive Network - read more on the [CRAN](https://cran.rstudio.com/) website].
2. Download and install an integrated development environment, a strong recommendation is [RStudio Desktop](https://rstudio.com/products/rstudio/download/#download).
## Basic interations with `R`
`R` is case-sensitive! We add comments to our `R` code using the `#` symbol on any line. A key concept when working with `R` is that the preference is to work with **vectorised** operations (over concepts like for loops). As an example we start with `1:10`{.R} which uses the colon operator (`:`{.R}) to generate a sequence starting with 1 and ending with 10 in steps of 1. The output is a numeric **vector** of integers. Let's see this in `R`:
```{r setup-vector-intro}
# This is the syntax for comments in R
(1:10) + 2 # Notice how we add element-wise in R
```
At the most basic level, `R` vectors can be of atomic modes:
- integer,
- numeric (equivalently, double),
- logical which take on the Boolean types: TRUE or FALSE and can be coerced into integers as 1 and 0 respectively,
- character which will be apparent in `R` with the wrapper "",
- complex, and
- raw
This book focuses on using `R` to solve actuarial statistical problems and will not explore the depths of the `R` language^[I fear this is already too indepth for "basic interactions with `R`" but for those that want to jump down the rabbit hole, see Hadley Wickham's book [Advanced R](https://adv-r.hadley.nz/).].
`R` has the usual arithmetic operators you'd expect with any programming language:
- `+`, `-`, `*`, `/` for addition, subtraction, multiplication and division,
- `^` for exponentiation,
- `%%` for modulo arithmetic (remainder after division)
- `%/%` for integer division
We **assign** values to **variables** using the `<-` *("assignment")* operator^[We can also assign values using the more familiar `=` symbol. In general this is discouraged, listen to [Hadley Wickham](https://style.tidyverse.org/syntax.html#assignment-1).].
```{r setup-vector-variable, collapse=TRUE}
x <- 1:10
y <- x + 2
x <- x + x # Notice that we can re-assign values to variables
z <- x + 2
y
z
```
Even though $z$ is assigned the same way as we assigned $y$, note that $y \neq z$ so execution order matters in `R`. All of $x$, $y$ and $z$ are **vectors** in `R`.
## Functions in `R`
We can add **functions** to `R` via the format `function_name(arguments = values, ...)`{.R}:
```{r setup-function}
# c() is the "combine" function, used often to create vectors
# Note we can also nest functions within functions
x <- c(1:3, 6:20, 21:42, c(43, 44))
# Another function with arguments:
y <- sample(x, size = 3)
y
```
There are a lot of in-built functions in `R` that we may need:
- `factorial(x)`
- `choose(n, k)` - for binomial coefficients
- `exp(x)`
- `log(x)` - by default in base $e$
- `gamma(x)`
- `abs(x)` - absolute value
- `sqrt(x)`
- `sum(x)`
- `mean(x)`
- `median(x)`
- `var(x)`
- `sd(x)`
- `quantile(x, 0.75)`
- `set.seed(seed)` - for reproducibility of random number generation
- `sample(x, size)`
`R` has an in-built help function `?`{.R} which can be used to read the documentation on any function as well as topic areas. For example have a look at `?Special`{.R} for more details about in-built `R` functions for the beta and gamma functions.
## Data structures in `R`
We have already seen **vectors** as a data structure that is very common in `R`. We can identify the structure of an `R` "object" using the `str(object)`{.R} function.
### Matrices {-}
Next we introduce the **matrix** structure. When interacting with matrices in `R` it is important to note that **matrix multiplication** requires the `%*%` syntax:
```{r setup-matrix}
first_matrix <- matrix(1:9, byrow = TRUE, nrow = 3)
first_matrix %*% first_matrix
```
### Dataframes {-}
A `data.frame` is a very popular data structure used in `R`. Each input variable has to have the same length but can be of different types (*strings, integers, booleans, etc.*).
```{r setup-dataframe}
# Input vectors for the data.frame
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
surface_gravity <- c(0.38, 0.904, 1, 0.3794, 2.528, 1.065, 0.886, 1.14)
# Create a data.frame from the vectors
solar_system <- data.frame(name, surface_gravity)
str(solar_system)
```
### Lists {-}
A `list` is a versatile data structure in `R` as their elements can be of any type, including lists themselves. In fact a `data.frame` is a specific implementation of a `list` which allows columns in a `data.frame` to have different types, unlike a `matrix`.
We will come across a number of functions that return a `list` type whilst working with actuarial statistics in `R`. For example when we look at linear models we will make use of the `lm(formula, data, ...)`{.R} function which returns a `list`.
```{r setup-list-1, collapse = TRUE}
# Use Orange dataset
df <- Orange
# Fit a linear model to predict circumference from age
fitted_lm <- lm(circumference ~ age, df)
# Size of the list
length(fitted_lm)
# Element names
names(fitted_lm)
```
We can access elements in the list using **subsetting**, noting the use of the `[[` operator. Here we subset on "_age_" within the "_coefficient_" element in the `list` we called "_fitted_lm_":
```{r setup-list-2, collapse = TRUE}
# Select [[1]] 1st element in the list, sub-select [2] 2nd element from that
fitted_lm[[1]][2]
# fitted_lm$coefficient is a shorthand for fitted_lm[["coefficient"]]
fitted_lm$coefficients[2]
# Select element using matching character vector "age"
fitted_lm$coefficients["age"]
# Select elements using matching character vectors
fitted_lm[["coefficients"]]["age"]
```
## Logical expressions in `R`
R has built in logic expressions:
| Operator | Description |
|-|-|
| < (<=) | less than (or equal to) |
| > (>=) | greater than (or equal to) |
| == | exactly equal to |
| ! | NOT |
| & | AND (*element-wise*) |
| \| | OR (*element-wise*) |
| != | not equal to |
We can use logical expressions to effectively filter data via **subsetting** the data using the `[...]`{.R} syntax:
```{r setup-subsetting}
x <- 1:10
x[x != 5 & x < 7]
```
We can select objects using the **\$** symbol (see `?Extract`{.R} for more help):
```{r setup-selecting}
#data.frame[rows to select, columns to select]
solar_system[solar_system$name == "Jupiter", c(1:2)]
```
## Extending `R` with packages
We can extend `R`'s functionality by loading **packages**:
```{r setup-packages}
# Load the ggplot2 package
library(ggplot2)
```
Did you get an error from `R` trying this? To load packages they need to be **installed** using `install.packages("package name")`{.R}.
## Importing data
`R` can import a wide variety of file formats, including:
- _.csv_
- _.RData_
- _.txt_
We can import these using `read.csv(file, header, sep)`, `load(file)` and `read.table(file, header, sep)` respectively, where:
- _file_ is the file name to be read,
- _header_ is a logical value to determine if the first line contains the variable names, and
- _sep_ is the specified field separator character, e.g. for _.csv_ files by default `sep = ","` given it is a comma separated value file type.