setup.Rmd

# R Setup

## Preparing your environment for `R`

The Institute and Faculty of Actuaries have provided their [own guide](https://www.actuaries.org.uk/system/files/field/document/R-Guide_technical.pdf) to getting up and running with `R`.

The steps to have `R` working is dependant on your operating system. The following resources _should_ allow for your local installation of `R` to be relatively painless:

1. Download and install `R` from [CRAN](https://cran.rstudio.com/)^[CRAN is the The Comprehensive R Archive Network - read more on the [CRAN](https://cran.rstudio.com/) website].
2. Download and install an integrated development environment, a strong recommendation is [RStudio Desktop](https://rstudio.com/products/rstudio/download/#download).

## Basic interations with `R`

`R` is case-sensitive! We add comments to our `R` code using the `#` symbol on any line. A key concept when working with `R` is that the preference is to work with **vectorised** operations (over concepts like for loops). As an example we start with `1:10`{.R} which uses the colon operator (`:`{.R}) to generate a sequence starting with 1 and ending with 10 in steps of 1. The output is a numeric **vector** of integers. Let's see this in `R`:

```{r setup-vector-intro}
# This is the syntax for comments in R
(1:10) + 2 # Notice how we add element-wise in R
```

At the most basic level, `R` vectors can be of atomic modes:

- integer,
- numeric (equivalently, double),
- logical which take on the Boolean types: TRUE or FALSE and can be coerced into integers as 1 and 0 respectively,
- character which will be apparent in `R` with the wrapper "",
- complex, and
- raw

This book focuses on using `R` to solve actuarial statistical problems and will not explore the depths of the `R` language^[I fear this is already too indepth for "basic interactions with `R`" but for those that want to jump down the rabbit hole, see Hadley Wickham's book [Advanced R](https://adv-r.hadley.nz/).]. 
`R` has the usual arithmetic operators you'd expect with any programming language:

- `+`, `-`, `*`, `/` for addition, subtraction, multiplication and division,
- `^` for exponentiation,
- `%%` for modulo arithmetic (remainder after division)
- `%/%` for integer division

We **assign** values to **variables** using the `<-` *("assignment")* operator^[We can also assign values using the more familiar `=` symbol. In general this is discouraged, listen to [Hadley Wickham](https://style.tidyverse.org/syntax.html#assignment-1).].

```{r setup-vector-variable, collapse=TRUE}
x <- 1:10
y <- x + 2
x <- x + x # Notice that we can re-assign values to variables
z <- x + 2
y
z
```

Even though $z$ is assigned the same way as we assigned $y$, note that $y \neq z$ so execution order matters in `R`. All of $x$, $y$ and $z$ are **vectors** in `R`.

## Functions in `R`

We can add **functions** to `R` via the format `function_name(arguments = values, ...)`{.R}:

```{r setup-function}
# c() is the "combine" function, used often to create vectors
# Note we can also nest functions within functions
x <- c(1:3, 6:20, 21:42, c(43, 44))
# Another function with arguments:
y <- sample(x, size = 3)
y
```

There are a lot of in-built functions in `R` that we may need:

- `factorial(x)`
- `choose(n, k)` - for binomial coefficients
- `exp(x)`
- `log(x)` - by default in base $e$
- `gamma(x)`
- `abs(x)` - absolute value
- `sqrt(x)`
- `sum(x)`
- `mean(x)`
- `median(x)`
- `var(x)`
- `sd(x)`
- `quantile(x, 0.75)`
- `set.seed(seed)` - for reproducibility of random number generation
- `sample(x, size)`

`R` has an in-built help function `?`{.R} which can be used to read the documentation on any function as well as topic areas. For example have a look at `?Special`{.R} for more details about in-built `R` functions for the beta and gamma functions.

## Data structures in `R`

We have already seen **vectors** as a data structure that is very common in `R`. We can identify the structure of an `R` "object" using the `str(object)`{.R} function.

### Matrices {-}

Next we introduce the **matrix** structure. When interacting with matrices in `R` it is important to note that **matrix multiplication** requires the `%*%` syntax:

```{r setup-matrix}
first_matrix <- matrix(1:9, byrow = TRUE, nrow = 3)
first_matrix %*% first_matrix
```

### Dataframes {-}

A `data.frame` is a very popular data structure used in `R`. Each input variable has to have the same length but can be of different types (*strings, integers, booleans, etc.*).

```{r setup-dataframe}
# Input vectors for the data.frame
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
surface_gravity <- c(0.38, 0.904, 1, 0.3794, 2.528, 1.065, 0.886, 1.14)
# Create a data.frame from the vectors
solar_system <- data.frame(name, surface_gravity)
str(solar_system)
```

### Lists {-}

A `list` is a versatile data structure in `R` as their elements can be of any type, including lists themselves. In fact a `data.frame` is a specific implementation of a `list` which allows columns in a `data.frame` to have different types, unlike a `matrix`.

We will come across a number of functions that return a `list` type whilst working with actuarial statistics in `R`. For example when we look at linear models we will make use of the `lm(formula, data, ...)`{.R} function which returns a `list`. 

```{r setup-list-1, collapse = TRUE}
# Use Orange dataset
df <- Orange
# Fit a linear model to predict circumference from age
fitted_lm <- lm(circumference ~ age, df)
# Size of the list
length(fitted_lm)
# Element names
names(fitted_lm)
```

We can access elements in the list using **subsetting**, noting the use of the `[[` operator. Here we subset on "_age_"  within the "_coefficient_" element in the `list` we called "_fitted_lm_":

```{r setup-list-2, collapse = TRUE}
# Select [[1]] 1st element in the list, sub-select [2] 2nd element from that
fitted_lm[[1]][2] 
# fitted_lm$coefficient is a shorthand for fitted_lm[["coefficient"]] 
fitted_lm$coefficients[2] 
# Select element using matching character vector "age"
fitted_lm$coefficients["age"]
# Select elements using matching character vectors
fitted_lm[["coefficients"]]["age"]
```

## Logical expressions in `R`

R has built in logic expressions:

| Operator | Description |
|-|-|
| < (<=) | less than (or equal to) |
| > (>=) | greater than (or equal to) |
| == | exactly equal to |
| ! | NOT |
| & | AND (*element-wise*) |
| \| | OR (*element-wise*) |
| != | not equal to |

We can use logical expressions to effectively filter data via **subsetting** the data using the `[...]`{.R} syntax:

```{r setup-subsetting}
x <- 1:10
x[x != 5 & x < 7]
```

We can select objects using the **\$** symbol (see `?Extract`{.R} for more help):

```{r setup-selecting}
#data.frame[rows to select, columns to select]
solar_system[solar_system$name == "Jupiter", c(1:2)]
```

## Extending `R` with packages

We can extend `R`'s functionality by loading **packages**:

```{r setup-packages}
# Load the ggplot2 package
library(ggplot2)
```

Did you get an error from `R` trying this? To load packages they need to be **installed** using `install.packages("package name")`{.R}.

## Importing data

`R` can import a wide variety of file formats, including:

- _.csv_
- _.RData_
- _.txt_

We can import these using `read.csv(file, header, sep)`, `load(file)` and `read.table(file, header, sep)` respectively, where:

- _file_ is the file name to be read,
- _header_ is a logical value to determine if the first line contains the variable names, and
- _sep_ is the specified field separator character, e.g. for _.csv_ files by default `sep = ","` given it is a comma separated value file type.