index.html

<!DOCTYPE html>
<html>
  <head>
    <title>Machine Learning 101</title>
    <meta charset="utf-8">
    <meta name="author" content="  Sarah Romanes   <span>&lt;i class="fab  fa-twitter faa-float animated "&gt;&lt;/i&gt;&amp;nbsp;@sarah_romanes</span>" />
    <link href="libs/remark-css/kunoichi.css" rel="stylesheet" />
    <link href="libs/remark-css/ninjutsu.css" rel="stylesheet" />
    <link href="libs/font-awesome-animation/font-awesome-animation-emi.css" rel="stylesheet" />
    <script src="libs/fontawesome/js/fontawesome-all.min.js"></script>
    <link href="libs/font-awesome/css/fontawesome-all.min.css" rel="stylesheet" />
    <link rel="stylesheet" href="assets\custom.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Machine Learning 101
## Model Assessment in R
### <br><br>Sarah Romanes   <span>&lt;i class="fab  fa-twitter faa-float animated "&gt;&lt;/i&gt;&amp;nbsp;@sarah_romanes</span>
### <br>17-Oct-2018<br><br><span>&lt;i class="fas  fa-link faa-vertical animated " style=" color:white;"&gt;&lt;/i&gt;&amp;nbsp;bit.ly/rladies-sydney-ML-2</span>

---


layout: false
class: bg-main3 split-30 hide-slide-number

.column[

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Welcome!]
]
]]

---

class: split-two white

.column.bg-main1[.content.vmiddle.center[

# Overview

&lt;br&gt;

### This two part R-Ladies workshop is designed to give you a small taster into the large field that is known as .orange[**Machine Learning**]! Last week, we covered supervised learning techniques, and this week we will cover model assessment. 

&lt;br&gt;

### You can always visit last weeks slides at .orange[bit.ly/r-ladies-ML-1] with the corresponding <i class="fab  fa-github "></i> repo [here](https://github.com/sarahromanes/r-ladies-ML-1).

]]
.column.bg-main3[.content.vmiddle.center[
&lt;img src="images/machinewar.png", width="70%"&gt;

##### [SMBC](https://www.smbc-comics.com/comic/rise-of-the-machines)


]]

---

# .purple[Recap - What *is* Machine Learning?]

&lt;br&gt;

### Machine learning is concerned with finding functions `\(Y=f(X)+\epsilon\)` that best **predict** outputs (responses), given data inputs (predictors).

&lt;br&gt;

&lt;center&gt;

  &lt;img src="images/ml-process.png", width="50%"&gt;

&lt;/center&gt;

&lt;br&gt;

### Mathematically, Machine Learning problems are simply *optimisation* problems, in which we will use .purple[<i class="fab  fa-r-project "></i>] to help us solve!

---

layout: false
class: bg-main3 split-30 hide-slide-number

.column[

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Performance Metrics]
]
]]

---

class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# Regression - revisited 

### Recall the simulated dataset `Income`, which looked at the relationship between `Education` (years) and `Income` (thousands).


```r
data &lt;- read.csv("data/Income.csv")
head(data)
```

```
##   Education   Income
## 1  10.00000 26.65884
## 2  10.40134 27.30644
## 3  10.84281 22.13241
## 4  11.24415 21.16984
## 5  11.64548 15.19263
## 6  12.08696 26.39895
```

### How can we assess how well a regression model fits the data?

]]

.column[.content.vmiddle.center[


&lt;img src="index_files/figure-html/unnamed-chunk-2-1.png" width="504" /&gt;


]]

---

class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# Is a straight line the best fit?

&lt;br&gt;

&lt;br&gt;

&lt;br&gt;

## Last week we considered fitting a .orange[Linear] model to this data, resulting in the following line...

#### (Revise how this was fit [here](https://sarahromanes.github.io/r-ladies-ML-1/#9).)


]]

.column[.content.vmiddle.center[


&lt;img src="index_files/figure-html/unnamed-chunk-3-1.png" width="504" /&gt;


]]

---

class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# Is a straight line the best fit?

&lt;br&gt;

&lt;br&gt;

&lt;br&gt;

## ... however there are many types of curves we could have fit to this dataset!

]]

.column[.content.vmiddle.center[

&lt;img src="images/curves.jpg", width="60%"&gt;

]]


---
class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# Is a straight line the best fit?

&lt;br&gt;

&lt;br&gt;

&lt;br&gt;

## ... however there are many types of curves we could have fit to this dataset!

&lt;br&gt;

## Consider now the .orange[*polynomial*] curve of degree 17. Does it fit better than our straight line?

]]

.column[.content.vmiddle.center[


&lt;img src="index_files/figure-html/unnamed-chunk-4-1.png" width="504" /&gt;


]]


---

class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Regression metrics: the RMSE

&lt;br&gt;


# The RMSE is the .orange[Root Mean Squared Error], which is a metric of how far away our fitted line lies away from the observations (on average).


]]
.column[.content.vmiddle.center[
&lt;img src="images/rmse.png", width="80%"&gt;
]]


---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# Regression metrics: the RMSE

&lt;br&gt;


```r
fit.lm &lt;- lm(data=data, Income~Education)
fit.poly &lt;- lm(data=data, Income~poly(Education, 17))

fitted.vals.lm &lt;- predict(fit.lm, 
                          data.frame(Education=data$Education))
fitted.vals.poly &lt;- predict(fit.poly, 
                            data.frame(Education=data$Education))
```

#### RMSE - Linear Model 

```r
sqrt(mean((data$Income-fitted.vals.lm)^2))
```

```
## [1] 5.461576
```
#### RMSE - Polynomial Model 

```r
sqrt(mean((data$Income-fitted.vals.poly)^2)) 
```

```
## [1] 2.803986
```


]]
.column.bg-main3[.content.vmiddle.center[

## We can find the RMSE of the two models proposed before.


## .orange[The polynomial fit has a lower RMSE. Does this mean we should use it?]

]]


---
# .purple[Bias and Variance]

&lt;br&gt;

## In order to minimise test error on new data points, statistical theory (out of the scope of this workshop) tells us that we need to select a function that achieves *low variance* and *low bias*.

&lt;br&gt;

--

## - .pink[**Variance**] refers to the amount by which our predictions would change if we estimated using a different training set. The more flexible the model, the higher the variance.

--

## - .pink[**Bias**] refers to the error that introduced by the approximation we are making with our model (represent complicated data by a simple model). The more simple the model, the higher the bias.


---

class: middle center bg-main1

# High variance (over fitted) models can be problematic...

&lt;img src="images/bird.jpg", width="70%"&gt;

---

class: middle center bg-main1

# ... as well as high bias (under fitted) models!

&lt;img src="images/constraints.png", width="70%"&gt;

---

# .purple[The Bias-Variance Tradeoff]

&lt;br&gt;

## As you may have guessed, there is a trade off between increasing variance (flexibility) and decreasing bias (simplicity) and vice versa.

&lt;br&gt;

&lt;center&gt;

  &lt;img src="images/tradeoff.png", width="50%"&gt;

&lt;/center&gt;

&lt;br&gt;

### This is known as the .pink[Bias-Variance Tradeoff], and will be revisited in the next section!


---
class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# The middle ground!

&lt;br&gt;

## For this dataset, we might consider the `loess` model as a nice middle ground between bias and variance.


```r
*fit.loess &lt;- loess(data=data, Income~Education)
fitted.vals.loess &lt;- predict(fit.loess,
                             data.frame(
                             Education=data$Education))
rmse.loess &lt;- sqrt(mean((data$Income-fitted.vals.loess)^2)) 
rmse.loess
```

```
## [1] 3.806408
```


]]

.column[.content.vmiddle.center[


&lt;img src="index_files/figure-html/unnamed-chunk-9-1.png" width="504" /&gt;


]]

---

class: split-two white 

.column.bg-main1[.content[

&lt;br&gt;

# The classification setting

&lt;br&gt;

### Recall the dataset `Exam`, where two exam scores are given for each student, and a Label represents whether they passed or failed the course.


```r
data&lt;- read.csv("data/Exam.csv", header=T)
head(data,4)
```

```
##      Exam1    Exam2 Label
## 1 34.62366 78.02469     0
## 2 30.28671 43.89500     0
## 3 35.84741 72.90220     0
## 4 60.18260 86.30855     1
```


### How do we assess model accuracy for a classifier?
]]

.column[.content.vmiddle.center[
&lt;img src="index_files/figure-html/unnamed-chunk-11-1.png" width="504" /&gt;
]]

---
class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Classification metrics

&lt;br&gt;


# The .orange[Resubstitution Error Rate] is a measure of the proportion of data points we predict correctly when we try to predict all the points we used to fit the model.


]]
.column[.content.vmiddle.center[
&lt;img src="images/rser.png", width="80%"&gt;
]]


---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# The Resubstitution Error Rate

&lt;br&gt;


```r
*fit.glm &lt;-  glm(data=data,
*           Label ~ .,
*           family=binomial(link="logit"))
```
 
]]
.column.bg-main3[.content.vmiddle.center[
# First, we fit a `glm` against all predictors like we did last week.
]]

---

class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# The Resubstitution Error Rate


&lt;br&gt;


```r
fit.glm &lt;-  glm(data=data, 
            Label ~ ., 
            family=binomial(link="logit"))  

*fitted.vals &lt;- round(predict(fit.glm,
*                            newdata = data[, c("Exam1", "Exam2")],
*                            type="response"))
```
]]
.column.bg-main3[.content.vmiddle.center[
# We then predict the values of all the data points we used to build the model.
#### (Remember, for `glm`, we predict the *probability* of being class 1, and then round).

]]

---


class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# The Resubstitution Error Rate


&lt;br&gt;


```r
fit.glm &lt;-  glm(data=data, 
            Label ~ ., 
            family=binomial(link="logit"))  

fitted.vals &lt;- round(predict(fit.glm,
                             newdata = data[, c("Exam1", "Exam2")], 
                             type="response")) 
*err &lt;- mean(fitted.vals!=data$Label)
err
```

```
## [1] 0.11
```
]]
.column.bg-main3[.content.vmiddle.center[
# We can then find the .orange[*resubstitution error rate*] by looking at the proportion of labels that were not correctly predicted.

]]


---


class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# The Confusion Matrix <i class="fas  fa-question-circle "></i>

&lt;br&gt;


```r
fit.glm &lt;-  glm(data=data, 
            Label ~ ., 
            family=binomial(link="logit"))  
fitted.vals &lt;- round(predict(fit.glm, 
                             newdata = data[, c("Exam1", "Exam2")], 
                             type="response"))
```


```r
library(caret)
*confusion.glm &lt;- confusionMatrix(as.factor(fitted.vals),
*                                as.factor(data$Label))
confusion.glm$table
```

```
##           Reference
## Prediction  0  1
##          0 34  5
##          1  6 55
```
]]
.column.bg-main3[.content.vmiddle.center[
## We can examine how our model predicted all the data points, using `confusionMatrix` from the `caret` package. Note, that we are required to put in factor inputs.

]]


---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# The Confusion Matrix <i class="fas  fa-question-circle "></i>

&lt;br&gt;


```r
fit.glm &lt;-  glm(data=data, 
            Label ~ ., 
            family=binomial(link="logit"))  
fitted.vals &lt;- round(predict(fit.glm, 
                             newdata = data[, c("Exam1", "Exam2")], 
                             type="response"))
```


```r
library(caret)
*confusion.glm &lt;- confusionMatrix(as.factor(fitted.vals),
*                                as.factor(data$Label))
*confusion.glm$table
```

```
##           Reference
## Prediction  0  1
##          0 34  5
##          1  6 55
```
]]
.column.bg-main3[.content.vmiddle.center[

## Looking at the `table` output, reading *vertically*, we can assess model performance. Out of the 40 failures in our dataset, `glm` sucessfully predicts 34. Out of the 60 passes in our data, `glm` sucessfully predicts 55.
]]


---

class: middle center bg-main1

# Your turn!

## <span>&lt;i class="fas  fa-sync-alt fa-4x faa-spin animated "&gt;&lt;/i&gt;</span>
---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# Classifying and assessing model fit for Diabetes data

&lt;br&gt;


```r
data &lt;- read.csv("data/diabetes.csv")
dim(data)
```

```
## [1] 768   9
```

```r
str(data)
'data.frame':	768 obs. of  9 variables:
 $ pregnant : int  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose  : int  148 85 183 89 137 116 78 115 197 125 ...
 $ hg       : int  72 66 64 66 40 74 50 0 70 96 ...
 $ thickness: int  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin  : int  0 0 0 94 168 0 88 0 543 0 ...
 $ bmi      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ pedigree : num  0.627 0.351 0.672 0.167 2.288 ...
 $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
*$ class    : int  1 0 1 0 1 0 1 0 1 1 ...
```


]]


.column.bg-main3[.content.vmiddle.center[

# Build a machine learning model to predict whether or not the patients in the dataset have diabetes (`class`).

# Further, assess the performance of .orange[*your model*] on classifying `class`.
]]

---


layout: false
class: bg-main3 split-30 hide-slide-number

.column[

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Cross Validation]
]
]]

---

class: middle center bg-main1

# Building reliable and accurate models is paramount in machine learning...

&lt;img src="images/netflix.jpg", width="70%"&gt;


---

class: middle center bg-main1

# ... however, it can be hard to get new data to assess model performance!

&lt;img src="images/newdata.jpg", width="70%"&gt;
---

class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Cross Validation

&lt;br&gt;


##  Often, we want to see how well our model can predict new data points. However, it is often impossible to get completely new data.

## So, we split our data into .orange[*training]* and .orange[*testing*] sets to evaluate performance, treating the *testing* data as new data points.


]]
.column[.content.vmiddle.center[
&lt;img src="images/Cv1.png", width="80%"&gt;
]]


---

class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Cross Validation

&lt;br&gt;

## To do this, we split our data into .orange[K-Folds].

&lt;br&gt;


## The first fold is treated as a testing set, and the method is fit on the remaining K − 1 folds. 


]]
.column[.content.vmiddle.center[
&lt;img src="images/Cv1.png", width="80%"&gt;
]]

---

class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Cross Validation

&lt;br&gt;

&lt;br&gt;


## The misclassification error rate is then computed on the observations in the held-out fold. 

## This procedure is .orange[repeated K] times; each time, a different group of observations is treated as a testing set.


]]
.column[.content.vmiddle.center[
&lt;img src="images/Cv2.png", width="80%"&gt;
]]


---

class: split-two white

.column.bg-main1[.content[

&lt;br&gt;

# Cross Validation

&lt;br&gt;

&lt;br&gt;


# The .orange[CV error rate] is then calculated as the average of these K error rates.


]]
.column[.content.vmiddle.center[
&lt;img src="images/Cv3.png", width="80%"&gt;
]]


---

# .purple[How many folds?]

&lt;br&gt;

&lt;center&gt;

  &lt;img src="images/splits.png", width="70%"&gt;

&lt;/center&gt;

&lt;br&gt;

#### Generally, .pink[k between 5 and 10] avoids over-training the model (variance), whilst avoiding too few training points (bias).

---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# 5-fold CV for Diabetes `knn` model 

&lt;br&gt;


```r
library(class) # for KNN function

K &lt;-  5 # number of folds
n &lt;- nrow(data) # number of data points

X &lt;- data[,-9] # predictors
y &lt;- as.factor(data$class) # response
```


]]
.column.bg-main3[.content.vmiddle.center[
# Let's perform a .orange[5-fold] CV for the `knn` model.  First, we split up our response and predictors.
]]


---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
library(cvTools)
set.seed(1)
*cvSets &lt;- cvFolds(n,K)
```


```r
str(cvSets)
List of 5
 $ n      : num 768 
 $ K      : num 5
 $ R      : num 1
 $ subsets: int [1:768, 1] 110 386 469 511 39 485 741 563 572 759 ... 
 $ which  : int [1:768] 1 2 3 4 5 1 2 3 4 5 ... 
```
]]
.column.bg-main3[.content.vmiddle.center[
##  Using the `cvTools` package, we can use the function `cvFolds` to seperate our `n` data points into `K` paritions for cross validation.

#### Note, we `set.seed(1)` to ensure reproducible results.
]]


---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
library(cvTools)
set.seed(1)
cvSets &lt;- cvFolds(n,K) 
```


```r
str(cvSets)
List of 5
 $ n      : num 768 
 $ K      : num 5
 $ R      : num 1
*$ subsets: int [1:768, 1] 110 386 469 511 39 485 741 563 572 759 ...
*$ which  : int [1:768] 1 2 3 4 5 1 2 3 4 5 ...
```
]]
.column.bg-main3[.content.vmiddle.center[
##  The important outputs are `subsets` and `which`. 
&lt;br&gt;

## For each fold (given by `which`), `subset` tells us which index of the data belongs to what fold. Eg, datapoint 110 belongs to fold 1, etc
]]
---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
*for(j in 1:K){
  
  fold.j &lt;- which(cvSets$which==j)
  test.inds &lt;- cvSets$subsets[fold.j]
  
  X.test &lt;- X[test.inds,]
  X.train &lt;- X[-test.inds,]
  y.test &lt;- y[test.inds]
  y.train &lt;- y[-test.inds]
  
  fit &lt;- knn(X.train, X.test, y.train, k=9)
  
  error.fold[j] &lt;- sum(fit!=y.test)
*} 

cv.error &lt;- sum(error.fold)/n
cv.error
```
]]
.column.bg-main3[.content.vmiddle.center[
# Now, we set up a for loop to iterate through the test/training folds.
]]

---

class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
for(j in 1:K){ 
  
* fold.j &lt;- which(cvSets$which==j)
* test.inds &lt;- cvSets$subsets[fold.j]
  
  X.test &lt;- X[test.inds,]
  X.train &lt;- X[-test.inds,]
  y.test &lt;- y[test.inds]
  y.train &lt;- y[-test.inds]
  
  fit &lt;- knn(X.train, X.test, y.train, k=9)
  
  error.fold[j] &lt;- sum(fit!=y.test)
}

cv.error &lt;- sum(error.fold)/n
cv.error
```
]]
.column.bg-main3[.content.vmiddle.center[
# We determine which indices belong to the current fold, and determine which indices are to be designated as *testing*.
]]

---

class: split-60 white

.column.bg-main1[.content[

&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
for(j in 1:K){ 
  
  fold.j &lt;- which(cvSets$which==j)
  test.inds &lt;- cvSets$subsets[fold.j]
  
* X.test &lt;- X[test.inds,]
* X.train &lt;- X[-test.inds,]
* y.test &lt;- y[test.inds]
* y.train &lt;- y[-test.inds]
  
  fit &lt;- knn(X.train, X.test, y.train, k=9)
  
  error.fold[j] &lt;- sum(fit!=y.test)
}

cv.error &lt;- sum(error.fold)/n
cv.error
```
]]
.column.bg-main3[.content.vmiddle.center[
# We then form *training* and *testing* data and response objects.
]]

---

class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
for(j in 1:K){
  
  fold.j &lt;- which(cvSets$which==j)
  test.inds &lt;- cvSets$subsets[fold.j]
  
  X.test &lt;- X[test.inds,]
  X.train &lt;- X[-test.inds,]
  y.test &lt;- y[test.inds]
  y.train &lt;- y[-test.inds]
  
* fit &lt;- knn(X.train, X.test, y.train, k=9)
  
  error.fold[j] &lt;- sum(fit!=y.test)
}

cv.error &lt;- sum(error.fold)/n
cv.error
```
]]
.column.bg-main3[.content.vmiddle.center[
# We **train** our model using `X.train` and `y.train`, and predict new class labels values based on `X.test`.
]]

---

class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
for(j in 1:K){ 
  
  fold.j &lt;- which(cvSets$which==j)
  test.inds &lt;- cvSets$subsets[fold.j]
  
  X.test &lt;- X[test.inds,]
  X.train &lt;- X[-test.inds,]
  y.test &lt;- y[test.inds]
  y.train &lt;- y[-test.inds]
  
  fit &lt;- knn(X.train, X.test, y.train, k=9)
  
* error.fold[j] &lt;- sum(fit!=y.test)
}

cv.error &lt;- sum(error.fold)/n
cv.error
```
]]
.column.bg-main3[.content.vmiddle.center[
# We then see how well we were able to match the true test labels, `y.test`. We store this, as we repeat this K times in total.
]]


---

class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# K-fold CV in <i class="fab  fa-r-project "></i> using `cvTools`

&lt;br&gt;


```r
error.fold &lt;- c()
for(j in 1:K){ 
  
  fold.j &lt;- which(cvSets$which==j)
  test.inds &lt;- cvSets$subsets[fold.j]
  
  X.test &lt;- X[test.inds,]
  X.train &lt;- X[-test.inds,]
  y.test &lt;- y[test.inds]
  y.train &lt;- y[-test.inds]
  
  fit &lt;- knn(X.train, X.test, y.train, k=9)
  
  error.fold[j] &lt;- sum(fit!=y.test)
}

*cv.error &lt;- sum(error.fold)/n
cv.error
```

```
## [1] 0.2630208
```
]]
.column.bg-main3[.content.vmiddle.center[
# Since each data point gets tested **exactly once**, we can calculate the cross validation error as the sum of the total errors, divided through by the number of samples.
]]

---

class: middle center bg-main1

# Your turn!

## <span>&lt;i class="fas  fa-sync-alt fa-4x faa-spin animated "&gt;&lt;/i&gt;</span>


---

class: split-two white 

.column.bg-main1[.content.vmiddle.center[

# .orange[Task 1]

## We want to know which value of k (neighbours) from `k.vals = 1:50` fits the data best. Embed the CV loop on the previous slide in another loop to determine which value of `k` gives the lowest error.

]]

.column.bg-main3[.content.vmiddle.center[


# .orange[Task 2]

## Fit a CV loop for `randomForest`. Does it perform better than the best value of `k` you found in Task 1?

]]


---

# .purple[Repeated K fold CV]

&lt;br&gt;

## Often we want to repeat the CV process, to get a range of error values.

&lt;center&gt;

  &lt;img src="images/repeat1.png", width="90%"&gt;

&lt;/center&gt;

---

# .purple[Repeated K fold CV]

&lt;br&gt;

## Often we want to repeat the CV process, to get a range of error values.

&lt;center&gt;

  &lt;img src="images/repeat2.png", width="90%"&gt;

&lt;/center&gt;


---
# .purple[Repeated K fold CV]

&lt;br&gt;

## Often we want to repeat the CV process, to get a range of error values.

&lt;center&gt;

  &lt;img src="images/repeat3.png", width="90%"&gt;

&lt;/center&gt;


---


class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# Repeated K-fold CV in <i class="fab  fa-r-project "></i>

&lt;br&gt;


```r
*reps &lt;- 10
*cv.error &lt;- c()
*for(i in 1:reps){
  set.seed(i)
  cvSets &lt;- cvFolds(n,K)
  
  error.fold &lt;- c()
  for(j in 1:K){
    
    fold.j &lt;- which(cvSets$which==j)
    test.inds &lt;- cvSets$subsets[fold.j]
    
    X.test &lt;- X[test.inds,]
    X.train &lt;- X[-test.inds,]
    y.test &lt;- y[test.inds]
    y.train &lt;- y[-test.inds]
    
    fit &lt;- knn(X.train, X.test, y.train, k=9)
    
    error.fold[j] &lt;- sum(fit!=y.test)
  }
* cv.error[i] &lt;- sum(error.fold)/n
*} 
```
]]
.column.bg-main3[.content.vmiddle.center[
# To peform a **repeated** cross validation, all we do is embed our existing CV loop into another loop that repeats the process `reps` times.
]]
---
class: split-two white

.column.bg-main1[.content[
&lt;br&gt;

# Repeated K-fold CV in <i class="fab  fa-r-project "></i>

&lt;br&gt;

&lt;br&gt;

# We can then easily visualise our repeated cross validation errors!


]]
.column[.content.vmiddle.center[


&lt;img src="index_files/figure-html/unnamed-chunk-34-1.png" width="504" /&gt;


]]

---

class: middle center bg-main1

# Your turn!

## <span>&lt;i class="fas  fa-sync-alt fa-4x faa-spin animated "&gt;&lt;/i&gt;</span>

---

class: split-two white 

.column.bg-main1[.content.vmiddle.center[

# .orange[Task 1]

## Perform a 10 repeat, 5 fold Cross Validation, for `randomForest`.

]]

.column.bg-main3[.content.vmiddle.center[


# .orange[Task 2]

## Boxplot the results of `knn` (with best k) vs `randomForest`. Which model performs better?

]]


---

layout: false
class: bg-main3 split-30 hide-slide-number

.column[

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Easy CV with `caret`]
]
]]


---
class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# Easy K-fold CV in <i class="fab  fa-r-project "></i>!

&lt;br&gt;


```r
*fitControl &lt;- trainControl(## 5-fold CV
*                          method = "repeatedcv",
*                          number =5,
*                          ## repeated ten times
*                          repeats = 10)

set.seed(1)
knnFit1 &lt;- train(class ~ ., data = data, 
                 method = "knn", 
                 trControl = fitControl)
knnFit1
```
]]
.column.bg-main3[.content.vmiddle.center[

# Using the `caret` package, we first set up details of of CV, including how many repeats we want, and how many folds.
]]

---
class: split-60 white

.column.bg-main1[.content[
&lt;br&gt;

# Easy K-fold CV in <i class="fab  fa-r-project "></i>!


```r
*knnFit1 &lt;- train(class ~ ., data = data,
*                method = "knn",
*                trControl = fitControl)
knnFit1
```

```
## k-Nearest Neighbors 
## 
## 768 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 615, 614, 614, 614, 615, 614, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7143664  0.3508650
##   7  0.7255573  0.3751169
##   9  0.7350547  0.3936117
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
```
]]
.column.bg-main3[.content.vmiddle.center[

# Next, we specify the response and its predictors, as well as what method we want to use, and the training parameters. .orange[**So easy!**]

#### (Note, Accuracy is 1-CV Error).
]]

---

class: split-two white 

.column.bg-main1[.content[
&lt;br&gt;

# But, but, why make us do all that work before??

&lt;br&gt;


### - Not every machine learning method is covered by the `caret` package, so it is important you know how to code up your own.

&lt;br&gt;

### - Also, by knowing how to code up the CV from scratch, you will have a deeper understanding of how it works!


]]

.column[.content.vmiddle.center[

&lt;iframe src="https://giphy.com/embed/9G3wg7lH5DpxC" width="467" height="480" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;&lt;p&gt;&lt;a href="https://giphy.com/gifs/9G3wg7lH5DpxC"&gt;via GIPHY&lt;/a&gt;&lt;/p&gt;

]]


---

class: middle center bg-main1

# Your turn!

## <span>&lt;i class="fas  fa-sync-alt fa-4x faa-spin animated "&gt;&lt;/i&gt;</span>
---

class: middle center bg-main3

# Fit a 10 repeat, 5 fold CV for `knn` and `randomForest` using `caret`. Do you get similar results to manual CV?

## You can find the method specifications for `caret` [here](https://rdrr.io/cran/caret/man/models.html).

---

class: bg-main1

# Big thanks go to:

&lt;br&gt;

## - Emi Tanaka for developing the Kunoichi themed slides, and also Yihui Xie for developing `xaringan`!
## - The R-Ladies Sydney organisers - you guys rock!

## <span>&lt;i class="fas  fa-thumbs-up fa-5x faa-float animated "&gt;&lt;/i&gt;</span>
    </textarea>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"ratio": "16:9",
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function() {
  var d = document, s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})();</script>

<script>
(function() {
  var i, text, code, codes = document.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
})();
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>