-
Notifications
You must be signed in to change notification settings - Fork 114
/
Copy path04_assesing_model_performance.Rmd
752 lines (405 loc) · 30.4 KB
/
04_assesing_model_performance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# Assesing Model Performance {#model-performance}
It covers **methodological aspects of the error** in predictive models, how to measure it through **cross-validation** data and its similitude with **bootstrapping** technique. And how these strategies are used internally by some predictive models such us _random forest_ or _gradient boosting machines_.
There is also a chapter about how to validate models when time is involved, which is similar to classical train/test validation.
<br>
## Knowing the Error {#knowing_the_error}
**Methodological Aspects on Model Validation**
```{r Error-in-predictive-models, echo=FALSE, out.width="30%"}
knitr::include_graphics("model_performance/magnifier.png")
```
<br>
### What's this about?
Once we've built a predictive model, how sure we are about its quality? Did it capture general patterns _-information-_ (excluding the _-noise-_)?
<br>
#### What sort of data?
It has other approach rather than the one covered on [Out-of-Time Validation](#out-of-time_validation). This approach could be used even when there is not possible to filter cases by date, for example having a data's snapshot at a certain point of time, when no new information will be generated.
For example some health data research from a reduced amount of people, a survey, or some data available on the internet for practicing purposes. It's either expensive, not practical, unethical or even impossible to add new cases. The `heart_disease` data coming in `funModeling` package is such an example.
<br>
### Reducing unexpected behavior
When a model is trained, it just sees a part of reality. It's a sample from a population that cannot be entirely seen.
There are lots of ways to validate a model (Accuracy / ROC curves / Lift / Gain / etc). Any of these metrics are **attached to variance**, which implies **getting different values**. If we remove some cases and then fit a new model, we'll see a _slightly_ different value.
Imagine we build a model and achieve an accuracy of `81`, now remove 10% of the cases, and then fit a new one, the accuracy now is: `78.4`. **What is the real accuracy?** The one obtained with 100% of data or the other based on 90%? For example, if the model will run live in a production environment, it will see **other cases** and the accuracy point will move to a new one.
_So what is the real value? The one to report?_ **Re-sampling** and **cross-validation** techniques will average -based on different sampling and testing criteria- in order to retrieve an approximation to the most trusted value.
<br>
**But why remove cases?**
There is no sense in removing cases like that, but it gets an idea of how sensible the accuracy metric is, remember we're working with a sample from an *_unknown population_*.
If we'd have a fully deterministic model, a model that contains 100% of all cases we are studying, and predictions were 100% accurate in all cases, we wouldn't need all of this.
As far as we always analyze samples, we just need to getting closer to the _real and unknown truthness_ of data through repetition, re-sampling, cross-validation, and so on...
<br>
### Let's illustrate this with Cross-Validation (CV)
```{r Cross-Validation, echo=FALSE, out.width="80%", fig.cap="k-fold cross validation", out.extra=''}
knitr::include_graphics("model_performance/k-fold_cross_validation.png")
```
_Image credit: Sebastian Raschka_ Ref. [@evaluate_model]
<br>
#### CV short summary
* Splits the data into random groups, let's say `10`, equally sized. These groups are commonly called `folds`, represented by the `'k'` letter.
* Take `9` folds, build a model, and then apply the model to the remaining fold (the one which was left out). This will return the accuracy metric we want: accuracy, ROC, Kappa, etc. We're using accuracy in this example.
* Repeat this `k` times (`10` in our example). So we'll get `10` different accuracies. The final result will be the average of all of them.
This average will be the one to evaluate if a model is good or not, and also to include it in a report.
<br>
#### Practical example
There 150 rows in the `iris` data frame, using [caret package](http://topepo.github.io/caret/index.html) to build a `random forest` with `caret` using `cross-validation` will end up in the -internal- construction of 10 random forest, each one based on 135 rows (9/10 * 150), and reporting an accuracy based on remaining 15 (1/10 * 150) cases. This procedure is repeated 10 times.
This part of the output:
```{r caret-cross-validation-output, echo=FALSE, out.width="86%", fig.cap="caret cross validation output", out.extra=''}
knitr::include_graphics("model_performance/caret_cross_validation_output.png")
```
`Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... `, each 135 represents a training sample, 10 in total but the output is truncated.
Rather a single number -the average-, we can see a distribution:
```{r Accuracy-predictive-models, echo=FALSE, out.width="50%", fig.cap="Visual analysis of the accuracy distribution", out.extra=''}
knitr::include_graphics("model_performance/accuracy_distribution_plot.png")
```
```{r Accuracy-predictive-models-2, echo=FALSE, out.width="70%", fig.cap="Accuracy distribution", out.extra=''}
knitr::include_graphics("model_performance/accuracy_distribution.png")
```
* The min/max accuracy will be between `~0.8` and `~1`.
* The mean is the one reported by `caret`.
* 50% of times it will be ranged between `~0.93 and ~1`.
Recommended lecture by Rob Hyndman, creator of `forecast` package: _Why every statistician should know about cross-validation?_ [@why_cross_validation]
<br>
### But what is Error?
The sum of **Bias**, **Variance** and the **_unexplained error_** -inner noise- in data, or the one that the model will never be able to reduce.
These three elements represent the error reported.
#### What is the nature of Bias and Variance?
When the model doesn't work well, there may be several causes:
* **Model too complicated**: Let's say we have lots of input variables, which is related to **high variance**. The model will overfit on training data, having a poor accuracy on unseen data due to its particularization.
* **Model too simple**: On the other hand, the model may not be capturing all the information from the data due to its simplicity. This is related to **high bias**.
* **Not enough input data**: Data forms shapes in an n-dimensional space (where `n` is all the input+target variables). If there are not enough points, this shape is not developed well enough.
More info here in _"In Machine Learning, What is Better: More Data or better Algorithms"_ [@more_data_or_better_algorithms].
```{r bias-variance, echo=FALSE, out.width="75%", fig.cap="Bias vs. variance tradeoff", out.extra=''}
knitr::include_graphics("model_performance/bias_variance.png")
```
_Image credit: Scott Fortmann-Roe_ [@bias_variance_tradeoff]. It also contains an intutitive way of understanding error through bias and variance through a animation.
<br>
#### Complexity vs Accuracy Tradeoff
```{r accuracy-machine-learning, echo=FALSE, out.width="100px"}
knitr::include_graphics("model_performance/complexity_accuracy_balance.png")
```
Bias and variance are related in the sense that if one goes down the other goes up, so it's a **tradeoff** between them. A practical example of this is on
Akaike Information Criterion (AIC) model quality measure.
**AIC** is used as a heuristic to pick the best **time series model** in the `auto.arima` function inside `forecast` package in `R` [@arima_modeling_in_r]. It chooses the model with the lowest AIC.
The lower, the better: The accuracy in prediction will lower the value, while the number of parameters will increase it.
<br>
#### Bootstrapping vs Cross-Validation
* **Bootstrapping** is mostly used when estimating a parameter.
* **Cross-Validation** is the choice when choosing among different predictive models.
Note: For a deeper coverage about bias and variance, please go to [@bias_variance_tradeoff] and [@more_data_or_better_algorithms] at the bottom of the page.
### Any advice on practice?
It depends on the data, but it's common to find examples such as `10 fold CV`, plus repetition: `10 fold CV, repeated 5 times`. Other times we find: `5 fold CV, repeated 3 times`.
And using the average of the desired metric. It's also recommended to use the `ROC` for being less biased to unbalanced target variables.
Since these validation techniques are **time consuming**, consider choosing a model which will run fast, allowing model tunning, testing different configurations, trying different variables in a "short" amount of time. [Random Forest](https://en.wikipedia.org/wiki/Random_forest) are an excellent option which gives **fast** and **accurate** results. More on Random Forest overall performance on [@do_we_need_hundred_models].
Another good option is **gradient boosting machines**, it has more parameters to tune than random forest, but at least in R it's implementation works fast.
#### Going back to bias and variance
* Random Forest focuses on decreasing bias, while...
* Gradient boosting machine focuses on minimizing variance. More info in _"Gradient boosting machine vs random forest"_ [@gbm_vs_random_forest].
<br>
### Don't forget: Data Preparation
Tweaking input data by transforming and cleaning it, will impact on model quality. Sometimes more than optimizing the model through its parameters.
Expand this point with the [Data Preparation](#data_preparation) chapter.
### Final thoughts
* Validating the models through re-sampling / cross-validation helps us to estimate the "real" error present in the data. If the model runs in the future, that will be the expected error to have.
* Another advantage is **model tuning**, avoiding the overfitting in selecting best parameters for certain model, [Example in caret](https://topepo.github.io/caret/model-training-and-tuning.html). The equivalent in **Python** is included in [Scikit Learn](http://scikit-learn.org/stable/modules/cross_validation.html).
* The best test is the one made by you, suited to your data and needs. Try different models and analyze the tradeoff between time consumption and any accuracy metric.
> These re-sampling techniques could be among the powerful tools behind the sites like stackoverflow.com or collaborative open-source software. To have many opinions to produce a less-biased solution.
But each opinion has to be reliable, imagine asking for a medical diagnostic to different doctors.
<br>
### Further reading
* Tutorial: [Cross validation for predictive analytics using R](http://www.milanor.net/blog/cross-validation-for-predictive-analytics-using-r)
* Tutorial by Max Kahn (caret's creator): [Comparing Different Species of Cross-Validation](http://appliedpredictivemodeling.com/blog/2014/11/27/vpuig01pqbklmi72b8lcl3ij5hj2qm)
* The cross-validation approach can also be applied to time dependant models, check the other chapter: [Out-of-time Validation](#out-of-time_validation).
<br>
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## Out-of-Time Validation {#out-of-time_validation}
```{r error-in-machine-learning, echo=FALSE, out.width="100px"}
knitr::include_graphics("model_performance/calendar.png")
```
### What's this about?
Once we've built a predictive model, how sure we are it captured general patterns and not just the data it has seen (overfitting)?.
Will it perform well when it is on production / running live? What is the expected error?
<br>
### What sort of data?
If it's generated over time and -let's say- every day we have new cases like _"page visits on a website"_, or _"new patients arriving at a medical center"_, one strong validation is the **Out-Of-Time** approach.
<br>
### Out-Of-Time Validation Example
**How to?**
Imagine we are building the model on **Jan-01**, then to build the model we use all the data **before Oct-31**. Between these two dates, there are 2 months.
When predicting a **binary/two class variable** (or multi-class), it's quite straightforward: with the model we've built -with data <= **Oct-31**- we score the data on that exact day, and then we measure how the users/patients/persons/cases evolved during those two months.
Since the output of a binary model should be a number indicating the likelihood for each case to belong to a particular class ([Scoring Data](#scoring_data) chapter), we test what the **model "_said_" on Oct-31 against what it actually happened on "Jan-01"**.
<br>
Following **validation workflow** may be helpful when building a predictive model involving time.
```{r model-performance-workflow, echo=FALSE, out.width="100%", fig.cap="A validation workflow for time dependant problems", out.extra=''}
knitr::include_graphics("model_performance/model_validation_workflow.png")
```
_[Enlarge image.](http://datascienceheroes.com/img/blog/model_validation_workflow.png)_
<br>
### Using Gain and Lift Analysis
This analysis explained in another chapter ([Gain & Lift](#gain_and_lift)) and it can be used following the out-of-time validation.
Keeping only with those cases that were `negative` on `Oct-31`, we get the `score` returned by the model on that date, and the `target` variable is the value that those cases had on `Jan-1`.
### How about a numerical target variable?
Now the common sense and business need is more present. A numerical outcome can take any value, it can increase or decrease through time, so we may have to consider these two scenarios to help us thinking what we consider success. This is the case of linear regression.
**Example scenario**: We measure some web app usage (like the homebanking), the standard thing is as the days pass, the users use it more.
Examples:
* Predicting the concentration of a certain substance in the blood.
* Predicting page visits.
* Time series analysis.
We also have in these cases the difference between: **"what was expected" vs. "what it is"**.
This difference can take any number. This is the error or residuals.
```{r gain-lift-analysis-in-r, echo=FALSE, out.width="100%", fig.cap="Prediction and error analysis", out.extra=''}
knitr::include_graphics("model_performance/numerical_variable.png")
```
If the model is good, this error should be **white noise**, more info in _"Time series analysis and regression"_ section inside [@white_noise]. It follows a normal curve when mainly there are some logical properties:
* The error should be **around 0** -_the model must tend its error to 0_-.
* The standard deviation from this error **must be finite** -to avoid unpredictable outliers-.
* There has to be no correlation between the errors.
* **Normal distribution**: expect the majority of errors around 0, having the biggest ones in a **smaller proportion** as the error increases -likelihood of finding bigger errors decreases exponentially-.
```{r error-curve-in-r, echo=FALSE, out.width="60%", fig.cap="A nice error curve (normal distribution)", out.extra=''}
knitr::include_graphics("model_performance/normal_error_curve.png")
```
<br>
### Final thoughts
* **Out-of-Time Validation** is a powerful validation tool to simulate the running of the model on production with data that may **not need to depend on sampling**.
* The **error analysis** is a big chapter in data science. Time to go to next chapter which will try to cover key-concepts on this: [Knowing the error](#knowing_the_error).
<br>
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## Gain and Lift Analysis {#gain_and_lift}
### What is this about?
Both metrics are extremely useful to validate the predictive model (binary outcome) quality. More info about [scoring data](#scoring_data)
Make sure we have the latest `funModeling` version (>= 1.3).
```{r lib-loading, results="hide", message=FALSE}
# Loading funModeling
library(funModeling)
```
```{r model-performance-gain-curve, fig.width=12, fig.height=3, fig.cap="Gain and lift curves", out.extra=''}
# Create a GLM model
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=heart_disease, family = binomial)
# Get the scores/probabilities for each row
heart_disease$score=predict(fit_glm, newdata=heart_disease, type='response')
# Plot the gain and lift curve
gain_lift(data=heart_disease, score='score', target='has_heart_disease')
```
### How to interpret it?
First, each case is ordered according to the likelihood of being the less representative class, aka, score value.
Then `Gain` column accumulates the positive class, for each 10% of rows - `Population` column.
So for the first row, it can be read as:
_"The first 10 percent of the population, ordered by score, collects 20.86% of total positive cases"_
For example, if we are sending emails based on this model, and we have a budget to reach only **20%** of our users, how many responses we should expect to get? **Answer: 35.97%**
<br>
### What about not using a model?
If we **don't use a model**, and we select randomly 20%, how many users do we have to reach? Well, 20%. That is the meaning of the **dashed line**, which starts at 0% and ends at 100%. Hopefully, with the predictive model we'll beat the randomness.
The **Lift** column represents the ratio, between the `Gain` and the _gain by chance_. Taking as an example the Population=20%, the model is **1.8 times better** than randomness.
<br>
#### Using the cut point `r emo::ji("scissors")`
What value of the score reaches 30% of the population?
Answer: `0.56`
The cut point allows us to segment the data.
<br>
#### Comparing models
In a good model, the gain will reach the 100% "at the beginning" of the population, representing that it separates the classes.
When comparing models, a quick metric is to see if the gain at the beginning of the population (10-30%) is higher.
As a result, the model with a higher gain at the beginning will have captured more information from data.
Let's illustrate it...
```{r predictive-model-comparison, echo=FALSE, out.width="100%", fig.cap="Comparing the gain and lift curves for two models", out.extra=''}
knitr::include_graphics("model_performance/model_comparison.png")
```
_[Enlarge image.](http://datascienceheroes.com/img/blog/model_comparison.png)_
<br>
**Cumulative Gain Analysis**: Model 1 reaches the ~20% of positive cases around the 10% of the population, while model 2 reaches a similar proportion approaching the 20% of the population. _Model 1 is better._
**Lift analysis**: Same as before, but also it is suspicious that not every lift number follow a decreasing pattern. Maybe the model is not ordering the first percentiles of the population.
Same ordering concepts as seen in [Profiling target using cross_plot](#profiling_target_cross_plot) chapter.
<br>
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## Scoring Data {#scoring_data}
### The intuition behind
Events can occur, or not... although we don't have _tomorrow's newspaper_ `r emo::ji("newspaper")`, we can make a good guess about how is it going to be.
```{r error-curve, echo=FALSE, out.width="170px"}
knitr::include_graphics("scoring/cover.png")
```
The future is undoubtedly attached to *uncertainty*, and this uncertainty can be estimated.
<br>
#### And there are differents targets...
For now, this book will cover the classical: `Yes`/`No` target -also known as binary or multiclass prediction.
So, this estimation is the _value of truth_ of an event to happen, therefore a probabilistic value between 0 and 1.
#### Two-label vs. multi-label outcome
Please note this chapter is written for a binary outcome (two-label outcome), but **multi-label** target can be seen as a general approach of a binary class.
For example, having a target with 4 different values, there can be 4 models that predict the likelihood of belonging to particular class, or not. And then a higher model which takes the results of those 4 models and predict the final class.
<br>
#### Say what? `r emo::ji("hushed")`
Some examples:
- Is this client going to buy this product?
- Is this patient going to get better?
- Is certain event going to happen in the next few weeks?
The answers to these last questions are True or False, but **the essence is to have a score**, or a number indicating the likelihood of a certain event to happen.
<br>
#### But we need more control...
Many machine learning resources show the simplified version -which is good to start- getting the final class as an output. Let's say:
Simplified approach:
* Question: _Is this person going to have a heart disease?_
* Answer: "No"
But there is something else before the "Yes/No" answer, and this is the score:
* Question: _What is the likelihood for this person of having heart disease?_
* Answer: "25%"
<br>
So first we get the score, and then according to our needs we set the **cut point**. And this is **really** important.
### Let see an example
```{r scoring-1, echo=FALSE, out.width="100%", fig.cap="Simple dataset example", out.extra='' }
knitr::include_graphics("scoring/tbl_example_1.png")
```
Example table showing the following
* `id`=identity
* `x1`,`x2` and `x3` input variables
* `target`=variable to predict
```{r scoring-machine-learning, echo=FALSE, out.width="50%", fig.cap="Getting the score (predictive model output)", out.extra=''}
knitr::include_graphics("scoring/tbl_example_2.png")
```
Forgetting about input variables... After the creation of the predictive model, like a random forest, we are interested in the **scores**. Even though our final goal is to deliver a `yes`/`no` predicted variable.
For example, the following 2 sentences express the same: _The likelihood of being `yes` is `0.8`_ <=> _The probability of being `no` is `0.2`_
Maybe it is understood, but the score usually refers to the less representative class: `yes`.
---
`r emo::ji("raised_hand")` **R Syntax** -_skip it if you don't want to see code_-
Following sentence will return the score:
`score = predict(randomForestModel, data, type = "prob")[, 2]`
Please note for other models this syntax may vary a little, but the concept **will remain the same**. Even for other languages.
Where `prob` indicates we want the probabilities (or scores).
The `predict` function + `type="prob"` parameter returns a matrix of 15 rows and 2 columns: the 1st indicates the likelihood of being `no` while the 2nd one shows the same for class `yes`.
Since target variable can be `no` or `yes`, the `[, 2]` return the likelihood of being -in this case- `yes` (which is the complement of the `no` likelihood).
---
<br>
### It's all about the cut point `r emo::ji("straight_ruler")`
```{r scoring-machine-learning-9, echo=FALSE, out.width="50%", fig.cap="Cases ordered by highest score", out.extra=''}
knitr::include_graphics("scoring/tbl_example_3.png")
```
Now the table is ordered by descending score.
This is meant to see how to extract the final class having by default the cut point in `0.5`. Tweaking the cut point will lead to a better classification.
> Accuracy metrics or the confusion matrix are always attached to a certain cut point value.
<br>
After assigning the cut point, we can see the classification results getting the famous:
* `r emo::ji("white_check_mark")` **True Positive** (TP): It's _true_, that the classification is _positive_, or, "the model hit correctly the positive (`yes`) class".
* `r emo::ji("white_check_mark")` **True Negative** (TN): Same as before, but with negative class (`no`).
* `r emo::ji("x")` **False Positive** (FP): It's _false_, that the classification is _positive_, or, "the model missed, it predicted `yes` but the result was `no`
* `r emo::ji("x")` **False Negative** (FN): Same as before, but with negative class, "the model predicted negative, but it was positive", or, "the model predicted `no`, but the class was `yes`"
```{r scoring-machine-learning-2, echo=FALSE, out.width="100%", fig.cap="Assigning the predicted label (cutoff=0.5)", out.extra='' }
knitr::include_graphics("scoring/tbl_example_4.png")
```
<br>
### The best and the worst scenario
Just like Zen does, the analysis of the extremes will help to find the middle point.
`r emo::ji("thumbsup")` The best scenario is when **TP** and **TN** rates are 100%. That means the model correctly predicts all the `yes` and all the `no`; _(as a result, **FP** and **FN** rates are 0%)_.
But wait `r emo::ji("raised_hand")` ! If we find a perfect classification, probably it's because of overfitting!
`r emo::ji("thumbsdown")` The worst scenario -the opposite to last example- is when **FP** and **FN** rates are 100%. Not even randomness can achieve such an awful scenario.
_Why?_ If the classes are balanced, 50/50, flipping a coin will assert around half of the results. This is the common baseline to test if the model is better than randomness.
<br>
In the example provided, class distribution is 5 for `yes`, and 10 for `no`; so: 33,3% (5/15) is `yes`.
<br>
---
### Comparing classifiers
#### Comparing classification results
`r emo::ji("question")` **Trivia**: Is a model which correcltly predict this 33.3% (TP rate=100%) a good one?
_Answer_: It depends on how many 'yes', the model predicted.
<br>
A classifier that always predicts `yes`, will have a TP of 100%, but is absolutely useless since lots of `yes` will be actually `no`. As a matter of fact, FP rate will be high.
#### Comparing ordering label based on score
A classifier must be trustful, and this is what **ROC** curves measures when plotting the TP vs FP rates. The higher the proportion of TP over FP, the higher the Area Under Roc Curve (AUC) is.
The intuition behind ROC curve is to get an **sanity measure** regarding the **score**: how well it orders the label. Ideally, all the positive labels must be at the top, and the negative ones at the bottom.
<br>
```{r scoring-machine-learning-3, echo=FALSE, out.width="100%", fig.cap="Comparing two predictive model scores", out.extra=''}
knitr::include_graphics("scoring/tbl_example_5.png")
```
<br>
`model 1` will have a higher AUC than `model 2`.
Wikipedia has an extensive and good article on this: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
There is the comparission of 4 models, given a cutpoint of 0.5:
```{r roc-curve-machine-learning, echo=FALSE, out.width="100%", fig.cap="Comparing four predictive models", out.extra=''}
knitr::include_graphics("scoring/4_models_roc.png")
```
<br>
---
### Hands on R!
We'll be analyzing three scenarios based on different cut-points.
```{r, eval=FALSE}
# install.packages("rpivotTable")
# rpivotTable: it creates a pivot table dinamically, it also supports plots, more info at: https://github.com/smartinsightsfromdata/rpivotTable
library(rpivotTable)
## reading the data
data=read.delim(file="https://goo.gl/ac5AkG", sep="\t", header = T, stringsAsFactors=F)
```
#### Scenario 1: cut point @ `0.5`
Classical confusion matrix, indicating how many cases fall in the intersection of real vs predicted value:
```{r, eval=FALSE}
data$predicted_target=ifelse(data$score>=0.5, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count", rendererName = "Table", width="100%", height="400px")
```
```{r Scoring-machine-learning-5, echo=FALSE, out.width="85%", fig.cap="Confusion matrix (metric: count)", out.extra=''}
knitr::include_graphics("scoring/count_1.png")
```
Another view, now each column sums **100%**. Good to answer the following questions:
```{r, eval=FALSE, fig.cap="Confusion matrix (metric: percentage per column)", out.extra=''}
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
```
```{r Scoring-machine-learning-6, echo=FALSE, out.width="85%", fig.cap="Confusion matrix (cutpoint at 0.5)", out.extra=''}
knitr::include_graphics("scoring/percentage_1.png")
```
* _What is the percentage of real `yes` values captured by the model? Answer: 80%_ Also known as **Precision** (PPV)
* _What is the percentage of `yes` thrown by the model? 40%._
So, from the last two sentences:
**The model throws 4 out of 10 predictions as `yes`, and from this segment -the `yes`- it hits 80%.**
<br>
Another view: The model correctly hits 3 cases for each 10 `yes` predictions _(0.4/0.8=3.2, or 3, rounding down)_.
Note: The last way of analysis can be found when building an association rules (market basket analysis), and a decision tree model.
<br>
#### Scenario 2: cut point @ `0.4`
Time to change the cut point to `0.4`, so the amount of `yes` will be higher:
```{r, eval=FALSE}
data$predicted_target=ifelse(data$score>=0.4, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
```
```{r Scoring-machine-learning-7, echo=FALSE, out.width="85%", fig.cap="Confusion matrix (cutpoint at 0.4)", out.extra=''}
knitr::include_graphics("scoring/percentage_2.png")
```
Now the model captures `100%` of `yes` (TP), so the total amount of `yes` produced by the model increased to `46.7%`, but at no cost since the *TN and FP remained the same* :thumbsup:.
<br>
#### Scenario 3: cut point @ `0.8`
Want to decrease the FP rate? Set the cut point to a higher value, for example: `0.8`, which will cause the `yes` produced by the model decreases:
```{r, eval=FALSE}
data$predicted_target=ifelse(data$score>=0.8, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
```
```{r scoring-machine-learning-8, echo=FALSE, out.width="85%", fig.cap="Confusion matrix (cutpoint at 0.8)", out.extra=''}
knitr::include_graphics("scoring/percentage_3.png")
```
<br>
Now the FP rate decreased to `10%` (from `20%`), and the model still captures the `80%` of TP which is the same rate as the one obtained with a cut point of `0.5` :thumbsup:.
**Decreasing the cut point to `0.8` improved the model at no cost.**
<br>
### Conclusions
* This chapter has focused on the essence of predicting a binary variable: To produce a score or likelihood number which **orders** the target variable.
* A predictive model maps the input with the output.
* There is not a unique and best **cut point value**, it relies on the project needs, and is constrained by the rate of `False Positive` and `False Negative` we can accept.
This book addresses general aspects on model performance in [Knowing the error](#knowing_the_error).
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---