-
Notifications
You must be signed in to change notification settings - Fork 0
/
presentation.rmd
857 lines (531 loc) · 29.6 KB
/
presentation.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
---
title: "Prediction of Coronary Artery Disease"
author: "Francesco Vo & Marco Uderzo"
date: "2023-04-30"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
```{r, include=FALSE}
library(MASS)
library(pROC)
library(class)
library(ggplot2)
library(ggcorrplot)
library(gridExtra)
library(corrplot)
library(correlation)
library(ggm)
library(igraph)
library(tidymodels)
library(naivebayes)
library(ROCR)
library(glmnet)
library(stats)
data <- read.csv("data/heart_data.csv", stringsAsFactors = T)
attach(data)
```
Cardiovascular diseases (CVDs) are the number one cause of death globally.
Coronary Heart Disease is a common and very deadly occurrence, and this dataset contains 11 features that can be used to predict it.
People with cardiovascular diseases or who are at high cardiovascular risk need early detection wherein statistical learning models can be of great help.
## Project goal and dataset description
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
The dataset consists of medical data from 5 different hospitals.
This data informs us of the presence of coronary heart disease.
Our main goal is to predict the presence of this condition given the other parameters in the dataset.
This dataset has a number of 12 parameters and presents us 918 observations.
The parameters are:
1. **Age**: age of the patient [years]
2. **Sex**: sex of the patient [M: male, F: female]
3. **ChestPainType**: chest pain type [TA: typical angina, ATA: atypical angina, NAP: non-anginal pain, ASY: asymptomatic]. Angina is a type of chest pain caused by reduced blood flow to the heart. Angina is a symptom of coronary heart disease.
4. **RestingBP**: resting blood pressure [mmHg]
5. **Cholesterol**: serum cholesterol [mm/dl]
6. **FastingBS**: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]. This measures the blood sugar level after an overnight fast.
7. **RestingECG**: resting electrocardiogram results [Normal: normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
8. **MaxHR**: maximum heart rate achieved [numeric value between 60 and 202]
9. **ExerciseAngina**: exercise-induced angina [Y: yes, N: no]
10. **Oldpeak**: oldpeak = ST [numeric value between -2.6 and 6.2]. ST depression refers to a finding on an electrocardiogram, wherein the trace in the ST segment is abnormally low below the baseline. Oldpeak measures the depression of the ST slope induced by exercise relative to rest.
11. **ST_Slope**: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
12. **HeartDisease**: output class [1: heart disease, 0: normal]
<br>
## Data visualization and cleaning
In this section we want to visualize our data and clean it. In order to that we have to:
* Visualize the distribution of the target variable
* Visualize the distribution of the predictors
* Deal with missing values
* Deal with outliers
* Find correlations between the parameters
<br>
### Data balance check
```{r, include=TRUE}
prop.table(table(HeartDisease))
```
The target variable is quite balanced.
<br>
### Continuous variables
Now we plot the continuous variables with respect to the target variable.
```{r, include=FALSE}
# continuous variables
cont.idx <- c(1,4,5,8,10)
colours <- c("#F8766D", "#00BFC4")
# Age
age.plot <- ggplot(data, aes(x=Age, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_density(alpha=0.4) +
ggtitle("Age - Density Plot") + xlab("Age") +
guides(fill = guide_legend(title="Heart disease"))
# RestingBP
restingBP.plot <- ggplot(data, aes(x=RestingBP, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_density(alpha=0.4) +
ggtitle("RestingBP - Density Plot") + xlab("RestingBP") +
guides(fill = guide_legend(title="Heart disease"))
# Cholesterol
chol.plot <- ggplot(data, aes(x=Cholesterol, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_density(alpha=0.4) +
ggtitle("Cholesterol - Density Plot") + xlab("Cholesterol") +
guides(fill = guide_legend(title="Heart disease"))
# MaxHR
maxHR.plot <- ggplot(data, aes(x=MaxHR, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_density(alpha=0.4) +
ggtitle("MaxHR - Density Plot") + xlab("MaxHR") +
guides(fill = guide_legend(title="Heart disease"))
# Oldpeak
oldpeak.plot <- ggplot(data, aes(x=Oldpeak, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_density(alpha=0.4) +
ggtitle("Oldpeak - Density Plot") + xlab("Oldpeak") +
guides(fill = guide_legend(title="Heart disease"))
# FastingBS
fastingBS.plot <- ggplot(data, aes(x=FastingBS, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
```
```{r, include=TRUE}
chol.plot
oldpeak.plot
fastingBS.plot
restingBP.plot
```
```{r, include=TRUE}
age.box <- boxplot(Age ~ HeartDisease, col=colours)
maxHR.box <- boxplot(MaxHR ~ HeartDisease, col=colours)
```
<br>
### Categorical variables
```{r, include=FALSE}
# Sex
sex.plot <- ggplot(data, aes(x=Sex, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
# ChestPainType
cpt.plot <- ggplot(data, aes(x=ChestPainType, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
# RestingECG
restingECG.plot <- ggplot(data, aes(x=RestingECG, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
# ExerciseAngina
exAn.plot <- ggplot(data, aes(x=ExerciseAngina, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
# ST_Slope
st.plot <- ggplot(data, aes(x=ST_Slope, group=HeartDisease,
fill=factor(HeartDisease))) +
geom_bar(alpha=0.5, position="dodge") +
guides(fill = guide_legend(title="Heart disease"))
```
```{r, include=TRUE}
cpt.plot
restingECG.plot
exAn.plot
st.plot
```
### Missing values
We have seen from the plots that there are 172 variables in Cholesterol that have value equal to 0. Also we have noted that patients that have Cholesterol equal to 0 are very likely to have the heart disease.
One possibility is that the measurements were taken after the patient was dead, but if we inspect we see that the rows with missing data have RestingBP and MaxHR greater than 0 and RestingECG readings include "Normal" so it safe to assume that these patients are alive and the Cholesterol value was incorrectly recorded.
Another guess is that the "serum cholesterol" measured by this variable combines the HDL and LDL values. High HDL and LDL would cancel each other out, making this variable less useful.
For this reason we decided not to use this variable in the models and as we are going to see it doesn't affect much the predictions.
``` {r, include=TRUE}
head(data[Cholesterol == 0, c("RestingBP", "MaxHR", "RestingECG")])
```
<br>
### Outliers
For each continuous variable we see that:
* Age: no outliers
* RestingBP: the presence of outliers with high value is plausible
* MaxHR: no outliers
* Oldpeak: the values are in the range [-2, 6], there are many values equal to 0
<br>
### Correlations
```{r, include=TRUE}
cont.idx <- c(1,4,8,10)
cor.data <- cor(data[,cont.idx])
corrplot(cor.data,
method="color",
diag=F,
tl.cex=0.4,
number.cex=0.5,
tl.col="black",
addCoef.col="grey50",
cl.pos="n")
```
<br>
### Graph
We visualize the igraph to see the relationships between the continuous variables.
```{r, include=TRUE}
S <- var(data[,cont.idx])
R <- -cov2cor(solve(S))
G <- abs(R)>0.1
diag(G) <- 0
```
![](img/igraph2.png){width="50%"}
<br>
### Data Preparation
We divide our data in the training set and the test set.
```{r, include=TRUE}
set.seed(123)
split <- initial_split(data[,-c(5)], prop=0.75)
train <- training(split)
test <- testing(split)
```
```{r, include=FALSE}
calculate.metrics <- function(conf.mat) {
acc <- sum(diag(conf.mat))/sum(conf.mat)
prec <- conf.mat[2,2] / sum(conf.mat[,2])
rec <- conf.mat[2,2] / sum(conf.mat[2,])
f1.score <- 2*prec*rec/(prec+rec)
fn.rate <- conf.mat[1,2]/sum(conf.mat[,2])
out <- list(acc, prec, rec, f1.score, fn.rate)
return(out)
}
model.plot.roc <- function(predm, labl) {
pred <- prediction(predm, labl)
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, main="ROC")
abline(a=0, b= 1)
auc.perf <- performance(pred, measure = "auc")
return(auc.perf@y.values)
}
```
<br>
## Evaluation
We chose to try different classification models, like Logistic Regression, Naive Bayes, LDA and QDA. For the former, we also used Lasso Regression and Ridge Regression. The goal is to predict whether or not heart disease is present in the patient.
<br>
### LDA
Linear Discriminant Analysis (LDA) is a classification algorithm, where a discriminant rule tries to divide the data points into K disjoint regions, where K is the number of classes (in this case K = 2).
To do that we assume that our data follows a normal distribution and the classes have the same variance, and then we define a discriminant function, which tells us how likely is the observation to follow into a particular class.
$$\delta_{k}(x) = log(f_{k}) + log(\pi_{k}) $$
The decision boundary separating the classes is the set of x where two discriminant functions have the same value. Therefore, any data that falls on the decision boundary is equally likely from all the classes.
```{r, include=TRUE}
lda.fit <- lda(HeartDisease~., data=train)
lda.fit
lda.pred <- predict(lda.fit, test, type="response")
lda.res <- lda.pred$posterior
# if we want to minimize the false positives the best result is when t = 0.6
lda.pred.best <- as.factor(ifelse(lda.res[,2] > 0.6, 1, 0))
lda.conf.mat <- table(test$HeartDisease, lda.pred.best)
lda.conf.mat
# accuracy, precision, recall, f1 score
lda.metrics <- calculate.metrics(lda.conf.mat)
lda.metrics
```
The Receiver Operating Characteristics traces out two types of error as we vary the threshold for the posterior probability:
* The true positive rate or sensitivity: the fraction of patients having the heart condition that are correctly identified
* The false positive rate: the fraction of patients without the conditions that we classify incorrectly with having the disease
The dotted line represents the “no information”
classifier; this is what we would expect if the predictors are not associated with probability of having the heart disease.
```{r, include=TRUE}
# ROC
lda.auc <- model.plot.roc(lda.res[,2], test$HeartDisease)
lda.auc
# ldahist(lda.pred$x[,1], g=lda.pred$class, col=2)
```
<br>
### QDA
Quadratic Discriminant Analysis (QDA) doesn't assume the equal variance of the classes. For this reason the decision boundary is not linear but quadratic.
```{r, include=TRUE}
qda.fit <- qda(HeartDisease~., data=train)
qda.pred <- predict(qda.fit, test)
qda.pred.best <- as.factor(ifelse(lda.res[,2] > 0.5, 1, 0))
qda.conf.mat <- table(qda.pred.best, test$HeartDisease)
qda.conf.mat
qda.metrics <- calculate.metrics(qda.conf.mat)
qda.metrics
qda.auc <- model.plot.roc(qda.pred$posterior[,2], test$HeartDisease)
qda.auc
```
<br>
### Generalized Linear Model and Simple Logistic Regression
Logistic Regression is the easiest and most common model to perform binary classification. The family set is binomial, as the dependent variable is binary.
```{r, include=TRUE}
glm.model <- glm(data=train, HeartDisease~., family="binomial")
glm_summary <- summary(glm.model)
glm_summary
```
We calculate the odds of success given the R-squared value, so that we evaluate the model error. R-squared is the percentage of the dependent variable variation that a linear model explains. 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
```{r, include=TRUE}
r2 <- 1 - (glm_summary$deviance/glm_summary$null.deviance)
r2
```
We run the model on the test data, with an initial threshold value of 0.6, to be tuned later through Variable Selection
```{r, include=TRUE}
prediction.glm.model <- predict(glm.model, newdata=test, type="response")
prediction.glm.model.binary <- ifelse(prediction.glm.model > 0.6, 1, 0)
conf_matrix <- table(test$HeartDisease, prediction.glm.model.binary)
glm.model.metrics <- calculate.metrics(conf_matrix)
glm.model.metrics
glm.model.auc <- model.plot.roc(prediction.glm.model, test$HeartDisease)
glm.model.auc
```
### Backward Variable Selection using p-value
All statistical tests have a null hypothesis. For most tests, the null hypothesis is that there is no relationship between your variables of interest or that there is no difference among groups.
The p value, or probability value, tells how likely it is that the data could have occurred under the null hypothesis. It does this by calculating the likelihood of the test statistic, which is the number calculated by a statistical test using the data.
The p value tells how often one would expect to see a test statistic as extreme or more extreme than the one calculated by the statistical test if the null hypothesis of that test was true. The p value gets smaller as the test statistic calculated from the data gets further away from the range of test statistics predicted by the null hypothesis.
We found out that Age, RestingECG and MaxHR, and RestingBP are not good enough predictors, having a p-value >= 0.05. We thus only kept predictors with a p-value < 0.05. The only exception is ST_SlopeUp which has a p-value > 0.5. The reason why we are keeping this is that ST segment depression is usually a good predictor of abnormality in the heart electrophysiology, consistent with Myocardial Infarction due to the blockage of at least of one the coronary arteries. [Diderholm et al.](https://pubmed.ncbi.nlm.nih.gov/11741361/)
```{r, include=TRUE}
glm.model.1 <- update(glm.model, ~. - Age)
glm.model.2 <- update(glm.model.1, ~. - RestingECG)
glm.model.3 <- update(glm.model.2, ~. - MaxHR)
glm.model.4 <- update(glm.model.3, ~. - RestingBP)
summary(glm.model.4)
```
Let's now compute the R-squared and the Variance Inflation Factor (VIF) for each model
R-squared is a measure of how well the model explains the data and of how much variance in the dependent variable is explained by the independent variables in the model. The VIF, on the other hand, is a measure of how much the variance of the estimated regression coefficient for a given independent variable is inflated due to multicollinearity.
A VIF value of 1 indicates no multicollinearity. Values between 1 and 5 indicates moderate correlation between a given predictor variable and other predictor variables in the model, but not severe enough to require attention. A value greater than 5 indicates potentially severe correlation between a given predictor variable and other predictor variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.
```{r, include=TRUE}
glm.models = list(glm.model, glm.model.1, glm.model.2, glm.model.3, glm.model.4)
i=0
for (mdl in glm.models){
r2 <- 1 - (mdl$deviance/mdl$null.deviance)
vif <- 1/(1-r2)
cat("GLM Model ", i, " - ", "VIF: ", vif, "\n")
i=i+1
}
```
Each model's VIF indicate some minor, although not severe, correlation between predictor
### Bayesian Information Criterion (BIC)
The Bayesian Information Criterion, often abbreviated BIC, is a metric that is used to compare the goodness of fit of different regression models.
```{r, include=TRUE}
glm.models = list(glm.model, glm.model.1, glm.model.2, glm.model.3, glm.model.4)
i=0
for (mdl in glm.models){
glm.bic <- BIC(mdl)
cat("GLM Model ", i, " - ","BIC: ", glm.bic, "\n")
i=i+1
}
```
Lower BIC means better model. So, the best model is glm.model.4 with a BIC of 522.2917.
### Analysis of Deviance
Deviance is a quality-of-fit statistic for a model that is often used for statistical hypothesis testing.
```{r, include=TRUE}
# Full model
glm.model.full <- glm(data=train, HeartDisease~., family="binomial")
deviance(glm.model.full)
pi.hat <- predict(glm.model.full, type="response")
DF <- -2*sum(train$HeartDisease*log(pi.hat)+(1-train$HeartDisease)*log(1-pi.hat))
df.F <- glm.model.full$df.residual
glm.model.reduced <- glm.model.4
deviance(glm.model.reduced)
pi.hat <- predict(glm.model.reduced, type="response")
DR <- -2*sum(train$HeartDisease*log(pi.hat)+(1-train$HeartDisease)*log(1-pi.hat))
df.R <- glm.model.reduced$df.residual
# Deviance difference test
dev.test <- DR-DF
df <- df.R- df.F
pvalue <- 1-pchisq(dev.test, df)
pvalue
# deviance difference using the anova() function
anova(glm.model.reduced, glm.model.full, test="Chisq")
```
In this case, the high p-value of 0.2043 tells us that the reduced model is not significantly different from the full model. Therefore, the removed predictors are indeed not useful.
### Threshold Selection
We tried different thresholds and we noticed that the best ones are 0.4 and 0.5, depending on whether to optimize for accuracy or recall.
.
```{r, include=TRUE}
thresholds = c(0.3, 0.4, 0.5, 0.6)
for (thr in thresholds) {
prediction.glm.model.4 <- predict(glm.model.4, newdata=test, type="response")
prediction.glm.model.4.binary <- ifelse(prediction.glm.model.4 > thr, 1, 0)
conf_matrix <- table(test$HeartDisease, prediction.glm.model.4.binary)
conf_matrix
glm.model.4.metrics <- calculate.metrics(conf_matrix)
cat("Threshold: ", thr, "\n")
cat("Accuracy, Precision, Rec, F1-Score", paste(glm.model.4.metrics, collapse = ", "), "\n")
}
prediction.glm.model.4 <- predict(glm.model.4, newdata=test, type="response")
prediction.glm.model.4.binary <- ifelse(prediction.glm.model.4 > 0.4, 1, 0)
conf_matrix <- table(test$HeartDisease, prediction.glm.model.4.binary)
conf_matrix
glm.model.4.metrics <- calculate.metrics(conf_matrix)
cat("Accuracy, Precision, Rec, F1-Score", paste(glm.model.4.metrics, collapse = ", "), "\n")
```
We need to minimize the amount of patients coming in with CHD symptoms that are wrongly predicted not to have CHD. This is because discharging the patient and them having a heart attack outside of the hospital greatly lowers their probability of survival. Therefore, although all metrics are important to consider and therefore a compromise has to be made, we need to find the one that maximizes Recall, so that it increases the prediction of patients with the disease, while trying not to lose much on the other metrics.
The best overall threshold is 0.4.
Let's also plot the ROC Curve for the final model.
```{r, include=TRUE}
glm.model.4.auc <- model.plot.roc(prediction.glm.model.4, test$HeartDisease)
glm.model.4.auc
```
### Lasso Regression
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage (L1 Regularization). The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).
```{r, include=TRUE}
X <- model.matrix(glm.model)
y <- train$HeartDisease
lasso.model <- cv.glmnet(X, y, family = "binomial", type.measure = "class") #cross-validation handled by glmnet
lasso.coef <- coef(lasso.model, s = "lambda.min")
lasso.vars <- rownames(lasso.coef)[-1][lasso.coef[-1,] != 0]
cat("Selected variables with Lasso Regression:\n\n", paste(lasso.vars, collapse = "\n"))
lasso.X_test <- model.matrix(glm.model, data = test)
lasso.y_pred <- predict(lasso.model, newx = lasso.X_test, s = "lambda.min", type = "response")
lasso.y_pred_class <- ifelse(lasso.y_pred > 0.6, 1, 0) # Convert probabilities to classes
lasso.model
lasso.coef
```
There are some differences between what Lasso decided to remove and what we removed before through Variable Selection. Lasso removed RestingECG, RestingBP and ChestPainTypeTA, while keeping Age and MaxHR.
Let's consider what this might imply, by looking at some of the medical literature.
- Age: Age is an independent risk factor for cardiovascular disease in adults, and the risk increases with age, but these risks are compounded by additional factors. [(Rodgers et al.)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6616540/)
- RestingECG: Resting ECG abnormality has been shown to be a strong predictor for mortality and major adverse cardiac events (MACE) in healthy subjects as well as for high-risk populations. [Kaolawanich et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8714441/#:~:text=Resting%20ECG%20abnormality%20has%20been,populations%20%5B10%2C%2011%5D.)
- MaxHR: Resting heart rate was an independent predictor of coronary artery disease, stroke, sudden death and non-cardiovascular diseases over all of the studies combined. [(Zhang et al.)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5056889/)
- RestingBP: Morning HBP is a strong predictor of future CAD and stroke events [(Kario et al.)](https://pubmed.ncbi.nlm.nih.gov/27150682/). With increasing age, there was a gradual shift from DBP to SBP and then to PP (Pulse Pressure) as predictors of CHD risk. In patients <50 years of age, DBP was the strongest predictor. Age 50 to 59 years was a transition period when all 3 BP indexes were comparable predictors, and from 60 years of age on, DBP was negatively related to CHD risk so that PP became superior to SBP. [(Franklin et al.)](https://www.ahajournals.org/doi/10.1161/01.cir.103.9.1245).
From this we can make a couple of observations and comparisons between choices of predictors from medical literature and our statistical analysis:
- First of all, the remaining predictors might be more useful for predicting CHD/CAD, whereas the removed ones contribute less to the prediction. Nonetheless, it is important to consider that we are dealing with statistical models and not with real clinical situations, so, in the clinical sense, all of those predictors are to be considered important to assess the patient's health.
- RestingECG and RestingBP have been discarded in both of the methods used. Although Resting Blood Pressure is an important predictor, our dataset only contains Systolic Blood Pressure readings, which are not enough to be considered a strong predictors, as seen by the complex difference in the importance of Blood Pressure Types (SBP, DBP, PP) in different Age ranges.
- Max Heart Rate is kept by Lasso Regression, which is a good thing, considering that Heart Rate at rest can be used to assess how efficiently the heart is working. Higher HR at rest means that the heart is working harder without an obvious reason, such as physical exertion, and this can indicate abnormalities in its functions.
```{r, include=TRUE}
lasso.conf.mat <- table(lasso.y_pred_class, test$HeartDisease)
lasso.conf.mat
```
Let's calculate the metrics
```{r, include=TRUE}
lasso.metrics <- calculate.metrics(lasso.conf.mat)
lasso.metrics
lasso.auc <- model.plot.roc(lasso.y_pred, test$HeartDisease)
lasso.auc
```
### Ridge Regression
Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. It performs L2 regularization.
Ridge Regression can also be used to check the coefficient estimates, that represent the expected change in the response variable for a one-unit increase in each predictor, holding all other predictors constant.
The negative signs on some of the coefficients indicate that an increase in the corresponding predictor is associated with a decrease in the response variable, while positive signs indicate an increase in the predictor is associated with an increase in the response variable.
```{r, include=TRUE}
X <- model.matrix(glm.model)
y <- train$HeartDisease
ridge.fit <- cv.glmnet(X, y, family = "binomial", alpha = 0, type.measure = "deviance") #cross-validation handled by glmnet
coef(ridge.fit, s = "lambda.min")
ridge.X_test <- model.matrix(glm.model, data = test)
ridge.y_pred <- predict(ridge.fit, newx = ridge.X_test, s = "lambda.min", type = "response")
ridge.y_pred_class <- ifelse(ridge.y_pred > 0.6, 1, 0) # Convert probabilities to classes
lambda_min_value <- ridge.fit$lambda.min
lambda_min_value
```
```{r, include=TRUE}
ridge.conf.mat <- table(ridge.y_pred_class, test$HeartDisease)
ridge.conf.mat
```
```{r, include=TRUE}
ridge.metrics <- calculate.metrics(ridge.conf.mat)
ridge.metrics
ridge.auc <- model.plot.roc(ridge.y_pred, test$HeartDisease)
ridge.auc
```
### Naive Bayes Classifier
One other model we tried is the Naive Bayes Classifier. It is a classification technique based on Bayes’ Theorem with an independence assumption among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
```{r, include=TRUE}
train$HeartDisease <- as.factor(train$HeartDisease)
test$HeartDisease <- as.factor(test$HeartDisease)
naivebayes.model <- naive_bayes(HeartDisease~., data=test)
naivebayes.prediction <- predict(naivebayes.model, newx = test, type = "prob")
naivebayes.y_pred_class <- ifelse(naivebayes.prediction > 0.6, 1, 0)
naivebayes.conf.mat <- table(naivebayes.y_pred_class[,2], test$HeartDisease)
naivebayes.metrics <- calculate.metrics(naivebayes.conf.mat)
naivebayes.metrics
naivebayes.auc <- model.plot.roc(naivebayes.prediction[,2], test$HeartDisease)
naivebayes.auc
```
### Model Comparison
```{r, include=TRUE}
# acc, prec, rec, f1.score
lda.acc <- lda.metrics[1][[1]]
lda.prec <- lda.metrics[2][[1]]
lda.rec <- lda.metrics[3][[1]]
lda.f1 <- lda.metrics[4][[1]]
lda.fn <- lda.metrics[5][[1]]
lda.auc <- lda.auc[[1]]
qda.acc <- qda.metrics[1][[1]]
qda.prec <- qda.metrics[2][[1]]
qda.rec <- qda.metrics[3][[1]]
qda.f1 <- qda.metrics[4][[1]]
qda.fn <- qda.metrics[5][[1]]
qda.auc <- qda.auc[[1]]
glm.4.acc <- glm.model.4.metrics[[1]][[1]]
glm.4.prec <- glm.model.4.metrics[2][[1]]
glm.4.rec <- glm.model.4.metrics[3][[1]]
glm.4.f1 <- glm.model.4.metrics[4][[1]]
glm.4.fn <- glm.model.4.metrics[5][[1]]
glm.4.auc <- glm.model.4.auc[[1]]
lasso.acc <- lasso.metrics[1][[1]]
lasso.prec <- lasso.metrics[2][[1]]
lasso.rec <- lasso.metrics[3][[1]]
lasso.f1 <- lasso.metrics[4][[1]]
lasso.fn <- lasso.metrics[5][[1]]
lasso.auc <- lasso.auc[[1]]
ridge.acc <- ridge.metrics[1][[1]]
ridge.prec <- ridge.metrics[2][[1]]
ridge.rec <- ridge.metrics[3][[1]]
ridge.f1 <- ridge.metrics[4][[1]]
ridge.fn <- ridge.metrics[5][[1]]
ridge.auc <- ridge.auc[[1]]
naivebayes.acc <- naivebayes.metrics[1][[1]]
naivebayes.prec <- naivebayes.metrics[2][[1]]
naivebayes.rec <- naivebayes.metrics[3][[1]]
naivebayes.f1 <- naivebayes.metrics[4][[1]]
naivebayes.fn <- naivebayes.metrics[5][[1]]
naivebayes.auc <- naivebayes.auc[[1]]
acc <- c(lda.acc, qda.acc, glm.4.acc, lasso.acc, ridge.acc, naivebayes.acc)
prec <- c(lda.prec, qda.prec, glm.4.prec, lasso.prec, ridge.prec, naivebayes.prec)
rec <- c(lda.rec, qda.rec, glm.4.rec, lasso.rec, ridge.rec, naivebayes.rec)
f1.score <- c(lda.f1, qda.f1, glm.4.f1, lasso.f1, ridge.f1, naivebayes.f1)
fn.rate <- c(lda.fn, qda.fn, glm.4.fn, lasso.fn, ridge.fn, naivebayes.fn)
auc <- c(lda.auc, qda.auc, glm.4.auc, lasso.auc, ridge.auc, naivebayes.auc)
model_performance <- matrix(c(acc, prec, rec, f1.score, fn.rate, auc), nrow = 6, ncol = 6, byrow = FALSE)
model_names <- c("LDA", "QDA", "Logistic Regression", "Lasso Regression", "Ridge Regression", "Naive Bayes")
performance_summary <- data.frame(Model = model_names, model_performance)
colnames(performance_summary) <- c("Model", "Accuracy", "Precision", "Recall", "F1 Score", "FN Rate", "AUC")
print(performance_summary)
```
```{r, include=TRUE}
roc.lda <- roc(test$HeartDisease, lda.res[,2])
roc.qda <- roc(test$HeartDisease, qda.pred$posterior[,2])
roc.glm <- roc(test$HeartDisease, prediction.glm.model.4)
roc.lasso <- roc(test$HeartDisease, lasso.y_pred)
roc.ridge <- roc(test$HeartDisease, ridge.y_pred)
roc.naivebayes <- roc(test$HeartDisease, naivebayes.prediction[,2])
plot(roc.lda, col = "blue", print.auc = FALSE, print.auc.y = 0, lwd=1.5)
lines(roc.qda, col = "red", print.auc = FALSE, print.auc.y = 0, lwd=1.5)
lines(roc.glm, col = "green", print.auc = FALSE, print.auc.y = 0, lwd=1.5)
lines(roc.lasso, col = "orange", print.auc = FALSE, print.auc.y = 0, lwd=1.5)
lines(roc.ridge, col = "pink", print.auc = FALSE, print.auc.y = 0, lwd=1.5)
lines(roc.naivebayes, col = "cyan", print.auc = FALSE, print.auc.y = 0,lwd=1.5)
legend("bottomright", legend = c("LDA", "QDA", "GLM","Lasso", "Ridge", "Naive Bayes"), col = c("blue", "red", "green", "orange", "pink", "cyan"), lty = 1)
```
## Conclusions
All models perform relatively good, with similar ROCs and each with their own strengths and weaknesses in their metrics.
Considering Recall as most important, the best models are:
- GLM with Logistic Regression (Best threshold: 0.4): maximizes Recall
- LDA: Lowest FN-Rate
- GLM with Ridge
Considering all models, the very best predictors are:
- Sex[“M”]: males are more likely to suffer from CAD.
- ChestPainType[“ATA, “NAP”]: Atypical Angina is a known symptom preceding Myocardial Infarction. Non-Anginal Pain is non-ischemic related.
- ST_Slope[“Up”, “Flat”]: ST Segment Depression is a known anomaly of the electrophysiology of heart due to coronary blockage and therefore myocardial ischemia.
- Oldpeak: variations in ST Segment Depression induced by exercise relative to rest indicate myocardial ischemia.