forked from stepthom/sandbox
-
Notifications
You must be signed in to change notification settings - Fork 0
/
classifiers.Rmd
151 lines (106 loc) · 3.21 KB
/
classifiers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Classifiers"
---
```{r}
library(tidyverse)
library(MLmetrics)
```
First, let's load the data.
```{r}
df <- read_csv("~/sandbox/laheart2.csv")
```
# Cleaning
Now, let's do some basic cleaning by removing columns and making sure data types are correct.
DEATH_YR is redundant with DEATH, and therefore will let the classifiers "cheat." Remove.
Also, ID won't be helpful.
```{r}
df <- df %>% select(-c(ID, DEATH_YR))
df$DEATH = as.factor(df$DEATH)
df$MD_50 = as.factor(df$MD_50)
df$MD_62 = as.factor(df$MD_62)
df$CL_STATUS = as.factor(df$CL_STATUS)
df$IHD_DX = as.factor(df$IHD_DX)
df
```
# Preview
Let's look at some of the summary statistics and a sampling of the data.
```{r}
summary(df)
df
```
# Training and Testing
First thing's first, let's split the data into training and testing.
```{r}
set.seed(123) # Set the seed to make it reproducible
train <- sample_frac(df, 0.8)
test <- setdiff(df, train)
```
# Decision Trees
Build a decision tree model using the rpart package.
Load the required packages.
```{r}
library(rpart)
library(rpart.plot) # For pretty trees
```
Build the model.
```{r}
tree <- rpart(DEATH ~ ., method="class", data=train)
```
```{r}
tree
```
```{r}
printcp(tree)
```
Let's look at a graphical rendering of the decision tree.
```{r}
rpart.plot(tree, extra=2, type=2)
```
Look at how the model predicts the data.
```{r}
predicted = predict(tree, test, type="class")
```
Let's look at the confusion matrix.
```{r}
actual = test$DEATH
table(actual, predicted)
```
Let's check the accuracy and other metrics of the classifier on the testing data.
```{r}
print(sprintf("Accuracy: %.3f", Accuracy(y_true=actual, y_pred=predicted)))
print(sprintf("AUC: %.3f", AUC(y_pred=predicted, y_true=actual)))
print(sprintf("Precision: %.3f", Precision(y_true=actual, y_pred=predicted)))
print(sprintf("Recall: %.3f", Recall(y_true=actual, y_pred=predicted)))
print(sprintf("F1 Score: %.3f", F1_Score(predicted, actual)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=predicted)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted, y_pred=actual)))
```
# Naive Bayes
Load the required packages.
```{r}
library(e1071)
```
Build the model. Note that this implementation uses the Gaussian model (mean and std dev) for continuous variables.
```{r}
nb <- naiveBayes(DEATH ~ ., data=train)
nb
```
Look at how the model predicts the data.
```{r}
predicted.nb = predict(nb, test, type="class")
```
Let's look at the confusion matrix.
```{r}
actual.nb = test$DEATH
table(actual.nb, predicted.nb)
```
Let's check the accuracy and other metrics of the classifier on the testing data.
```{r}
print(sprintf("Accuracy: %.3f", Accuracy(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("AUC: %.3f", AUC(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Precision: %.3f", Precision(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Recall: %.3f", Recall(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("F1 Score: %.3f", F1_Score(predicted.nb, actual.nb)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted.nb, y_pred=actual.nb)))
```