classifiers.Rmd

---
title: "Classifiers"
---


```{r}
library(tidyverse)
library(MLmetrics)
```

First, let's load the data.

```{r}
df <- read_csv("~/sandbox/laheart2.csv")
```


# Cleaning

Now, let's do some basic cleaning by removing columns and making sure data types are correct.

DEATH_YR is redundant with DEATH, and therefore will let the classifiers "cheat." Remove.
Also, ID won't be helpful.

```{r}
df <- df %>% select(-c(ID, DEATH_YR))
df$DEATH = as.factor(df$DEATH)
df$MD_50 = as.factor(df$MD_50)
df$MD_62 = as.factor(df$MD_62)
df$CL_STATUS = as.factor(df$CL_STATUS)
df$IHD_DX = as.factor(df$IHD_DX)
df
```

# Preview

Let's look at some of the summary statistics and a sampling of the data.

```{r}
summary(df)
df
```

# Training and Testing

First thing's first, let's split the data into training and testing.

```{r}
set.seed(123) # Set the seed to make it reproducible
train <- sample_frac(df, 0.8)
test <- setdiff(df, train)
```

# Decision Trees

Build a decision tree model using the rpart package.

Load the required packages.

```{r}
library(rpart)
library(rpart.plot) # For pretty trees
```


Build the model.

```{r}
tree <- rpart(DEATH ~ ., method="class", data=train)
```


```{r}
tree
```


```{r}
printcp(tree)
```

Let's look at a graphical rendering of the decision tree.

```{r}
rpart.plot(tree, extra=2, type=2)
```

Look at how the model predicts the data.

```{r}
predicted = predict(tree, test, type="class") 
```

Let's look at the confusion matrix.

```{r}
actual = test$DEATH
table(actual, predicted)
```

Let's check the accuracy and other metrics of the classifier on the testing data.

```{r}
print(sprintf("Accuracy:    %.3f", Accuracy(y_true=actual, y_pred=predicted)))
print(sprintf("AUC:         %.3f", AUC(y_pred=predicted, y_true=actual)))
print(sprintf("Precision:   %.3f", Precision(y_true=actual, y_pred=predicted)))
print(sprintf("Recall:      %.3f", Recall(y_true=actual, y_pred=predicted)))
print(sprintf("F1 Score:    %.3f", F1_Score(predicted, actual)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=predicted)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted, y_pred=actual)))
```

# Naive Bayes

Load the required packages.
```{r}
library(e1071)
```

Build the model. Note that this implementation uses the Gaussian model (mean and std dev) for continuous variables.

```{r}
nb <- naiveBayes(DEATH ~ ., data=train)
nb
```
Look at how the model predicts the data.


```{r}
predicted.nb = predict(nb, test, type="class")
```

Let's look at the confusion matrix.

```{r}
actual.nb = test$DEATH
table(actual.nb, predicted.nb)
```

Let's check the accuracy and other metrics of the classifier on the testing data.

```{r}
print(sprintf("Accuracy:    %.3f", Accuracy(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("AUC:         %.3f", AUC(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Precision:   %.3f", Precision(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Recall:      %.3f", Recall(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("F1 Score:    %.3f", F1_Score(predicted.nb, actual.nb)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual.nb, y_pred=predicted.nb)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted.nb, y_pred=actual.nb)))
```