Skip to content

Nazarkovsky/Heart-Attack-Prediction

Repository files navigation

Heart-Attack-Prediction

Abstract

The project was prepared and submitted within the Brazilian "Bootcamp Data Science na prática" by Neuron.

See also here

The dataset was provided by the Bootcamp administration from Kaggle. The description of the data is uploaded here as a TXT.

The certificate

Methods

In order to perform the typical classification procedure for the binary output, following metrics are required to select the optimal model: misclassification rate or MR, F1, Generalized R2 (Nagelkerke or Craig and Uhler R2), Mean Abs Dev (the average of the absolute values of the differences between the real output and the predicted output). The project is supported by figures found in the Issues section. The information about R2 for classification is here

The data were stratified and divided : 75% for training, 25% for validation.

The models for machine learning: Naive Bayes, K-Nearest Neighbors or KNN (155 Neighbors estimated through euclidean distances at the uniform weights of points), Multiple Logistic Regression, Generalized Regression techniques (Lasso, Elastic Net, Ridge, Double Lasso), SVM (linear kernel function, cost = 1), Classification Tree , Boosted Tree and Bootstrap Forest (10 trees, 3 terms sampled per split, learning rate 0.1).

Results and Discussions

The most effective trained model was Bootstrap Forest among other classification models. While validating, it was discovered that majority of models gave MR 7,69% at equal F1 (0.92), whereas their Generaized R2-s were also identical. Thus, among the models, the minimal Mean Abs Dev had the logistic regression. Its Fit data, ROC, confusion matrix are given here.

As for non-parametric KNN, MR and F1 became higher at the validation at K = 92 (number of neighbors of the minimal MR on validation): MR = 5.77% and F1 = 0.939. See the graphical summary here and the table of all 155 neighbors vs. MR is found in the uploaded file "KNN-iterations summary by MR.xlsx".

An overall comparison of the values for actual target vs. predicted target by the set models on Validation is here and in general is here