The project was prepared and submitted within the Brazilian "Bootcamp Data Science na prática" by Neuron.
See also here
The dataset was provided by the Bootcamp administration from Kaggle. The description of the data is uploaded here as a TXT.
In order to perform the typical classification procedure for the binary output, following metrics are required to select the optimal model: misclassification rate or MR, F1, Generalized R2 (Nagelkerke or Craig and Uhler R2), Mean Abs Dev (the average of the absolute values of the differences between the real output and the predicted output). The project is supported by figures found in the Issues section. The information about R2 for classification is here
The data were stratified and divided : 75% for training, 25% for validation.
The models for machine learning: Naive Bayes, K-Nearest Neighbors or KNN (155 Neighbors estimated through euclidean distances at the uniform weights of points), Multiple Logistic Regression, Generalized Regression techniques (Lasso, Elastic Net, Ridge, Double Lasso), SVM (linear kernel function, cost = 1), Classification Tree , Boosted Tree and Bootstrap Forest (10 trees, 3 terms sampled per split, learning rate 0.1).
The most effective trained model was Bootstrap Forest among other classification models. While validating, it was discovered that majority of models gave MR 7,69% at equal F1 (0.92), whereas their Generaized R2-s were also identical. Thus, among the models, the minimal Mean Abs Dev had the logistic regression. Its Fit data, ROC, confusion matrix are given here.
As for non-parametric KNN, MR and F1 became higher at the validation at K = 92 (number of neighbors of the minimal MR on validation): MR = 5.77% and F1 = 0.939. See the graphical summary here and the table of all 155 neighbors vs. MR is found in the uploaded file "KNN-iterations summary by MR.xlsx".
An overall comparison of the values for actual target vs. predicted target by the set models on Validation is here and in general is here