This project demonstrates a comprehensive approach to detecting malware using a combination of Machine Learning (ML) and Deep Learning (DL) models, alongside MLflow for experiment tracking.
- Enhanced Overview
- Clone the Repository
- Machine Learning Algorithms
- Top 3 ML Algorithms
- Deep Learning Algorithms
- MLflow Integration
- Imports necessary libraries for data manipulation, visualization and ML
- Loads Microsoft malware dataset from Kaggle
- Handles missing values and outliers
- Performs feature encoding and scaling
- Uses various feature selection techniques like:
- Chi-Square test
- ANOVA
- Mutual Information
- Kendall's correlation
- Implements multiple dimensionality reduction techniques:
- LDA (Linear Discriminant Analysis)
- t-SNE
- UMAP
To start using this project, first, clone the repository:
git clone https://github.com/Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition.git
-
Logistic Regression
Uses the logit function to model malware probability.
Accuracy: 50.47% -
K-Nearest Neighbors (KNN)
Classifies by majority vote of nearest neighbors.
Accuracy: 51.52% -
Random Forest
Combines multiple decision trees to reduce overfitting via bagging.
Accuracy: 60.84% -
Decision Tree
Splits features recursively for interpretability.
Accuracy: 56.09% -
AdaBoost
Iteratively adjusts weights to emphasize misclassified samples.
Accuracy: 62.65% -
Gradient Boosting (GBM)
Minimizes residuals incrementally by adding weak learners.
Accuracy: 62.84% -
XGBoost
Highly efficient tree boosting with advanced regularization.
Accuracy: 63.87% -
LightGBM
Uses leaf-wise growth and gradient-based sampling.
Accuracy: 63.72% -
CatBoost
Handles categorical features internally, reducing manual encoding.
Accuracy: 63.98%
-
Multi-Layer Perceptron (MLP)
A dense network that learns nonlinear boundaries with backpropagation. Accuracy: 62.68% -
Gated Recurrent Unit (GRU)
Manages time-dependent data via gating mechanisms in recurrent layers. Accuracy: 62.93% -
Autoencoder
Learns efficient encodings and reconstructs inputs for feature extraction. Accuracy: 62.45%
MLflow tracks experiments centrally, storing metrics (accuracy, precision, recall, AUC) and model artifacts. It integrates seamlessly with various libraries, enabling comparison of runs on a local or remote MLflow UI.