Microsoft Malware Prediction

This project demonstrates a comprehensive approach to detecting malware using a combination of Machine Learning (ML) and Deep Learning (DL) models, alongside MLflow for experiment tracking.

Enhanced Overview

1. Data Preparation & Preprocessing:

Imports necessary libraries for data manipulation, visualization and ML
Loads Microsoft malware dataset from Kaggle
Handles missing values and outliers
Performs feature encoding and scaling

2. Feature Engineering & Selection:

Uses various feature selection techniques like:
- Chi-Square test
- ANOVA
- Mutual Information
- Kendall's correlation

3. Dimensionality Reduction:

Implements multiple dimensionality reduction techniques:
- LDA (Linear Discriminant Analysis)
- t-SNE
- UMAP

Clone the Repository

To start using this project, first, clone the repository:

git clone https://github.com/Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition.git

Machine Learning Algorithms

Logistic Regression
Uses the logit function to model malware probability.
Accuracy: 50.47%
K-Nearest Neighbors (KNN)
Classifies by majority vote of nearest neighbors.
Accuracy: 51.52%
Random Forest
Combines multiple decision trees to reduce overfitting via bagging.
Accuracy: 60.84%
Decision Tree
Splits features recursively for interpretability.
Accuracy: 56.09%
AdaBoost
Iteratively adjusts weights to emphasize misclassified samples.
Accuracy: 62.65%
Gradient Boosting (GBM)
Minimizes residuals incrementally by adding weak learners.
Accuracy: 62.84%
XGBoost
Highly efficient tree boosting with advanced regularization.
Accuracy: 63.87%
LightGBM
Uses leaf-wise growth and gradient-based sampling.
Accuracy: 63.72%
CatBoost
Handles categorical features internally, reducing manual encoding.
Accuracy: 63.98%

Top 3 ML Algorithms

CatBoost
XGBoost
LightGBM

Deep Learning Algorithms

Multi-Layer Perceptron (MLP)
A dense network that learns nonlinear boundaries with backpropagation. Accuracy: 62.68%
Gated Recurrent Unit (GRU)
Manages time-dependent data via gating mechanisms in recurrent layers. Accuracy: 62.93%
Autoencoder
Learns efficient encodings and reconstructs inputs for feature extraction. Accuracy: 62.45%

MLflow Integration

MLflow tracks experiments centrally, storing metrics (accuracy, precision, recall, AUC) and model artifacts. It integrates seamlessly with various libraries, enabling comparison of runs on a local or remote MLflow UI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Microsoft Malware Prediction

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

2. Feature Engineering & Selection:

3. Dimensionality Reduction:

Clone the Repository

Machine Learning Algorithms

Top 3 ML Algorithms

Deep Learning Algorithms

MLflow Integration

Files

README.md

Latest commit

History

README.md

File metadata and controls

Microsoft Malware Prediction

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

2. Feature Engineering & Selection:

3. Dimensionality Reduction:

Clone the Repository

Machine Learning Algorithms

Top 3 ML Algorithms

Deep Learning Algorithms

MLflow Integration