Microsoft Malware Prediction

This project demonstrates a comprehensive approach to detecting malware using a combination of Machine Learning (ML) and Deep Learning (DL) models, alongside MLflow for experiment tracking.

Enhanced Overview

1. Data Preparation & Preprocessing:

Imports necessary libraries for data manipulation, visualization and ML
Loads Microsoft malware dataset from Kaggle
Handles missing values and outliers
Performs feature encoding and scaling

2. Feature Engineering & Selection:

Uses various feature selection techniques like:
- Chi-Square test
- ANOVA
- Mutual Information
- Kendall's correlation

3. Dimensionality Reduction:

Implements multiple dimensionality reduction techniques:
- LDA (Linear Discriminant Analysis)
- t-SNE
- UMAP

Clone the Repository

To start using this project, first, clone the repository:

git clone https://github.com/Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition.git

Machine Learning Algorithms

Logistic Regression
Uses the logit function to model malware probability.
Accuracy: 50.47%
K-Nearest Neighbors (KNN)
Classifies by majority vote of nearest neighbors.
Accuracy: 51.52%
Random Forest
Combines multiple decision trees to reduce overfitting via bagging.
Accuracy: 60.84%
Decision Tree
Splits features recursively for interpretability.
Accuracy: 56.09%
AdaBoost
Iteratively adjusts weights to emphasize misclassified samples.
Accuracy: 62.65%
Gradient Boosting (GBM)
Minimizes residuals incrementally by adding weak learners.
Accuracy: 62.84%
XGBoost
Highly efficient tree boosting with advanced regularization.
Accuracy: 63.87%
LightGBM
Uses leaf-wise growth and gradient-based sampling.
Accuracy: 63.72%
CatBoost
Handles categorical features internally, reducing manual encoding.
Accuracy: 63.98%

Top 3 ML Algorithms

CatBoost
XGBoost
LightGBM

Deep Learning Algorithms

Multi-Layer Perceptron (MLP)
A dense network that learns nonlinear boundaries with backpropagation. Accuracy: 62.68%
Gated Recurrent Unit (GRU)
Manages time-dependent data via gating mechanisms in recurrent layers. Accuracy: 62.93%
Autoencoder
Learns efficient encodings and reconstructs inputs for feature extraction. Accuracy: 62.45%

MLflow Integration

MLflow tracks experiments centrally, storing metrics (accuracy, precision, recall, AUC) and model artifacts. It integrates seamlessly with various libraries, enabling comparison of runs on a local or remote MLflow UI.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Model Evaluation Plots		Model Evaluation Plots
MLFlow.png		MLFlow.png
README.md		README.md
microsoft-malware-prediction.ipynb		microsoft-malware-prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microsoft Malware Prediction

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

2. Feature Engineering & Selection:

3. Dimensionality Reduction:

Clone the Repository

Machine Learning Algorithms

Top 3 ML Algorithms

Deep Learning Algorithms

MLflow Integration

About

Releases

Packages

Languages

Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition

Folders and files

Latest commit

History

Repository files navigation

Microsoft Malware Prediction

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

2. Feature Engineering & Selection:

3. Dimensionality Reduction:

Clone the Repository

Machine Learning Algorithms

Top 3 ML Algorithms

Deep Learning Algorithms

MLflow Integration

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages