Skip to content

Malware Detection System using various machine learning and deep learning models, incorporating data preprocessing, feature engineering, and model evaluation techniques, while tracking experiments with MLflow for performance comparison and reproducibility.

Notifications You must be signed in to change notification settings

Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Microsoft Malware Prediction

This project demonstrates a comprehensive approach to detecting malware using a combination of Machine Learning (ML) and Deep Learning (DL) models, alongside MLflow for experiment tracking.

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

  • Imports necessary libraries for data manipulation, visualization and ML
  • Loads Microsoft malware dataset from Kaggle
  • Handles missing values and outliers
  • Performs feature encoding and scaling

2. Feature Engineering & Selection:

  • Uses various feature selection techniques like:
    • Chi-Square test
    • ANOVA
    • Mutual Information
    • Kendall's correlation

3. Dimensionality Reduction:

  • Implements multiple dimensionality reduction techniques:
    • LDA (Linear Discriminant Analysis)
    • t-SNE
    • UMAP

Clone the Repository

To start using this project, first, clone the repository:

git clone https://github.com/Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition.git

Machine Learning Algorithms

  • Logistic Regression
    Uses the logit function to model malware probability.
    Accuracy: 50.47%

  • K-Nearest Neighbors (KNN)
    Classifies by majority vote of nearest neighbors.
    Accuracy: 51.52%

  • Random Forest
    Combines multiple decision trees to reduce overfitting via bagging.
    Accuracy: 60.84%

  • Decision Tree
    Splits features recursively for interpretability.
    Accuracy: 56.09%

  • AdaBoost
    Iteratively adjusts weights to emphasize misclassified samples.
    Accuracy: 62.65%

  • Gradient Boosting (GBM)
    Minimizes residuals incrementally by adding weak learners.
    Accuracy: 62.84%

  • XGBoost
    Highly efficient tree boosting with advanced regularization.
    Accuracy: 63.87%

  • LightGBM
    Uses leaf-wise growth and gradient-based sampling.
    Accuracy: 63.72%

  • CatBoost
    Handles categorical features internally, reducing manual encoding.
    Accuracy: 63.98%

Top 3 ML Algorithms

  1. CatBoost Image

  2. XGBoost
    Image

  3. LightGBM
    Image

Deep Learning Algorithms

  • Multi-Layer Perceptron (MLP)
    A dense network that learns nonlinear boundaries with backpropagation. Accuracy: 62.68% Image

  • Gated Recurrent Unit (GRU)
    Manages time-dependent data via gating mechanisms in recurrent layers. Accuracy: 62.93% Image

  • Autoencoder
    Learns efficient encodings and reconstructs inputs for feature extraction. Accuracy: 62.45% Image

MLflow Integration

MLflow tracks experiments centrally, storing metrics (accuracy, precision, recall, AUC) and model artifacts. It integrates seamlessly with various libraries, enabling comparison of runs on a local or remote MLflow UI. Image

About

Malware Detection System using various machine learning and deep learning models, incorporating data preprocessing, feature engineering, and model evaluation techniques, while tracking experiments with MLflow for performance comparison and reproducibility.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published