Skip to content

Latest commit

 

History

History
112 lines (82 loc) · 3.62 KB

File metadata and controls

112 lines (82 loc) · 3.62 KB

Microsoft Malware Prediction

This project demonstrates a comprehensive approach to detecting malware using a combination of Machine Learning (ML) and Deep Learning (DL) models, alongside MLflow for experiment tracking.

Table of Contents

Enhanced Overview

1. Data Preparation & Preprocessing:

  • Imports necessary libraries for data manipulation, visualization and ML
  • Loads Microsoft malware dataset from Kaggle
  • Handles missing values and outliers
  • Performs feature encoding and scaling

2. Feature Engineering & Selection:

  • Uses various feature selection techniques like:
    • Chi-Square test
    • ANOVA
    • Mutual Information
    • Kendall's correlation

3. Dimensionality Reduction:

  • Implements multiple dimensionality reduction techniques:
    • LDA (Linear Discriminant Analysis)
    • t-SNE
    • UMAP

Clone the Repository

To start using this project, first, clone the repository:

git clone https://github.com/Abdelrahman-Elshahed/Microsoft_Maleware_Prediction_Kaggle_Competition.git

Machine Learning Algorithms

  • Logistic Regression
    Uses the logit function to model malware probability.
    Accuracy: 50.47%

  • K-Nearest Neighbors (KNN)
    Classifies by majority vote of nearest neighbors.
    Accuracy: 51.52%

  • Random Forest
    Combines multiple decision trees to reduce overfitting via bagging.
    Accuracy: 60.84%

  • Decision Tree
    Splits features recursively for interpretability.
    Accuracy: 56.09%

  • AdaBoost
    Iteratively adjusts weights to emphasize misclassified samples.
    Accuracy: 62.65%

  • Gradient Boosting (GBM)
    Minimizes residuals incrementally by adding weak learners.
    Accuracy: 62.84%

  • XGBoost
    Highly efficient tree boosting with advanced regularization.
    Accuracy: 63.87%

  • LightGBM
    Uses leaf-wise growth and gradient-based sampling.
    Accuracy: 63.72%

  • CatBoost
    Handles categorical features internally, reducing manual encoding.
    Accuracy: 63.98%

Top 3 ML Algorithms

  1. CatBoost Image

  2. XGBoost
    Image

  3. LightGBM
    Image

Deep Learning Algorithms

  • Multi-Layer Perceptron (MLP)
    A dense network that learns nonlinear boundaries with backpropagation. Accuracy: 62.68% Image

  • Gated Recurrent Unit (GRU)
    Manages time-dependent data via gating mechanisms in recurrent layers. Accuracy: 62.93% Image

  • Autoencoder
    Learns efficient encodings and reconstructs inputs for feature extraction. Accuracy: 62.45% Image

MLflow Integration

MLflow tracks experiments centrally, storing metrics (accuracy, precision, recall, AUC) and model artifacts. It integrates seamlessly with various libraries, enabling comparison of runs on a local or remote MLflow UI. Image