Skip to content

Latest commit

 

History

History
137 lines (109 loc) · 4.75 KB

README.md

File metadata and controls

137 lines (109 loc) · 4.75 KB

Malware Image Classification with PyTorch

🌟 Project Overview

This repository represents the first part of a comprehensive malware classification project using deep learning techniques. The project applies computer vision approaches to malware classification by treating binary files as images.

The main contributions of this first phase are:

  • A highly modular and reusable training framework
  • Extensive experiment tracking and logging capabilities
  • Flexible model architecture with support for both custom CNNs and transfer learning
  • Comprehensive configuration management system

You can find information about Malimg dataset here and I downloaded it from here by using Kaggle API and this script.

Next steps: I would like to try some Conditional GANs and Diffusion models from HuggingFace to make some ideas, it is a working process. The first phase with Deep Learning and PyTorch is completed. You can find a notebook to test this project here

📊 Table of Contents

🏗️ Project Architecture

The project follows a highly modular architecture designed for extensibility and reusability:

malware_detection/
│
├── notebooks/
│   ├── 01_eda.ipynb
│   └── 02_model_training.ipynb
│
├── src/
│   ├── data/         
│   │   └── dataset_analyser.py 
│   │
│   ├── models/
│   │   ├── base.py           
│   │   ├── cnn.py           
│   │   └── model_factory.py 
│   │
│   ├── training/
│   │   ├── experiment.py   
│   │   ├── callbacks.py    
│   │   └── trainer.py       
│   │
│   ├── utils/
│   │   ├── image_analyser.py   
│   │   ├── kaggle_downloader.py  
│   |   └── logger_setup.py
|   |
│   └── visualization/
│       ├── panel_dashboard.py   
│       ├── plot_results.py  
│       └── plotters.py
│
├── config/
│   └── config.yaml          

🛠️ Framework Components

Experiment Management System

The Experiment class provides a robust framework for:

  • Automatic experiment tracking and logging
  • Checkpoint management with best model saving
  • Training history visualization
  • Performance metrics tracking
  • JSON-based results export

Callback System

Modular callback system inspired by Keras:

  • EarlyStopping: Prevents overfitting
  • ModelCheckpoint: Saves model states
  • ReduceLROnPlateau: Adaptive learning rate

Model Factory

Flexible model creation system supporting:

  • Custom CNN architectures
  • Transfer learning with timm models
  • Automatic architecture configuration

🚀 Training System

The training system features:

  • Configuration-driven training setup
  • Comprehensive metric logging
  • Real-time performance visualization
  • Automatic checkpoint management

Configuration Management

Centralized YAML configuration for:

  • Model architecture
  • Training parameters
  • Data augmentation

📈 Results

Both baseline CNN and transfer learning models achieved excellent performance:

Baseline Model

  • Training metrics: >95%
  • Validation metrics: >95%
  • Test metrics: >95%

Note: One interesting observation was the consistent misclassification of 'Autorun.K' as 'Yuner.A', suggesting potential similarities in their binary patterns. You can check this results here.

Transfer Learning Models

  • Training metrics: 98-99%
  • Validation metrics: 98-99%
  • Test metrics: 98-99%

I received these results training a ResNet-50 model, but I decided not to include results in the experiments folder because you didn’t need to adjust any more results and you can use the notebook to test any model that timm makes available. The main focus of the project remains on the next phase.

🔮 Future Work

This repository represents Phase 1 of a larger project. Phase 2 will explore more advanced approaches:

Upcoming Features

  • Generative AI Integration
    • Conditional GANs for malware image generation
    • Diffusion models using Hugging Face's ecosystem

🚀 How to Run

My advice is to run this notebook on Google Colab in order to take advantages of free GPU and an already implemented environment.

  1. Go to Google Colab
  2. Clone the repository
  3. Install dependencies (if needed):
pip install -r requirements.txt
  1. Configure kaggle API and config.yaml
  2. Run all the cells