Malware Image Classification with PyTorch

🌟 Project Overview

This repository represents the first part of a comprehensive malware classification project using deep learning techniques. The project applies computer vision approaches to malware classification by treating binary files as images.

The main contributions of this first phase are:

A highly modular and reusable training framework
Extensive experiment tracking and logging capabilities
Flexible model architecture with support for both custom CNNs and transfer learning
Comprehensive configuration management system

You can find information about Malimg dataset here and I downloaded it from here by using Kaggle API and this script.

Next steps: I would like to try some Conditional GANs and Diffusion models from HuggingFace to make some ideas, it is a working process. The first phase with Deep Learning and PyTorch is completed. You can find a notebook to test this project here

🏗️ Project Architecture

The project follows a highly modular architecture designed for extensibility and reusability:

malware_detection/
│
├── notebooks/
│   ├── 01_eda.ipynb
│   └── 02_model_training.ipynb
│
├── src/
│   ├── data/         
│   │   └── dataset_analyser.py 
│   │
│   ├── models/
│   │   ├── base.py           
│   │   ├── cnn.py           
│   │   └── model_factory.py 
│   │
│   ├── training/
│   │   ├── experiment.py   
│   │   ├── callbacks.py    
│   │   └── trainer.py       
│   │
│   ├── utils/
│   │   ├── image_analyser.py   
│   │   ├── kaggle_downloader.py  
│   |   └── logger_setup.py
|   |
│   └── visualization/
│       ├── panel_dashboard.py   
│       ├── plot_results.py  
│       └── plotters.py
│
├── config/
│   └── config.yaml

🛠️ Framework Components

Experiment Management System

The Experiment class provides a robust framework for:

Automatic experiment tracking and logging
Checkpoint management with best model saving
Training history visualization
Performance metrics tracking
JSON-based results export

Callback System

Modular callback system inspired by Keras:

EarlyStopping: Prevents overfitting
ModelCheckpoint: Saves model states
ReduceLROnPlateau: Adaptive learning rate

Model Factory

Flexible model creation system supporting:

Custom CNN architectures
Transfer learning with timm models
Automatic architecture configuration

🚀 Training System

The training system features:

Configuration-driven training setup
Comprehensive metric logging
Real-time performance visualization
Automatic checkpoint management

Configuration Management

Centralized YAML configuration for:

Model architecture
Training parameters
Data augmentation

📈 Results

Both baseline CNN and transfer learning models achieved excellent performance:

Baseline Model

Training metrics: >95%
Validation metrics: >95%
Test metrics: >95%

Note: One interesting observation was the consistent misclassification of 'Autorun.K' as 'Yuner.A', suggesting potential similarities in their binary patterns. You can check this results here.

Transfer Learning Models

Training metrics: 98-99%
Validation metrics: 98-99%
Test metrics: 98-99%

I received these results training a ResNet-50 model, but I decided not to include results in the experiments folder because you didn’t need to adjust any more results and you can use the notebook to test any model that timm makes available. The main focus of the project remains on the next phase.

🔮 Future Work

This repository represents Phase 1 of a larger project. Phase 2 will explore more advanced approaches:

Upcoming Features

Generative AI Integration
- Conditional GANs for malware image generation
- Diffusion models using Hugging Face's ecosystem

🚀 How to Run

My advice is to run this notebook on Google Colab in order to take advantages of free GPU and an already implemented environment.

Go to Google Colab
Clone the repository
Install dependencies (if needed):

pip install -r requirements.txt

Configure kaggle API and config.yaml
Run all the cells

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
config		config
experiments/malware_classification_baseline_v1		experiments/malware_classification_baseline_v1
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Image Classification with PyTorch

🌟 Project Overview

📊 Table of Contents

🏗️ Project Architecture

🛠️ Framework Components

Experiment Management System

Callback System

Model Factory

🚀 Training System

Configuration Management

📈 Results

Baseline Model

Transfer Learning Models

🔮 Future Work

Upcoming Features

🚀 How to Run

About

Releases

Packages

Languages

License

Silvano315/Malware-classification-model

Folders and files

Latest commit

History

Repository files navigation

Malware Image Classification with PyTorch

🌟 Project Overview

📊 Table of Contents

🏗️ Project Architecture

🛠️ Framework Components

Experiment Management System

Callback System

Model Factory

🚀 Training System

Configuration Management

📈 Results

Baseline Model

Transfer Learning Models

🔮 Future Work

Upcoming Features

🚀 How to Run

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages