This repository represents the first part of a comprehensive malware classification project using deep learning techniques. The project applies computer vision approaches to malware classification by treating binary files as images.
The main contributions of this first phase are:
- A highly modular and reusable training framework
- Extensive experiment tracking and logging capabilities
- Flexible model architecture with support for both custom CNNs and transfer learning
- Comprehensive configuration management system
You can find information about Malimg dataset here and I downloaded it from here by using Kaggle API and this script.
Next steps: I would like to try some Conditional GANs and Diffusion models from HuggingFace to make some ideas, it is a working process. The first phase with Deep Learning and PyTorch is completed. You can find a notebook to test this project here
The project follows a highly modular architecture designed for extensibility and reusability:
malware_detection/
│
├── notebooks/
│ ├── 01_eda.ipynb
│ └── 02_model_training.ipynb
│
├── src/
│ ├── data/
│ │ └── dataset_analyser.py
│ │
│ ├── models/
│ │ ├── base.py
│ │ ├── cnn.py
│ │ └── model_factory.py
│ │
│ ├── training/
│ │ ├── experiment.py
│ │ ├── callbacks.py
│ │ └── trainer.py
│ │
│ ├── utils/
│ │ ├── image_analyser.py
│ │ ├── kaggle_downloader.py
│ | └── logger_setup.py
| |
│ └── visualization/
│ ├── panel_dashboard.py
│ ├── plot_results.py
│ └── plotters.py
│
├── config/
│ └── config.yaml
The Experiment
class provides a robust framework for:
- Automatic experiment tracking and logging
- Checkpoint management with best model saving
- Training history visualization
- Performance metrics tracking
- JSON-based results export
Modular callback system inspired by Keras:
EarlyStopping
: Prevents overfittingModelCheckpoint
: Saves model statesReduceLROnPlateau
: Adaptive learning rate
Flexible model creation system supporting:
- Custom CNN architectures
- Transfer learning with timm models
- Automatic architecture configuration
The training system features:
- Configuration-driven training setup
- Comprehensive metric logging
- Real-time performance visualization
- Automatic checkpoint management
Centralized YAML configuration for:
- Model architecture
- Training parameters
- Data augmentation
Both baseline CNN and transfer learning models achieved excellent performance:
- Training metrics: >95%
- Validation metrics: >95%
- Test metrics: >95%
Note: One interesting observation was the consistent misclassification of 'Autorun.K' as 'Yuner.A', suggesting potential similarities in their binary patterns. You can check this results here.
- Training metrics: 98-99%
- Validation metrics: 98-99%
- Test metrics: 98-99%
I received these results training a ResNet-50 model, but I decided not to include results in the experiments folder because you didn’t need to adjust any more results and you can use the notebook to test any model that timm makes available. The main focus of the project remains on the next phase.
This repository represents Phase 1 of a larger project. Phase 2 will explore more advanced approaches:
- Generative AI Integration
- Conditional GANs for malware image generation
- Diffusion models using Hugging Face's ecosystem
My advice is to run this notebook on Google Colab in order to take advantages of free GPU and an already implemented environment.
- Go to Google Colab
- Clone the repository
- Install dependencies (if needed):
pip install -r requirements.txt
- Configure kaggle API and
config.yaml
- Run all the cells