This is the repository for the final group project for the lecture Machine Learning in Cyber Security at CISPA / Saarland University.
It contains an experiment about malware classification on the EMBER dataset with Neural Nets and it vulnerability to evasion attacks, especially FGSM (Fast Gradient Signed Method). We also propose adversarial training as a defense mechanism against such attacks.
📂mlcysec_final_project
┣ 📂 [adv_samples]
┃ ┣ 📜 adv_examples.jsonl
┃ ┣ 📜 perturbed_example_pretty.json
┃ ┣ 📜 selected_samples_original.jsonl
┃ ┗ 📜 selected_samples_perturbed.jsonl
┣ 📂 ember_github
┣ 📂 model
┃ ┣ 📜 EmberNet2.pth
┃ ┣ 📜 EmberNet2_hist.pth
┃ ┣ 📜 EmberNetRobust.pth
┃ ┣ 📜 EmberNetRobust_hist.pth
┃ ┗ 📜 scaler.pkl
┣ 📜 README.md
┣ 📦 adverserial_gen.py
┣ 📦 ember_net.py
┣ 📜 ember_nn.ipynb
┣ 📜 ember_nn_robust.ipynb
┣ 📦 evaluation.py
┣ 📦 plots.py
┗ 📦 preprocessing.py
In the folder adv_examples
you will find several samples from our adverserial sample set. The folder ember_github
is a fork of the EMBER repository. In the folder model
you will find our pretrained models, including training history.
The notebooks ember_nn.ipynb
,ember_nn_robust.ipynb
contain our training process and pipeline with results. The first notebook for training FGSM attack and adversarial sample gerneration, the second one for adversarial training and FGSM attack. The module ember_net.py
contains our Neural Network definition, adverserial_gen.py
, evaluation.py
, plots.py
and preprocessing.py
are further modules we created to to modularize our project. The modules are well documented, so don't fear to have a look there.
To use the provided notebooks and models, You should first download the ember dataset from their repository. We used the feature version 2018. To work seamlessly, the dataset should be downloaded into a folder called ember2018
inside the root folder of this project. You can also specify any other folder if necessary.
You then can open one of the notebooks, which are self-explanatory. It is important to first vectorize the dataset and collect the hashes of each sample. Both needs only to be done once. You can do so be setting the parameter of the data preprocessing accordingly.
Please note that this project most likely will not work on windows.