This is the code implementing our approach for the paper "Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification". Using it, you can train the XGBoost model that achieved the highest 10-fold cross validation accuracy and lowest logarithmic loss on the Microsoft Malware Classification Challenge dataset..
To extract the static features from both the hexadecimal representation of malware's binary content and the assembly language source code you need to install the pe_extractor library. You have to clone the following repository and install the library as follows:
https://github.com/danielgibert/pe_extractor.git
cd pe_extractor/
pip install -e .
However, if you just want to train the XGBoost model to replicate the experiments it is not necessary to install the aforementioned library as we provide the extracted features of the samples from the test set. Nevertheless, you will need to install the following libraries:
pandas
xgboost
If you want to train your own CNN models to extract deep features you will need to install tensorflow, but it is not necessary to do it to replicate the experiments. In addition, you will need to download the Microsoft Malware Classification Challenge dataset.
The code has been organized into two main directories:
- data/ directory. It contains the individual features extracted for all samples in the train and test set of the Microsoft Malware Classification Challenge dataset
- src/ directory. It contains the code to preprocess the files, run k-fold cross validation, train the CNNs and XGBoost models.
- src/kfold_cross_validation/. It contains the code to run k-fold cross validation on the training set.
- src/preprocessing/. It contains the code the preprocess the hexadecimal content and the assembly language source code to extract the feature representation of the executables.
- src/tensorflow/. It contains the code to train the CNNs.
- src/xgboost/. It contains the code to train the XGBoost model.
We provided various CSV files containing the subset of features for all samples in the training and test set of the Microsoft Malware Classification Challenge dataset. These CSV files are stored in the following directories:
- Training CSV files path: malware_classification_with_gradient_boosting_and_deep_features/data/feature_files/train/
- Test CSV files path: malware_classification_with_gradient_boosting_and_deep_features/data/feature_files/test/
The first step is to generate a CSV file for training and testing containing the final subset of features used in our experiments. To do so, you need to run the BASH script located in the directory:
- malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_almost_all_features.sh
- malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_with_subsets_from_forward_stepwise_selection_algorithm.sh
- malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_with_various_subsets_of_features.sh
These scripts concatenate the various feature vectors (one for each type of feature) into a single feature vector.
The second step is to train the XGBoost model. We provided various BASH scripts to replicate our experiments:
- malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_forward_stepwise_subset_models_script.sh
- malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_individual_models_script.sh
- malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_K_features_models_script.sh
- malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_subset_models_script.sh
You can train the final model by executing the following Python script:
python train_and_test_xgboost_model_version2.py ../../data/feature_files/train/feature_concatenation/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity.csv ../../data/feature_files/test/feature_concatenation/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity.csv models/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity/best_hyperparameters_model hyperparameters/best_hyperparameters.json output/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity/best_hyperparameters_test.csv
The aforementioned scripts will generate a CSV file containing the probabilities assigned by the XGBoost model to each sample in the test set. Afterwards, you need to submit the resulting probabilities to Kaggle to get the logarithmic loss on the test set.
We provide a script that automatically performs the aforementioned steps.
cd src/
sh run.sh
If you find this work useful in your research, please consider citing:
@article{gibert2022,
author = {Daniel Gibert and
Jordi Planes and
Carles Mateu and
Quan Le},
title = {Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification},
journal = {Expert Systems with Applications},
publisher = {Elsevier},
year = {2022},
}
If you have any trouble please contact daniel DOT gibertlla @ gmail DOT com