Skip to content

danielgibert/fusing_feature_engineering_and_deep_learning_a_case_study_for_malware_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification

This is the code implementing our approach for the paper "Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification". Using it, you can train the XGBoost model that achieved the highest 10-fold cross validation accuracy and lowest logarithmic loss on the Microsoft Malware Classification Challenge dataset..

Install

To extract the static features from both the hexadecimal representation of malware's binary content and the assembly language source code you need to install the pe_extractor library. You have to clone the following repository and install the library as follows:

https://github.com/danielgibert/pe_extractor.git
cd pe_extractor/
pip install -e .

However, if you just want to train the XGBoost model to replicate the experiments it is not necessary to install the aforementioned library as we provide the extracted features of the samples from the test set. Nevertheless, you will need to install the following libraries:

pandas
xgboost

If you want to train your own CNN models to extract deep features you will need to install tensorflow, but it is not necessary to do it to replicate the experiments. In addition, you will need to download the Microsoft Malware Classification Challenge dataset.

Code Organization

The code has been organized into two main directories:

  • data/ directory. It contains the individual features extracted for all samples in the train and test set of the Microsoft Malware Classification Challenge dataset
  • src/ directory. It contains the code to preprocess the files, run k-fold cross validation, train the CNNs and XGBoost models.
    • src/kfold_cross_validation/. It contains the code to run k-fold cross validation on the training set.
    • src/preprocessing/. It contains the code the preprocess the hexadecimal content and the assembly language source code to extract the feature representation of the executables.
    • src/tensorflow/. It contains the code to train the CNNs.
    • src/xgboost/. It contains the code to train the XGBoost model.

Steps to reproduce the paper

We provided various CSV files containing the subset of features for all samples in the training and test set of the Microsoft Malware Classification Challenge dataset. These CSV files are stored in the following directories:

  • Training CSV files path: malware_classification_with_gradient_boosting_and_deep_features/data/feature_files/train/
  • Test CSV files path: malware_classification_with_gradient_boosting_and_deep_features/data/feature_files/test/

Step 1: Feature Concatenation

The first step is to generate a CSV file for training and testing containing the final subset of features used in our experiments. To do so, you need to run the BASH script located in the directory:

  • malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_almost_all_features.sh
  • malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_with_subsets_from_forward_stepwise_selection_algorithm.sh
  • malware_classification_with_gradient_boosting_and_deep_features/src/preprocessing/generate_single_csv_files_with_various_subsets_of_features.sh

These scripts concatenate the various feature vectors (one for each type of feature) into a single feature vector.

Step 2: Training XGBoost model and generate predictions for the test set

The second step is to train the XGBoost model. We provided various BASH scripts to replicate our experiments:

  • malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_forward_stepwise_subset_models_script.sh
  • malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_individual_models_script.sh
  • malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_K_features_models_script.sh
  • malware_classification_with_gradient_boosting_and_deep_features/src/xgboost/train_and_test_subset_models_script.sh

You can train the final model by executing the following Python script:

python train_and_test_xgboost_model_version2.py ../../data/feature_files/train/feature_concatenation/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity.csv ../../data/feature_files/test/feature_concatenation/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity.csv models/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity/best_hyperparameters_model hyperparameters/best_hyperparameters.json output/asm_view_features_plus_bytes_md_unigrams_cnn_without_pixel_intensity/best_hyperparameters_test.csv

The aforementioned scripts will generate a CSV file containing the probabilities assigned by the XGBoost model to each sample in the test set. Afterwards, you need to submit the resulting probabilities to Kaggle to get the logarithmic loss on the test set.

Run.sh script

We provide a script that automatically performs the aforementioned steps.

cd src/
sh run.sh

Citing

If you find this work useful in your research, please consider citing:

@article{gibert2022,
  author    = {Daniel Gibert and
               Jordi Planes and
               Carles Mateu and
               Quan Le},
  title     = {Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification},
  journal = {Expert Systems with Applications}, 
  publisher = {Elsevier},
  year      = {2022},
}

Contact

If you have any trouble please contact daniel DOT gibertlla @ gmail DOT com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published