This repository contains some of the core Python scripts used in the master thesis:
Attribution of Malware Binaries to APT Actors using an Ensemble Classifier
Disclaimer: Due to the integration of third-party tools and confidential code, the entire MAA pipeline cannot be disclosed. Furthermore, please note that the APT samples and extracted features cannot be shared due to TLP. However, you may have a look at open-source malware repositories to collect the samples referenced in the datasets.
Two datasets were used to evaluate the proposed ensemble classifier: APTClass and cyber-research.
APTClass is an annotated meta-dataset for MAA published by Gray et al. The ground truth is based on threat intelligence reports published by government departments, anti-virus and security companies.
- Paper: Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets
- Dataset: Bitbucket (15,660 samples by 164 APT groups according to paper)
Some incorrect assignments were found in the dataset, which were forwarded to the authors and described in the thesis. The applied corrections were comprehensibly committed to a dedicated Bitbucket repository. The authors of APTClass adapted the majority of corrections into the original dataset with some minor exceptions.
cyber-research is an annotated dataset including all referenced samples for MAA published by Coen Boot. The ground truth is based on threat intelligence reports published by security vendors.
- Paper: Applying Supervised Learning on Malware Authorship Attribution
- Dataset: GitHub (3,594 samples by 12 APT groups according to paper)
Similar to APTClass, a few mistakes were noticed, which are shown in the pull request in a comprehensible way: cyber-research/APTMalware#2
All configurable variables are stored in an environment file .env
on the top level in the project directory, which contains connection information, hyper-parameters and local paths.
An overview of the environment variables can be taken from the configuration file models/config.py
.
Default values were assigned to the majority of variables, which were used in the final MAA pipeline described the master thesis.
However, the subsequent ones must be defined by yourself.
SAMPLE_FOLDER=/samples
DATABASE_URL="mysql://username:password@localhost:3306/aptclass?charset=utf8mb4"
MONGODB_URL="mongodb://localhost:27017/"
APTCLASS_CSV=/home/user/aptclass/2022-aug-aptclass_dataset.csv
CYBERRESEARCH_CSV=/home/user/cyberresearch/overview.csv
NUMPY_FILE_MODEL_A=/home/user/model_A_feature_vector_aptclass.npz
NUMPY_FILE_MODEL_B=/home/user/model_B_feature_vector_aptclass.npz
NUMPY_FILE_MODEL_C=/home/user/model_C_feature_vector_aptclass.npz
Since the environment file is used by all Python modules of this MAA pipeline, the working directory has to be equal to the project directory.
The proposed malware authorship attribution pipeline is composed of four components, illustrated in the figure below.
This component prepares the dataset by transforming the information about samples, APT groups, countries and aliases into a relational database using SQLModel. The relations between the entities are depicted in the following UML diagram.
A parser for both datasets can be found in the datasets
folder which filters out samples not complying with the MAA assumptions (e.g. file types or multiple actors) and splits the dataset into eight folds.
However, you must first request access to the APTClass original and updated CSV files 2021-jan-aptclass_dataset.csv
and 2022-nov-aptclass_dataset.csv
granted by Gray et al. before I am allowed to give you access to the corrected dataset 2022-aug-aptclass_dataset.csv
.
Furthermore, the cyber-research dataset by default does not include the file types in overview.csv
, so keep a look at my forked repository marius-benthin/cyber-research.
- APTClass: 11,787 portable executables by 82 APT groups
- cyber-research: 2,867 portable executables by 12 APT groups
It is assumed that the samples and extracted artifacts are stored in the following folder structure indexed by SHA-256.
/
├─ 00/
│ ├─ 00A1…F6BC/
│ ├─ 00A1…F6BC
│ ├─ 00A1…F6BC.c
│ ├─ 00A1…F6BC.ast
│ ├─ 00A1…F6BC.idb
│ ├─ 00A1…F6BC.i64
│ ├─ 00A1…F6BC_vmray.json
│
├─ 01/
├─ … /
├─ FF/
The sample pre-processing component is divided into a static and dynamic stream.
First, all samples in the datasets are unpacked with UnpacMe. The service offers its own API, which can be used to automatically upload the samples and download the results. A packed sample may contain one or multiple portable executables that have to be sorted into the folder structure as well.
Subsequently, the user-related strings were extracted using the disassembler IDA Pro together with the IDAPython plugin BinAuthor.
Since IDA Pro 7.7 was used, the BinAuthor plugin had to be upgraded to Python 3 first (see marius-benthin/BinAuthor).
To automate the process, an FastAPI application was developed that accepts multiple samples in one IDA Pro session and disassembles them in parallel.
The source code and the Docker image can be found at pre-processing/binauthor/
.
Simultaneously, all unpacked samples were converted to pseudo-like C code using the Hex-Rays decompiler.
This was also done in parallel with one IDA session using a FastAPI application.
The source code and the Docker image can be found at pre-processing/decompiler/
.
Afterwards, the decompiled code was transformed into an abstract syntax tree with the fuzzy parser Joern (Version 0.2.5) and joern-tools.
To do this, the code was first loaded into the Neo4j database deployed via the Docker image neepl/joern and afterwards exported using the joern-tools.
# index C file located in mounted folder code/
java -jar /joern/bin/joern.jar /code
# start Neo4j database
/var/lib/neo4j/bin/neo4j start
# generate abstract syntax tree
echo 'queryNodeIndex("type:Function").id' | joern-lookup -g | joern-plot-ast > /code/test.ast
Please note that the most recent Joern version is recommended for new attribution pipelines because its faster and more stable. However, due to reproducibility with Caliskan et al. approach this pipeline stick with the old version.
Since in dynamic analysis the samples unpack in memory anyway, the original possibly packed samples were used. All portable executables are executed with VMRay Analyzer on a Windows 10 64-bit system for a maximum of three minutes without Internet connection. Afterwards, the JSON report of the Analysis Results Reference v1 have been downloaded and stored next to the sample in the folder structure.
The feature extraction component decomposes the user-related strings and behavior artifacts into 3-grams or 2-grams respectively and calculates the mutual information for the derived numpy
feature vector.
Furthermore, the abstract syntax tree nodes are simply counted and also stored in a numpy
feature vector.
Since the feature space for the n-grams gets very large, only the top 10% n-grams are selected using mutual information. This feature selection has to be conducted eight times, because the 8-fold cross-validation splits the dataset into eight training sets with each seven folds.
The last component is the authorship identification, which first employs a feedforward neural network for the user-related string tri-grams and two random forests for the AST nodes and dynamic behavior bi-grams. Subsequently, the predictions are passed to the ensemble classifier with soft voting to output the final attribution.