This is an implementation of the Deep Cost-sensitive Kernel Machine (DCKM) model as described in the Deep Cost-sensitive Kernel Machine for Binary Software Vulnerability Detection which was accepted at PAKDD 2020.
DCKM model is a combination of a number of diverse techniques, including deep learning, kernel methods, and the cost-sensitive based approach, aiming to detect efficiently potential vulnerabilities in binary software.
The overall structure of DCKM model consists of 3 primary elements: an embedding layer for vectorizing machine instructions, a Bidirectional Recurrent Neural Network capable of taking into account temporal information from a sequence of machine instructions, and a novel Cost-sensitive Kernel Machine invoked in the random feature space to predict the vulnerability with minimal cost-sensitive loss.
The model is trained on two binary datasets, NDSS18 and 6 open-source which is a new real-world binary dataset whose source code was collected from six open-source projects.
#Non-vul | #Vul | #Binaries | ||
---|---|---|---|---|
NDSS18 | Windows | 8,999 | 8,978 | 17,977 |
Linux | 6,955 | 7,349 | 14,304 | |
Whole | 15,954 | 16,327 | 32,281 | |
6 open-source | Windows | 26,621 | 328 | 26,949 |
Linux | 25,660 | 290 | 25,950 | |
Whole | 52,281 | 618 | 52,899 |
Each dataset folder contains two files: binaries-x-y.data (including functions compiled into binaries under two platforms (Windows/Linux), and architectures (x86/x64)), and their corresponding labels, labels-x-y.data (where x is '32' or '64', y is 'windows' or 'linux').
Note:
- We convert machine instructions from hexadecimal format to decimal format. The instruction information after that will represent numbers from 0 to 255, which is easier in computing the frequency vector of instruction information for obtaining the instruction information embedding (see the Section "Data Processing and Embedding" of the paper).
- We use the special charactor '|' to distinguish opcodes from instruction information. For example, '131|131,196,8' has an opcode '131' and instruction information '131,196,8'. Additionally, the functions are separated by '-----'.
...
-----
85|85
137|137,229
104|104,0,0,0,0
106|106,1
232|232,252,255,255,255
131|131,196,8
255|255,117,8
232|232,252,255,255,255
131|131,196,4
144|144
201|201
195|195
-----
85|85
137|137,229
131|131,236,16
...
Note: label '0' and '1' represent for a non-vulnerable and vulnerable function respectively.
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
- Python >= 3.5
- Tensorflow >= 1.12
Command to run:
python main.py
Model parameters: Please kindly look at the initial function of the DCKM class (in the main.py) for hyperparameter settings. Run the default setting to obtain the best result of the experiment on the whole 6 open-source dataset, which outperforms the baselines in all performance measures of interest including the cost-sensitive loss, F1 score, and AUC (see the Table III of the paper).
Some parameters are crucial to obtain promising results after 100 epochs:
- embedding_dimension: the dimension of the embedding process, should be set to 100 for 6_projects and 64 for NDSS18
- hidden_size: the number of hidden units of the Bidirectional RNN. It is quite suitable when setting 128 and 256 units for 6_projects and NDSS18 respectively.
- num_random_features: the dimension for mapping machine instruction representations to random feature space. It depends on the data size, so it should be set either 512 or 1024, or even 2048 for larger datasets.
To test saved DCKM model, set the running_mode parameter to '0', and rerun the main.py.