Deep Cost-sensitive Kernel Machine Model

This is an implementation of the Deep Cost-sensitive Kernel Machine (DCKM) model as described in the Deep Cost-sensitive Kernel Machine for Binary Software Vulnerability Detection which was accepted at PAKDD 2020.

DCKM model is a combination of a number of diverse techniques, including deep learning, kernel methods, and the cost-sensitive based approach, aiming to detect efficiently potential vulnerabilities in binary software.

The overall structure of DCKM model consists of 3 primary elements: an embedding layer for vectorizing machine instructions, a Bidirectional Recurrent Neural Network capable of taking into account temporal information from a sequence of machine instructions, and a novel Cost-sensitive Kernel Machine invoked in the random feature space to predict the vulnerability with minimal cost-sensitive loss.

The model is trained on two binary datasets, NDSS18 and 6 open-source which is a new real-world binary dataset whose source code was collected from six open-source projects.

Datasets

The statistics of the two binary datasets

		#Non-vul	#Vul	#Binaries
NDSS18	Windows	8,999	8,978	17,977
	Linux	6,955	7,349	14,304
	Whole	15,954	16,327	32,281
6 open-source	Windows	26,621	328	26,949
	Linux	25,660	290	25,950
	Whole	52,281	618	52,899

Data format

Each dataset folder contains two files: binaries-x-y.data (including functions compiled into binaries under two platforms (Windows/Linux), and architectures (x86/x64)), and their corresponding labels, labels-x-y.data (where x is '32' or '64', y is 'windows' or 'linux').

An example of the content of binary files

Note:

We convert machine instructions from hexadecimal format to decimal format. The instruction information after that will represent numbers from 0 to 255, which is easier in computing the frequency vector of instruction information for obtaining the instruction information embedding (see the Section "Data Processing and Embedding" of the paper).

We use the special charactor '|' to distinguish opcodes from instruction information. For example, '131|131,196,8' has an opcode '131' and instruction information '131,196,8'. Additionally, the functions are separated by '-----'.

...
-----
85|85
137|137,229
104|104,0,0,0,0
106|106,1
232|232,252,255,255,255
131|131,196,8
255|255,117,8
232|232,252,255,255,255
131|131,196,4
144|144
201|201
195|195
-----
85|85
137|137,229
131|131,236,16
...

An example of the content of label files

Note: label '0' and '1' represent for a non-vulnerable and vulnerable function respectively.

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

Model implementation

Environment preparation

Python >= 3.5
Tensorflow >= 1.12

Model training and evaluation

Command to run:

python main.py

Model parameters: Please kindly look at the initial function of the DCKM class (in the main.py) for hyperparameter settings. Run the default setting to obtain the best result of the experiment on the whole 6 open-source dataset, which outperforms the baselines in all performance measures of interest including the cost-sensitive loss, F1 score, and AUC (see the Table III of the paper).

Some parameters are crucial to obtain promising results after 100 epochs:

embedding_dimension: the dimension of the embedding process, should be set to 100 for 6_projects and 64 for NDSS18
hidden_size: the number of hidden units of the Bidirectional RNN. It is quite suitable when setting 128 and 256 units for 6_projects and NDSS18 respectively.
num_random_features: the dimension for mapping machine instruction representations to random feature space. It depends on the data size, so it should be set either 512 or 1024, or even 2048 for larger datasets.

Model test

To test saved DCKM model, set the running_mode parameter to '0', and rerun the main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Cost-sensitive Kernel Machine Model

Datasets

The statistics of the two binary datasets

Data format

An example of the content of binary files

An example of the content of label files

Model implementation

Environment preparation

Model training and evaluation

Model test

About

Releases

Packages

Languages

License

tuanrpt/DCKM

Folders and files

Latest commit

History

Repository files navigation

Deep Cost-sensitive Kernel Machine Model

Datasets

The statistics of the two binary datasets

Data format

An example of the content of binary files

An example of the content of label files

Model implementation

Environment preparation

Model training and evaluation

Model test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages