This repo contains the source codes for the paper "The Benefit of GNNs for Network Traffic Analysis" submitted to the 2nd International Workshop on Graph Neural Networking (GNNet@CoNEXT’23).
Feel free to explore the notebooks, experiment with the library, and adapt the code to your own research or applications. For any questions, issues, or suggestions, please open an issue on this repository.
- How to reproduce results in the paper?
- Repository Structure
- Notebook Folder
- Data Structure
- API Reference
Note: This guide assumes a Debian-like system (tested on Ubuntu 20.04 & Debian 11).
- Clone this repository
- Download the gzip data file from: https://mplanestore.polito.it:5001/sharing/1LhiKV1ko To download the data a password is required via mail to luca.gioacchini@polito.it. Please refer to the data section for an overview of the data structure.
- Unzip the TBD file into a subfolder of this repository called
data
mkdir -p data && tar -xzf gnn-for-darknet-data.tar.gz -C data --strip-components=1
- Install the
virtualenv
library (python3 is assumed):
pip3 install --user virtualenv
- Create a new virtual environment and activate it:
virtualenv darknet-gnn-env
source darknet-gnn-env/bin/activate
- Install the required libraries (python3 is assumed):
pip3 install -r requirements.txt
- Run the notebooks described next. For example, to run the first notebook:
jupyter-lab 00-dataset-characterization.ipynb
- When the notebook exploration is ended, remember to deactivate the virtual environment:
deactivate
The repository is organized as follows:
- The
notebook
folder contains Jupyter notebooks that replicate the experiments presented in the paper. - The
src
folder contains all the source codes and libraries providing the necessary tools for implementing and reproducing the experiments of the paper. This library encapsulates the functions, methods, classes, and models used in the notebooks. By utilizing this library, users can streamline their workflow and easily experiment with different components. - The
docs
folder contains the codes documentation. - The
requirements.txt
file lists the required Python packages and their versions.
The notebook
folder contains Jupyter notebooks that demonstrate how to reproduce the experiments described in the paper. Each notebook corresponds to a specific experiment and provides step-by-step instructions and explanations. The notebooks are designed to be self-contained and easy to follow.
-
The notebook
00_dataset_characterization
contains the main codes to characterized both the filtered dataset (total and on daily basis) and the resulting temporal graph. -
The notebook
01_dataset_generation
contains the main codes to (i) process raw traces filtering unwanted data; (ii) generate bipartite graphs from filtered traces and extract node features; (iii) generate textual corpora which will be processed by NLP algorithms. -
The notebook
02_embeddings_generation
contains the main codes to (i) prduce NLP embeddings through i-DarkVec; (ii) prduce (t)GNN embeddings without node features; (iii) prduce (t)GNN embeddings with node features; (iv) produce embeddings to evaluate the impact of the parameters (history and training epochs). -
The notebook
03_classification
contains the main codes to run the final k-Nearest-Neighbors classification pipeline. The main experiments are: (i) Main table with classification performance; (ii) impact of History parameter -- temporal aspect of tGNN; (iii) Impact of training epochs for incremental training.
The downloaded data
folder is organized as follow:
-
corpus
folder contains the NLP corpora. Each pickle file is contains a list of numpy arrays (sequence of strings) of a snapshot named ascorpus_DATE.pkl
, whereDATE
is referred to the considered snapshot. -
features
folder contains the node features. Each csv file has V rows, where V is the number of vertices of the graph and F columns, where F is the number of features for each node. Each file is namedfeatures_DATE.csv
, whereDATE
is referred to the considered snapshot. -
graph
folder contains the graph obtained for each snapshot. Each txt file contains 4 columns (source node, destination node, edge weight, label). Each file is namedDATE.txt
, whereDATE
is referred to the considered snapshot. -
ground_truth
folder contains the full ground truth. The fileground_truth.csv
has two columns (src_ip, label). The first column contains source IP addresses, the second column is the ground truth label. -
gnn_embeddings
folder contains the csv of the embeddings generated through GNNs.- The files containing embeddings generated without node features are named
embeddings_MODEL_DATE.csv
, whereDATE
is referred to the considered snapshot andMODEL
is referred to the used GNN. - The files containing embeddings generated with node features are named
embeddings_MODEL_features_DATE.csv
, whereDATE
is referred to the considered snapshot andMODEL
is referred to the used GNN. - The files containing embeddings generated for the history evaluation are named
embeddings_MODEL_features_Hhist_DATE.csv
, whereDATE
is referred to the considered snapshot,MODEL
is referred to the used GNN andhist
is the value of the history parameter. - The files containing embeddings generated for the training evaluation are named
embeddings_MODEL_features_eeEPOCHS_DATE.csv
, whereDATE
is referred to the considered snapshot,MODEL
is referred to the used GNN andEPOCHS
is the value of the training epochs.
The possible values of
MODEL
aregcn
,gcngru
,igcn
,igcngru
. Each file is indexed by the src_ip active in the considered snapshot and hse E columns, where E is the embeddings size. - The files containing embeddings generated without node features are named
-
nlp_embeddings
folder contains the csv of the embeddings generated through i-DarkVec. Each file is namedembeddings_idarkvec_DATE.csv
, whereDATE
is referred to the considered snapshot. Each file is indexed by the src_ip active in the considered snapshot and hse E columns, where E is the embeddings size. -
raw
folder contains the raw data pre-processed with the codes reported in01_dataset_generation
. Each csv file contains the following columns (ts, ethtype, src_ip, src_port, dst_ip, dst_port, pck_len, tcp_flags, mirai, tcp_seq, ttl, t_mss, t_win, t_ts, t_sack, t_sackp, interval). Each file is namedraw_DATE.csv
, whereDATE
is referred to the considered snapshot. -
traces
folder contains the pure raw traces. Each file is namedtrace_YYMMDD_HH_MM_SS_MS.log.gz
and each row is referred to a packet received by the darknet. -
results
folder contains the final results to be plotted.
Please, refer to the API reference for the complete code documentation.