SADE (abbreviated as Software Architecture with Document Embeddings) is a library for studying and recovering the architectures of complex softwares systems. Our approach uses a combination of document embeddings on the source code provided by Doc2Vec as well as the existing structure of the codebase via the call graphs, produced by CScout.
Document embeddings have never been used before to study the architecture of a software system. We will construct a geometric graph on a pseudo-metric space and iteratively and form communities in this graph, creating clusters that represent modules of software using the Louvain Algorithm. The proposed evaluation metrics for software clusterings are stability, authoritativeness (closeness to the ground truth) and extremity (avoiding the creation of very small or very large clusters).
This project was curated for the ESEC/FSE 2019 Student Research Competition. You can read the paper here as well as the slides.
The software is released under the MIT License.
Installing system/user-wide (with sudo if system-wide):
make install
Installing on a virtual environment using virtualenv
:
make install_venv
With SADE you can analyze your C project using the components provided by it. Below there are steps on how you should do it. We will be using CScout for Static Graph Analysis.
For defining the modules of the system, each file must map to a grain. You should generate a modules.json
file with the following format:
{
"boo.c" : "boograin",
"foo.c" : "foograin"
}
You can do this manually, but in case the project is strictly organized into grains (e.g. one-top directories) you can use the autogen_module
tool to generate the module definition. You can do this by:
autogen_module.py --suffix .c --suffix .h -d 1 >modules.json
where the -d
specifies the depth that the modules must be split. An example is located at examples/linux/modules.json
.
For scalability purposes you can manually set the --suffix
arguments for other languages. For example, for a C++ project
autogen_module.py --suffix .cpp --suffix .h -d 1 >modules.json
After creating the modules.json
definitions file you can proceed generating the Doc2Vec using Gensim and spaCy preprocessed with the following pipeline:
autogen_module.py --suffix .c --suffix .h -d 1 >modules.json
- Stop-word Removal
- Tokenization
- Lemmatization
You can generate the embeddings with the embeddings.py
script using
embeddings.py -m modules.json -o embeddings.bin -p params.json
You can configure it further by passing parameters for the model with -p
flag as a params.json
file.
A params.json
file example:
{
"size": 200,
"epochs" : 1000,
"window" : 10,
"min_count": 10,
"workers":7,
"sample": 1E-3
}
For the purposes of our research we have trained the document embeddings for the Linux Kernel Codebase v4.21. From here you can download the embeddings produced with gensim
.
- Document Embeddings (One-top directory Level without Identifier Splitting)
- Document Embeddings (One-top directory Level with Identifier Splitting)
- Document Embeddings (Source Code File Level)
Generate the make.cs
file via:
csmake
in case you have a multi-core machine you can use the classic -j
flag:
csmake -j7
After generating the make.cs
file you can analyze it with CScout via
cscout make.cs
CScout may complain for undefined names. What you can to is to place their respective definitions to cscout-pre-defs.h
(before csmake
) and to cscout-post-defs.h
. For more information on it, please refer to CScout Documentation.
An example of such configuration for the Linux Kernel 4.x Codebase is located at examples/linux
.
Finally, you can send GET
requests to CScout and get responses through its REST API.
For example:
# Call graph (functions)
curl -X GET "http://localhost:8081/cgraph.txt" >graph.txt
You can get all the call graphs via running scripts/get_graphs_rest.sh
.
A pre-generated call graph of Linux Kernel 4.21 (20.3 million lines of source code) can be found here. The call graphs come to a format:
u1 v1
u2 v2
// more edges
un vn
where ui vi
is a directed edge from ui
to vi
.
The call graph was generated on an Intel(R) Xeon(R) CPU E5-1410 0 @ 2.80GHz with 72G of RAM.
After generating the embeddings you can use the layerize.py
tool to get the proposed layered architecture. You can do it by:
layerize.py -e embeddings.bin -g graph.txt >layers.bunch
to export it to a .bunch
file. The format of a bunch file is:
Layer0= File1, File2, File3
or to JSON with:
layerize.py -e embeddings.bin -g graph.txt --export json >layers.json
Once generating the layered architecture, in case there is an existing one serving as ground truth, such that the Linux Layers located at examples/linux/ground_truth.json
you can compare the architectures with the MoJoFM metric provided in the mojo
package via:
import mojo
mojo.mojo('proposed_layers.bunch', 'ground_truth.bunch', '-fm')
SADE was developed in Python 3.x using the following libraries:
- Gensim
- spaCy
- sklearn
- NetworkX
You can use SADE with a different static call graph analyzer tool for your preferred language. The format that SADE understands is of the form
foo.c boo.c
which indicates a directed edge from foo.c
to boo.c
.
The module definitions are, as explained above, contained in JSON files.
The clustering results are, as explained above, contained in JSON or Bunch files.
You can cite the project using the following bibliographic entries
@inproceedings{sade,
title={Software Clusterings with Vector Semantics and the Call Graph},
author={Papachristou, Marios},
year={2019},
booktitle={ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)},
organization={Association for Computing Machinery}
}
@misc{call_graph,
title={Linux Kernel 4.21 Call Graph},
DOI={10.5281/zenodo.2652487},
publisher={Zenodo},
author={Papachristou, Marios},
year={2019}
}
@misc{sade_source_code,
title={Software Architecture with Document Embeddings and the Call Graph Source Code},
DOI={10.5281/zenodo.2673033},
publisher={Zenodo},
author={Papachristou, Marios},
year={2019}
}