ATLAS (Automated Tree-based Language Analysis System) aims to generate combined multi-code view graphs that can be used with various types of machine learning models (sequence models, graph neural networks, etc).
Tool Demonstration link: https://youtu.be/QGuJZhj9CTA
ATLAS is a CLI tool for generating customized source code representations from C and C++ programs. Currently, ATLAS generates codeviews for C and C++, supporting both method-level and file-level code snippets. ATLAS can be used to generate over 15 possible combinations of codeviews for both languages, including:
- AST (Abstract Syntax Tree)
- CFG (Control Flow Graph)
- DFG (Data Flow Graph)
- SDFG (Statement-level Data Flow Graph with Reaching Definitions)
- Combined graphs (any combination of the above)
ATLAS is designed to be easily extendable to various programming languages. This is primarily because we use tree-sitter, a highly efficient incremental parser that supports over 40 languages.
There are two ways to set up ATLAS: using Docker (recommended for quick usage) or using a Python virtual environment (recommended for development).
Docker provides an isolated environment with all dependencies pre-installed.
1. Build the Docker image:
docker build -t atlas .That's it! You're ready to generate graphs using Docker.
1. Create a new virtual environment:
python -m venv .venv2. Activate the environment:
source .venv/bin/activate # On Linux/Mac
# or
.venv\Scripts\activate # On Windows3. Install the package in development mode:
pip install -e .4. Install GraphViz (Optional - for visualization):
GraphViz is only required if you want to generate DOT or PNG output files.
Ubuntu/Debian:
sudo apt install graphvizMacOS:
brew install graphvizWindows: Download from graphviz.org
There are two ways to generate graphs: using Docker or using the CLI directly (after virtual environment setup).
Docker commands mount your current directory to /work inside the container, so output files appear in your working directory.
Single File Analysis:
docker run --rm -v "$(pwd):/work" -w /work atlas \
--lang cpp \
--code-file ./examples/single/test_single.cpp \
--graphs "ast,cfg,dfg" \
--output allFolder Analysis (Multi-file Projects):
docker run --rm -v "$(pwd):/work" -w /work atlas \
--lang c \
--code-folder ./examples/multi \
--combined-name "multi_file_example" \
--graphs "cfg,dfg" \
--output allWith Additional Options:
# Generate only JSON output
docker run --rm -v "$(pwd):/work" -w /work atlas \
--lang c \
--code-file ./examples/single/test_single.c \
--graphs cfg \
--output json
# With collapsed nodes and last-def tracking
docker run --rm -v "$(pwd):/work" -w /work atlas \
--lang c \
--code-file ./examples/single/test_single.c \
--graphs "dfg" \
--collapsed \
--last-defAfter setting up via virtual environment, use the atlas command directly.
Output Location: All generated files (JSON, DOT, PNG) are saved to the output/ directory in your current working directory. The directory is created automatically if it doesn't exist.
The attributes and options supported by the CLI are well documented and can be viewed by running:
atlas --helpSingle File Analysis:
Generate a combined CFG and DFG graph for a C++ file:
atlas --lang "cpp" --code-file ./test.cpp --graphs "cfg,dfg"Generate an AST for a C file with output in JSON format:
atlas --lang "c" --code-file ./example.c --graphs "ast" --output "json"Folder Analysis (Multi-file Projects):
ATLAS can analyze entire projects by combining multiple source files from a folder:
atlas --lang "c" --code-folder ./project/src --graphs "cfg,dfg" --output "json"This will:
- Recursively scan the folder for all
.cand.hfiles - Combine them into a single temporary file (preserving includes, declarations, definitions)
- Generate the requested codeviews from the combined source
- Output results to the
output/directory
You can customize the combined output file name:
atlas --lang "cpp" --code-folder ./mylib --combined-name "myproject" --graphs "ast,cfg"Inline Code Analysis:
You can also analyze code snippets directly without a file:
atlas --lang "c" --code "int main() { int x = 5; return x; }" --graphs "ast,cfg"Additional CLI Options:
| Option | Description |
|---|---|
--output |
Output format: json, dot, or all (dot also generates PNG). Default: all |
--collapsed |
Collapse duplicate variable nodes into a single node in DFG |
--last-def |
Add last definition information to DFG edges (shows where variables were last defined) |
--blacklisted |
Comma-separated list of AST node types to exclude from the graph |
Flag-Codeview Compatibility:
| Flag | AST | CFG | DFG |
|---|---|---|---|
--collapsed |
✓ | ✗ | ✗ |
--blacklisted |
✓ | ✗ | ✗ |
--last-def |
✗ | ✗ | ✓ |
--last-use |
✗ | ✗ | ✓ |
Examples:
# Generate all output formats (DOT, JSON, PNG)
atlas --lang "c" --code-file test.c --graphs "cfg" --output "all"
# Collapse duplicate variable nodes in DFG
atlas --lang "cpp" --code-file test.cpp --graphs "ast" --collapsed
# Add last definition tracking to DFG
atlas --lang "c" --code-file test.c --graphs "dfg" --last-def
# Exclude specific AST node types
atlas --lang "c" --code-file test.c --graphs "ast,cfg" --blacklisted "comment,string_literal"While ATLAS provides method-level and file-level support for both C and C++, it's important to note the following limitations and known issues:
- Syntax Errors in Code: To ensure accurate codeviews, the input code must be free of syntax errors. Code with syntax errors may not be correctly parsed and displayed in the generated codeviews. Note that the code does not need to be compilable, only syntactically valid.
In addition to the general limitations, the tool has the following limitations specific to C++:
-
Limited Template Metaprogramming Support: Complex template metaprogramming patterns may not be fully captured in the generated codeviews.
-
Partial Preprocessor Directive Support: Preprocessor directives (e.g.,
#define,#ifdef) are parsed but not fully processed. Conditional compilation may not be accurately reflected in the codeviews. -
Limited Support for Advanced C++ Features: Some advanced C++ features such as:
- Complex inheritance hierarchies
- Multiple inheritance with virtual functions
- Template specializations
- SFINAE patterns
- Concepts (C++20)
may not be fully represented in the generated codeviews.
CLI Command:
atlas --lang "cpp" --code-file paper_assets/function_pointers.cpp --graphs "cfg,dfg"C++ Code Snippet (function_pointers.cpp):
#include <iostream>
void f1(int times) {
if(!times)
return;
std::cout << "In f1()\n";
f1(times-1);
}
void f2() {
std::cout << "In f2()\n";
}
int main() {
void (*fptr_1)(int);
void (*fptr_2)(void);
fptr_1 = &f1;
fptr_2 = &f2;
int var = 0;
std::cin >> var;
(var > 0) ? fptr_1(3) : fptr_2();
}Generated Codeview:
CLI Command:
atlas --lang "cpp" --code-file paper_assets/pass_by_reference.cpp --graphs "cfg,dfg"C++ Code Snippet (pass_by_reference.cpp):
#include <iostream>
class TestClass {
public:
int x;
TestClass(int _x) {
x = _x + 20;
}
void f1(int& a) {
a += 100;
a -= x;
}
};
int main() {
TestClass obj(30);
int k = 0;
obj.f1(k);
std::cout << k; // prints 50
return 0;
}Generated Codeview:
The code is structured in the following way:
-
Preprocessing (
src/atlas/utils/): Themulti_file_merger.pymodule combines multiple source files from a folder into a single file for analysis. -
Parsing (
src/atlas/tree_parser/): For each code-view, first the source code is parsed using the tree-sitter parser. The Parser and ParserDriver are implemented with various functionalities commonly required by all code-views. Language-specific features are further developed in the language-specific parsers (c_parser.py,cpp_parser.py). -
Codeview Generation (
src/atlas/codeviews/): This directory contains the core logic for the various codeviews:AST/- Abstract Syntax Tree (language-agnostic)CFG/- Control Flow Graph (language-specific:CFG_c.py,CFG_cpp.py)DFG/- Data Flow Graph (language-agnostic)SDFG/- Statement-level Data Flow Graph (language-specific:SDFG_c.py,SDFG_cpp.py)combined_graph/- Combines multiple codeviews into a single graph
-
CLI Entry Point (
src/atlas/cli.py): The CLI implementation using Typer. The drivers can also be directly imported and used as a Python package. -
Node Definitions (
src/atlas/utils/):c_nodes.pyandcpp_nodes.pydefine AST node type categorizations used throughout the codebase.
This tool builds upon the tree-sitter parsing framework and is inspired by research on source code representation learning for AI-driven software engineering tasks.

