This repository provides the code and dataset for our ASE 2025 paper:
Chaopeng Dong, Jingdong Guo, Shouguo Yang, Yi Li, Dongliang Fang, Yang Xiao, Yongle Chen, Limin Sun. Advancing Binary Code Similarity Detection via Context-Content Fusion and LLM Verification. In IEEE/ACM International Conference on Automated Software Engineering (ASE 2025).
The project is organized as follows:
-
DBs/Contains dataset lists, stripped binaries, and other essential data files. You can download the dataset from figshare. -
core/Includes the core implementation, such as context construction, feature extraction, and LLM integration logic. -
IDA_scripts/Provides IDA-Pro scripts used to extract binary-level features and function representations from binaries. -
utils/Contains supporting utility functions (data preprocessing, evaluation, etc.). -
saved/Stores experimental results and intermediate files.
Before running Co2FuLL, please ensure your environment meets the following requirements:
-
Operating System The experiments were conducted on Ubuntu 22.04 LTS. While other systems may work, unexpected errors might occur due to environment differences.
-
Python Environment Python 3.9 is used. We recommend managing the environment via conda. A pre-configured environment file (
environment.yml) is provided for convenience:conda env create -f environment.yml conda activate co2full-public
Alternatively, you may create a clean environment manually and install required dependencies.
- Apart from that, install the necessary packages under your IDA python environment
pip install cptools networkx loguru --target="/path/to/IDA Python/DIR/"
-
-
IDA-Pro We use IDA-Pro v7.5 for binary feature extraction. Since IDA-Pro is commercial software, please install it manually and configure the following paths in
settings.py:# Replace IDA path with your own IDA_PATH = Path(getenv("IDA_PATH", "/data/Application/idapro-7.5/idat64")) IDA32_PATH = Path(getenv("IDA32_PATH", "/data/Application/idapro-7.5/idat"))
To reproduce the experimental results presented in our paper, follow these three major steps:
- Feature Extraction
- Candidate Retrieval
- LLM Verification
Feature extraction involves generating various binary representations and metadata for downstream tasks.
-
Generate
.idbfiles (IDA-Pro analysis results):python IDA_scripts/cli_idbs.py
-
Generate dependency graphs (DGs):
python IDA_scripts/cli_DG.py
-
Extract code snippets (for top-5 candidate functions):
python IDA_scripts/cli_code.py -input DBs/Binkit-1.0-dataset/top5_for_llm-idb_path2func_eas.json
-
Generate model embeddings Follow the corresponding repositories to generate embeddings for your test functions:
The embedding file is expected to store a dictionary mapping each function to its corresponding embedding vector. Each key should follow the format binary@function_address. Your embedding file should be named like
Binkit-1.0-normal-strip_testing_hermessim_embedding_250123.pklfor data loading. More details could be found incontext_exp.pyload_model_embeddings. An example is provided below.{ "lightning-2.1.2_gcc-4.9.4_arm_64_O3_liblightning.so.1.0.0.elf.strip@0x60a8": [1.12573838e+00 -8.25750828e-03, ..., 3.89225304e-01] }
In this stage, Co2FuLL fuses contextual and content-based similarities to retrieve semantically equivalent functions from a large function pool.
We explore:
- 4 models
- 3 sub-tasks:
xc,xa,xm - 5 configurations:
base,base+context,base+context(import),base+context(string),base+context(direct)
Example usage:
python context_exp.py -hHelp output:
usage: context_exp.py [-h] [-data_path DATA_PATH] [-exp EXP]
Function retrieval enhancement experiments
options:
-h, --help show this help message and exit
-data_path DATA_PATH Testing dataset path
-exp EXP Experiment name (xc, xa, xm)
After retrieving top-K candidates, Co2FuLL leverages Large Language Models (LLMs) to verify and confirm the true match. This verification step improves precision and interpretability.
In our experiments, we evaluate:
- 7 LLMs
- Qwen-2.5-7B (14B, 72B)
- Qwen2.5-Coder-14B
- DeepSeek-V3, DeepSeek-R1
- GPT-4o
- 6 prompt designs
- Zero-shot, Few-Shot, CoT-Lite, CoT-Pro, CoT-Self, Critique
- 5 LLM settings
- top_p/temperature: 1.0/0.5, 0.5/0.5, 0.5/1.0, 1.0/1.0, 1.0/0.0
Example usage:
python LLM_exp.py -hHelp output:
usage: LLM_exp.py [-h] [-data_path DATA_PATH] [-api_key API_KEY] [-n_jobs N_JOBS]
[-save_dir SAVE_DIR] [-url URL] [-model MODEL] [-exp EXP]
BCSD LLM experiments
options:
-h, --help Show this help message and exit
-data_path DATA_PATH Path to top-K results
-api_key API_KEY API key for LLM service
-n_jobs N_JOBS Number of parallel threads
-save_dir SAVE_DIR Directory to save results
-url URL API request endpoint
-model MODEL LLM model name
-exp EXP Experiment name
Note: Different API vendors may use different names for the same LLM model (for example, deepseek-v3 may appear as deepseek-chat). Please make sure to adjust the LLM name according to the API naming convention of the service you are using.
If you find this work useful, please cite our paper:
@inproceedings{dong2025co2full,
title={Advancing Binary Code Similarity Detection via Context-Content Fusion and LLM Verification},
author={Dong, Chaopeng and Guo, Jingdong and Yang, Shouguo and Li, Yi and Fang, Dongliang and Xiao, Yang and Chen, Yongle and Sun, Limin},
booktitle={Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE)},
year={2025}
}If you encounter any issues or have questions about the code or dataset, please feel free to contact:
- Chaopeng Dong: dongchaopeng@iie.ac.cn