Skip to content

GentleCP/Co2FuLL-public

Repository files navigation

Co2FuLL

This repository provides the code and dataset for our ASE 2025 paper:

Chaopeng Dong, Jingdong Guo, Shouguo Yang, Yi Li, Dongliang Fang, Yang Xiao, Yongle Chen, Limin Sun. Advancing Binary Code Similarity Detection via Context-Content Fusion and LLM Verification. In IEEE/ACM International Conference on Automated Software Engineering (ASE 2025).


📁 Project Structure

The project is organized as follows:

  • DBs/ Contains dataset lists, stripped binaries, and other essential data files. You can download the dataset from figshare.

  • core/ Includes the core implementation, such as context construction, feature extraction, and LLM integration logic.

  • IDA_scripts/ Provides IDA-Pro scripts used to extract binary-level features and function representations from binaries.

  • utils/ Contains supporting utility functions (data preprocessing, evaluation, etc.).

  • saved/ Stores experimental results and intermediate files.


⚙️ Environment Setup

Before running Co2FuLL, please ensure your environment meets the following requirements:

  • Operating System The experiments were conducted on Ubuntu 22.04 LTS. While other systems may work, unexpected errors might occur due to environment differences.

    • Python Environment Python 3.9 is used. We recommend managing the environment via conda. A pre-configured environment file (environment.yml) is provided for convenience:

      conda env create -f environment.yml
      conda activate co2full-public

      Alternatively, you may create a clean environment manually and install required dependencies.

      • Apart from that, install the necessary packages under your IDA python environment
      pip install cptools networkx loguru --target="/path/to/IDA Python/DIR/"
  • IDA-Pro We use IDA-Pro v7.5 for binary feature extraction. Since IDA-Pro is commercial software, please install it manually and configure the following paths in settings.py:

    # Replace IDA path with your own
    IDA_PATH = Path(getenv("IDA_PATH", "/data/Application/idapro-7.5/idat64"))
    IDA32_PATH = Path(getenv("IDA32_PATH", "/data/Application/idapro-7.5/idat"))

🔁 Reproducing the Experiments

To reproduce the experimental results presented in our paper, follow these three major steps:

  1. Feature Extraction
  2. Candidate Retrieval
  3. LLM Verification

🧩 Step 1: Feature Extraction

Feature extraction involves generating various binary representations and metadata for downstream tasks.

  1. Generate .idb files (IDA-Pro analysis results):

    python IDA_scripts/cli_idbs.py
  2. Generate dependency graphs (DGs):

    python IDA_scripts/cli_DG.py
  3. Extract code snippets (for top-5 candidate functions):

    python IDA_scripts/cli_code.py -input DBs/Binkit-1.0-dataset/top5_for_llm-idb_path2func_eas.json
  4. Generate model embeddings Follow the corresponding repositories to generate embeddings for your test functions:

    The embedding file is expected to store a dictionary mapping each function to its corresponding embedding vector. Each key should follow the format binary@function_address. Your embedding file should be named like Binkit-1.0-normal-strip_testing_hermessim_embedding_250123.pkl for data loading. More details could be found in context_exp.py load_model_embeddings. An example is provided below.

    {
    "lightning-2.1.2_gcc-4.9.4_arm_64_O3_liblightning.so.1.0.0.elf.strip@0x60a8": [1.12573838e+00 -8.25750828e-03, ..., 3.89225304e-01]
    }

🧭 Step 2: Candidate Retrieval

In this stage, Co2FuLL fuses contextual and content-based similarities to retrieve semantically equivalent functions from a large function pool.

We explore:

  • 4 models
  • 3 sub-tasks: xc, xa, xm
  • 5 configurations: base, base+context, base+context(import), base+context(string), base+context(direct)

Example usage:

python context_exp.py -h

Help output:

usage: context_exp.py [-h] [-data_path DATA_PATH] [-exp EXP]

Function retrieval enhancement experiments

options:
  -h, --help            show this help message and exit
  -data_path DATA_PATH  Testing dataset path
  -exp EXP              Experiment name (xc, xa, xm)

🤖 Step 3: LLM Verification

After retrieving top-K candidates, Co2FuLL leverages Large Language Models (LLMs) to verify and confirm the true match. This verification step improves precision and interpretability.

In our experiments, we evaluate:

  • 7 LLMs
    • Qwen-2.5-7B (14B, 72B)
    • Qwen2.5-Coder-14B
    • DeepSeek-V3, DeepSeek-R1
    • GPT-4o
  • 6 prompt designs
    • Zero-shot, Few-Shot, CoT-Lite, CoT-Pro, CoT-Self, Critique
  • 5 LLM settings
    • top_p/temperature: 1.0/0.5, 0.5/0.5, 0.5/1.0, 1.0/1.0, 1.0/0.0

Example usage:

python LLM_exp.py -h

Help output:

usage: LLM_exp.py [-h] [-data_path DATA_PATH] [-api_key API_KEY] [-n_jobs N_JOBS]
                  [-save_dir SAVE_DIR] [-url URL] [-model MODEL] [-exp EXP]

BCSD LLM experiments

options:
  -h, --help            Show this help message and exit
  -data_path DATA_PATH  Path to top-K results
  -api_key API_KEY      API key for LLM service
  -n_jobs N_JOBS        Number of parallel threads
  -save_dir SAVE_DIR    Directory to save results
  -url URL              API request endpoint
  -model MODEL          LLM model name
  -exp EXP              Experiment name

Note: Different API vendors may use different names for the same LLM model (for example, deepseek-v3 may appear as deepseek-chat). Please make sure to adjust the LLM name according to the API naming convention of the service you are using.


📜 Citation

If you find this work useful, please cite our paper:

@inproceedings{dong2025co2full,
  title={Advancing Binary Code Similarity Detection via Context-Content Fusion and LLM Verification},
  author={Dong, Chaopeng and Guo, Jingdong and Yang, Shouguo and Li, Yi and Fang, Dongliang and Xiao, Yang and Chen, Yongle and Sun, Limin},
  booktitle={Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  year={2025}
}

📬 Contact

If you encounter any issues or have questions about the code or dataset, please feel free to contact:

About

The official repository of Co2FuLL, accepted by ASE 2025.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages