This repository contains the source code of SoundnessBench, a benchmark designed to thoroughly evaluate the soundness of neural network (NN) verifiers. It addresses the critical lack of ground-truth labels in previous benchmarks for NN verifiers by providing numerous unverifiable instances with hidden counterexamples, so that it can effectively reveal internal bugs of NN verifiers if they falsely claim verifiability on those unverifiable instances. SoundnessBench aims to support developers in evaluating and improving NN verifiers. See our paper for more details.
[Paper] Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples
SoundnessBench is hosted on HuggingFace. To directly download the benchmark, use
git clone https://huggingface.co/datasets/SoundnessBench/SoundnessBench
The downloaded benchmark should contain a total of 26 models across 9 distinct NN architectures with different input sizes and perturbation radii. The table below shows the 9 architectures.
Name | Model Architecture | Activation Function |
---|---|---|
CNN 1 Conv | Conv 10 × 3 × 3, FC 1000, FC 100, FC 20, FC 2 | ReLU |
CNN 2 Conv | Conv 5 × 3 × 3, Conv 10 × 3 × 3, FC 1000, FC 100, FC 20, FC 2 | ReLU |
CNN 3 Conv | Conv 5 × 3 × 3, Conv 10 × 3 × 3, Conv 20 × 3 × 3, FC 1000, FC 100, FC 20, FC 2 | ReLU |
CNN AvgPool | Conv 10 × 3 × 3, AvgPool 3 × 3, FC 1000, FC 100, FC 20, FC 2 | ReLU |
MLP 4 Hidden | FC 100, FC 1000, FC 1000, FC 1000, FC 20, FC 2 | ReLU |
MLP 5 Hidden | FC 100, FC 1000, FC 1000, FC 1000, FC 1000, FC 20, FC 2 | ReLU |
CNN Tanh | Conv 10 × 3 × 3, FC 1000, FC 100, FC 20, FC 2 | Tanh |
CNN Sigmoid | Conv 10 × 3 × 3, FC 1000, FC 100, FC 20, FC 2 | Sigmoid |
VIT | Modified VIT [17] with patch size 1 × 1, 2 attention heads and embedding size 16 | ReLU |
Each folder should contain:
model.onnx
: Model in ONNX format with both model architecture and parametersvnnlib/
: A folder of instances in VNN-LIB formatinstances.csv
: A list of VNN-LIB filesmodel.pt
: Model checkpoint in PyTorch format with parameters only (not needed for verification)data.pt
: Raw data with instances (not needed for verification)
The format of our benchmarks follows VNN-COMP and their benchmarks.
You may run your verifier on our benchmark to produce a results.csv
file containing the results.
The format of this CSV file should follow the format in VNN-COMP, where each line contains the result on one instance.
This CSV file can be automatically produced if you use the official script from VNN-COMP to run your verifier, with three scripts provided by the verifier.
See an example of the result file.
We provide an evaluation script to quickly count the results in a results CSV file. Our script would report metrics including:
clean_instance_verified_ratio
clean_instance_falsified_ratio
unverifiable_instance_verified_ratio
unverifiable_instance_falsified_ratio
A non-zero value for unverifiable_instance_verified_ratio
indicates unsound results by the verifier.
Run the evaluation script by:
python eval.py <path to results.csv> <path to metrics.csv>
In this section, we provide steps for running several verifiers included in our paper.
Our experiments require conda
(e.g., miniconda).
For verification with NeuralSAT and Marabou on certain networks, Gurobi is needed, and please obtain a license. You may be able to obtain a free academic license.
You can install alpha-beta-CROWN, NeuralSAT, and PyRAT using the following command:
bash install.sh
Running Marabou is a bit more complex. Our experiments use Docker to install Marabou. Please set up a Docker environment and mount marabou_vnncomp_2023 or marabou_vnncomp_2024 in your container. After that, run the following to install Marabou:
conda create -y --name marabou python=3.8
conda activate marabou
cd ~/marabou/vnncomp
bash install_tool.sh
cp -r ~/marabou/opt ~/
We provide the config file used in our experiments for alpha-beta-crown here.
To run alpha-beta-CROWN with activation split:
python run_all_verification.py --verifier abcrown --model_folder <path_to_soundness_bench> --config_dir <path_to_config_file>
To run alpha-beta-CROWN with input split:
python run_all_verification.py --verifier abcrown --model_folder <path_to_soundness_bench> --split_type input
To run NeuralSAT with activation split:
python run_all_verification.py --verifier neuralsat --model_folder <path_to_soundness_bench>
To run NeuralSAT with input split:
python run_all_verification.py --verifier neuralsat --model_folder <path_to_soundness_bench> --split_type input
To run PyRAT:
python run_all_verification.py --verifier pyrat --model_folder <path_to_soundness_bench>
We use Docker to run Marabou. Please specify the container name in the script. If you want to run a specific version of Marabou, just specify the name of the container corresponding to its version.
# Replace <marabou_container> with your Marabou container.
python run_all_verification.py --verifier marabou --model_folder <path_to_soundness_bench> --container_name <marabou_container>
The run_all_verification.py
script generates <toolname>_results.csv
files in the ./results
directory. You can use the eval.py
script to count the results in these files.
python eval.py <path to results.csv> <path to metrics.csv>
We provide an easy pipeline for users to train new models that contain hidden adversarial examples.
synthetic_data_generation.py
generates synthetic data with pre-defined hidden counterexamplesadv_training.py
trains models on the synthetically generated dataset using a two-objective training frameworkcross_attack_evaluation.py
uses very strong adversarial attacks to make sure that no trivial counterexamples can be found and saves those instances into VNNLIB filesmodels
contain the definitions of all NN architectures we use in SoundnessBench
conda create -n "SoundnessBench" python=3.11
conda activate SoundnessBench
pip install -r requirements.txt
To train a new model, you can run the following command, which will first call synthetic_data_generation.py
to generate the synthetic dataset used for training, and then apply a two-objective training framework (see our paper for details) to the model.
python adv_training.py
# GENERAL OPTIONS
--fname {model, FNAME}
--model {synthetic_mlp_default, MODEL} # model architecture
# DATASET OPTIONS
--dataset {synthetic1d, synthetic2d}
--synthetic_size {10, SYNTHETIC_SIZE} # number of hidden instances generated
--input_size {5, INPUT_SIZE} # input dimension
--input_channel {1, INPUT_CHANNEL} # number of input channels
--data_range {0.1, DATA_RANGE} # the range of the dataset [-c, c]
--epsilon {0.1, EPSILON} # perturbation radius
# TRAINING OPTIONS
--batch_size {512, BATCH_SIZE}
--epochs {5000, EPOCHS}
--lr-max {1e-3, LR-MAX}
--seed {0, SEED}
# ADDITIONAL TECHNIQUES OPTIONS
# See our paper for details
--margin_obj
--counter_margin {0.01, COUNTER_MARGIN}
--window_size {1, WINDOW_SIZE}
Below are some example commands:
# Example training command for a MLP model
python adv_training.py --fname model --model synthetic_mlp_default --dataset synthetic1d --synthetic_size 10 --input_size 5 --epsilon 0.02 --epochs 5000 --lr-max 1e-3 --alpha 0.02 --margin_obj --window_size 300
# Example training command for a CNN model
python adv_training.py --fname model --model synthetic_cnn_default --dataset synthetic2d --synthetic_size 10 --input_size 5 --epsilon 0.05 --epochs 5000 --lr-max 1e-3 --alpha 0.0005 --margin_obj --window_size 300
After the training has completed, run the following command to evaluate the trained model to see if the counterexamples are truly hidden and generate onnx model and VNNLIB files. The parameters should be consistent with the parameters during training to avoid errors.
python cross_attack_evalution.py --fname FNAME # filename of your trained model
--model MODEL # model architecture
--dataset {synthetic1d, synthetic2d}
--data_range DATA_RANGE # data range previously used
--epsilon EPSILON # perturbation radius previously used
--batch_size {512, BATCH_SIZE}
--result_path {result, PATH} # root path of models' data directory
--output_path {verification, PATH} # path to store VNNLIB
Below are some example commands:
# For the previously trained MLP model:
python cross_attack_evaluation.py --fname model --model synthetic_mlp_default --dataset synthetic1d --epsilon 0.02 --pgd --result_path verification/model
# For the previously trained CNN model:
python cross_attack_evaluation.py --fname model --model synthetic_cnn_default --dataset synthetic2d --epsilon 0.02 --pgd --result_path verification/model
Please kindly cite our paper if find SoundnessBench helpful.
@misc{zhou2024testingneuralnetworkverifiers,
title={Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples},
author={Xingjian Zhou and Hongji Xu and Andy Xu and Zhouxing Shi and Cho-Jui Hsieh and Huan Zhang},
year={2024},
eprint={2412.03154},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.03154},
}
Please reach out to us if you have any questions or suggestions. You can submit an issue or pull request, or send an email to:
- Harry Zhou (hzhou27@g.ucla.edu), UCLA
- Hongji Xu (hx84@duke.edu), Duke University
- Andy Xu (Xuandy05@gmail.com), UCLA
- Zhouxing Shi (zshi@cs.ucla.edu), UCLA
Thank you for your interest in SoundnessBench. We hope that our work will serve as a valuable resource for developers building and improving NN verifiers in the future.