CANDOR-Bench (Continuous Approximate Nearest neighbor search under Dynamic Open-woRld Streams) is a benchmarking framework designed to evaluate in-memory ANNS algorithms under realistic, dynamic data stream conditions.
CANDY-Benchmark/
├── benchmark/
├── big-ann-benchmarks/ # Core benchmarking framework (Dynamic Open-World conditions)
│ ├── benchmark/
│ │ ├── algorithms/ # Concurrent Track
│ │ ├── concurrent/ # Congestion Track
│ │ ├── congestion/
│ │ ├── main.py
│ │ ├── runner.py
│ │ └── ……
│ ├── create_dataset.py
│ ├── requirements_py3.10.txt
│ ├── logging.conf
│ ├── neurips21/
│ ├── neurips23/ # NeurIPS'23 benchmark configurations and scripts
│ │ ├── concurrent/ # Concurrent Track
│ │ ├── congestion/ # Congestion Track
│ │ ├── filter/
│ │ ├── ood/
│ │ ├── runbooks/ # Dynamic benchmark scenario definitions (e.g., T1, T3, etc.)
│ │ ├── sparse/
│ │ ├── streaming/
│ │ └── ……
│ └──……
├── GTI/ # Integrated GTI algorithm source
├── IP-DiskANN/ # Integrated IP-DiskANN algorithm source
├── src/ # Main algorithm implementations
├── include/ # C++ header files
├── thirdparty/ # External dependencies
├── Dockerfile # Docker build recipe
├── requirements.txt
├── setup.py # Python package setup
└── ……
Our evaluation involves the following datasets and algorithms.
Category | Name | Description | Dimension | Data Size | Query Size | Code Identifier |
---|---|---|---|---|---|---|
Real-world | SIFT | Image | 128 | 1M | 10K | sift |
OpenImagesStreaming | Image | 512 | 1M | 10K | \ | |
Sun | Image | 512 | 79K | 200 | sun | |
SIFT100M | Image | 128 | 100M | 10K | sift100M | |
Trevi | Image | 4096 | 100K | 200 | sift | |
Msong | Audio | 420 | 990K | 200 | msong | |
COCO | Multi-Modal | 768 | 100K | 500 | coco | |
Glove | Text | 100 | 1.192M | 200 | glove | |
MSTuring | Text | 100 | 30M | 10K | msturing | |
Synthetic | Gaussian | i.i.d values | Adjustable | 500K | 1000 | \ |
Blob | Gaussian Blobs | 768 | 500K | 1000 | \ | |
WTE | Text | 768 | 100K | 100 | \ | |
FreewayML | Constructed | 128 | 100K | 1K | \ |
Category | Algorithm Name | Description | Code Identifier |
---|---|---|---|
Tree-based | SPTAG | Space-partitioning tree structure for efficient data segmentation. | candy_sptag |
LSH-based | LSH | Data-independent hashing to reduce dimensionality and approximate nearest neighbors. | faiss_lsh |
LSHAPG | LSH-driven optimization using LSB-Tree to differentiate graph regions. | candy_lshapg | |
Clustering-based | PQ | Product quantization for efficient clustering into compact subspaces. | faiss_pq |
IVFPQ | Inverted index with product quantization for hierarchical clustering. | faiss_IVFPQ | |
OnlinePQ | Incremental updates of centroids in product quantization for streaming data. | faiss_onlinepq | |
Puck | Non-orthogonal inverted indexes with multiple quantization optimized for large-scale datasets. | puck | |
SCANN | Small-bit quantization to improve register utilization. | faiss_fast_scan | |
Graph-based | NSW | Navigable Small World graph for fast nearest neighbor search. | faiss_NSW |
HNSW | Hierarchical Navigable Small World for scalable search. | faiss_HNSW | |
MNRU | Enhances HNSW with efficient updates to prevent unreachable points in dynamic environments. | candy_mnru | |
Cufe | Enhances FreshDiskANN with batched neighbor expansion. | cufe | |
Pyanns | Enhances FreshDiskANN with fix-sized huge pages for optimized memory access. | pyanns | |
IPDiskANN | Enables efficient in-place deletions for FreshDiskANN, improving update performance without reconstructions. | ipdiskann | |
GTI | Hybrid tree-graph indexing for efficient, dynamic high-dimensional search, with optimized updates and construction. | gti |
We strongly recommend using Docker to build and run this project.
There are many algorithm libraries with complex dependencies. Setting up the environment locally can be difficult and error-prone. Docker provides a consistent and reproducible environment, saving you time and avoiding compatibility issues.
Note: Building the Docker image may take 15–30 minutes depending on your network and hardware, please be patient.
To build the project using Docker, simply use the provided Dockerfile located in the root directory. This ensures a consistent and reproducible environment for all dependencies and build steps.
- To initialize and update all submodules in the project, you can run:
git submodule update --init --recursive
- You can build the Docker image with:
docker build -t <your-image-name> .
- Once the image is built, you can run a container from it using the following command.
docker run -it <your-image-name>
- After entering the container, navigate to the project directory:
cd /app/big-ann-benchmarks
Prepare dataset and compute groundtruth
cd big-ann-benchmarks
bash scripts/compute_general.sh
Run general experiments
bash scripts/run_general.sh
Wait experiments completed, and generate results, will be as gen-congestion.csv
python3 data_exporter.py --output gen --track congestion
All the following operations are performed in the root directory of big-ann-benchmarks.
Create a small, sample dataset. For example, to create a dataset with 10000 20-dimensional random floating point vectors, run:
python3 create_dataset.py --dataset random-xs
To see a complete list of datasets, run the following:
python3 create_dataset.py --help
To evaluate an algorithm under the congestion
track, use the following command:
python3 run.py \
--neurips23track congestion \
--algorithm "$ALGO" \
--nodocker \
--rebuild \
--runbook_path "$PATH" \
--dataset "$DS"
- algorithm "$ALGO": Name of the algorithm to evaluate.Detailed names of the algorithms can be found in the "Code Identifier" column (the last column) of the "summary of algorithms" table.
- dataset "$DS": Name of the dataset to use.
- runbook_path "$PATH": Path to the runbook file describing the test scenario.For example, the runbook path for the general experiment is
neurips23/runbooks/congestion/general_experiment/general_experiment.yaml.
- rebuild: Rebuild the target before running.
To compute ground truth for an runbook, Use the provided script to compute ground truth at various checkpoints:
python3 benchmark/congestion/compute_gt.py \
--runbook "$PATH" \
--dataset "$DS" \
--gt_cmdline_tool ./DiskANN/build/apps/utils/compute_groundtruth
- To make the results available for post-processing, change permissions of the results folder
chmod 777 -R results/
- The following command will summarize all results files into a single csv file
python3 data_export.py --out "$OUT" --track congestion
The --out
parameter "$OUT" should be adjusted according to the testing scenario. For example, the value corresponding to the general experiment is gen
.
Common values include:
gen
batch
event
conceptDrift
randomContamination
randomDrop
wordContamination
bulkDeletion
batchDeletion
multiModal
- ……