Skip to content

[FEA] Binary IVF Flat Index#1099

Open
tarang-jain wants to merge 152 commits intorapidsai:mainfrom
tarang-jain:binary-kmeans
Open

[FEA] Binary IVF Flat Index#1099
tarang-jain wants to merge 152 commits intorapidsai:mainfrom
tarang-jain:binary-kmeans

Conversation

@tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 9, 2025

Depends on rapidsai/raft#2770

Implementation of binary ivf flat index (bitwise hamming metric for the IVF Flat index)

Key Features

1. Binary Index Structure

  • Added binary_centers_ field to store cluster centers as packed uint8_t arrays for binary data
  • Index automatically detects BitwiseHamming metric and configures itself for binary operation
  • Only support uint8_t inputs with BitwiseHamming and add only single instantiations of newly added kernels

2. K-means Clustering for Binary Data

The clustering approach for binary data required special handling:

  • Expanded Space Clustering: Binary data (uint8_t) is expanded to signed representation (int8_t) where each bit becomes ±1

    • 0 → -1, 1 → +1 transformation enables meaningful centroid computation
    • Clustering performed using L2 distance in the expanded dimensional space
  • Centroid Quantization: After computing float centroids in expanded space, they are converted back to binary format:

    • Centroids are stored as packed uint8_t arrays
    • KMeans (coarse) prediction is done on these quantized centroids with the BitwiseHamming distance.

3. Distance Kernels

Coarse Search (Cluster Selection)

  • Implemented specialized bitwise_hamming_distance_op for query-to-centroid distances in order to compute PairwiseDistances

Fine-Grained Search (Within Clusters)

Extended the interleaved scan kernel (ivf_flat_interleaved_scan.cuh) with specialized templates for BitwiseHamming:

  • Veclen-based optimization: Different code paths based on vectorization width

    • Veclen=16,8,4: Load data as uint32_t, use __popc(x ^ y) for 4-byte Hamming distance
    • Veclen=1,2: Byte-wise XOR and population count
  • Efficient memory access patterns:

    • Maintains interleaved data layout for coalesced memory access
    • Specialized loadAndComputeDist templates for uint8_t that leverage vectorized loads

as of 10/17/2025
Binary size increase:
branch-25.12 (CUDA 12.9 + X86): 1232.414 MB
This PR (CUDA 12.9 + X86): 1251.051 MB

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tarang-jain
Copy link
Contributor Author

/ok to test 07354d1

@tarang-jain
Copy link
Contributor Author

/ok to test 07e1837

@tarang-jain tarang-jain changed the base branch from main to release/26.02 January 20, 2026 21:42
@cjnolet
Copy link
Member

cjnolet commented Jan 28, 2026

/ok to test d546471

@tarang-jain tarang-jain changed the base branch from release/26.02 to main March 17, 2026 00:38
@tarang-jain tarang-jain requested a review from a team as a code owner March 17, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

5 participants