-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
ScalableBloomFilter is constructed inside ShardedIndex.__init__() with hardcoded defaults:
initial_capacity=10_000_000fpr=0.01(1%)growth_factor=2.0
There's no way to tune these through the ShardedIndex constructor. For large indexes (tens of millions of vectors), a lower FPR may be desirable. For small indexes, 10M initial capacity wastes memory.
Proposal
Accept optional kwargs on ShardedIndex.__init__():
ShardedIndex(
...,
bloom_fpr=0.001, # default: 0.01
bloom_initial_capacity=1_000, # default: 10_000_000
)Pass these through to ScalableBloomFilter() construction in __init__ and rebuild_bloom().
Context
iscc-search creates multiple ShardedIndex128 and ShardedNphdIndex instances per index (one per ISCC-UNIT type, one per simprint type). Some of these will be small (thousands of vectors) while others may be large (tens of millions). Being able to tune bloom filter parameters per-index would help with both memory efficiency and lookup performance.
The current 1% FPR is reasonable for most cases — this is a nice-to-have, not blocking.