Skip to content

Configurable bloom filter parameters via ShardedIndex constructor #11

@titusz

Description

@titusz

Problem

ScalableBloomFilter is constructed inside ShardedIndex.__init__() with hardcoded defaults:

  • initial_capacity=10_000_000
  • fpr=0.01 (1%)
  • growth_factor=2.0

There's no way to tune these through the ShardedIndex constructor. For large indexes (tens of millions of vectors), a lower FPR may be desirable. For small indexes, 10M initial capacity wastes memory.

Proposal

Accept optional kwargs on ShardedIndex.__init__():

ShardedIndex(
    ...,
    bloom_fpr=0.001,              # default: 0.01
    bloom_initial_capacity=1_000,  # default: 10_000_000
)

Pass these through to ScalableBloomFilter() construction in __init__ and rebuild_bloom().

Context

iscc-search creates multiple ShardedIndex128 and ShardedNphdIndex instances per index (one per ISCC-UNIT type, one per simprint type). Some of these will be small (thousands of vectors) while others may be large (tens of millions). Being able to tune bloom filter parameters per-index would help with both memory efficiency and lookup performance.

The current 1% FPR is reasonable for most cases — this is a nice-to-have, not blocking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions