Skip to content

Comments

feat(dedup): implement Candidate Generation v2 with Multi-Key Blocking (#334)#338

Open
ZohaibHassan16 wants to merge 1 commit intoHawksight-AI:mainfrom
ZohaibHassan16:feature/candidate-gen-v2-334
Open

feat(dedup): implement Candidate Generation v2 with Multi-Key Blocking (#334)#338
ZohaibHassan16 wants to merge 1 commit intoHawksight-AI:mainfrom
ZohaibHassan16:feature/candidate-gen-v2-334

Conversation

@ZohaibHassan16
Copy link
Collaborator

Description

This PR implements Candidate Generation v2 to address the $O(N^2)$ "pair explosion" during deduplication. By moving away from naive first-character blocking and introducing token-based multi-key blocking and deterministic budgeting, we've significantly optimized the candidate pool before any heavy semantic scoring occurs.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

  • New feature (non-breaking change which adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

  • Documentation update

  • Performance improvement

  • Code refactoring

Related Issues

Changes Made

  • Multi-Key Blocking Indexer: Introduced _build_block_indexes with support for normalized token prefixes and phonetic (Soundex) keys.

  • Deduplicated Pair Generation: Implemented _generate_candidate_pairs to union candidate pairs across overlapping blocks without redundant calculations.

  • Candidate Budgeting: Addressed adversarial latency spikes via _cap_candidate_pairs using a deterministic max_candidates_per_entity limit.

  • Orchestration: Refactored batch_calculate_similarity to use the new v2 pipeline while maintaining legacy compatibility.

Testing

  • Tested locally

  • Added tests for new functionality

  • Package builds successfully (python -m build)

Test Commands

# Run deduplication benchmarks to verify latency drop
pytest benchmarks/quality_assurance/test_deduplication.py

Benchmark Results (V1 vs V2)

Test Case (N=500) | Legacy Mean Time | V2 Mean Time | Improvement -- | -- | -- | --

worst_case | ~4.46s | ~0.88s | 80.2% Reduction

scaling_opt | ~0.28s | ~0.95s | Minimal overhead

Documentation

  • Updated relevant documentation

  • Added code examples if applicable

  • Updated API reference if adding new APIs

  • Updated cookbook if adding new examples

  • No documentation changes needed (internal optimization)

Breaking Changes

Breaking Changes: No (Logic is opt-in via candidate_strategy="blocking_v2". Default behavior remains identical to legacy.)

Checklist

  • My code follows the project's style guidelines

  • I have performed a self-review of my code

  • I have commented my code, particularly in hard-to-understand areas

  • My changes generate no new warnings

  • Package builds successfully

@KaifAhmad1 KaifAhmad1 self-requested a review February 21, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Sub-Issue 1] Candidate Generation v2 to Cut Pair Explosion

1 participant