Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@
**Action:** When calculating similarity scores or dot products on vectors represented as Python lists, always prefer `map(operator.mul, a, b)` wrapped in `sum()` over list comprehensions or generator expressions.

## 2024-08-14 - Optimizing Sparse Dictionary Intersections in Math Hot Loops
**Learning:** When computing cosine similarity or dot products for sparse dictionaries (like TF-IDF score mappings) in Python, creating sets for key intersection (`set(v1.keys()) & set(v2.keys())`) adds significant overhead due to set allocation and hashing. Iterating directly over the items of the smaller dictionary and checking for key existence in the larger dictionary (`sum(v * v2[k] for k, v in v1.items() if k in v2)`) operates in O(min(N, M)) time with fewer memory allocations and is roughly 30-40% faster in execution time.
**Action:** Replace `set()` intersection calls with smaller-dictionary iteration logic (`if len(v1) > len(v2): v1, v2 = v2, v1`) when calculating intersections of sparse dicts in tight performance paths.
**Learning:** When computing cosine similarity or dot products for sparse dictionaries (like TF-IDF score mappings) in Python, creating sets for key intersection (`set(v1.keys()) & set(v2.keys())`) adds significant overhead due to set allocation and hashing. Iterating directly over the items of the smaller dictionary with a single lookup into the larger dictionary (`val = v2.get(k, sentinel)`) avoids double hashing, keeps O(min(N, M)) complexity, and is roughly 30-40% faster in execution time while still handling `0.0` values correctly.
**Action:** Replace `set()` intersection calls with smaller-dictionary iteration logic (`if len(v1) > len(v2): v1, v2 = v2, v1`) and use a sentinel-backed `dict.get` to keep one lookup per key: `sentinel = object(); dot = sum(v * val for k, v in v1.items() if (val := v2.get(k, sentinel)) is not sentinel)` in tight performance paths.
6 changes: 4 additions & 2 deletions app/rag/simple_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -278,10 +278,12 @@ def cosine_similarity(v1: Dict[str, float], v2: Dict[str, float]) -> float:
if len(v1) > len(v2):
v1, v2 = v2, v1

missing = object()
dot = 0.0
for k, v in v1.items():
Comment on lines +281 to 283
if k in v2:
dot += v * v2[k]
v2_val = v2.get(k, missing)
if v2_val is not missing:
dot += v * v2_val

# Bolt Optimization: math.hypot is ~5x faster than math.sqrt(sum(v*v))
norm1 = math.hypot(*v1.values())
Expand Down
Loading