diff --git a/.jules/bolt.md b/.jules/bolt.md index 4ee2ce6..4cbfc58 100644 --- a/.jules/bolt.md +++ b/.jules/bolt.md @@ -11,5 +11,5 @@ **Action:** When calculating similarity scores or dot products on vectors represented as Python lists, always prefer `map(operator.mul, a, b)` wrapped in `sum()` over list comprehensions or generator expressions. ## 2024-08-14 - Optimizing Sparse Dictionary Intersections in Math Hot Loops -**Learning:** When computing cosine similarity or dot products for sparse dictionaries (like TF-IDF score mappings) in Python, creating sets for key intersection (`set(v1.keys()) & set(v2.keys())`) adds significant overhead due to set allocation and hashing. Iterating directly over the items of the smaller dictionary and checking for key existence in the larger dictionary (`sum(v * v2[k] for k, v in v1.items() if k in v2)`) operates in O(min(N, M)) time with fewer memory allocations and is roughly 30-40% faster in execution time. -**Action:** Replace `set()` intersection calls with smaller-dictionary iteration logic (`if len(v1) > len(v2): v1, v2 = v2, v1`) when calculating intersections of sparse dicts in tight performance paths. +**Learning:** When computing cosine similarity or dot products for sparse dictionaries (like TF-IDF score mappings) in Python, creating sets for key intersection (`set(v1.keys()) & set(v2.keys())`) adds significant overhead due to set allocation and hashing. Iterating directly over the items of the smaller dictionary with a single lookup into the larger dictionary (`val = v2.get(k, sentinel)`) avoids double hashing, keeps O(min(N, M)) complexity, and is roughly 30-40% faster in execution time while still handling `0.0` values correctly. +**Action:** Replace `set()` intersection calls with smaller-dictionary iteration logic (`if len(v1) > len(v2): v1, v2 = v2, v1`) and use a sentinel-backed `dict.get` to keep one lookup per key: `sentinel = object(); dot = sum(v * val for k, v in v1.items() if (val := v2.get(k, sentinel)) is not sentinel)` in tight performance paths. diff --git a/app/rag/simple_index.py b/app/rag/simple_index.py index 59fc17a..0add42f 100644 --- a/app/rag/simple_index.py +++ b/app/rag/simple_index.py @@ -278,10 +278,12 @@ def cosine_similarity(v1: Dict[str, float], v2: Dict[str, float]) -> float: if len(v1) > len(v2): v1, v2 = v2, v1 + missing = object() dot = 0.0 for k, v in v1.items(): - if k in v2: - dot += v * v2[k] + v2_val = v2.get(k, missing) + if v2_val is not missing: + dot += v * v2_val # Bolt Optimization: math.hypot is ~5x faster than math.sqrt(sum(v*v)) norm1 = math.hypot(*v1.values())