Skip to content

Streamline sparse cosine dot-product lookups#195

Merged
madara88645 merged 2 commits intobolt-tfidf-sparse-dict-intersection-5659582442749844468from
codex/sub-pr-193
Mar 18, 2026
Merged

Streamline sparse cosine dot-product lookups#195
madara88645 merged 2 commits intobolt-tfidf-sparse-dict-intersection-5659582442749844468from
codex/sub-pr-193

Conversation

@Codex
Copy link
Contributor

@Codex Codex AI commented Mar 18, 2026

Frequent TF-IDF cosine similarity calls were still doing double dictionary lookups per key in the hot loop despite prior intersection optimization, adding needless hashing overhead.

  • cosine_similarity: Iterate the smaller TF-IDF dict and use a sentinel-backed get to collapse membership + fetch into one lookup while preserving zero-value correctness.
  • Bolt note: Document the single-lookup sparse-dict pattern as the recommended approach for hot paths.

Example:

missing = object()
dot = 0.0
for k, v in v1.items():  # v1 is the smaller dict
    v2_val = v2.get(k, missing)
    if v2_val is not missing:
        dot += v * v2_val

@vercel
Copy link

vercel bot commented Mar 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
compiler Ready Ready Preview, Comment Mar 18, 2026 8:37pm

@Codex
Copy link
Contributor Author

Codex AI commented Mar 18, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • openaipublic.blob.core.windows.net
    • Triggering command: /usr/bin/python python -m pytest (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@Codex Codex AI changed the title [WIP] [performance improvement] Optimize TF-IDF cosine similarity Streamline sparse cosine dot-product lookups Mar 18, 2026
@Codex Codex AI requested a review from madara88645 March 18, 2026 20:37
@madara88645 madara88645 marked this pull request as ready for review March 18, 2026 20:37
Copilot AI review requested due to automatic review settings March 18, 2026 20:37
@madara88645 madara88645 merged commit 6ff01af into bolt-tfidf-sparse-dict-intersection-5659582442749844468 Mar 18, 2026
4 checks passed
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@madara88645 madara88645 deleted the codex/sub-pr-193 branch March 18, 2026 20:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the TF‑IDF cosine similarity hot loop used by semantic chunking by reducing sparse-vector dot-product lookups from two dictionary operations (in + []) to a single sentinel-backed dict.get, and documents the pattern for future use.

Changes:

  • Update cosine_similarity to use a sentinel-backed v2.get(k, sentinel) during sparse dot-product accumulation.
  • Update .jules/bolt.md to recommend the single-lookup sparse-dict intersection pattern for performance-critical paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
app/rag/simple_index.py Collapses membership + fetch into one dict lookup per key in the sparse dot-product loop.
.jules/bolt.md Documents the sentinel-backed dict.get pattern as the preferred sparse intersection optimization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +281 to 283
missing = object()
dot = 0.0
for k, v in v1.items():
madara88645 added a commit that referenced this pull request Mar 18, 2026
…ictionary intersection (#193)

* Optimize TF-IDF cosine similarity dictionary intersection

Avoid allocating expensive sets and performing set intersection when calculating
cosine similarity for sparse dictionaries (like TF-IDF scores). Instead, iterate over the smaller
dictionary items directly checking for key existence in the larger dictionary.
This provides roughly a 30-40% speed-up in this calculation hot loop during chunking.

* Optimize TF-IDF cosine similarity dictionary intersection

Avoid allocating expensive sets and performing set intersection when calculating
cosine similarity for sparse dictionaries (like TF-IDF scores). Instead, iterate over the smaller
dictionary items directly checking for key existence in the larger dictionary.
This provides roughly a 30-40% speed-up in this calculation hot loop during chunking.

* Streamline sparse cosine dot-product lookups (#195)

* Initial plan

* Optimize sparse cosine similarity lookups

---------

Co-authored-by: openai-code-agent[bot] <242516109+Codex@users.noreply.github.com>

---------

Co-authored-by: Codex <242516109+Codex@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants