Skip to content

Added a c++ warper which is used by pyengine to connect with ROCKS #23

Merged
Digvijay-x1 merged 4 commits intomainfrom
feat/title-and-snippit
Dec 10, 2025
Merged

Added a c++ warper which is used by pyengine to connect with ROCKS #23
Digvijay-x1 merged 4 commits intomainfrom
feat/title-and-snippit

Conversation

@Digvijay-x1
Copy link
Owner

@Digvijay-x1 Digvijay-x1 commented Dec 10, 2025

Summary by CodeRabbit

  • New Features

    • Search results now show clickable titles linking to each result, green-formatted URLs, and page snippets for richer previews.
    • Indexer now extracts page titles and generates 200-character snippets for improved result previews.
    • Ranking batches metadata lookups so top results reliably include URL, title (fallback to URL), and snippet.
  • Tests

    • Added tests to verify title extraction and snippet generation.
  • Chores

    • Build updated to include a native RocksDB reader extension.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

Refactors C++ HTML extraction to return structured content (text + title) and generate 200‑char snippets; updates indexer to store title/snippet; adds a pybind11 RocksDBReader extension and build setup; updates Python ranker to batch-fetch metadata (url, title, snippet) and surfaces them in the search UI.

Changes

Cohort / File(s) Summary
Search Results Display
API/app/views/search/index.html.erb
Replace placeholder link/text with dynamic fields: title links to result['url'], display result['url'], and render result['snippet']; remove static Doc ID and placeholder snippet.
C++ Content Extraction (API)
cpp/indexer/src/utils.hpp
Remove clean_text(); add struct ExtractedContent { std::string text; std::string title; } and ExtractedContent extract_content(GumboNode* node);.
C++ Content Extraction (Impl)
cpp/indexer/src/utils.cpp
Implement extract_content_recursive(...) and extract_content(...) that accumulate text, capture TITLE element as title, insert spaces between siblings, and preserve SCRIPT/STYLE exclusions.
Indexer Integration
cpp/indexer/src/main.cpp
Use extract_content(); generate 200‑char snippet (clean \n/\r), capture title, and UPDATE DB with doc_length, title, and snippet in one statement; adjust flow and resource cleanup.
C++ Tests
cpp/indexer/tests/test_utils.cpp
Replace clean_text() calls with extract_content(...).text; add test_extract_title() asserting title and text; invoke new test.
Python RocksDB Binding
python/ranker/rocksdb_client.cpp, python/ranker/setup.py
Add pybind11 module rocksdb_client exposing RocksDBReader (ctor, get, close, dtor); setup.py defines extension build using pybind11, C++17, linking RocksDB.
Ranker Engine & Build
python/ranker/engine.py, python/ranker/requirements.txt, python/ranker/Dockerfile
Replace rocksdict usage with RocksDBReader; batch-fetch metadata for top doc IDs into a meta_map and populate result url, title (fallback), and snippet (fallback); add pybind11 to requirements and update Dockerfile to COPY project and run pip3 install . to build the extension.

Sequence Diagram(s)

sequenceDiagram
    participant CIndexer as C++ Indexer
    participant RocksDB as RocksDB (on disk)
    participant Reader as rocksdb_client.RocksDBReader
    participant Ranker as Python Ranker
    participant UI as Web UI

    Note over CIndexer: On crawl/parse
    CIndexer->>CIndexer: extract_content(HTML) → {text, title}
    CIndexer->>CIndexer: generate snippet (200 chars), clean \n/\r
    CIndexer->>RocksDB: UPDATE doc SET title, snippet, doc_length

    Note over Ranker: On search request
    Ranker->>Ranker: compute top_doc_ids
    Ranker->>Reader: batch get(metadata_keys for top_doc_ids)
    Reader->>RocksDB: read keys
    RocksDB-->>Reader: return bytes (url/title/snippet)
    Reader-->>Ranker: metadata map
    Ranker->>UI: return results with url, title, snippet
    UI->>UI: render title as link, show url, display snippet
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–75 minutes

  • Inspect C++ memory/resource handling and exception safety around Gumbo and the new extractor.
  • Verify TITLE extraction and spacing logic in utils.cpp to avoid content loss/duplication.
  • Review RocksDBReader lifecycle, error handling, and pybind11 bindings in python/ranker/rocksdb_client.cpp and setup.py.
  • Validate ranker batch-fetch logic, fallbacks, and Docker build step for the extension.

Possibly related PRs

Poem

🐰 I nibble HTML bits with glee,
Titles and snippets set free for thee.
Batch the metadata, hop to the store,
Serve tidy results — now oh so much more.
A happy rabbit, hopping to explore! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title mentions adding a C++ wrapper for ROCKS connection, which aligns with the rocksdb_client.cpp addition and setup changes, but significantly understates the broader refactoring including content extraction improvements, title/snippet extraction, and database updates across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/title-and-snippit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
cpp/indexer/src/main.cpp (2)

131-134: Content extraction & snippet generation are correct; consider minor snippet polish

Using extract_content(output->root) and then deriving plain_text/title is correct and consistent with the new API. Generating a 200‑character snippet and normalizing \n/\r to spaces is safe and simple.

If you want to refine snippet quality later (optional):

  • Trim leading/trailing whitespace and collapse multiple spaces.
  • Optionally avoid including <title> text in the snippet, since you already display the title separately.

These are UX tweaks, not correctness issues.

Also applies to: 136-141


171-174: DB UPDATE extension looks correct; consider explicit width for doc_length

Updating doc_length, title, and snippet in one UPDATE with exec_params is good and keeps things consistent with the new extraction logic.

If doc_length is stored in a narrow integer column, tokens.size() (a size_t) could, in extreme cases, overflow that type. Not urgent, but you might:

  • Ensure documents.doc_length is BIGINT (or equivalent), and/or
  • Cast tokens.size() to a specific width type that matches the column.

Functionally this is fine as‑is.

cpp/indexer/src/utils.cpp (2)

37-61: Recursive extraction is correct; consider hardening TITLE handling and spacing

The recursion over GumboNode is sound: it skips SCRIPT/STYLE, accumulates text, and captures the <title> text without risking crashes.

A couple of non-blocking refinements you might consider:

  • TITLE robustness: Right now you only inspect children.data[0]. On documents where the first child is whitespace or a non‑text node, content.title will stay empty even if later children hold the real title. Iterating children until you find the first non‑empty GUMBO_NODE_TEXT would be more robust.
  • TITLE duplication in body text: Because the <title> children are still traversed recursively, the title text ends up both in content.title and in content.text. That may be fine (it boosts title terms in the index), but if you want the snippet/body text to exclude the title, you could skip adding <title>’s text to content.text while still populating content.title.

These are behavior/quality tweaks, not correctness issues.


63-66: Wrapper around extract_content_recursive is clean; consider internal linkage for helper

The extract_content wrapper is straightforward and matches the header declaration and downstream usage in main.cpp. Return-by-value is fine here and will benefit from RVO.

To avoid exporting unnecessary symbols from this TU, you could give extract_content_recursive internal linkage (e.g., put it in an anonymous namespace or mark it static) since it’s only used inside utils.cpp. Optional, but keeps the public surface smaller.

python/ranker/requirements.txt (1)

3-3: Consider pinning pybind11 for reproducible builds.

Since the C++ extension build depends on pybind11, leaving it unpinned can cause surprise build or ABI issues when the library updates. Consider constraining to a tested version/range to keep Docker and local builds stable.

python/ranker/setup.py (1)

1-18: Extension setup is fine; consider extra pybind11 include and basic package metadata.

The rocksdb_client Extension definition is minimal and should work, but for robustness you may want to include both pybind11.get_include() and pybind11.get_include(user=True) and add basic metadata (version, description) to setup() so this wheel is easier to manage if you ever publish or reuse it.

python/ranker/engine.py (2)

7-13: Be explicit about when you want to fall back to the mock index.

The ROCKSDB_AVAILABLE/self.index_db checks plus the mock index fallback are great for local dev, but in a production setting this would silently serve stubbed results if the extension fails to import or RocksDB fails to open. Consider gating the mock behavior behind an environment flag or raising hard in non‑dev environments so search failures don’t masquerade as low‑quality results.

Also applies to: 38-55


206-247: Batch metadata fetch is good; consider handling missing rows and narrowing the exception.

The single query using top_doc_ids and the meta_map lookup is a solid optimization, but documents without metadata rows are currently dropped from results entirely; you may want to provide a simple fallback entry (mock URL/title/snippet) instead to preserve ranking integrity. Also, the except Exception as e around this block could be narrowed to psycopg2.Error (and perhaps KeyError) so you don’t accidentally hide unrelated bugs.

-            except Exception as e:
-                print(f"Error fetching metadata: {e}")
+            except psycopg2.Error as e:
+                print(f"Error fetching metadata from Postgres: {e}")
python/ranker/rocksdb_client.cpp (1)

9-41: RocksDBReader binding looks correct; consider releasing the GIL during reads.

The wrapper’s lifecycle and error handling are solid. If you expect concurrent queries, you can optionally release the Python GIL around the blocking db->Get call so multiple threads can issue reads in parallel:

-    py::object get(const py::bytes& key) {
+    py::object get(const py::bytes& key) {
         if (!is_open) return py::none();
         
         std::string key_str = key;
         std::string value;
-        rocksdb::Status status = db->Get(rocksdb::ReadOptions(), key_str, &value);
+        rocksdb::Status status;
+        {
+            py::gil_scoped_release release;
+            status = db->Get(rocksdb::ReadOptions(), key_str, &value);
+        }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75aeef8 and 65d14fd.

📒 Files selected for processing (9)
  • API/app/views/search/index.html.erb (1 hunks)
  • cpp/indexer/src/main.cpp (2 hunks)
  • cpp/indexer/src/utils.cpp (1 hunks)
  • cpp/indexer/src/utils.hpp (1 hunks)
  • python/ranker/Dockerfile (1 hunks)
  • python/ranker/engine.py (3 hunks)
  • python/ranker/requirements.txt (1 hunks)
  • python/ranker/rocksdb_client.cpp (1 hunks)
  • python/ranker/setup.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
cpp/indexer/src/main.cpp (2)
cpp/indexer/src/utils.cpp (2)
  • extract_content (63-67)
  • extract_content (63-63)
cpp/indexer/src/utils.hpp (1)
  • extract_content (22-22)
python/ranker/engine.py (1)
python/ranker/rocksdb_client.cpp (2)
  • RocksDBReader (13-21)
  • RocksDBReader (23-25)
python/ranker/rocksdb_client.cpp (1)
python/ranker/engine.py (1)
  • close (251-269)
🪛 GitHub Actions: Search Engine CI
python/ranker/Dockerfile

[error] 22-22: Build step failed: cmake ../src; make; ctest --output-on-failure failed due to compile errors in indexer/tests/test_utils.cpp: 'clean_text' is not a member of 'indexer'.

🪛 Ruff (0.14.8)
python/ranker/engine.py

236-236: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)
cpp/indexer/src/utils.hpp (1)

17-22: ExtractedContent API and declaration look consistent and minimal

The struct and extract_content declaration are straightforward, match the implementation in cpp/indexer/src/utils.cpp, and clearly document the new title extraction behavior. No issues from an API/contract perspective.

python/ranker/Dockerfile (2)

24-30: pip3 install . flow looks good; ensure Docker build context matches setup.py location.

This Dockerfile assumes the build context is the python/ranker directory so that COPY . . and pip3 install . pick up setup.py for the RocksDB extension. If you build from the repo root, this may either fail or install the wrong project; make sure your CI/local docker build command uses python/ranker as the context.


20-32: CI failure is currently in C++ indexer tests, not this Dockerfile.

The reported pipeline error (clean_text not a member of indexer in indexer/tests/test_utils.cpp) is unrelated to these Docker changes but will still block merging. Just flagging that you’ll need to address the indexer test compilation issue elsewhere in this PR or a follow‑up.

API/app/views/search/index.html.erb (1)

16-18: Validate and sanitize result['url'] before using in href attribute.

The code directly embeds result['url'] into the href without validation. If URLs originate from crawled content or external sources, unsafe schemes like javascript: could enable XSS attacks. Implement server-side URL scheme validation (restricting to http/https) before passing data to the view, or use Rails helpers like link_to with the url parameter, which provides built-in sanitization.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cpp/indexer/tests/test_utils.cpp (1)

44-53: Consider renaming test functions to match the new API.

The comment at line 44 correctly reflects the new extract_content API, but the function name test_clean_text_simple is stale. This inconsistency also applies to test_clean_text_ignores_script and test_clean_text_ignores_style.

-// --- Test: extract_content ---
-void test_clean_text_simple() {
+// --- Test: extract_content ---
+void test_extract_content_simple() {

Similar renames for the other functions would improve clarity:

  • test_clean_text_ignores_scripttest_extract_content_ignores_script
  • test_clean_text_ignores_styletest_extract_content_ignores_style
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65d14fd and ec1e706.

📒 Files selected for processing (1)
  • cpp/indexer/tests/test_utils.cpp (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cpp/indexer/tests/test_utils.cpp (2)
cpp/indexer/src/utils.hpp (1)
  • extract_content (22-22)
cpp/indexer/src/utils.cpp (2)
  • extract_content (63-67)
  • extract_content (63-63)
🔇 Additional comments (1)
cpp/indexer/tests/test_utils.cpp (1)

75-84: LGTM!

Well-structured test that properly validates both title and text fields of the new ExtractedContent struct. Good coverage for the new title extraction feature.

@Digvijay-x1
Copy link
Owner Author

@coderabbitai full review .

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

@Digvijay-x1: I'll perform a full review of all the changes in this PR.

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
API/app/views/search/index.html.erb (1)

16-18: Result card now correctly uses dynamic URL/title/snippet

The view now cleanly wires result['url'], result['title'], and result['snippet'] into the UI and should align with the new backend payload shape. If you later decide to open links in a new tab, consider adding target="_blank" and rel="noopener noreferrer" to the anchor, but it’s fine as-is.

python/ranker/requirements.txt (1)

3-3: pybind11 as a runtime requirement—consider scoping it to build time

Adding pybind11 here works for the Docker build, but it’s primarily a build-time dependency for setup.py, not something the running service needs. Long term, consider moving it into a proper build-system requirement (e.g., pyproject.toml or setup_requires) and pinning a version for reproducible builds.

python/ranker/Dockerfile (1)

28-30: Dockerfile correctly builds the C++ extension; minor image optimizations possible

Copying the source and running pip3 install . is the right way to ensure rocksdb_client is compiled into the image. If image size or build caching becomes a concern later, you could add --no-cache-dir to this pip install or move to a multi-stage build, but it’s not required now.

python/ranker/setup.py (1)

1-18: pybind11-based rocksdb_client extension definition looks sound

The extension wiring (name, source file, include dir via pybind11.get_include(), libraries=["rocksdb"], and -std=c++17) is coherent for building rocksdb_client in this repo. For broader reuse, you might later:

  • Add basic metadata (version, description) to setup().
  • Move pybind11 into a pyproject.toml build-system section instead of importing it at runtime in setup.py.

Not blockers for this PR.

cpp/indexer/tests/test_utils.cpp (1)

44-83: Tests correctly migrated to extract_content and validate title extraction

The switch to indexer::extract_content(...).text in the existing tests plus the new test_extract_title() gives good coverage for both body text and <title> handling, and wiring it into main() ensures it runs with the rest of the suite. If you revisit naming later, you might rename the test_clean_text_* functions to better reflect the new API, but functionally this is solid.

Also applies to: 136-145

cpp/indexer/src/main.cpp (1)

130-141: Indexer now persists title/snippet correctly; snippet generation is intentionally simple

Using ExtractedContent to derive plain_text and title, then updating doc_length, title, and snippet in a single UPDATE is consistent and avoids extra round trips. The 200-character snippet derived from plain_text with newline stripping is a reasonable first pass; if you want nicer snippets later, you could cut at the last space before 200 characters to avoid mid-word truncation, but the current approach is functionally fine.

Also applies to: 171-175

python/ranker/rocksdb_client.cpp (1)

1-57: RocksDBReader wrapper is well-structured; confirm Python call sites pass bytes keys

The wrapper cleanly:

  • Opens RocksDB read-only in the constructor and throws on failure.
  • Manages the rocksdb::DB* lifetime via close() and the destructor.
  • Returns None for missing keys and bytes for hits, with errors surfaced as RuntimeError.

One integration detail to double-check: get takes a py::bytes parameter, so Python callers must pass bytes (e.g., key.encode("utf-8")), not plain str, or they’ll get a type error. If you expect str keys, you could either adjust the signature to accept std::string or handle both types explicitly.

python/ranker/engine.py (2)

226-235: Documents missing from DB are silently dropped.

If a doc_id exists in sorted_docs but not in meta_map (e.g., deleted from DB after indexing, or data sync issues), it's silently excluded from results. This could return fewer than k results unexpectedly.

Consider adding a fallback for missing metadata:

                     for doc_id, score in sorted_docs:
                         if doc_id in meta_map:
                             meta = meta_map[doc_id]
                             results.append({
                                 "id": doc_id,
                                 "url": meta['url'],
                                 "score": score,
                                 "title": meta['title'] if meta['title'] else meta['url'],
                                 "snippet": meta['snippet'] if meta['snippet'] else "No preview available."
                             })
+                        else:
+                            # Document exists in index but metadata missing
+                            print(f"Warning: Missing metadata for doc_id {doc_id}")

236-237: Consider catching specific database exceptions.

Static analysis flagged catching bare Exception. While this provides robustness, catching psycopg2.Error would be more precise and avoid masking unexpected errors.

-            except Exception as e:
+            except psycopg2.Error as e:
                 print(f"Error fetching metadata: {e}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75aeef8 and ec1e706.

📒 Files selected for processing (10)
  • API/app/views/search/index.html.erb (1 hunks)
  • cpp/indexer/src/main.cpp (2 hunks)
  • cpp/indexer/src/utils.cpp (1 hunks)
  • cpp/indexer/src/utils.hpp (1 hunks)
  • cpp/indexer/tests/test_utils.cpp (4 hunks)
  • python/ranker/Dockerfile (1 hunks)
  • python/ranker/engine.py (3 hunks)
  • python/ranker/requirements.txt (1 hunks)
  • python/ranker/rocksdb_client.cpp (1 hunks)
  • python/ranker/setup.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
python/ranker/engine.py (1)
python/ranker/rocksdb_client.cpp (2)
  • RocksDBReader (13-21)
  • RocksDBReader (23-25)
cpp/indexer/tests/test_utils.cpp (2)
cpp/indexer/src/utils.cpp (2)
  • extract_content (63-67)
  • extract_content (63-63)
cpp/indexer/src/utils.hpp (1)
  • extract_content (22-22)
python/ranker/rocksdb_client.cpp (1)
python/ranker/engine.py (1)
  • close (251-269)
🪛 Ruff (0.14.8)
python/ranker/engine.py

236-236: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (6)
cpp/indexer/src/utils.hpp (1)

17-23: ExtractedContent API cleanly extends text extraction with a title field

Defining ExtractedContent with text and title and exposing extract_content(GumboNode*) is a clear, minimal extension of the previous API. The comment about clean text plus the “also extracts the title” note makes the intent obvious, and it keeps the header stable for future additions if needed.

cpp/indexer/src/utils.cpp (2)

44-59: Title text is duplicated into content.text.

When a <title> element is encountered, the title is extracted (lines 44-51), but the function continues to recurse into children (lines 53-59), causing the title text to also be appended to content.text. If you intend to keep title separate from body text, add a return or skip recursion for title elements.

Additionally, only the first child text node is captured. Titles like <title>Part1<span>Part2</span></title> would only capture "Part1".

         if (node->v.element.tag == GUMBO_TAG_TITLE) {
             if (node->v.element.children.length > 0) {
                 GumboNode* title_text = static_cast<GumboNode*>(node->v.element.children.data[0]);
                 if (title_text->type == GUMBO_NODE_TEXT) {
                     content.title = title_text->v.text.text;
                 }
             }
+            return; // Don't add title text to body content
         }

63-67: LGTM!

Clean wrapper function that initializes the struct and delegates to the recursive helper.

python/ranker/engine.py (3)

7-13: LGTM!

Clean migration from rocksdict to the custom RocksDBReader extension with appropriate fallback handling.


45-48: LGTM!

Correct instantiation of RocksDBReader with the path, matching the C++ constructor signature from rocksdb_client.cpp.


239-247: LGTM!

Good fallback handling that provides mock data when the database is unavailable, ensuring the UI degrades gracefully.

@Digvijay-x1 Digvijay-x1 merged commit d6d675c into main Dec 10, 2025
12 checks passed
@Digvijay-x1 Digvijay-x1 deleted the feat/title-and-snippit branch December 10, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant