Added a c++ warper which is used by pyengine to connect with ROCKS #23
Added a c++ warper which is used by pyengine to connect with ROCKS #23Digvijay-x1 merged 4 commits intomainfrom
Conversation
WalkthroughRefactors C++ HTML extraction to return structured content (text + title) and generate 200‑char snippets; updates indexer to store title/snippet; adds a pybind11 RocksDBReader extension and build setup; updates Python ranker to batch-fetch metadata (url, title, snippet) and surfaces them in the search UI. Changes
Sequence Diagram(s)sequenceDiagram
participant CIndexer as C++ Indexer
participant RocksDB as RocksDB (on disk)
participant Reader as rocksdb_client.RocksDBReader
participant Ranker as Python Ranker
participant UI as Web UI
Note over CIndexer: On crawl/parse
CIndexer->>CIndexer: extract_content(HTML) → {text, title}
CIndexer->>CIndexer: generate snippet (200 chars), clean \n/\r
CIndexer->>RocksDB: UPDATE doc SET title, snippet, doc_length
Note over Ranker: On search request
Ranker->>Ranker: compute top_doc_ids
Ranker->>Reader: batch get(metadata_keys for top_doc_ids)
Reader->>RocksDB: read keys
RocksDB-->>Reader: return bytes (url/title/snippet)
Reader-->>Ranker: metadata map
Ranker->>UI: return results with url, title, snippet
UI->>UI: render title as link, show url, display snippet
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45–75 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (9)
cpp/indexer/src/main.cpp (2)
131-134: Content extraction & snippet generation are correct; consider minor snippet polishUsing
extract_content(output->root)and then derivingplain_text/titleis correct and consistent with the new API. Generating a 200‑character snippet and normalizing\n/\rto spaces is safe and simple.If you want to refine snippet quality later (optional):
- Trim leading/trailing whitespace and collapse multiple spaces.
- Optionally avoid including
<title>text in the snippet, since you already display the title separately.These are UX tweaks, not correctness issues.
Also applies to: 136-141
171-174: DB UPDATE extension looks correct; consider explicit width fordoc_lengthUpdating
doc_length,title, andsnippetin oneUPDATEwithexec_paramsis good and keeps things consistent with the new extraction logic.If
doc_lengthis stored in a narrow integer column,tokens.size()(asize_t) could, in extreme cases, overflow that type. Not urgent, but you might:
- Ensure
documents.doc_lengthisBIGINT(or equivalent), and/or- Cast
tokens.size()to a specific width type that matches the column.Functionally this is fine as‑is.
cpp/indexer/src/utils.cpp (2)
37-61: Recursive extraction is correct; consider hardening TITLE handling and spacingThe recursion over
GumboNodeis sound: it skips SCRIPT/STYLE, accumulates text, and captures the<title>text without risking crashes.A couple of non-blocking refinements you might consider:
- TITLE robustness: Right now you only inspect
children.data[0]. On documents where the first child is whitespace or a non‑text node,content.titlewill stay empty even if later children hold the real title. Iterating children until you find the first non‑emptyGUMBO_NODE_TEXTwould be more robust.- TITLE duplication in body text: Because the
<title>children are still traversed recursively, the title text ends up both incontent.titleand incontent.text. That may be fine (it boosts title terms in the index), but if you want the snippet/body text to exclude the title, you could skip adding<title>’s text tocontent.textwhile still populatingcontent.title.These are behavior/quality tweaks, not correctness issues.
63-66: Wrapper around extract_content_recursive is clean; consider internal linkage for helperThe
extract_contentwrapper is straightforward and matches the header declaration and downstream usage inmain.cpp. Return-by-value is fine here and will benefit from RVO.To avoid exporting unnecessary symbols from this TU, you could give
extract_content_recursiveinternal linkage (e.g., put it in an anonymous namespace or mark itstatic) since it’s only used insideutils.cpp. Optional, but keeps the public surface smaller.python/ranker/requirements.txt (1)
3-3: Consider pinningpybind11for reproducible builds.Since the C++ extension build depends on
pybind11, leaving it unpinned can cause surprise build or ABI issues when the library updates. Consider constraining to a tested version/range to keep Docker and local builds stable.python/ranker/setup.py (1)
1-18: Extension setup is fine; consider extra pybind11 include and basic package metadata.The
rocksdb_clientExtension definition is minimal and should work, but for robustness you may want to include bothpybind11.get_include()andpybind11.get_include(user=True)and add basic metadata (version, description) tosetup()so this wheel is easier to manage if you ever publish or reuse it.python/ranker/engine.py (2)
7-13: Be explicit about when you want to fall back to the mock index.The
ROCKSDB_AVAILABLE/self.index_dbchecks plus the mock index fallback are great for local dev, but in a production setting this would silently serve stubbed results if the extension fails to import or RocksDB fails to open. Consider gating the mock behavior behind an environment flag or raising hard in non‑dev environments so search failures don’t masquerade as low‑quality results.Also applies to: 38-55
206-247: Batch metadata fetch is good; consider handling missing rows and narrowing the exception.The single query using
top_doc_idsand themeta_maplookup is a solid optimization, but documents without metadata rows are currently dropped fromresultsentirely; you may want to provide a simple fallback entry (mock URL/title/snippet) instead to preserve ranking integrity. Also, theexcept Exception as earound this block could be narrowed topsycopg2.Error(and perhapsKeyError) so you don’t accidentally hide unrelated bugs.- except Exception as e: - print(f"Error fetching metadata: {e}") + except psycopg2.Error as e: + print(f"Error fetching metadata from Postgres: {e}")python/ranker/rocksdb_client.cpp (1)
9-41: RocksDBReader binding looks correct; consider releasing the GIL during reads.The wrapper’s lifecycle and error handling are solid. If you expect concurrent queries, you can optionally release the Python GIL around the blocking
db->Getcall so multiple threads can issue reads in parallel:- py::object get(const py::bytes& key) { + py::object get(const py::bytes& key) { if (!is_open) return py::none(); std::string key_str = key; std::string value; - rocksdb::Status status = db->Get(rocksdb::ReadOptions(), key_str, &value); + rocksdb::Status status; + { + py::gil_scoped_release release; + status = db->Get(rocksdb::ReadOptions(), key_str, &value); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
API/app/views/search/index.html.erb(1 hunks)cpp/indexer/src/main.cpp(2 hunks)cpp/indexer/src/utils.cpp(1 hunks)cpp/indexer/src/utils.hpp(1 hunks)python/ranker/Dockerfile(1 hunks)python/ranker/engine.py(3 hunks)python/ranker/requirements.txt(1 hunks)python/ranker/rocksdb_client.cpp(1 hunks)python/ranker/setup.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
cpp/indexer/src/main.cpp (2)
cpp/indexer/src/utils.cpp (2)
extract_content(63-67)extract_content(63-63)cpp/indexer/src/utils.hpp (1)
extract_content(22-22)
python/ranker/engine.py (1)
python/ranker/rocksdb_client.cpp (2)
RocksDBReader(13-21)RocksDBReader(23-25)
python/ranker/rocksdb_client.cpp (1)
python/ranker/engine.py (1)
close(251-269)
🪛 GitHub Actions: Search Engine CI
python/ranker/Dockerfile
[error] 22-22: Build step failed: cmake ../src; make; ctest --output-on-failure failed due to compile errors in indexer/tests/test_utils.cpp: 'clean_text' is not a member of 'indexer'.
🪛 Ruff (0.14.8)
python/ranker/engine.py
236-236: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (4)
cpp/indexer/src/utils.hpp (1)
17-22: ExtractedContent API and declaration look consistent and minimalThe struct and
extract_contentdeclaration are straightforward, match the implementation incpp/indexer/src/utils.cpp, and clearly document the new title extraction behavior. No issues from an API/contract perspective.python/ranker/Dockerfile (2)
24-30:pip3 install .flow looks good; ensure Docker build context matchessetup.pylocation.This Dockerfile assumes the build context is the
python/rankerdirectory so thatCOPY . .andpip3 install .pick upsetup.pyfor the RocksDB extension. If you build from the repo root, this may either fail or install the wrong project; make sure your CI/localdocker buildcommand usespython/rankeras the context.
20-32: CI failure is currently in C++ indexer tests, not this Dockerfile.The reported pipeline error (
clean_textnot a member ofindexerinindexer/tests/test_utils.cpp) is unrelated to these Docker changes but will still block merging. Just flagging that you’ll need to address the indexer test compilation issue elsewhere in this PR or a follow‑up.API/app/views/search/index.html.erb (1)
16-18: Validate and sanitizeresult['url']before using inhrefattribute.The code directly embeds
result['url']into the href without validation. If URLs originate from crawled content or external sources, unsafe schemes likejavascript:could enable XSS attacks. Implement server-side URL scheme validation (restricting tohttp/https) before passing data to the view, or use Rails helpers likelink_towith theurlparameter, which provides built-in sanitization.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cpp/indexer/tests/test_utils.cpp (1)
44-53: Consider renaming test functions to match the new API.The comment at line 44 correctly reflects the new
extract_contentAPI, but the function nametest_clean_text_simpleis stale. This inconsistency also applies totest_clean_text_ignores_scriptandtest_clean_text_ignores_style.-// --- Test: extract_content --- -void test_clean_text_simple() { +// --- Test: extract_content --- +void test_extract_content_simple() {Similar renames for the other functions would improve clarity:
test_clean_text_ignores_script→test_extract_content_ignores_scripttest_clean_text_ignores_style→test_extract_content_ignores_style
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/indexer/tests/test_utils.cpp(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cpp/indexer/tests/test_utils.cpp (2)
cpp/indexer/src/utils.hpp (1)
extract_content(22-22)cpp/indexer/src/utils.cpp (2)
extract_content(63-67)extract_content(63-63)
🔇 Additional comments (1)
cpp/indexer/tests/test_utils.cpp (1)
75-84: LGTM!Well-structured test that properly validates both
titleandtextfields of the newExtractedContentstruct. Good coverage for the new title extraction feature.
|
@coderabbitai full review . |
|
@Digvijay-x1: I'll perform a full review of all the changes in this PR. ✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (9)
API/app/views/search/index.html.erb (1)
16-18: Result card now correctly uses dynamic URL/title/snippetThe view now cleanly wires
result['url'],result['title'], andresult['snippet']into the UI and should align with the new backend payload shape. If you later decide to open links in a new tab, consider addingtarget="_blank"andrel="noopener noreferrer"to the anchor, but it’s fine as-is.python/ranker/requirements.txt (1)
3-3: pybind11 as a runtime requirement—consider scoping it to build timeAdding
pybind11here works for the Docker build, but it’s primarily a build-time dependency forsetup.py, not something the running service needs. Long term, consider moving it into a proper build-system requirement (e.g.,pyproject.tomlorsetup_requires) and pinning a version for reproducible builds.python/ranker/Dockerfile (1)
28-30: Dockerfile correctly builds the C++ extension; minor image optimizations possibleCopying the source and running
pip3 install .is the right way to ensurerocksdb_clientis compiled into the image. If image size or build caching becomes a concern later, you could add--no-cache-dirto this pip install or move to a multi-stage build, but it’s not required now.python/ranker/setup.py (1)
1-18: pybind11-based rocksdb_client extension definition looks soundThe extension wiring (name, source file, include dir via
pybind11.get_include(),libraries=["rocksdb"], and-std=c++17) is coherent for buildingrocksdb_clientin this repo. For broader reuse, you might later:
- Add basic metadata (version, description) to
setup().- Move pybind11 into a
pyproject.tomlbuild-system section instead of importing it at runtime insetup.py.Not blockers for this PR.
cpp/indexer/tests/test_utils.cpp (1)
44-83: Tests correctly migrated to extract_content and validate title extractionThe switch to
indexer::extract_content(...).textin the existing tests plus the newtest_extract_title()gives good coverage for both body text and<title>handling, and wiring it intomain()ensures it runs with the rest of the suite. If you revisit naming later, you might rename thetest_clean_text_*functions to better reflect the new API, but functionally this is solid.Also applies to: 136-145
cpp/indexer/src/main.cpp (1)
130-141: Indexer now persists title/snippet correctly; snippet generation is intentionally simpleUsing
ExtractedContentto deriveplain_textandtitle, then updatingdoc_length,title, andsnippetin a singleUPDATEis consistent and avoids extra round trips. The 200-character snippet derived fromplain_textwith newline stripping is a reasonable first pass; if you want nicer snippets later, you could cut at the last space before 200 characters to avoid mid-word truncation, but the current approach is functionally fine.Also applies to: 171-175
python/ranker/rocksdb_client.cpp (1)
1-57: RocksDBReader wrapper is well-structured; confirm Python call sites pass bytes keysThe wrapper cleanly:
- Opens RocksDB read-only in the constructor and throws on failure.
- Manages the
rocksdb::DB*lifetime viaclose()and the destructor.- Returns
Nonefor missing keys andbytesfor hits, with errors surfaced asRuntimeError.One integration detail to double-check:
gettakes apy::bytesparameter, so Python callers must passbytes(e.g.,key.encode("utf-8")), not plainstr, or they’ll get a type error. If you expectstrkeys, you could either adjust the signature to acceptstd::stringor handle both types explicitly.python/ranker/engine.py (2)
226-235: Documents missing from DB are silently dropped.If a
doc_idexists insorted_docsbut not inmeta_map(e.g., deleted from DB after indexing, or data sync issues), it's silently excluded from results. This could return fewer thankresults unexpectedly.Consider adding a fallback for missing metadata:
for doc_id, score in sorted_docs: if doc_id in meta_map: meta = meta_map[doc_id] results.append({ "id": doc_id, "url": meta['url'], "score": score, "title": meta['title'] if meta['title'] else meta['url'], "snippet": meta['snippet'] if meta['snippet'] else "No preview available." }) + else: + # Document exists in index but metadata missing + print(f"Warning: Missing metadata for doc_id {doc_id}")
236-237: Consider catching specific database exceptions.Static analysis flagged catching bare
Exception. While this provides robustness, catchingpsycopg2.Errorwould be more precise and avoid masking unexpected errors.- except Exception as e: + except psycopg2.Error as e: print(f"Error fetching metadata: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
API/app/views/search/index.html.erb(1 hunks)cpp/indexer/src/main.cpp(2 hunks)cpp/indexer/src/utils.cpp(1 hunks)cpp/indexer/src/utils.hpp(1 hunks)cpp/indexer/tests/test_utils.cpp(4 hunks)python/ranker/Dockerfile(1 hunks)python/ranker/engine.py(3 hunks)python/ranker/requirements.txt(1 hunks)python/ranker/rocksdb_client.cpp(1 hunks)python/ranker/setup.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
python/ranker/engine.py (1)
python/ranker/rocksdb_client.cpp (2)
RocksDBReader(13-21)RocksDBReader(23-25)
cpp/indexer/tests/test_utils.cpp (2)
cpp/indexer/src/utils.cpp (2)
extract_content(63-67)extract_content(63-63)cpp/indexer/src/utils.hpp (1)
extract_content(22-22)
python/ranker/rocksdb_client.cpp (1)
python/ranker/engine.py (1)
close(251-269)
🪛 Ruff (0.14.8)
python/ranker/engine.py
236-236: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (6)
cpp/indexer/src/utils.hpp (1)
17-23: ExtractedContent API cleanly extends text extraction with a title fieldDefining
ExtractedContentwithtextandtitleand exposingextract_content(GumboNode*)is a clear, minimal extension of the previous API. The comment about clean text plus the “also extracts the title” note makes the intent obvious, and it keeps the header stable for future additions if needed.cpp/indexer/src/utils.cpp (2)
44-59: Title text is duplicated intocontent.text.When a
<title>element is encountered, the title is extracted (lines 44-51), but the function continues to recurse into children (lines 53-59), causing the title text to also be appended tocontent.text. If you intend to keep title separate from body text, add areturnor skip recursion for title elements.Additionally, only the first child text node is captured. Titles like
<title>Part1<span>Part2</span></title>would only capture "Part1".if (node->v.element.tag == GUMBO_TAG_TITLE) { if (node->v.element.children.length > 0) { GumboNode* title_text = static_cast<GumboNode*>(node->v.element.children.data[0]); if (title_text->type == GUMBO_NODE_TEXT) { content.title = title_text->v.text.text; } } + return; // Don't add title text to body content }
63-67: LGTM!Clean wrapper function that initializes the struct and delegates to the recursive helper.
python/ranker/engine.py (3)
7-13: LGTM!Clean migration from
rocksdictto the customRocksDBReaderextension with appropriate fallback handling.
45-48: LGTM!Correct instantiation of
RocksDBReaderwith the path, matching the C++ constructor signature fromrocksdb_client.cpp.
239-247: LGTM!Good fallback handling that provides mock data when the database is unavailable, ensuring the UI degrades gracefully.
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.