Skip to content

Comments

Basic ranker Implementation#19

Merged
Digvijay-x1 merged 9 commits intomainfrom
basic-ranker
Dec 9, 2025
Merged

Basic ranker Implementation#19
Digvijay-x1 merged 9 commits intomainfrom
basic-ranker

Conversation

@Digvijay-x1
Copy link
Owner

@Digvijay-x1 Digvijay-x1 commented Dec 8, 2025

Added ranker logic based on BM25 method in engine.py .
Added RocksDb

Summary by CodeRabbit

  • New Features

    • Production BM25 ranker integrated for search; responses include result count and latency.
    • Health endpoint reports ranker readiness (may show degraded if unavailable).
    • App root now directs users to the search page.
  • Infrastructure

    • Containers and Python runtime standardized to Ubuntu 22.04; image and runtime updates applied.
    • Services use environment variables for DB credentials and share a mounted crawl-data volume.
    • Runtime deps expanded to include RocksDB support, psycopg2, numpy and Cython.
  • CI

    • CI installs native search/storage libraries and builds Python dependencies accordingly.
  • Documentation

    • Added example .env and setup/run instructions to README.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 8, 2025

Walkthrough

Activates API root to search#index, updates several Docker bases to Ubuntu 22.04, injects DB envs and a shared volume in docker-compose, adds a BM25 Ranker (RocksDB + Postgres with in-memory fallbacks) integrated into the Flask app, and updates Python deps and CI native-lib install steps.

Changes

Cohort / File(s) Summary
Route Configuration
API/config/routes.rb
Activates root route: root "search#index" (replaced commented # root "posts#index").
Docker Base Image Updates
cpp/indexer/Dockerfile, python/ranker/Dockerfile
cpp/indexer/Dockerfile: base → ubuntu:22.04. python/ranker/Dockerfile: base changed from python:3.9-slimubuntu:22.04, installs system deps, creates Python3 symlink, installs Cython before requirements, uses python3 as entry.
Compose / Service Configuration
docker-compose.yml, .env.example
Replaces hard-coded Postgres creds with ${DB_*} vars; adds ./data/crawled_pages:/shared_data volume for ranker_service; injects DB envs into ranker_service/indexer_service; adds .env.example with DB_USER/DB_PASS/DB_NAME.
Ranker Implementation & Integration
python/ranker/engine.py, python/ranker/app.py, python/ranker/requirements.txt
Adds Ranker class implementing BM25 with RocksDB inverted index + Postgres metadata and in-memory fallbacks; integrates a global Ranker instance into the Flask app with init/cleanup and reinit-on-demand; updates /health and /search to use ranker, measure latency, and return meta; requirements updated (Cython<3, rocksdict, psycopg2-binary, numpy; Flask loosened).
CI / Build Dependencies
.github/workflows/ci.yml
Adds step to install RocksDB and compression lib dev packages and installs Cython prior to Python dependencies.
Docs / README
README.md
Adds Setup section showing environment variables and docker-compose run instructions.

Sequence Diagram(s)

sequenceDiagram
    %% Participants
    participant Client
    participant App as Flask (python/ranker/app.py)
    participant Ranker as Ranker Engine (python/ranker/engine.py)
    participant RocksDB
    participant Postgres

    Client->>App: GET /search?query=...
    activate App

    alt ranker not initialized
        App->>Ranker: initialize()
        activate Ranker
        Ranker->>Postgres: connect & fetch stats
        Postgres-->>Ranker: stats / error
        Ranker->>RocksDB: open read-only index
        alt RocksDB available
            RocksDB-->>Ranker: postings
        else RocksDB unavailable
            Ranker-->>Ranker: use in-memory index
        end
        Ranker-->>App: initialized
        deactivate Ranker
    end

    App->>Ranker: search(query, k=10)
    activate Ranker
    Ranker->>Ranker: tokenize & gather postings
    Ranker->>Ranker: compute BM25 scores
    Ranker->>Postgres: fetch metadata for top-k
    alt Postgres available
        Postgres-->>Ranker: metadata
    else
        Ranker-->>Ranker: use mock metadata
    end
    Ranker-->>App: results + meta
    deactivate Ranker

    App->>App: measure latency, build response
    App-->>Client: {results, count, latency_ms}
    deactivate App
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus review areas:
    • python/ranker/engine.py: BM25 math (IDF/TF/length), total_docs/avgdl computation, and edge-case handling.
    • DB queries and SQL safety/error handling in engine.py.
    • RocksDB open/read and resource cleanup patterns.
    • Race conditions and reinitialization logic around the global ranker in python/ranker/app.py.
    • Dockerfile/CI changes for native deps (building rocksdict/psycopg2 binaries) and their caching.

Poem

🐰 I hopped through code on a search-lit trail,
RocksDB hummed while Postgres wagged its tail.
BM25 danced and results took flight,
Latency counted in morning light,
A rabbit cheers: indexes set sail! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes - implementing a basic BM25-based ranker with RocksDB integration, matching the PR objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch basic-ranker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
docker-compose.yml (1)

26-36: Consider adding depends_on for postgres_service.

The ranker service connects to Postgres (via engine.py), but doesn't declare a dependency. This may cause connection failures during startup if ranker initializes before Postgres is ready.

   ranker_service:
     build: ./python/ranker
     ports:
       - "5000:5000"
     volumes:
       - ./python/ranker:/app # Hot-reloading for Python
       - ./data/crawled_pages:/shared_data
     environment:
       - FLASK_ENV=development
+    depends_on:
+      - postgres_service
     networks:
       - search_net
python/ranker/requirements.txt (1)

1-4: Consider pinning dependency versions for reproducible builds.

All dependencies are unpinned, and the RocksDB wrapper points to a Git repository's default branch. This can cause build inconsistencies across environments.

-flask
-git+https://github.com/twmht/python-rocksdb.git
-psycopg2-binary
-numpy
+flask==3.0.0
+git+https://github.com/twmht/python-rocksdb.git@<commit-sha>
+psycopg2-binary==2.9.9
+numpy==1.26.0
python/ranker/app.py (2)

26-27: Use !r conversion for exception representation in f-string.

Per static analysis (RUF010), use {e!r} instead of {str(e)} for a more informative representation.

-            return jsonify({"error": f"Ranker not initialized: {str(e)}"}), 500
+            return jsonify({"error": f"Ranker not initialized: {e!r}"}), 500

29-29: Query is lowercased twice.

The query is lowercased here (line 29) and again in engine.py at line 69 (query.lower().split()). Consider removing one to avoid redundant processing.

Either remove .lower() here or in engine.py:69.

python/ranker/engine.py (3)

103-103: No validation on postings string format.

If the posting list contains malformed data (non-integer values), int(d) will raise a ValueError. Consider adding error handling.

-            doc_ids = [int(d) for d in postings_str.split(',')]
+            try:
+                doc_ids = [int(d) for d in postings_str.split(',') if d.strip()]
+            except ValueError as e:
+                print(f"Malformed posting list for token {token}: {e}")
+                continue

129-143: Database queries in a loop may cause performance issues.

Executing a separate query for each document in the results loop (N+1 query pattern) is inefficient. Consider batching the metadata fetch.

         if self.db_conn:
             try:
                 with self.db_conn.cursor() as cur:
-                    for doc_id, score in sorted_docs:
-                        cur.execute("SELECT url FROM documents WHERE id = %s", (doc_id,))
-                        row = cur.fetchone()
-                        if row:
-                            results.append({
-                                "id": doc_id,
-                                "url": row[0],
-                                "score": score,
-                                "title": row[0] # Use URL as title for now
-                            })
+                    doc_ids = [doc_id for doc_id, _ in sorted_docs]
+                    cur.execute("SELECT id, url FROM documents WHERE id = ANY(%s)", (doc_ids,))
+                    url_map = {row[0]: row[1] for row in cur.fetchall()}
+                    for doc_id, score in sorted_docs:
+                        url = url_map.get(doc_id)
+                        if url:
+                            results.append({
+                                "id": doc_id,
+                                "url": url,
+                                "score": score,
+                                "title": url  # Use URL as title for now
+                            })
             except Exception as e:
                 print(f"Error fetching metadata: {e}")

17-27: Database connection is never closed and lacks connection pooling.

The Postgres connection is held open indefinitely without proper lifecycle management. For a Flask application, consider using connection pooling (e.g., psycopg2.pool) or closing connections properly.

For now, at minimum add a cleanup method:

def close(self):
    if self.db_conn:
        self.db_conn.close()
        self.db_conn = None
    if self.index_db:
        del self.index_db
        self.index_db = None

For production, consider using psycopg2.pool.ThreadedConnectionPool or an ORM like SQLAlchemy with built-in connection management.

Also applies to: 129-143

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0902a and 74a9ab3.

📒 Files selected for processing (7)
  • API/config/routes.rb (1 hunks)
  • cpp/indexer/Dockerfile (1 hunks)
  • docker-compose.yml (1 hunks)
  • python/ranker/Dockerfile (1 hunks)
  • python/ranker/app.py (1 hunks)
  • python/ranker/engine.py (1 hunks)
  • python/ranker/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
python/ranker/app.py (1)
python/ranker/engine.py (2)
  • Ranker (14-154)
  • search (64-154)
python/ranker/engine.py (1)
python/ranker/app.py (1)
  • search (20-43)
🪛 Ruff (0.14.7)
python/ranker/app.py

11-11: Do not catch blind exception: Exception

(BLE001)


26-26: Do not catch blind exception: Exception

(BLE001)


27-27: Use explicit conversion flag

Replace with conversion flag

(RUF010)

python/ranker/engine.py

25-25: Do not catch blind exception: Exception

(BLE001)


39-39: Do not catch blind exception: Exception

(BLE001)


60-60: Do not catch blind exception: Exception

(BLE001)


89-89: Do not catch blind exception: Exception

(BLE001)


142-142: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: docker-build
🔇 Additional comments (5)
cpp/indexer/Dockerfile (1)

1-1: Good: Pinning to a specific Ubuntu version.

Using ubuntu:22.04 instead of ubuntu:latest improves build reproducibility and consistency across services.

python/ranker/Dockerfile (1)

1-28: LGTM!

The Dockerfile properly installs system dependencies required for building python-rocksdb, and correctly installs Cython<3 before the requirements to ensure the RocksDB wrapper builds successfully.

API/config/routes.rb (1)

11-11: LGTM!

Setting the root route to search#index aligns with the search-focused nature of this application.

python/ranker/app.py (1)

7-12: Module-level initialization may cause import-time failures.

If Ranker initialization fails, the exception is caught and logged, but ranker remains None. This is acceptable given the fallback in the search endpoint. However, consider whether initialization errors should be more specific.

python/ranker/engine.py (1)

17-27: Hardcoded credentials in default values.

While environment variables are used, the defaults include plaintext credentials. This is acceptable for local development but ensure production deployments override these values.

Verify that production deployments provide proper environment variables and don't rely on these defaults.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
python/ranker/engine.py (1)

129-143: Optimize metadata fetching to avoid N+1 queries.

The loop performs individual database queries for each result, creating an N+1 query problem. For k=10, this makes 10 separate queries instead of a single batch query.

Consider batching the metadata fetch:

         results = []
         if self.db_conn:
             try:
                 with self.db_conn.cursor() as cur:
-                    for doc_id, score in sorted_docs:
-                        cur.execute("SELECT url FROM documents WHERE id = %s", (doc_id,))
-                        row = cur.fetchone()
-                        if row:
-                            results.append({
-                                "id": doc_id,
-                                "url": row[0],
-                                "score": score,
-                                "title": row[0] # Use URL as title for now
-                            })
+                    doc_ids = [doc_id for doc_id, _ in sorted_docs]
+                    if doc_ids:
+                        placeholders = ','.join(['%s'] * len(doc_ids))
+                        cur.execute(f"SELECT id, url FROM documents WHERE id IN ({placeholders})", doc_ids)
+                        doc_metadata = {row[0]: row[1] for row in cur.fetchall()}
+                        
+                        for doc_id, score in sorted_docs:
+                            if doc_id in doc_metadata:
+                                results.append({
+                                    "id": doc_id,
+                                    "url": doc_metadata[doc_id],
+                                    "score": score,
+                                    "title": doc_metadata[doc_id]  # Use URL as title for now
+                                })
             except Exception as e:
                 print(f"Error fetching metadata: {e}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74a9ab3 and 5ffcbbc.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml (1 hunks)
  • python/ranker/engine.py (1 hunks)
  • python/ranker/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
  • search (20-43)
🪛 Ruff (0.14.8)
python/ranker/engine.py

25-25: Do not catch blind exception: Exception

(BLE001)


39-39: Do not catch blind exception: Exception

(BLE001)


60-60: Do not catch blind exception: Exception

(BLE001)


89-89: Do not catch blind exception: Exception

(BLE001)


142-142: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: docker-build
🔇 Additional comments (4)
.github/workflows/ci.yml (1)

130-133: LGTM! Proper RocksDB dependencies for CI.

The system dependencies correctly align with the RocksDB integration and match the runtime requirements for the python-rocksdb package.

python/ranker/engine.py (3)

1-12: LGTM! Graceful RocksDB fallback handling.

The defensive import pattern with the ROCKSDB_AVAILABLE flag provides good resilience when the RocksDB library is unavailable.


52-62: LGTM! Division by zero issue addressed.

The _calculate_avgdl method correctly handles the case where the database returns None or 0 by falling back to 100.0, preventing the division by zero issue mentioned in previous reviews.


80-94: LGTM! Mock fallback logic corrected.

The fallback to the mock index now correctly uses else:, ensuring it triggers whenever self.index_db is None, addressing the issue from previous reviews.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
docker-compose.yml (1)

1-1: Consider removing obsolete version: key to satisfy Compose/CI warning.

Compose v2 no longer requires the top-level version: field and your CI already flags it as obsolete. Dropping that line will avoid the warning and make the file compatible with current Compose recommendations.

python/ranker/app.py (1)

8-15: Broad exception catch on Ranker init hides root causes.

Catching bare Exception here means config errors, DB connection failures, and programming bugs are all flattened into a single log line and the app silently degrades. Consider either:

  • Letting initialization errors bubble up (fail fast in non-dev), or
  • Narrowing the exception types (e.g., DB/rocksdb related) and logging full details (including traceback) so misconfigurations are obvious.
python/ranker/engine.py (2)

16-37: DB credential validation is good; consider separating config errors from runtime connection failures.

Requiring DB_USER/DB_PASS and raising on missing env vars is a good step. Right now that ValueError is caught together with connection errors and simply logged, leaving db_conn=None and silently dropping to mock behavior.

If you want misconfiguration to be more obvious, consider:

  • Validating env vars before the try block so a missing credential fails fast, or
  • Catching psycopg2.Error separately from ValueError and handling them differently (e.g., re‑raise config errors).

181-205: Minor BM25 implementation cleanups (no behavior change).

A couple of small tweaks can improve clarity and satisfy the linter without changing semantics:

  • Avoid the multi‑statement line for N:
-            N = self.total_docs
-            if N == 0: N = 1 # Avoid division by zero issues if DB is empty
+            N = self.total_docs or 1  # Avoid division by zero issues if DB is empty
  • Optionally, add a brief comment that doc_ids currently encode only document IDs (no per‑doc TF), so tf = 1 is intentional until the index schema is expanded.

These are non‑functional cleanups but make the scoring loop easier to read and lint‑clean.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fac0362 and 56b5236.

📒 Files selected for processing (4)
  • docker-compose.yml (1 hunks)
  • python/ranker/app.py (1 hunks)
  • python/ranker/engine.py (1 hunks)
  • python/ranker/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
  • search (22-45)
🪛 GitHub Actions: Search Engine CI
docker-compose.yml

[warning] 1-1: the attribute version is obsolete, it will be ignored; remove it to avoid potential confusion

🪛 Ruff (0.14.8)
python/ranker/app.py

13-13: Do not catch blind exception: Exception

(BLE001)


28-28: Do not catch blind exception: Exception

(BLE001)


29-29: Use explicit conversion flag

Replace with conversion flag

(RUF010)

python/ranker/engine.py

25-25: Abstract raise to an inner function

(TRY301)


25-25: Avoid specifying long messages outside the exception class

(TRY003)


34-34: Do not catch blind exception: Exception

(BLE001)


48-48: Do not catch blind exception: Exception

(BLE001)


70-70: Do not catch blind exception: Exception

(BLE001)


82-82: Do not catch blind exception: Exception

(BLE001)


110-110: Do not catch blind exception: Exception

(BLE001)


151-151: Do not catch blind exception: Exception

(BLE001)


184-184: Multiple statements on one line (colon)

(E701)


222-222: Do not catch blind exception: Exception

(BLE001)


242-242: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (3)
python/ranker/requirements.txt (1)

1-5: Pinned versions look good; consider commit-hash for git dependency later.

The explicit version pins (including the rocksdb tag) significantly improve reproducibility and align with earlier feedback. If you need even stricter determinism in the future, you could pin the git dependency to a commit hash instead of a tag, but this is already a solid improvement.

python/ranker/app.py (1)

23-45: Global Ranker with a single psycopg2 connection may be unsafe under threaded servers.

ranker is a module‑level singleton, and Ranker internally holds one psycopg2 connection that is reused across requests. Under the default threaded Flask dev server or any threaded WSGI server, this can lead to undefined behavior because psycopg2 connections are not designed to be shared across threads.

Consider:

  • Creating DB connections per request (or per thread) via a small connection factory/pool, or
  • Ensuring the service runs in a process‑only (non‑threaded) model and documenting that assumption, or
  • Refactoring Ranker so it doesn't own a shared connection but instead uses a connection pool abstraction.

This will make the search endpoint more robust as you scale beyond single‑threaded dev usage.

python/ranker/engine.py (1)

86-112: Verify column-name consistency between _calculate_avgdl and _get_doc_lengths against actual schema.

The review identifies a potential mismatch: _calculate_avgdl queries SELECT AVG(doc_length) FROM documents while _get_doc_lengths queries SELECT id, length FROM documents. If your database schema uses doc_length as the column name, the mismatch will cause _get_doc_lengths to fail silently and degrade ranking quality. Cross-reference both methods with your actual documents table schema to confirm which column name is correct, then update both queries to use the same column consistently.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
python/ranker/requirements.txt (1)

1-5: Re‑introduce version pins for reproducible builds

All dependencies here are effectively “latest” (flask, rocksdict, psycopg2-binary, numpy), with only an upper bound on Cython. This can lead to non‑reproducible builds and surprise breakages when new major versions ship.

Consider pinning or at least constraining versions for each package (e.g., >=min_known_good,<next_major) and aligning those with the versions installed in your Docker/CI setup.

🧹 Nitpick comments (5)
python/ranker/engine.py (5)

16-37: Clarify DB env validation vs silent fallback

Right now ValueError for missing DB_USER/DB_PASS is immediately swallowed by the broad except Exception in Lines [34]-[36], so the app silently runs without Postgres (just logging to stdout). That’s convenient for dev but can hide misconfiguration in non-dev environments.

Consider either:

  • Moving the DB_USER/DB_PASS validation outside the try so it fails fast, or
  • Restricting the except to connection-related errors (e.g., psycopg2.OperationalError) and letting configuration errors propagate.

This keeps the “mock/fallback” story for local runs but makes production failures more explicit.


141-167: Postings retrieval and mock fallback behavior is solid, but consider narrower exception handling

The logic to read postings from RocksDB when available and fall back to mock_index when self.index_db is falsy is correct and addresses the earlier incomplete fallback issue.

You’re still catching a broad Exception around RocksDB access (Lines [146]-[151]). If possible, narrow that to the specific rocksdict exception(s) you expect (I/O errors, corruption), so unexpected bugs surface rather than being silently logged.


175-201: BM25 loop is correct, but N=0 guard and style could be cleaned up

The BM25 formula is implemented correctly for this simplified setting, and the doc_len fallback to avgdl ensures the denominator stays safe.

Two minor polish points:

  • The if N == 0: N = 1 on Line [183] works but is harder to read; consider splitting to two lines or handling the 0 case once when initializing self.total_docs.
  • If you ever move beyond TF=1, you may want to compute n_qi as the count of distinct docs containing the term (e.g., len(set(doc_ids))) depending on how the index is structured.

Not blockers, but worth keeping in mind as the ranker matures.


206-223: Avoid N+1 metadata queries for top results

For each result you issue a separate SELECT url FROM documents WHERE id = %s (Lines [211]-[213]). With k=10 this is acceptable, but as k grows this becomes an N+1 pattern.

You can batch fetch URLs for all top doc IDs in one query, similar to _get_doc_lengths:

  1. Build a list of doc_ids from sorted_docs.
  2. Execute a single SELECT id, url FROM documents WHERE id IN %s.
  3. Map IDs to URLs in Python and assemble results.

This reduces DB round‑trips and keeps latency predictable.


235-262: Resource cleanup and context management are well handled

The close, __enter__, __exit__, and __del__ methods correctly close both RocksDB and Postgres and null out the handles. This integrates nicely with both explicit .close() calls and context‑manager usage.

One minor consideration: since __del__ timing is not guaranteed, ensure your primary code paths call close() (or use with Ranker()), and treat __del__ as best‑effort cleanup only. Otherwise, the implementation looks good.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56b5236 and e8a3988.

📒 Files selected for processing (2)
  • python/ranker/engine.py (1 hunks)
  • python/ranker/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
  • search (22-45)
🪛 Ruff (0.14.8)
python/ranker/engine.py

25-25: Abstract raise to an inner function

(TRY301)


25-25: Avoid specifying long messages outside the exception class

(TRY003)


34-34: Do not catch blind exception: Exception

(BLE001)


47-47: Do not catch blind exception: Exception

(BLE001)


69-69: Do not catch blind exception: Exception

(BLE001)


81-81: Do not catch blind exception: Exception

(BLE001)


109-109: Do not catch blind exception: Exception

(BLE001)


150-150: Do not catch blind exception: Exception

(BLE001)


183-183: Multiple statements on one line (colon)

(E701)


221-221: Do not catch blind exception: Exception

(BLE001)


241-241: Do not catch blind exception: Exception

(BLE001)


250-250: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: docker-build
🔇 Additional comments (3)
python/ranker/engine.py (3)

7-14: RocksDB import guard and mock fallback look good

The conditional import plus clear warning and in-memory mock index give a sensible dev fallback when RocksDB isn’t available. No changes needed here.


61-84: Global stats helpers are safe against division-by-zero

_calculate_avgdl now returns a non-zero default (100.0) when AVG is NULL/0 or the query fails, and _get_total_docs has a reasonable default when DB is unavailable. Together with the later if N == 0 guard, this avoids division-by-zero issues in BM25.

Looks good as a basic, defensive implementation.


113-126: Query preprocessing now correctly mirrors indexer behavior

Lowercasing, stripping non‑alphanumeric chars, splitting on whitespace, and filtering out tokens with length < 3 aligns well with the described C++ indexer tokenization. This should fix many of the earlier “query doesn’t match index” issues.

No changes needed.

@Digvijay-x1
Copy link
Owner Author

@copilot merge this PR if all test passes including the coderabbitai

Copy link
Contributor

Copilot AI commented Dec 9, 2025

@Digvijay-x1 I've opened a new pull request, #20, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
python/ranker/engine.py (2)

183-183: Split compound statement for readability.

For better readability and linter compliance, split the inline conditional across two lines.

Apply this diff:

-            if N == 0: N = 1 # Avoid division by zero issues if DB is empty
+            if N == 0:
+                N = 1  # Avoid division by zero issues if DB is empty

34-34: Consider more specific exception types.

While the broad Exception catching provides defensive robustness, catching more specific exceptions (e.g., psycopg2.Error, IOError, rocksdict.RocksDBError) would improve error diagnostics and prevent masking unexpected issues.

Example for database operations:

         try:
             # ... database operations ...
-        except Exception as e:
+        except (psycopg2.Error, psycopg2.OperationalError) as e:
             print(f"Database error: {e}")

Also applies to: 47-47, 69-69, 81-81, 109-109, 150-150, 221-221, 241-241, 250-250

.env.example (1)

1-3: Good password placeholder, consider username guidance.

The change_me_in_production placeholder for DB_PASS clearly indicates customization is needed. While DB_USER=admin is common for local development, you might consider adding a comment suggesting users also customize the username for production deployments.

Optional enhancement:

+# Change these values for production deployments
 DB_USER=admin
 DB_PASS=change_me_in_production
 DB_NAME=search_engine
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8a3988 and 01f2683.

📒 Files selected for processing (4)
  • .env.example (1 hunks)
  • README.md (1 hunks)
  • docker-compose.yml (3 hunks)
  • python/ranker/engine.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker-compose.yml
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
  • search (22-45)
🪛 dotenv-linter (4.0.0)
.env.example

[warning] 2-2: [UnorderedKey] The DB_PASS key should go before the DB_USER key

(UnorderedKey)


[warning] 3-3: [UnorderedKey] The DB_NAME key should go before the DB_PASS key

(UnorderedKey)

🪛 Ruff (0.14.8)
python/ranker/engine.py

25-25: Abstract raise to an inner function

(TRY301)


25-25: Avoid specifying long messages outside the exception class

(TRY003)


34-34: Do not catch blind exception: Exception

(BLE001)


47-47: Do not catch blind exception: Exception

(BLE001)


69-69: Do not catch blind exception: Exception

(BLE001)


81-81: Do not catch blind exception: Exception

(BLE001)


109-109: Do not catch blind exception: Exception

(BLE001)


150-150: Do not catch blind exception: Exception

(BLE001)


183-183: Multiple statements on one line (colon)

(E701)


221-221: Do not catch blind exception: Exception

(BLE001)


241-241: Do not catch blind exception: Exception

(BLE001)


250-250: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: docker-build
🔇 Additional comments (9)
python/ranker/engine.py (8)

1-13: LGTM! Clean import structure with graceful fallback.

The optional RocksDB dependency handling with a fallback flag is a solid pattern for managing external dependencies.


21-25: Security improvement confirmed.

Requiring explicit DB_USER and DB_PASS environment variables without weak defaults addresses the previous security concern. The clear error message helps developers configure the system properly.


61-71: Division-by-zero protection confirmed.

The fallback to 100.0 when avg is None or 0 properly addresses the previous concern about potential division by zero in BM25 calculations.


85-111: Column name consistency confirmed.

Both queries now correctly use the doc_length column for word counts, addressing the previous mismatch concern. The single vs. multiple ID handling is implemented correctly.


118-128: Query tokenization aligned with indexer.

The preprocessing pipeline (lowercase, remove non-alphanumeric, filter length ≥ 3) now matches the indexer's tokenization logic, addressing the previous mismatch concern. This ensures queries can find indexed documents correctly.


141-154: Improved fallback logic confirmed.

The fallback to mock index now triggers when self.index_db is None, properly handling cases where RocksDB is available but the index file doesn't exist. This addresses the previous incomplete fallback concern.


174-201: BM25 implementation with tracked placeholders.

The BM25 scoring logic is sound with appropriate safety checks (N=0 handling, doc_len fallbacks). The TODO comment at lines 189-191 properly tracks the TF=1 simplification, addressing the previous concern about untracked placeholders.


235-262: Comprehensive resource cleanup implemented.

The cleanup implementation with close(), context manager support (__enter__/__exit__), and __del__ fallback addresses the previous concern about resource leaks. Setting connections to None in finally blocks is good practice.

README.md (1)

3-24: Clear setup documentation added.

The step-by-step environment configuration and Docker instructions provide a smooth onboarding experience. The placeholder password text clearly indicates customization is needed.

@Digvijay-x1 Digvijay-x1 merged commit 6d14f88 into main Dec 9, 2025
7 checks passed
@Digvijay-x1 Digvijay-x1 deleted the basic-ranker branch December 9, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants