Conversation
…, updating Docker and app to use it.
…for ranker, and set API root to search index.
WalkthroughActivates API root to search#index, updates several Docker bases to Ubuntu 22.04, injects DB envs and a shared volume in docker-compose, adds a BM25 Ranker (RocksDB + Postgres with in-memory fallbacks) integrated into the Flask app, and updates Python deps and CI native-lib install steps. Changes
Sequence Diagram(s)sequenceDiagram
%% Participants
participant Client
participant App as Flask (python/ranker/app.py)
participant Ranker as Ranker Engine (python/ranker/engine.py)
participant RocksDB
participant Postgres
Client->>App: GET /search?query=...
activate App
alt ranker not initialized
App->>Ranker: initialize()
activate Ranker
Ranker->>Postgres: connect & fetch stats
Postgres-->>Ranker: stats / error
Ranker->>RocksDB: open read-only index
alt RocksDB available
RocksDB-->>Ranker: postings
else RocksDB unavailable
Ranker-->>Ranker: use in-memory index
end
Ranker-->>App: initialized
deactivate Ranker
end
App->>Ranker: search(query, k=10)
activate Ranker
Ranker->>Ranker: tokenize & gather postings
Ranker->>Ranker: compute BM25 scores
Ranker->>Postgres: fetch metadata for top-k
alt Postgres available
Postgres-->>Ranker: metadata
else
Ranker-->>Ranker: use mock metadata
end
Ranker-->>App: results + meta
deactivate Ranker
App->>App: measure latency, build response
App-->>Client: {results, count, latency_ms}
deactivate App
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
docker-compose.yml (1)
26-36: Consider addingdepends_onforpostgres_service.The ranker service connects to Postgres (via
engine.py), but doesn't declare a dependency. This may cause connection failures during startup if ranker initializes before Postgres is ready.ranker_service: build: ./python/ranker ports: - "5000:5000" volumes: - ./python/ranker:/app # Hot-reloading for Python - ./data/crawled_pages:/shared_data environment: - FLASK_ENV=development + depends_on: + - postgres_service networks: - search_netpython/ranker/requirements.txt (1)
1-4: Consider pinning dependency versions for reproducible builds.All dependencies are unpinned, and the RocksDB wrapper points to a Git repository's default branch. This can cause build inconsistencies across environments.
-flask -git+https://github.com/twmht/python-rocksdb.git -psycopg2-binary -numpy +flask==3.0.0 +git+https://github.com/twmht/python-rocksdb.git@<commit-sha> +psycopg2-binary==2.9.9 +numpy==1.26.0python/ranker/app.py (2)
26-27: Use!rconversion for exception representation in f-string.Per static analysis (RUF010), use
{e!r}instead of{str(e)}for a more informative representation.- return jsonify({"error": f"Ranker not initialized: {str(e)}"}), 500 + return jsonify({"error": f"Ranker not initialized: {e!r}"}), 500
29-29: Query is lowercased twice.The query is lowercased here (line 29) and again in
engine.pyat line 69 (query.lower().split()). Consider removing one to avoid redundant processing.Either remove
.lower()here or inengine.py:69.python/ranker/engine.py (3)
103-103: No validation on postings string format.If the posting list contains malformed data (non-integer values),
int(d)will raise aValueError. Consider adding error handling.- doc_ids = [int(d) for d in postings_str.split(',')] + try: + doc_ids = [int(d) for d in postings_str.split(',') if d.strip()] + except ValueError as e: + print(f"Malformed posting list for token {token}: {e}") + continue
129-143: Database queries in a loop may cause performance issues.Executing a separate query for each document in the results loop (N+1 query pattern) is inefficient. Consider batching the metadata fetch.
if self.db_conn: try: with self.db_conn.cursor() as cur: - for doc_id, score in sorted_docs: - cur.execute("SELECT url FROM documents WHERE id = %s", (doc_id,)) - row = cur.fetchone() - if row: - results.append({ - "id": doc_id, - "url": row[0], - "score": score, - "title": row[0] # Use URL as title for now - }) + doc_ids = [doc_id for doc_id, _ in sorted_docs] + cur.execute("SELECT id, url FROM documents WHERE id = ANY(%s)", (doc_ids,)) + url_map = {row[0]: row[1] for row in cur.fetchall()} + for doc_id, score in sorted_docs: + url = url_map.get(doc_id) + if url: + results.append({ + "id": doc_id, + "url": url, + "score": score, + "title": url # Use URL as title for now + }) except Exception as e: print(f"Error fetching metadata: {e}")
17-27: Database connection is never closed and lacks connection pooling.The Postgres connection is held open indefinitely without proper lifecycle management. For a Flask application, consider using connection pooling (e.g.,
psycopg2.pool) or closing connections properly.For now, at minimum add a cleanup method:
def close(self): if self.db_conn: self.db_conn.close() self.db_conn = None if self.index_db: del self.index_db self.index_db = NoneFor production, consider using
psycopg2.pool.ThreadedConnectionPoolor an ORM like SQLAlchemy with built-in connection management.Also applies to: 129-143
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
API/config/routes.rb(1 hunks)cpp/indexer/Dockerfile(1 hunks)docker-compose.yml(1 hunks)python/ranker/Dockerfile(1 hunks)python/ranker/app.py(1 hunks)python/ranker/engine.py(1 hunks)python/ranker/requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
python/ranker/app.py (1)
python/ranker/engine.py (2)
Ranker(14-154)search(64-154)
python/ranker/engine.py (1)
python/ranker/app.py (1)
search(20-43)
🪛 Ruff (0.14.7)
python/ranker/app.py
11-11: Do not catch blind exception: Exception
(BLE001)
26-26: Do not catch blind exception: Exception
(BLE001)
27-27: Use explicit conversion flag
Replace with conversion flag
(RUF010)
python/ranker/engine.py
25-25: Do not catch blind exception: Exception
(BLE001)
39-39: Do not catch blind exception: Exception
(BLE001)
60-60: Do not catch blind exception: Exception
(BLE001)
89-89: Do not catch blind exception: Exception
(BLE001)
142-142: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: docker-build
🔇 Additional comments (5)
cpp/indexer/Dockerfile (1)
1-1: Good: Pinning to a specific Ubuntu version.Using
ubuntu:22.04instead ofubuntu:latestimproves build reproducibility and consistency across services.python/ranker/Dockerfile (1)
1-28: LGTM!The Dockerfile properly installs system dependencies required for building
python-rocksdb, and correctly installsCython<3before the requirements to ensure the RocksDB wrapper builds successfully.API/config/routes.rb (1)
11-11: LGTM!Setting the root route to
search#indexaligns with the search-focused nature of this application.python/ranker/app.py (1)
7-12: Module-level initialization may cause import-time failures.If Ranker initialization fails, the exception is caught and logged, but
rankerremainsNone. This is acceptable given the fallback in the search endpoint. However, consider whether initialization errors should be more specific.python/ranker/engine.py (1)
17-27: Hardcoded credentials in default values.While environment variables are used, the defaults include plaintext credentials. This is acceptable for local development but ensure production deployments override these values.
Verify that production deployments provide proper environment variables and don't rely on these defaults.
… length default in engine.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
python/ranker/engine.py (1)
129-143: Optimize metadata fetching to avoid N+1 queries.The loop performs individual database queries for each result, creating an N+1 query problem. For
k=10, this makes 10 separate queries instead of a single batch query.Consider batching the metadata fetch:
results = [] if self.db_conn: try: with self.db_conn.cursor() as cur: - for doc_id, score in sorted_docs: - cur.execute("SELECT url FROM documents WHERE id = %s", (doc_id,)) - row = cur.fetchone() - if row: - results.append({ - "id": doc_id, - "url": row[0], - "score": score, - "title": row[0] # Use URL as title for now - }) + doc_ids = [doc_id for doc_id, _ in sorted_docs] + if doc_ids: + placeholders = ','.join(['%s'] * len(doc_ids)) + cur.execute(f"SELECT id, url FROM documents WHERE id IN ({placeholders})", doc_ids) + doc_metadata = {row[0]: row[1] for row in cur.fetchall()} + + for doc_id, score in sorted_docs: + if doc_id in doc_metadata: + results.append({ + "id": doc_id, + "url": doc_metadata[doc_id], + "score": score, + "title": doc_metadata[doc_id] # Use URL as title for now + }) except Exception as e: print(f"Error fetching metadata: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/ci.yml(1 hunks)python/ranker/engine.py(1 hunks)python/ranker/requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
search(20-43)
🪛 Ruff (0.14.8)
python/ranker/engine.py
25-25: Do not catch blind exception: Exception
(BLE001)
39-39: Do not catch blind exception: Exception
(BLE001)
60-60: Do not catch blind exception: Exception
(BLE001)
89-89: Do not catch blind exception: Exception
(BLE001)
142-142: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: docker-build
🔇 Additional comments (4)
.github/workflows/ci.yml (1)
130-133: LGTM! Proper RocksDB dependencies for CI.The system dependencies correctly align with the RocksDB integration and match the runtime requirements for the python-rocksdb package.
python/ranker/engine.py (3)
1-12: LGTM! Graceful RocksDB fallback handling.The defensive import pattern with the
ROCKSDB_AVAILABLEflag provides good resilience when the RocksDB library is unavailable.
52-62: LGTM! Division by zero issue addressed.The
_calculate_avgdlmethod correctly handles the case where the database returnsNoneor0by falling back to100.0, preventing the division by zero issue mentioned in previous reviews.
80-94: LGTM! Mock fallback logic corrected.The fallback to the mock index now correctly uses
else:, ensuring it triggers wheneverself.index_dbisNone, addressing the issue from previous reviews.
…r dependency installation.
…mic document statistics, and robust database connection handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
docker-compose.yml (1)
1-1: Consider removing obsoleteversion:key to satisfy Compose/CI warning.Compose v2 no longer requires the top-level
version:field and your CI already flags it as obsolete. Dropping that line will avoid the warning and make the file compatible with current Compose recommendations.python/ranker/app.py (1)
8-15: Broad exception catch on Ranker init hides root causes.Catching bare
Exceptionhere means config errors, DB connection failures, and programming bugs are all flattened into a single log line and the app silently degrades. Consider either:
- Letting initialization errors bubble up (fail fast in non-dev), or
- Narrowing the exception types (e.g., DB/rocksdb related) and logging full details (including traceback) so misconfigurations are obvious.
python/ranker/engine.py (2)
16-37: DB credential validation is good; consider separating config errors from runtime connection failures.Requiring
DB_USER/DB_PASSand raising on missing env vars is a good step. Right now thatValueErroris caught together with connection errors and simply logged, leavingdb_conn=Noneand silently dropping to mock behavior.If you want misconfiguration to be more obvious, consider:
- Validating env vars before the
tryblock so a missing credential fails fast, or- Catching
psycopg2.Errorseparately fromValueErrorand handling them differently (e.g., re‑raise config errors).
181-205: Minor BM25 implementation cleanups (no behavior change).A couple of small tweaks can improve clarity and satisfy the linter without changing semantics:
- Avoid the multi‑statement line for
N:- N = self.total_docs - if N == 0: N = 1 # Avoid division by zero issues if DB is empty + N = self.total_docs or 1 # Avoid division by zero issues if DB is empty
- Optionally, add a brief comment that
doc_idscurrently encode only document IDs (no per‑doc TF), sotf = 1is intentional until the index schema is expanded.These are non‑functional cleanups but make the scoring loop easier to read and lint‑clean.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docker-compose.yml(1 hunks)python/ranker/app.py(1 hunks)python/ranker/engine.py(1 hunks)python/ranker/requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
search(22-45)
🪛 GitHub Actions: Search Engine CI
docker-compose.yml
[warning] 1-1: the attribute version is obsolete, it will be ignored; remove it to avoid potential confusion
🪛 Ruff (0.14.8)
python/ranker/app.py
13-13: Do not catch blind exception: Exception
(BLE001)
28-28: Do not catch blind exception: Exception
(BLE001)
29-29: Use explicit conversion flag
Replace with conversion flag
(RUF010)
python/ranker/engine.py
25-25: Abstract raise to an inner function
(TRY301)
25-25: Avoid specifying long messages outside the exception class
(TRY003)
34-34: Do not catch blind exception: Exception
(BLE001)
48-48: Do not catch blind exception: Exception
(BLE001)
70-70: Do not catch blind exception: Exception
(BLE001)
82-82: Do not catch blind exception: Exception
(BLE001)
110-110: Do not catch blind exception: Exception
(BLE001)
151-151: Do not catch blind exception: Exception
(BLE001)
184-184: Multiple statements on one line (colon)
(E701)
222-222: Do not catch blind exception: Exception
(BLE001)
242-242: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (3)
python/ranker/requirements.txt (1)
1-5: Pinned versions look good; consider commit-hash for git dependency later.The explicit version pins (including the rocksdb tag) significantly improve reproducibility and align with earlier feedback. If you need even stricter determinism in the future, you could pin the git dependency to a commit hash instead of a tag, but this is already a solid improvement.
python/ranker/app.py (1)
23-45: Global Ranker with a single psycopg2 connection may be unsafe under threaded servers.
rankeris a module‑level singleton, andRankerinternally holds one psycopg2 connection that is reused across requests. Under the default threaded Flask dev server or any threaded WSGI server, this can lead to undefined behavior because psycopg2 connections are not designed to be shared across threads.Consider:
- Creating DB connections per request (or per thread) via a small connection factory/pool, or
- Ensuring the service runs in a process‑only (non‑threaded) model and documenting that assumption, or
- Refactoring
Rankerso it doesn't own a shared connection but instead uses a connection pool abstraction.This will make the search endpoint more robust as you scale beyond single‑threaded dev usage.
python/ranker/engine.py (1)
86-112: Verify column-name consistency between_calculate_avgdland_get_doc_lengthsagainst actual schema.The review identifies a potential mismatch:
_calculate_avgdlqueriesSELECT AVG(doc_length) FROM documentswhile_get_doc_lengthsqueriesSELECT id, length FROM documents. If your database schema usesdoc_lengthas the column name, the mismatch will cause_get_doc_lengthsto fail silently and degrade ranking quality. Cross-reference both methods with your actual documents table schema to confirm which column name is correct, then update both queries to use the same column consistently.
…nvironment variables for database configuration
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
python/ranker/requirements.txt (1)
1-5: Re‑introduce version pins for reproducible buildsAll dependencies here are effectively “latest” (
flask,rocksdict,psycopg2-binary,numpy), with only an upper bound on Cython. This can lead to non‑reproducible builds and surprise breakages when new major versions ship.Consider pinning or at least constraining versions for each package (e.g.,
>=min_known_good,<next_major) and aligning those with the versions installed in your Docker/CI setup.
🧹 Nitpick comments (5)
python/ranker/engine.py (5)
16-37: Clarify DB env validation vs silent fallbackRight now
ValueErrorfor missingDB_USER/DB_PASSis immediately swallowed by the broadexcept Exceptionin Lines [34]-[36], so the app silently runs without Postgres (just logging to stdout). That’s convenient for dev but can hide misconfiguration in non-dev environments.Consider either:
- Moving the
DB_USER/DB_PASSvalidation outside thetryso it fails fast, or- Restricting the
exceptto connection-related errors (e.g.,psycopg2.OperationalError) and letting configuration errors propagate.This keeps the “mock/fallback” story for local runs but makes production failures more explicit.
141-167: Postings retrieval and mock fallback behavior is solid, but consider narrower exception handlingThe logic to read postings from RocksDB when available and fall back to
mock_indexwhenself.index_dbis falsy is correct and addresses the earlier incomplete fallback issue.You’re still catching a broad
Exceptionaround RocksDB access (Lines [146]-[151]). If possible, narrow that to the specific rocksdict exception(s) you expect (I/O errors, corruption), so unexpected bugs surface rather than being silently logged.
175-201: BM25 loop is correct, but N=0 guard and style could be cleaned upThe BM25 formula is implemented correctly for this simplified setting, and the
doc_lenfallback toavgdlensures the denominator stays safe.Two minor polish points:
- The
if N == 0: N = 1on Line [183] works but is harder to read; consider splitting to two lines or handling the0case once when initializingself.total_docs.- If you ever move beyond TF=1, you may want to compute
n_qias the count of distinct docs containing the term (e.g.,len(set(doc_ids))) depending on how the index is structured.Not blockers, but worth keeping in mind as the ranker matures.
206-223: Avoid N+1 metadata queries for top resultsFor each result you issue a separate
SELECT url FROM documents WHERE id = %s(Lines [211]-[213]). Withk=10this is acceptable, but askgrows this becomes an N+1 pattern.You can batch fetch URLs for all top doc IDs in one query, similar to
_get_doc_lengths:
- Build a list of doc_ids from
sorted_docs.- Execute a single
SELECT id, url FROM documents WHERE id IN %s.- Map IDs to URLs in Python and assemble
results.This reduces DB round‑trips and keeps latency predictable.
235-262: Resource cleanup and context management are well handledThe
close,__enter__,__exit__, and__del__methods correctly close both RocksDB and Postgres and null out the handles. This integrates nicely with both explicit.close()calls and context‑manager usage.One minor consideration: since
__del__timing is not guaranteed, ensure your primary code paths callclose()(or usewith Ranker()), and treat__del__as best‑effort cleanup only. Otherwise, the implementation looks good.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
python/ranker/engine.py(1 hunks)python/ranker/requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
search(22-45)
🪛 Ruff (0.14.8)
python/ranker/engine.py
25-25: Abstract raise to an inner function
(TRY301)
25-25: Avoid specifying long messages outside the exception class
(TRY003)
34-34: Do not catch blind exception: Exception
(BLE001)
47-47: Do not catch blind exception: Exception
(BLE001)
69-69: Do not catch blind exception: Exception
(BLE001)
81-81: Do not catch blind exception: Exception
(BLE001)
109-109: Do not catch blind exception: Exception
(BLE001)
150-150: Do not catch blind exception: Exception
(BLE001)
183-183: Multiple statements on one line (colon)
(E701)
221-221: Do not catch blind exception: Exception
(BLE001)
241-241: Do not catch blind exception: Exception
(BLE001)
250-250: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: docker-build
🔇 Additional comments (3)
python/ranker/engine.py (3)
7-14: RocksDB import guard and mock fallback look goodThe conditional import plus clear warning and in-memory mock index give a sensible dev fallback when RocksDB isn’t available. No changes needed here.
61-84: Global stats helpers are safe against division-by-zero
_calculate_avgdlnow returns a non-zero default (100.0) when AVG is NULL/0 or the query fails, and_get_total_docshas a reasonable default when DB is unavailable. Together with the laterif N == 0guard, this avoids division-by-zero issues in BM25.Looks good as a basic, defensive implementation.
113-126: Query preprocessing now correctly mirrors indexer behaviorLowercasing, stripping non‑alphanumeric chars, splitting on whitespace, and filtering out tokens with length < 3 aligns well with the described C++ indexer tokenization. This should fix many of the earlier “query doesn’t match index” issues.
No changes needed.
|
@copilot merge this PR if all test passes including the coderabbitai |
|
@Digvijay-x1 I've opened a new pull request, #20, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
python/ranker/engine.py (2)
183-183: Split compound statement for readability.For better readability and linter compliance, split the inline conditional across two lines.
Apply this diff:
- if N == 0: N = 1 # Avoid division by zero issues if DB is empty + if N == 0: + N = 1 # Avoid division by zero issues if DB is empty
34-34: Consider more specific exception types.While the broad
Exceptioncatching provides defensive robustness, catching more specific exceptions (e.g.,psycopg2.Error,IOError,rocksdict.RocksDBError) would improve error diagnostics and prevent masking unexpected issues.Example for database operations:
try: # ... database operations ... - except Exception as e: + except (psycopg2.Error, psycopg2.OperationalError) as e: print(f"Database error: {e}")Also applies to: 47-47, 69-69, 81-81, 109-109, 150-150, 221-221, 241-241, 250-250
.env.example (1)
1-3: Good password placeholder, consider username guidance.The
change_me_in_productionplaceholder for DB_PASS clearly indicates customization is needed. WhileDB_USER=adminis common for local development, you might consider adding a comment suggesting users also customize the username for production deployments.Optional enhancement:
+# Change these values for production deployments DB_USER=admin DB_PASS=change_me_in_production DB_NAME=search_engine
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.env.example(1 hunks)README.md(1 hunks)docker-compose.yml(3 hunks)python/ranker/engine.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docker-compose.yml
🧰 Additional context used
🧬 Code graph analysis (1)
python/ranker/engine.py (1)
python/ranker/app.py (1)
search(22-45)
🪛 dotenv-linter (4.0.0)
.env.example
[warning] 2-2: [UnorderedKey] The DB_PASS key should go before the DB_USER key
(UnorderedKey)
[warning] 3-3: [UnorderedKey] The DB_NAME key should go before the DB_PASS key
(UnorderedKey)
🪛 Ruff (0.14.8)
python/ranker/engine.py
25-25: Abstract raise to an inner function
(TRY301)
25-25: Avoid specifying long messages outside the exception class
(TRY003)
34-34: Do not catch blind exception: Exception
(BLE001)
47-47: Do not catch blind exception: Exception
(BLE001)
69-69: Do not catch blind exception: Exception
(BLE001)
81-81: Do not catch blind exception: Exception
(BLE001)
109-109: Do not catch blind exception: Exception
(BLE001)
150-150: Do not catch blind exception: Exception
(BLE001)
183-183: Multiple statements on one line (colon)
(E701)
221-221: Do not catch blind exception: Exception
(BLE001)
241-241: Do not catch blind exception: Exception
(BLE001)
250-250: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: docker-build
🔇 Additional comments (9)
python/ranker/engine.py (8)
1-13: LGTM! Clean import structure with graceful fallback.The optional RocksDB dependency handling with a fallback flag is a solid pattern for managing external dependencies.
21-25: Security improvement confirmed.Requiring explicit DB_USER and DB_PASS environment variables without weak defaults addresses the previous security concern. The clear error message helps developers configure the system properly.
61-71: Division-by-zero protection confirmed.The fallback to 100.0 when avg is None or 0 properly addresses the previous concern about potential division by zero in BM25 calculations.
85-111: Column name consistency confirmed.Both queries now correctly use the
doc_lengthcolumn for word counts, addressing the previous mismatch concern. The single vs. multiple ID handling is implemented correctly.
118-128: Query tokenization aligned with indexer.The preprocessing pipeline (lowercase, remove non-alphanumeric, filter length ≥ 3) now matches the indexer's tokenization logic, addressing the previous mismatch concern. This ensures queries can find indexed documents correctly.
141-154: Improved fallback logic confirmed.The fallback to mock index now triggers when
self.index_dbis None, properly handling cases where RocksDB is available but the index file doesn't exist. This addresses the previous incomplete fallback concern.
174-201: BM25 implementation with tracked placeholders.The BM25 scoring logic is sound with appropriate safety checks (N=0 handling, doc_len fallbacks). The TODO comment at lines 189-191 properly tracks the TF=1 simplification, addressing the previous concern about untracked placeholders.
235-262: Comprehensive resource cleanup implemented.The cleanup implementation with
close(), context manager support (__enter__/__exit__), and__del__fallback addresses the previous concern about resource leaks. Setting connections to None in finally blocks is good practice.README.md (1)
3-24: Clear setup documentation added.The step-by-step environment configuration and Docker instructions provide a smooth onboarding experience. The placeholder password text clearly indicates customization is needed.
Added ranker logic based on BM25 method in engine.py .
Added RocksDb
Summary by CodeRabbit
New Features
Infrastructure
CI
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.