-
Notifications
You must be signed in to change notification settings - Fork 549
Description
Currently, qmd relies on node-llama-cpp with a local Qwen model (qwen3-reranker) for the reranking step. While this fits the project's fantastic "all-local" design, it creates a massive bottleneck for users running on CPU-only machines when querying non-English text.
To illustrate the performance degradation on a CPU-only setup:
- English text: ~1.2 seconds
- Dutch text: ~72 seconds
This ~60x latency spike makes the CLI effectively unusable for non-English or multilingual knowledge bases without dedicated GPU acceleration.
Proposed Solution
Introduce an optional configuration to route the reranking step through the Cohere Rerank API (specifically their multilingual models like rerank-multilingual-v3.0 or v3.5).
While qmd is built to track SOTA local approaches, providing an opt-in API fallback for this specific step would allow users without GPUs to maintain fast, interactive, and high-quality search speeds regardless of the language.
Expected Behavior
- Users can provide a COHERE_API_KEY via environment variables or the configuration file.
- When enabled, the rerank step bypasses the local qwen3-reranker execution and sends the query and document chunks directly to Cohere's endpoint.
- Users keep the benefits of local BM25 and vector semantic search but bypass the severe CPU cross-encoder compute penalty.
Alternatives Considered - Smaller local multilingual models: These might offer slight speedups but usually sacrifice too much accuracy or still heavily bottleneck a standard CPU.
- Skipping the rerank step entirely: This avoids the latency but defeats the purpose of the hybrid search pipeline and significantly degrades context extraction quality.