Skip to content

feat(mcp): expose skipRerank and candidateLimit in query tool#435

Open
DmitryPogodaev wants to merge 1 commit intotobi:mainfrom
DmitryPogodaev:feat/skip-rerank-mcp
Open

feat(mcp): expose skipRerank and candidateLimit in query tool#435
DmitryPogodaev wants to merge 1 commit intotobi:mainfrom
DmitryPogodaev:feat/skip-rerank-mcp

Conversation

@DmitryPogodaev
Copy link
Copy Markdown

Problem

On CPU-only servers (no GPU), the LLM reranker model (Qwen3-Reranker-0.6B) takes ~2 seconds per document to score. A typical query with 20 candidates takes 30-40 seconds — far exceeding the 1-2s timeouts used by automated RAG hooks.

The internal structuredSearch already supports skipRerank and candidateLimit, but neither is exposed through the MCP query tool.

Changes

  • skipRerank (boolean, optional): added to MCP query tool schema. When true, returns results scored by RRF fusion only — no LLM rerank. Queries complete in 30-50ms instead of 30-40s.
  • candidateLimit: was declared in the MCP schema but never forwarded to store.search(). Now passed through.
  • Added candidateLimit to the SearchOptions interface.

Use case

Automated RAG hooks (e.g. Telegram bot preprocessing) on VPS without GPU, where the reranker model is prohibitively slow. skipRerank: true gives fast approximate results; the LLM reranker remains available for interactive / CLI use.

Performance

With rerank skipRerank=true
Cold (model load) 30-40s 400-500ms
Warm 30-40s (rerank dominates) 30-50ms

Tested on AMD EPYC 8-core (no GPU), QMD 2.0.1, node-llama-cpp 3.18.1.

On CPU-only servers, LLM reranking (0.6B model) takes ~2s per document,
making the query tool unusable with timeouts under 30s.

This commit:
- Adds `skipRerank` boolean parameter to the MCP `query` tool schema.
  When true, returns results scored by RRF fusion only (no LLM rerank).
- Passes `candidateLimit` through to structuredSearch (was declared in
  schema but never forwarded to the store).

Use case: automated RAG hooks with 1-2s timeouts on VPS without GPU.
With skipRerank=true, queries complete in 30-50ms instead of 30-40s.
@fxstein
Copy link
Copy Markdown

fxstein commented Mar 18, 2026

Was on my todo list as well. Should now make the mcp server pretty complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants