Skip to content

feat: support remote embedding and reranking via OpenAI-compatible endpoints#446

Open
CountofGlamorgan wants to merge 1 commit intotobi:mainfrom
CountofGlamorgan:jamar/remote-embed-rerank
Open

feat: support remote embedding and reranking via OpenAI-compatible endpoints#446
CountofGlamorgan wants to merge 1 commit intotobi:mainfrom
CountofGlamorgan:jamar/remote-embed-rerank

Conversation

@CountofGlamorgan
Copy link
Copy Markdown

Motivation

On machines with limited VRAM (e.g. 16GB unified memory MacBooks), loading embedding + reranking models locally via node-llama-cpp is slow with long cold-start times. Meanwhile, a separate GPU machine (e.g. a PC with RTX 5090) can serve these models via llama-server with near-instant response over a local network.

This PR adds native support for offloading embedding and reranking to remote OpenAI-compatible API servers, while keeping query expansion running locally (since it uses a QMD fine-tuned model).

Relation to #415

PR #415 by @shyuan introduced a similar concept. This PR builds on the same direction but addresses several limitations:

Feature #415 This PR
URL configuration Single QMD_REMOTE_URL Separate QMD_REMOTE_EMBED_URL + QMD_REMOTE_RERANK_URL
Multi-host deployment ✅ (embed and rerank on different servers/ports)
Circuit breaker ✅ Per-endpoint (2 failures → 10min cooldown → half-open retry)
Dimension validation ✅ First response locks expected dimensions
Connect timeout QMD_REMOTE_CONNECT_TIMEOUT (default 500ms)
Read timeout QMD_REMOTE_READ_TIMEOUT (default 10000ms)
Error handling Returns score=0 on failure Throws immediately (no silent degradation)
Qwen3 instruct format ✅ Auto-detects and injects instruct prefix

Summary of changes

  • src/remote-llm.tsRemoteLLM class: HTTP calls to OpenAI-compatible /v1/embeddings and /v1/rerank endpoints, with per-endpoint circuit breaker, dimension locking, configurable timeouts, and Qwen3 instruct formatting
  • src/hybrid-llm.tsHybridLLM class: routes embed/rerank → remote, generate/expandQuery → local LlamaCpp
  • src/llm.ts — Extended LLM interface to support generic default instances (not just LlamaCpp)
  • src/store.ts — Decoupled from LlamaCpp concrete type
  • src/cli/qmd.ts — Env var detection + HybridLLM initialization
  • test/remote-llm.test.ts — Tests covering remote calls, circuit breaker, dimension validation, Qwen3 formatting, timeouts

Environment variables

QMD_REMOTE_EMBED_URL=http://host:8080     # Remote embedding server
QMD_REMOTE_RERANK_URL=http://host:8081     # Remote reranker (can differ from embed)
QMD_REMOTE_API_KEY=optional                # Bearer token auth
QMD_REMOTE_CONNECT_TIMEOUT=500             # Connect timeout ms (default 500)
QMD_REMOTE_READ_TIMEOUT=10000              # Read timeout ms (default 10000)

No env vars set = behavior completely unchanged (pure local).

Tested with

  • Two llama-server instances on PC (RTX 5090) serving Qwen3-Embedding-8B (Q8_0) and Qwen3-Reranker-4B (Q8_0) over 10GbE
  • qmd query end-to-end: local query expansion → remote embedding (189ms) → remote reranking (837ms)
  • Circuit breaker: service down → 2 failures → open state → 10min cooldown → half-open recovery
  • npm run build passes
  • npx vitest run test/remote-llm.test.ts passes
  • No regressions in existing tests

…dpoints

Add RemoteLLM and HybridLLM classes to offload embedding and reranking
to remote OpenAI-compatible API servers (e.g. llama-server), while
keeping query expansion running locally.

Key features:
- Separate QMD_REMOTE_EMBED_URL and QMD_REMOTE_RERANK_URL for
  independent deployment (different hosts/ports)
- Per-endpoint circuit breaker (2 failures -> 10min cooldown -> half-open)
- Embedding dimension lock (first response locks expected dims)
- Qwen3 instruct formatting detection and prefix injection
- Configurable connect/read timeouts
- Errors throw immediately (no silent fallback to score=0)
- Fully backwards compatible (no env vars = pure local behavior)

Relates to tobi#415 - addresses single-URL limitation, adds reliability
features (circuit breaker, dimension validation, timeout control).
@shyuan
Copy link
Copy Markdown

shyuan commented Mar 22, 2026

Great work on this PR! The circuit breaker, dimension validation, connect/read timeout separation, and extractOriginalEmbeddingText() approach are all significantly more robust than what I did in #415. I'm happy to close mine in favor of this one.

A few things I ran into while testing #415 with oMLX that might be worth considering:

1. Unicode surrogate sanitization

When chunking splits a multi-byte emoji in half (common in Twitter/social media data), the resulting unpaired surrogates crash remote tokenizers:

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I added a sanitization step before sending to the remote API:

t.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|\uFFFD/g, "")

2. Batch retry fallback

When a batch of 32 embeddings fails (e.g. one text has broken encoding), throwing the entire batch loses 31 good results. In #415 I retry each text individually on batch failure to isolate the bad input:

// Batch failed → retry individually to isolate bad inputs
for (let i = 0; i < texts.length; i++) {
  try {
    const resp = await fetch(url, { body: JSON.stringify({ input: [texts[i]] }) });
    // ...
  } catch (e) {
    results.push(null); // only this one fails
  }
}

This reduced errors from 47 to 0 in my 469-document test corpus.

3. Rerank timeout

Reranking 40 chunks with Qwen3-Reranker can take 30-360 seconds depending on cold/warm start. The default 10s read timeout will likely hit AbortError on first query. In #415 I used a separate 5-minute timeout for rerank. Your circuit breaker would handle the retry, but the first few queries would all fail until the model is warm.


If you're open to it, I'd like to branch off your PR and submit the above improvements (surrogate sanitization, batch retry, rerank timeout) as a PR to your fork — so they can be folded into this PR cleanly. Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants