feat: support remote embedding and reranking via OpenAI-compatible endpoints by CountofGlamorgan · Pull Request #446 · tobi/qmd

CountofGlamorgan · 2026-03-20T04:01:11Z

Motivation

On machines with limited VRAM (e.g. 16GB unified memory MacBooks), loading embedding + reranking models locally via node-llama-cpp is slow with long cold-start times. Meanwhile, a separate GPU machine (e.g. a PC with RTX 5090) can serve these models via llama-server with near-instant response over a local network.

This PR adds native support for offloading embedding and reranking to remote OpenAI-compatible API servers, while keeping query expansion running locally (since it uses a QMD fine-tuned model).

Relation to #415

PR #415 by @shyuan introduced a similar concept. This PR builds on the same direction but addresses several limitations:

Feature	#415	This PR
URL configuration	Single `QMD_REMOTE_URL`	Separate `QMD_REMOTE_EMBED_URL` + `QMD_REMOTE_RERANK_URL`
Multi-host deployment	❌	✅ (embed and rerank on different servers/ports)
Circuit breaker	❌	✅ Per-endpoint (2 failures → 10min cooldown → half-open retry)
Dimension validation	❌	✅ First response locks expected dimensions
Connect timeout	❌	✅ `QMD_REMOTE_CONNECT_TIMEOUT` (default 500ms)
Read timeout	❌	✅ `QMD_REMOTE_READ_TIMEOUT` (default 10000ms)
Error handling	Returns score=0 on failure	Throws immediately (no silent degradation)
Qwen3 instruct format	❌	✅ Auto-detects and injects instruct prefix

Summary of changes

src/remote-llm.ts — RemoteLLM class: HTTP calls to OpenAI-compatible /v1/embeddings and /v1/rerank endpoints, with per-endpoint circuit breaker, dimension locking, configurable timeouts, and Qwen3 instruct formatting
src/hybrid-llm.ts — HybridLLM class: routes embed/rerank → remote, generate/expandQuery → local LlamaCpp
src/llm.ts — Extended LLM interface to support generic default instances (not just LlamaCpp)
src/store.ts — Decoupled from LlamaCpp concrete type
src/cli/qmd.ts — Env var detection + HybridLLM initialization
test/remote-llm.test.ts — Tests covering remote calls, circuit breaker, dimension validation, Qwen3 formatting, timeouts

Environment variables

QMD_REMOTE_EMBED_URL=http://host:8080     # Remote embedding server
QMD_REMOTE_RERANK_URL=http://host:8081     # Remote reranker (can differ from embed)
QMD_REMOTE_API_KEY=optional                # Bearer token auth
QMD_REMOTE_CONNECT_TIMEOUT=500             # Connect timeout ms (default 500)
QMD_REMOTE_READ_TIMEOUT=10000              # Read timeout ms (default 10000)

No env vars set = behavior completely unchanged (pure local).

Tested with

Two llama-server instances on PC (RTX 5090) serving Qwen3-Embedding-8B (Q8_0) and Qwen3-Reranker-4B (Q8_0) over 10GbE
qmd query end-to-end: local query expansion → remote embedding (189ms) → remote reranking (837ms)
Circuit breaker: service down → 2 failures → open state → 10min cooldown → half-open recovery
npm run build passes
npx vitest run test/remote-llm.test.ts passes
No regressions in existing tests

…dpoints Add RemoteLLM and HybridLLM classes to offload embedding and reranking to remote OpenAI-compatible API servers (e.g. llama-server), while keeping query expansion running locally. Key features: - Separate QMD_REMOTE_EMBED_URL and QMD_REMOTE_RERANK_URL for independent deployment (different hosts/ports) - Per-endpoint circuit breaker (2 failures -> 10min cooldown -> half-open) - Embedding dimension lock (first response locks expected dims) - Qwen3 instruct formatting detection and prefix injection - Configurable connect/read timeouts - Errors throw immediately (no silent fallback to score=0) - Fully backwards compatible (no env vars = pure local behavior) Relates to tobi#415 - addresses single-URL limitation, adds reliability features (circuit breaker, dimension validation, timeout control).

shyuan · 2026-03-22T02:44:49Z

Great work on this PR! The circuit breaker, dimension validation, connect/read timeout separation, and extractOriginalEmbeddingText() approach are all significantly more robust than what I did in #415. I'm happy to close mine in favor of this one.

A few things I ran into while testing #415 with oMLX that might be worth considering:

1. Unicode surrogate sanitization

When chunking splits a multi-byte emoji in half (common in Twitter/social media data), the resulting unpaired surrogates crash remote tokenizers:

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I added a sanitization step before sending to the remote API:

t.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|\uFFFD/g, "")

2. Batch retry fallback

When a batch of 32 embeddings fails (e.g. one text has broken encoding), throwing the entire batch loses 31 good results. In #415 I retry each text individually on batch failure to isolate the bad input:

// Batch failed → retry individually to isolate bad inputs
for (let i = 0; i < texts.length; i++) {
  try {
    const resp = await fetch(url, { body: JSON.stringify({ input: [texts[i]] }) });
    // ...
  } catch (e) {
    results.push(null); // only this one fails
  }
}

This reduced errors from 47 to 0 in my 469-document test corpus.

3. Rerank timeout

Reranking 40 chunks with Qwen3-Reranker can take 30-360 seconds depending on cold/warm start. The default 10s read timeout will likely hit AbortError on first query. In #415 I used a separate 5-minute timeout for rerank. Your circuit breaker would handle the retry, but the first few queries would all fail until the model is warm.

If you're open to it, I'd like to branch off your PR and submit the above improvements (surrogate sanitization, batch retry, rerank timeout) as a PR to your fork — so they can be folded into this PR cleanly. Let me know!

alexleach mentioned this pull request Mar 28, 2026

Add support for remote OpenAI-compatible embeddings #480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support remote embedding and reranking via OpenAI-compatible endpoints#446

feat: support remote embedding and reranking via OpenAI-compatible endpoints#446
CountofGlamorgan wants to merge 1 commit intotobi:mainfrom
CountofGlamorgan:jamar/remote-embed-rerank

CountofGlamorgan commented Mar 20, 2026

Uh oh!

shyuan commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CountofGlamorgan commented Mar 20, 2026

Motivation

Relation to #415

Summary of changes

Environment variables

Tested with

Uh oh!

shyuan commented Mar 22, 2026

1. Unicode surrogate sanitization

2. Batch retry fallback

3. Rerank timeout

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants