feat: support remote embedding and reranking via OpenAI-compatible endpoints#446
feat: support remote embedding and reranking via OpenAI-compatible endpoints#446CountofGlamorgan wants to merge 1 commit intotobi:mainfrom
Conversation
…dpoints Add RemoteLLM and HybridLLM classes to offload embedding and reranking to remote OpenAI-compatible API servers (e.g. llama-server), while keeping query expansion running locally. Key features: - Separate QMD_REMOTE_EMBED_URL and QMD_REMOTE_RERANK_URL for independent deployment (different hosts/ports) - Per-endpoint circuit breaker (2 failures -> 10min cooldown -> half-open) - Embedding dimension lock (first response locks expected dims) - Qwen3 instruct formatting detection and prefix injection - Configurable connect/read timeouts - Errors throw immediately (no silent fallback to score=0) - Fully backwards compatible (no env vars = pure local behavior) Relates to tobi#415 - addresses single-URL limitation, adds reliability features (circuit breaker, dimension validation, timeout control).
|
Great work on this PR! The circuit breaker, dimension validation, connect/read timeout separation, and A few things I ran into while testing #415 with oMLX that might be worth considering: 1. Unicode surrogate sanitizationWhen chunking splits a multi-byte emoji in half (common in Twitter/social media data), the resulting unpaired surrogates crash remote tokenizers: I added a sanitization step before sending to the remote API: t.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|\uFFFD/g, "")2. Batch retry fallbackWhen a batch of 32 embeddings fails (e.g. one text has broken encoding), throwing the entire batch loses 31 good results. In #415 I retry each text individually on batch failure to isolate the bad input: // Batch failed → retry individually to isolate bad inputs
for (let i = 0; i < texts.length; i++) {
try {
const resp = await fetch(url, { body: JSON.stringify({ input: [texts[i]] }) });
// ...
} catch (e) {
results.push(null); // only this one fails
}
}This reduced errors from 47 to 0 in my 469-document test corpus. 3. Rerank timeoutReranking 40 chunks with Qwen3-Reranker can take 30-360 seconds depending on cold/warm start. The default 10s read timeout will likely hit If you're open to it, I'd like to branch off your PR and submit the above improvements (surrogate sanitization, batch retry, rerank timeout) as a PR to your fork — so they can be folded into this PR cleanly. Let me know! |
Motivation
On machines with limited VRAM (e.g. 16GB unified memory MacBooks), loading embedding + reranking models locally via node-llama-cpp is slow with long cold-start times. Meanwhile, a separate GPU machine (e.g. a PC with RTX 5090) can serve these models via
llama-serverwith near-instant response over a local network.This PR adds native support for offloading embedding and reranking to remote OpenAI-compatible API servers, while keeping query expansion running locally (since it uses a QMD fine-tuned model).
Relation to #415
PR #415 by @shyuan introduced a similar concept. This PR builds on the same direction but addresses several limitations:
QMD_REMOTE_URLQMD_REMOTE_EMBED_URL+QMD_REMOTE_RERANK_URLQMD_REMOTE_CONNECT_TIMEOUT(default 500ms)QMD_REMOTE_READ_TIMEOUT(default 10000ms)Summary of changes
src/remote-llm.ts—RemoteLLMclass: HTTP calls to OpenAI-compatible/v1/embeddingsand/v1/rerankendpoints, with per-endpoint circuit breaker, dimension locking, configurable timeouts, and Qwen3 instruct formattingsrc/hybrid-llm.ts—HybridLLMclass: routes embed/rerank → remote, generate/expandQuery → local LlamaCppsrc/llm.ts— ExtendedLLMinterface to support generic default instances (not justLlamaCpp)src/store.ts— Decoupled fromLlamaCppconcrete typesrc/cli/qmd.ts— Env var detection + HybridLLM initializationtest/remote-llm.test.ts— Tests covering remote calls, circuit breaker, dimension validation, Qwen3 formatting, timeoutsEnvironment variables
No env vars set = behavior completely unchanged (pure local).
Tested with
llama-serverinstances on PC (RTX 5090) serving Qwen3-Embedding-8B (Q8_0) and Qwen3-Reranker-4B (Q8_0) over 10GbEqmd queryend-to-end: local query expansion → remote embedding (189ms) → remote reranking (837ms)npm run buildpassesnpx vitest run test/remote-llm.test.tspasses