Skip to content

feat: add OpenAI-compatible and Gemini API embedding support#427

Open
ysys143 wants to merge 2 commits intotobi:mainfrom
ysys143:feat/api-embedding
Open

feat: add OpenAI-compatible and Gemini API embedding support#427
ysys143 wants to merge 2 commits intotobi:mainfrom
ysys143:feat/api-embedding

Conversation

@ysys143
Copy link
Copy Markdown

@ysys143 ysys143 commented Mar 18, 2026

What

Adds external embedding API support as an alternative to local GGUF models. Three environment variables activate API mode:

Variable Description Example
`QMD_EMBED_API_URL` Base URL (presence activates API mode) `https://api.openai.com/v1\`
`QMD_EMBED_API_KEY` API key `sk-...`
`QMD_EMBED_API_MODEL` Model name `text-embedding-3-small`

API type is auto-detected from the URL — no extra config needed:

  • URL contains `googleapis.com` → Gemini (`batchEmbedContents` endpoint)
  • Otherwise → OpenAI-compatible format (works with Ollama, LM Studio, OpenAI, etc.)

Usage

```bash

OpenAI

export QMD_EMBED_API_URL="https://api.openai.com/v1"
export QMD_EMBED_API_KEY="sk-..."
export QMD_EMBED_API_MODEL="text-embedding-3-small"
qmd embed && qmd vsearch "test query"

Gemini (free tier)

export QMD_EMBED_API_URL="https://generativelanguage.googleapis.com/v1beta"
export QMD_EMBED_API_KEY="AIza..."
export QMD_EMBED_API_MODEL="gemini-embedding-001"
qmd embed && qmd vsearch "test query"

Ollama (OpenAI-compatible)

export QMD_EMBED_API_URL="http://localhost:11434/v1"
export QMD_EMBED_API_MODEL="nomic-embed-text"
qmd embed && qmd vsearch "test query"
```

Why

Local GGUF embedding is great for privacy and offline use, but has tradeoffs:

  • Requires downloading a model (~300MB+)
  • Slower on CPU, needs Metal/CUDA for reasonable speed
  • Limited model selection

API mode is useful when:

  • Speed matters (cloud APIs are 2–5× faster in practice)
  • You want higher-dimensional embeddings (3072 vs 768)
  • Running in CI or on a machine without GPU

Implementation notes

  • `formatQueryForEmbedding` / `formatDocForEmbedding` return raw text when `modelUri` starts with `api:` — cloud models don't use nomic-style task prefixes
  • Model stored in `content_vectors.model` as `api:` — switching between local and API requires re-embedding (same as switching local models, existing behavior)
  • Vector dimension is set dynamically on first embed, so any dimension works without schema changes

Benchmark (63 chunks, Korean+English, M3 Max)

Model Per chunk Cost (63 chunks) Dims
embeddinggemma-300M local 72ms $0 768
text-embedding-3-small 16ms $0.000290 1536
text-embedding-3-large 13ms $0.001882 3072
gemini-embedding-001 38ms free 3072
gemini-embedding-2-preview 41ms free 3072

Full benchmark results + script in `docs/embedding-benchmark.md` and `scripts/benchmark-embed.ts`.

Relation to other PRs

#406 and #415 also address remote embedding — worth noting the differences:

Happy to collaborate with the other authors or adjust the approach based on feedback.

Adds external embedding API support via three environment variables:
- QMD_EMBED_API_URL: base URL (presence activates API mode)
- QMD_EMBED_API_KEY: API key
- QMD_EMBED_API_MODEL: model name

API type is auto-detected from URL:
- Contains "googleapis.com" → Gemini (batchEmbedContents)
- Otherwise → OpenAI-compatible format (Ollama, LM Studio, OpenAI, etc.)

In API mode, formatQueryForEmbedding/formatDocForEmbedding return raw
text without task prefixes, since cloud embedding models don't use them.
The model URI is stored as "api:<model-name>" in the vectors table.

Also adds scripts/benchmark-embed.ts comparing embeddinggemma-300M
(local/Metal), text-embedding-3-small/large, gemini-embedding-001, and
gemini-embedding-2-preview on 63 Korean+English chunks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant