Skip to content

feat(gemini): add GeminiDenseEmbedder text embedding provider#751

Open
chethanuk wants to merge 7 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing
Open

feat(gemini): add GeminiDenseEmbedder text embedding provider#751
chethanuk wants to merge 7 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing

Conversation

@chethanuk
Copy link
Contributor

@chethanuk chethanuk commented Mar 18, 2026

Description

Phase 1 of #566: Adds GeminiDenseEmbedder — a production-ready, opt-in text-only dense embedding provider backed by Google's google-genai SDK. Volcengine remains the default; Gemini is enabled via provider: "gemini" in ov.conf. No existing behaviour changes.

This PR implements Phase 1 (text-only). Phase 2 (multimodal embedding) is tracked in #566 — the issue remains open.

Phased Implementation

Phase Scope Status
Phase 1 (this PR) Text-only dense embedding: 8 task types, MRL 1-3072, SDK retry, async batching, non-symmetric routing via #702 pattern
Phase 2 (future) Multimodal embedding: image/audio/video via Parts API, cross-modal retrieval, align with #718 approach Planned

Key capabilities

  • Text-onlysupports_multimodal = False; multimodal is Phase 2
  • Non-symmetric routingget_query_embedder() / get_document_embedder() route query_param/document_param to Gemini task types via factory (follows feat(embedding): combine document embedder and query embedder to avoi… #702 pattern)
  • Per-call task_type + title — override at call time; all 8 Gemini task types
  • Case-insensitive configtask_type, query_param, document_param auto-uppercased in config normalization
  • SDK-native retryHttpRetryOptions(attempts=3, exp_base=2, max_delay=30s) with import guard for SDK < 0.8
  • Async concurrent batchingasync_embed_batch() dispatches 100-text chunks in parallel via anyio
  • Empty-text guard — returns zero-vectors with warning log; preserves index alignment
flowchart LR
    subgraph Config["ov.conf / EmbeddingModelConfig"]
        CFG["provider: gemini\napi_key: sk-…\ntask_type: RETRIEVAL_DOCUMENT\ndimension: 1536"]
    end

    subgraph Embedder["GeminiDenseEmbedder(DenseEmbedderBase)"]
        direction TB
        BC["_build_config(task_type?, title?)"]
        E["embed(text, task_type?, title?)"]
        EB["embed_batch(texts, titles?)"]
        AEB["async_embed_batch(texts)"]
        E --> BC
        EB --> BC
        AEB --> BC
    end

    subgraph SDK["google-genai SDK"]
        direction TB
        HTTP["HttpOptions\n(retry: 3× exp backoff)"]
        SYNC["models.embed_content()"]
        ASYNC["aio.models.embed_content()\n(semaphore-bounded)"]
        HTTP --> SYNC
        HTTP --> ASYNC
    end

    Config --> Embedder
    BC --> SYNC
    AEB --> ASYNC
Loading
image

Related Issue

Partial implementation of #566 (Phase 1: text-only dense embedding; Phase 2: multimodal — issue stays open)

Type of Change

  • Bug fix
  • New feature (non-breaking, opt-in via config)
  • Breaking change
  • Documentation update
  • Refactoring
  • Performance improvement
  • Test update

Changes Made

  • openviking/models/embedder/gemini_embedders.pyGeminiDenseEmbedder(DenseEmbedderBase): _build_config(), per-call task_type/title, SDK retry, __repr__, empty-text guard with warning log
  • openviking_cli/utils/config/embedding_config.py — registers "gemini" provider; validates + auto-uppercases task_type/query_param/document_param; factory routes context for non-symmetric mode
  • openviking/models/embedder/__init__.py — exports GeminiDenseEmbedder; annotation: "text-only; multimodal planned"
  • examples/ov.conf.example — Gemini config snippet
  • pyproject.tomlgoogle-genai>=0.8 dep; optional gemini-async extra for anyio
  • tests/unit/test_gemini_embedder.py — 41 unit tests (mock-only, no API key needed)
  • tests/unit/test_embedding_config_gemini.py — 18 config validation tests (incl. case-insensitive task_type)
  • tests/integration/ — 49 integration tests (auto-skip without GOOGLE_API_KEY)

Testing

Suite Tests Live key required
test_gemini_embedder.py 41 No
test_embedding_config_gemini.py 18 No
test_gemini_e2e.py 7 Yes
test_gemini_embedding_it.py 22 Yes
test_gemini_openviking_it.py 20 Yes
108 passed, 0 skipped  ← with GOOGLE_API_KEY
 59 passed, 49 skipped ← without GOOGLE_API_KEY (CI)

Checklist

  • Code follows project coding style (ruff + black formatted)
  • Self-review completed
  • Tests added and passing
  • Tested on Linux and macOS

- Adds GeminiDenseEmbedder backed by google-genai>=0.8.0 SDK
- Supports all 8 task types, MRL dimensions 1-3072, async batch via anyio
- Wires gemini provider into EmbeddingModelConfig factory + context routing
- Adds unit tests (mocked, no API key) and IT tests (auto-skip without GOOGLE_API_KEY)
- Updates pyproject.toml with google-genai, anyio, pytest-asyncio, pytest-cov deps
…t from IT test

- GeminiDenseEmbedder.supports_multimodal = False (text-only provider)
- Remove ModalContent/Vectorize import from test_gemini_e2e.py (doesn't exist in gem-v1)
- Remove TestGeminiE2EMultimodalEmbedding class (multimodal out of scope for text-only slice)
- embed_batch() fallback: pass is_query through to self.embed()
- Remove redundant _l2_normalize() calls in embed/embed_batch/async_embed_batch;
  Gemini API already returns L2-normalized vectors, truncate_and_normalize is sufficient
- Remove now-dead _l2_normalize() function and unused math import
- embedding_config.py: fix typo "Secretfor" → "Secret for" in sk field description
- conftest.py: narrow broad except Exception to (ImportError, ModuleNotFoundError, AttributeError)
- test_gemini_embedding_it.py: construct fake API key dynamically to avoid hardcoded secret pattern
- test_gemini_openviking_it.py: add dimension assertion in test_dimension_variant_add_search;
  fix tautology assertion (>= 0 → > 0) in test_session_search_smoke
…dd pytest-xdist

- embed()/embed_batch(): return zero vector immediately for empty/whitespace-only
  text, avoiding non-descriptive Gemini API errors
- embed_batch(): filter empty strings pre-API call; reassemble results in original
  order so callers receive a zero vector at each empty position
- Remove _l2_normalize() function and all call sites (embed, embed_batch,
  async_embed_batch): Gemini API already returns L2-normalized vectors;
  truncate_and_normalize() is sufficient
- Remove unused `import math` (was only used by _l2_normalize)
- Add pytest-xdist>=3.0 to [dependency-groups].dev for parallel test runs
- Add 2 unit tests: test_embed_empty_string_returns_zero_vector,
  test_embed_batch_skips_empty_strings (50 unit tests total)
…ry, repr

- Add _build_config() to merge per-call task_type/title with instance defaults
- embed() gains keyword-only task_type= and title= overrides
- embed_batch() gains keyword-only task_type= and titles=; falls back to
  per-item embed() calls when titles is provided (per-document metadata)
- async_embed_batch() switches to _build_config() (no new args)
- Client constructed with HttpOptions(retry_options=HttpRetryOptions(
  attempts=3, initial_delay=1.0, max_delay=30.0, exp_base=2.0))
  when SDK >= 0.8.0; falls back to plain Client on ImportError
- Add __repr__: GeminiDenseEmbedder(model=..., dim=..., task_type=...)
- Remove now-dead self._embed_config from __init__
- 8 new unit tests in TestBuildConfig; fix test_init_stores_fields
  assertion to allow http_options kwarg
@chethanuk
Copy link
Contributor Author

@qin-ctx Please review and merge this

image

@ZaynJarvis
Copy link
Collaborator

#702 is megred for query and document embedding support.

get_query_embedder() is not the expected usage.

btw check if you can improve based on #718

Copy link
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds Google Gemini as a text embedding provider with excellent test coverage (107 tests) and solid error handling. However, there are three blocking issues that need to be addressed:

  1. Design mismatch: Issue #566 explicitly requests multimodal retrieval support, but this PR only implements text-only embedding (supports_multimodal = False). The PR description should clarify this is a phased implementation with a roadmap for multimodal support, or change the issue reference to indicate partial completion.

  2. Documentation inconsistency: Code comment claims "multimodal" but implementation is text-only.

  3. CI failure: lint job failed due to formatting issues in test_gemini_embedder.py.

Once these are resolved, the PR will be ready to merge. The implementation quality is strong, with comprehensive testing, graceful error handling, and proper SDK retry integration.

🤖 I am a bot owned by @qin-ctx.

- OpenAI: Dense only
- Volcengine: Dense, Sparse, Hybrid
- Jina AI: Dense only
- Google Gemini: Dense only (multimodal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

Comment says "Google Gemini: Dense only (multimodal)" but the actual implementation has GeminiDenseEmbedder.supports_multimodal = False. This is contradictory.

Should either:

  • Change to "Google Gemini: Dense only (text-only; multimodal planned)"
  • Or remove the "(multimodal)" annotation entirely until multimodal support is actually implemented

# text-embedding-004: 768 fixed-dim legacy model, does not support MRL truncation
# Future gemini-embedding-*: default 3072 via _default_dimension() fallback
# Future text-embedding-*: default 768 via _default_dimension() prefix rule
supports_multimodal: bool = False # text-only; multimodal planned separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

This PR claims to close issue #566, which explicitly requests "first-class Gemini Embedding 2 support for multimodal retrieval". The issue emphasizes:

  • "it should not reduce everything to plain text before embedding"
  • "multimodal inputs handled as multimodal, not flattened by default"

However, this implementation sets supports_multimodal = False and only handles text. This is a fundamental mismatch between the issue requirement and the PR implementation.

Recommendation: Either:

  1. Update the PR description to clarify this is phase 1 (text-only) with a roadmap for phase 2 (multimodal), and change the issue reference to "Partial implementation of [Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566", keeping the issue open
  2. Or implement multimodal support in this PR as originally requested by the issue

The current "Closes: #566" statement is misleading and will cause users to expect multimodal functionality that doesn't exist.

@@ -0,0 +1,479 @@
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

CI lint job failed because this file needs formatting:

Would reformat: tests/unit/test_gemini_embedder.py

Please run black tests/unit/test_gemini_embedder.py and commit the formatted version.

if backend is not None and provider is None:
data["provider"] = backend
for key in ("query_param", "document_param"):
for key in ("input_type",):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

The for key in ("input_type",): loop normalizes input_type to lowercase, but the new Gemini provider uses task_type instead of input_type. Consider also normalizing task_type here, or document that task_type must be uppercase (which is currently enforced in the Gemini validator at line 176).

task_type: Optional[str] = None,
title: Optional[str] = None,
) -> EmbedResult:
if not text or not text.strip():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Returning a zero vector for empty text is a reasonable fallback, but it happens silently. Consider logging a warning so callers know their empty input was handled specially:

if not text or not text.strip():
    logger.warning("Empty text passed to embed(), returning zero vector")
    return EmbedResult(dense_vector=[0.0] * self._dimension)

This helps with debugging when unexpected empty strings appear in production.

…fig normalization, empty-text warning

- Fix "(multimodal)" annotation to "(text-only; multimodal planned)" to match supports_multimodal=False (B1)
- Run black formatter on test_gemini_embedder.py (B3)
- Add auto-uppercase normalization for task_type/query_param/document_param in sync_provider_backend so lowercase config values pass validation (NB1)
- Add logger.warning on empty text in embed() for production debuggability (NB2)
- Add test_gemini_task_type_case_insensitive test (NB1)
- Fix ruff import sorting in __init__.py
@chethanuk
Copy link
Contributor Author

Thanks for the thorough review! All issues addressed in commit bcfd612:

@qin-ptr — blocking fixes

@qin-ptr — non-blocking (adopted both)

  • NB1 task_type normalization: Added auto-uppercase in sync_provider_backend for task_type, query_param, document_param. Users can now write task_type: "retrieval_document" (any case) in config. Added test test_gemini_task_type_case_insensitive.
  • NB2 Empty text warning: Added logger.warning("Empty text passed to embed(), returning zero vector")

@ZaynJarvis#702 pattern + #718

Verification

ruff check: All checks passed!
pytest: 59 passed (41 embedder + 18 config, incl. 1 new normalization test)
black --check: 1 file would be left unchanged

CI uses ruff format (not black); re-ran ruff format to match.
@chethanuk chethanuk force-pushed the feat/gemini-landing branch from 5ac0ebc to 1586f65 Compare March 19, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants