feat(gemini): add GeminiDenseEmbedder text embedding provider#751
feat(gemini): add GeminiDenseEmbedder text embedding provider#751chethanuk wants to merge 7 commits intovolcengine:mainfrom
Conversation
- Adds GeminiDenseEmbedder backed by google-genai>=0.8.0 SDK - Supports all 8 task types, MRL dimensions 1-3072, async batch via anyio - Wires gemini provider into EmbeddingModelConfig factory + context routing - Adds unit tests (mocked, no API key) and IT tests (auto-skip without GOOGLE_API_KEY) - Updates pyproject.toml with google-genai, anyio, pytest-asyncio, pytest-cov deps
…t from IT test - GeminiDenseEmbedder.supports_multimodal = False (text-only provider) - Remove ModalContent/Vectorize import from test_gemini_e2e.py (doesn't exist in gem-v1) - Remove TestGeminiE2EMultimodalEmbedding class (multimodal out of scope for text-only slice)
- embed_batch() fallback: pass is_query through to self.embed() - Remove redundant _l2_normalize() calls in embed/embed_batch/async_embed_batch; Gemini API already returns L2-normalized vectors, truncate_and_normalize is sufficient - Remove now-dead _l2_normalize() function and unused math import - embedding_config.py: fix typo "Secretfor" → "Secret for" in sk field description - conftest.py: narrow broad except Exception to (ImportError, ModuleNotFoundError, AttributeError) - test_gemini_embedding_it.py: construct fake API key dynamically to avoid hardcoded secret pattern - test_gemini_openviking_it.py: add dimension assertion in test_dimension_variant_add_search; fix tautology assertion (>= 0 → > 0) in test_session_search_smoke
…dd pytest-xdist - embed()/embed_batch(): return zero vector immediately for empty/whitespace-only text, avoiding non-descriptive Gemini API errors - embed_batch(): filter empty strings pre-API call; reassemble results in original order so callers receive a zero vector at each empty position - Remove _l2_normalize() function and all call sites (embed, embed_batch, async_embed_batch): Gemini API already returns L2-normalized vectors; truncate_and_normalize() is sufficient - Remove unused `import math` (was only used by _l2_normalize) - Add pytest-xdist>=3.0 to [dependency-groups].dev for parallel test runs - Add 2 unit tests: test_embed_empty_string_returns_zero_vector, test_embed_batch_skips_empty_strings (50 unit tests total)
…ry, repr - Add _build_config() to merge per-call task_type/title with instance defaults - embed() gains keyword-only task_type= and title= overrides - embed_batch() gains keyword-only task_type= and titles=; falls back to per-item embed() calls when titles is provided (per-document metadata) - async_embed_batch() switches to _build_config() (no new args) - Client constructed with HttpOptions(retry_options=HttpRetryOptions( attempts=3, initial_delay=1.0, max_delay=30.0, exp_base=2.0)) when SDK >= 0.8.0; falls back to plain Client on ImportError - Add __repr__: GeminiDenseEmbedder(model=..., dim=..., task_type=...) - Remove now-dead self._embed_config from __init__ - 8 new unit tests in TestBuildConfig; fix test_init_stores_fields assertion to allow http_options kwarg
|
@qin-ctx Please review and merge this
|
qin-ptr
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds Google Gemini as a text embedding provider with excellent test coverage (107 tests) and solid error handling. However, there are three blocking issues that need to be addressed:
-
Design mismatch: Issue #566 explicitly requests multimodal retrieval support, but this PR only implements text-only embedding (
supports_multimodal = False). The PR description should clarify this is a phased implementation with a roadmap for multimodal support, or change the issue reference to indicate partial completion. -
Documentation inconsistency: Code comment claims "multimodal" but implementation is text-only.
-
CI failure: lint job failed due to formatting issues in
test_gemini_embedder.py.
Once these are resolved, the PR will be ready to merge. The implementation quality is strong, with comprehensive testing, graceful error handling, and proper SDK retry integration.
🤖 I am a bot owned by @qin-ctx.
| - OpenAI: Dense only | ||
| - Volcengine: Dense, Sparse, Hybrid | ||
| - Jina AI: Dense only | ||
| - Google Gemini: Dense only (multimodal) |
There was a problem hiding this comment.
[Bug] (blocking)
Comment says "Google Gemini: Dense only (multimodal)" but the actual implementation has GeminiDenseEmbedder.supports_multimodal = False. This is contradictory.
Should either:
- Change to "Google Gemini: Dense only (text-only; multimodal planned)"
- Or remove the "(multimodal)" annotation entirely until multimodal support is actually implemented
| # text-embedding-004: 768 fixed-dim legacy model, does not support MRL truncation | ||
| # Future gemini-embedding-*: default 3072 via _default_dimension() fallback | ||
| # Future text-embedding-*: default 768 via _default_dimension() prefix rule | ||
| supports_multimodal: bool = False # text-only; multimodal planned separately |
There was a problem hiding this comment.
[Design] (blocking)
This PR claims to close issue #566, which explicitly requests "first-class Gemini Embedding 2 support for multimodal retrieval". The issue emphasizes:
- "it should not reduce everything to plain text before embedding"
- "multimodal inputs handled as multimodal, not flattened by default"
However, this implementation sets supports_multimodal = False and only handles text. This is a fundamental mismatch between the issue requirement and the PR implementation.
Recommendation: Either:
- Update the PR description to clarify this is phase 1 (text-only) with a roadmap for phase 2 (multimodal), and change the issue reference to "Partial implementation of [Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566", keeping the issue open
- Or implement multimodal support in this PR as originally requested by the issue
The current "Closes: #566" statement is misleading and will cause users to expect multimodal functionality that doesn't exist.
| @@ -0,0 +1,479 @@ | |||
| # Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd. | |||
There was a problem hiding this comment.
[Bug] (blocking)
CI lint job failed because this file needs formatting:
Would reformat: tests/unit/test_gemini_embedder.py
Please run black tests/unit/test_gemini_embedder.py and commit the formatted version.
| if backend is not None and provider is None: | ||
| data["provider"] = backend | ||
| for key in ("query_param", "document_param"): | ||
| for key in ("input_type",): |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
The for key in ("input_type",): loop normalizes input_type to lowercase, but the new Gemini provider uses task_type instead of input_type. Consider also normalizing task_type here, or document that task_type must be uppercase (which is currently enforced in the Gemini validator at line 176).
| task_type: Optional[str] = None, | ||
| title: Optional[str] = None, | ||
| ) -> EmbedResult: | ||
| if not text or not text.strip(): |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
Returning a zero vector for empty text is a reasonable fallback, but it happens silently. Consider logging a warning so callers know their empty input was handled specially:
if not text or not text.strip():
logger.warning("Empty text passed to embed(), returning zero vector")
return EmbedResult(dense_vector=[0.0] * self._dimension)This helps with debugging when unexpected empty strings appear in production.
…fig normalization, empty-text warning - Fix "(multimodal)" annotation to "(text-only; multimodal planned)" to match supports_multimodal=False (B1) - Run black formatter on test_gemini_embedder.py (B3) - Add auto-uppercase normalization for task_type/query_param/document_param in sync_provider_backend so lowercase config values pass validation (NB1) - Add logger.warning on empty text in embed() for production debuggability (NB2) - Add test_gemini_task_type_case_insensitive test (NB1) - Fix ruff import sorting in __init__.py
|
Thanks for the thorough review! All issues addressed in commit bcfd612: @qin-ptr — blocking fixes
@qin-ptr — non-blocking (adopted both)
@ZaynJarvis — #702 pattern + #718
Verification |
CI uses ruff format (not black); re-ran ruff format to match.
5ac0ebc to
1586f65
Compare

Description
Phase 1 of #566: Adds
GeminiDenseEmbedder— a production-ready, opt-in text-only dense embedding provider backed by Google'sgoogle-genaiSDK. Volcengine remains the default; Gemini is enabled viaprovider: "gemini"inov.conf. No existing behaviour changes.Phased Implementation
Key capabilities
supports_multimodal = False; multimodal is Phase 2get_query_embedder()/get_document_embedder()routequery_param/document_paramto Gemini task types via factory (follows feat(embedding): combine document embedder and query embedder to avoi… #702 pattern)task_type+title— override at call time; all 8 Gemini task typestask_type,query_param,document_paramauto-uppercased in config normalizationHttpRetryOptions(attempts=3, exp_base=2, max_delay=30s)with import guard for SDK < 0.8async_embed_batch()dispatches 100-text chunks in parallel viaanyioflowchart LR subgraph Config["ov.conf / EmbeddingModelConfig"] CFG["provider: gemini\napi_key: sk-…\ntask_type: RETRIEVAL_DOCUMENT\ndimension: 1536"] end subgraph Embedder["GeminiDenseEmbedder(DenseEmbedderBase)"] direction TB BC["_build_config(task_type?, title?)"] E["embed(text, task_type?, title?)"] EB["embed_batch(texts, titles?)"] AEB["async_embed_batch(texts)"] E --> BC EB --> BC AEB --> BC end subgraph SDK["google-genai SDK"] direction TB HTTP["HttpOptions\n(retry: 3× exp backoff)"] SYNC["models.embed_content()"] ASYNC["aio.models.embed_content()\n(semaphore-bounded)"] HTTP --> SYNC HTTP --> ASYNC end Config --> Embedder BC --> SYNC AEB --> ASYNCRelated Issue
Partial implementation of #566 (Phase 1: text-only dense embedding; Phase 2: multimodal — issue stays open)
Type of Change
Changes Made
openviking/models/embedder/gemini_embedders.py—GeminiDenseEmbedder(DenseEmbedderBase):_build_config(), per-call task_type/title, SDK retry,__repr__, empty-text guard with warning logopenviking_cli/utils/config/embedding_config.py— registers"gemini"provider; validates + auto-uppercasestask_type/query_param/document_param; factory routes context for non-symmetric modeopenviking/models/embedder/__init__.py— exportsGeminiDenseEmbedder; annotation: "text-only; multimodal planned"examples/ov.conf.example— Gemini config snippetpyproject.toml—google-genai>=0.8dep; optionalgemini-asyncextra foranyiotests/unit/test_gemini_embedder.py— 41 unit tests (mock-only, no API key needed)tests/unit/test_embedding_config_gemini.py— 18 config validation tests (incl. case-insensitive task_type)tests/integration/— 49 integration tests (auto-skip withoutGOOGLE_API_KEY)Testing
test_gemini_embedder.pytest_embedding_config_gemini.pytest_gemini_e2e.pytest_gemini_embedding_it.pytest_gemini_openviking_it.pyChecklist