LCORE-1723: Cross-Encoder Reranking for enhanced RAG#1566
LCORE-1723: Cross-Encoder Reranking for enhanced RAG#1566Anxhela21 wants to merge 3 commits intolightspeed-core:mainfrom
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughAdds a cross-encoder reranking pipeline: new reranker configuration model and AppConfig property, sentence-transformers dependency (duplicated in pyproject), BYOK rerank boost constant, cached CrossEncoder loading, score normalization and BYOK boosting, refactors chunk retrieval and integrates reranking into build_rag_context. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Builder as build_rag_context
participant BYOK as _fetch_byok_chunks
participant Solr as _fetch_solr_chunks
participant Merge as MergeCandidates
participant Reranker as CrossEncoderReranker
participant Booster as BYOKBooster
participant Selector as SelectTopK
participant Context as BuildRAGContext
Client->>Builder: query, configuration
Builder->>BYOK: fetch (max_chunks = 2×BYOK_MAX)
BYOK-->>Builder: byok_chunks
Builder->>Solr: fetch (max_chunks = 2×BYOK_MAX)
Solr-->>Builder: solr_chunks
Builder->>Merge: combine candidates
Merge-->>Builder: merged_candidates
alt configuration.reranker.enabled
Builder->>Reranker: (query, merged_candidates)
Reranker->>Reranker: predict scores for (query, chunk.content)
Reranker->>Reranker: min-max normalize -> [0,1]
Reranker-->>Builder: reranked_chunks
Builder->>Booster: apply BYOK boost (multiply non-OKP scores)
Booster-->>Builder: boosted_chunks
Builder->>Selector: sort & select top BYOK_MAX
else
Builder->>Selector: sort by original scores & select top BYOK_MAX
end
Selector-->>Builder: final_chunks
Builder->>Context: build referenced_documents from final_chunks
Context-->>Client: RAGContext
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e6b6d66 to
880a566
Compare
0131555 to
82caaf6
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 75-79: Remove the duplicate dependency entry for
"sentence-transformers>=2.0.0" in pyproject.toml (the repeated listing with the
same comment), leaving only a single declaration; also check the llslibdev group
requirement (">=5.0.0") and confirm whether the runtime floor should be raised
to ">=5.0.0" to match required features, updating the single remaining
"sentence-transformers" entry accordingly if the code needs 5.0.0+.
In `@src/constants.py`:
- Around line 195-196: Add a Final[float] type annotation to the new constant
BYOK_RAG_RERANK_BOOST so it matches the other constants' immutability checks;
update the declaration for BYOK_RAG_RERANK_BOOST to use Final[float] and ensure
Final is imported from typing at the top of the file if it isn't already.
In `@src/models/config.py`:
- Around line 1828-1866: The current PrivateAttr(_explicitly_configured: bool =
PrivateAttr(default=False)) makes _explicitly_configured always present and the
value-comparison in mark_as_explicitly_configured() misses cases where a user
explicitly provided a field equal to its default; change _explicitly_configured
to a tri-state (e.g. PrivateAttr(default=None)) and rewrite
mark_as_explicitly_configured to use the Pydantic-provided
self.model_fields_set: compute the set of reranker field names you care about
(enabled, model, model_id, provider_id, top_k_multiplier, byok_boost,
okp_boost), check if any of those names are in self.model_fields_set and if so
set self._explicitly_configured = True (leave it None/False otherwise), and
update the consuming code (the hasattr(...) check elsewhere) to test for
_explicitly_configured is not None / True instead of hasattr so the auto-enable
branch can run correctly.
- Around line 1815-1826: The three RerankerConfiguration fields
(top_k_multiplier, byok_boost, okp_boost) are declared but not used; update the
reranking code to read these values from the configuration instead of using
hard-coded constants: replace the hard-coded 2x in the pool_size calculation in
vector_search.py with configuration.reranker.top_k_multiplier when computing
pool_size (or its equivalent variable), change the boost applied in
_apply_byok_rerank_boost to use configuration.reranker.byok_boost instead of
constants.BYOK_RAG_RERANK_BOOST, and ensure OKP chunks receive
configuration.reranker.okp_boost (apply the multiplier where OKP chunks
currently pass through unmodified). Keep existing defaults from the
RerankerConfiguration when reading these fields so behavior is unchanged if the
user omits them.
In `@src/utils/vector_search.py`:
- Around line 28-56: The module-level cache _cross_encoder_models and loader
_get_cross_encoder are not concurrency-safe and can instantiate
CrossEncoder(model_name) multiple times on cold start; fix by serializing loads
for a given model_name (single-flight) — e.g., add a module-level dict of
asyncio.Lock objects (or a per-name asyncio.Event/Future) and acquire the lock
for model_name inside _get_cross_encoder (or convert the loader to a synchronous
function, apply functools.cache, and call it via asyncio.to_thread) so only one
task constructs CrossEncoder(model_name) and others await the result, then store
the loaded model (or None on failure) into _cross_encoder_models before
releasing the lock; ensure any exception handling still sets the cache to None
to avoid repeated retries.
- Around line 722-728: The function build_rag_context currently takes an unused
parameter moderation_decision (annotated with # pylint:
disable=unused-argument); either remove that parameter from the function
signature and delete the pylint disable, and update its three callers (the calls
in the functions at src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to
stop passing moderation_decision, or if you intend to use it later, keep the
parameter but replace the pylint disable with a short TODO comment in
build_rag_context explaining the planned future use and why moderation_decision
is intentionally ignored; reference the function name build_rag_context and the
three caller sites when making the change so the signature and call sites remain
consistent.
- Around line 59-134: The async function _rerank_chunks_with_cross_encoder
currently calls the blocking sentence_transformers.CrossEncoder.predict
synchronously (model.predict), which will block the asyncio event loop; change
the call to run in a thread via asyncio.to_thread (e.g., scores = await
asyncio.to_thread(model.predict, pairs)) so inference executes off the event
loop, preserving async responsiveness, and keep the subsequent score processing
logic unchanged; ensure to import asyncio if not already present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 89ed94e6-8009-41d5-8e4c-3f2892881c37
📒 Files selected for processing (5)
pyproject.tomlsrc/configuration.pysrc/constants.pysrc/models/config.pysrc/utils/vector_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: bandit
- GitHub Check: build-pr
- GitHub Check: unit_tests (3.12)
- GitHub Check: integration_tests (3.12)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Import FastAPI dependencies with:from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Usetyping_extensions.Selffor model validators in Pydantic models
Use modern union type syntaxstr | intinstead ofUnion[str, int]
UseOptional[Type]for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Use@model_validatorand@field_validatorfor Pydantic model validation
Complete type annotations for all class attributes; use specific types, notAny
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections
Files:
src/constants.pysrc/configuration.pysrc/models/config.pysrc/utils/vector_search.py
**/constants.py
📄 CodeRabbit inference engine (AGENTS.md)
Central
constants.pyfor shared constants with descriptive comments
Files:
src/constants.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models extend
ConfigurationBasefor config,BaseModelfor data models
Files:
src/constants.pysrc/configuration.pysrc/models/config.pysrc/utils/vector_search.py
src/**/config*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/config*.py: All config uses Pydantic models extendingConfigurationBase
Base class setsextra="forbid"to reject unknown fields in Pydantic models
Use@field_validatorand@model_validatorfor custom validation in Pydantic models
Use type hints likeOptional[FilePath],PositiveInt,SecretStrin Pydantic models
Files:
src/configuration.pysrc/models/config.py
🧠 Learnings (6)
📚 Learning: 2026-03-17T11:34:53.242Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.
Applied to files:
src/constants.pysrc/utils/vector_search.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`
Applied to files:
src/models/config.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models
Applied to files:
src/models/config.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/config.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/config.py
📚 Learning: 2026-01-30T13:33:40.479Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.
Applied to files:
src/utils/vector_search.py
🔇 Additional comments (3)
src/utils/vector_search.py (1)
722-746:pool_size/top_kignore the configuredtop_k_multiplier.
pool_size = 2 * constants.BYOK_RAG_MAX_CHUNKShard-codes the 2× factor, butRerankerConfiguration.top_k_multiplier(default 2.0) is intended to govern this exact value. Likewise,_apply_byok_rerank_boost(...)reads the multiplier fromconstants.BYOK_RAG_RERANK_BOOST, ignoringreranker.byok_boostandreranker.okp_boost. See the comment onsrc/models/config.pyRerankerConfigurationfor the full picture and a suggested wiring diff.src/configuration.py (1)
469-475: LGTM.The new
rerankerproperty follows the same pattern as the surroundingrag/okpproperties. Sincererankeris a non-optional field with adefault_factoryinConfiguration, the access is safe once_configurationis loaded.src/models/config.py (1)
1822-1823: Remove unused placeholder fields or wire them up.
model_idandprovider_idinRerankerConfigurationare not consumed bysrc/utils/vector_search.py(onlymodelandenabledare read). If these fields are reserved for future Llama Stack-backed reranker provider support, they should either be wired up or removed to avoid leaking unimplemented schema to users. The Pydantic v2 protected namespace concern does not apply here—the project uses Pydantic v2.10.6+, where the defaultprotected_namespacessetting only flags direct conflicts with methods likemodel_validateandmodel_dump, not fields with themodel_prefix.> Likely an incorrect or invalid review comment.
| # Lazy-loaded cross-encoder models for reranking RAG chunks (CPU-bound, use in thread). | ||
| # Cache models by name to avoid reloading the same model multiple times. | ||
| # Not a constant; pylint invalid-name is disabled for this module-level singleton. | ||
| _cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name | ||
|
|
||
|
|
||
| def _get_cross_encoder(model_name: str) -> Any: | ||
| """Return the lazy-loaded cross-encoder model for reranking. | ||
|
|
||
| Args: | ||
| model_name: Name of the cross-encoder model to load. | ||
|
|
||
| Returns: | ||
| Loaded CrossEncoder model instance, or None if loading fails. | ||
| """ | ||
| if model_name not in _cross_encoder_models: | ||
| try: | ||
| from sentence_transformers import ( # pylint: disable=import-outside-toplevel | ||
| CrossEncoder, | ||
| ) | ||
|
|
||
| _cross_encoder_models[model_name] = CrossEncoder(model_name) | ||
| logger.info("Loaded cross-encoder for RAG reranking: %s", model_name) | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning( | ||
| "Could not load cross-encoder for reranking (%s): %s", model_name, e | ||
| ) | ||
| _cross_encoder_models[model_name] = None | ||
| return _cross_encoder_models[model_name] |
There was a problem hiding this comment.
Cross-encoder cache is not safe against concurrent cold-start loads.
_cross_encoder_models is a plain module-level dict. With multiple concurrent requests arriving while the cache is cold (e.g. right after process start), each await-suspended task can re-enter _get_cross_encoder before any prior task has populated the cache, causing CrossEncoder(model_name) to be instantiated several times — wasteful CPU/RAM on startup and possible OOM if the model is large.
A single-flight pattern with asyncio.Lock (or functools.cache if you make the loader sync and only call it via asyncio.to_thread) avoids this:
♻️ Single-flight loader sketch
-_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name
+_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name
+_cross_encoder_load_lock = asyncio.Lock()
+
+
+async def _get_cross_encoder(model_name: str) -> Any:
+ if model_name in _cross_encoder_models:
+ return _cross_encoder_models[model_name]
+ async with _cross_encoder_load_lock:
+ if model_name in _cross_encoder_models:
+ return _cross_encoder_models[model_name]
+ try:
+ from sentence_transformers import CrossEncoder # pylint: disable=import-outside-toplevel
+ model = await asyncio.to_thread(CrossEncoder, model_name)
+ _cross_encoder_models[model_name] = model
+ logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
+ except Exception as e: # pylint: disable=broad-exception-caught
+ logger.warning("Could not load cross-encoder (%s): %s", model_name, e)
+ _cross_encoder_models[model_name] = None
+ return _cross_encoder_models[model_name]Note: when running uvicorn with workers > 1, each process still gets its own copy of the model — that is unavoidable here, but worth keeping in mind for memory budgeting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/vector_search.py` around lines 28 - 56, The module-level cache
_cross_encoder_models and loader _get_cross_encoder are not concurrency-safe and
can instantiate CrossEncoder(model_name) multiple times on cold start; fix by
serializing loads for a given model_name (single-flight) — e.g., add a
module-level dict of asyncio.Lock objects (or a per-name asyncio.Event/Future)
and acquire the lock for model_name inside _get_cross_encoder (or convert the
loader to a synchronous function, apply functools.cache, and call it via
asyncio.to_thread) so only one task constructs CrossEncoder(model_name) and
others await the result, then store the loaded model (or None on failure) into
_cross_encoder_models before releasing the lock; ensure any exception handling
still sets the cache to None to avoid repeated retries.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| async def _rerank_chunks_with_cross_encoder( | ||
| query: str, | ||
| chunks: list[RAGChunk], | ||
| top_k: int, | ||
| model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2", | ||
| ) -> list[RAGChunk]: | ||
| """Rerank chunks using configurable cross-encoder model. | ||
|
|
||
| Args: | ||
| query: The search query | ||
| chunks: RAG chunks to rerank | ||
| top_k: Number of top chunks to return | ||
| model_name: Cross-encoder model name to use | ||
|
|
||
| Returns: | ||
| Top top_k chunks sorted by cross-encoder score (descending) | ||
| """ | ||
| if not chunks: | ||
| return [] | ||
|
|
||
| try: | ||
| # Get the cached cross-encoder model | ||
| model = _get_cross_encoder(model_name) | ||
| if model is None: | ||
| raise RuntimeError(f"Failed to load cross-encoder model: {model_name}") | ||
|
|
||
| logger.debug("Using cross-encoder model: %s", model_name) | ||
|
|
||
| # Create query-chunk pairs for scoring | ||
| pairs = [(query, chunk.content) for chunk in chunks] | ||
| scores = model.predict(pairs) | ||
|
|
||
| if hasattr(scores, "tolist"): | ||
| scores = scores.tolist() | ||
|
|
||
| # Normalize scores to [0,1] range using min-max normalization | ||
| if len(scores) > 1: | ||
| min_score = min(scores) | ||
| max_score = max(scores) | ||
| score_range = max_score - min_score | ||
| if score_range > 0: | ||
| normalized_scores = [(score - min_score) / score_range for score in scores] | ||
| else: | ||
| # All scores are identical, assign 0.5 to all | ||
| normalized_scores = [0.5] * len(scores) | ||
| else: | ||
| # Single score, assign 1.0 | ||
| normalized_scores = [1.0] * len(scores) | ||
|
|
||
| # Combine normalized scores with chunks and sort by score (descending) | ||
| indexed = list(zip(normalized_scores, chunks, strict=True)) | ||
| indexed.sort(key=lambda x: x[0], reverse=True) | ||
| top_indexed = indexed[:top_k] | ||
|
|
||
| # Return RAGChunk list with normalized cross-encoder scores [0,1] | ||
| return [ | ||
| RAGChunk( | ||
| content=chunk.content, | ||
| source=chunk.source, | ||
| score=float(score), | ||
| attributes=chunk.attributes, | ||
| ) | ||
| for score, chunk in top_indexed | ||
| ] | ||
|
|
||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning( | ||
| "Cross-encoder reranking failed, falling back to original scoring: %s", e | ||
| ) | ||
| # Fallback: sort by original score and take top_k | ||
| sorted_chunks = sorted( | ||
| chunks, | ||
| key=lambda c: c.score if c.score is not None else float("-inf"), | ||
| reverse=True, | ||
| ) | ||
| return sorted_chunks[:top_k] |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
sentence-transformers CrossEncoder predict synchronous blocking
💡 Result:
The CrossEncoder.predict method in sentence-transformers performs synchronous, blocking inference by default. It computes similarity scores for input sentence pairs in a blocking manner on the specified device(s), returning results only after completion. Key details: - Input: List of sentence pairs, e.g., [['sentence1a', 'sentence1b'], ['sentence2a', 'sentence2b']]. - Default behavior: Batched processing (batch_size=32), progress bar optional, returns numpy array or torch.Tensor of scores. - Blocking: No asynchronous option mentioned; the call waits for computation to finish. - Multi-device: Supports multiprocessing via device list (e.g., ["cuda:0", "cuda:1"]) or manual pool for parallel GPUs/CPUs, but still synchronous from the caller's perspective. - Example: from sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') scores = model.predict([('Query1', 'Doc1'), ('Query1', 'Doc2')]) # Returns array of scores immediately after computation. This is confirmed in official documentation as the standard synchronous prediction workflow.
Citations:
- 1: https://sbert.net/docs/cross_encoder/usage/usage.html
- 2: https://www.sbert.net/docs/package%5Freference/cross%5Fencoder/cross%5Fencoder.html
- 3: https://www.sbert.net/examples/applications/cross-encoder/README.html
🏁 Script executed:
# First, locate and inspect the file to confirm line numbers and context
fd -t f "vector_search.py" | head -5Repository: lightspeed-core/lightspeed-stack
Length of output: 142
🏁 Script executed:
# Check Python version requirement in project
fd -t f -e "pyproject.toml" -o -e "setup.py" -o -e ".python-version" | head -10Repository: lightspeed-core/lightspeed-stack
Length of output: 193
🏁 Script executed:
# Inspect the vector_search.py file around the identified lines
cat -n src/utils/vector_search.py | sed -n '59,100p'Repository: lightspeed-core/lightspeed-stack
Length of output: 1768
🏁 Script executed:
# Also check the _get_cross_encoder implementation to assess cold-start blocking
cat -n src/utils/vector_search.py | sed -n '34,56p'Repository: lightspeed-core/lightspeed-stack
Length of output: 1164
🏁 Script executed:
# Check Python version requirement
cat pyproject.toml | grep -A 5 "python"Repository: lightspeed-core/lightspeed-stack
Length of output: 309
🏁 Script executed:
# Also check setup.py if it exists
[ -f setup.py ] && cat setup.py | head -30 || echo "No setup.py found"Repository: lightspeed-core/lightspeed-stack
Length of output: 94
🏁 Script executed:
# Verify the line 28 comment mentioned in the review about CPU-bound
cat -n src/utils/vector_search.py | sed -n '1,35p'Repository: lightspeed-core/lightspeed-stack
Length of output: 1581
🏁 Script executed:
# Check if asyncio is already imported at the top of the file
head -30 src/utils/vector_search.py | grep -E "^import|^from"Repository: lightspeed-core/lightspeed-stack
Length of output: 605
model.predict() is synchronous and blocks the event loop inside an async function.
sentence_transformers.CrossEncoder.predict is a CPU-bound PyTorch call that runs synchronously. Calling it directly at line 89 inside async def _rerank_chunks_with_cross_encoder blocks the asyncio event loop for the entire inference duration — preventing the worker from handling other concurrent requests. The module-level comment at line 28 explicitly notes "CPU-bound, use in thread", yet the code does not implement this.
Use asyncio.to_thread (available in Python 3.9+, and your project requires ≥3.12) to offload the blocking inference so the event loop remains responsive.
Proposed fix
# Create query-chunk pairs for scoring
pairs = [(query, chunk.content) for chunk in chunks]
- scores = model.predict(pairs)
+ scores = await asyncio.to_thread(model.predict, pairs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/vector_search.py` around lines 59 - 134, The async function
_rerank_chunks_with_cross_encoder currently calls the blocking
sentence_transformers.CrossEncoder.predict synchronously (model.predict), which
will block the asyncio event loop; change the call to run in a thread via
asyncio.to_thread (e.g., scores = await asyncio.to_thread(model.predict, pairs))
so inference executes off the event loop, preserving async responsiveness, and
keep the subsequent score processing logic unchanged; ensure to import asyncio
if not already present.
There was a problem hiding this comment.
This seems a valid concern
There was a problem hiding this comment.
@are-ces, agreed — this is a valid issue. The fix is straightforward since asyncio is already imported. Here's the minimal change:
# Create query-chunk pairs for scoring
pairs = [(query, chunk.content) for chunk in chunks]
- scores = model.predict(pairs)
+ scores = await asyncio.to_thread(model.predict, pairs)As a secondary note, there's also a cold-start blocking concern: on the first call for a given model_name, _get_cross_encoder calls CrossEncoder(model_name) synchronously (model download + load), which also blocks the event loop. You may want to wrap that call in asyncio.to_thread as well:
- model = _get_cross_encoder(model_name)
+ model = await asyncio.to_thread(_get_cross_encoder, model_name)The module-level comment even acknowledges this with # CPU-bound, use in thread, so both fixes align with the documented intent.
| async def build_rag_context( # pylint: disable=too-many-locals | ||
| client: AsyncLlamaStackClient, | ||
| moderation_decision: str, | ||
| moderation_decision: str, # pylint: disable=unused-argument | ||
| query: str, | ||
| vector_store_ids: Optional[list[str]], | ||
| solr: Optional[SolrVectorSearchRequest] = None, | ||
| ) -> RAGContext: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all call sites of build_rag_context to confirm impact of removing the parameter.
rg -nP --type=py -C2 '\bbuild_rag_context\s*\('Repository: lightspeed-core/lightspeed-stack
Length of output: 2197
🏁 Script executed:
# Read the build_rag_context function implementation
sed -n '722,800p' src/utils/vector_search.pyRepository: lightspeed-core/lightspeed-stack
Length of output: 3006
🏁 Script executed:
# Check the exact line numbers and context for call sites
echo "=== query.py ===" && sed -n '170,185p' src/app/endpoints/query.py
echo -e "\n=== streaming_query.py ===" && sed -n '228,245p' src/app/endpoints/streaming_query.py
echo -e "\n=== responses.py ===" && sed -n '343,360p' src/app/endpoints/responses.pyRepository: lightspeed-core/lightspeed-stack
Length of output: 1836
🏁 Script executed:
# Verify that moderation_decision is not used anywhere in the function body
rg -A30 'async def build_rag_context' src/utils/vector_search.py | grep -i 'moderation_decision'Repository: lightspeed-core/lightspeed-stack
Length of output: 141
Remove unused moderation_decision parameter or document its intended future use.
The parameter is annotated with # pylint: disable=unused-argument and is not referenced anywhere in the function body (nor in the docstring). The function builds RAG context without consulting moderation status, since moderation checks are already performed by callers before deciding whether to proceed. Remove the parameter and update the three call sites:
src/app/endpoints/query.py:179src/app/endpoints/streaming_query.py:234src/app/endpoints/responses.py:349
Alternatively, if there is planned future use, add a TODO comment explaining the intent instead of silently passing the query to vector stores for queries that may have been blocked.
Suggested signature cleanup
async def build_rag_context( # pylint: disable=too-many-locals
client: AsyncLlamaStackClient,
- moderation_decision: str, # pylint: disable=unused-argument
query: str,
vector_store_ids: Optional[list[str]],
solr: Optional[SolrVectorSearchRequest] = None,
) -> RAGContext:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/vector_search.py` around lines 722 - 728, The function
build_rag_context currently takes an unused parameter moderation_decision
(annotated with # pylint: disable=unused-argument); either remove that parameter
from the function signature and delete the pylint disable, and update its three
callers (the calls in the functions at src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to
stop passing moderation_decision, or if you intend to use it later, keep the
parameter but replace the pylint disable with a short TODO comment in
build_rag_context explaining the planned future use and why moderation_decision
is intentionally ignored; reference the function name build_rag_context and the
three caller sites when making the change so the signature and call sites remain
consistent.
Signed-off-by: Anxhela Coba <acoba@redhat.com>
cce6b82 to
d664680
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/vector_search.py (1)
642-685:⚠️ Potential issue | 🟠 MajorThe OKP fetch path never actually expands the rerank candidate pool.
_fetch_solr_rag()acceptsmax_chunks, but the value never reachesvector_io.query()because_build_query_params(solr)still uses the defaultk.build_rag_context()also calls_fetch_solr_rag()withoutmax_chunks=pool_size. As a result, OKP stays capped at the default retrieval size, so the reranker never sees the larger Solr pool described in this PR.♻️ Suggested fix
- params = _build_query_params(solr) + params = _build_query_params(solr, k=limit)- solr_chunks_task = _fetch_solr_rag(client, query, solr) + solr_chunks_task = _fetch_solr_rag( + client, query, solr, max_chunks=pool_size + )Also applies to: 748-755
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 642 - 685, The Solr fetch never expands the rerank candidate pool because _fetch_solr_rag ignores its max_chunks arg and _build_query_params still uses the default k; also build_rag_context doesn't pass pool_size into _fetch_solr_rag. Fix by threading max_chunks through: update _fetch_solr_rag to pass max_chunks (or computed limit) into the params returned by _build_query_params (ensure _build_query_params accepts a k/limit argument and sets params['k'] accordingly), call client.vector_io.query with those params, and modify build_rag_context to call _fetch_solr_rag(..., max_chunks=pool_size) so the Solr query and vector_io.query receive the larger pool_size for reranking (refer to functions _fetch_solr_rag, _build_query_params, build_rag_context, vector_io.query, and variables max_chunks/pool_size).
♻️ Duplicate comments (3)
src/utils/vector_search.py (2)
79-90:⚠️ Potential issue | 🔴 CriticalOffload
CrossEncoder.predict()off the event loop.
_rerank_chunks_with_cross_encoder()is async, butmodel.predict()is a synchronous inference call. Awaiting this path under load will stall the event loop for the full rerank duration and block unrelated requests.♻️ Minimal fix
- model = _get_cross_encoder(model_name) + model = await _get_cross_encoder(model_name) @@ - scores = model.predict(pairs) + scores = await asyncio.to_thread(model.predict, pairs)According to the official sentence-transformers documentation, is `CrossEncoder.predict(...)` a synchronous/blocking call, and is `asyncio.to_thread(...)` the appropriate way to invoke it from Python async code?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 79 - 90, _rerank_chunks_with_cross_encoder is async but calls the synchronous CrossEncoder.predict (via model.predict) which blocks the event loop; change the call to run the blocking inference off the loop using asyncio.to_thread (or loop.run_in_executor) so prediction runs in a worker thread and returns awaitably. Locate the call to model.predict in _rerank_chunks_with_cross_encoder and replace the direct call with an awaited asyncio.to_thread(lambda: model.predict(pairs)) (or equivalent) so the CrossEncoder.predict invocation is executed off the event loop.
34-56:⚠️ Potential issue | 🟠 MajorSerialize cold-start loads for each cross-encoder model.
This cache is still racy: concurrent requests can all miss
_cross_encoder_modelsand instantiate the same model before the first load stores it, which is an expensive CPU/RAM spike for large rerankers. Wrap model construction in a per-model single-flight lock so only one task loads each model name.♻️ Suggested single-flight pattern
_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name +_cross_encoder_load_locks: dict[str, asyncio.Lock] = {} # pylint: disable=invalid-name -def _get_cross_encoder(model_name: str) -> Any: +async def _get_cross_encoder(model_name: str) -> Any: """Return the lazy-loaded cross-encoder model for reranking. @@ - if model_name not in _cross_encoder_models: - try: - from sentence_transformers import ( # pylint: disable=import-outside-toplevel - CrossEncoder, - ) - - _cross_encoder_models[model_name] = CrossEncoder(model_name) - logger.info("Loaded cross-encoder for RAG reranking: %s", model_name) - except Exception as e: # pylint: disable=broad-exception-caught - logger.warning( - "Could not load cross-encoder for reranking (%s): %s", model_name, e - ) - _cross_encoder_models[model_name] = None + if model_name in _cross_encoder_models: + return _cross_encoder_models[model_name] + + lock = _cross_encoder_load_locks.setdefault(model_name, asyncio.Lock()) + async with lock: + if model_name in _cross_encoder_models: + return _cross_encoder_models[model_name] + try: + from sentence_transformers import ( # pylint: disable=import-outside-toplevel + CrossEncoder, + ) + + _cross_encoder_models[model_name] = await asyncio.to_thread( + CrossEncoder, model_name + ) + logger.info("Loaded cross-encoder for RAG reranking: %s", model_name) + except Exception as e: # pylint: disable=broad-exception-caught + logger.warning( + "Could not load cross-encoder for reranking (%s): %s", model_name, e + ) + _cross_encoder_models[model_name] = None return _cross_encoder_models[model_name]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 34 - 56, The _get_cross_encoder function is racy because multiple callers can simultaneously miss _cross_encoder_models and each instantiate the same heavy model; fix it by introducing a per-model single-flight lock (e.g., a module-level dict _cross_encoder_locks: Dict[str, threading.Lock]) and wrap the check-and-create logic for a given model_name inside that lock: obtain or create the lock for model_name (use setdefault), acquire it (with context manager) before re-checking _cross_encoder_models[model_name], then instantiate CrossEncoder and assign to _cross_encoder_models[model_name] inside the locked region (ensure exceptions still set the cache to None and release the lock in finally). Update references to _get_cross_encoder, _cross_encoder_models, and logger accordingly.src/models/config.py (1)
1829-1861:⚠️ Potential issue | 🟠 Major
_explicitly_configurednever makes the auto-enable branch work as intended.
PrivateAttr(default=False)means_explicitly_configuredis always present, sonot hasattr(self.reranker, "_explicitly_configured")is always false and the auto-enable path never runs. The value-comparison validator also misses explicit input that matches defaults, e.g.reranker: { enabled: true }. Usemodel_fields_setto mark explicit input and test the flag value directly.♻️ Suggested fix
class RerankerConfiguration(ConfigurationBase): @@ _explicitly_configured: bool = PrivateAttr(default=False) @@ `@model_validator`(mode="after") def mark_as_explicitly_configured(self) -> Self: """Mark this configuration as explicitly set when instantiated from user input.""" - # Only mark as explicitly configured if we're not using all default values - # This allows auto-enabling when user hasn't touched reranker settings - # Check if any field differs from default values - default_model = "cross-encoder/ms-marco-MiniLM-L6-v2" - default_top_k_multiplier = 2.0 - default_byok_boost = 1.2 - default_okp_boost = 1.0 - - # Check if any setting differs from defaults (indicates explicit configuration) - current_values = [ - self.enabled, - self.model, - self.top_k_multiplier, - self.byok_boost, - self.okp_boost, - ] - default_values = [ - True, - default_model, - default_top_k_multiplier, - default_byok_boost, - default_okp_boost, - ] - - if current_values != default_values: + if self.model_fields_set: self._explicitly_configured = True return self @@ - if ( - has_byok - and has_okp - and not hasattr(self.reranker, "_explicitly_configured") - ): + if has_byok and has_okp and not self.reranker._explicitly_configured: # pylint: disable=no-member if not self.reranker.enabled: logger.info( "Automatically enabling reranker: Both BYOK RAG (%d entries) and OKP " "are configured. Reranking improves result quality when multiple "In Pydantic v2, do `PrivateAttr(default=...)` attributes always exist on model instances, and does `model_fields_set` track fields explicitly provided in the input even when the provided value equals the field default?Also applies to: 2154-2168
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/config.py` around lines 1829 - 1861, The _explicitly_configured flag is never meaningful because PrivateAttr always exists; update the mark_as_explicitly_configured validator (and the similar validator at lines ~2154-2168) to use model_fields_set to detect which fields were explicitly provided and set self._explicitly_configured = True when any reranker-related fields (e.g., "enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost") appear in self.model_fields_set or when the parent model passed reranker in its input; also change any checks that use hasattr(self.reranker, "_explicitly_configured") to inspect the flag's value directly (self.reranker._explicitly_configured) so the auto-enable branch can run when the user explicitly provided values even if they match defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/utils/vector_search.py`:
- Around line 182-205: The loop that builds ReferencedDocument from rag_chunks
currently appends entries even when both title and URL are missing; update the
logic in the loop (the section that computes attrs, doc_url, parsed_url and
before result.append) to skip/continue when both attrs.get("title") is falsy and
parsed_url is falsy/None so you only append ReferencedDocument when there is a
doc_title or a doc_url; keep existing dedup_key/seen handling and AnyUrl parsing
but add a guard (using the computed title and parsed_url) to avoid creating
empty ReferencedDocument objects.
---
Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 642-685: The Solr fetch never expands the rerank candidate pool
because _fetch_solr_rag ignores its max_chunks arg and _build_query_params still
uses the default k; also build_rag_context doesn't pass pool_size into
_fetch_solr_rag. Fix by threading max_chunks through: update _fetch_solr_rag to
pass max_chunks (or computed limit) into the params returned by
_build_query_params (ensure _build_query_params accepts a k/limit argument and
sets params['k'] accordingly), call client.vector_io.query with those params,
and modify build_rag_context to call _fetch_solr_rag(..., max_chunks=pool_size)
so the Solr query and vector_io.query receive the larger pool_size for reranking
(refer to functions _fetch_solr_rag, _build_query_params, build_rag_context,
vector_io.query, and variables max_chunks/pool_size).
---
Duplicate comments:
In `@src/models/config.py`:
- Around line 1829-1861: The _explicitly_configured flag is never meaningful
because PrivateAttr always exists; update the mark_as_explicitly_configured
validator (and the similar validator at lines ~2154-2168) to use
model_fields_set to detect which fields were explicitly provided and set
self._explicitly_configured = True when any reranker-related fields (e.g.,
"enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost") appear in
self.model_fields_set or when the parent model passed reranker in its input;
also change any checks that use hasattr(self.reranker, "_explicitly_configured")
to inspect the flag's value directly (self.reranker._explicitly_configured) so
the auto-enable branch can run when the user explicitly provided values even if
they match defaults.
In `@src/utils/vector_search.py`:
- Around line 79-90: _rerank_chunks_with_cross_encoder is async but calls the
synchronous CrossEncoder.predict (via model.predict) which blocks the event
loop; change the call to run the blocking inference off the loop using
asyncio.to_thread (or loop.run_in_executor) so prediction runs in a worker
thread and returns awaitably. Locate the call to model.predict in
_rerank_chunks_with_cross_encoder and replace the direct call with an awaited
asyncio.to_thread(lambda: model.predict(pairs)) (or equivalent) so the
CrossEncoder.predict invocation is executed off the event loop.
- Around line 34-56: The _get_cross_encoder function is racy because multiple
callers can simultaneously miss _cross_encoder_models and each instantiate the
same heavy model; fix it by introducing a per-model single-flight lock (e.g., a
module-level dict _cross_encoder_locks: Dict[str, threading.Lock]) and wrap the
check-and-create logic for a given model_name inside that lock: obtain or create
the lock for model_name (use setdefault), acquire it (with context manager)
before re-checking _cross_encoder_models[model_name], then instantiate
CrossEncoder and assign to _cross_encoder_models[model_name] inside the locked
region (ensure exceptions still set the cache to None and release the lock in
finally). Update references to _get_cross_encoder, _cross_encoder_models, and
logger accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 32b1995a-8892-4d14-a537-53427a6d3620
📒 Files selected for processing (7)
pyproject.tomlsrc/configuration.pysrc/constants.pysrc/models/config.pysrc/utils/vector_search.pytests/unit/models/config/test_reranker_configuration.pytests/unit/utils/test_vector_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: build-pr
- GitHub Check: unit_tests (3.13)
- GitHub Check: unit_tests (3.12)
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
- GitHub Check: Pylinter
- GitHub Check: mypy
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Import FastAPI dependencies with:from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Usetyping_extensions.Selffor model validators in Pydantic models
Use modern union type syntaxstr | intinstead ofUnion[str, int]
UseOptional[Type]for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Use@model_validatorand@field_validatorfor Pydantic model validation
Complete type annotations for all class attributes; use specific types, notAny
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections
Files:
src/constants.pysrc/configuration.pytests/unit/models/config/test_reranker_configuration.pysrc/models/config.pytests/unit/utils/test_vector_search.pysrc/utils/vector_search.py
**/constants.py
📄 CodeRabbit inference engine (AGENTS.md)
Central
constants.pyfor shared constants with descriptive comments
Files:
src/constants.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models extend
ConfigurationBasefor config,BaseModelfor data models
Files:
src/constants.pysrc/configuration.pysrc/models/config.pysrc/utils/vector_search.py
src/**/config*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/config*.py: All config uses Pydantic models extendingConfigurationBase
Base class setsextra="forbid"to reject unknown fields in Pydantic models
Use@field_validatorand@model_validatorfor custom validation in Pydantic models
Use type hints likeOptional[FilePath],PositiveInt,SecretStrin Pydantic models
Files:
src/configuration.pysrc/models/config.py
tests/{unit,integration}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest; pytest is the standard for this project
Usepytest-mockfor AsyncMock objects in tests
Use markerpytest.mark.asynciofor async tests
Unit tests require 60% coverage, integration tests 10%
Files:
tests/unit/models/config/test_reranker_configuration.pytests/unit/utils/test_vector_search.py
🧠 Learnings (9)
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/*.py : Use Final[type] as type hint for all constants
Applied to files:
src/constants.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/constants.py : Central `constants.py` for shared constants with descriptive comments
Applied to files:
src/constants.py
📚 Learning: 2026-03-17T11:34:53.242Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.
Applied to files:
src/constants.pysrc/models/config.pytests/unit/utils/test_vector_search.pysrc/utils/vector_search.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Always check `pyproject.toml` for existing dependencies before adding new ones
Applied to files:
pyproject.toml
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`
Applied to files:
tests/unit/models/config/test_reranker_configuration.pysrc/models/config.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models
Applied to files:
src/models/config.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/config.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/config.py
📚 Learning: 2026-01-30T13:33:40.479Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.
Applied to files:
src/utils/vector_search.py
| for chunk in rag_chunks: | ||
| attrs = chunk.attributes or {} | ||
| doc_url = ( | ||
| attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url") | ||
| ) | ||
| doc_id = attrs.get("document_id") or attrs.get("doc_id") | ||
| dedup_key = doc_url or doc_id or chunk.source or "" | ||
| if not dedup_key or dedup_key in seen: | ||
| continue | ||
| seen.add(dedup_key) | ||
| parsed_url: Optional[AnyUrl] = None | ||
| if doc_url: | ||
| try: | ||
| parsed_url = AnyUrl(doc_url) | ||
| except Exception: # pylint: disable=broad-exception-caught | ||
| parsed_url = None | ||
| result.append( | ||
| ReferencedDocument( | ||
| doc_title=attrs.get("title"), | ||
| doc_url=parsed_url, | ||
| source=chunk.source, | ||
| ) | ||
| ) | ||
| return result |
There was a problem hiding this comment.
Skip referenced documents that have neither title nor URL.
This helper now emits ReferencedDocument objects even when both doc_title and doc_url are missing. Those entries are intentionally filtered out elsewhere because the frontend cannot render them usefully, so this regresses the final referenced-doc list after reranking.
♻️ Suggested guard
for chunk in rag_chunks:
attrs = chunk.attributes or {}
doc_url = (
attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url")
)
+ title = attrs.get("title")
doc_id = attrs.get("document_id") or attrs.get("doc_id")
dedup_key = doc_url or doc_id or chunk.source or ""
if not dedup_key or dedup_key in seen:
continue
seen.add(dedup_key)
@@
- result.append(
- ReferencedDocument(
- doc_title=attrs.get("title"),
- doc_url=parsed_url,
- source=chunk.source,
- )
- )
+ if title is None and parsed_url is None:
+ continue
+ result.append(
+ ReferencedDocument(
+ doc_title=title,
+ doc_url=parsed_url,
+ source=chunk.source,
+ )
+ )Based on learnings: In src/app/endpoints/query_v2.py, the parse_referenced_documents_from_responses_api function filters referenced documents to include only those with doc_title or doc_url because these documents are displayed in the frontend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/vector_search.py` around lines 182 - 205, The loop that builds
ReferencedDocument from rag_chunks currently appends entries even when both
title and URL are missing; update the logic in the loop (the section that
computes attrs, doc_url, parsed_url and before result.append) to skip/continue
when both attrs.get("title") is falsy and parsed_url is falsy/None so you only
append ReferencedDocument when there is a doc_title or a doc_url; keep existing
dedup_key/seen handling and AnyUrl parsing but add a guard (using the computed
title and parsed_url) to avoid creating empty ReferencedDocument objects.
Signed-off-by: Anxhela Coba <acoba@redhat.com>
are-ces
left a comment
There was a problem hiding this comment.
The example config in the PR description must be outdated because
top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources
are missing from the configuration in the PR.
Would be nice to see a configured reranker in the example configuration examples/lightspeed-stack-byok-okp-rag.yaml.
Please update unit and integration tests; fix Black and Linting failing tests. TY!
|
|
||
| # Create query-chunk pairs for scoring | ||
| pairs = [(query, chunk.content) for chunk in chunks] | ||
| scores = model.predict(pairs) |
There was a problem hiding this comment.
As CodeRabbit points out, this is synchronous and blocks the process, please make async
| solr_chunks_task = _fetch_solr_rag(client, query, solr) | ||
|
|
||
| (byok_chunks, byok_docs), (solr_chunks, solr_docs) = await asyncio.gather( | ||
| (byok_chunks, _), (solr_chunks, _) = await asyncio.gather( |
There was a problem hiding this comment.
The new _referenced_documents_from_rag_chunks (line 169) duplicates the same extraction logic already present in _process_byok_rag_chunks_for_documents and _process_solr_chunks_for_documents.
On top of that, those two per-source functions are effectively dead code in the current flow, their return values are discarded on line 754:
(byok_chunks, _), (solr_chunks, _) = await asyncio.gather(
byok_chunks_task, solr_chunks_task
)...and then _referenced_documents_from_rag_chunks re-extracts the same documents from the final chunks at line 785.
Suggestion: Instead of discarding the per-source documents and re-extracting them, keep the documents returned by _fetch_byok_rag and _fetch_solr_rag, then after reranking, filter/reorder them to match whichever chunks survived the rerank. This avoids duplicating the extraction logic and is more efficient — one lookup pass instead of a second extraction. _referenced_documents_from_rag_chunks can then be removed.
| # pylint: disable=no-member | ||
| has_okp = ( | ||
| constants.OKP_RAG_ID in self.rag.inline | ||
| or constants.OKP_RAG_ID in self.rag.tool |
There was a problem hiding this comment.
Why should the re-ranked be active if OKP is in RAG as a Tool?
There was a problem hiding this comment.
this is looking if okp is enabled either as inline or tool since we can configure okp either way unless I'm mistaken?
| Self: The validated configuration instance with reranker potentially enabled. | ||
| """ | ||
| # Check if BYOK RAG entries are configured | ||
| has_byok = len(self.byok_rag) > 0 |
There was a problem hiding this comment.
Should also check that there are any entries registered in inline RAG
| Returns: | ||
| RAGContext containing formatted context text and referenced documents | ||
| """ | ||
| if moderation_decision == "blocked": |
There was a problem hiding this comment.
Why was this removed? This if was there to prevent doing RAG in case of an unsafe input
| client: AsyncLlamaStackClient, | ||
| query: str, | ||
| solr: Optional[SolrVectorSearchRequest] = None, | ||
| max_chunks: Optional[int] = None, |
There was a problem hiding this comment.
This parameter is not used
| byok_chunks_task = _fetch_byok_rag( | ||
| client, query, vector_store_ids, max_chunks=pool_size | ||
| ) | ||
| solr_chunks_task = _fetch_solr_rag(client, query, solr) |
There was a problem hiding this comment.
Why are we not increasing the pool size for OKP chunks? I might miss something here...
| """ | ||
| if model_name not in _cross_encoder_models: | ||
| try: | ||
| from sentence_transformers import ( # pylint: disable=import-outside-toplevel |
There was a problem hiding this comment.
This dep should be moved on top
| # To be able to fix multiple CVEs, also LCORE-1117 | ||
| "requests>=2.33.0", | ||
| # Used for RAG chunk reranking (cross-encoder) | ||
| "sentence-transformers>=2.0.0", |
There was a problem hiding this comment.
yes, it needs to be updated to pass CI etc.
There was a problem hiding this comment.
thanks missed that - will update uv.lock
| class RerankerConfiguration(ConfigurationBase): | ||
| """Reranker configuration for RAG chunk reranking.""" | ||
|
|
||
| enabled: bool = True |
There was a problem hiding this comment.
nit: can be initialized as Field so it would be possible to specify title and description.
| # To be able to fix multiple CVEs, also LCORE-1117 | ||
| "requests>=2.33.0", | ||
| # Used for RAG chunk reranking (cross-encoder) | ||
| "sentence-transformers>=2.0.0", |
There was a problem hiding this comment.
yes, it needs to be updated to pass CI etc.
Description
Cross-Encoder Reranking for Enhanced RAG Quality
Summary
Implemented neural reranking using cross-encoder models to improve RAG chunk relevance scoring. This enhancement replaces simple vector similarity with semantic relevance scoring when combining results from multiple knowledge sources (BYOK + OKP) that have different search types e.g. hybrid, semantic, etc
Key Features
Flexible Configuration
reranker:
enabled: true
model: "cross-encoder/ms-marco-MiniLM-L6-v2"
top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources
Auto-Enable
Error Handling
Technical Implementation
Modified Files
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit