Evaluation: Refactor #503

AkhileshNegi · 2025-12-17T09:53:56Z

Summary

Target issue is #492

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Summary by CodeRabbit

Refactor
- Reworked evaluation into service-driven flows, moving endpoint orchestration to dedicated services and consolidating public evaluation API exports.
New Features
- New dataset upload flow with stricter CSV validation, sanitization, optional object-store upload, and persistent dataset tracking.
- Evaluation lifecycle now starts and returns runs via services; status retrieval can include trace scores with resync support.
Documentation
- Endpoints enriched with clearer docstrings.
Tests
- Tests updated to reference the new service locations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-17T09:54:02Z

📝 Walkthrough

Walkthrough

Moves CSV validation, dataset upload, and evaluation orchestration from route handlers into new service modules; routes now delegate to services, export surfaces adjusted, and tests updated to patch new service paths.

Changes

Cohort / File(s)	Summary
Service Layer (new/expanded) `backend/app/services/evaluation/__init__.py`, `backend/app/services/evaluation/dataset.py`, `backend/app/services/evaluation/evaluation.py`, `backend/app/services/evaluation/validators.py`	New evaluation service API: dataset upload, evaluation orchestration, config building, score fetching, and CSV validators (name sanitization, CSV validation/parsing, file limits).
API Routes (refactored) `backend/app/api/routes/evaluation.py`	Endpoints delegate to service functions; `upload_dataset` renamed to `upload_dataset_endpoint`; removed in-endpoint CSV/Langfuse/OpenAI orchestration; added docstrings and `_dataset_to_response` mapper.
CRUD exports `backend/app/crud/evaluations/__init__.py`	Now imports/re-exports `save_score` and `fetch_trace_scores_from_langfuse`; explicit `__all__` block removed.
Tests (patch targets updated) `backend/app/tests/api/routes/test_evaluation.py`	Mock/patch targets updated to new service import paths (`app.services.evaluation.dataset.*`) only.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant API as Evaluation API
    participant Svc as Dataset Service
    participant Storage as ObjectStore
    participant Langfuse
    participant DB

    Client->>API: POST /evaluation/datasets (file, name, duplication_factor)
    API->>Svc: validate_csv_file(file) & sanitize_dataset_name(name)
    Svc->>Svc: parse_csv_items(csv_content)
    alt object store available
        Svc->>Storage: upload_csv(csv_content) -> object_store_url
    end
    Svc->>Langfuse: upload_dataset_to_langfuse(csv_content) -> langfuse_id
    Svc->>DB: create EvaluationDataset(metadata, object_store_url?, langfuse_id)
    Svc-->>API: DatasetUploadResponse
    API-->>Client: 201 Created

sequenceDiagram
    autonumber
    participant Client
    participant API as Evaluation API
    participant EvalSvc as Evaluation Service
    participant DB
    participant OpenAI
    participant Langfuse

    Client->>API: POST /evaluation/runs (dataset_id, config, assistant_id)
    API->>EvalSvc: start_evaluation(dataset_id, experiment_name, config, assistant_id)
    EvalSvc->>DB: fetch dataset -> verify langfuse_id
    EvalSvc->>EvalSvc: build_evaluation_config(config, assistant_id)
    EvalSvc->>DB: create EvaluationRun (status: pending)
    EvalSvc->>Langfuse: start_evaluation_batch(langfuse_dataset_id, config)
    Langfuse-->>EvalSvc: batch started / error
    EvalSvc->>DB: update EvaluationRun (status, external ids)
    EvalSvc-->>API: return EvaluationRun
    API-->>Client: 202 Accepted

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

Prajna1999

Poem

🐰 I nibbled rows and parsed each line,
I stitched names tidy, CSVs fine,
Langfuse hummed and storage gave a wink,
Runs now leap where endpoints used to think—
Hop, hop, services made it shine! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Evaluation: Refactor' is overly generic and vague. It lacks specificity about what was refactored or what the primary goal of the refactoring is.	Provide a more descriptive title that clearly indicates the main refactoring objective, such as 'Refactor evaluation endpoints to use service-driven architecture' or 'Extract evaluation logic into service layer'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhancement/evaluation-refactor

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba9b43 and 09c4f4a.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (1)

backend/app/api/routes/evaluation.py (4)

backend/app/services/evaluation/evaluation.py (1)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/validators.py (1)

validate_csv_file (71-115)

backend/app/utils.py (3)

APIResponse (33-57)

success_response (40-43)

failure_response (46-57)

🪛 GitHub Actions: Kaapi CI

backend/app/api/routes/evaluation.py

[error] 60-60: NameError: name 'require_permission' is not defined. This occurred during pytest collection while running: 'coverage run --source=app -m pytest'.

🔇 Additional comments (3)

backend/app/api/routes/evaluation.py (3)

62-92: LGTM: Clean service delegation.

The refactored endpoint properly delegates CSV validation and dataset upload to the service layer, maintaining clear separation of concerns. The error handling is appropriately managed by the service functions.

201-225: LGTM: Simplified evaluation orchestration.

The endpoint correctly delegates evaluation orchestration to start_evaluation, simplifying the route handler and moving business logic to the service layer as intended by this refactor.

264-312: LGTM: Proper error handling for evaluation scores.

The endpoint correctly handles the tuple return from get_evaluation_with_scores and appropriately uses failure_response when an error is present. This maintains a consistent API response structure while delegating the complex Langfuse interaction to the service layer.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
52-82: Fix event loop blocking: async endpoint calling sync I/O.

The endpoint is async def (required for await validate_csv_file), but line 72 calls the synchronous upload_dataset service, which performs blocking I/O operations (database queries, Langfuse API calls, object store uploads). This blocks FastAPI's event loop and degrades concurrency.
🔎 Recommended fixes (choose one):

Option 1 (simplest): Run the sync service in a thread pool
+from fastapi.concurrency import run_in_threadpool
+
 async def upload_dataset_endpoint(
     ...
 ) -> APIResponse[DatasetUploadResponse]:
     """Upload an evaluation dataset."""
     # Validate and read CSV file
     csv_content = await validate_csv_file(file)
 
     # Upload dataset using service
-    dataset = upload_dataset(
+    dataset = await run_in_threadpool(
+        upload_dataset,
         session=_session,
         csv_content=csv_content,
         dataset_name=dataset_name,
         description=description,
         duplication_factor=duplication_factor,
         organization_id=auth_context.organization.id,
         project_id=auth_context.project.id,
     )
 
     return APIResponse.success_response(data=_dataset_to_response(dataset))
Option 2: Make the service async (requires refactoring service layer)

This would require making upload_dataset and its dependencies async, which is a larger refactor but provides better async integration.
Note: The other endpoints (evaluate, get_evaluation_run_status) are correctly defined as sync (def), so FastAPI automatically runs them in a thread pool.

🧹 Nitpick comments (5)

backend/app/services/evaluation/dataset.py (1)
131-138: Consider sanitizing exception details in client-facing error message.

Exposing raw exception messages ({e}) could leak internal implementation details (stack traces, connection strings, etc.) to API clients.
🔎 Apply this diff to provide a generic error message:
     except Exception as e:
         logger.error(
             f"[upload_dataset] Failed to upload dataset to Langfuse | {e}",
             exc_info=True,
         )
         raise HTTPException(
-            status_code=500, detail=f"Failed to upload dataset to Langfuse: {e}"
+            status_code=500, detail="Failed to upload dataset to Langfuse. Please check your Langfuse credentials and try again."
         )
backend/app/services/evaluation/evaluation.py (3)
23-28: Consider more specific type hints for config parameter.

Using dict without type parameters reduces type safety. As per coding guidelines requiring type hints, consider using dict[str, Any] for clarity.
🔎 Apply this diff:
+from typing import Any
+
 def build_evaluation_config(
     session: Session,
-    config: dict,
+    config: dict[str, Any],
     assistant_id: str | None,
     project_id: int,
-) -> dict:
+) -> dict[str, Any]:
101-109: Consider more specific type hints for config parameter here as well.

Same suggestion as build_evaluation_config for consistency.

313-319: Consider sanitizing exception details in error message.

Similar to the dataset service, exposing raw exception messages could leak internal details.
🔎 Apply this diff:
     except Exception as e:
         logger.error(
             f"[get_evaluation_with_scores] Failed to fetch trace info | "
             f"evaluation_id={evaluation_id} | error={e}",
             exc_info=True,
         )
-        return eval_run, f"Failed to fetch trace info from Langfuse: {str(e)}"
+        return eval_run, "Failed to fetch trace info from Langfuse. Please try again later."
backend/app/api/routes/evaluation.py (1)
67-67: Consider enriching docstrings with parameter and behavior details.

The added docstrings are present but minimal, mostly restating the function names. For route handlers that delegate to services, consider documenting:

Key parameter constraints or validation rules

Important response scenarios (e.g., 404 vs 422)

Side effects or async behavior notes

Example for upload_dataset_endpoint:
"""
Upload an evaluation dataset from a CSV file.

Validates CSV format and content, uploads to Langfuse and optional object store,
and persists metadata. Runs synchronously in a thread pool to avoid blocking.
"""
This is optional since the Swagger descriptions are loaded from external markdown files.

Also applies to: 96-96, 124-124, 156-156, 200-200, 225-225, 269-269

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d26401b and d990add.

📒 Files selected for processing (7)

backend/app/api/routes/evaluation.py (10 hunks)
backend/app/crud/evaluations/__init__.py (4 hunks)
backend/app/services/evaluation/__init__.py (1 hunks)
backend/app/services/evaluation/dataset.py (1 hunks)
backend/app/services/evaluation/evaluation.py (1 hunks)
backend/app/services/evaluation/validators.py (1 hunks)
backend/app/tests/api/routes/test_evaluation.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

backend/app/services/evaluation/validators.py
backend/app/services/evaluation/dataset.py
backend/app/services/evaluation/__init__.py
backend/app/services/evaluation/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/services/evaluation/validators.py
backend/app/crud/evaluations/__init__.py
backend/app/services/evaluation/dataset.py
backend/app/services/evaluation/__init__.py
backend/app/services/evaluation/evaluation.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/api/routes/evaluation.py

backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

backend/app/tests/api/routes/test_evaluation.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (4)

backend/app/crud/evaluations/__init__.py (2)

backend/app/crud/evaluations/core.py (1)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

backend/app/services/evaluation/__init__.py (3)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/evaluation.py (2)

build_evaluation_config (23-98)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/validators.py (2)

parse_csv_items (118-174)

sanitize_dataset_name (23-68)

backend/app/services/evaluation/evaluation.py (7)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (14-61)

get_evaluation_run_by_id (103-141)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

backend/app/crud/evaluations/dataset.py (1)

get_dataset_by_id (107-140)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (118-212)

backend/app/models/evaluation.py (1)

EvaluationRun (170-319)

backend/app/utils.py (2)

get_langfuse_client (212-248)

get_openai_client (179-209)

backend/app/api/routes/evaluation.py (3)

backend/app/services/evaluation/evaluation.py (1)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/validators.py (1)

validate_csv_file (71-115)

🔇 Additional comments (29)

backend/app/tests/api/routes/test_evaluation.py (7)

56-64: LGTM! Mock paths correctly updated to reflect service layer refactoring.

The patch targets are properly updated from app.api.routes.evaluation.* to app.services.evaluation.dataset.*, aligning with the new service-oriented architecture.

142-150: Mock paths consistently updated for empty rows test.

188-196: Mock paths consistently updated for default duplication test.

229-237: Mock paths consistently updated for custom duplication test.

270-278: Mock paths consistently updated for description test.

362-370: Mock paths consistently updated for boundary minimum test.

407-410: Mock paths consistently updated for Langfuse configuration failure test.

backend/app/crud/evaluations/__init__.py (3)

4-9: LGTM! Properly exports save_score from core module.

The new export aligns with the service layer's need to persist evaluation scores.

26-31: LGTM! Properly exports fetch_trace_scores_from_langfuse from langfuse module.

This enables the evaluation service to retrieve trace scores via the CRUD layer's public API.

39-70: LGTM! __all__ correctly updated with new exports.

Both save_score and fetch_trace_scores_from_langfuse are properly listed in the appropriate sections.

backend/app/services/evaluation/validators.py (4)

13-20: LGTM! Well-defined security constants.

Clear file upload limits with appropriate MIME type allowances including text/plain for compatibility.

23-68: LGTM! Well-documented sanitization with comprehensive rules.

The function handles edge cases properly, includes clear examples in the docstring, and follows coding guidelines with type hints.

100-115: LGTM! Thorough file size validation.

Properly seeks to end for size check and resets position before reading.

118-174: LGTM! Robust CSV parsing with case-insensitive headers.

Good approach to normalize headers while preserving original column names for row access. Properly filters empty rows and re-raises HTTPException without wrapping.

backend/app/services/evaluation/dataset.py (4)

24-32: LGTM! Well-typed function signature.

All parameters have proper type hints including the union type for optional description. As per coding guidelines, type hints are correctly applied.

58-74: LGTM! Proper sanitization with informative logging.

Logging follows the [function_name] prefix convention as per coding guidelines.

86-108: LGTM! Graceful degradation for object store failures.

Non-fatal handling allows the workflow to continue when object storage is unavailable, which is appropriate for optional storage backends.

140-163: LGTM! Clean metadata construction and database persistence.

The metadata structure is well-organized and the final logging confirms successful creation.

backend/app/services/evaluation/__init__.py (1)

1-32: LGTM! Clean public API surface definition.

Well-organized exports with clear categorization. The __all__ list provides a clear contract for the evaluation services package.

backend/app/services/evaluation/evaluation.py (8)

47-86: LGTM! Robust assistant config merging logic.

Properly handles the merge precedence where provided config values override assistant values, and conditionally adds tools when vector stores are available.

88-98: LGTM! Proper validation for direct config usage.

Clear error message when model is missing and no assistant_id is provided.

142-170: LGTM! Thorough dataset validation with clear error messages.

Properly validates both dataset existence and Langfuse ID presence before proceeding.

179-200: LGTM! Clean client initialization and evaluation run creation.

Follows the documented workflow steps clearly.

229-236: LGTM! Well-typed return signature.

The tuple[EvaluationRun | None, str | None] return type clearly communicates the possible outcomes.

276-286: LGTM! Proper caching logic for trace scores.

Correctly checks both completion status and existing cached scores before fetching from Langfuse.

321-329: LGTM! Clean score persistence flow.

Uses the separate-session save_score function correctly to persist scores after the primary fetch operation.

219-226: The error handling behavior is correct and intentional. start_evaluation_batch properly sets status = "failed" with an error_message before re-raising the exception. The exception is then caught in start_evaluation, and session.refresh(eval_run) retrieves the updated failure state from the database. The approach is sound—error details are persisted before exception propagation, ensuring the evaluation run maintains failure context when returned to the caller.

backend/app/api/routes/evaluation.py (2)

1-23: LGTM: Clean service-oriented refactoring.

The refactoring to delegate business logic to service modules improves separation of concerns and maintainability. The route handlers are now thin wrappers focused on HTTP concerns.

294-295: Graceful degradation pattern is intentional and well-designed.

Lines 294-295 correctly use APIResponse.failure_response(), which is explicitly designed to support returning data alongside errors. The failure_response() method signature includes an optional data: Optional[T] = None parameter, and the response is constructed with both success=False, error=error_message, and data=data. This pattern is appropriate for cases where the evaluation run exists but additional enrichment (like trace info) fails, allowing clients to receive partial data along with error context.

coderabbitai · 2025-12-18T06:57:00Z

backend/app/services/evaluation/validators.py

+    # Security validation: Check file extension
+    file_ext = Path(file.filename).suffix.lower()
+    if file_ext not in ALLOWED_EXTENSIONS:
+        raise HTTPException(
+            status_code=422,
+            detail=f"Invalid file type. Only CSV files are allowed. Got: {file_ext}",
+        )


⚠️ Potential issue | 🟡 Minor

Handle potential None filename.

file.filename can be None for certain upload scenarios. Path(None) will raise TypeError.

🔎 Apply this diff to handle None filename:

# Security validation: Check file extension + if not file.filename: + raise HTTPException( + status_code=422, + detail="File must have a filename", + ) file_ext = Path(file.filename).suffix.lower()

🤖 Prompt for AI Agents

In backend/app/services/evaluation/validators.py around lines 84 to 90, file.filename can be None which makes Path(file.filename) raise TypeError; guard against a missing filename first and return the same 422 HTTPException with a clear message if it's None (e.g., "Missing filename or invalid upload"), otherwise safely compute the suffix (lowercased) from the filename and continue the existing allowed-extension check and error path.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/api/routes/evaluation.py (1)
51-81: Add logging to endpoint and address blocking I/O pattern.

Two issues found:

Missing logging: This endpoint lacks logging despite other endpoints in the file logging operations with [function_name] prefix. Per coding guidelines, add a log statement prefixed with [upload_dataset_endpoint].

Blocking I/O in async context: upload_dataset is a synchronous function that performs blocking I/O operations (database writes, object store uploads, Langfuse API calls). Calling it directly in an async endpoint blocks the event loop. Consider wrapping it with asyncio.to_thread() or refactoring the service function to be async.

Suggested fix for logging:
async def upload_dataset_endpoint(
    ...
) -> APIResponse[DatasetUploadResponse]:
    """Upload an evaluation dataset."""
+   logger.info(
+       f"[upload_dataset_endpoint] Uploading dataset | "
+       f"name={dataset_name} | org_id={auth_context.organization.id} | "
+       f"project_id={auth_context.project.id}"
+   )
    # Validate and read CSV file
    csv_content = await validate_csv_file(file)

    # Upload dataset using service
    dataset = upload_dataset(

🧹 Nitpick comments (4)

backend/app/api/routes/evaluation.py (4)
33-43: Add type hint for dataset parameter.

The dataset parameter is missing a type hint. As per coding guidelines, all function parameters should have type hints.
🔎 Suggested fix:
-def _dataset_to_response(dataset) -> DatasetUploadResponse:
+def _dataset_to_response(dataset: "Dataset") -> DatasetUploadResponse:
You'll need to import the Dataset model or use the appropriate type from your models. If you want to avoid a circular import, you can use a string annotation or TYPE_CHECKING block.
89-110: Add logging for list operation.

This endpoint lacks logging. Per coding guidelines, consider adding a log statement prefixed with [list_datasets_endpoint] to track dataset listing operations.
🔎 Suggested addition:
 def list_datasets_endpoint(
     ...
 ) -> APIResponse[list[DatasetUploadResponse]]:
     """List evaluation datasets."""
+    logger.info(
+        f"[list_datasets_endpoint] Listing datasets | "
+        f"org_id={auth_context.organization.id} | "
+        f"project_id={auth_context.project.id} | limit={limit} | offset={offset}"
+    )
     # Enforce maximum limit
     if limit > 100:
         limit = 100
186-210: Add logging for evaluation start.

This endpoint lacks logging for the evaluation start operation. Starting an evaluation is a significant action that should be logged for observability and debugging purposes.
🔎 Suggested addition:
 def evaluate(
     ...
 ) -> APIResponse[EvaluationRunPublic]:
     """Start an evaluation run."""
+    logger.info(
+        f"[evaluate] Starting evaluation | dataset_id={dataset_id} | "
+        f"experiment_name={experiment_name} | assistant_id={assistant_id} | "
+        f"org_id={auth_context.organization.id} | project_id={auth_context.project.id}"
+    )
     eval_run = start_evaluation(
247-296: Add logging for evaluation status fetch.

This endpoint lacks logging. Consider adding a log statement prefixed with [get_evaluation_run_status] for consistency with other endpoints like get_dataset.
🔎 Suggested addition:
 def get_evaluation_run_status(
     ...
 ) -> APIResponse[EvaluationRunPublic]:
     """Get evaluation run status with optional trace info."""
+    logger.info(
+        f"[get_evaluation_run_status] Fetching evaluation | id={evaluation_id} | "
+        f"get_trace_info={get_trace_info} | resync_score={resync_score} | "
+        f"org_id={auth_context.organization.id}"
+    )
     if resync_score and not get_trace_info:

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d990add and 8e5ba77.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🔇 Additional comments (2)

backend/app/api/routes/evaluation.py (2)

113-142: LGTM!

Proper logging with [get_dataset] prefix, complete type hints, and appropriate error handling for the 404 case.

145-178: LGTM!

Proper logging with [delete_dataset] prefix at entry and exit. Type hints are complete.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/evaluations/__init__.py (1)

1-37: Add an explicit __all__ list to declare the public API.

Per PEP 8, modules should explicitly declare the names in their public API using the __all__ attribute. This module currently imports 21+ functions from six submodules without defining __all__, which means all imported names become part of the public API by default. Using __all__ allows you to explicitly specify the public interface of your packages and modules and prevents accidental usage of objects that shouldn't be used from outside the module. Define an __all__ list after the module docstring containing the functions intended as part of the public API.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e5ba77 and 7ba9b43.

📒 Files selected for processing (2)

backend/app/crud/evaluations/__init__.py (2 hunks)
backend/app/services/evaluation/__init__.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/services/evaluation/init.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/crud/evaluations/__init__.py

🧬 Code graph analysis (1)

backend/app/crud/evaluations/__init__.py (2)

backend/app/crud/evaluations/core.py (1)

save_score (259-295)

backend/app/crud/evaluations/langfuse.py (1)

fetch_trace_scores_from_langfuse (304-543)

🔇 Additional comments (2)

backend/app/crud/evaluations/__init__.py (2)

8-8: LGTM!

The import of save_score is correctly added and maintains alphabetical ordering with other imports from the core module.

28-28: LGTM!

The import of fetch_trace_scores_from_langfuse is correctly added and maintains alphabetical ordering with other imports from the langfuse module.

codecov · 2025-12-18T07:28:35Z

Codecov Report

❌ Patch coverage is 65.77540% with 64 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/services/evaluation/evaluation.py	34.28%	46 Missing ⚠️
backend/app/services/evaluation/validators.py	83.05%	10 Missing ⚠️
backend/app/services/evaluation/dataset.py	84.09%	7 Missing ⚠️
backend/app/api/routes/evaluation.py	90.90%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/api/routes/evaluation.py (3)
5-36: Critical: Missing imports causing pipeline failure.

The code references require_permission and Permission on lines 60, 99, 129, 162, 199, 232, and 262, but these are not imported. This is causing the pytest collection to fail with NameError: name 'require_permission' is not defined.
🔎 Add missing imports
 from app.api.deps import AuthContextDep, SessionDep
+from app.api.deps import require_permission, Permission
 from app.crud.evaluations import (
43-53: Add type hint for dataset parameter.

The dataset parameter is missing a type hint, which violates the coding guideline requiring type hints on all function parameters.

Based on coding guidelines, all function parameters must have type hints.
🔎 Add type hint
-def _dataset_to_response(dataset) -> DatasetUploadResponse:
+def _dataset_to_response(dataset: EvaluationDataset) -> DatasetUploadResponse:
     """Convert a dataset model to a DatasetUploadResponse."""
Note: You may need to import EvaluationDataset from the appropriate module if not already imported.
101-122: Unify inconsistent auth_context attribute access patterns across all endpoints.

This function uses auth_context.organization_.id and auth_context.project_.id (non-optional property accessors), while other endpoints like upload_dataset_endpoint (lines 88-89) and start_evaluation (lines 221-222) use auth_context.organization.id and auth_context.project.id (optional attribute accessors). All endpoints have REQUIRE_PROJECT permission dependencies, ensuring these values are never None; however, the codebase should consistently use the non-optional property accessors (organization_ and project_) for semantic clarity and fail-fast error handling.

Update lines 114-115 and other inconsistent endpoints to use the underscore versions for uniformity.

♻️ Duplicate comments (3)

backend/app/api/routes/evaluation.py (3)

131-155: Verify inconsistent auth_context attribute access.

Same issue as in list_datasets_endpoint: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 139-140 and 146-147, while refactored endpoints use the non-underscore version. This inconsistency needs to be resolved.

164-192: Verify inconsistent auth_context attribute access.

Same issue: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 172-173, while refactored endpoints use the non-underscore version.

234-255: Verify inconsistent auth_context attribute access.

Same issue: this function uses auth_context.organization_.id and auth_context.project_.id (with underscores) on lines 243-244 and 250-251, while refactored endpoints use the non-underscore version.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba9b43 and 09c4f4a.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py

🧰 Additional context used

📓 Path-based instructions (2)

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

backend/app/api/routes/evaluation.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (1)

backend/app/api/routes/evaluation.py (4)

backend/app/services/evaluation/evaluation.py (1)

get_evaluation_with_scores (229-329)

backend/app/services/evaluation/dataset.py (1)

upload_dataset (24-163)

backend/app/services/evaluation/validators.py (1)

validate_csv_file (71-115)

backend/app/utils.py (3)

APIResponse (33-57)

success_response (40-43)

failure_response (46-57)

🪛 GitHub Actions: Kaapi CI

backend/app/api/routes/evaluation.py

[error] 60-60: NameError: name 'require_permission' is not defined. This occurred during pytest collection while running: 'coverage run --source=app -m pytest'.

🔇 Additional comments (3)

backend/app/api/routes/evaluation.py (3)

62-92: LGTM: Clean service delegation.

The refactored endpoint properly delegates CSV validation and dataset upload to the service layer, maintaining clear separation of concerns. The error handling is appropriately managed by the service functions.

201-225: LGTM: Simplified evaluation orchestration.

The endpoint correctly delegates evaluation orchestration to start_evaluation, simplifying the route handler and moving business logic to the service layer as intended by this refactor.

264-312: LGTM: Proper error handling for evaluation scores.

The endpoint correctly handles the tuple return from get_evaluation_with_scores and appropriately uses failure_response when an error is present. This maintains a consistent API response structure while delegating the complex Langfuse interaction to the service layer.

first stab at refactoring

d990add

AkhileshNegi self-assigned this Dec 17, 2025

AkhileshNegi added the enhancement New feature or request label Dec 17, 2025

AkhileshNegi marked this pull request as ready for review December 18, 2025 06:53

Merge branch 'main' into enhancement/evaluation-refactor

8e5ba77

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

cleanups

7ba9b43

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

run pre commit

49a2063

Merge branch 'main' into enhancement/evaluation-refactor

09c4f4a

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation: Refactor #503

Evaluation: Refactor #503

AkhileshNegi commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 17, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Dec 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evaluation: Refactor #503

Are you sure you want to change the base?

Evaluation: Refactor #503

Conversation

AkhileshNegi commented Dec 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 18, 2025

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AkhileshNegi commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading