sft mvp by distributedstatemachine · Pull Request #298 · one-covenant/basilica

distributedstatemachine · 2025-12-18T01:33:31Z

Summary

Brief description of what this PR does.

Related Issues

Closes #(issue number)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Changes Made

List the main changes in this PR:

Testing

How Has This Been Tested?

Describe the tests you ran to verify your changes.

Unit tests pass (cargo test)
Integration tests pass
Manual testing completed

Test Configuration

OS:
Rust version:
Bittensor version (if applicable):

Checklist

My code follows the project's style guidelines
I have run cargo fmt to format my code
I have run cargo clippy and addressed all warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Context

Add any other context about the PR here.

Summary by CodeRabbit

New Features
- Introduced Basilica Training SDK with comprehensive Python API for LoRA fine-tuning, including session management, training operations, sampling, and checkpoint management.
- Added training service with HTTP endpoints for forward/backward passes, optimization steps, and text generation.
- Replaced previous evaluator scaffolding with Affine and AgentGym evaluators for improved flexibility.
Documentation
- Added training service README and comprehensive example scripts demonstrating SDK usage.

coderabbitai · 2025-12-18T01:33:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6ea43c74-df6a-484a-83fd-bb46996bfef2

📥 Commits

Reviewing files that changed from the base of the PR and between f03f1d8 and fa32881.

📒 Files selected for processing (1)

.gitignore

✅ Files skipped from review due to trivial changes (1)

.gitignore

Walkthrough

This PR introduces the Basilica Training SDK and supporting infrastructure for LoRA-based model fine-tuning. It includes Python SDK clients, a FastAPI training backend service, Kubernetes resources (CRDs, operators, gateways), Docker Compose configuration, example scripts, and comprehensive testing.

Changes

Cohort / File(s)	Summary
Python SDK Public API Surface `crates/basilica-sdk-python/python/basilica/training/__init__.py`	Re-exports core public API (clients, data types, exceptions) at module level; sets `__version__` to "0.2.0"; defines consolidated `__all__` list for clean package imports.
SDK Core Clients `crates/basilica-sdk-python/python/basilica/training/service_client.py`, `training_client.py`, `sampling_client.py`, `rest_client.py`	Four client implementations: ServiceClient for entry point with capability probing, TrainingClient for training operations (forward/backward, optim), SamplingClient for text generation, RestClient for run/checkpoint management. All use httpx and ThreadPoolExecutor for async wrappers via APIFuture.
SDK Data Types & Exceptions `crates/basilica-sdk-python/python/basilica/training/types.py`, `exceptions.py`	Introduces data classes (Datum, ModelInput, SamplingParams, SampleResponse, ForwardBackwardResult, ForwardResult) and GetServerCapabilitiesResponse for API contracts. Defines TrainingError hierarchy with 9 exception types for error handling and validation.
SDK Tests `crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py`, `tests/__init__.py`	Comprehensive test coverage for APIFuture, data types, ServiceClient, TrainingClient, RestClient, and GetServerCapabilitiesResponse with mocked HTTP interactions.
Training Backend Service Implementation `services/training-service/src/backend.py`, `server.py`, `Dockerfile`, `pyproject.toml`, `README.md`, `src/__init__.py`	Backend service using PEFT LoRA for HuggingFace model fine-tuning. Includes TrainingBackend class managing session lifecycle, forward/backward passes, optimizer steps, sampling, and checkpoint save/load via S3/R2. FastAPI server exposes endpoints for session management, training, and inference.
Training Backend Tests `services/training-service/tests/test_backend.py`, `tests/__init__.py`	Unit tests covering session lifecycle, forward-backward passes, optimization, sampling with logprobs, checkpoint persistence, status tracking, and error handling for nonexistent sessions.
Kubernetes CRDs & Operators `orchestrator/k8s/training/training-session-crd.yaml`, `operator-deployment.yaml`, `training-service-rbac.yaml`	CustomResourceDefinition for TrainingSession with LoRA config, optimizer settings, GPU resources, and checkpoint storage. Operator deployment with RBAC for basilica-operator managing training sessions and related cluster resources.
Kubernetes Gateway & Infrastructure `orchestrator/k8s/training/local-dev/gateway.yaml`, `namespace.yaml`	Gateway API setup with GatewayClass, HTTP listener on port 8080, ReferenceGrant for cross-namespace routing. Three namespaces: basilica-system, envoy-gateway-system, u-testuser.
Kubernetes Session Examples `orchestrator/k8s/training/training-session-example.yaml`	Two example TrainingSession manifests for Llama-3.1-8B and Llama-3.1-70B with LoRA rank configurations, R2 checkpoint storage, and GPU resource specifications.
End-to-End Testing & Orchestration `scripts/local-training-e2e.sh`	Comprehensive bash script for local development: cluster provisioning (k3d), Gateway/Envoy setup, image builds, storage/FUSE configuration, Postgres API backend, and end-to-end training workflow with port-forwarding and checkpoint management. Exposes CLI commands for cluster-up, deploy, test, logs, reset-db, etc.
Integration Testing `scripts/test-training-sdk.sh`	Integration test script validating API health, endpoint availability, token authentication, and running Python examples and pytest suite.
Example Scripts `examples/training_example.py`, `examples/11_agentgym.py`	training_example.py demonstrates end-to-end SDK usage with main(), sampling_example(), async_example(), and rest_client_example() entry points. 11_agentgym.py replaces LLM evaluator scaffolding with Affine and AgentGym evaluators, adding per-request API key handling via environment variables.
Docker Compose & Configuration `docker-compose.training.yml`, `.gitignore`	Docker Compose with training-service and basilica-api services, GPU constraints, health checks, restart policies, and volume mounts. Gitignore updated to exclude .venv.

Sequence Diagram

sequenceDiagram
    participant User
    participant ServiceClient
    participant TrainingClient
    participant HTTPClient as HTTP/Remote
    participant TrainingBackend
    participant Model as HuggingFace Model
    
    User->>ServiceClient: create_lora_training_client(base_model, rank, ...)
    ServiceClient->>HTTPClient: POST /sessions (create)
    HTTPClient->>TrainingBackend: create_session(base_model)
    TrainingBackend->>Model: Load base model & tokenizer
    Model-->>TrainingBackend: Model loaded
    TrainingBackend-->>HTTPClient: session_id, status
    HTTPClient-->>ServiceClient: CreateSessionResponse
    ServiceClient->>ServiceClient: Wrap in TrainingClient
    ServiceClient-->>User: TrainingClient instance
    
    User->>TrainingClient: forward_backward(input_ids, labels)
    TrainingClient->>HTTPClient: POST /forward_backward
    HTTPClient->>TrainingBackend: forward_backward(tokens, labels)
    TrainingBackend->>Model: forward pass + loss
    Model-->>TrainingBackend: logits
    TrainingBackend->>TrainingBackend: backward pass
    TrainingBackend-->>HTTPClient: loss, logprobs, tokens_processed
    HTTPClient-->>TrainingClient: ForwardBackwardResult
    TrainingClient-->>User: APIFuture (loss, logprobs)
    
    User->>TrainingClient: optim_step()
    TrainingClient->>HTTPClient: POST /optim_step
    HTTPClient->>TrainingBackend: optim_step()
    TrainingBackend->>TrainingBackend: gradient clip & optimizer.step()
    TrainingBackend-->>HTTPClient: step count
    HTTPClient-->>TrainingClient: OptimStepResponse
    TrainingClient-->>User: step count
    
    User->>TrainingClient: save_state(path)
    TrainingClient->>HTTPClient: POST /save_state
    HTTPClient->>TrainingBackend: save_state(path)
    TrainingBackend->>Model: Save LoRA weights to S3/R2
    Model-->>TrainingBackend: Checkpoint saved
    TrainingBackend-->>HTTPClient: checkpoint path
    HTTPClient-->>TrainingClient: SaveStateResponse
    TrainingClient-->>User: checkpoint path

sequenceDiagram
    participant Admin
    participant kubectl as kubectl/K8s API
    participant Operator
    participant TrainingSessionCRD
    participant PodController
    participant TrainingPod as Training Pod
    
    Admin->>kubectl: kubectl apply training-session.yaml
    kubectl->>TrainingSessionCRD: Create TrainingSession resource
    TrainingSessionCRD-->>kubectl: resource created
    
    Operator->>TrainingSessionCRD: Watch for new TrainingSession
    TrainingSessionCRD-->>Operator: TrainingSession event (Added)
    Operator->>Operator: Validate spec (baseModel, gpuResources)
    Operator->>Operator: Build training pod spec
    Operator->>PodController: Create Pod with training-service:local image
    PodController-->>Operator: Pod created
    
    Operator->>TrainingSessionCRD: Update status (phase: loading_model)
    TrainingSessionCRD-->>Operator: Status updated
    
    PodController->>TrainingPod: Launch container
    TrainingPod->>TrainingPod: Load model, initialize LoRA
    TrainingPod-->>PodController: Ready (healthcheck passes)
    
    Operator->>TrainingSessionCRD: Update status (phase: ready)
    Operator->>TrainingSessionCRD: Set endpoint URL
    TrainingSessionCRD-->>Operator: Status updated
    
    Admin->>kubectl: Query TrainingSession status
    kubectl->>TrainingSessionCRD: Get resource
    TrainingSessionCRD-->>kubectl: Return phase=ready, endpoint
    kubectl-->>Admin: Training session ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore(cleanup): remove deprecated configs and fix Discord links #343: Directly removes scripts/local-training-e2e.sh and related training artifacts, creating a direct conflict with this PR's introduction of the same script and comprehensive orchestration setup.
Feat/orchestrator gateway improvements #225: Modifies Kubernetes Gateway API setup (GatewayClass, HTTPRoute, ReferenceGrant) for cross-namespace routing, overlapping with the Gateway configuration introduced in this PR's orchestrator/k8s/training/local-dev/gateway.yaml.

Suggested reviewers

epappas
itzlambda

Poem

🐰 A training SDK hops into place,
With clients and models in full embrace,
LoRA fine-tunes with a quantum leap,
Checkpoints saved, gradients deep,
From K8s to FastAPI's grace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'sft mvp' is vague and non-descriptive, using abbreviations without context. It does not clearly communicate the scope or purpose of the substantial changes across training infrastructure, SDKs, examples, and Kubernetes configurations.	Replace with a descriptive title that captures the primary change, such as 'Add Basilica Training SDK and end-to-end training infrastructure' or 'Introduce LoRA training service with Python SDK and Kubernetes deployment'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/training

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (32)

examples/11_agentgym.py (1)
265-265: Remove unnecessary f-string prefix.

This line doesn't contain any placeholders, so the f prefix is unnecessary.
🔎 Apply this diff:
-    print(f"TTL: 1 hour (auto-cleanup)")
+    print("TTL: 1 hour (auto-cleanup)")
justfile (2)
962-970: Consider using pushd/popd for directory navigation.

The script uses cd to navigate into the SDK directory and cd ../.. to return. While this works correctly with set -euo pipefail, using pushd/popd would make directory changes more robust and self-documenting.
🔎 View suggested refactor:
     # Install Python SDK in editable mode
     echo "Installing Python SDK..."
     if command -v uv &> /dev/null; then
         uv pip install -e crates/basilica-sdk-python
     else
         # Use maturin directly for standard venv
-        cd crates/basilica-sdk-python
+        pushd crates/basilica-sdk-python > /dev/null
         maturin develop --release
-        cd ../..
+        popd > /dev/null
     fi
972-976: Apply the same pushd/popd pattern for stub generation.

The stub generation step uses the same cd pattern. Consider applying the same pushd/popd refactor here for consistency.
🔎 View suggested refactor:
     # Generate type stubs
     echo "Generating type stubs..."
-    cd crates/basilica-sdk-python
+    pushd crates/basilica-sdk-python > /dev/null
     cargo run --bin stub_gen --features stub-gen
-    cd ../..
+    popd > /dev/null
crates/basilica-sdk-python/python/basilica/training/__init__.py (2)
177-181: Remove or utilize the unused loss_fn parameter.

The loss_fn parameter is documented but never used in the method implementation or sent to the API endpoint. This could confuse users who expect different loss functions to work.

Consider either removing this parameter until the feature is implemented, or documenting it as reserved for future use.
🔎 Option 1: Remove the unused parameter:
     def forward_backward(
         self,
         data: List[Datum],
-        loss_fn: str = "cross_entropy",
     ) -> ForwardBackwardResult:
         """
         Compute forward pass and gradients.
 
         Args:
             data: List of training examples
-            loss_fn: Loss function (currently only "cross_entropy")
 
         Returns:
             ForwardBackwardResult with loss and logprobs
         """
589-606: Consider sorting __all__, though current grouping is also maintainable.

Static analysis suggests sorting the __all__ list alphabetically. However, the current organization by category (client, config types, data types, exceptions) is logical and aids maintainability. Either approach is acceptable.
🔎 If you prefer alphabetical sorting:
 # Export all public symbols
 __all__ = [
-    # Client
-    "TrainingClient",
-    "TrainingSession",
-    # Config types
-    "LoraConfig",
-    "OptimizerConfig",
     "CheckpointStorage",
-    "GpuResources",
-    # Data types
     "Datum",
     "ForwardBackwardResult",
-    "Sample",
-    # Exceptions
+    "GpuResources",
+    "LoraConfig",
+    "OptimizerConfig",
+    "Sample",
+    "SessionNotFoundError",
+    "TrainingClient",
     "TrainingError",
-    "SessionNotFoundError",
+    "TrainingSession",
     "TrainingServiceError",
 ]
examples/training_example.py (1)
112-135: Consider renaming the unused loop variable.

The loop variable step is not used within the loop body. Python convention is to use _step or _ for intentionally unused variables.
🔎 Apply this diff:
-    for step in range(num_steps):
+    for _step in range(num_steps):
         # Forward-backward pass (accumulate gradients)
         total_loss = 0.0
scripts/local-training-e2e.sh (1)
427-427: Consider using single quotes in trap command (optional nitpick).

While the current code works correctly (since $PF_PID is expanded at trap setup time), shell best practice is to use single quotes for trap commands to make the intention explicit.
🔎 Apply this diff:
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    trap 'kill '"$PF_PID"' 2>/dev/null' EXIT
crates/basilica-api/src/api/middleware/scope.rs (1)
180-184: Add test coverage for the new training session scope mappings.

The scope mappings are correctly implemented and follow the existing patterns. However, the test suite in this file (lines 196-406) does not include test cases for the new training session endpoints.
🔎 Add test cases for training session routes

Add the following test cases to the test_required_scope_mapping function:
// Test training session endpoints
let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:create".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:list".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions/ts-123456")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:view".to_string()));

let req = Request::builder()
    .method(Method::DELETE)
    .uri("/sessions/ts-123456")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:delete".to_string()));
services/training-service/Dockerfile (2)
15-21: Consider production-optimized installation and exclude test files.

The current setup uses editable mode installation (-e ".") and includes test files in the production image:

Line 17: Editable install is intended for development, not production deployment

Line 21: Including tests/ directory adds unnecessary bloat to the production image
🔎 Apply this diff for production optimization:
 # Install Python dependencies
 COPY pyproject.toml .
 RUN pip install --no-cache-dir --upgrade pip && \
-    pip install --no-cache-dir -e "."
+    pip install --no-cache-dir .
 
 # Copy source code
 COPY src/ src/
-COPY tests/ tests/
If you need tests in the image for integration testing, consider using a multi-stage build with separate dev and prod stages.
6-9: Consider removing git if not required at runtime.

Git is installed as a system dependency but may only be needed during development or for certain ML libraries that clone repos. If the training service doesn't require git at runtime, removing it reduces the attack surface.

If git is only needed for development or specific edge cases, consider documenting why it's required or removing it to minimize the image size and security footprint.
docker-compose.training.yml (1)
33-37: Consider making GPU count configurable.

The GPU count is hardcoded to 1. For flexibility across different development environments, consider using an environment variable.
🔎 Example approach:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: ${GPU_COUNT:-1}  # Default to 1 if not set
          capabilities: [gpu]
Then developers can override with:
GPU_COUNT=2 docker-compose -f docker-compose.training.yml up
services/training-service/README.md (2)
128-128: Unused import in SDK example.

The Datum import is shown but never used in the example code. Either demonstrate its usage or remove it to avoid confusion for users.
-from basilica.training import TrainingClient, Datum
+from basilica.training import TrainingClient
169-187: Add language specifier to architecture diagram code block.

The fenced code block for the architecture diagram is missing a language specifier. Use text or plaintext to satisfy linters and improve rendering consistency.
-```
+```text
 ┌─────────────────────────────────────────┐
orchestrator/k8s/training/local-dev/operator-deployment.yaml (2)
12-32: Consider tightening RBAC for production variants.

Using wildcard verbs ("*") grants full permissions on these resources. While acceptable for local development, ensure production manifests follow least-privilege principles by specifying explicit verbs (get, list, watch, create, update, patch, delete) as needed.

63-79: Add securityContext to harden the container.

Even for local development, adding a securityContext improves security posture and catches permission issues early.
🔎 Suggested addition:
       containers:
         - name: operator
           image: basilica-operator:local
           imagePullPolicy: Never
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL
           env:
services/training-service/src/backend.py (4)
93-100: Consider handling explicit CPU device with bfloat16.

When device="cpu" is passed explicitly (not as a fallback), dtype remains bfloat16, which has limited CPU support. The current logic only converts dtype when falling back from CUDA.
🔎 Suggested fix:
-        if device == "cuda" and not torch.cuda.is_available():
-            # On Mac, use CPU to avoid MPS bfloat16 issues
-            self.device = "cpu"
-            self.dtype = torch.float32  # bfloat16 not well supported on CPU
-        else:
-            self.device = device
-            self.dtype = dtype
+        if device == "cuda" and not torch.cuda.is_available():
+            # On Mac, use CPU to avoid MPS bfloat16 issues
+            self.device = "cpu"
+            self.dtype = torch.float32  # bfloat16 not well supported on CPU
+        elif device == "cpu" and dtype == torch.bfloat16:
+            # bfloat16 has limited CPU support
+            self.device = "cpu"
+            self.dtype = torch.float32
+        else:
+            self.device = device
+            self.dtype = dtype
117-155: Security consideration: trust_remote_code=True executes arbitrary code.

While often necessary for newer models, this setting allows execution of untrusted code from model repositories. Consider making this configurable or documenting the security implications.
🔎 Suggested approach:
 def __init__(
     self,
     model_cache_dir: str = "/models",
     checkpoint_dir: str = "/checkpoints",
     device: str = "cuda",
     dtype: torch.dtype = torch.bfloat16,
+    trust_remote_code: bool = True,
 ):
+    self.trust_remote_code = trust_remote_code
Then use self.trust_remote_code in _load_base_model.
176-178: Incomplete seeding for full reproducibility.

torch.manual_seed only sets the CPU seed. For reproducible GPU training, CUDA seeds should also be set.
🔎 Apply this diff:
         if seed is not None:
             torch.manual_seed(seed)
+            if torch.cuda.is_available():
+                torch.cuda.manual_seed_all(seed)
476-480: Consider explicit resource cleanup in delete_session.

Deleting from the sessions dict doesn't immediately free GPU memory. For long-running services, consider explicit cleanup.
🔎 Suggested enhancement:
     def delete_session(self, session_id: str) -> None:
         """Delete a training session."""
         if session_id in self.sessions:
+            # Explicitly delete model to help free GPU memory
+            session = self.sessions[session_id]
+            del session.model
+            del session.optimizer
             del self.sessions[session_id]
+            # Suggest garbage collection for GPU memory
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
             logger.info("session_deleted", session_id=session_id)
orchestrator/k8s/training/local-dev/gateway.yaml (1)

25-39: ReferenceGrant is namespace-specific to u-testuser.

The ReferenceGrant only allows HTTPRoutes from the u-testuser namespace. For a more flexible local-dev setup, consider using a wildcard or label selector if Gateway API supports it, or document that additional ReferenceGrants are needed for other user namespaces.
crates/basilica-operator/src/controllers/training_session_controller.rs (5)
239-248: String-based error checking is fragile.

Checking e.to_string().contains("409") is fragile. Consider using proper error type matching if the error type supports it.

250-251: External endpoint is hardcoded.

The external endpoint https://api.basilica.ai/sessions/{}/ is hardcoded. Consider making this configurable via environment variable or controller configuration for different environments.
🔎 Suggested approach:
// At the top of the file, add:
fn get_external_base_url() -> String {
    std::env::var("EXTERNAL_API_BASE_URL")
        .unwrap_or_else(|_| "https://api.basilica.ai".to_string())
}

// Then in handle_initializing:
let external_endpoint = format!("{}/sessions/{}/", get_external_base_url(), name);
277-293: MVP placeholder: health check not implemented.

As noted in the comments, the actual health check (GET /health) is not implemented. Consider adding a TODO or tracking issue.

Do you want me to open an issue to track implementing the health check endpoint verification?

331-333: Pod deletion error is silently ignored.

When TTL expires, pod deletion errors are discarded with let _ =. Consider logging the error even if proceeding with termination.
🔎 Suggested fix:
-                    let _ = self.client.delete_pod(namespace, &pod_name).await;
+                    if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {
+                        warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");
+                    }
367-392: Resource defaults are hardcoded; consider making configurable.

CPU/memory limits (16/64Gi for GPU, 2/4Gi for CPU-only) are hardcoded. For flexibility across different cluster configurations, these could be made configurable via the TrainingSessionSpec or environment variables.
services/training-service/tests/test_backend.py (2)
73-76: Token IDs may exceed model vocabulary size.

torch.randint(0, 1000, ...) generates token IDs up to 999, but facebook/opt-125m has a vocabulary size of ~50k, so this is safe. However, for robustness, consider using the tokenizer's vocab size.
🔎 More robust approach:
# Get tokenizer vocab size for safer random tokens
tokenizer = backend.tokenizers.get(small_model)
if tokenizer:
    vocab_size = len(tokenizer)
    input_ids = torch.randint(0, vocab_size, (1, 32))
else:
    input_ids = torch.randint(0, 1000, (1, 32))
165-166: Assertion len(result.text) >= 0 is always true.

This assertion doesn't test anything meaningful since string length is always non-negative.
🔎 Consider a more meaningful assertion:
-        assert len(result.text) >= 0  # May be empty for some models
-        assert len(result.token_ids) > 0
+        # Verify generation produced output (may be empty text due to special tokens)
+        assert isinstance(result.text, str)
+        assert len(result.token_ids) > 0
+        assert result.finish_reason in ("stop", "length")
crates/basilica-operator/src/crd/training_session.rs (1)
211-213: Consider using a pinned image tag instead of :latest.

The default image uses the :latest tag, which can lead to inconsistent behavior across deployments and make rollbacks difficult. Consider pinning to a specific version tag or using a semantic version.
🔎 Suggested improvement:
 fn default_image() -> String {
-    "basilica/training:latest".into()
+    "basilica/training:v0.1.0".into()  // Use a specific version
 }
crates/basilica-api/src/api/routes/training.rs (2)
274-281: UUID truncation reduces uniqueness guarantees.

Taking only the first segment of a UUID (8 hex characters) significantly reduces the UUID's uniqueness. This provides only 2^32 (~4 billion) possible values instead of the full 2^128, increasing collision risk in high-throughput scenarios.

Consider using the full UUID or a larger portion:
🔎 Suggested improvement:
     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        uuid::Uuid::new_v4().to_string().replace('-', "")  // Full UUID without hyphens
     );
Or keep it short but use more characters:
     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        &uuid::Uuid::new_v4().to_string()[..16]  // First 16 chars (64 bits)
     );
334-334: Hardcoded API domain should be configurable.

The endpoint domain "api.basilica.ai" is hardcoded, which makes it inflexible for different environments (dev, staging, production) or on-premise deployments.

Consider making this configurable via environment variable or AppState:
🔎 Suggested improvement:
+    let api_domain = std::env::var("API_DOMAIN").unwrap_or_else(|_| "api.basilica.ai".to_string());
     let endpoint = format!(
-        "https://api.basilica.ai/sessions/{}/",
+        "https://{}/sessions/{}/",
+        api_domain,
         session_id
     );
services/training-service/src/server.py (2)
190-194: Simplify target_modules handling.

The logic to handle target_modules is more complex than necessary. You can rely on the backend's LoraConfiguration defaults instead.
🔎 Simplified approach:
         lora_kwargs = {}
         if request.lora_config:
             lora_kwargs = request.lora_config.model_dump(exclude_none=True)
-            if "target_modules" not in lora_kwargs or lora_kwargs["target_modules"] is None:
-                lora_kwargs.pop("target_modules", None)
The exclude_none=True already handles None values, and the backend's LoraConfiguration will apply defaults for missing fields.
289-312: Inconsistent error handling compared to create_session.

The sample endpoint only catches ValueError, while create_session also catches generic exceptions and logs them. For consistency and better observability, add generic exception handling here as well.
🔎 Apply this fix:
     except ValueError as e:
         raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=str(e))
+    except Exception as e:
+        logger.exception("sample_failed", error=str(e))
+        raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(e))
Apply similar fixes to save_state and load_state endpoints.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3118197 and 11a5072.

📒 Files selected for processing (34)

crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
crates/basilica-api/src/api/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-api/src/config/mod.rs (2 hunks)
crates/basilica-api/src/k8s/trait.rs (1 hunks)
crates/basilica-api/src/server.rs (3 hunks)
crates/basilica-operator/src/controllers/mod.rs (1 hunks)
crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
crates/basilica-operator/src/crd/mod.rs (1 hunks)
crates/basilica-operator/src/crd/training_session.rs (1 hunks)
crates/basilica-operator/src/k8s_client.rs (6 hunks)
crates/basilica-operator/src/runtime.rs (6 hunks)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
docker-compose.training.yml (1 hunks)
examples/11_agentgym.py (2 hunks)
examples/training_example.py (1 hunks)
justfile (1 hunks)
orchestrator/k8s/training/local-dev/gateway.yaml (1 hunks)
orchestrator/k8s/training/local-dev/namespace.yaml (1 hunks)
orchestrator/k8s/training/local-dev/operator-deployment.yaml (1 hunks)
orchestrator/k8s/training/training-service-rbac.yaml (1 hunks)
orchestrator/k8s/training/training-session-crd.yaml (1 hunks)
orchestrator/k8s/training/training-session-example.yaml (1 hunks)
scripts/api/compose.dev.yml (1 hunks)
scripts/local-training-e2e.sh (1 hunks)
services/training-service/Dockerfile (1 hunks)
services/training-service/README.md (1 hunks)
services/training-service/pyproject.toml (1 hunks)
services/training-service/src/__init__.py (1 hunks)
services/training-service/src/backend.py (1 hunks)
services/training-service/src/server.py (1 hunks)
services/training-service/tests/__init__.py (1 hunks)
services/training-service/tests/test_backend.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (11)

crates/basilica-api/src/config/mod.rs (1)

crates/basilica-api/src/server.rs (1)

new (559-1165)

crates/basilica-api/src/api/mod.rs (1)

crates/basilica-api/src/api/routes/training.rs (4)

create_session (255-343)

list_sessions (478-582)

get_session (346-425)

delete_session (431-475)

crates/basilica-operator/src/k8s_client.rs (1)

crates/basilica-operator/src/crd/training_session.rs (2)

default (34-41)

default (80-86)

crates/basilica-api/src/server.rs (1)

crates/basilica-validator/src/api/client.rs (1)

new (24-34)

examples/11_agentgym.py (1)

crates/basilica-sdk-python/python/basilica/deployment.py (1)

url (183-191)

examples/training_example.py (2)

crates/basilica-sdk-python/python/basilica/training/__init__.py (9)

Datum (84-89)

TrainingError (111-114)

create_session (446-534)

base_model (168-170)

session_id (163-165)

forward_backward (177-233)

optim_step (235-253)

save_state (303-332)

sample (255-301)

services/training-service/src/backend.py (5)

create_session (163-224)

forward_backward (226-292)

optim_step (294-317)

save_state (383-432)

sample (319-381)

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

crates/basilica-operator/src/crd/training_session.rs (5)

new (316-322)

new (374-391)

default (34-41)

default (80-86)

default (145-151)

crates/basilica-api/src/api/routes/training.rs (2)

crates/basilica-operator/src/crd/training_session.rs (10)

default_rank (44-46)

default_alpha (47-49)

default_dropout (50-52)

default (34-41)

default (80-86)

default (145-151)

default_learning_rate (89-91)

default_weight_decay (92-94)

new (316-322)

new (374-391)

crates/basilica-api/src/k8s/trait.rs (1)

kube_client (7-7)

scripts/local-training-e2e.sh (3)

crates/basilica-api/src/server.rs (1)

run (1188-1226)

crates/basilica-operator/src/runtime.rs (1)

run (162-361)

orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)

deployments (103-213)

services/training-service/tests/test_backend.py (1)

services/training-service/src/backend.py (12)

LoraConfiguration (16-32)

OptimizerConfiguration (36-44)

TrainingBackend (80-511)

create_session (163-224)

forward_backward (226-292)

optim_step (294-317)

sample (319-381)

save_state (383-432)

load_state (434-462)

get_session_status (464-474)

delete_session (476-480)

list_sessions (482-484)

services/training-service/src/server.py (1)

services/training-service/src/backend.py (14)

ForwardBackwardResult (62-67)

LoraConfiguration (16-32)

OptimizerConfiguration (36-44)

SampleResult (71-77)

TrainingBackend (80-511)

list_sessions (482-484)

create_session (163-224)

get_session_status (464-474)

delete_session (476-480)

forward_backward (226-292)

optim_step (294-317)

sample (319-381)

save_state (383-432)

load_state (434-462)

🪛 Checkov (3.2.334)

orchestrator/k8s/training/local-dev/operator-deployment.yaml

[medium] 8-33: Minimize wildcard use in Roles and ClusterRoles

(CKV_K8S_49)

[medium] 47-79: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 47-79: Minimize the admission of root containers

(CKV_K8S_23)

🪛 markdownlint-cli2 (0.18.1)

services/training-service/README.md

169-169: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)

examples/11_agentgym.py

265-265: f-string without any placeholders

Remove extraneous f prefix

(F541)

examples/training_example.py

57-57: Do not catch blind exception: Exception

(BLE001)

112-112: Loop control variable step not used within loop body

Rename unused step to _step

(B007)

crates/basilica-sdk-python/python/basilica/training/__init__.py

180-180: Unused method argument: loss_fn

(ARG002)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

226-226: Avoid specifying long messages outside the exception class

(TRY003)

247-247: Avoid specifying long messages outside the exception class

(TRY003)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

291-291: Avoid specifying long messages outside the exception class

(TRY003)

293-293: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

329-329: Avoid specifying long messages outside the exception class

(TRY003)

355-355: Avoid specifying long messages outside the exception class

(TRY003)

357-357: Avoid specifying long messages outside the exception class

(TRY003)

371-371: Avoid specifying long messages outside the exception class

(TRY003)

373-373: Avoid specifying long messages outside the exception class

(TRY003)

384-384: Avoid specifying long messages outside the exception class

(TRY003)

436-438: Avoid specifying long messages outside the exception class

(TRY003)

527-527: Avoid specifying long messages outside the exception class

(TRY003)

552-552: Avoid specifying long messages outside the exception class

(TRY003)

554-554: Avoid specifying long messages outside the exception class

(TRY003)

573-573: Avoid specifying long messages outside the exception class

(TRY003)

589-606: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🪛 Shellcheck (0.11.0)

scripts/local-training-e2e.sh

[warning] 372-372: i appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 427-427: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (82)

examples/11_agentgym.py (5)

1-27: Excellent documentation structure!

The module docstring clearly describes the evaluation flow, lists available task types for both suites, and provides usage examples. This makes the example very accessible to users.

39-76: Clean implementation with flexible API key handling!

The function properly handles optional API keys by conditionally populating environment variables. The deployment configuration with resource limits (500m CPU, 1Gi memory, 180s timeout) is appropriate for evaluation workloads.

84-125: Consistent pattern with good parameterization!

The function follows the same structure as deploy_affine_evaluator, ensuring code consistency. The env_name parameter elegantly controls both the deployment name and Docker image tag.

133-214: Helpful usage examples with good UX!

The usage printing functions provide clear, actionable curl examples. The fallback to "YOUR_CHUTES_API_KEY" when the API key is not provided ensures the examples remain useful in all scenarios.

217-270: Well-structured main function with good user guidance!

The main function provides clear warnings when the API key is missing, parses CLI arguments to select the appropriate evaluator, and prints comprehensive usage information. The overall flow is intuitive and user-friendly.

justfile (1)

937-947: Excellent improvements to the Python SDK workflow!

The additions enhance the development experience significantly:

Strict bash error handling with set -euo pipefail

Smart venv strategy supporting both modern uv and traditional python3 -m venv

Conditional dependency installation (maturin only for standard venv)

Clear, helpful output messages guiding developers

These changes align well with modern Python development practices and make the SDK setup more reliable.

Also applies to: 957-959, 978-984

crates/basilica-sdk-python/python/basilica/training/__init__.py (5)

1-40: LGTM! Well-documented module setup.

The module docstring provides clear usage examples and authentication guidance. The imports are appropriate for the SDK functionality.

43-109: LGTM! Well-structured data models.

The dataclass definitions are clean with appropriate type hints and default values. The use of default_factory for mutable defaults (lists) is correct.

111-127: LGTM! Clean exception hierarchy.

The exception classes provide a clear hierarchy for error handling in the SDK.

235-384: LGTM! Consistent method implementations.

The remaining methods follow a consistent pattern with proper error handling. The close method correctly handles 404 errors for idempotency, which is a good practice for cleanup operations.

387-586: LGTM! Well-designed client with proper authentication and lifecycle management.

The TrainingClient class is well-structured with:

Proper API key validation and error handling using response.is_success property to check for 2xx responses

Flexible configuration via environment variables or parameters

Context manager support for resource cleanup

Comprehensive error handling including specific 404 detection for missing sessions

The 300-second timeout is an appropriate explicit configuration for training operations, where requests may involve long-running tasks, compared to httpx's default 5-second timeout for network inactivity. All httpx API usage is correct.

services/training-service/src/__init__.py (1)

1-3: LGTM!

Clean package initialization following Python conventions. The version string is consistent with the SDK versioning pattern.

crates/basilica-api/src/k8s/trait.rs (1)

129-131: LGTM!

The comments clearly document the division of responsibilities between the API and operator for TrainingSession CRD operations, which will help future maintainers understand the architecture.

crates/basilica-api/src/config/mod.rs (2)

49-56: LGTM!

The dev_mode flag is well-designed for local development workflows. The documentation clearly indicates production requirements, and the default value (false) ensures production-safe behavior out of the box.

66-67: LGTM!

The default initialization correctly sets dev_mode to false and provides an empty string for validator_hotkey, which aligns with the validation logic in server.rs that requires the hotkey in production mode.

examples/training_example.py (1)

39-59: LGTM!

The client initialization includes proper environment variable validation and user-friendly error messages. While the exception handler is broad, this is acceptable for example code where the goal is to provide clear guidance to users.

scripts/local-training-e2e.sh (7)

1-38: LGTM!

The script header is well-structured with clear documentation, proper error handling (set -euo pipefail), and a clean logging abstraction using color-coded functions.

40-65: LGTM!

Comprehensive prerequisite checks with helpful error messages. The automatic k3d installation improves the developer experience.

288-409: LGTM!

The E2E test orchestration is well-structured with proper error handling, health checks, and clear logging. The API token handling gracefully falls back when authentication is unavailable.

411-542: Training steps implementation is solid.

The function properly manages port-forwarding with cleanup, performs training operations, and validates results at each step. Error handling is comprehensive.

544-591: LGTM!

The status and logging functions provide comprehensive visibility into the cluster state and component logs, essential for debugging E2E tests.

613-662: LGTM!

Excellent documentation that clearly explains the architecture, usage, and what each test step does. This will significantly improve the developer experience.

664-701: LGTM!

Clean command dispatch with proper error handling for unknown commands.

crates/basilica-api/src/api/routes/mod.rs (1)

14-14: LGTM!

The training module export follows the established pattern and properly exposes the new Training Session API routes.

crates/basilica-operator/src/crd/mod.rs (1)

5-5: LGTM!

The training_session module export follows the established pattern and properly exposes the TrainingSession CRD types for use by the operator.

crates/basilica-operator/src/controllers/mod.rs (1)

6-6: LGTM!

The training_session_controller module export follows the established pattern and properly integrates the TrainingSession controller into the operator.

orchestrator/k8s/training/local-dev/namespace.yaml (1)

1-18: LGTM!

The namespace declarations are well-structured and follow Kubernetes best practices. The user namespace labeling (basilica.ai/user: testuser) is a good pattern for identifying and filtering user-specific resources.

services/training-service/tests/__init__.py (1)

1-1: LGTM!

Standard Python package initializer with appropriate docstring.

crates/basilica-api/src/api/mod.rs (1)

105-115: LGTM!

The training session routes are properly integrated into the protected routes with appropriate handler bindings. The comment clarifying that training operations are routed directly by Envoy Gateway is helpful architectural context.

orchestrator/k8s/training/training-service-rbac.yaml (1)

1-72: LGTM! RBAC configuration follows least-privilege principles.

The RBAC configuration is well-structured:

Appropriate read-only access to Secrets, ConfigMaps, and Nodes

Full lifecycle management for TrainingSession CRDs, Pods, and Services (required for operator functionality)

Event creation for observability

Finalizer access for proper cleanup

The permissions align with the controller's responsibilities for managing training session workloads.

docker-compose.training.yml (1)

1-80: LGTM! Well-configured Docker Compose for local development.

The configuration is well-structured with:

Appropriate health checks with sufficient start period for model loading (120s)

Proper service dependencies

GPU resource allocation

Volume persistence for models and checkpoints

Restart policies for resilience

The debug logging and SQLite database are appropriate for local development.

crates/basilica-operator/src/runtime.rs (3)

46-50: LGTM!

The TrainingSessionCtx struct follows the established pattern used by other controller contexts (JobCtx, RentalCtx, UserDeploymentCtx).

135-160: LGTM!

The reconcile and error policy functions follow the same pattern as existing controllers (reconcile_user_deployment, error_policy_user_deployment), with appropriate structured logging including namespace and session name.

333-359: LGTM!

The training sessions controller stream is correctly wired into the operator's main loop with consistent error handling and properly included in the futures::join! macro.

orchestrator/k8s/training/training-session-example.yaml (1)

1-94: LGTM!

These example manifests are well-documented and demonstrate both a comprehensive configuration (8B model with all optional fields) and a minimal configuration (70B model with defaults). The inline comments explaining each field are helpful for users.

crates/basilica-api/src/server.rs (3)

564-585: LGTM!

The dev mode implementation correctly creates a stub validator client pointing to a non-routable address (localhost:1) that will fail fast if accidentally invoked. The warning message clearly communicates limitations to developers.

902-981: LGTM!

Appropriately gating the validator health check and rental health check tasks behind dev_mode since these require a functional validator client.

1097-1123: Gate credit exhaustion check task behind !config.bittensor.dev_mode.

The credit exhaustion check task uses a validator client for stopping community rentals, but it's only gated by billing client presence—not by dev_mode. In dev_mode, the validator client is a stub pointing to a non-existent endpoint, causing stop_community_rental_internal calls to fail silently (errors are logged). While this won't crash the application, community rentals won't actually be stopped in dev mode.

The health check task (line 903) already uses the !config.bittensor.dev_mode pattern and should be followed here for consistency.

orchestrator/k8s/training/training-session-crd.yaml (1)

1-183: LGTM overall!

The CRD is well-designed with:

Clear required fields and sensible defaults

Proper validation constraints (min/max on numeric fields)

Comprehensive status tracking with appropriate phases

Useful printer columns for kubectl get output

Short names for convenience (ts, tsession)

services/training-service/src/backend.py (10)

1-13: LGTM - Imports and logger setup are appropriate.

Standard imports for the training backend with HuggingFace, PEFT, and PyTorch libraries.

15-33: LGTM - LoRA configuration with sensible defaults.

The alpha/rank ratio of 2 and the target modules covering standard projection layers align with common LoRA practices.

35-45: LGTM - Optimizer configuration with standard AdamW defaults.

47-59: LGTM - SessionState dataclass properly tracks training session state.

61-78: LGTM - Result dataclasses provide clear return structures.

226-292: LGTM - Forward-backward implementation is correct.

The weighted loss computation properly handles the shift for next-token prediction, and the gradient accumulation behavior (allowing multiple forward_backward calls before optim_step) is a useful feature for gradient accumulation.

294-317: LGTM - Optimizer step implementation is correct.

The gradient clipping, optimizer step, and gradient zeroing sequence is properly implemented.

319-381: LGTM - Text generation implementation handles edge cases well.

The temperature=0 (greedy decoding), top_k=0 (no filtering), and finish_reason detection are all correctly implemented.

383-432: LGTM - Checkpoint save implementation is comprehensive.

Correctly saves adapter weights and optionally the full training state including optimizer, step count, and configurations.

492-511: LGTM - Helper methods are correctly implemented.

_get_session provides proper validation, and _compute_logprobs correctly handles the shift alignment for next-token prediction.

orchestrator/k8s/training/local-dev/gateway.yaml (1)

1-24: LGTM - Gateway configuration is appropriate for local development.

The GatewayClass and Gateway resources are correctly configured for Envoy Gateway with HTTP listener on port 8080.

crates/basilica-operator/src/k8s_client.rs (6)

23-24: LGTM - Import added for TrainingSession CRD types.

77-91: LGTM - Trait methods follow established patterns.

The new TrainingSession CRUD methods are consistent with existing CRD operations in the trait.

183-183: LGTM - Storage map added for TrainingSession CRDs in mock client.

363-408: LGTM - MockK8sClient TrainingSession implementations follow existing patterns.

The implementations are consistent with other CRD operations (BasilicaJob, GpuRental, etc.) in the mock client.

1038-1084: LGTM - KubeClient TrainingSession implementations are correct.

The real Kubernetes API implementations follow the established patterns with proper error handling for 409 conflicts and use of Server-Side Apply for status updates.

1664-1693: LGTM - RateLimitedKubeClient properly delegates with rate limiting.

All four TrainingSession methods correctly await rate limiting before delegating to the inner client.

crates/basilica-operator/src/controllers/training_session_controller.rs (6)

1-28: LGTM - Imports and constants are appropriate.

The constants for gateway name, namespace, and service port are well-defined for Envoy Gateway integration.

165-211: LGTM - Scheduling handler correctly monitors pod phases.

The state machine logic properly handles Running, Pending, and Failed pod states, with appropriate fallback to Pending if pod is missing.

486-489: restart_policy: Never means failed pods won't auto-recover.

With Never, if the training container crashes, the pod enters Failed state permanently. This is intentional for training jobs (to preserve state/logs), but ensure the controller handles this appropriately.

The controller does handle Failed pods in handle_scheduling by setting an error status. Verify this is the desired behavior for production.

497-541: LGTM - Service spec is correctly configured.

The selector matches the pod label, and the service port correctly targets the container port.

543-599: LGTM - HTTPRoute construction is correct.

The route properly configures path matching, URL rewriting to strip the session prefix, and backend reference to the training service.

601-682: LGTM - Unit tests cover key resource builder functionality.

Tests verify pod and service construction with expected names, ports, and GPU resource configuration.

Consider adding tests for the controller's reconcile logic using MockK8sClient to verify state transitions.
services/training-service/tests/test_backend.py (7)
1-31: LGTM - Test fixtures are well-structured.

The fixtures properly set up the backend with temp directories and handle CUDA/CPU device selection. Using facebook/opt-125m is appropriate for fast testing.

34-62: LGTM - Session creation tests cover happy path and error case.

64-112: LGTM - Forward/backward tests verify core training functionality.

Tests appropriately verify loss computation and token counting for both standard and weighted loss scenarios.

114-147: LGTM - Optimizer step test verifies multi-step training workflow.

188-212: Test may not catch the optimizer parameter bug.

Due to the bug identified in load_state (optimizer references stale parameters after loading), this test might pass but actual training after load_state would fail. Consider adding a test that does training after loading to verify end-to-end correctness.
🔎 Suggested additional test:
def test_training_after_load_checkpoint(self, backend, session_id, small_model):
    """Test that training works correctly after loading a checkpoint."""
    backend.create_session(
        session_id=session_id,
        base_model=small_model,
        lora_config=LoraConfiguration(rank=8),
        optimizer_config=OptimizerConfiguration(),
    )

    input_ids = torch.randint(0, 1000, (1, 32))
    attention_mask = torch.ones_like(input_ids)
    labels = input_ids.clone()

    # Train and save
    backend.forward_backward(session_id, input_ids, attention_mask, labels)
    backend.optim_step(session_id)
    path = backend.save_state(session_id, "checkpoint-1")

    # Load and train more
    backend.load_state(session_id, path)
    result = backend.forward_backward(session_id, input_ids, attention_mask, labels)
    step = backend.optim_step(session_id)
    
    # Verify training continued correctly
    assert step == 2
    assert result.loss > 0
214-227: LGTM - Status test verifies configuration is correctly stored.

229-267: LGTM - Session management and error handling tests are comprehensive.

Tests cover deletion, listing multiple sessions, and proper error raising for nonexistent sessions.
crates/basilica-operator/src/crd/training_session.rs (6)

9-42: LGTM!

The LoraConfig structure is well-defined with sensible defaults (rank=32, alpha=64, dropout=0.05) and appropriate validation ranges. The target modules cover standard attention projections commonly used in transformer fine-tuning.

62-97: LGTM!

OptimizerConfig provides sensible defaults for fine-tuning (learning_rate=1e-4, weight_decay=0.01, grad_clip=1.0). The configuration is appropriate for LoRA training scenarios.

99-123: LGTM!

CheckpointStorage provides flexible configuration for multiple storage backends with appropriate optional fields for credentials, region, and custom endpoints.

125-156: LGTM!

GpuResources struct provides appropriate configuration for GPU requirements with sensible defaults (count=1) and validation ranges (1-8 GPUs, 8-256 GB memory).

221-264: LGTM!

TrainingSessionPhase enum provides a clear lifecycle progression with appropriate requeue intervals for each phase. The terminal state handling and string conversion methods are correctly implemented.

266-427: LGTM!

TrainingSessionStatus provides comprehensive state tracking with well-implemented builder methods. The automatic timestamp updates and phase transitions (e.g., setting Failed phase in with_error()) ensure consistent state management. The tests adequately cover the builder pattern and phase logic.

crates/basilica-api/src/api/routes/training.rs (2)

40-135: LGTM!

Request types are well-structured with defaults that match the CRD definitions. The use of serde(default) attributes ensures backward compatibility when fields are omitted.

427-475: LGTM!

The delete handler correctly uses typed error matching (ae.code == 404) and appropriately treats 404 as a non-error (idempotent delete). The implementation is robust and follows best practices.

services/training-service/src/server.py (5)

1-46: LGTM!

The initialization logic properly uses environment variables for configuration and correctly handles device selection. The lifespan context manager ensures proper backend lifecycle management.

82-161: LGTM!

Response models are well-structured with appropriate field types. The SessionStatusResponse correctly includes training-specific fields (lora_rank, learning_rate) that are internal to the service.

263-267: The logprobs wrapping is correct. The backend's ForwardBackwardResult returns logprobs: List[float] (a flat list for a single sequence), while ForwardBackwardResponse expects logprobs: List[List[float]] (batched format). Wrapping the result in a list converts it to the required API contract.

176-179: list_sessions() is a trivial in-memory operation and does not require blocking prevention.

The list_sessions() method simply returns list(self.sessions.keys()) — a dictionary key access with no I/O, heavy computation, or GPU operations. This will not block the event loop.

Likely an incorrect or invalid review comment.

61-70: The Python service uses Adam optimizer parameters not exposed in the CRD.

The backend internally uses beta1, beta2, and epsilon when configuring the AdamW optimizer, and the Python API accepts these parameters. However, the Rust CRD only exposes learning_rate, weight_decay, and grad_clip. This is intentional: the Python training service provides fine-grained control over Adam parameters, while the CRD presents a simpler interface for Kubernetes users. The field defaults match industry standards for Adam optimization and are properly validated.

coderabbitai · 2025-12-18T01:41:43Z

crates/basilica-api/src/api/routes/training.rs

+fn user_namespace(user_id: &str) -> String {
+    // Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens)
+    let sanitized: String = user_id
+        .chars()
+        .map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' })
+        .collect();
+    format!("u-{}", sanitized)
+}


⚠️ Potential issue | 🟠 Major

Potential namespace collision risk.

The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". Additionally, the function doesn't enforce Kubernetes namespace name length limits (max 63 characters).

Consider using a hash-based approach for deterministic, collision-free namespace names:

🔎 Suggested fix:

fn user_namespace(user_id: &str) -> String { - // Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens) - let sanitized: String = user_id - .chars() - .map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' }) - .collect(); - format!("u-{}", sanitized) + // Use hash to ensure unique, valid K8s namespace name + use std::collections::hash_map::DefaultHasher; + use std::hash::{Hash, Hasher}; + + let mut hasher = DefaultHasher::new(); + user_id.hash(&mut hasher); + let hash = hasher.finish(); + + format!("u-{:x}", hash) // Produces names like "u-a1b2c3d4e5f6g7h8" }

coderabbitai · 2025-12-18T01:41:43Z

crates/basilica-api/src/api/routes/training.rs

+            },
+            "ttlSeconds": req.ttl_seconds,
+            "seed": req.seed,
+            "enableBilling": true


⚠️ Potential issue | 🟡 Minor

Hardcoded enableBilling prevents user override.

The enableBilling field is always set to true, ignoring any potential user preference. Consider adding it as an optional field in CreateSessionRequest.

🔎 Suggested fix:

Add the field to CreateSessionRequest:

pub struct CreateSessionRequest { pub base_model: String, pub checkpoint_storage: CheckpointStorageRequest, // ... other fields ... + #[serde(default = "default_enable_billing")] + pub enable_billing: bool, } + +fn default_enable_billing() -> bool { + true +}

Then use it in build_training_session_crd:

- "enableBilling": true + "enableBilling": req.enable_billing

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around line 244, the JSON hardcodes "enableBilling": true which prevents users from overriding billing; add an optional field (Option<bool>) named enableBilling to the CreateSessionRequest struct, update request deserialization/validation as needed, and change build_training_session_crd to read request.enableBilling and set the CRD field to that value (use false or a sensible default when None) instead of the hardcoded true.

coderabbitai · 2025-12-18T01:41:44Z

crates/basilica-api/src/api/routes/training.rs

+    if let Err(e) = k8s_client.create_namespace(&namespace).await {
+        // Ignore "already exists" errors
+        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+            error!(error = %e, namespace = %namespace, "Failed to create namespace");
+            apimetrics::record_request("training.create", "POST", start, false);
+            return Err(ApiError::Internal {
+                message: format!("Failed to create namespace: {}", e),
+            });
+        }
+    }


🛠️ Refactor suggestion | 🟠 Major

Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors or incorrectly ignore failures if the error message format changes.

🔎 Apply this fix to use typed error matching:

if let Err(e) = k8s_client.create_namespace(&namespace).await { - // Ignore "already exists" errors - if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") { + // Ignore "already exists" errors using typed matching + if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) { error!(error = %e, namespace = %namespace, "Failed to create namespace"); apimetrics::record_request("training.create", "POST", start, false); return Err(ApiError::Internal { message: format!("Failed to create namespace: {}", e), }); } }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 284 to 293, replace the fragile string-based checks against e.to_string() with typed error matching: match the error (likely kube::Error) and specifically handle kube::Error::Api(ae) where ae.code == 409 or ae.reason == "AlreadyExists" as the non-fatal case; for all other error variants log the error with context, record the metric, and return the ApiError::Internal as before. Ensure you import the kube types if needed and keep the existing logging/metrics/return behavior for real failures.

crates/basilica-api/src/api/routes/training.rs

coderabbitai · 2025-12-18T01:41:44Z

scripts/api/compose.dev.yml

      POSTGRES_DB: basilica_api
    ports:
-      - "5432:5432"
+      - "5433:5432"  # Use 5433 to avoid conflict with local Postgres


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for the compose.dev.yml file and check its content echo "=== Checking compose.dev.yml ===" if [ -f "scripts/api/compose.dev.yml" ]; then head -20 scripts/api/compose.dev.yml echo "" sed -n '8,12p' scripts/api/compose.dev.yml else echo "File not found at expected path" fi echo "" echo "=== Searching for localhost:5432 references in codebase ===" rg -n 'localhost:5432|127\.0\.0\.1:5432' --type-list 2>/dev/null | head -20 || echo "No direct references found or rg not available" echo "" echo "=== Searching in .env and documentation files ===" find . -type f $ -name ".env*" -o -name "README*" -o -name "*.md" $ -exec grep -l "5432" {} \; 2>/dev/null | head -10

Repository: one-covenant/basilica

Length of output: 1092

🏁 Script executed:

#!/bin/bash # Check the documentation files that mention 5432 for context echo "=== Checking docs/validator.md for postgres references ===" grep -n -C 3 "5432" ./docs/validator.md 2>/dev/null || echo "No matches found" echo "" echo "=== Checking orchestrator/README.md for postgres references ===" grep -n -C 3 "5432" ./orchestrator/README.md 2>/dev/null || echo "No matches found" echo "" echo "=== Checking for any DATABASE_URL or connection string patterns ===" rg -n "DATABASE_URL|postgres.*localhost" --type md --type yaml 2>/dev/null | head -20 || echo "No matches found" echo "" echo "=== Look for local development or setup instructions ===" find . -type f -name "*.md" -exec grep -l "docker-compose\|docker compose\|local.*setup\|development.*setup" {} \; 2>/dev/null | grep -i dev | head -5

Repository: one-covenant/basilica

Length of output: 3472

Update docs/validator.md to reflect the new PostgreSQL port for local development.

The port change in scripts/api/compose.dev.yml is correct with a clear inline comment. However, docs/validator.md contains example connection strings (lines 933 and 2743) showing localhost:5432 which would now be incorrect. These examples should be updated to use localhost:5433 if they apply to the API service running with the new compose setup. Note that internal Docker references (e.g., postgres:5432 in other compose files) are unaffected and correct.

🤖 Prompt for AI Agents

In docs/validator.md (lines ~933 and ~2743) update the example connection strings that reference localhost:5432 to localhost:5433 so they match the API local dev compose mapping (scripts/api/compose.dev.yml line 10) — leave any internal Docker host references (e.g., postgres:5432) unchanged; verify both examples still parse correctly and run a quick search to ensure no other localhost:5432 examples remain that should be localhost:5433.

coderabbitai · 2025-12-18T01:41:44Z

services/training-service/Dockerfile

+# Use NVIDIA CUDA image for GPU support, or Python slim for CPU-only
+ARG BASE_IMAGE=python:3.11-slim
+FROM ${BASE_IMAGE}
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set up working directory
+WORKDIR /app
+
+# Install Python dependencies
+COPY pyproject.toml .
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -e "."
+
+# Copy source code
+COPY src/ src/
+COPY tests/ tests/
+
+# Create directories for models and checkpoints
+RUN mkdir -p /models /checkpoints
+
+# Environment variables
+ENV PYTHONUNBUFFERED=1
+ENV MODEL_CACHE_DIR=/models
+ENV CHECKPOINT_DIR=/checkpoints
+
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+
+# Expose port
+EXPOSE 8000
+
+# Run service
+CMD ["python", "-m", "uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8000"]


⚠️ Potential issue | 🟠 Major

Add non-root user for security.

The container runs as root by default, which violates the principle of least privilege and creates security risks. Container processes should run as non-root users.

🔎 Apply this diff to add a non-root user:

# Install system dependencies RUN apt-get update && apt-get install -y \ curl \ git \ && rm -rf /var/lib/apt/lists/* +# Create non-root user +RUN useradd -m -u 1000 -s /bin/bash appuser && \ + mkdir -p /models /checkpoints && \ + chown -R appuser:appuser /app /models /checkpoints + # Set up working directory WORKDIR /app # Install Python dependencies COPY pyproject.toml . RUN pip install --no-cache-dir --upgrade pip && \ pip install --no-cache-dir -e "." # Copy source code COPY src/ src/ COPY tests/ tests/ -# Create directories for models and checkpoints -RUN mkdir -p /models /checkpoints - # Environment variables ENV PYTHONUNBUFFERED=1 ENV MODEL_CACHE_DIR=/models ENV CHECKPOINT_DIR=/checkpoints +# Switch to non-root user +USER appuser + # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-12-18T01:41:44Z

services/training-service/pyproject.toml

+dependencies = [
+    "torch>=2.1.0",
+    "transformers>=4.36.0",
+    "peft>=0.7.0",
+    "accelerate>=0.25.0",
+    "safetensors>=0.4.0",
+    "tokenizers>=0.15.0",
+    "boto3>=1.34.0",
+    "fastapi>=0.109.0",
+    "uvicorn>=0.27.0",
+    "pydantic>=2.5.0",
+    "structlog>=24.1.0",
+    "httpx>=0.27.0",
+]


🛠️ Refactor suggestion | 🟠 Major

Pin dependency versions to avoid unexpected breaking changes.

All dependencies use >= constraints without upper bounds, which can lead to:

Unexpected breaking changes when major versions are released

Inconsistent behavior across deployments

Difficult debugging when dependencies change behavior

🔎 Recommended approach for dependency management:

Consider one of these strategies:

Option 1: Pin exact versions in pyproject.toml (strictest)

dependencies = [ "torch==2.1.0", "transformers==4.36.0", # ... ]

Option 2: Use compatible version ranges (recommended)

dependencies = [ "torch>=2.1.0,<3.0.0", "transformers>=4.36.0,<5.0.0", # ... ]

Option 3: Use a requirements.txt or poetry.lock for pinning

Additionally, maintain a requirements-lock.txt or use Poetry with poetry.lock to ensure reproducible builds while keeping pyproject.toml flexible for development.

This is especially critical for ML dependencies like torch and transformers where API changes are common.

🤖 Prompt for AI Agents

In services/training-service/pyproject.toml around lines 6 to 19 the dependency constraints are open-ended (using >=) which can introduce breaking changes; update each dependency to either a pinned exact version (e.g., ==) or, preferably, a compatible upper-bounded range (e.g., >=x.y.z,<next-major) for all listed packages (torch, transformers, peft, accelerate, safetensors, tokenizers, boto3, fastapi, uvicorn, pydantic, structlog, httpx), and ensure you commit a lockfile (poetry.lock or a generated requirements-lock.txt) so builds are reproducible; for ML libs like torch/transformers prefer tight upper bounds (<3.0.0, <5.0.0 respectively) to avoid major-breaking upgrades.

coderabbitai · 2025-12-18T01:41:45Z

services/training-service/src/backend.py

+        # Wrap with LoRA
+        model = get_peft_model(base, peft_config)
+        model.print_trainable_parameters()


⚠️ Potential issue | 🟡 Minor

get_peft_model modifies base model; print_trainable_parameters uses stdout.

get_peft_model can modify the base model in place, which may cause issues if the base model is cached and shared across sessions.

print_trainable_parameters() outputs to stdout rather than the structured logger.

🔎 Suggested fixes:

- # Wrap with LoRA - model = get_peft_model(base, peft_config) - model.print_trainable_parameters() + # Wrap with LoRA (using copy to avoid modifying cached base model) + import copy + base_copy = copy.deepcopy(base) if model_name in self.base_models else base + model = get_peft_model(base_copy, peft_config) + + # Log trainable parameters + trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) + total_params = sum(p.numel() for p in model.parameters()) + logger.info( + "trainable_parameters", + trainable=trainable_params, + total=total_params, + percentage=100 * trainable_params / total_params, + )

Committable suggestion skipped: line range outside the PR's diff.

services/training-service/src/backend.py

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (2)

crates/basilica-operator/src/controllers/training_session_controller.rs (2)
203-212: Status update condition still missing endpoint and start_time checks.

The condition now includes tokens_processed, but changes to endpoint or start_time still won't trigger status updates. This can cause the CRD status to become stale when these fields change during phase transitions.

Consider deriving PartialEq on TrainingSessionStatus and comparing directly, or extend the condition to include these fields.

239-251: Using is_err() conflates "not found" with other errors.

This issue was flagged in a previous review. The code treats any error from get_pod as "pod doesn't exist" and attempts creation. Network failures, permission errors, or API server issues would be masked and could lead to spurious creation attempts.
🔎 Suggested fix using proper error discrimination:
-        // Check if pod already exists
-        let existing_pod = self.client.get_pod(namespace, &pod_name).await;
-        if existing_pod.is_err() {
+        // Check if pod already exists
+        let pod_exists = match self.client.get_pod(namespace, &pod_name).await {
+            Ok(_) => true,
+            Err(e) if e.to_string().contains("NotFound") || e.to_string().contains("not found") => false,
+            Err(e) => {
+                error!(error = %e, "failed to check pod existence");
+                return Ok(status.with_error(format!("Failed to check pod: {}", e)));
+            }
+        };
+        if !pod_exists {
             // Create the pod

🧹 Nitpick comments (9)

.gitignore (1)
102-102: Remove redundant .venv pattern.

Both .venv/ (line 102) and .venv (line 137) are now in the file. In gitignore, .venv/ already matches the .venv directory. Adding .venv is redundant since both patterns accomplish the same goal for this use case.

Consider keeping only one pattern. For Python virtual environments (which are always directories), .venv/ is sufficient and more explicit about intent.
🔎 Apply this diff to remove the redundant pattern:
  # Runtime State
  *.pid
  *.lock
  tmp/
  temp/
- .venv/
+ .venv

  # Binaries
Or keep the explicit directory pattern and remove the new one:
- .venv
Also applies to: 137-137
scripts/local-training-e2e.sh (5)
58-62: Consider verifying the k3d install script before execution.

Piping curl directly to bash can execute arbitrary code from a remote URL without verification. For a local dev script this is often acceptable, but consider downloading first, inspecting, then executing, or using a package manager.
🔎 Safer alternative:
     # Install k3d if needed
     if ! command -v k3d &> /dev/null; then
         log_info "Installing k3d..."
-        curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+        # Option 1: Use package manager
+        # brew install k3d
+        # Option 2: Download and inspect before running
+        local install_script="/tmp/k3d-install.sh"
+        curl -sL https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh -o "$install_script"
+        log_info "Downloaded install script to $install_script - review if needed"
+        bash "$install_script"
+        rm -f "$install_script"
     fi
343-380: Consider extracting the curl logic to reduce duplication.

The authenticated and unauthenticated curl calls are nearly identical. This could be simplified with a helper function or by conditionally building the headers.
🔎 Optional refactor:
# Build curl args
CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json")
if [ -n "$AUTH_HEADER" ]; then
    CURL_ARGS+=(-H "$AUTH_HEADER")
fi

RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{
    "baseModel": "facebook/opt-125m",
    ...
}')
401-409: Unused loop variable i.

The loop counter i is never used inside the loop. Use _ to indicate an intentionally unused variable, which also silences shellcheck warning SC2034.
🔎 Apply this diff:
-    for i in {1..60}; do
+    for _ in {1..60}; do
449-456: Consider using single quotes for the trap command.

The current trap with double quotes works correctly here because $PF_PID is expanded when the trap is set, capturing the actual PID. However, using single quotes is more idiomatic and makes the intent clearer while silencing shellcheck warning SC2064.
🔎 Apply this diff:
     kubectl port-forward -n "$NAMESPACE" "pod/$POD_NAME" 8001:8000 &
     PF_PID=$!
     sleep 3

     # Cleanup on exit
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    trap 'kill $PF_PID 2>/dev/null' EXIT
500-502: Arbitrary sleep for model loading.

The sleep 10 is a fixed wait that may be insufficient for slower systems or unnecessary for faster ones. Consider polling the session status endpoint to confirm the model is loaded, similar to how pod readiness is checked elsewhere.
crates/basilica-operator/src/controllers/training_session_controller.rs (3)

267-279: Same is_err() issue for network policy check.

While the network policy failure is non-fatal (as noted in the comment), distinguishing error types would provide better observability. Consider logging differently for "not found" vs other errors.

361-370: Fragile string-based error detection for conflict.

Checking e.to_string().contains("409") or contains("AlreadyExists") is brittle and may break if error message format changes. Consider using typed error variants if the K8sClient trait supports them.

808-889: Good unit test coverage for resource builders.

The tests validate the core resource construction logic. Consider adding tests for:

Error cases (missing UID once that validation is added)

CPU-only mode (gpu_count = 0)

HTTPRoute construction

NetworkPolicy construction

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11a5072 and 40e914d.

📒 Files selected for processing (4)

.gitignore (1 hunks)
crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
scripts/local-training-e2e.sh (1 hunks)
services/training-service/README.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

services/training-service/README.md

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

244-244: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)

scripts/local-training-e2e.sh

[warning] 401-401: i appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 456-456: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (14)

services/training-service/README.md (1)

1-305: Comprehensive and well-organized documentation.

The README effectively documents the Training Service with clear sections covering prerequisites, quick start, API endpoints, configuration, and troubleshooting. The architecture diagram, example commands, and E2E testing guidance provide excellent user onboarding. The documentation properly aligns with the TrainingSession CRD workflow and operator-driven resource provisioning introduced in this PR.

scripts/local-training-e2e.sh (12)

1-38: LGTM!

Good script setup with set -euo pipefail, clear documentation, and useful logging helpers with color-coded output.

67-96: LGTM!

Well-structured cluster setup with proper idempotency check, reasonable port mappings, and appropriate wait for node readiness.

98-127: LGTM with minor note.

Good fallback logic for helm vs kubectl installation. The sleep 10 on line 121 before the kubectl wait is somewhat arbitrary but acceptable for a local dev script since the actual readiness is verified by the subsequent wait command.

129-150: LGTM!

Clean image build and load workflow for local development.

152-183: LGTM!

Well-organized deployment functions with proper sequencing and readiness checks.

185-211: LGTM!

Good defensive handling for stale Postgres volumes with the user existence check before proceeding.

213-286: LGTM!

Database reset, API key generation, and API runner functions are well-organized for the local dev workflow.

288-312: LGTM!

Good cleanup function with proper wait for pod termination.

509-548: LGTM!

Well-structured training loop with proper error handling and informative logging for each step.

573-620: LGTM!

Status and log viewing functions provide good observability into the cluster state.

622-640: LGTM!

Comprehensive cleanup that handles Postgres, cluster deletion, and generated artifacts.

642-734: LGTM!

Excellent help documentation with clear quick-start guide, and well-organized main command dispatch.

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

399-415: LGTM - MVP placeholder acknowledged.

The comment clearly documents that health checks are deferred. The immediate transition to Ready is acceptable for MVP.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+        // Check if service already exists
+        let existing_svc = self.client.get_service(namespace, &svc_name).await;
+        if existing_svc.is_err() {
+            // Create the service
+            let svc = build_training_service(session, namespace, name)?;
+            match self.client.create_service(namespace, &svc).await {
+                Ok(_) => info!(service = %svc_name, "created training service"),
+                Err(e) => {
+                    error!(error = %e, "failed to create training service");
+                    return Ok(status.with_error(format!("Failed to create service: {}", e)));
+                }
+            }
+        }


⚠️ Potential issue | 🟡 Minor

Same is_err() issue for service existence check.

Apply the same fix pattern as recommended for the pod check to properly distinguish "not found" from other errors.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+                    "Failed" => {
+                        let message = pod
+                            .status
+                            .as_ref()
+                            .and_then(|s| s.message.clone())
+                            .unwrap_or_else(|| "Pod failed".into());
+                        Ok(status.with_error(message))
+                    }


⚠️ Potential issue | 🟠 Major

Pod failure doesn't transition session to Failed phase.

When the pod enters "Failed" state, only an error message is set via with_error(), but the session phase remains "Scheduling". This could cause the reconciler to keep checking a failed pod indefinitely.

🔎 Consider transitioning to Failed phase:

"Failed" => { let message = pod .status .as_ref() .and_then(|s| s.message.clone()) .unwrap_or_else(|| "Pod failed".into()); - Ok(status.with_error(message)) + Ok(status + .with_phase(TrainingSessionPhase::Failed) + .with_error(message)) }

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 314 to 321, when matching a Pod with "Failed" status the code only calls with_error(...) and does not update the session phase; change the status update to set the session phase to Failed as well (e.g., call the method that sets phase to TrainingSessionPhase::Failed or use status.with_phase(...).with_error(message)) so the session transitions out of Scheduling and the reconciler stops repeatedly processing the failed pod.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+            Err(e) => {
+                warn!(error = %e, "service not found");
+                // Recreate service
+                let svc = build_training_service(session, namespace, name)?;
+                self.client.create_service(namespace, &svc).await?;
+                Ok(status)


⚠️ Potential issue | 🟡 Minor

Inconsistent error propagation.

Here, create_service errors propagate via ?, returning an Err. In handle_pending (lines 258-264), service creation errors return Ok(status.with_error(...)). This inconsistency could cause different reconcile behaviors for similar failures.

Consider using consistent error handling strategy across all phase handlers.

🤖 Prompt for AI Agents

crates/basilica-operator/src/controllers/training_session_controller.rs around lines 389 to 394: the handler currently propagates create_service errors with `?` while other handlers (e.g., handle_pending at lines 258-264) convert similar failures into an Ok(status.with_error(...)) result; make the behavior consistent by catching the error from self.client.create_service(...) instead of using `?` and return Ok(status.with_error(format!("create_service failed: {}", e))) (or alternatively change the other handler to propagate — pick one consistent strategy across all phase handlers), ensuring the status is updated with the error and the function returns Ok(status) rather than bubbling the error.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+                    // Delete the pod
+                    let _ = self.client.delete_pod(namespace, &pod_name).await;
+                    return Ok(status.with_phase(TrainingSessionPhase::Terminated));


⚠️ Potential issue | 🟡 Minor

Silently ignoring pod deletion result.

The let _ = discards any error from delete_pod, which could mask cleanup failures. At minimum, log the outcome for observability.

🔎 Suggested fix:

- let _ = self.client.delete_pod(namespace, &pod_name).await; + if let Err(e) = self.client.delete_pod(namespace, &pod_name).await { + warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry"); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Delete the pod

let _ = self.client.delete_pod(namespace, &pod_name).await;

return Ok(status.with_phase(TrainingSessionPhase::Terminated));

// Delete the pod

if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {

warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");

}

return Ok(status.with_phase(TrainingSessionPhase::Terminated));

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 453-455, replace the ignored call `let _ = self.client.delete_pod(...).await;` with code that captures the Result from delete_pod, matches on it, and logs the outcome (include namespace and pod_name) — log success at debug/info and log failures with error details at warn/error so deletion failures are observable; keep returning Ok(status.with_phase(TrainingSessionPhase::Terminated)) afterwards (or optionally propagate the error if you want to treat deletion failure as fatal).

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+    let owner_ref = OwnerReference {
+        api_version: "basilica.ai/v1".to_string(),
+        kind: "TrainingSession".to_string(),
+        name: name.to_string(),
+        uid: session.metadata.uid.clone().unwrap_or_default(),
+        controller: Some(true),
+        block_owner_deletion: Some(true),
+    };


⚠️ Potential issue | 🟠 Major

Owner reference UID defaults to empty string if missing.

If session.metadata.uid is None, the owner reference will have an empty UID. Kubernetes requires a valid UID for owner references to work correctly for garbage collection. This should likely return an error instead.

🔎 Suggested fix:

+ let uid = session.metadata.uid.clone() + .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?; + let owner_ref = OwnerReference { api_version: "basilica.ai/v1".to_string(), kind: "TrainingSession".to_string(), name: name.to_string(), - uid: session.metadata.uid.clone().unwrap_or_default(), + uid, controller: Some(true), block_owner_deletion: Some(true), };

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 586 to 593, the OwnerReference is being constructed using session.metadata.uid.clone().unwrap_or_default() which will create an empty UID when the session UID is missing; Kubernetes requires a valid UID so instead check for the UID and return an error if absent (e.g., use session.metadata.uid.clone().ok_or_else(||anyhow!("missing metadata.uid for TrainingSession {name}"))?), propagate the error by ensuring the enclosing function returns a Result, and only build the OwnerReference when a non-empty UID is present.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+    let owner_ref = OwnerReference {
+        api_version: "basilica.ai/v1".to_string(),
+        kind: "TrainingSession".to_string(),
+        name: name.to_string(),
+        uid: session.metadata.uid.clone().unwrap_or_default(),
+        controller: Some(true),
+        block_owner_deletion: Some(true),
+    };


⚠️ Potential issue | 🟠 Major

Same empty UID issue in owner reference.

Same concern as build_training_pod - the owner reference will have an empty UID if the session metadata UID is missing. Apply the same fix pattern.

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 646 to 653, the OwnerReference is being built with session.metadata.uid.clone().unwrap_or_default(), which can produce an empty UID; mirror the fix used in build_training_pod by retrieving the UID via session.metadata.uid.clone().ok_or_else(|| anyhow::anyhow!("missing session UID"))? and propagate/return the error if missing, then use the guaranteed UID when constructing the OwnerReference instead of unwrapping to a default empty string.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+    let route_json = serde_json::json!({
+        "apiVersion": "gateway.networking.k8s.io/v1",
+        "kind": "HTTPRoute",
+        "metadata": {
+            "name": route_name,
+            "namespace": namespace,
+            "labels": {
+                "app": "basilica-training",
+                "session": session_name
+            }
+        },
+        "spec": {
+            "parentRefs": [{
+                "name": DEFAULT_GATEWAY_NAME,
+                "namespace": DEFAULT_GATEWAY_NAMESPACE
+            }],
+            "rules": [{
+                "matches": [{
+                    "path": {
+                        "type": "PathPrefix",
+                        "value": path_prefix
+                    }
+                }],
+                "filters": [{
+                    "type": "URLRewrite",
+                    "urlRewrite": {
+                        "path": {
+                            "type": "ReplacePrefixMatch",
+                            "replacePrefixMatch": "/"
+                        }
+                    }
+                }],
+                "backendRefs": [{
+                    "name": backend_service,
+                    "namespace": namespace,
+                    "port": port
+                }]
+            }]
+        }
+    });


⚠️ Potential issue | 🟠 Major

HTTPRoute missing owner reference for garbage collection.

Unlike the Pod and Service, this HTTPRoute doesn't have an ownerReferences field. When the TrainingSession is deleted, this route will be orphaned and must be manually cleaned up.

🔎 Add owner reference to metadata:

The function would need the session parameter to access the UID:

fn build_training_http_route( + session: &TrainingSession, session_name: &str, namespace: &str, backend_service: &str, port: u16, ) -> Result<DynamicObject> { let route_name = format!("training-route-{}", session_name); let path_prefix = format!("/sessions/{}/", session_name); + let uid = session.metadata.uid.clone() + .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?; let route_json = serde_json::json!({ // ... "metadata": { "name": route_name, "namespace": namespace, "labels": { "app": "basilica-training", "session": session_name - } + }, + "ownerReferences": [{ + "apiVersion": "basilica.ai/v1", + "kind": "TrainingSession", + "name": session_name, + "uid": uid, + "controller": true, + "blockOwnerDeletion": true + }] }, // ...

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let route_json = serde_json::json!({

"apiVersion": "gateway.networking.k8s.io/v1",

"kind": "HTTPRoute",

"metadata": {

"name": route_name,

"namespace": namespace,

"labels": {

"app": "basilica-training",

"session": session_name

}

},

"spec": {

"parentRefs": [{

"name": DEFAULT_GATEWAY_NAME,

"namespace": DEFAULT_GATEWAY_NAMESPACE

}],

"rules": [{

"matches": [{

"path": {

"type": "PathPrefix",

"value": path_prefix

}

}],

"filters": [{

"type": "URLRewrite",

"urlRewrite": {

"path": {

"type": "ReplacePrefixMatch",

"replacePrefixMatch": "/"

}

}

}],

"backendRefs": [{

"name": backend_service,

"namespace": namespace,

"port": port

}]

}]

}

});

let route_name = format!("training-route-{}", session_name);

let path_prefix = format!("/sessions/{}/", session_name);

let uid = session.metadata.uid.clone()

.ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;

let route_json = serde_json::json!({

"apiVersion": "gateway.networking.k8s.io/v1",

"kind": "HTTPRoute",

"metadata": {

"name": route_name,

"namespace": namespace,

"labels": {

"app": "basilica-training",

"session": session_name

},

"ownerReferences": [{

"apiVersion": "basilica.ai/v1",

"kind": "TrainingSession",

"name": session_name,

"uid": uid,

"controller": true,

"blockOwnerDeletion": true

}]

},

"spec": {

"parentRefs": [{

"name": DEFAULT_GATEWAY_NAME,

"namespace": DEFAULT_GATEWAY_NAMESPACE

}],

"rules": [{

"matches": [{

"path": {

"type": "PathPrefix",

"value": path_prefix

}

}],

"filters": [{

"type": "URLRewrite",

"urlRewrite": {

"path": {

"type": "ReplacePrefixMatch",

"replacePrefixMatch": "/"

}

}

}],

"backendRefs": [{

"name": backend_service,

"namespace": namespace,

"port": port

}]

}]

}

});

🤖 Prompt for AI Agents

crates/basilica-operator/src/controllers/training_session_controller.rs around lines 692 to 731: the generated HTTPRoute JSON is missing metadata.ownerReferences so the route will be orphaned when the TrainingSession is deleted; update the function signature to accept the TrainingSession (or its UID and controller metadata) and add an ownerReferences entry in metadata with apiVersion, kind, name, uid from the session, controller: true and blockOwnerDeletion: true so Kubernetes garbage collection removes the HTTPRoute when the TrainingSession is deleted.

coderabbitai · 2025-12-18T02:10:15Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+    let owner_ref = OwnerReference {
+        api_version: "basilica.ai/v1".into(),
+        kind: "TrainingSession".into(),
+        name: name.to_string(),
+        uid: session.metadata.uid.clone().unwrap_or_default(),
+        controller: Some(true),
+        block_owner_deletion: Some(true),
+    };


⚠️ Potential issue | 🟠 Major

Same empty UID issue in NetworkPolicy owner reference.

Apply the same fix pattern as recommended for the other resource builders.

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 755 to 762, the OwnerReference is being built with uid: session.metadata.uid.clone().unwrap_or_default(), which injects an empty string when UID is missing; change this to preserve the optional nature of the UID (e.g., use session.metadata.uid.clone() or session.metadata.uid.clone().filter(|s| !s.is_empty())) so the OwnerReference.uid is None when there is no UID, mirroring the fix used in the other resource builders.

services/training-service/README.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (7)

services/training-service/README.md (2)
138-157: Add language specification to fenced code block.

Line 138 begins a code block without a language identifier, failing markdown linting compliance.
🔎 Apply this diff to specify the language:
-```
+```bash
 [STEP] Running training steps via port-forward...
289-321: Add language specification to fenced code block.

Line 289 begins an ASCII architecture diagram code block without a language identifier, failing markdown linting compliance.
🔎 Apply this diff to specify the language:
-```
+```text
 ┌─────────────────────────────────────────────────────────────┐
crates/basilica-api/src/api/routes/training.rs (4)
288-297: Fragile error detection using string matching.

String-based error checks (e.to_string().contains("409")) are brittle and may incorrectly classify errors if message formats change. The kube crate provides typed error variants that should be used instead.
🔎 Consider using typed error matching:
     if let Err(e) = k8s_client.create_namespace(&namespace).await {
-        // Ignore "already exists" errors
-        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+        // Ignore "already exists" errors using typed matching
+        if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) {
             error!(error = %e, namespace = %namespace, "Failed to create namespace");
             apimetrics::record_request("training.create", "POST", start, false);
             return Err(ApiError::Internal {
                 message: format!("Failed to create namespace: {}", e),
             });
         }
     }
363-374: Fragile error detection using string matching.

Similar to the namespace creation handler, this uses string matching to detect 404 errors instead of typed error matching with kube::Error variants.

495-521: Confusing sentinel error pattern for empty list handling.

Using an error with message "empty" as a sentinel value (line 502) and then checking for it with string matching (line 513) is an anti-pattern that makes the code harder to understand and maintain.
🔎 Consider cleaner error handling:
     // List all TrainingSession CRDs in the user's namespace
-    let sessions = api
+    let sessions_result = api
         .list(&ListParams::default().labels(&format!("user={}", auth.user_id)))
-        .await
-        .map_err(|e| {
-            // If namespace doesn't exist, return empty list
-            if e.to_string().contains("404") || e.to_string().contains("NotFound") {
-                return ApiError::Internal {
-                    message: "empty".to_string(), // Sentinel for empty list
-                };
-            }
-            error!(error = %e, "Failed to list sessions");
-            ApiError::Internal {
-                message: format!("Failed to list sessions: {}", e),
-            }
-        });
+        .await;
 
-    let sessions = match sessions {
-        Ok(list) => list,
-        Err(e) if e.to_string().contains("empty") => {
+    let sessions = match sessions_result {
+        Ok(list) => list,
+        Err(kube::Error::Api(ae)) if ae.code == 404 => {
+            // Namespace doesn't exist, return empty list
             apimetrics::record_request("training.list", "GET", start, true);
             return Ok(Json(vec![]));
         }
         Err(e) => {
+            error!(error = %e, "Failed to list sessions");
             apimetrics::record_request("training.list", "GET", start, false);
-            return Err(e);
+            return Err(ApiError::Internal {
+                message: format!("Failed to list sessions: {}", e),
+            });
         }
     };
185-192: Namespace collision risk remains.

The sanitization could create collisions when different user IDs map to the same namespace (e.g., "user_123" and "user-123" both become "u-user-123"). This was previously flagged but hasn't been addressed.
crates/basilica-sdk-python/python/basilica/training/__init__.py (1)
128-203: Add validation for empty data list.

The forward_backward method accesses examples[0] at lines 180 and 185 without verifying the list is non-empty, and line 174 calls max() on the list which will raise ValueError if empty.
🔎 Apply this diff to add validation:
     ) -> float:
         """
         Compute forward pass and gradients.

         Args:
             data: List of training examples (Datum objects or dicts with
                   'input_ids' and 'labels' keys)
             loss_fn: Loss function to use. Options:
                 - "cross_entropy": Standard NLL loss (default)
                 - "importance_sampling": Policy gradient with importance weighting
                 - "ppo": Proximal Policy Optimization with clipping
                 - "dpo": Direct Preference Optimization

         Returns:
             Loss value

         Example:
             >>> # Standard supervised learning
             >>> loss = session.forward_backward([{"input_ids": [1, 2, 3]}])

             >>> # RL with importance sampling
             >>> loss = session.forward_backward(data, loss_fn="importance_sampling")
         """
+        if not data:
+            raise TrainingError("data list cannot be empty")
+        
         # Normalize input
         examples = []
         for d in data:
Based on the web search results, calling max() on an empty sequence raises ValueError: max() arg is an empty sequence.

🧹 Nitpick comments (1)

crates/basilica-api/src/api/middleware/scope.rs (1)

180-231: Add test coverage for training route scope mappings.

The new training session scope mappings lack corresponding test cases in the test_required_scope_mapping function. This creates a gap in validation coverage.

🔎 Consider adding tests for the new training routes:

Add these test cases to test_required_scope_mapping (after line 417):

// Test training session endpoints
let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:create".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:list".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions/ts-123")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:view".to_string()));

let req = Request::builder()
    .method(Method::DELETE)
    .uri("/sessions/ts-123")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:delete".to_string()));

let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions/ts-123/internal")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:use".to_string()));

let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions/ts-123/internal/train-123/forward_backward")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:use".to_string()));

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40e914d and 0c1cf26.

📒 Files selected for processing (7)

crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
crates/basilica-api/src/api/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-api/src/bin/gen_api_key.rs (1 hunks)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
examples/training_example.py (1 hunks)
services/training-service/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

crates/basilica-api/src/api/mod.rs

🧰 Additional context used

🧬 Code graph analysis (2)

crates/basilica-sdk-python/python/basilica/training/__init__.py (3)

crates/basilica-api/src/api/routes/training.rs (5)

forward_backward (733-757)

optim_step (761-784)

sample (788-812)

get_session (350-429)

list_sessions (482-586)

services/training-service/src/backend.py (4)

forward_backward (226-292)

optim_step (294-317)

sample (319-381)

list_sessions (482-484)

services/training-service/src/server.py (5)

forward_backward (244-273)

optim_step (277-283)

sample (290-311)

get_session (221-227)

list_sessions (177-179)

crates/basilica-api/src/api/routes/training.rs (2)

crates/basilica-operator/src/crd/training_session.rs (6)

default_rank (44-46)

default_alpha (47-49)

default_dropout (50-52)

default (34-41)

default (80-86)

default (145-151)

crates/basilica-api/src/k8s/trait.rs (1)

kube_client (7-7)

🪛 markdownlint-cli2 (0.18.1)

services/training-service/README.md

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

138-138: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

289-289: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)

crates/basilica-sdk-python/python/basilica/training/__init__.py

201-201: Avoid specifying long messages outside the exception class

(TRY003)

215-215: Avoid specifying long messages outside the exception class

(TRY003)

253-253: Avoid specifying long messages outside the exception class

(TRY003)

273-273: Avoid specifying long messages outside the exception class

(TRY003)

290-290: Avoid specifying long messages outside the exception class

(TRY003)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Avoid specifying long messages outside the exception class

(TRY003)

346-346: Avoid specifying long messages outside the exception class

(TRY003)

414-414: Avoid specifying long messages outside the exception class

(TRY003)

423-423: Avoid specifying long messages outside the exception class

(TRY003)

431-431: Avoid specifying long messages outside the exception class

(TRY003)

435-435: Avoid specifying long messages outside the exception class

(TRY003)

451-451: Avoid specifying long messages outside the exception class

(TRY003)

468-468: Avoid specifying long messages outside the exception class

(TRY003)

470-470: Avoid specifying long messages outside the exception class

(TRY003)

474-474: Avoid specifying long messages outside the exception class

(TRY003)

488-488: Avoid specifying long messages outside the exception class

(TRY003)

507-513: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

examples/training_example.py

18-18: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (2)

crates/basilica-api/src/bin/gen_api_key.rs (1)

22-24: LGTM!

The addition of "training:*" to the default scopes is appropriate and aligns with the new training session endpoints introduced in this PR.

examples/training_example.py (1)

1-27: LGTM!

The example effectively demonstrates the training SDK workflow with session creation, training loop, and sampling. The code is clear and appropriate for an example script.

coderabbitai · 2025-12-18T03:45:19Z

services/training-service/README.md

+```
+=== Creating session for facebook/opt-125m ===
+Session: ts-a1b2c3d4
+
+=== Training (3 steps) ===
+Step 1: loss=6.2612
+Step 2: loss=5.7423
+Step 3: loss=5.7501
+
+Sample: The quick brown -> fox jumps over the lazy
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 38 begins an example output code block without a language identifier, which fails markdown linting compliance.

🔎 Apply this diff to specify the language:

-``` +```text === Creating session for facebook/opt-125m ===

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

=== Creating session for facebook/opt-125m ===

Session: ts-a1b2c3d4

=== Training (3 steps) ===

Step 1: loss=6.2612

Step 2: loss=5.7423

Step 3: loss=5.7501

Sample: The quick brown -> fox jumps over the lazy

```

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

services/training-service/README.md around lines 38 to 48: the fenced code block starting at line 38 lacks a language spec and triggers markdown linting; change the opening fence from ``` to ```text (and keep the closing ``` unchanged) so the example output is explicitly marked as plain text.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (5)

crates/basilica-api/src/api/routes/training.rs (5)
200-207: Potential namespace collision risk.

Different user IDs like "user_123" and "user-123" both map to "u-user-123", causing namespace collisions. Additionally, Kubernetes namespace names have a 63-character limit that isn't enforced here.

Based on past review comments, a hash-based approach was suggested to ensure unique, valid namespace names.

283-283: Hardcoded enableBilling prevents user override.

The enableBilling field is always true, ignoring potential user preferences.

Based on past review comments, consider adding an optional field to CreateSessionRequest with a default of true.

323-332: Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may incorrectly ignore real errors if the error message format changes.

Based on past review comments, use typed error matching like the CRD creation does at line 353: kube::Error::Api(ae) if ae.code == 409.

530-556: Confusing sentinel error pattern for empty list handling.

Using an ApiError::Internal with message "empty" as a sentinel value is a code smell that makes the code harder to maintain.

Based on past review comments, use proper typed error matching with early return for 404 cases, similar to how delete_session handles it at line 497.

398-409: Fix string-based error matching in session retrieval.

String-based error matching at lines 398-409 (and in list_sessions) is fragile and inconsistent with typed error handling used elsewhere in the file. Replace with typed kube::Error::Api pattern matching:
-    let session = api.get(&session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let session = api.get(&session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Session {} not found", session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, session_id = %session_id, "Failed to get session");
             ApiError::Internal {
                 message: format!("Failed to get session: {}", e),
             }
         }
     })?;
Apply the same fix to the list_sessions function.

🧹 Nitpick comments (9)

crates/basilica-api/src/api/routes/training.rs (3)
313-320: Consider collision risk with short session IDs.

Using only the first segment of UUID v4 (8 hex characters = ~4 billion possibilities) could lead to collisions as the number of sessions grows, especially across all users. While unlikely in the short term, full UUIDs would eliminate this risk entirely.
🔎 Suggested fix:
     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        uuid::Uuid::new_v4().to_string().replace("-", "")
     );
Or simply use the standard hyphenated UUID:
-    let session_id = format!(
-        "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
-    );
+    let session_id = format!("ts-{}", uuid::Uuid::new_v4());
634-709: Improve proxy error handling specificity.

The proxy always returns ApiError::ServiceUnavailable regardless of the underlying error type. This loses important context—a 404 from the training service means the session doesn't exist, while a 500 means an internal error.
🔎 Suggested improvement:

Parse the kube error and map specific HTTP status codes:
     let response = kube_client
         .request::<serde_json::Value>(request)
         .await
-        .map_err(|e| {
+        .map_err(|e| match &e {
+            kube::Error::Api(ae) if ae.code == 404 => {
+                ApiError::NotFound {
+                    message: format!("Training session {} not found", session_id),
+                }
+            }
+            _ => {
                 error!(
                     error = %e,
                     service = %service_name,
                     namespace = %namespace,
                     path = %path,
                     "Failed to proxy request to training service"
                 );
                 ApiError::ServiceUnavailable
+            }
         })?;
711-903: Consider reducing duplication in proxy handlers.

All seven proxy handler functions follow an identical pattern with only the path and metric name differing. While the current code is clear, extracting a macro or higher-order function could reduce maintenance burden.
💡 Example approach using a macro:
macro_rules! proxy_handler {
    ($name:ident, $metric:expr, $method:expr, $path_fn:expr, $body:expr) => {
        pub async fn $name(
            State(state): State<AppState>,
            Extension(auth): Extension<AuthContext>,
            Path(params): Path<_>,
            $($body: Json<serde_json::Value>,)?
        ) -> Result<axum::response::Response> {
            let start = Instant::now();
            let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;
            let kube_client = k8s_client.kube_client();
            let namespace = user_namespace(&auth.user_id);
            
            let result = proxy_to_training_service(
                &kube_client,
                &namespace,
                &params.0,
                &$path_fn(params),
                $method,
                $body,
            ).await;
            
            apimetrics::record_request($metric, stringify!($method), start, result.is_ok());
            result
        }
    };
}
This is a nice-to-have optimization, not a blocking issue.
crates/basilica-sdk-python/python/basilica/training/types.py (1)
169-176: Use asyncio.get_running_loop() instead of deprecated get_event_loop().

asyncio.get_event_loop() is deprecated since Python 3.10 and will emit a DeprecationWarning when called from a coroutine without a running loop. Since result_async is an async method, use get_running_loop() which is the recommended approach.
🔎 Apply this diff:
     async def result_async(self, timeout: Optional[float] = None):
         """Wait for operation to complete (async)."""
-        loop = asyncio.get_event_loop()
+        loop = asyncio.get_running_loop()
         if timeout is not None:
             return await asyncio.wait_for(
                 loop.run_in_executor(None, self._future.result), timeout=timeout
             )
         return await loop.run_in_executor(None, self._future.result)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
146-150: Consider extracting endpoint resolution to reduce duplication.

The endpoint resolution logic here duplicates _get_endpoint() with a different suffix. This is minor, but could be consolidated.
🔎 Example refactor:
+    def _get_endpoint(self, operation: str = "sample") -> str:
+        """Get the endpoint for the given operation."""
+        if self._session_id and self._internal_id:
+            return f"/sessions/{self._session_id}/internal/{self._internal_id}/{operation}"
+        return f"/{operation}"
+
-    def _get_endpoint(self) -> str:
-        """Get the sampling endpoint."""
-        if self._session_id and self._internal_id:
-            return f"/sessions/{self._session_id}/internal/{self._internal_id}/sample"
-        return "/sample"
Then use self._get_endpoint("compute_logprobs") at line 148.
crates/basilica-sdk-python/python/basilica/training/exceptions.py (3)
8-14: Use explicit | None type annotation for optional parameter.

Per PEP 484, implicit Optional (using = None without a None type) is discouraged. This improves type checker compatibility.
🔎 Apply this diff:
-    def __init__(self, message: str, status_code: int = None):
+    def __init__(self, message: str, status_code: int | None = None):
54-62: Use explicit | None type annotation.

Same issue as the base class - retry_after: float = None should use explicit optional type.
🔎 Apply this diff:
-    def __init__(self, retry_after: float = None):
+    def __init__(self, retry_after: float | None = None):
72-75: Consider adding a default message to CheckpointError.

Unlike other exceptions, CheckpointError has no custom __init__, so callers must always provide a message. Consider adding a default for consistency with AuthenticationError and InsufficientResourcesError.
🔎 Apply this diff:
 class CheckpointError(TrainingError):
     """Checkpoint operation failed."""
 
-    pass
+    def __init__(self, message: str = "Checkpoint operation failed"):
+        super().__init__(message)
crates/basilica-sdk-python/python/basilica/training/service_client.py (1)
388-399: Async client is not properly closed.

The async client is lazily created but never closed, which can cause resource leaks. Consider adding an async close method or async context manager support.
🔎 Suggested approach:
async def close_async(self):
    """Close async resources."""
    if self._async_client:
        await self._async_client.aclose()
        self._async_client = None

async def __aenter__(self):
    return self

async def __aexit__(self, *args):
    await self.close_async()

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c1cf26 and 0aa3e6b.

📒 Files selected for processing (8)

crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/service_client.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/training_client.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/types.py (1 hunks)
examples/training_example.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (3)

crates/basilica-sdk-python/python/basilica/training/types.py (7)

ModelInput (48-66)

SampleResponse (100-106)

SamplingParams (70-93)

to_dict (37-44)

to_dict (84-93)

from_string (64-66)

result_async (169-176)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)

TrainingError (8-14)

crates/basilica-sdk-python/python/basilica/training/training_client.py (2)

sample (395-428)

close (430-435)

crates/basilica-sdk-python/python/basilica/training/service_client.py (5)

crates/basilica-sdk-python/python/basilica/training/types.py (1)

GetServerCapabilitiesResponse (127-132)

crates/basilica-sdk-python/python/basilica/training/training_client.py (4)

TrainingClient (22-465)

base_model (70-72)

session_id (65-67)

close (430-435)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)

SamplingClient (16-186)

close (178-180)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (4)

TrainingError (8-14)

SessionTimeoutError (36-44)

AuthenticationError (47-51)

ValidationError (65-69)

crates/basilica-api/src/api/routes/training.rs (2)

get_session (385-464)

list_sessions (517-621)

crates/basilica-api/src/api/routes/training.rs (2)

crates/basilica-operator/src/crd/training_session.rs (9)

default_rank (44-46)

default_alpha (47-49)

default_dropout (50-52)

default (34-41)

default (80-86)

default (145-151)

default_ttl (214-216)

new (316-322)

new (374-391)

services/training-service/src/backend.py (4)

create_session (163-224)

forward_backward (226-292)

optim_step (294-317)

sample (319-381)

crates/basilica-sdk-python/python/basilica/training/__init__.py (5)

crates/basilica-sdk-python/python/basilica/training/types.py (8)

Datum (17-44)

ModelInput (48-66)

SamplingParams (70-93)

SampleResponse (100-106)

ForwardBackwardResult (110-115)

ForwardResult (119-123)

GetServerCapabilitiesResponse (127-132)

APIFuture (138-184)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (10)

TrainingError (8-14)

SessionNotFoundError (17-22)

SessionNotReadyError (25-33)

SessionTimeoutError (36-44)

AuthenticationError (47-51)

RateLimitError (54-62)

ValidationError (65-69)

CheckpointError (72-75)

ModelNotFoundError (78-83)

InsufficientResourcesError (86-90)

crates/basilica-sdk-python/python/basilica/training/service_client.py (1)

ServiceClient (24-399)

crates/basilica-sdk-python/python/basilica/training/training_client.py (1)

TrainingClient (22-465)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)

SamplingClient (16-186)

🪛 Ruff (0.14.8)

crates/basilica-sdk-python/python/basilica/training/types.py

158-158: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

189-198: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/training_client.py

125-125: Avoid specifying long messages outside the exception class

(TRY003)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

253-253: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

299-299: Avoid specifying long messages outside the exception class

(TRY003)

319-319: Avoid specifying long messages outside the exception class

(TRY003)

339-339: Avoid specifying long messages outside the exception class

(TRY003)

381-381: Avoid specifying long messages outside the exception class

(TRY003)

426-426: Avoid specifying long messages outside the exception class

(TRY003)

434-434: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py

99-99: Avoid specifying long messages outside the exception class

(TRY003)

154-154: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/service_client.py

64-66: Avoid specifying long messages outside the exception class

(TRY003)

154-156: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Avoid specifying long messages outside the exception class

(TRY003)

199-199: Avoid specifying long messages outside the exception class

(TRY003)

207-207: Avoid specifying long messages outside the exception class

(TRY003)

235-235: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

350-350: Avoid specifying long messages outside the exception class

(TRY003)

352-352: Avoid specifying long messages outside the exception class

(TRY003)

356-356: Avoid specifying long messages outside the exception class

(TRY003)

374-374: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/exceptions.py

11-11: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

57-57: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

95-106: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/__init__.py

69-96: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (16)

examples/training_example.py (2)

45-77: LGTM!

The training loop demonstrates the SDK API well, with proper context manager usage for session cleanup, clear forward-backward/optim-step pattern, and checkpoint saving.

108-140: LGTM!

The async example properly demonstrates the async API pattern with try/finally for cleanup.

crates/basilica-sdk-python/python/basilica/training/types.py (2)

16-45: LGTM!

The Datum dataclass is well-documented with clear field semantics and a clean to_dict() serialization method.

69-93: LGTM!

SamplingParams provides sensible defaults and the to_dict() method handles None gracefully for stop_sequences.

crates/basilica-sdk-python/python/basilica/training/training_client.py (2)

219-258: LGTM!

The optim_step method correctly handles optional hyperparameter overrides and updates the internal step counter from the server response.

430-441: LGTM!

The close() method appropriately ignores 404 responses (idempotent deletion) and uses wait=False for executor shutdown during cleanup, which is acceptable since the session is being terminated.

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)

26-48: LGTM!

The constructor clearly documents the dual-use pattern for session-bound vs standalone sampling, with appropriate optional parameters.

56-105: LGTM!

The sample() method cleanly handles payload construction, optional model hints, and gracefully handles both single and batch response formats with data.get("samples", [data]).

crates/basilica-sdk-python/python/basilica/training/__init__.py (2)

1-35: LGTM!

Excellent module docstring with a practical quick-start example and clear documentation of supported loss functions and authentication.

39-96: LGTM!

The module correctly re-exports all public symbols from submodules, and the __all__ list is well-organized with logical grouping (clients, types, exceptions) that aids discoverability.

crates/basilica-sdk-python/python/basilica/training/service_client.py (6)

36-73: LGTM!

The initialization logic handles API key and endpoint configuration correctly with appropriate fallbacks to environment variables and sensible defaults.

75-90: LGTM!

Capability query methods are clean and correctly use raise_for_status() for error handling.

247-307: No readiness polling for restored sessions.

Unlike create_lora_training_client which polls until the session is ready, create_training_client_from_state and create_training_client_from_state_with_optimizer return immediately after the API call. If session restoration is asynchronous, the returned TrainingClient may reference a session that isn't ready yet.

Consider adding similar readiness polling, or document that restoration is synchronous.

309-333: LGTM!

Clean validation ensuring at least one model reference is provided before creating the sampling client.

365-376: LGTM!

Simple implementation that correctly handles errors and returns session list.

163-187: Empty checkpointStorage fields should be required.

The bucket and path fields in checkpointStorage are set to empty strings without validation. Since storage operations require these values, either provide non-empty defaults, require them as parameters, or add validation to reject empty values at the API level.

coderabbitai · 2025-12-18T14:29:01Z

crates/basilica-api/src/api/routes/training.rs

+#[derive(Debug, Clone, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct CheckpointStorageRequest {
+    pub backend: String,
+    pub bucket: String,
+    pub path: String,
+    #[serde(default)]
+    pub credentials_secret: Option<String>,
+}


⚠️ Potential issue | 🟡 Minor

Document accepted backend values.

The backend field has no validation or documentation indicating which values are accepted (e.g., "s3", "gcs", "local"). Consider adding validation or at minimum doc comments listing the supported storage backends.

🔎 Suggested improvement:

Add documentation:

#[derive(Debug, Clone, Deserialize)] #[serde(rename_all = "camelCase")] pub struct CheckpointStorageRequest { + /// Storage backend type. Supported values: "s3", "gcs" pub backend: String, pub bucket: String, pub path: String,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[derive(Debug, Clone, Deserialize)]

#[serde(rename_all = "camelCase")]

pub struct CheckpointStorageRequest {

pub backend: String,

pub bucket: String,

pub path: String,

#[serde(default)]

pub credentials_secret: Option<String>,

}

#[derive(Debug, Clone, Deserialize)]

#[serde(rename_all = "camelCase")]

pub struct CheckpointStorageRequest {

/// Storage backend type. Supported values: "s3", "gcs"

pub backend: String,

pub bucket: String,

pub path: String,

#[serde(default)]

pub credentials_secret: Option<String>,

}

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 121 to 129, the CheckpointStorageRequest struct's backend field lacks documentation or validation about accepted values; add a doc comment above the backend field enumerating supported backends (e.g., "s3", "gcs", "local") and, if practical, enforce validation by either changing the type to a dedicated enum with Deserialize implementations for those variants or adding a simple runtime check where the struct is deserialized/used that returns a clear error for unsupported values.

coderabbitai · 2025-12-18T14:29:01Z

crates/basilica-sdk-python/python/basilica/training/service_client.py

+        resp = self._client.get(f"/sessions/{session_id}")
+
+        if resp.status_code == 404:
+            raise TrainingError(f"Session {session_id} not found")
+        if not resp.is_success:
+            raise TrainingError(f"Failed to get session: {resp.text}")
+
+        data = resp.json()
+        if data.get("phase") != "ready":
+            raise TrainingError(f"Session not ready: {data.get('phase')}")
+
+        return TrainingClient(
+            client=self._client,
+            session_id=session_id,
+            internal_id=f"train-{session_id}",
+            base_model=data.get("baseModel", "unknown"),
+        )


🛠️ Refactor suggestion | 🟠 Major

Use specific exception types from exceptions.py.

The module defines SessionNotFoundError and SessionNotReadyError specifically for these cases, but generic TrainingError is used instead. Using the specific exceptions improves error handling for callers.

🔎 Apply this diff:

+from .exceptions import ( + TrainingError, + SessionTimeoutError, + SessionNotFoundError, + SessionNotReadyError, + AuthenticationError, + ValidationError, +)

Then update the method:

if resp.status_code == 404: - raise TrainingError(f"Session {session_id} not found") + raise SessionNotFoundError(session_id) if not resp.is_success: raise TrainingError(f"Failed to get session: {resp.text}") data = resp.json() if data.get("phase") != "ready": - raise TrainingError(f"Session not ready: {data.get('phase')}") + raise SessionNotReadyError(session_id, data.get("phase", "unknown"))

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.8)

350-350: Avoid specifying long messages outside the exception class

(TRY003)

352-352: Avoid specifying long messages outside the exception class

(TRY003)

356-356: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/service_client.py around lines 347 to 363, the code raises generic TrainingError for a 404 and for a session not-ready state; replace those with the more specific exceptions defined in exceptions.py by importing SessionNotFoundError and SessionNotReadyError (or adjusting existing imports) and raise SessionNotFoundError(f"Session {session_id} not found") for the 404 branch and SessionNotReadyError(f"Session not ready: {data.get('phase')}") when the phase is not "ready", leaving the other TrainingError usages unchanged.

coderabbitai · 2025-12-18T14:29:01Z

crates/basilica-sdk-python/python/basilica/training/service_client.py

+    async def _get_async_client(self) -> httpx.AsyncClient:
+        """Get or create async HTTP client."""
+        if self._async_client is None:
+            self._async_client = httpx.AsyncClient(
+                base_url=self.endpoint,
+                headers={"Authorization": f"Bearer {self.api_key}"},
+                timeout=300.0,
+            )
+        return self._async_client


⚠️ Potential issue | 🟡 Minor

Async client uses hardcoded timeout instead of configured value.

The sync client uses the timeout parameter from __init__, but the async client hardcodes 300.0. Store the timeout in __init__ and reuse it here for consistency.

🔎 Apply this diff:

In __init__:

self._async_client = None # Lazy init + self._timeout = timeout

In _get_async_client:

self._async_client = httpx.AsyncClient( base_url=self.endpoint, headers={"Authorization": f"Bearer {self.api_key}"}, - timeout=300.0, + timeout=self._timeout, )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/service_client.py around lines 378-386, the async HTTP client is created with a hardcoded timeout of 300.0; update __init__ to store the provided timeout (e.g., self.timeout = timeout) and change the AsyncClient construction to use that stored value (timeout=self.timeout) so the async client honors the configured timeout consistently with the sync client.

coderabbitai · 2025-12-18T14:29:01Z

crates/basilica-sdk-python/python/basilica/training/training_client.py

+        def _call():
+            normalized = self._normalize_data(data)
+
+            # Pad sequences
+            max_len = max(len(d["input_ids"]) for d in normalized)
+            input_ids = []
+            attention_mask = []
+
+            for d in normalized:
+                ids = d["input_ids"]
+                pad_len = max_len - len(ids)
+                input_ids.append(ids + [0] * pad_len)
+                attention_mask.append([1] * len(ids) + [0] * pad_len)
+
+            resp = self._client.post(
+                self._proxy("forward"),
+                json={"input_ids": input_ids, "attention_mask": attention_mask},
+            )
+            if not resp.is_success:
+                raise TrainingError(f"forward failed: {resp.text}")
+            r = resp.json()
+            return ForwardResult(
+                logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]
+            )
+
+        return APIFuture(self._executor.submit(_call), ForwardResult)


⚠️ Potential issue | 🟠 Major

Add validation for empty data list.

If data is empty, max(len(d["input_ids"]) for d in normalized) at line 110 will raise a ValueError: max() arg is an empty sequence. Add a guard at the start of _call().

🔎 Apply this diff:

def _call(): normalized = self._normalize_data(data) + if not normalized: + raise ValueError("data list cannot be empty") # Pad sequences max_len = max(len(d["input_ids"]) for d in normalized)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _call():

normalized = self._normalize_data(data)

# Pad sequences

max_len = max(len(d["input_ids"]) for d in normalized)

input_ids = []

attention_mask = []

for d in normalized:

ids = d["input_ids"]

pad_len = max_len - len(ids)

input_ids.append(ids + [0] * pad_len)

attention_mask.append([1] * len(ids) + [0] * pad_len)

resp = self._client.post(

self._proxy("forward"),

json={"input_ids": input_ids, "attention_mask": attention_mask},

)

if not resp.is_success:

raise TrainingError(f"forward failed: {resp.text}")

r = resp.json()

return ForwardResult(

logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]

)

return APIFuture(self._executor.submit(_call), ForwardResult)

def _call():

normalized = self._normalize_data(data)

if not normalized:

raise ValueError("data list cannot be empty")

# Pad sequences

max_len = max(len(d["input_ids"]) for d in normalized)

input_ids = []

attention_mask = []

for d in normalized:

ids = d["input_ids"]

pad_len = max_len - len(ids)

input_ids.append(ids + [0] * pad_len)

attention_mask.append([1] * len(ids) + [0] * pad_len)

resp = self._client.post(

self._proxy("forward"),

json={"input_ids": input_ids, "attention_mask": attention_mask},

)

if not resp.is_success:

raise TrainingError(f"forward failed: {resp.text}")

r = resp.json()

return ForwardResult(

logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]

)

return APIFuture(self._executor.submit(_call), ForwardResult)

🧰 Tools

🪛 Ruff (0.14.8)

125-125: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/training_client.py around lines 106 to 131, add a guard after normalizing data to handle an empty input list: after normalized = self._normalize_data(data) check if not normalized and raise a TrainingError (or otherwise return a failed future) with a clear message such as "forward failed: no data provided" to avoid the max() empty-sequence ValueError; keep the rest of the function flow unchanged so the APIFuture is still returned by _call().

coderabbitai · 2025-12-18T14:29:01Z

crates/basilica-sdk-python/python/basilica/training/training_client.py

+        def _call():
+            normalized = self._normalize_data(data)
+
+            # Default labels to input_ids (causal LM)
+            for d in normalized:
+                if "labels" not in d or d["labels"] is None:
+                    d["labels"] = d["input_ids"].copy()
+
+            # Pad sequences
+            max_len = max(len(d["input_ids"]) for d in normalized)
+            input_ids = []
+            labels = []
+            attention_mask = []
+            loss_weights = []
+
+            for d in normalized:
+                ids = d["input_ids"]
+                pad_len = max_len - len(ids)
+                input_ids.append(ids + [0] * pad_len)
+                labels.append(d["labels"] + [-100] * pad_len)
+                attention_mask.append([1] * len(ids) + [0] * pad_len)
+                if d.get("loss_weights"):
+                    loss_weights.append(d["loss_weights"] + [0.0] * pad_len)
+
+            payload = {
+                "input_ids": input_ids,
+                "attention_mask": attention_mask,
+                "labels": labels,
+                "loss_fn": loss_fn,
+            }
+            if loss_weights:
+                payload["loss_weights"] = loss_weights
+
+            resp = self._client.post(self._proxy("forward_backward"), json=payload)
+            if not resp.is_success:
+                raise TrainingError(f"forward_backward failed: {resp.text}")
+
+            r = resp.json()
+            return ForwardBackwardResult(
+                loss=r["loss"],
+                logprobs=r.get("logprobs", []),
+                tokens_processed=r.get("tokens_processed", 0),
+            )
+
+        return APIFuture(self._executor.submit(_call), ForwardBackwardResult)


⚠️ Potential issue | 🟠 Major

Add validation for empty data list in forward_backward.

Same issue as forward() — line 162 will raise ValueError if data is empty. Add a guard after normalization.

🔎 Apply this diff:

def _call(): normalized = self._normalize_data(data) + if not normalized: + raise ValueError("data list cannot be empty") # Default labels to input_ids (causal LM) for d in normalized:

🧰 Tools

🪛 Ruff (0.14.8)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/training_client.py around lines 153 to 197, after normalizing the input data add a guard that checks if the normalized list is empty and raise a TrainingError (or appropriate API error) with a clear message (e.g. "forward_backward requires non-empty data") before calling max(len(...)) to avoid ValueError; ensure the new check occurs immediately after normalized = self._normalize_data(data) and returns/raises consistently with other methods like forward().

coderabbitai · 2025-12-18T14:29:01Z

examples/training_example.py

+    if os.path.exists("build/api-token.txt"):
+        api_key = open("build/api-token.txt").read().strip()


⚠️ Potential issue | 🟡 Minor

Resource leak: file handle not closed.

The file is opened without a context manager, leaving the handle unclosed after reading. This pattern is repeated in sampling_example() and async_example().

🔎 Apply this diff to fix the resource leak:

- if os.path.exists("build/api-token.txt"): - api_key = open("build/api-token.txt").read().strip() + if os.path.exists("build/api-token.txt"): + with open("build/api-token.txt") as f: + api_key = f.read().strip()

Apply the same fix at lines 83-84 and 114-115.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if os.path.exists("build/api-token.txt"):

api_key = open("build/api-token.txt").read().strip()

if os.path.exists("build/api-token.txt"):

with open("build/api-token.txt") as f:

api_key = f.read().strip()

🤖 Prompt for AI Agents

In examples/training_example.py around lines 36-37 (and apply the same fix at lines 83-84 and 114-115), the file is opened with open(...) and read without using a context manager which leaks file handles; replace those open(...).read() calls with a with open(...) as f: api_key = f.read().strip() pattern (i.e., use a with-statement context manager to open, read, strip, and automatically close the file) for each occurrence.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (11)

crates/basilica-sdk-python/python/basilica/training/training_client.py (2)

107-132: Add validation for empty data list.

If data is empty, max(len(d["input_ids"]) for d in normalized) at line 111 will raise a ValueError: max() arg is an empty sequence. This issue was previously raised and remains unaddressed.

180-224: Add validation for empty data list in forward_backward.

Same issue as forward() — line 189 will raise ValueError if data is empty. This issue was previously raised and remains unaddressed.

examples/training_example.py (1)

36-37: Resource leak: file handles not closed.

The file is opened without a context manager at lines 37, 106, and 137, leaving handles unclosed after reading. This issue was previously raised and remains unaddressed.

Also applies to: 105-106, 136-137
crates/basilica-api/src/api/routes/training.rs (6)
200-207: Potential namespace collision risk.

The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". This issue was previously raised and remains unaddressed.

283-283: Hardcoded enableBilling prevents user override.

The enableBilling field is always set to true, ignoring any potential user preference. This issue was previously raised and remains unaddressed.

323-332: Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors. This issue was previously raised and remains unaddressed.

530-556: Confusing sentinel error pattern for empty list handling.

Using an error with message "empty" as a sentinel value is a code smell. This issue was previously raised and remains unaddressed.

121-129: Document accepted backend values.

The backend field has no validation or documentation indicating which values are accepted. This issue was previously raised and remains unaddressed.

398-409: Fragile error detection still present despite marked as addressed.

The past review comment indicates this was addressed in commit 0c1cf26, but the code still uses string-based error matching with e.to_string().contains("404") at line 399.
🔎 Apply this fix to use typed error matching:
-    let session = api.get(&session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let session = api.get(&session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Session {} not found", session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, session_id = %session_id, "Failed to get session");
             ApiError::Internal {
                 message: format!("Failed to get session: {}", e),
             }
         }
     })?;
services/training-service/src/backend.py (2)

207-209: get_peft_model modifies base model; print_trainable_parameters uses stdout.

get_peft_model can modify the base model in place, which may cause issues if the base model is cached and shared across sessions.

print_trainable_parameters() outputs to stdout rather than the structured logger.

This issue was previously raised and remains unaddressed.

531-559: Critical bug: Optimizer references stale parameters after loading checkpoint.

After PeftModel.from_pretrained creates a new model at line 543, session.optimizer still references parameters from the old model. The optimizer state is loaded, but it won't update the new model's parameters. This issue was previously raised and remains unaddressed.

🧹 Nitpick comments (4)

crates/basilica-api/src/api/routes/training.rs (1)
713-959: Significant code duplication in proxy handlers.

The proxy handler functions (create_internal_session, get_internal_session, forward_backward, optim_step, sample, save_checkpoint, load_checkpoint, forward, compute_logprobs) follow an identical pattern with only minor variations in paths and methods. Consider extracting a common helper or using a macro to reduce duplication.
💡 Example refactor using a helper function:
async fn proxy_handler(
    state: State<AppState>,
    auth: Extension<AuthContext>,
    session_id: String,
    internal_session_id: Option<String>,
    path_suffix: &str,
    method: http::Method,
    body: Option<serde_json::Value>,
    metric_name: &str,
) -> Result<axum::response::Response> {
    let start = Instant::now();
    let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;
    let kube_client = k8s_client.kube_client();
    let namespace = user_namespace(&auth.user_id);
    
    let path = if let Some(internal_id) = internal_session_id {
        format!("/sessions/{}/{}", internal_id, path_suffix)
    } else {
        format!("/sessions{}", path_suffix)
    };
    
    let result = proxy_to_training_service(
        &kube_client,
        &namespace,
        &session_id,
        &path,
        method,
        body,
    )
    .await;
    
    apimetrics::record_request(metric_name, method.as_str(), start, result.is_ok());
    result
}
scripts/local-training-e2e.sh (3)
58-62: Consider adding a note about the security trade-off.

Installing k3d via curl | bash without signature verification is convenient but poses a security risk. While acceptable for local development scripts, consider documenting this trade-off or providing an option to skip auto-installation.

221-250: Fragile output parsing could break silently.

Lines 229-230 parse the gen-api-key output using grep and sed, which is fragile if the output format changes. While there's a validation check for empty values on Line 232, consider making the key generation output more structured (e.g., JSON) for robust parsing.
Suggested approach:

If gen-api-key can emit JSON, parse it with jq:
KEY_OUTPUT=$(cargo run -p basilica-api --bin gen-api-key -- \
    --user testuser --name training-e2e --scopes "training:*" \
    --format json 2>/dev/null)

API_TOKEN=$(echo "$KEY_OUTPUT" | jq -r '.token')
SQL_INSERT=$(echo "$KEY_OUTPUT" | jq -r '.sql')
Otherwise, document the expected format and add more validation:
 # Extract the token and SQL
 API_TOKEN=$(echo "$KEY_OUTPUT" | grep "Token (Authorization):" | sed 's/.*Bearer //')
 SQL_INSERT=$(echo "$KEY_OUTPUT" | grep "^INSERT INTO")

 if [ -z "$API_TOKEN" ] || [ -z "$SQL_INSERT" ]; then
     log_error "Failed to generate API key"
+    log_error "Expected output format not found. Output was:"
     echo "$KEY_OUTPUT"
     exit 1
 fi
362-401: Reduce duplication in session creation logic.

The session creation payload and curl invocation are duplicated for authenticated vs. unauthenticated requests. This makes the function harder to maintain.
🔎 Consolidate the logic:
# Create a training session (CPU mode with gpu_count: 0)
log_info "Creating training session (CPU mode)..."

CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json")
[ -n "$AUTH_HEADER" ] && CURL_ARGS+=(-H "$AUTH_HEADER")

RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{
    "baseModel": "facebook/opt-125m",
    "checkpointStorage": {
        "backend": "r2",
        "bucket": "test-bucket",
        "path": "test/checkpoints"
    },
    "loraConfig": {
        "rank": 8,
        "alpha": 16
    },
    "gpuResources": {
        "count": 0
    }
}')

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa3e6b and eabae23.

📒 Files selected for processing (8)

crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
crates/basilica-api/src/api/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-sdk-python/python/basilica/training/training_client.py (1 hunks)
examples/training_example.py (1 hunks)
scripts/local-training-e2e.sh (1 hunks)
services/training-service/src/backend.py (1 hunks)
services/training-service/src/server.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

crates/basilica-api/src/api/mod.rs

🧰 Additional context used

🧬 Code graph analysis (5)

crates/basilica-sdk-python/python/basilica/training/training_client.py (3)

crates/basilica-sdk-python/python/basilica/training/types.py (9)

APIFuture (138-184)

Datum (17-44)

ForwardBackwardResult (110-115)

ForwardResult (119-123)

ModelInput (48-66)

result (163-167)

to_dict (37-44)

to_dict (84-93)

result_async (169-176)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (4)

SamplingClient (16-186)

compute_logprobs (128-158)

sample (56-105)

compute_logprobs_async (172-176)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)

TrainingError (8-14)

scripts/local-training-e2e.sh (3)

crates/basilica-operator/src/runtime.rs (1)

run (162-361)

crates/basilica-api/src/server.rs (1)

run (1188-1226)

orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)

deployments (103-213)

examples/training_example.py (2)

crates/basilica-sdk-python/python/basilica/training/types.py (4)

Datum (17-44)

ModelInput (48-66)

SamplingParams (70-93)

from_ints (59-61)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)

compute_logprobs (128-158)

services/training-service/src/server.py (1)

services/training-service/src/backend.py (17)

ForwardBackwardResult (70-75)

ForwardResult (62-66)

LoraConfiguration (16-32)

OptimizerConfiguration (36-44)

SampleResult (79-85)

TrainingBackend (88-608)

list_sessions (579-581)

create_session (171-232)

get_session_status (561-571)

delete_session (573-577)

forward (302-349)

compute_logprobs (351-389)

forward_backward (234-300)

optim_step (391-414)

sample (416-478)

save_state (480-529)

load_state (531-559)

crates/basilica-api/src/api/routes/training.rs (4)

services/training-service/src/server.py (6)

create_session (213-244)

forward_backward (323-352)

optim_step (356-362)

sample (369-390)

forward (271-295)

compute_logprobs (299-319)

crates/basilica-operator/src/crd/training_session.rs (5)

default (34-41)

default (80-86)

default (145-151)

new (316-322)

new (374-391)

services/training-service/src/backend.py (6)

create_session (171-232)

forward_backward (234-300)

optim_step (391-414)

sample (416-478)

forward (302-349)

compute_logprobs (351-389)

crates/basilica-api/src/k8s/trait.rs (1)

kube_client (7-7)

🪛 Ruff (0.14.8)

crates/basilica-sdk-python/python/basilica/training/training_client.py

126-126: Avoid specifying long messages outside the exception class

(TRY003)

155-155: Avoid specifying long messages outside the exception class

(TRY003)

215-215: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Avoid specifying long messages outside the exception class

(TRY003)

326-326: Avoid specifying long messages outside the exception class

(TRY003)

346-346: Avoid specifying long messages outside the exception class

(TRY003)

366-366: Avoid specifying long messages outside the exception class

(TRY003)

408-408: Avoid specifying long messages outside the exception class

(TRY003)

453-453: Avoid specifying long messages outside the exception class

(TRY003)

461-461: Avoid specifying long messages outside the exception class

(TRY003)

🪛 Shellcheck (0.11.0)

scripts/local-training-e2e.sh

[warning] 422-422: i appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 477-477: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (12)

crates/basilica-api/src/api/middleware/scope.rs (1)

180-242: LGTM! Training scope mappings are well-structured.

The scope mappings for training endpoints follow consistent patterns and provide appropriate granularity (training:create, training:list, training:view, training:use, training:delete). The distinction between public CRD management routes and internal proxy routes is clear.

services/training-service/src/server.py (1)

1-421: LGTM! Clean FastAPI implementation.

The server implementation follows FastAPI best practices with:

Proper lifespan management for backend initialization

Well-structured Pydantic models with validation constraints

Consistent error handling (ValueError → 404, generic Exception → 500)

Appropriate use of structured logging

Correct tensor conversions for backend operations

The code is clean, maintainable, and production-ready.
scripts/local-training-e2e.sh (10)
1-39: LGTM!

Good use of strict error handling (set -euo pipefail), clear documentation, and well-organized logging helpers.

67-96: LGTM!

The cluster creation logic is well-structured with proper reuse detection and readiness checks.

129-183: LGTM!

Image building, loading, and operator deployment are well-structured with proper readiness checks.

185-211: LGTM!

Excellent handling of stale Postgres data and robust retry logic with clear failure messaging.

252-286: LGTM!

The API startup and key generation workflows are clear and well-documented with helpful user guidance.

288-333: LGTM!

Comprehensive cleanup logic that handles both CRD-managed and orphaned resources with proper waiting for termination.

470-477: Trap cleanup is correct, but shellcheck prefers explicit quoting.

The trap on Line 477 correctly expands $PF_PID at trap-set time (after Line 473 sets it). Shellcheck warns (SC2064) because double quotes can sometimes cause unintended expansions, but this usage is safe.

For clarity and to silence the warning, you could make the intent explicit:
Optional: Explicit quoting style
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    # shellcheck disable=SC2064
+    trap "kill $PF_PID 2>/dev/null" EXIT
Or capture the PID in a way that makes the intent clearer:
readonly PF_PID=$!
trap "kill ${PF_PID} 2>/dev/null" EXIT
630-697: LGTM!

Utility functions for status display, log tailing, and cluster teardown are well-organized and handle errors gracefully.

699-791: LGTM!

Excellent documentation in the help text and clean command dispatcher with good error handling for unknown commands.

420-430: Shellcheck warning about unused loop variable is harmless.

Shellcheck warns (SC2034) that the loop variable i on Line 422 is unused. This is a common Bash pattern for counting iterations and is perfectly fine. The warning can be safely ignored, or you can suppress it with a comment if desired:
# shellcheck disable=SC2034
for i in {1..60}; do

coderabbitai · 2025-12-18T15:42:11Z

scripts/local-training-e2e.sh

+    # Wait for Envoy Gateway
+    log_info "Waiting for Envoy Gateway to be ready..."
+    sleep 10
+    kubectl wait --namespace envoy-gateway-system \
+        --for=condition=Available deployment/envoy-gateway \
+        --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."


⚠️ Potential issue | 🟠 Major

Verify gateway readiness or fail explicitly.

The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).

🔎 Consider failing explicitly or adding a retry:

kubectl wait --namespace envoy-gateway-system \ --for=condition=Available deployment/envoy-gateway \ - --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." + --timeout=180s || { + log_error "Envoy Gateway failed to become ready" + exit 1 + }

Alternatively, if you want to be lenient in dev scenarios, add a note that downstream failures may occur:

kubectl wait --namespace envoy-gateway-system \ --for=condition=Available deployment/envoy-gateway \ - --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." + --timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Wait for Envoy Gateway

log_info "Waiting for Envoy Gateway to be ready..."

sleep 10

kubectl wait --namespace envoy-gateway-system \

--for=condition=Available deployment/envoy-gateway \

--timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."

# Wait for Envoy Gateway

log_info "Waiting for Envoy Gateway to be ready..."

sleep 10

kubectl wait --namespace envoy-gateway-system \

--for=condition=Available deployment/envoy-gateway \

--timeout=180s || {

log_error "Envoy Gateway failed to become ready"

exit 1

}

Suggested change

# Wait for Envoy Gateway

log_info "Waiting for Envoy Gateway to be ready..."

sleep 10

kubectl wait --namespace envoy-gateway-system \

--for=condition=Available deployment/envoy-gateway \

--timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."

# Wait for Envoy Gateway

log_info "Waiting for Envoy Gateway to be ready..."

sleep 10

kubectl wait --namespace envoy-gateway-system \

--for=condition=Available deployment/envoy-gateway \

--timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail."

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 119 to 124, the kubectl wait fallback currently only logs a warning on timeout and lets the script continue, which can cause downstream cryptic failures; modify this block to fail explicitly on timeout (exit non-zero) so the pipeline stops when the Envoy Gateway is not Available, or implement a retry loop with a configurable retry count/backoff and clear logging and finally exit non-zero if still unavailable; optionally honor an environment variable (e.g., ALLOW_GATEWAY_WARN) to preserve the current lenient behavior when set.

coderabbitai · 2025-12-18T15:42:11Z

scripts/local-training-e2e.sh

+    # Health check
+    log_info "Checking API health..."
+    if ! curl -sf "$API_URL/health" > /dev/null; then
+        log_error "API not running. Start it with: $0 api"
+        exit 1
+    fi
+    curl -s "$API_URL/health" | jq .


⚠️ Potential issue | 🟡 Minor

Add timeout to curl health check.

The health check curl on Line 356 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.

🔎 Add a timeout:

- if ! curl -sf "$API_URL/health" > /dev/null; then + if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then log_error "API not running. Start it with: $0 api" exit 1 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Health check

log_info "Checking API health..."

if ! curl -sf "$API_URL/health" > /dev/null; then

log_error "API not running. Start it with: $0 api"

exit 1

fi

curl -s "$API_URL/health" | jq .

# Health check

log_info "Checking API health..."

if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then

log_error "API not running. Start it with: $0 api"

exit 1

fi

curl -s "$API_URL/health" | jq .

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 354 to 360, the curl health check (line 356) lacks a timeout and can hang indefinitely; modify the curl invocation to include a sensible timeout (for example --max-time 5 or -m 5) so the command fails fast on unresponsive API, and keep the existing -s and -f flags; also consider adding the same timeout to the subsequent curl that pipes to jq to avoid hanging when printing the health response.

coderabbitai · 2025-12-18T15:42:11Z

scripts/local-training-e2e.sh

+    if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
+        log_error "Training service not accessible via port-forward"
+        kill $PF_PID 2>/dev/null
+        return 1
+    fi
+    curl -s "$TRAINING_URL/health" | jq .


⚠️ Potential issue | 🟡 Minor

Add timeout to training service health check.

The curl on Line 483 has no timeout and could hang if the port-forward is established but the service is unresponsive.

🔎 Add a timeout:

- if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then + if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then log_error "Training service not accessible via port-forward" kill $PF_PID 2>/dev/null return 1 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then

log_error "Training service not accessible via port-forward"

kill $PF_PID 2>/dev/null

return 1

fi

curl -s "$TRAINING_URL/health" | jq .

if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then

log_error "Training service not accessible via port-forward"

kill $PF_PID 2>/dev/null

return 1

fi

curl -s "$TRAINING_URL/health" | jq .

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 483 to 488, the health check curl calls lack timeouts and can hang if the port-forward exists but the service is unresponsive; update both curl invocations to include a reasonable timeout (e.g., --max-time 5 and/or --connect-timeout 2) and retain the existing error handling so the first timed-out check triggers log_error, kills the PF_PID, and returns 1; ensure the follow-up curl that pipes to jq also uses a timeout so the script never blocks indefinitely.

coderabbitai · 2025-12-18T15:42:11Z

scripts/local-training-e2e.sh

+    RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
+        -H "Content-Type: application/json" \
+        -d "{
+            \"session_id\": \"$INTERNAL_SESSION\",
+            \"base_model\": \"facebook/opt-125m\",
+            \"lora_config\": {
+                \"rank\": 8,
+                \"alpha\": 16,
+                \"dropout\": 0.05
+            },
+            \"optimizer_config\": {
+                \"learning_rate\": 0.0001,
+                \"weight_decay\": 0.01
+            }
+        }")
+    echo "$RESPONSE" | jq .
+
+    if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then
+        log_info "Training session created successfully"
+    else
+        log_error "Failed to create training session"
+        echo "$RESPONSE"
+        kill $PF_PID 2>/dev/null
+        return 1
+    fi


⚠️ Potential issue | 🟡 Minor

Add timeout to training session creation.

The curl on Line 495 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive.

🔎 Add a timeout:

- RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \ + RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \ -H "Content-Type: application/json" \ -d "{

Apply the same pattern to other training operation curls (lines 537, 558, 577, 588, 606).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \

-H "Content-Type: application/json" \

-d "{

\"session_id\": \"$INTERNAL_SESSION\",

\"base_model\": \"facebook/opt-125m\",

\"lora_config\": {

\"rank\": 8,

\"alpha\": 16,

\"dropout\": 0.05

},

\"optimizer_config\": {

\"learning_rate\": 0.0001,

\"weight_decay\": 0.01

}

}")

echo "$RESPONSE" | jq .

if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then

log_info "Training session created successfully"

else

log_error "Failed to create training session"

echo "$RESPONSE"

kill $PF_PID 2>/dev/null

return 1

fi

RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \

-H "Content-Type: application/json" \

-d "{

\"session_id\": \"$INTERNAL_SESSION\",

\"base_model\": \"facebook/opt-125m\",

\"lora_config\": {

\"rank\": 8,

\"alpha\": 16,

\"dropout\": 0.05

},

\"optimizer_config\": {

\"learning_rate\": 0.0001,

\"weight_decay\": 0.01

}

}")

echo "$RESPONSE" | jq .

if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then

log_info "Training session created successfully"

else

log_error "Failed to create training session"

echo "$RESPONSE"

kill $PF_PID 2>/dev/null

return 1

fi

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 495 to 519, the curl POST that creates the training session has no timeout and can hang; add a total request timeout (e.g. --max-time 30) to the curl invocation to fail fast on slow/unresponsive services, handle the non-zero exit by checking curl success before parsing JSON, and apply the same --max-time timeout change to the other training-related curl calls at lines 537, 558, 577, 588, and 606 so all training operations use a consistent timeout.

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (9)

crates/basilica-sdk-python/python/basilica/training/service_client.py (3)
37-74: Store timeout for async client initialization.

The timeout parameter is passed to the sync httpx.Client but not stored for later use. Line 399 in _get_async_client hardcodes timeout=300.0, ignoring the configured value.
🔎 Apply this diff to store and reuse timeout:
     def __init__(
         self,
         api_key: Optional[str] = None,
         endpoint: Optional[str] = None,
         timeout: float = 300.0,
     ):
         ...
+        self._timeout = timeout
         self._client = httpx.Client(
             base_url=self.endpoint,
             headers={"Authorization": f"Bearer {self.api_key}"},
             timeout=timeout,
         )
         self._async_client = None  # Lazy init
Then update _get_async_client to use self._timeout (see separate comment).
Based on past review comments.

350-378: Use specific exception types for better error handling.

Lines 364-365 and 370-371 raise generic TrainingError instead of the more specific SessionNotFoundError and SessionNotReadyError that are already imported.
🔎 Apply this diff to use specific exceptions:
+from .exceptions import (
+    TrainingError,
+    SessionTimeoutError,
+    SessionNotFoundError,
+    SessionNotReadyError,
+    AuthenticationError,
+    ValidationError,
+)
Then update the method:
         if resp.status_code == 404:
-            raise TrainingError(f"Session {session_id} not found")
+            raise SessionNotFoundError(session_id)
         if not resp.is_success:
             raise TrainingError(f"Failed to get session: {resp.text}")

         data = resp.json()
         if data.get("phase") != "ready":
-            raise TrainingError(f"Session not ready: {data.get('phase')}")
+            raise SessionNotReadyError(session_id, data.get("phase", "unknown"))
Based on past review comments.

393-401: Async client uses hardcoded timeout instead of configured value.

Line 399 hardcodes timeout=300.0, ignoring the timeout parameter passed to __init__. This creates inconsistency between sync and async clients.
🔎 Apply this diff (requires storing timeout in __init__):
     async def _get_async_client(self) -> httpx.AsyncClient:
         """Get or create async HTTP client."""
         if self._async_client is None:
             self._async_client = httpx.AsyncClient(
                 base_url=self.endpoint,
                 headers={"Authorization": f"Bearer {self.api_key}"},
-                timeout=300.0,
+                timeout=self._timeout,
             )
         return self._async_client
Based on past review comments.
crates/basilica-api/src/api/routes/training.rs (6)

121-129: Document accepted backend values.

The backend field lacks documentation about which values are accepted (e.g., "s3", "gcs"). Consider adding a doc comment or, better yet, using an enum with Deserialize to enforce valid values at deserialization time.

Based on past review comments.

399-399: Hardcoded enableBilling prevents user configuration.

Line 399 hardcodes "enableBilling": true, preventing users from opting out of billing. Consider adding an optional field to CreateSessionRequest with a sensible default.

Based on past review comments.

316-323: Namespace collision risk remains unaddressed.

The sanitization logic can create collisions (e.g., "user_123" and "user-123" both map to "u-user-123") and doesn't enforce Kubernetes' 63-character limit. Consider using a hash-based or base32-encoded approach for deterministic, collision-free namespace names.

Based on past review comments.

439-448: Fragile error detection via string matching.

Lines 441 checks for "409" or "AlreadyExists" in the error string, which is brittle. Use typed error matching instead: matches!(e, kube::Error::Api(ref ae) if ae.code == 409) for robustness.

Based on past review comments.

514-525: Fragile error detection via string matching.

Line 515 uses string matching (e.to_string().contains("404")) to detect not-found errors. This should use typed error matching: matches!(e, kube::Error::Api(ref ae) if ae.code == 404).

Note: Past review marked this as addressed, but string matching is still present in the current code.

646-672: Confusing sentinel error pattern for empty list handling.

Lines 652-654 use ApiError::Internal { message: "empty".to_string() } as a sentinel value to signal an empty list, which is later checked at line 664. This makes the code harder to understand. Use explicit control flow or an Option instead.

Based on past review comments.

🧹 Nitpick comments (5)

crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)
212-228: Consider consistent response handling across list methods.

Line 226 returns resp.json() directly, while other list methods extract specific keys (e.g., list_training_runs extracts "runs", list_checkpoints extracts "checkpoints"). If the sessions endpoint returns a wrapper object with a "sessions" key, extracting it would maintain consistency.
🔎 View suggested fix if the API returns a wrapper:
 def _call():
     resp = self._client.get("/sessions", params={"limit": limit})
     if not resp.is_success:
         raise TrainingError(f"list_sessions failed: {resp.text}")
-    return resp.json()
+    return resp.json().get("sessions", [])
290-292: Consider waiting for in-flight operations during shutdown.

Line 292 calls shutdown(wait=False), which may abandon in-flight operations. For graceful cleanup, wait=True ensures pending futures complete before shutdown.
🔎 Apply this diff for graceful shutdown:
 def close(self):
     """Close the REST client."""
-    self._executor.shutdown(wait=False)
+    self._executor.shutdown(wait=True)
crates/basilica-api/src/api/middleware/scope.rs (1)
180-267: Consider more precise pattern matching for internal session operations.

The scope validation patterns are functional but could be more precise. For example, line 192-196 uses contains("/internal/") which would match any path containing that substring, not just the intended /sessions/:session_id/internal/:internal_session_id pattern. While not a security risk (unmatched routes will 404), tighter patterns improve maintainability.
Example improvement:
// Instead of:
(&Method::GET, p) if p.starts_with("/sessions/") && p.contains("/internal/") => {
    Some("training:view".to_string())
}

// Consider:
(&Method::GET, p) if p.starts_with("/sessions/") && p.matches("/internal/").count() == 1 && !p.ends_with("/internal") => {
    Some("training:view".to_string())
}
However, given that invalid paths will be rejected at routing and all matched scopes are appropriate, this is a low-priority refinement.
crates/basilica-api/src/api/routes/training.rs (2)

750-825: Consider adding timeout for proxy requests.

The proxy implementation doesn't specify a timeout for the kube_client.request() call. If the training service is slow or hangs, this could block the API server thread. Consider adding a timeout and/or documenting that the kube client's default timeout will apply.

1459-1579: Placeholder implementations for checkpoint operations.

Several checkpoint management functions (download URL, delete, publish, unpublish) contain TODO comments and placeholder implementations (lines 1479, 1511-1520, 1543-1546, 1574-1575). While acceptable for an MVP, these should be tracked.

Do you want me to open issues to track the completion of these checkpoint operations, or are they already tracked elsewhere?

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eabae23 and 430c5b4.

📒 Files selected for processing (6)

crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
crates/basilica-api/src/api/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/rest_client.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/service_client.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

crates/basilica-api/src/api/mod.rs (1)

crates/basilica-api/src/api/routes/training.rs (19)

create_session (410-498)

list_sessions (633-737)

get_session (501-580)

delete_session (586-630)

forward_backward (884-908)

optim_step (912-935)

sample (939-963)

forward (1023-1047)

compute_logprobs (1051-1075)

save_for_sampler (1079-1103)

get_capabilities (1109-1129)

list_training_runs (1133-1203)

get_training_run_by_path (1256-1312)

get_training_run (1207-1252)

list_checkpoints (1318-1390)

get_checkpoint_info (1394-1457)

publish_checkpoint (1528-1555)

unpublish_checkpoint (1559-1579)

delete_checkpoint (1495-1524)

crates/basilica-sdk-python/python/basilica/training/__init__.py (2)

crates/basilica-sdk-python/python/basilica/training/types.py (8)

Datum (17-44)

ModelInput (48-66)

SamplingParams (70-93)

SampleResponse (100-106)

ForwardBackwardResult (110-115)

ForwardResult (119-123)

GetServerCapabilitiesResponse (127-132)

APIFuture (138-184)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)

TrainingError (8-14)

crates/basilica-api/src/api/routes/training.rs (4)

crates/basilica-api/src/server.rs (2)

std (1246-1246)

new (559-1165)

services/training-service/src/server.py (13)

LoraConfigRequest (53-59)

OptimizerConfigRequest (62-70)

CreateSessionRequest (73-80)

CreateSessionResponse (83-87)

SessionStatusResponse (179-187)

create_session (213-244)

get_session (248-254)

delete_session (258-264)

forward_backward (323-352)

optim_step (356-362)

sample (369-390)

forward (271-295)

compute_logprobs (299-319)

services/training-service/src/backend.py (7)

create_session (171-232)

delete_session (573-577)

forward_backward (234-300)

optim_step (391-414)

sample (416-478)

forward (302-349)

compute_logprobs (351-389)

crates/basilica-api/src/k8s/helpers.rs (1)

arr (9-11)

crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)

crates/basilica-sdk-python/python/basilica/training/types.py (2)

APIFuture (138-184)

result_async (169-176)

crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)

TrainingError (8-14)

🪛 Ruff (0.14.8)

crates/basilica-sdk-python/python/basilica/training/__init__.py

70-98: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/service_client.py

65-67: Avoid specifying long messages outside the exception class

(TRY003)

155-157: Avoid specifying long messages outside the exception class

(TRY003)

191-191: Avoid specifying long messages outside the exception class

(TRY003)

200-200: Avoid specifying long messages outside the exception class

(TRY003)

208-208: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Avoid specifying long messages outside the exception class

(TRY003)

328-328: Avoid specifying long messages outside the exception class

(TRY003)

365-365: Avoid specifying long messages outside the exception class

(TRY003)

367-367: Avoid specifying long messages outside the exception class

(TRY003)

371-371: Avoid specifying long messages outside the exception class

(TRY003)

389-389: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/rest_client.py

53-53: Avoid specifying long messages outside the exception class

(TRY003)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

89-89: Avoid specifying long messages outside the exception class

(TRY003)

115-115: Avoid specifying long messages outside the exception class

(TRY003)

133-133: Avoid specifying long messages outside the exception class

(TRY003)

151-151: Avoid specifying long messages outside the exception class

(TRY003)

168-168: Avoid specifying long messages outside the exception class

(TRY003)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

206-206: Avoid specifying long messages outside the exception class

(TRY003)

225-225: Avoid specifying long messages outside the exception class

(TRY003)

243-243: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)

crates/basilica-sdk-python/python/basilica/training/__init__.py (1)

70-98: Current all organization is clear; ignore static analysis sorting hint.

Static analysis suggests alphabetically sorting __all__, but the current logical grouping (version, clients, types, exceptions) is more maintainable and readable than alphabetical order.

crates/basilica-api/src/api/routes/training.rs (1)

461-486: Good: Proper typed error matching for 409 Conflict.

Line 469 correctly uses kube::Error::Api(ae) if ae.code == 409 for typed error matching, which is more robust than string-based checks.

crates/basilica-api/src/api/mod.rs (3)

105-113: LGTM! Session lifecycle endpoints follow RESTful conventions.

The session lifecycle routes are well-structured and follow standard CRUD patterns under the protected router.

155-172: LGTM! Capabilities and training runs endpoints are well-structured.

The capabilities and training runs management routes follow clear RESTful patterns. The "Phase 3" comments provide helpful context for the phased rollout.

114-154: All training operation handlers are properly implemented in crates/basilica-api/src/api/routes/training.rs. The referenced handlers exist:

create_internal_session (line 829)

get_internal_session (line 857)

forward_backward (line 884)

optim_step (line 912)

sample (line 939)

save_checkpoint (line 967)

load_checkpoint (line 995)

forward (line 1023)

compute_logprobs (line 1051)

save_for_sampler (line 1079)

No issues found.

coderabbitai · 2025-12-18T17:22:02Z

crates/basilica-api/src/api/routes/training.rs

+    let sessions = match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await {
+        Ok(list) => list,
+        Err(e) => {
+            if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+                apimetrics::record_request("training.runs.list", "GET", start, true);
+                return Ok(Json(ListTrainingRunsResponse {
+                    runs: vec![],
+                    total: 0,
+                    offset: params.offset,
+                    limit: params.limit,
+                }));
+            }
+            error!(error = %e, "Failed to list training runs");
+            apimetrics::record_request("training.runs.list", "GET", start, false);
+            return Err(ApiError::Internal {
+                message: format!("Failed to list training runs: {}", e),
+            });
+        }
+    };


🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1149 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 instead for more robust error handling.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1146 to 1164, the code currently detects a 404 by matching substrings on the error string; change this to typed error matching using kube::Error::Api(ae) if ae.code == 404 in the match arm that handles Err(e). Specifically, add a match arm that matches kube::Error::Api(ae) if ae.code == 404 and in that arm record the successful metrics and return the empty ListTrainingRunsResponse as currently done for 404s; leave the existing error logging, metrics failure recording, and ApiError::Internal return in the fallback arm for all other error variants. Ensure the match uses the typed kube::Error pattern rather than string comparisons.

coderabbitai · 2025-12-18T17:22:02Z

crates/basilica-api/src/api/routes/training.rs

+    let session = api.get(&run_id).await.map_err(|e| {
+        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+            ApiError::NotFound {
+                message: format!("Training run {} not found", run_id),
+            }
+        } else {
+            error!(error = %e, run_id = %run_id, "Failed to get training run");
+            ApiError::Internal {
+                message: format!("Failed to get training run: {}", e),
+            }
+        }
+    })?;


🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1220 uses string matching to detect 404 errors. Apply the same typed error matching pattern recommended elsewhere: kube::Error::Api(ae) if ae.code == 404.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1219 to 1230, the code detects a 404 by string-matching on the error; replace that with typed matching on kube::Error::Api(ae) if ae.code == 404. Update the map_err closure to pattern-match the error (or use matches/if let) against kube::Error::Api(ae) with ae.code == 404 and return ApiError::NotFound in that branch; for all other errors keep the existing error! log (including the error and run_id) and return ApiError::Internal with the error string. Ensure the pattern match compiles with the error type returned by api.get and preserve existing logging and messages.

coderabbitai · 2025-12-18T17:22:02Z

crates/basilica-api/src/api/routes/training.rs

+    let session = api.get(&session_id).await.map_err(|e| {
+        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+            ApiError::NotFound {
+                message: format!("Training run for path {} not found", params.path),
+            }
+        } else {
+            error!(error = %e, path = %params.path, "Failed to get training run by path");
+            ApiError::Internal {
+                message: format!("Failed to get training run: {}", e),
+            }
+        }
+    })?;


🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1280 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 for consistency with best practices.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1279 to 1290, the closure currently detects 404 by string-matching the error message; replace that with a typed match on the kube error variant (match e { kube::Error::Api(ae) if ae.code == 404 => ApiError::NotFound { message: format!("Training run for path {} not found", params.path) }, other => { error!(error = %other, path = %params.path, "Failed to get training run by path"); ApiError::Internal { message: format!("Failed to get training run: {}", other) } } }) so you use kube::Error::Api(ae).code == 404 for 404 detection and preserve logging and internal error construction for other cases.

coderabbitai · 2025-12-18T17:22:02Z

crates/basilica-api/src/api/routes/training.rs

+        match api.get(run_id).await {
+            Ok(s) => vec![s],
+            Err(e) => {
+                if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+                    vec![]
+                } else {
+                    error!(error = %e, run_id = %run_id, "Failed to get session for checkpoints");
+                    apimetrics::record_request("training.checkpoints.list", "GET", start, false);
+                    return Err(ApiError::Internal {
+                        message: format!("Failed to list checkpoints: {}", e),
+                    });
+                }
+            }
+        }
+    } else {
+        match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await {
+            Ok(list) => list.items,
+            Err(e) => {
+                if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+                    vec![]
+                } else {
+                    error!(error = %e, "Failed to list sessions for checkpoints");
+                    apimetrics::record_request("training.checkpoints.list", "GET", start, false);
+                    return Err(ApiError::Internal {
+                        message: format!("Failed to list checkpoints: {}", e),
+                    });
+                }
+            }
+        }
+    };


🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Lines 1335 and 1350 use string matching to detect 404 errors. Consistently apply typed error matching: kube::Error::Api(ae) if ae.code == 404.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1332 to 1361, the code detects 404s by matching error.to_string() contents; change these to typed matches against kube::Error::Api(ae) with ae.code == 404 in both error arms. Replace the if e.to_string().contains(...) checks with a match (or if let) on Err(e) => match e { kube::Error::Api(ae) if ae.code == 404 => vec![], _ => { error!(error = %e, ...); apimetrics::record_request(...); return Err(ApiError::Internal { message: format!(...) }); } } and ensure kube::Error is in scope (add use kube::Error if needed).

coderabbitai · 2025-12-18T17:22:02Z

crates/basilica-api/src/api/routes/training.rs

+    if let Some(cp_array) = status.get("checkpointDetails").and_then(|v| v.as_array()) {
+        for cp in cp_array {
+            let cp_path = cp.get("path").and_then(|v| v.as_str()).unwrap_or("");
+            if cp_path.contains(&checkpoint_name) || cp_path == params.path {
+                apimetrics::record_request("training.checkpoints.info", "GET", start, true);
+                return Ok(Json(CheckpointResponse {
+                    checkpoint_id: cp.get("id").and_then(|v| v.as_str()).unwrap_or("").to_string(),
+                    run_id: session_id,
+                    path: cp_path.to_string(),
+                    step: cp.get("step").and_then(|v| v.as_u64()).unwrap_or(0),
+                    size_bytes: cp.get("sizeBytes").and_then(|v| v.as_u64()).unwrap_or(0),
+                    created_at: cp.get("createdAt").and_then(|v| v.as_str()).unwrap_or("").to_string(),
+                    is_published: cp.get("isPublished").and_then(|v| v.as_bool()).unwrap_or(false),
+                    public_url: cp.get("publicUrl").and_then(|v| v.as_str()).map(String::from),
+                }));
+            }
+        }


⚠️ Potential issue | 🟡 Minor

Loose checkpoint path matching may return wrong checkpoint.

Line 1437 uses cp_path.contains(&checkpoint_name) which could match unintended checkpoints. For example, if checkpoint_name is "step_100", it would also match "step_1000". Consider using exact path matching or more precise substring logic.

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1434 to 1450, the checkpoint match uses cp_path.contains(&checkpoint_name) which can produce false positives (e.g., "step_100" matching "step_1000"); change the condition to a precise match — either compare the full path equality (cp_path == params.path) or compare the checkpoint filename/basename exactly to checkpoint_name (extract the last path segment and compare == checkpoint_name), handling potential trailing slashes, then keep the existing return block when a true exact match is found.

coderabbitai · 2025-12-18T17:22:03Z

crates/basilica-sdk-python/python/basilica/training/rest_client.py

+    def get_checkpoint_archive_url(self, checkpoint_id: str) -> APIFuture:
+        """Get signed download URL for checkpoint.
+
+        Args:
+            checkpoint_id: Checkpoint ID
+
+        Returns:
+            APIFuture resolving to presigned URL string
+        """
+
+        def _call():
+            resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url")
+            if not resp.is_success:
+                raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}")
+            return resp.json()["url"]
+
+        return APIFuture(self._executor.submit(_call), str)


⚠️ Potential issue | 🟡 Minor

Add defensive key checking to prevent KeyError.

Line 134 accesses resp.json()["url"] without verifying the key exists. If the API response is malformed or the key is missing, this will raise a KeyError.

🔎 Apply this diff to add validation:

def _call(): resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url") if not resp.is_success: raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}") - return resp.json()["url"] + data = resp.json() + if "url" not in data: + raise TrainingError("Missing 'url' in response") + return data["url"]

🧰 Tools

🪛 Ruff (0.14.8)

133-133: Avoid specifying long messages outside the exception class

(TRY003)

coderabbitai · 2025-12-18T17:22:03Z

crates/basilica-sdk-python/python/basilica/training/rest_client.py

+    def publish_checkpoint(self, path: str) -> APIFuture:
+        """Make checkpoint publicly accessible.
+
+        Args:
+            path: Checkpoint path to publish
+
+        Returns:
+            APIFuture resolving to public URL string
+        """
+
+        def _call():
+            resp = self._client.post("/checkpoints/publish", json={"path": path})
+            if not resp.is_success:
+                raise TrainingError(f"publish_checkpoint failed: {resp.text}")
+            return resp.json()["public_url"]
+
+        return APIFuture(self._executor.submit(_call), str)


⚠️ Potential issue | 🟡 Minor

Add defensive key checking to prevent KeyError.

Line 189 accesses resp.json()["public_url"] without verifying the key exists, risking a KeyError if the response is malformed.

🔎 Apply this diff to add validation:

def _call(): resp = self._client.post("/checkpoints/publish", json={"path": path}) if not resp.is_success: raise TrainingError(f"publish_checkpoint failed: {resp.text}") - return resp.json()["public_url"] + data = resp.json() + if "public_url" not in data: + raise TrainingError("Missing 'public_url' in response") + return data["public_url"]

🧰 Tools

🪛 Ruff (0.14.8)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/rest_client.py around lines 175 to 191, the code blindly accesses resp.json()["public_url"] which can raise KeyError or crash on invalid JSON; parse the JSON into a variable, check that it is a dict and contains the "public_url" key, and if missing raise a TrainingError that includes the response status, raw text and/or parsed body for debugging; also catch JSON decode errors and raise TrainingError with response.text so the APIFuture always surfaces a clear, defensive error rather than a KeyError.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (8)

crates/basilica-api/src/api/routes/training.rs (4)

121-129: Add validation or documentation for storage backend values.

The backend field accepts any string. Consider adding validation for supported backends (e.g., "s3", "gcs", "r2") or at minimum document accepted values.

Based on past review comments flagging this same issue.

341-348: Namespace collision and length issues remain unaddressed.

This sanitization can cause collisions (e.g., "user_123" and "user-123" both become "u-user-123"). Additionally, K8s namespace names have a 63-character limit that isn't enforced.

Based on past review comments flagging namespace collision risk.

1329-1347: Use typed error matching for 404 detection.

Apply the same typed matching pattern: kube::Error::Api(ae) if ae.code == 404.

Based on past review comments.

1617-1633: Loose checkpoint path matching may return wrong checkpoint.

Using cp_path.contains(&checkpoint_name) can match unintended checkpoints (e.g., "step_100" matches "step_1000"). Use exact path matching or compare the final path segment.

Based on past review comments.
examples/training_example.py (4)
39-40: Resource leak: file handle not closed.

This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed.
🔎 Apply this diff to fix:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()
Apply the same fix at lines 120-121, 151-152, and 185-186.

120-121: Resource leak: file handle not closed.

This is the same issue flagged earlier. Use a context manager to ensure the file is properly closed.
🔎 Apply the same fix as lines 39-40:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()
151-152: Resource leak: file handle not closed.

Same resource leak issue. Apply the same fix as previous occurrences.
🔎 Apply the context manager fix:
-        if os.path.exists("build/api-token.txt"):
-            api_key = open("build/api-token.txt").read().strip()
+        if os.path.exists("build/api-token.txt"):
+            with open("build/api-token.txt") as f:
+                api_key = f.read().strip()
185-186: Resource leak: file handle not closed.

Fourth occurrence of the same resource leak. Apply the context manager fix here as well.
🔎 Apply the context manager fix:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()

🧹 Nitpick comments (11)

crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (2)
142-145: Remove redundant os.environ.pop calls.

The patch.dict(os.environ, {}, clear=True) already clears all environment variables, making the subsequent pop calls unnecessary.
🔎 Simplify the test setup:
     def test_init_requires_api_key(self):
         """Test that ServiceClient requires API key."""
         with patch.dict(os.environ, {}, clear=True):
-            # Remove any environment variables
-            os.environ.pop("BASILICA_API_KEY", None)
-            os.environ.pop("BASILICA_API_TOKEN", None)
-
             with pytest.raises(ValueError, match="API key required"):
                 ServiceClient()
194-217: Consider verifying exact request parameters.

While the test confirms the method is called, asserting the exact parameters passed to mock_client.post would strengthen the test and catch parameter-related regressions.
Example enhancement:
# After line 217, add:
call_args = mock_client.post.call_args
assert call_args[1]['json']['data'][0]['input_ids'] == [1, 2, 3]
# Or use call_args_list to verify the exact call signature
crates/basilica-api/src/api/middleware/scope.rs (1)

180-268: Training scope mappings look well-structured.

The new training-related scope mappings follow the established pattern. The route matching logic correctly distinguishes between different operations (create, view, use, delete).

However, consider adding unit tests for the new training endpoints to maintain test coverage parity with existing rental and payment endpoint tests (visible in lines 283-454).
crates/basilica-api/src/api/routes/training.rs (5)
464-473: Use typed error matching for namespace creation.

Line 466 uses string matching (e.to_string().contains("409")) while the CRD creation at line 494 correctly uses typed matching. Apply the same pattern here for consistency and robustness.
🔎 Suggested fix:
     if let Err(e) = k8s_client.create_namespace(&namespace).await {
-        // Ignore "already exists" errors
-        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+        // Ignore "already exists" errors using typed matching
+        // Note: This requires checking if k8s_client.create_namespace returns kube::Error
+        let is_conflict = e.to_string().contains("409") || e.to_string().contains("AlreadyExists");
+        if !is_conflict {
             error!(error = %e, namespace = %namespace, "Failed to create namespace");
Note: If create_namespace doesn't return kube::Error directly, you may need to update its signature or wrap the error type. Based on past review comments flagging fragile error detection.

562-573: Use typed error matching for 404 detection.

This uses string matching to detect 404 errors. Apply the typed error matching pattern used elsewhere: kube::Error::Api(ae) if ae.code == 404.
🔎 Suggested fix:
-    let original_session = api.get(original_session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let original_session = api.get(original_session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Original session {} not found", original_session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, path = %req.path, "Failed to get original session");
             ApiError::Internal {
                 message: format!("Failed to get original session: {}", e),
             }
         }
     })?;
697-708: Still uses string matching for 404 detection.

The past review marked this as "✅ Addressed in commit 0c1cf26" but the code still uses string matching. Use kube::Error::Api(ae) if ae.code == 404 for consistency with delete_session (line 796).

828-855: Sentinel error pattern still present.

The past review marked this as "✅ Addressed in commit 430c5b4" but the code still uses the "empty" sentinel string pattern. Refactor to use direct control flow:
🔎 Suggested fix:
-    let sessions = api
+    let sessions = match api
         .list(&ListParams::default().labels(&format!("user={}", auth.user_id)))
         .await
-        .map_err(|e| {
-            // If namespace doesn't exist, return empty list
-            if e.to_string().contains("404") || e.to_string().contains("NotFound") {
-                return ApiError::Internal {
-                    message: "empty".to_string(), // Sentinel for empty list
-                };
-            }
-            error!(error = %e, "Failed to list sessions");
-            ApiError::Internal {
-                message: format!("Failed to list sessions: {}", e),
-            }
-        });
-
-    let sessions = match sessions {
-        Ok(list) => list,
-        Err(e) if e.to_string().contains("empty") => {
+    {
+        Ok(list) => list,
+        Err(kube::Error::Api(ae)) if ae.code == 404 => {
             apimetrics::record_request("training.list", "GET", start, true);
             return Ok(Json(vec![]));
         }
         Err(e) => {
+            error!(error = %e, "Failed to list sessions");
             apimetrics::record_request("training.list", "GET", start, false);
-            return Err(e);
+            return Err(ApiError::Internal {
+                message: format!("Failed to list sessions: {}", e),
+            });
         }
     };
1290-1312: Consider making capabilities configurable.

Hardcoding model names and limits works for MVP, but consider loading these from configuration or a CRD to avoid code changes when adding new models or adjusting limits.
scripts/test-training-sdk.sh (1)
39-67: Consider checking for jq availability.

The script uses jq for JSON processing on lines 46, 56, and 66 without verifying it's installed. While jq is commonly available in development environments, adding a check would improve robustness and provide a clearer error message if it's missing.
🔎 Add this check after the venv activation block:
 echo ""
 
+# Check for jq
+if ! command -v jq &> /dev/null; then
+    echo "WARNING: jq not found, JSON output will be raw"
+    echo "Install with: apt-get install jq (or brew install jq on macOS)"
+fi
+
 # Check if API is running
examples/training_example.py (2)
47-57: Narrow the exception handler.

Catching bare Exception can mask unexpected errors including KeyboardInterrupt. For a production-quality example, consider catching specific exceptions or at minimum excluding KeyboardInterrupt and SystemExit.
🔎 Apply this diff to narrow the exception scope:
-    except Exception as e:
+    except (IOError, ValueError, RuntimeError) as e:
         print(f"Could not fetch capabilities: {e}")
Alternatively, if you want to catch all non-system exceptions:
-    except Exception as e:
+    except BaseException as e:
+        if isinstance(e, (KeyboardInterrupt, SystemExit)):
+            raise
         print(f"Could not fetch capabilities: {e}")
73-90: Consider adding error handling for educational value.

While the training loop is functionally correct, adding basic error handling around the .result() calls would make this example more production-ready and educational for users who will adapt this code.
🔎 Example error handling pattern:
try:
    for _ in range(5):
        result = training.forward_backward(
            [Datum(input_ids=train_tokens)],
            loss_fn="cross_entropy",
        ).result()
        
        step = training.optim_step().result()
        print(f"Step {step}: loss={result.loss:.4f}")
    
    checkpoint_path = training.save_state("checkpoint-final").result()
    print(f"\nCheckpoint saved: {checkpoint_path}")
except Exception as e:
    print(f"Training error: {e}")
    raise

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 430c5b4 and f0c655f.

📒 Files selected for processing (7)

crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
crates/basilica-api/src/api/mod.rs (1 hunks)
crates/basilica-api/src/api/routes/training.rs (1 hunks)
crates/basilica-sdk-python/python/basilica/training/tests/__init__.py (1 hunks)
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (1 hunks)
examples/training_example.py (1 hunks)
scripts/test-training-sdk.sh (1 hunks)

✅ Files skipped from review due to trivial changes (1)

crates/basilica-sdk-python/python/basilica/training/tests/init.py

🧰 Additional context used

🧬 Code graph analysis (1)

examples/training_example.py (4)

crates/basilica-sdk-python/python/basilica/training/types.py (5)

Datum (17-44)

ModelInput (48-66)

SamplingParams (70-93)

result (163-167)

from_ints (59-61)

crates/basilica-sdk-python/python/basilica/training/training_client.py (8)

session_id (66-68)

forward_backward (160-224)

step (76-78)

optim_step (246-285)

save_state (289-308)

forward (97-132)

compute_logprobs (134-158)

sample (422-455)

services/training-service/src/server.py (6)

forward_backward (323-352)

optim_step (356-362)

save_state (397-407)

forward (271-295)

compute_logprobs (299-319)

sample (369-390)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)

compute_logprobs (128-158)

sample (56-105)

🪛 Ruff (0.14.8)

examples/training_example.py

55-55: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (23)

crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (7)

62-84: LGTM!

Well-structured tests covering the core Datum.to_dict() functionality with good coverage of optional fields.

87-102: LGTM!

Good test coverage with proper mock verification for tokenizer integration.

105-134: LGTM!

Comprehensive test coverage for SamplingParams including defaults, custom parameters, and serialization.

240-294: LGTM!

Well-structured tests with proper resource cleanup via close() calls. Good coverage of REST client operations.

297-312: LGTM!

Clean test verifying GetServerCapabilitiesResponse construction from dictionary data.

315-316: LGTM!

Convenient pattern for running tests directly during development.

51-59: The @pytest.mark.asyncio decorator is properly supported—pytest-asyncio>=0.21 is already listed in the project's dev dependencies.

crates/basilica-api/src/api/mod.rs (1)

104-201: LGTM! Training routes properly integrated.

The training routes are correctly added to the protected routes with appropriate handler mappings. The middleware ordering (scope validation after auth) ensures proper authorization flow.

crates/basilica-api/src/api/routes/training.rs (1)

1662-1674: Placeholder token in download URL.

The URL contains token=placeholder which provides no actual security. Ensure this is replaced with proper signed URL generation before production use.

Verify this endpoint is not exposed in production without proper signed URL implementation.

scripts/test-training-sdk.sh (6)

1-4: LGTM: Proper error handling setup.

The set -e directive ensures the script fails fast on any command error, which is appropriate for an integration test script.

5-17: LGTM: Robust path resolution and optional venv activation.

The script correctly resolves paths and conditionally activates the virtual environment if present, providing good flexibility for different development setups.

19-27: LGTM: Robust health check with helpful error messages.

The health check properly validates API availability before proceeding and provides clear remediation steps if the API is not running.

29-37: LGTM: Secure token handling.

The script properly validates token file existence and truncates the token in output for security, which is a good practice.

69-73: LGTM: Straightforward example execution.

The Python example is executed correctly with the appropriate flag, relying on set -e for error propagation.

75-86: LGTM: Robust pytest execution with graceful degradation.

The script properly checks for pytest availability, provides installation instructions if missing, and gracefully continues even if some tests fail (appropriate for an integration test that may run in various environments).

examples/training_example.py (8)

1-33: LGTM: Clear documentation and appropriate imports.

The docstring provides excellent usage examples and API patterns. The sys.path manipulation is acceptable for an example script that needs to import from a local development SDK.

59-72: LGTM: Well-documented LoRA configuration.

The context manager usage is correct, and the inline comments clearly explain each LoRA parameter. The GPU count of 0 for CPU testing is appropriate for a portable example.

92-111: LGTM: Clear demonstration of logprobs computation.

The logprobs demo correctly uses the API to perform inference, compute per-token probabilities, and calculate sequence probability. The pattern is educational and correct.

122-142: LGTM: Clear demonstration of SamplingClient usage.

The sampling example correctly demonstrates using ModelInput.from_ints() for token-based prompts and SamplingParams for generation control. The API usage matches the expected patterns.

145-179: LGTM: Proper async/await pattern with cleanup.

The async example correctly demonstrates the async API with proper cleanup in the finally block. The use of try/finally ensures training.close() is called even if errors occur during training.

182-231: LGTM: Comprehensive RestClient demonstration.

The example effectively demonstrates REST API operations including listing runs/sessions/checkpoints and retrieving run details. Good defensive programming with .get() for dictionary access and proper cleanup with rest.close().

233-255: LGTM: Clean CLI dispatch.

The argument parsing and dispatch logic is straightforward and well-structured, providing clear entry points for different example modes.

112-114: The training.sample() convenience method exists and is correctly used in the example. No issue found.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-18T17:55:41Z

crates/basilica-api/src/api/routes/training.rs

+    match api.create(&PostParams::default(), &crd).await {
+        Ok(_) => {
+            info!(
+                session_id = %session_id,
+                namespace = %namespace,
+                resumed_from = %original_session_id,
+                "Created TrainingSession CRD from checkpoint"
+            );
+        }
+        Err(e) => {
+            error!(error = %e, "Failed to create TrainingSession CRD from checkpoint");
+            apimetrics::record_request("training.from_state", "POST", start, false);
+            return Err(ApiError::Internal {
+                message: format!("Failed to create session: {}", e),
+            });
+        }
+    }


⚠️ Potential issue | 🟡 Minor

Missing 409 conflict handling for CRD creation.

Unlike create_session (line 494), this doesn't handle the case where the session already exists. If a UUID collision occurs (unlikely but possible), this will return an internal error instead of a proper conflict response.

🔎 Suggested fix:

match api.create(&PostParams::default(), &crd).await { Ok(_) => { info!( session_id = %session_id, namespace = %namespace, resumed_from = %original_session_id, "Created TrainingSession CRD from checkpoint" ); } + Err(kube::Error::Api(ae)) if ae.code == 409 => { + warn!(session_id = %session_id, "TrainingSession already exists"); + apimetrics::record_request("training.from_state", "POST", start, false); + return Err(ApiError::Conflict { + message: format!("Session {} already exists", session_id), + }); + } Err(e) => {

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 651 to 667, the create() error path should detect a Kubernetes "already exists"/409 conflict and return a Conflict API error instead of treating every error as Internal; update the Err(e) branch to inspect the error (e.g., match on kube/k8s error kind or map to AlreadyExists/Conflict), and for conflict cases log an appropriate warn/info, record the apimetrics success=false for "training.from_state", and return ApiError::Conflict with a clear message; for all other errors keep the existing error log, metrics recording, and return ApiError::Internal.

coderabbitai · 2025-12-18T17:55:41Z

crates/basilica-api/src/api/routes/training.rs

+    // Execute the request through the kube client
+    let response = kube_client
+        .request::<serde_json::Value>(request)
+        .await
+        .map_err(|e| {
+            error!(
+                error = %e,
+                service = %service_name,
+                namespace = %namespace,
+                path = %path,
+                "Failed to proxy request to training service"
+            );
+            ApiError::ServiceUnavailable
+        })?;


⚠️ Potential issue | 🟠 Major

Missing timeout for proxy requests.

The K8s API server proxy request has no explicit timeout. If the training pod is unresponsive, this could cause requests to hang indefinitely, potentially exhausting connection pools or worker threads.

🔎 Consider adding a timeout:

You may need to wrap the request with a tokio timeout or configure the kube client with request timeouts:

use tokio::time::{timeout, Duration}; let response = timeout( Duration::from_secs(30), kube_client.request::<serde_json::Value>(request) ).await .map_err(|_| ApiError::GatewayTimeout { message: "Training service request timed out".to_string() })? .map_err(|e| { // ... existing error handling })?;

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 981 to 994, the kube client proxy request has no timeout and can hang indefinitely; wrap the kube_client.request call with tokio::time::timeout using a sensible Duration (e.g., 30s), map a timeout result to ApiError::GatewayTimeout with an explanatory message, and preserve the existing map_err handling for the inner request error (log the error and return ApiError::ServiceUnavailable); also add the necessary tokio::time::{timeout, Duration} import and adjust logs to indicate when a timeout occurred.

coderabbitai · 2025-12-18T17:55:41Z

crates/basilica-api/src/api/routes/training.rs

+/// Delete a checkpoint.
+/// DELETE /checkpoints/:checkpoint_id
+pub async fn delete_checkpoint(
+    State(state): State<AppState>,
+    Extension(auth): Extension<AuthContext>,
+    Path(checkpoint_id): Path<String>,
+) -> Result<StatusCode> {
+    let start = Instant::now();
+
+    info!(
+        user_id = %auth.user_id,
+        checkpoint_id = %checkpoint_id,
+        "Deleting checkpoint"
+    );
+
+    // Check that we have K8s access
+    let _k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;
+
+    // TODO: Actually delete the checkpoint from storage
+    // This would involve:
+    // 1. Finding the session that owns this checkpoint
+    // 2. Updating the session's checkpoint list
+    // 3. Deleting the checkpoint files from R2/S3
+
+    warn!(
+        checkpoint_id = %checkpoint_id,
+        "Checkpoint deletion not fully implemented - returning success"
+    );
+
+    apimetrics::record_request("training.checkpoints.delete", "DELETE", start, true);
+    Ok(StatusCode::NO_CONTENT)
+}


⚠️ Potential issue | 🟠 Major

Delete checkpoint returns success without performing deletion.

This endpoint claims success (NO_CONTENT) without actually deleting anything. This could lead to data inconsistencies and user confusion when checkpoints aren't actually removed.

Consider either:

Returning a 501 Not Implemented until fully implemented

Implementing the actual deletion logic

Documenting this as expected MVP behavior in the API response

🤖 Prompt for AI Agents

In crates/basilica-api/src/api/routes/training.rs around lines 1676 to 1707, the delete_checkpoint handler currently returns 204 NO_CONTENT without performing any deletion; either make this explicit or implement deletion. Fix by one of two actions: (A) If not ready to delete, return StatusCode::NOT_IMPLEMENTED (501) and log/warn that deletion is not implemented, keeping the k8s check and metrics; or (B) implement the deletion flow: verify checkpoint ownership via session lookup, remove checkpoint ID from the session record, delete checkpoint files from the configured object storage (R2/S3) and any Kubernetes resources, remove checkpoint metadata from DB, handle and log errors (return appropriate status codes), and only record metrics and return 204 on success. Ensure auth checks, error handling, and metrics remain consistent.

coderabbitai · 2025-12-18T17:55:41Z

crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py

+# Add parent path for imports
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))


🛠️ Refactor suggestion | 🟠 Major

Remove sys.path manipulation in favor of proper package installation.

Modifying sys.path directly is fragile and creates dependencies on the directory structure. Instead, install the package in editable mode (pip install -e .) or configure pytest's pythonpath in pytest.ini or pyproject.toml.

🔎 Remove these lines and configure proper package installation:

-# Add parent path for imports -sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) - from basilica.training import (

Then ensure the package is installed via:

pip install -e crates/basilica-sdk-python

Or add to pytest.ini:

[pytest] pythonpath = crates/basilica-sdk-python/python

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py around lines 10 to 11, remove the sys.path manipulation (sys.path.insert(...)) and rely on proper package installation or pytest configuration instead; install the package in editable mode from the repo root (pip install -e crates/basilica-sdk-python) or add the package path to pytest configuration (pytest.ini or pyproject.toml pythonpath) so tests can import the package without altering sys.path.

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (15)

services/training-service/README.md (1)
38-48: Add language specification to fenced code block.

Line 38 begins an example output code block without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
 ### Example Output

-```
+```text
 === Creating session for facebook/opt-125m ===
 Session: ts-a1b2c3d4

 === Training (3 steps) ===
 Step 1: loss=6.2612
 Step 2: loss=5.7423
 Step 3: loss=5.7501

 Sample: The quick brown -> fox jumps over the lazy
-```
+```
scripts/local-training-e2e.sh (4)
119-124: Verify gateway readiness or fail explicitly.

The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).
🔎 Consider failing explicitly:
     kubectl wait --namespace envoy-gateway-system \
         --for=condition=Available deployment/envoy-gateway \
-        --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+        --timeout=180s || {
+            log_error "Envoy Gateway failed to become ready"
+            exit 1
+        }
568-574: Add timeout to curl health check.

The health check curl on Line 570 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.
🔎 Add a timeout:
-    if ! curl -sf "$API_URL/health" > /dev/null; then
+    if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then
         log_error "API not running. Start it with: $0 api"
         exit 1
     fi
+    curl -s --max-time 10 "$API_URL/health" | jq .
-    curl -s "$API_URL/health" | jq .
695-702: Add timeout to training service health check.

The curl on Line 697 has no timeout and could hang if the port-forward is established but the service is unresponsive.
🔎 Add a timeout:
-    if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
+    if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then
         log_error "Training service not accessible via port-forward"
         kill $PF_PID 2>/dev/null
         return 1
     fi
-    curl -s "$TRAINING_URL/health" | jq .
+    curl -s --max-time 10 "$TRAINING_URL/health" | jq .
709-733: Add timeout to training session creation and subsequent curls.

The curl on Line 709 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive. The same issue applies to other training operation curls throughout run_training_steps.
🔎 Add timeouts to all training curls:
-    RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
+    RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \
         -H "Content-Type: application/json" \
Apply --max-time 30 (or appropriate value) to all curl calls in this function: lines 751, 772, 787, 791, 802, 820, 841, 858, 864, 871, 874, 880, 897, 902, 921.
crates/basilica-operator/src/controllers/training_session_controller.rs (10)

203-212: Status update condition still misses endpoint and start_time changes.

The condition now checks tokens_processed (line 206) but still doesn't check endpoint or start_time. Changes to these fields won't trigger status updates as previously noted.

239-251: Still using is_err() without distinguishing NotFound from other errors.

As previously noted, this treats network/auth errors the same as "pod not found" and proceeds to create a pod.

253-265: Same is_err() issue for service existence check.

Apply the same fix pattern as recommended for the pod check.

314-321: Pod failure still doesn't transition session to Failed phase.

As previously noted, when the pod enters "Failed" state, the session phase should transition to Failed, not remain in Scheduling.

389-394: Inconsistent error propagation with other handlers.

As previously noted, this propagates errors with ? while handle_pending converts similar failures to Ok(status.with_error(...)).

453-455: Silently ignoring pod deletion result.

As previously noted, the let _ = discards errors from delete_pod. At minimum, log the outcome.

643-650: Owner reference UID still defaults to empty string if missing.

As previously noted, the UID should be validated and an error returned if missing, not defaulted to an empty string.

710-717: Same empty UID issue in owner reference.

As previously noted, apply the same fix pattern as recommended for build_training_pod.

756-795: HTTPRoute still missing owner reference for garbage collection.

As previously noted, without ownerReferences, this HTTPRoute will be orphaned when the TrainingSession is deleted.

819-826: Same empty UID issue in NetworkPolicy owner reference.

As previously noted, apply the same fix pattern as recommended for the other resource builders.

🧹 Nitpick comments (3)

scripts/local-training-e2e.sh (3)
58-62: Add timeout to k3d install curl.

The k3d install curl has no timeout and could hang indefinitely if the network is slow or unresponsive.
🔎 Suggested fix
     if ! command -v k3d &> /dev/null; then
         log_info "Installing k3d..."
-        curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+        curl -s --max-time 60 https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
     fi
634-644: Unused loop variable.

The loop variable i is unused. Use _ to indicate an intentionally unused variable, or use brace expansion which is slightly more efficient.
🔎 Suggested fix:
-    for i in {1..60}; do
+    for _ in {1..60}; do
         POD_STATUS=$(kubectl get pods -A -l app=basilica-training -o jsonpath='{.items[0].status.phase}' 2>/dev/null)
Based on static analysis hint SC2034.

973-985: Add timeout to storage mounts curl.

The curl on Line 980 has no timeout and could hang if the storage daemon is unresponsive, leaving the port-forward orphaned.
🔎 Add timeout:
         kubectl port-forward -n basilica-storage "pod/$DAEMON_POD" 9092:9090 &
         PF_PID=$!
         sleep 1
-        curl -sf "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding"
+        curl -sf --max-time 5 "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding"
         kill $PF_PID 2>/dev/null
         wait $PF_PID 2>/dev/null

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0c655f and 73cdae1.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

crates/basilica-api/Cargo.toml (1 hunks)
crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
scripts/local-training-e2e.sh (1 hunks)
services/training-service/README.md (1 hunks)
services/training-service/src/backend.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

services/training-service/src/backend.py

🧰 Additional context used

🧬 Code graph analysis (2)

scripts/local-training-e2e.sh (7)

orchestrator/cluster-manager/src/clustermgr/commands/kubeconfig.py (1)

kubeconfig (22-27)

crates/basilica-common/src/utils/docker_validation.rs (1)

image (38-39)

crates/basilica-common/src/error.rs (1)

generic (357-361)

crates/basilica-api/src/server.rs (1)

run (1188-1226)

crates/basilica-operator/src/runtime.rs (1)

run (162-361)

orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)

deployments (103-213)

orchestrator/cluster-manager/src/clustermgr/commands/gateway.py (1)

gateway (296-302)

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

crates/basilica-operator/src/crd/training_session.rs (5)

new (316-322)

new (374-391)

default (34-41)

default (80-86)

default (145-151)

🪛 markdownlint-cli2 (0.18.1)

services/training-service/README.md

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

128-128: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

200-200: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

253-253: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

404-404: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)

scripts/local-training-e2e.sh

[warning] 636-636: i appears unused. Verify use (or export if used externally).

(SC2034)

[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (7)

scripts/local-training-e2e.sh (6)

1-38: LGTM!

Good use of set -euo pipefail for strict error handling, and the color-coded logging helpers improve script readability.

67-96: LGTM!

Cluster creation logic is well-structured with proper port mappings, kubeconfig export, and node readiness wait.

129-150: LGTM!

Image build and load logic is clean with clear logging for each step.

152-176: LGTM!

Operator deployment is well-structured with proper kubectl wait and timeout.

399-433: LGTM!

Postgres setup with stale data detection and reset functionality is well-implemented.

1032-1088: LGTM!

Comprehensive help documentation clearly explains the workflow, commands, and test steps.

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

872-982: Good test coverage for resource builders.

The tests comprehensively verify pod and service creation, including GPU resources, FUSE storage configuration, volume mounts, and command wrapping. This provides solid confidence in the resource builder functions.

crates/basilica-api/Cargo.toml

coderabbitai · 2025-12-19T01:20:28Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+        if sessions.sessions.is_empty() {
+            debug!("no active sessions in training service");
+            return None;
+        }
+
+        // Get status of the first session (typically there's only one per pod)
+        let session_id = &sessions.sessions[0];
+        let status_url = format!("{}/sessions/{}", internal_endpoint, session_id);


⚠️ Potential issue | 🟡 Minor

Verify session name matches expected TrainingSession.

The code retrieves the first session from the list without confirming it matches the expected name. If the training service somehow returns multiple sessions or a different session, status updates could be applied to the wrong TrainingSession.

🔎 Suggested fix to validate session name

if sessions.sessions.is_empty() { debug!("no active sessions in training service"); return None; } - // Get status of the first session (typically there's only one per pod) - let session_id = &sessions.sessions[0]; + // Get status of the first session, verify it matches expected name + let session_id = sessions.sessions.iter() + .find(|id| id.as_str() == name) + .or_else(|| sessions.sessions.first())?; let status_url = format!("{}/sessions/{}", internal_endpoint, session_id);

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 110 to 117, the code unconditionally uses the first session from sessions.sessions which can misidentify the intended TrainingSession; instead, search sessions.sessions for an entry that matches the expected `name` (or other canonical identifier available in scope), select that session's id for status checks, and only proceed when a matching session is found (otherwise log/debug and return None). Ensure the lookup handles multiple sessions, uses the matched session ID in the status_url, and retains clear logging when no matching session exists.

coderabbitai · 2025-12-19T01:20:28Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+        // Poll training service for status updates
+        let mut updated_status = status.clone();
+        if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await {
+            // Only update if there's actual progress
+            if steps != status.steps_completed || tokens != status.tokens_processed {
+                info!(
+                    session = %name,
+                    steps = steps,
+                    tokens = tokens,
+                    "updating training progress"
+                );
+                updated_status = updated_status.with_steps(steps, tokens);
+            }
+        }
+
+        // Update last activity timestamp
+        updated_status.last_updated = chrono::Utc::now().to_rfc3339();


⚠️ Potential issue | 🟡 Minor

Timestamp updated even when status fetch fails.

Line 476 unconditionally updates last_updated even when fetch_training_status returns None. This creates misleading timestamps suggesting the session was recently checked when no actual status was obtained.

🔎 Suggested fix to update timestamp only on successful fetch

// Poll training service for status updates let mut updated_status = status.clone(); if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await { // Only update if there's actual progress if steps != status.steps_completed || tokens != status.tokens_processed { info!( session = %name, steps = steps, tokens = tokens, "updating training progress" ); updated_status = updated_status.with_steps(steps, tokens); } + // Update last activity timestamp only when we successfully fetched status + updated_status.last_updated = chrono::Utc::now().to_rfc3339(); } - // Update last activity timestamp - updated_status.last_updated = chrono::Utc::now().to_rfc3339(); - Ok(updated_status)

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 460 to 476, last_updated is set unconditionally even when fetch_training_status returned None; move the updated_status.last_updated = chrono::Utc::now().to_rfc3339() into the Some branch so the timestamp is only updated when fetch_training_status succeeded (place it after updating steps/tokens or alongside that update), ensuring last_updated changes only on successful status fetch.

coderabbitai · 2025-12-19T01:20:28Z

crates/basilica-operator/src/controllers/training_session_controller.rs

+    // Allow TRAINING_IMAGE env var to override the default image (for local dev)
+    let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());


⚠️ Potential issue | 🟠 Major

TRAINING_IMAGE override lacks dev_mode protection.

The environment variable allows overriding the spec image for "local dev" but isn't gated behind a dev_mode check. If this env var is accidentally set in production, it could allow execution of arbitrary container images, bypassing security policies.

🔎 Suggested fix to gate behind dev_mode

- // Allow TRAINING_IMAGE env var to override the default image (for local dev) - let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone()); + // Allow TRAINING_IMAGE env var to override the default image (only in dev mode) + let image = if cfg!(feature = "dev_mode") { + std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone()) + } else { + spec.image.clone() + };

Alternatively, check for an explicit DEV_MODE environment variable:

- // Allow TRAINING_IMAGE env var to override the default image (for local dev) - let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone()); + // Allow TRAINING_IMAGE env var to override the default image (only in dev mode) + let image = if std::env::var("DEV_MODE").is_ok() { + std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone()) + } else { + spec.image.clone() + };

🤖 Prompt for AI Agents

In crates/basilica-operator/src/controllers/training_session_controller.rs around lines 500-501, the TRAINING_IMAGE env var is used unconditionally to override the spec image; gate this override behind a dev-mode check (e.g., check an explicit DEV_MODE env var or a operator config flag) so production cannot be overridden. Update the code to read DEV_MODE (parse as bool or check for "1"/"true"), only call std::env::var("TRAINING_IMAGE") when dev mode is enabled, otherwise always use spec.image; also add a warning log when dev mode is enabled and the override is applied, and ensure no panic on missing/invalid DEV_MODE by defaulting to false.

coderabbitai · 2025-12-19T01:20:28Z

scripts/local-training-e2e.sh

+create_r2_credentials() {
+    export KUBECONFIG="$KUBECONFIG_PATH"
+
+    # R2 credentials - you can override these with environment variables
+    R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"
+    R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"
+    R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"
+    R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"
+    R2_REGION="${R2_REGION:-auto}"


⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded R2 credentials in source code.

The R2 access key and secret key defaults appear to be real credentials committed to version control. This is a serious security risk—even if these are for a test bucket, credentials in source code can be scraped, reused, or accidentally exposed.

🔎 Remove default secrets and require environment variables:

# R2 credentials - you can override these with environment variables - R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}" - R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}" - R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}" - R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}" - R2_REGION="${R2_REGION:-auto}" + R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}" + R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}" + R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}" + R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}" + R2_REGION="${R2_REGION:-auto}"

If these credentials are already committed, they should be rotated immediately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

create_r2_credentials() {

export KUBECONFIG="$KUBECONFIG_PATH"

# R2 credentials - you can override these with environment variables

R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"

R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"

R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"

R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"

R2_REGION="${R2_REGION:-auto}"

create_r2_credentials() {

export KUBECONFIG="$KUBECONFIG_PATH"

# R2 credentials - you can override these with environment variables

R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}"

R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}"

R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}"

R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}"

R2_REGION="${R2_REGION:-auto}"

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 262 to 270, the R2 access and secret keys are hardcoded as defaults; remove these embedded secrets and require callers to supply credentials via environment variables (or a secure secret manager). Replace the default assignments so R2_ACCESS_KEY and R2_SECRET_KEY are read directly from the environment without fallback values, validate at runtime and exit with a clear error if they are unset, update README/.env.example to document required vars, and ensure any committed secrets are rotated and removed from the repo/history (and add a note to rotate if present).

coderabbitai · 2025-12-19T01:20:28Z

scripts/local-training-e2e.sh

+    sleep 3
+
+    # Cleanup on exit
+    trap "kill $PF_PID 2>/dev/null" EXIT


⚠️ Potential issue | 🟡 Minor

Trap expansion happens at definition time, not signal time.

Using double quotes causes $PF_PID to expand immediately when the trap is set. If the variable changes later (or is unset), the trap will use the stale value.

🔎 Use single quotes for delayed expansion:

- trap "kill $PF_PID 2>/dev/null" EXIT + trap 'kill $PF_PID 2>/dev/null' EXIT

Based on static analysis hint SC2064.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sleep 3

# Cleanup on exit

trap "kill $PF_PID 2>/dev/null" EXIT

sleep 3

# Cleanup on exit

trap 'kill $PF_PID 2>/dev/null' EXIT

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🤖 Prompt for AI Agents

In scripts/local-training-e2e.sh around lines 688 to 691, the trap is defined with double quotes so $PF_PID is expanded immediately; change the trap to use single quotes (e.g., trap 'kill $PF_PID 2>/dev/null' EXIT) so the variable is expanded at signal time and the current PF_PID value is used when the script exits.

coderabbitai · 2025-12-19T01:20:28Z

services/training-service/README.md

+```
+Training Pod                    Storage Daemon (DaemonSet)           R2 Bucket
+┌─────────────┐                ┌──────────────────────────┐         ┌─────────┐
+│ /checkpoints│ ──hostPath──▶  │ FUSE Mount               │         │         │
+│  (volume)   │                │ /var/lib/basilica/fuse/  │ ──sync──▶│ basilica│
+└─────────────┘                │ u-testuser/              │         │         │
+                               └──────────────────────────┘         └─────────┘
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 128 begins an ASCII diagram code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:

#### How It Works -``` +```text Training Pod Storage Daemon (DaemonSet) R2 Bucket ┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐ │ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

Training Pod Storage Daemon (DaemonSet) R2 Bucket

┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐

│ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │

│ (volume) │ │ /var/lib/basilica/fuse/ │ ──sync──▶│ basilica│

└─────────────┘ │ u-testuser/ │ │ │

└──────────────────────────┘ └─────────┘

```

#### How It Works

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

128-128: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In services/training-service/README.md around lines 128 to 135, the fenced ASCII diagram block lacks a language identifier causing markdown lint MD040; add a language specifier (e.g., ```text) on the opening fence so the block becomes a labeled code block, leaving the diagram content unchanged and keeping the closing ``` as-is.

coderabbitai · 2025-12-19T01:20:28Z

services/training-service/README.md

+```
+{bucket}/{namespace}/{namespace}/{checkpoint_path}
+
+Example:
+basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 200 begins a code block showing the R2 object path structure without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:

#### R2 Object Path Checkpoints are stored in R2 with the following path structure: -``` +```text {bucket}/{namespace}/{namespace}/{checkpoint_path} Example: basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

200-200: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In services/training-service/README.md around lines 200 to 205, the fenced code block showing the R2 object path lacks a language identifier (causing MD040); update the opening fence to include the language specifier (e.g., change ``` to ```text) so the block becomes a language-marked code fence while leaving the block contents unchanged.

coderabbitai · 2025-12-19T01:20:28Z

services/training-service/README.md

+```
+[STEP] Running training steps via port-forward...
+[INFO] Training step 1/3
+[INFO]   Loss: 6.26, Tokens: 10
+[INFO]   Completed step: 1
+[INFO] Training step 2/3
+[INFO]   Loss: 5.74, Tokens: 10
+[INFO]   Completed step: 2
+[INFO] Training step 3/3
+[INFO]   Loss: 5.75, Tokens: 10
+[INFO]   Completed step: 3
+
+[INFO] Training session status:
+{
+  "session_id": "train-session-1",
+  "step_count": 3,
+  "tokens_processed": 30,
+  "lora_rank": 8
+}
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 253 begins an example output code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:

### Example Output -``` +```text [STEP] Running training steps via port-forward... [INFO] Training step 1/3 [INFO] Loss: 6.26, Tokens: 10 [INFO] Completed step: 1 [INFO] Training step 2/3 [INFO] Loss: 5.74, Tokens: 10 [INFO] Completed step: 2 [INFO] Training step 3/3 [INFO] Loss: 5.75, Tokens: 10 [INFO] Completed step: 3 [INFO] Training session status: { "session_id": "train-session-1", "step_count": 3, "tokens_processed": 30, "lora_rank": 8 } -``` +```

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

[STEP] Running training steps via port-forward...

[INFO] Training step 1/3

[INFO] Loss: 6.26, Tokens: 10

[INFO] Completed step: 1

[INFO] Training step 2/3

[INFO] Loss: 5.74, Tokens: 10

[INFO] Completed step: 2

[INFO] Training step 3/3

[INFO] Loss: 5.75, Tokens: 10

[INFO] Completed step: 3

[INFO] Training session status:

{

"session_id": "train-session-1",

"step_count": 3,

"tokens_processed": 30,

"lora_rank": 8

}

```

### Example Output

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

253-253: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In services/training-service/README.md around lines 253 to 272, the example output code block is missing a language identifier which triggers markdown lint rule MD040; update the opening fenced code block from ``` to ```text so the block is explicitly marked as plain text (leave the closing ``` unchanged) to satisfy the linter and preserve the example output formatting.

coderabbitai · 2025-12-19T01:20:28Z

services/training-service/README.md

+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Basilica API                             │
+│              (creates TrainingSession CRD)                  │
+└─────────────────────┬───────────────────────────────────────┘
+                      │
+                      ▼
+┌─────────────────────────────────────────────────────────────┐
+│                 Basilica Operator                           │
+│  • Reconciles TrainingSession CRD                           │
+│  • Creates Pod + Service + NetworkPolicy                    │
+│  • Polls training service for status updates                │
+└─────────────────────┬───────────────────────────────────────┘
+                      │
+                      ▼
+┌─────────────────────────────────────────────────────────────┐
+│               Training Service Pod                          │
+├─────────────────────────────────────────────────────────────┤
+│              FastAPI Server (src/server.py)                 │
+├─────────────────────────────────────────────────────────────┤
+│              Training Backend (src/backend.py)              │
+│  • Session management                                       │
+│  • Forward-backward with gradient accumulation              │
+│  • Optimizer step                                           │
+│  • Text generation                                          │
+│  • Checkpoint save/load                                     │
+├─────────────────────────────────────────────────────────────┤
+│           HuggingFace Transformers + PEFT                   │
+│  • Model loading (OPT, LLaMA, etc.)                         │
+│  • LoRA adapter (rank, alpha, dropout)                      │
+│  • AdamW optimizer                                          │
+└─────────────────────────────────────────────────────────────┘
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 404 begins an ASCII architecture diagram code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:

## Architecture -``` +```text ┌─────────────────────────────────────────────────────────────┐ │ Basilica API │ │ (creates TrainingSession CRD) │ └─────────────────────┬───────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Basilica Operator │ │ • Reconciles TrainingSession CRD │ │ • Creates Pod + Service + NetworkPolicy │ │ • Polls training service for status updates │ └─────────────────────┬───────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Training Service Pod │ ├─────────────────────────────────────────────────────────────┤ │ FastAPI Server (src/server.py) │ ├─────────────────────────────────────────────────────────────┤ │ Training Backend (src/backend.py) │ │ • Session management │ │ • Forward-backward with gradient accumulation │ │ • Optimizer step │ │ • Text generation │ │ • Checkpoint save/load │ ├─────────────────────────────────────────────────────────────┤ │ HuggingFace Transformers + PEFT │ │ • Model loading (OPT, LLaMA, etc.) │ │ • LoRA adapter (rank, alpha, dropout) │ │ • AdamW optimizer │ -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

404-404: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In services/training-service/README.md around lines 404 to 436, the ASCII architecture diagram code block is missing a language identifier which triggers markdown lint rule MD040; update the opening fence from ``` to ```text (or another suitable language like ```plain) and keep the closing fence unchanged so the block is recognized as a code block with a language specifier.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

examples/training_example.py (1)

39-40: Resource leak: file handle not closed.

This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed. The same pattern appears at lines 182-183, 213-214, and 247-248.

🧹 Nitpick comments (3)

crates/basilica-sdk-python/python/basilica/training/types.py (2)
176-194: Use get_running_loop() and fix caching inconsistency.

Three issues:

_result_type is stored but never used (dead code)

asyncio.get_event_loop() is deprecated in Python 3.10+ when called from a coroutine; use asyncio.get_running_loop() instead

result() caches to _result, but result_async() bypasses this cache, causing inconsistent behavior if both are called
🔎 Proposed fix
-    def __init__(self, future: Future, result_type: type = None):
+    def __init__(self, future: Future, result_type: Optional[type] = None):
         self._future = future
         self._result = None
-        self._result_type = result_type
+        self._result_type = result_type  # Reserved for future deserialization

     def result(self, timeout: Optional[float] = None):
         """Block until operation completes (sync)."""
         if self._result is None:
             self._result = self._future.result(timeout=timeout)
         return self._result

     async def result_async(self, timeout: Optional[float] = None):
         """Wait for operation to complete (async)."""
-        loop = asyncio.get_event_loop()
-        if timeout is not None:
-            return await asyncio.wait_for(
-                loop.run_in_executor(None, self._future.result), timeout=timeout
-            )
-        return await loop.run_in_executor(None, self._future.result)
+        if self._result is not None:
+            return self._result
+        loop = asyncio.get_running_loop()
+        if timeout is not None:
+            self._result = await asyncio.wait_for(
+                loop.run_in_executor(None, self._future.result), timeout=timeout
+            )
+        else:
+            self._result = await loop.run_in_executor(None, self._future.result)
+        return self._result
207-216: Consider sorting __all__ alphabetically.

Per static analysis (RUF022), the export list is unsorted. Sorting improves readability and reduces merge conflicts.
examples/training_example.py (1)

126-144: Retry logic is reasonable but consider documenting the magic numbers.

The 5 retries with 1-second delays are hardcoded. For example code this is acceptable, but a brief inline comment explaining why these values were chosen would help readers understand the expected service behavior.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73cdae1 and f03f1d8.

📒 Files selected for processing (2)

crates/basilica-sdk-python/python/basilica/training/types.py (1 hunks)
examples/training_example.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.8)

examples/training_example.py

55-55: Do not catch blind exception: Exception

(BLE001)

151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

crates/basilica-sdk-python/python/basilica/training/types.py

134-134: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

135-135: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

136-136: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

137-137: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

138-138: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

140-140: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

141-141: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

142-142: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

143-143: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

144-144: Unused method argument: kwargs

(ARG002)

176-176: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

207-216: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (7)

crates/basilica-sdk-python/python/basilica/training/types.py (4)

16-45: LGTM!

The Datum dataclass is well-structured with appropriate optional fields and a clean to_dict() serialization method.

47-67: LGTM!

The factory methods provide a clean API for creating ModelInput from different sources.

69-94: LGTM!

Sensible defaults and proper handling of stop_sequences in serialization.

99-124: LGTM!

Response dataclasses are well-defined with appropriate fields and defaults.

examples/training_example.py (3)

60-86: LGTM!

Good use of the context manager pattern for the training client, and the training loop demonstrates the APIFuture .result() pattern clearly.

207-241: Good use of try/finally for cleanup in async context.

The try/finally pattern ensures training.close() is called even if an exception occurs during async training.

295-317: LGTM!

Clean CLI argument handling with argparse and clear dispatch to the appropriate example function.

coderabbitai · 2025-12-19T01:42:26Z

crates/basilica-sdk-python/python/basilica/training/types.py

+    def __init__(
+        self,
+        models: List[str] = None,
+        max_batch_tokens: int = None,
+        max_sequence_length: int = None,
+        gpu_types: List[str] = None,
+        lora_ranks: List[int] = None,
+        # camelCase variants from API
+        maxBatchTokens: int = None,
+        maxSequenceLength: int = None,
+        gpuTypes: List[str] = None,
+        loraRanks: List[int] = None,
+        **kwargs,  # Ignore unknown fields
+    ):
+        self.models = models or []
+        self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0
+        self.max_sequence_length = max_sequence_length or maxSequenceLength or 0
+        self.gpu_types = gpu_types or gpuTypes or []
+        self.lora_ranks = lora_ranks or loraRanks or []
+


⚠️ Potential issue | 🟡 Minor

Falsy value bug: explicit 0 values are ignored.

The or chaining treats 0 as falsy, so max_batch_tokens=0 falls through to check maxBatchTokens. While the final fallback to 0 masks this for zeroes, it's semantically incorrect and will fail if someone passes max_batch_tokens=0, maxBatchTokens=100 expecting 0.

Additionally, per static analysis hints, the implicit Optional types (using = None without Optional[T]) violate PEP 484.

🔎 Proposed fix using explicit None checks

def __init__( self, - models: List[str] = None, - max_batch_tokens: int = None, - max_sequence_length: int = None, - gpu_types: List[str] = None, - lora_ranks: List[int] = None, + models: Optional[List[str]] = None, + max_batch_tokens: Optional[int] = None, + max_sequence_length: Optional[int] = None, + gpu_types: Optional[List[str]] = None, + lora_ranks: Optional[int] = None, # camelCase variants from API - maxBatchTokens: int = None, - maxSequenceLength: int = None, - gpuTypes: List[str] = None, - loraRanks: List[int] = None, + maxBatchTokens: Optional[int] = None, + maxSequenceLength: Optional[int] = None, + gpuTypes: Optional[List[str]] = None, + loraRanks: Optional[List[int]] = None, **kwargs, # Ignore unknown fields ): self.models = models or [] - self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0 - self.max_sequence_length = max_sequence_length or maxSequenceLength or 0 - self.gpu_types = gpu_types or gpuTypes or [] - self.lora_ranks = lora_ranks or loraRanks or [] + self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else (maxBatchTokens if maxBatchTokens is not None else 0) + self.max_sequence_length = max_sequence_length if max_sequence_length is not None else (maxSequenceLength if maxSequenceLength is not None else 0) + self.gpu_types = gpu_types if gpu_types is not None else (gpuTypes if gpuTypes is not None else []) + self.lora_ranks = lora_ranks if lora_ranks is not None else (loraRanks if loraRanks is not None else [])

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.8)

134-134: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

135-135: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

136-136: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

137-137: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

138-138: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

140-140: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

141-141: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

142-142: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

143-143: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

144-144: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents

In crates/basilica-sdk-python/python/basilica/training/types.py around lines 132 to 151, the constructor uses "or" chaining which treats valid falsy values like 0 as missing; change the assignments to explicit None checks (e.g., set self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else (maxBatchTokens if maxBatchTokens is not None else 0)) and similarly for max_sequence_length, gpu_types, and lora_ranks so passed zeros or empty lists are preserved; also update the parameter type annotations from "T = None" to "Optional[T] = None" (import Optional from typing) for all optional params to satisfy PEP 484.

coderabbitai · 2025-12-19T01:42:26Z

examples/training_example.py

+        if abs(loss_after_load - loss_before_changes) < 0.01:
+            print("✓ R2 checkpoint round-trip PASSED - weights restored correctly!")
+        else:
+            print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")


⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string.

This f-string has no placeholders, making the f prefix unnecessary.

🔎 Proposed fix

- print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch") + print("✗ R2 checkpoint round-trip FAILED - loss mismatch")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")

print("✗ R2 checkpoint round-trip FAILED - loss mismatch")

🧰 Tools

🪛 Ruff (0.14.8)

151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In examples/training_example.py around line 151, the print statement uses an unnecessary f-string prefix for a literal without placeholders; remove the leading "f" so the string is a normal literal (print("✗ R2 checkpoint round-trip FAILED - loss mismatch")) to avoid misleading use of f-strings.

Made-with: Cursor # Conflicts: # .gitignore # Cargo.lock # crates/basilica-api/Cargo.toml # crates/basilica-api/src/api/middleware/scope.rs # crates/basilica-api/src/api/mod.rs # crates/basilica-api/src/api/routes/mod.rs # crates/basilica-api/src/bin/gen_api_key.rs # crates/basilica-api/src/config/mod.rs # crates/basilica-api/src/k8s/trait.rs # crates/basilica-api/src/server.rs # crates/basilica-operator/src/controllers/mod.rs # crates/basilica-operator/src/crd/mod.rs # crates/basilica-operator/src/k8s_client.rs # crates/basilica-operator/src/runtime.rs # justfile # scripts/api/compose.dev.yml

sft mvp

11a5072

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

chore: local run instructions

40e914d

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

sdk

0c1cf26

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

feat: improve sdk

0aa3e6b

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

feat: improvements

eabae23

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

feat: capabilities

430c5b4

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

chore: tests

f0c655f

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

feat: fuse checkpoints

73cdae1

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

feat: checlpoint sdk tests

f03f1d8

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

distributedstatemachine closed this Jan 3, 2026

distributedstatemachine reopened this Mar 26, 2026

		if os.path.exists("build/api-token.txt"):
		api_key = open("build/api-token.txt").read().strip()

		# Add parent path for imports
		sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))

		// Allow TRAINING_IMAGE env var to override the default image (for local dev)
		let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(\|_\| spec.image.clone());

	print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")
	print("✗ R2 checkpoint round-trip FAILED - loss mismatch")

Conversation

distributedstatemachine commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Type of Change

Changes Made

Testing

How Has This Been Tested?

Test Configuration

Checklist

Additional Context

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

distributedstatemachine commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading