Skip to content

sft mvp#298

Open
distributedstatemachine wants to merge 10 commits intomainfrom
feat/training
Open

sft mvp#298
distributedstatemachine wants to merge 10 commits intomainfrom
feat/training

Conversation

@distributedstatemachine
Copy link
Copy Markdown
Contributor

@distributedstatemachine distributedstatemachine commented Dec 18, 2025

Summary

Brief description of what this PR does.

Related Issues

Closes #(issue number)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Changes Made

List the main changes in this PR:

Testing

How Has This Been Tested?

Describe the tests you ran to verify your changes.

  • Unit tests pass (cargo test)
  • Integration tests pass
  • Manual testing completed

Test Configuration

  • OS:
  • Rust version:
  • Bittensor version (if applicable):

Checklist

  • My code follows the project's style guidelines
  • I have run cargo fmt to format my code
  • I have run cargo clippy and addressed all warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Context

Add any other context about the PR here.

Summary by CodeRabbit

  • New Features

    • Introduced Basilica Training SDK with comprehensive Python API for LoRA fine-tuning, including session management, training operations, sampling, and checkpoint management.
    • Added training service with HTTP endpoints for forward/backward passes, optimization steps, and text generation.
    • Replaced previous evaluator scaffolding with Affine and AgentGym evaluators for improved flexibility.
  • Documentation

    • Added training service README and comprehensive example scripts demonstrating SDK usage.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 18, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6ea43c74-df6a-484a-83fd-bb46996bfef2

📥 Commits

Reviewing files that changed from the base of the PR and between f03f1d8 and fa32881.

📒 Files selected for processing (1)
  • .gitignore
✅ Files skipped from review due to trivial changes (1)
  • .gitignore

Walkthrough

This PR introduces the Basilica Training SDK and supporting infrastructure for LoRA-based model fine-tuning. It includes Python SDK clients, a FastAPI training backend service, Kubernetes resources (CRDs, operators, gateways), Docker Compose configuration, example scripts, and comprehensive testing.

Changes

Cohort / File(s) Summary
Python SDK Public API Surface
crates/basilica-sdk-python/python/basilica/training/__init__.py
Re-exports core public API (clients, data types, exceptions) at module level; sets __version__ to "0.2.0"; defines consolidated __all__ list for clean package imports.
SDK Core Clients
crates/basilica-sdk-python/python/basilica/training/service_client.py, training_client.py, sampling_client.py, rest_client.py
Four client implementations: ServiceClient for entry point with capability probing, TrainingClient for training operations (forward/backward, optim), SamplingClient for text generation, RestClient for run/checkpoint management. All use httpx and ThreadPoolExecutor for async wrappers via APIFuture.
SDK Data Types & Exceptions
crates/basilica-sdk-python/python/basilica/training/types.py, exceptions.py
Introduces data classes (Datum, ModelInput, SamplingParams, SampleResponse, ForwardBackwardResult, ForwardResult) and GetServerCapabilitiesResponse for API contracts. Defines TrainingError hierarchy with 9 exception types for error handling and validation.
SDK Tests
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py, tests/__init__.py
Comprehensive test coverage for APIFuture, data types, ServiceClient, TrainingClient, RestClient, and GetServerCapabilitiesResponse with mocked HTTP interactions.
Training Backend Service Implementation
services/training-service/src/backend.py, server.py, Dockerfile, pyproject.toml, README.md, src/__init__.py
Backend service using PEFT LoRA for HuggingFace model fine-tuning. Includes TrainingBackend class managing session lifecycle, forward/backward passes, optimizer steps, sampling, and checkpoint save/load via S3/R2. FastAPI server exposes endpoints for session management, training, and inference.
Training Backend Tests
services/training-service/tests/test_backend.py, tests/__init__.py
Unit tests covering session lifecycle, forward-backward passes, optimization, sampling with logprobs, checkpoint persistence, status tracking, and error handling for nonexistent sessions.
Kubernetes CRDs & Operators
orchestrator/k8s/training/training-session-crd.yaml, operator-deployment.yaml, training-service-rbac.yaml
CustomResourceDefinition for TrainingSession with LoRA config, optimizer settings, GPU resources, and checkpoint storage. Operator deployment with RBAC for basilica-operator managing training sessions and related cluster resources.
Kubernetes Gateway & Infrastructure
orchestrator/k8s/training/local-dev/gateway.yaml, namespace.yaml
Gateway API setup with GatewayClass, HTTP listener on port 8080, ReferenceGrant for cross-namespace routing. Three namespaces: basilica-system, envoy-gateway-system, u-testuser.
Kubernetes Session Examples
orchestrator/k8s/training/training-session-example.yaml
Two example TrainingSession manifests for Llama-3.1-8B and Llama-3.1-70B with LoRA rank configurations, R2 checkpoint storage, and GPU resource specifications.
End-to-End Testing & Orchestration
scripts/local-training-e2e.sh
Comprehensive bash script for local development: cluster provisioning (k3d), Gateway/Envoy setup, image builds, storage/FUSE configuration, Postgres API backend, and end-to-end training workflow with port-forwarding and checkpoint management. Exposes CLI commands for cluster-up, deploy, test, logs, reset-db, etc.
Integration Testing
scripts/test-training-sdk.sh
Integration test script validating API health, endpoint availability, token authentication, and running Python examples and pytest suite.
Example Scripts
examples/training_example.py, examples/11_agentgym.py
training_example.py demonstrates end-to-end SDK usage with main(), sampling_example(), async_example(), and rest_client_example() entry points. 11_agentgym.py replaces LLM evaluator scaffolding with Affine and AgentGym evaluators, adding per-request API key handling via environment variables.
Docker Compose & Configuration
docker-compose.training.yml, .gitignore
Docker Compose with training-service and basilica-api services, GPU constraints, health checks, restart policies, and volume mounts. Gitignore updated to exclude .venv.

Sequence Diagram

sequenceDiagram
    participant User
    participant ServiceClient
    participant TrainingClient
    participant HTTPClient as HTTP/Remote
    participant TrainingBackend
    participant Model as HuggingFace Model
    
    User->>ServiceClient: create_lora_training_client(base_model, rank, ...)
    ServiceClient->>HTTPClient: POST /sessions (create)
    HTTPClient->>TrainingBackend: create_session(base_model)
    TrainingBackend->>Model: Load base model & tokenizer
    Model-->>TrainingBackend: Model loaded
    TrainingBackend-->>HTTPClient: session_id, status
    HTTPClient-->>ServiceClient: CreateSessionResponse
    ServiceClient->>ServiceClient: Wrap in TrainingClient
    ServiceClient-->>User: TrainingClient instance
    
    User->>TrainingClient: forward_backward(input_ids, labels)
    TrainingClient->>HTTPClient: POST /forward_backward
    HTTPClient->>TrainingBackend: forward_backward(tokens, labels)
    TrainingBackend->>Model: forward pass + loss
    Model-->>TrainingBackend: logits
    TrainingBackend->>TrainingBackend: backward pass
    TrainingBackend-->>HTTPClient: loss, logprobs, tokens_processed
    HTTPClient-->>TrainingClient: ForwardBackwardResult
    TrainingClient-->>User: APIFuture (loss, logprobs)
    
    User->>TrainingClient: optim_step()
    TrainingClient->>HTTPClient: POST /optim_step
    HTTPClient->>TrainingBackend: optim_step()
    TrainingBackend->>TrainingBackend: gradient clip & optimizer.step()
    TrainingBackend-->>HTTPClient: step count
    HTTPClient-->>TrainingClient: OptimStepResponse
    TrainingClient-->>User: step count
    
    User->>TrainingClient: save_state(path)
    TrainingClient->>HTTPClient: POST /save_state
    HTTPClient->>TrainingBackend: save_state(path)
    TrainingBackend->>Model: Save LoRA weights to S3/R2
    Model-->>TrainingBackend: Checkpoint saved
    TrainingBackend-->>HTTPClient: checkpoint path
    HTTPClient-->>TrainingClient: SaveStateResponse
    TrainingClient-->>User: checkpoint path
Loading
sequenceDiagram
    participant Admin
    participant kubectl as kubectl/K8s API
    participant Operator
    participant TrainingSessionCRD
    participant PodController
    participant TrainingPod as Training Pod
    
    Admin->>kubectl: kubectl apply training-session.yaml
    kubectl->>TrainingSessionCRD: Create TrainingSession resource
    TrainingSessionCRD-->>kubectl: resource created
    
    Operator->>TrainingSessionCRD: Watch for new TrainingSession
    TrainingSessionCRD-->>Operator: TrainingSession event (Added)
    Operator->>Operator: Validate spec (baseModel, gpuResources)
    Operator->>Operator: Build training pod spec
    Operator->>PodController: Create Pod with training-service:local image
    PodController-->>Operator: Pod created
    
    Operator->>TrainingSessionCRD: Update status (phase: loading_model)
    TrainingSessionCRD-->>Operator: Status updated
    
    PodController->>TrainingPod: Launch container
    TrainingPod->>TrainingPod: Load model, initialize LoRA
    TrainingPod-->>PodController: Ready (healthcheck passes)
    
    Operator->>TrainingSessionCRD: Update status (phase: ready)
    Operator->>TrainingSessionCRD: Set endpoint URL
    TrainingSessionCRD-->>Operator: Status updated
    
    Admin->>kubectl: Query TrainingSession status
    kubectl->>TrainingSessionCRD: Get resource
    TrainingSessionCRD-->>kubectl: Return phase=ready, endpoint
    kubectl-->>Admin: Training session ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • epappas
  • itzlambda

Poem

🐰 A training SDK hops into place,
With clients and models in full embrace,
LoRA fine-tunes with a quantum leap,
Checkpoints saved, gradients deep,
From K8s to FastAPI's grace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'sft mvp' is vague and non-descriptive, using abbreviations without context. It does not clearly communicate the scope or purpose of the substantial changes across training infrastructure, SDKs, examples, and Kubernetes configurations. Replace with a descriptive title that captures the primary change, such as 'Add Basilica Training SDK and end-to-end training infrastructure' or 'Introduce LoRA training service with Python SDK and Kubernetes deployment'.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/training

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (32)
examples/11_agentgym.py (1)

265-265: Remove unnecessary f-string prefix.

This line doesn't contain any placeholders, so the f prefix is unnecessary.

🔎 Apply this diff:
-    print(f"TTL: 1 hour (auto-cleanup)")
+    print("TTL: 1 hour (auto-cleanup)")
justfile (2)

962-970: Consider using pushd/popd for directory navigation.

The script uses cd to navigate into the SDK directory and cd ../.. to return. While this works correctly with set -euo pipefail, using pushd/popd would make directory changes more robust and self-documenting.

🔎 View suggested refactor:
     # Install Python SDK in editable mode
     echo "Installing Python SDK..."
     if command -v uv &> /dev/null; then
         uv pip install -e crates/basilica-sdk-python
     else
         # Use maturin directly for standard venv
-        cd crates/basilica-sdk-python
+        pushd crates/basilica-sdk-python > /dev/null
         maturin develop --release
-        cd ../..
+        popd > /dev/null
     fi

972-976: Apply the same pushd/popd pattern for stub generation.

The stub generation step uses the same cd pattern. Consider applying the same pushd/popd refactor here for consistency.

🔎 View suggested refactor:
     # Generate type stubs
     echo "Generating type stubs..."
-    cd crates/basilica-sdk-python
+    pushd crates/basilica-sdk-python > /dev/null
     cargo run --bin stub_gen --features stub-gen
-    cd ../..
+    popd > /dev/null
crates/basilica-sdk-python/python/basilica/training/__init__.py (2)

177-181: Remove or utilize the unused loss_fn parameter.

The loss_fn parameter is documented but never used in the method implementation or sent to the API endpoint. This could confuse users who expect different loss functions to work.

Consider either removing this parameter until the feature is implemented, or documenting it as reserved for future use.

🔎 Option 1: Remove the unused parameter:
     def forward_backward(
         self,
         data: List[Datum],
-        loss_fn: str = "cross_entropy",
     ) -> ForwardBackwardResult:
         """
         Compute forward pass and gradients.
 
         Args:
             data: List of training examples
-            loss_fn: Loss function (currently only "cross_entropy")
 
         Returns:
             ForwardBackwardResult with loss and logprobs
         """

589-606: Consider sorting __all__, though current grouping is also maintainable.

Static analysis suggests sorting the __all__ list alphabetically. However, the current organization by category (client, config types, data types, exceptions) is logical and aids maintainability. Either approach is acceptable.

🔎 If you prefer alphabetical sorting:
 # Export all public symbols
 __all__ = [
-    # Client
-    "TrainingClient",
-    "TrainingSession",
-    # Config types
-    "LoraConfig",
-    "OptimizerConfig",
     "CheckpointStorage",
-    "GpuResources",
-    # Data types
     "Datum",
     "ForwardBackwardResult",
-    "Sample",
-    # Exceptions
+    "GpuResources",
+    "LoraConfig",
+    "OptimizerConfig",
+    "Sample",
+    "SessionNotFoundError",
+    "TrainingClient",
     "TrainingError",
-    "SessionNotFoundError",
+    "TrainingSession",
     "TrainingServiceError",
 ]
examples/training_example.py (1)

112-135: Consider renaming the unused loop variable.

The loop variable step is not used within the loop body. Python convention is to use _step or _ for intentionally unused variables.

🔎 Apply this diff:
-    for step in range(num_steps):
+    for _step in range(num_steps):
         # Forward-backward pass (accumulate gradients)
         total_loss = 0.0
scripts/local-training-e2e.sh (1)

427-427: Consider using single quotes in trap command (optional nitpick).

While the current code works correctly (since $PF_PID is expanded at trap setup time), shell best practice is to use single quotes for trap commands to make the intention explicit.

🔎 Apply this diff:
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    trap 'kill '"$PF_PID"' 2>/dev/null' EXIT
crates/basilica-api/src/api/middleware/scope.rs (1)

180-184: Add test coverage for the new training session scope mappings.

The scope mappings are correctly implemented and follow the existing patterns. However, the test suite in this file (lines 196-406) does not include test cases for the new training session endpoints.

🔎 Add test cases for training session routes

Add the following test cases to the test_required_scope_mapping function:

// Test training session endpoints
let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:create".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:list".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions/ts-123456")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:view".to_string()));

let req = Request::builder()
    .method(Method::DELETE)
    .uri("/sessions/ts-123456")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:delete".to_string()));
services/training-service/Dockerfile (2)

15-21: Consider production-optimized installation and exclude test files.

The current setup uses editable mode installation (-e ".") and includes test files in the production image:

  • Line 17: Editable install is intended for development, not production deployment
  • Line 21: Including tests/ directory adds unnecessary bloat to the production image
🔎 Apply this diff for production optimization:
 # Install Python dependencies
 COPY pyproject.toml .
 RUN pip install --no-cache-dir --upgrade pip && \
-    pip install --no-cache-dir -e "."
+    pip install --no-cache-dir .
 
 # Copy source code
 COPY src/ src/
-COPY tests/ tests/

If you need tests in the image for integration testing, consider using a multi-stage build with separate dev and prod stages.


6-9: Consider removing git if not required at runtime.

Git is installed as a system dependency but may only be needed during development or for certain ML libraries that clone repos. If the training service doesn't require git at runtime, removing it reduces the attack surface.

If git is only needed for development or specific edge cases, consider documenting why it's required or removing it to minimize the image size and security footprint.

docker-compose.training.yml (1)

33-37: Consider making GPU count configurable.

The GPU count is hardcoded to 1. For flexibility across different development environments, consider using an environment variable.

🔎 Example approach:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: ${GPU_COUNT:-1}  # Default to 1 if not set
          capabilities: [gpu]

Then developers can override with:

GPU_COUNT=2 docker-compose -f docker-compose.training.yml up
services/training-service/README.md (2)

128-128: Unused import in SDK example.

The Datum import is shown but never used in the example code. Either demonstrate its usage or remove it to avoid confusion for users.

-from basilica.training import TrainingClient, Datum
+from basilica.training import TrainingClient

169-187: Add language specifier to architecture diagram code block.

The fenced code block for the architecture diagram is missing a language specifier. Use text or plaintext to satisfy linters and improve rendering consistency.

-```
+```text
 ┌─────────────────────────────────────────┐
orchestrator/k8s/training/local-dev/operator-deployment.yaml (2)

12-32: Consider tightening RBAC for production variants.

Using wildcard verbs ("*") grants full permissions on these resources. While acceptable for local development, ensure production manifests follow least-privilege principles by specifying explicit verbs (get, list, watch, create, update, patch, delete) as needed.


63-79: Add securityContext to harden the container.

Even for local development, adding a securityContext improves security posture and catches permission issues early.

🔎 Suggested addition:
       containers:
         - name: operator
           image: basilica-operator:local
           imagePullPolicy: Never
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL
           env:
services/training-service/src/backend.py (4)

93-100: Consider handling explicit CPU device with bfloat16.

When device="cpu" is passed explicitly (not as a fallback), dtype remains bfloat16, which has limited CPU support. The current logic only converts dtype when falling back from CUDA.

🔎 Suggested fix:
-        if device == "cuda" and not torch.cuda.is_available():
-            # On Mac, use CPU to avoid MPS bfloat16 issues
-            self.device = "cpu"
-            self.dtype = torch.float32  # bfloat16 not well supported on CPU
-        else:
-            self.device = device
-            self.dtype = dtype
+        if device == "cuda" and not torch.cuda.is_available():
+            # On Mac, use CPU to avoid MPS bfloat16 issues
+            self.device = "cpu"
+            self.dtype = torch.float32  # bfloat16 not well supported on CPU
+        elif device == "cpu" and dtype == torch.bfloat16:
+            # bfloat16 has limited CPU support
+            self.device = "cpu"
+            self.dtype = torch.float32
+        else:
+            self.device = device
+            self.dtype = dtype

117-155: Security consideration: trust_remote_code=True executes arbitrary code.

While often necessary for newer models, this setting allows execution of untrusted code from model repositories. Consider making this configurable or documenting the security implications.

🔎 Suggested approach:
 def __init__(
     self,
     model_cache_dir: str = "/models",
     checkpoint_dir: str = "/checkpoints",
     device: str = "cuda",
     dtype: torch.dtype = torch.bfloat16,
+    trust_remote_code: bool = True,
 ):
+    self.trust_remote_code = trust_remote_code

Then use self.trust_remote_code in _load_base_model.


176-178: Incomplete seeding for full reproducibility.

torch.manual_seed only sets the CPU seed. For reproducible GPU training, CUDA seeds should also be set.

🔎 Apply this diff:
         if seed is not None:
             torch.manual_seed(seed)
+            if torch.cuda.is_available():
+                torch.cuda.manual_seed_all(seed)

476-480: Consider explicit resource cleanup in delete_session.

Deleting from the sessions dict doesn't immediately free GPU memory. For long-running services, consider explicit cleanup.

🔎 Suggested enhancement:
     def delete_session(self, session_id: str) -> None:
         """Delete a training session."""
         if session_id in self.sessions:
+            # Explicitly delete model to help free GPU memory
+            session = self.sessions[session_id]
+            del session.model
+            del session.optimizer
             del self.sessions[session_id]
+            # Suggest garbage collection for GPU memory
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
             logger.info("session_deleted", session_id=session_id)
orchestrator/k8s/training/local-dev/gateway.yaml (1)

25-39: ReferenceGrant is namespace-specific to u-testuser.

The ReferenceGrant only allows HTTPRoutes from the u-testuser namespace. For a more flexible local-dev setup, consider using a wildcard or label selector if Gateway API supports it, or document that additional ReferenceGrants are needed for other user namespaces.

crates/basilica-operator/src/controllers/training_session_controller.rs (5)

239-248: String-based error checking is fragile.

Checking e.to_string().contains("409") is fragile. Consider using proper error type matching if the error type supports it.


250-251: External endpoint is hardcoded.

The external endpoint https://api.basilica.ai/sessions/{}/ is hardcoded. Consider making this configurable via environment variable or controller configuration for different environments.

🔎 Suggested approach:
// At the top of the file, add:
fn get_external_base_url() -> String {
    std::env::var("EXTERNAL_API_BASE_URL")
        .unwrap_or_else(|_| "https://api.basilica.ai".to_string())
}

// Then in handle_initializing:
let external_endpoint = format!("{}/sessions/{}/", get_external_base_url(), name);

277-293: MVP placeholder: health check not implemented.

As noted in the comments, the actual health check (GET /health) is not implemented. Consider adding a TODO or tracking issue.

Do you want me to open an issue to track implementing the health check endpoint verification?


331-333: Pod deletion error is silently ignored.

When TTL expires, pod deletion errors are discarded with let _ =. Consider logging the error even if proceeding with termination.

🔎 Suggested fix:
-                    let _ = self.client.delete_pod(namespace, &pod_name).await;
+                    if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {
+                        warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");
+                    }

367-392: Resource defaults are hardcoded; consider making configurable.

CPU/memory limits (16/64Gi for GPU, 2/4Gi for CPU-only) are hardcoded. For flexibility across different cluster configurations, these could be made configurable via the TrainingSessionSpec or environment variables.

services/training-service/tests/test_backend.py (2)

73-76: Token IDs may exceed model vocabulary size.

torch.randint(0, 1000, ...) generates token IDs up to 999, but facebook/opt-125m has a vocabulary size of ~50k, so this is safe. However, for robustness, consider using the tokenizer's vocab size.

🔎 More robust approach:
# Get tokenizer vocab size for safer random tokens
tokenizer = backend.tokenizers.get(small_model)
if tokenizer:
    vocab_size = len(tokenizer)
    input_ids = torch.randint(0, vocab_size, (1, 32))
else:
    input_ids = torch.randint(0, 1000, (1, 32))

165-166: Assertion len(result.text) >= 0 is always true.

This assertion doesn't test anything meaningful since string length is always non-negative.

🔎 Consider a more meaningful assertion:
-        assert len(result.text) >= 0  # May be empty for some models
-        assert len(result.token_ids) > 0
+        # Verify generation produced output (may be empty text due to special tokens)
+        assert isinstance(result.text, str)
+        assert len(result.token_ids) > 0
+        assert result.finish_reason in ("stop", "length")
crates/basilica-operator/src/crd/training_session.rs (1)

211-213: Consider using a pinned image tag instead of :latest.

The default image uses the :latest tag, which can lead to inconsistent behavior across deployments and make rollbacks difficult. Consider pinning to a specific version tag or using a semantic version.

🔎 Suggested improvement:
 fn default_image() -> String {
-    "basilica/training:latest".into()
+    "basilica/training:v0.1.0".into()  // Use a specific version
 }
crates/basilica-api/src/api/routes/training.rs (2)

274-281: UUID truncation reduces uniqueness guarantees.

Taking only the first segment of a UUID (8 hex characters) significantly reduces the UUID's uniqueness. This provides only 2^32 (~4 billion) possible values instead of the full 2^128, increasing collision risk in high-throughput scenarios.

Consider using the full UUID or a larger portion:

🔎 Suggested improvement:
     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        uuid::Uuid::new_v4().to_string().replace('-', "")  // Full UUID without hyphens
     );

Or keep it short but use more characters:

     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        &uuid::Uuid::new_v4().to_string()[..16]  // First 16 chars (64 bits)
     );

334-334: Hardcoded API domain should be configurable.

The endpoint domain "api.basilica.ai" is hardcoded, which makes it inflexible for different environments (dev, staging, production) or on-premise deployments.

Consider making this configurable via environment variable or AppState:

🔎 Suggested improvement:
+    let api_domain = std::env::var("API_DOMAIN").unwrap_or_else(|_| "api.basilica.ai".to_string());
     let endpoint = format!(
-        "https://api.basilica.ai/sessions/{}/",
+        "https://{}/sessions/{}/",
+        api_domain,
         session_id
     );
services/training-service/src/server.py (2)

190-194: Simplify target_modules handling.

The logic to handle target_modules is more complex than necessary. You can rely on the backend's LoraConfiguration defaults instead.

🔎 Simplified approach:
         lora_kwargs = {}
         if request.lora_config:
             lora_kwargs = request.lora_config.model_dump(exclude_none=True)
-            if "target_modules" not in lora_kwargs or lora_kwargs["target_modules"] is None:
-                lora_kwargs.pop("target_modules", None)

The exclude_none=True already handles None values, and the backend's LoraConfiguration will apply defaults for missing fields.


289-312: Inconsistent error handling compared to create_session.

The sample endpoint only catches ValueError, while create_session also catches generic exceptions and logs them. For consistency and better observability, add generic exception handling here as well.

🔎 Apply this fix:
     except ValueError as e:
         raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=str(e))
+    except Exception as e:
+        logger.exception("sample_failed", error=str(e))
+        raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(e))

Apply similar fixes to save_state and load_state endpoints.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3118197 and 11a5072.

📒 Files selected for processing (34)
  • crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
  • crates/basilica-api/src/api/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-api/src/config/mod.rs (2 hunks)
  • crates/basilica-api/src/k8s/trait.rs (1 hunks)
  • crates/basilica-api/src/server.rs (3 hunks)
  • crates/basilica-operator/src/controllers/mod.rs (1 hunks)
  • crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
  • crates/basilica-operator/src/crd/mod.rs (1 hunks)
  • crates/basilica-operator/src/crd/training_session.rs (1 hunks)
  • crates/basilica-operator/src/k8s_client.rs (6 hunks)
  • crates/basilica-operator/src/runtime.rs (6 hunks)
  • crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
  • docker-compose.training.yml (1 hunks)
  • examples/11_agentgym.py (2 hunks)
  • examples/training_example.py (1 hunks)
  • justfile (1 hunks)
  • orchestrator/k8s/training/local-dev/gateway.yaml (1 hunks)
  • orchestrator/k8s/training/local-dev/namespace.yaml (1 hunks)
  • orchestrator/k8s/training/local-dev/operator-deployment.yaml (1 hunks)
  • orchestrator/k8s/training/training-service-rbac.yaml (1 hunks)
  • orchestrator/k8s/training/training-session-crd.yaml (1 hunks)
  • orchestrator/k8s/training/training-session-example.yaml (1 hunks)
  • scripts/api/compose.dev.yml (1 hunks)
  • scripts/local-training-e2e.sh (1 hunks)
  • services/training-service/Dockerfile (1 hunks)
  • services/training-service/README.md (1 hunks)
  • services/training-service/pyproject.toml (1 hunks)
  • services/training-service/src/__init__.py (1 hunks)
  • services/training-service/src/backend.py (1 hunks)
  • services/training-service/src/server.py (1 hunks)
  • services/training-service/tests/__init__.py (1 hunks)
  • services/training-service/tests/test_backend.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (11)
crates/basilica-api/src/config/mod.rs (1)
crates/basilica-api/src/server.rs (1)
  • new (559-1165)
crates/basilica-api/src/api/mod.rs (1)
crates/basilica-api/src/api/routes/training.rs (4)
  • create_session (255-343)
  • list_sessions (478-582)
  • get_session (346-425)
  • delete_session (431-475)
crates/basilica-operator/src/k8s_client.rs (1)
crates/basilica-operator/src/crd/training_session.rs (2)
  • default (34-41)
  • default (80-86)
crates/basilica-api/src/server.rs (1)
crates/basilica-validator/src/api/client.rs (1)
  • new (24-34)
examples/11_agentgym.py (1)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
examples/training_example.py (2)
crates/basilica-sdk-python/python/basilica/training/__init__.py (9)
  • Datum (84-89)
  • TrainingError (111-114)
  • create_session (446-534)
  • base_model (168-170)
  • session_id (163-165)
  • forward_backward (177-233)
  • optim_step (235-253)
  • save_state (303-332)
  • sample (255-301)
services/training-service/src/backend.py (5)
  • create_session (163-224)
  • forward_backward (226-292)
  • optim_step (294-317)
  • save_state (383-432)
  • sample (319-381)
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
crates/basilica-operator/src/crd/training_session.rs (5)
  • new (316-322)
  • new (374-391)
  • default (34-41)
  • default (80-86)
  • default (145-151)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (10)
  • default_rank (44-46)
  • default_alpha (47-49)
  • default_dropout (50-52)
  • default (34-41)
  • default (80-86)
  • default (145-151)
  • default_learning_rate (89-91)
  • default_weight_decay (92-94)
  • new (316-322)
  • new (374-391)
crates/basilica-api/src/k8s/trait.rs (1)
  • kube_client (7-7)
scripts/local-training-e2e.sh (3)
crates/basilica-api/src/server.rs (1)
  • run (1188-1226)
crates/basilica-operator/src/runtime.rs (1)
  • run (162-361)
orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
  • deployments (103-213)
services/training-service/tests/test_backend.py (1)
services/training-service/src/backend.py (12)
  • LoraConfiguration (16-32)
  • OptimizerConfiguration (36-44)
  • TrainingBackend (80-511)
  • create_session (163-224)
  • forward_backward (226-292)
  • optim_step (294-317)
  • sample (319-381)
  • save_state (383-432)
  • load_state (434-462)
  • get_session_status (464-474)
  • delete_session (476-480)
  • list_sessions (482-484)
services/training-service/src/server.py (1)
services/training-service/src/backend.py (14)
  • ForwardBackwardResult (62-67)
  • LoraConfiguration (16-32)
  • OptimizerConfiguration (36-44)
  • SampleResult (71-77)
  • TrainingBackend (80-511)
  • list_sessions (482-484)
  • create_session (163-224)
  • get_session_status (464-474)
  • delete_session (476-480)
  • forward_backward (226-292)
  • optim_step (294-317)
  • sample (319-381)
  • save_state (383-432)
  • load_state (434-462)
🪛 Checkov (3.2.334)
orchestrator/k8s/training/local-dev/operator-deployment.yaml

[medium] 8-33: Minimize wildcard use in Roles and ClusterRoles

(CKV_K8S_49)


[medium] 47-79: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 47-79: Minimize the admission of root containers

(CKV_K8S_23)

🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md

169-169: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)
examples/11_agentgym.py

265-265: f-string without any placeholders

Remove extraneous f prefix

(F541)

examples/training_example.py

57-57: Do not catch blind exception: Exception

(BLE001)


112-112: Loop control variable step not used within loop body

Rename unused step to _step

(B007)

crates/basilica-sdk-python/python/basilica/training/__init__.py

180-180: Unused method argument: loss_fn

(ARG002)


224-224: Avoid specifying long messages outside the exception class

(TRY003)


226-226: Avoid specifying long messages outside the exception class

(TRY003)


247-247: Avoid specifying long messages outside the exception class

(TRY003)


249-249: Avoid specifying long messages outside the exception class

(TRY003)


291-291: Avoid specifying long messages outside the exception class

(TRY003)


293-293: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


329-329: Avoid specifying long messages outside the exception class

(TRY003)


355-355: Avoid specifying long messages outside the exception class

(TRY003)


357-357: Avoid specifying long messages outside the exception class

(TRY003)


371-371: Avoid specifying long messages outside the exception class

(TRY003)


373-373: Avoid specifying long messages outside the exception class

(TRY003)


384-384: Avoid specifying long messages outside the exception class

(TRY003)


436-438: Avoid specifying long messages outside the exception class

(TRY003)


527-527: Avoid specifying long messages outside the exception class

(TRY003)


552-552: Avoid specifying long messages outside the exception class

(TRY003)


554-554: Avoid specifying long messages outside the exception class

(TRY003)


573-573: Avoid specifying long messages outside the exception class

(TRY003)


589-606: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh

[warning] 372-372: i appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 427-427: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (82)
examples/11_agentgym.py (5)

1-27: Excellent documentation structure!

The module docstring clearly describes the evaluation flow, lists available task types for both suites, and provides usage examples. This makes the example very accessible to users.


39-76: Clean implementation with flexible API key handling!

The function properly handles optional API keys by conditionally populating environment variables. The deployment configuration with resource limits (500m CPU, 1Gi memory, 180s timeout) is appropriate for evaluation workloads.


84-125: Consistent pattern with good parameterization!

The function follows the same structure as deploy_affine_evaluator, ensuring code consistency. The env_name parameter elegantly controls both the deployment name and Docker image tag.


133-214: Helpful usage examples with good UX!

The usage printing functions provide clear, actionable curl examples. The fallback to "YOUR_CHUTES_API_KEY" when the API key is not provided ensures the examples remain useful in all scenarios.


217-270: Well-structured main function with good user guidance!

The main function provides clear warnings when the API key is missing, parses CLI arguments to select the appropriate evaluator, and prints comprehensive usage information. The overall flow is intuitive and user-friendly.

justfile (1)

937-947: Excellent improvements to the Python SDK workflow!

The additions enhance the development experience significantly:

  • Strict bash error handling with set -euo pipefail
  • Smart venv strategy supporting both modern uv and traditional python3 -m venv
  • Conditional dependency installation (maturin only for standard venv)
  • Clear, helpful output messages guiding developers

These changes align well with modern Python development practices and make the SDK setup more reliable.

Also applies to: 957-959, 978-984

crates/basilica-sdk-python/python/basilica/training/__init__.py (5)

1-40: LGTM! Well-documented module setup.

The module docstring provides clear usage examples and authentication guidance. The imports are appropriate for the SDK functionality.


43-109: LGTM! Well-structured data models.

The dataclass definitions are clean with appropriate type hints and default values. The use of default_factory for mutable defaults (lists) is correct.


111-127: LGTM! Clean exception hierarchy.

The exception classes provide a clear hierarchy for error handling in the SDK.


235-384: LGTM! Consistent method implementations.

The remaining methods follow a consistent pattern with proper error handling. The close method correctly handles 404 errors for idempotency, which is a good practice for cleanup operations.


387-586: LGTM! Well-designed client with proper authentication and lifecycle management.

The TrainingClient class is well-structured with:

  • Proper API key validation and error handling using response.is_success property to check for 2xx responses
  • Flexible configuration via environment variables or parameters
  • Context manager support for resource cleanup
  • Comprehensive error handling including specific 404 detection for missing sessions

The 300-second timeout is an appropriate explicit configuration for training operations, where requests may involve long-running tasks, compared to httpx's default 5-second timeout for network inactivity. All httpx API usage is correct.

services/training-service/src/__init__.py (1)

1-3: LGTM!

Clean package initialization following Python conventions. The version string is consistent with the SDK versioning pattern.

crates/basilica-api/src/k8s/trait.rs (1)

129-131: LGTM!

The comments clearly document the division of responsibilities between the API and operator for TrainingSession CRD operations, which will help future maintainers understand the architecture.

crates/basilica-api/src/config/mod.rs (2)

49-56: LGTM!

The dev_mode flag is well-designed for local development workflows. The documentation clearly indicates production requirements, and the default value (false) ensures production-safe behavior out of the box.


66-67: LGTM!

The default initialization correctly sets dev_mode to false and provides an empty string for validator_hotkey, which aligns with the validation logic in server.rs that requires the hotkey in production mode.

examples/training_example.py (1)

39-59: LGTM!

The client initialization includes proper environment variable validation and user-friendly error messages. While the exception handler is broad, this is acceptable for example code where the goal is to provide clear guidance to users.

scripts/local-training-e2e.sh (7)

1-38: LGTM!

The script header is well-structured with clear documentation, proper error handling (set -euo pipefail), and a clean logging abstraction using color-coded functions.


40-65: LGTM!

Comprehensive prerequisite checks with helpful error messages. The automatic k3d installation improves the developer experience.


288-409: LGTM!

The E2E test orchestration is well-structured with proper error handling, health checks, and clear logging. The API token handling gracefully falls back when authentication is unavailable.


411-542: Training steps implementation is solid.

The function properly manages port-forwarding with cleanup, performs training operations, and validates results at each step. Error handling is comprehensive.


544-591: LGTM!

The status and logging functions provide comprehensive visibility into the cluster state and component logs, essential for debugging E2E tests.


613-662: LGTM!

Excellent documentation that clearly explains the architecture, usage, and what each test step does. This will significantly improve the developer experience.


664-701: LGTM!

Clean command dispatch with proper error handling for unknown commands.

crates/basilica-api/src/api/routes/mod.rs (1)

14-14: LGTM!

The training module export follows the established pattern and properly exposes the new Training Session API routes.

crates/basilica-operator/src/crd/mod.rs (1)

5-5: LGTM!

The training_session module export follows the established pattern and properly exposes the TrainingSession CRD types for use by the operator.

crates/basilica-operator/src/controllers/mod.rs (1)

6-6: LGTM!

The training_session_controller module export follows the established pattern and properly integrates the TrainingSession controller into the operator.

orchestrator/k8s/training/local-dev/namespace.yaml (1)

1-18: LGTM!

The namespace declarations are well-structured and follow Kubernetes best practices. The user namespace labeling (basilica.ai/user: testuser) is a good pattern for identifying and filtering user-specific resources.

services/training-service/tests/__init__.py (1)

1-1: LGTM!

Standard Python package initializer with appropriate docstring.

crates/basilica-api/src/api/mod.rs (1)

105-115: LGTM!

The training session routes are properly integrated into the protected routes with appropriate handler bindings. The comment clarifying that training operations are routed directly by Envoy Gateway is helpful architectural context.

orchestrator/k8s/training/training-service-rbac.yaml (1)

1-72: LGTM! RBAC configuration follows least-privilege principles.

The RBAC configuration is well-structured:

  • Appropriate read-only access to Secrets, ConfigMaps, and Nodes
  • Full lifecycle management for TrainingSession CRDs, Pods, and Services (required for operator functionality)
  • Event creation for observability
  • Finalizer access for proper cleanup

The permissions align with the controller's responsibilities for managing training session workloads.

docker-compose.training.yml (1)

1-80: LGTM! Well-configured Docker Compose for local development.

The configuration is well-structured with:

  • Appropriate health checks with sufficient start period for model loading (120s)
  • Proper service dependencies
  • GPU resource allocation
  • Volume persistence for models and checkpoints
  • Restart policies for resilience

The debug logging and SQLite database are appropriate for local development.

crates/basilica-operator/src/runtime.rs (3)

46-50: LGTM!

The TrainingSessionCtx struct follows the established pattern used by other controller contexts (JobCtx, RentalCtx, UserDeploymentCtx).


135-160: LGTM!

The reconcile and error policy functions follow the same pattern as existing controllers (reconcile_user_deployment, error_policy_user_deployment), with appropriate structured logging including namespace and session name.


333-359: LGTM!

The training sessions controller stream is correctly wired into the operator's main loop with consistent error handling and properly included in the futures::join! macro.

orchestrator/k8s/training/training-session-example.yaml (1)

1-94: LGTM!

These example manifests are well-documented and demonstrate both a comprehensive configuration (8B model with all optional fields) and a minimal configuration (70B model with defaults). The inline comments explaining each field are helpful for users.

crates/basilica-api/src/server.rs (3)

564-585: LGTM!

The dev mode implementation correctly creates a stub validator client pointing to a non-routable address (localhost:1) that will fail fast if accidentally invoked. The warning message clearly communicates limitations to developers.


902-981: LGTM!

Appropriately gating the validator health check and rental health check tasks behind dev_mode since these require a functional validator client.


1097-1123: Gate credit exhaustion check task behind !config.bittensor.dev_mode.

The credit exhaustion check task uses a validator client for stopping community rentals, but it's only gated by billing client presence—not by dev_mode. In dev_mode, the validator client is a stub pointing to a non-existent endpoint, causing stop_community_rental_internal calls to fail silently (errors are logged). While this won't crash the application, community rentals won't actually be stopped in dev mode.

The health check task (line 903) already uses the !config.bittensor.dev_mode pattern and should be followed here for consistency.

orchestrator/k8s/training/training-session-crd.yaml (1)

1-183: LGTM overall!

The CRD is well-designed with:

  • Clear required fields and sensible defaults
  • Proper validation constraints (min/max on numeric fields)
  • Comprehensive status tracking with appropriate phases
  • Useful printer columns for kubectl get output
  • Short names for convenience (ts, tsession)
services/training-service/src/backend.py (10)

1-13: LGTM - Imports and logger setup are appropriate.

Standard imports for the training backend with HuggingFace, PEFT, and PyTorch libraries.


15-33: LGTM - LoRA configuration with sensible defaults.

The alpha/rank ratio of 2 and the target modules covering standard projection layers align with common LoRA practices.


35-45: LGTM - Optimizer configuration with standard AdamW defaults.


47-59: LGTM - SessionState dataclass properly tracks training session state.


61-78: LGTM - Result dataclasses provide clear return structures.


226-292: LGTM - Forward-backward implementation is correct.

The weighted loss computation properly handles the shift for next-token prediction, and the gradient accumulation behavior (allowing multiple forward_backward calls before optim_step) is a useful feature for gradient accumulation.


294-317: LGTM - Optimizer step implementation is correct.

The gradient clipping, optimizer step, and gradient zeroing sequence is properly implemented.


319-381: LGTM - Text generation implementation handles edge cases well.

The temperature=0 (greedy decoding), top_k=0 (no filtering), and finish_reason detection are all correctly implemented.


383-432: LGTM - Checkpoint save implementation is comprehensive.

Correctly saves adapter weights and optionally the full training state including optimizer, step count, and configurations.


492-511: LGTM - Helper methods are correctly implemented.

_get_session provides proper validation, and _compute_logprobs correctly handles the shift alignment for next-token prediction.

orchestrator/k8s/training/local-dev/gateway.yaml (1)

1-24: LGTM - Gateway configuration is appropriate for local development.

The GatewayClass and Gateway resources are correctly configured for Envoy Gateway with HTTP listener on port 8080.

crates/basilica-operator/src/k8s_client.rs (6)

23-24: LGTM - Import added for TrainingSession CRD types.


77-91: LGTM - Trait methods follow established patterns.

The new TrainingSession CRUD methods are consistent with existing CRD operations in the trait.


183-183: LGTM - Storage map added for TrainingSession CRDs in mock client.


363-408: LGTM - MockK8sClient TrainingSession implementations follow existing patterns.

The implementations are consistent with other CRD operations (BasilicaJob, GpuRental, etc.) in the mock client.


1038-1084: LGTM - KubeClient TrainingSession implementations are correct.

The real Kubernetes API implementations follow the established patterns with proper error handling for 409 conflicts and use of Server-Side Apply for status updates.


1664-1693: LGTM - RateLimitedKubeClient properly delegates with rate limiting.

All four TrainingSession methods correctly await rate limiting before delegating to the inner client.

crates/basilica-operator/src/controllers/training_session_controller.rs (6)

1-28: LGTM - Imports and constants are appropriate.

The constants for gateway name, namespace, and service port are well-defined for Envoy Gateway integration.


165-211: LGTM - Scheduling handler correctly monitors pod phases.

The state machine logic properly handles Running, Pending, and Failed pod states, with appropriate fallback to Pending if pod is missing.


486-489: restart_policy: Never means failed pods won't auto-recover.

With Never, if the training container crashes, the pod enters Failed state permanently. This is intentional for training jobs (to preserve state/logs), but ensure the controller handles this appropriately.

The controller does handle Failed pods in handle_scheduling by setting an error status. Verify this is the desired behavior for production.


497-541: LGTM - Service spec is correctly configured.

The selector matches the pod label, and the service port correctly targets the container port.


543-599: LGTM - HTTPRoute construction is correct.

The route properly configures path matching, URL rewriting to strip the session prefix, and backend reference to the training service.


601-682: LGTM - Unit tests cover key resource builder functionality.

Tests verify pod and service construction with expected names, ports, and GPU resource configuration.

Consider adding tests for the controller's reconcile logic using MockK8sClient to verify state transitions.

services/training-service/tests/test_backend.py (7)

1-31: LGTM - Test fixtures are well-structured.

The fixtures properly set up the backend with temp directories and handle CUDA/CPU device selection. Using facebook/opt-125m is appropriate for fast testing.


34-62: LGTM - Session creation tests cover happy path and error case.


64-112: LGTM - Forward/backward tests verify core training functionality.

Tests appropriately verify loss computation and token counting for both standard and weighted loss scenarios.


114-147: LGTM - Optimizer step test verifies multi-step training workflow.


188-212: Test may not catch the optimizer parameter bug.

Due to the bug identified in load_state (optimizer references stale parameters after loading), this test might pass but actual training after load_state would fail. Consider adding a test that does training after loading to verify end-to-end correctness.

🔎 Suggested additional test:
def test_training_after_load_checkpoint(self, backend, session_id, small_model):
    """Test that training works correctly after loading a checkpoint."""
    backend.create_session(
        session_id=session_id,
        base_model=small_model,
        lora_config=LoraConfiguration(rank=8),
        optimizer_config=OptimizerConfiguration(),
    )

    input_ids = torch.randint(0, 1000, (1, 32))
    attention_mask = torch.ones_like(input_ids)
    labels = input_ids.clone()

    # Train and save
    backend.forward_backward(session_id, input_ids, attention_mask, labels)
    backend.optim_step(session_id)
    path = backend.save_state(session_id, "checkpoint-1")

    # Load and train more
    backend.load_state(session_id, path)
    result = backend.forward_backward(session_id, input_ids, attention_mask, labels)
    step = backend.optim_step(session_id)
    
    # Verify training continued correctly
    assert step == 2
    assert result.loss > 0

214-227: LGTM - Status test verifies configuration is correctly stored.


229-267: LGTM - Session management and error handling tests are comprehensive.

Tests cover deletion, listing multiple sessions, and proper error raising for nonexistent sessions.

crates/basilica-operator/src/crd/training_session.rs (6)

9-42: LGTM!

The LoraConfig structure is well-defined with sensible defaults (rank=32, alpha=64, dropout=0.05) and appropriate validation ranges. The target modules cover standard attention projections commonly used in transformer fine-tuning.


62-97: LGTM!

OptimizerConfig provides sensible defaults for fine-tuning (learning_rate=1e-4, weight_decay=0.01, grad_clip=1.0). The configuration is appropriate for LoRA training scenarios.


99-123: LGTM!

CheckpointStorage provides flexible configuration for multiple storage backends with appropriate optional fields for credentials, region, and custom endpoints.


125-156: LGTM!

GpuResources struct provides appropriate configuration for GPU requirements with sensible defaults (count=1) and validation ranges (1-8 GPUs, 8-256 GB memory).


221-264: LGTM!

TrainingSessionPhase enum provides a clear lifecycle progression with appropriate requeue intervals for each phase. The terminal state handling and string conversion methods are correctly implemented.


266-427: LGTM!

TrainingSessionStatus provides comprehensive state tracking with well-implemented builder methods. The automatic timestamp updates and phase transitions (e.g., setting Failed phase in with_error()) ensure consistent state management. The tests adequately cover the builder pattern and phase logic.

crates/basilica-api/src/api/routes/training.rs (2)

40-135: LGTM!

Request types are well-structured with defaults that match the CRD definitions. The use of serde(default) attributes ensures backward compatibility when fields are omitted.


427-475: LGTM!

The delete handler correctly uses typed error matching (ae.code == 404) and appropriately treats 404 as a non-error (idempotent delete). The implementation is robust and follows best practices.

services/training-service/src/server.py (5)

1-46: LGTM!

The initialization logic properly uses environment variables for configuration and correctly handles device selection. The lifespan context manager ensures proper backend lifecycle management.


82-161: LGTM!

Response models are well-structured with appropriate field types. The SessionStatusResponse correctly includes training-specific fields (lora_rank, learning_rate) that are internal to the service.


263-267: The logprobs wrapping is correct. The backend's ForwardBackwardResult returns logprobs: List[float] (a flat list for a single sequence), while ForwardBackwardResponse expects logprobs: List[List[float]] (batched format). Wrapping the result in a list converts it to the required API contract.


176-179: list_sessions() is a trivial in-memory operation and does not require blocking prevention.

The list_sessions() method simply returns list(self.sessions.keys()) — a dictionary key access with no I/O, heavy computation, or GPU operations. This will not block the event loop.

Likely an incorrect or invalid review comment.


61-70: The Python service uses Adam optimizer parameters not exposed in the CRD.

The backend internally uses beta1, beta2, and epsilon when configuring the AdamW optimizer, and the Python API accepts these parameters. However, the Rust CRD only exposes learning_rate, weight_decay, and grad_clip. This is intentional: the Python training service provides fine-grained control over Adam parameters, while the CRD presents a simpler interface for Kubernetes users. The field defaults match industry standards for Adam optimization and are properly validated.

Comment on lines +181 to +188
fn user_namespace(user_id: &str) -> String {
// Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens)
let sanitized: String = user_id
.chars()
.map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' })
.collect();
format!("u-{}", sanitized)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential namespace collision risk.

The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". Additionally, the function doesn't enforce Kubernetes namespace name length limits (max 63 characters).

Consider using a hash-based approach for deterministic, collision-free namespace names:

🔎 Suggested fix:
 fn user_namespace(user_id: &str) -> String {
-    // Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens)
-    let sanitized: String = user_id
-        .chars()
-        .map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' })
-        .collect();
-    format!("u-{}", sanitized)
+    // Use hash to ensure unique, valid K8s namespace name
+    use std::collections::hash_map::DefaultHasher;
+    use std::hash::{Hash, Hasher};
+    
+    let mut hasher = DefaultHasher::new();
+    user_id.hash(&mut hasher);
+    let hash = hasher.finish();
+    
+    format!("u-{:x}", hash)  // Produces names like "u-a1b2c3d4e5f6g7h8"
 }

},
"ttlSeconds": req.ttl_seconds,
"seed": req.seed,
"enableBilling": true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded enableBilling prevents user override.

The enableBilling field is always set to true, ignoring any potential user preference. Consider adding it as an optional field in CreateSessionRequest.

🔎 Suggested fix:

Add the field to CreateSessionRequest:

 pub struct CreateSessionRequest {
     pub base_model: String,
     pub checkpoint_storage: CheckpointStorageRequest,
     // ... other fields ...
+    #[serde(default = "default_enable_billing")]
+    pub enable_billing: bool,
 }
+
+fn default_enable_billing() -> bool {
+    true
+}

Then use it in build_training_session_crd:

-            "enableBilling": true
+            "enableBilling": req.enable_billing

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around line 244, the JSON
hardcodes "enableBilling": true which prevents users from overriding billing;
add an optional field (Option<bool>) named enableBilling to the
CreateSessionRequest struct, update request deserialization/validation as
needed, and change build_training_session_crd to read request.enableBilling and
set the CRD field to that value (use false or a sensible default when None)
instead of the hardcoded true.

Comment on lines +284 to +293
if let Err(e) = k8s_client.create_namespace(&namespace).await {
// Ignore "already exists" errors
if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
error!(error = %e, namespace = %namespace, "Failed to create namespace");
apimetrics::record_request("training.create", "POST", start, false);
return Err(ApiError::Internal {
message: format!("Failed to create namespace: {}", e),
});
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors or incorrectly ignore failures if the error message format changes.

🔎 Apply this fix to use typed error matching:
     if let Err(e) = k8s_client.create_namespace(&namespace).await {
-        // Ignore "already exists" errors
-        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+        // Ignore "already exists" errors using typed matching
+        if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) {
             error!(error = %e, namespace = %namespace, "Failed to create namespace");
             apimetrics::record_request("training.create", "POST", start, false);
             return Err(ApiError::Internal {
                 message: format!("Failed to create namespace: {}", e),
             });
         }
     }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 284 to 293,
replace the fragile string-based checks against e.to_string() with typed error
matching: match the error (likely kube::Error) and specifically handle
kube::Error::Api(ae) where ae.code == 409 or ae.reason == "AlreadyExists" as the
non-fatal case; for all other error variants log the error with context, record
the metric, and return the ApiError::Internal as before. Ensure you import the
kube types if needed and keep the existing logging/metrics/return behavior for
real failures.

POSTGRES_DB: basilica_api
ports:
- "5432:5432"
- "5433:5432" # Use 5433 to avoid conflict with local Postgres
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for the compose.dev.yml file and check its content
echo "=== Checking compose.dev.yml ===" 
if [ -f "scripts/api/compose.dev.yml" ]; then
  head -20 scripts/api/compose.dev.yml
  echo ""
  sed -n '8,12p' scripts/api/compose.dev.yml
else
  echo "File not found at expected path"
fi

echo ""
echo "=== Searching for localhost:5432 references in codebase ===" 
rg -n 'localhost:5432|127\.0\.0\.1:5432' --type-list 2>/dev/null | head -20 || echo "No direct references found or rg not available"

echo ""
echo "=== Searching in .env and documentation files ===" 
find . -type f \( -name ".env*" -o -name "README*" -o -name "*.md" \) -exec grep -l "5432" {} \; 2>/dev/null | head -10

Repository: one-covenant/basilica

Length of output: 1092


🏁 Script executed:

#!/bin/bash
# Check the documentation files that mention 5432 for context
echo "=== Checking docs/validator.md for postgres references ==="
grep -n -C 3 "5432" ./docs/validator.md 2>/dev/null || echo "No matches found"

echo ""
echo "=== Checking orchestrator/README.md for postgres references ==="
grep -n -C 3 "5432" ./orchestrator/README.md 2>/dev/null || echo "No matches found"

echo ""
echo "=== Checking for any DATABASE_URL or connection string patterns ==="
rg -n "DATABASE_URL|postgres.*localhost" --type md --type yaml 2>/dev/null | head -20 || echo "No matches found"

echo ""
echo "=== Look for local development or setup instructions ==="
find . -type f -name "*.md" -exec grep -l "docker-compose\|docker compose\|local.*setup\|development.*setup" {} \; 2>/dev/null | grep -i dev | head -5

Repository: one-covenant/basilica

Length of output: 3472


Update docs/validator.md to reflect the new PostgreSQL port for local development.

The port change in scripts/api/compose.dev.yml is correct with a clear inline comment. However, docs/validator.md contains example connection strings (lines 933 and 2743) showing localhost:5432 which would now be incorrect. These examples should be updated to use localhost:5433 if they apply to the API service running with the new compose setup. Note that internal Docker references (e.g., postgres:5432 in other compose files) are unaffected and correct.

🤖 Prompt for AI Agents
In docs/validator.md (lines ~933 and ~2743) update the example connection
strings that reference localhost:5432 to localhost:5433 so they match the API
local dev compose mapping (scripts/api/compose.dev.yml line 10) — leave any
internal Docker host references (e.g., postgres:5432) unchanged; verify both
examples still parse correctly and run a quick search to ensure no other
localhost:5432 examples remain that should be localhost:5433.

Comment on lines +1 to +39
# Use NVIDIA CUDA image for GPU support, or Python slim for CPU-only
ARG BASE_IMAGE=python:3.11-slim
FROM ${BASE_IMAGE}

# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
git \
&& rm -rf /var/lib/apt/lists/*

# Set up working directory
WORKDIR /app

# Install Python dependencies
COPY pyproject.toml .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -e "."

# Copy source code
COPY src/ src/
COPY tests/ tests/

# Create directories for models and checkpoints
RUN mkdir -p /models /checkpoints

# Environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_CACHE_DIR=/models
ENV CHECKPOINT_DIR=/checkpoints

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1

# Expose port
EXPOSE 8000

# Run service
CMD ["python", "-m", "uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8000"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add non-root user for security.

The container runs as root by default, which violates the principle of least privilege and creates security risks. Container processes should run as non-root users.

🔎 Apply this diff to add a non-root user:
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
     curl \
     git \
     && rm -rf /var/lib/apt/lists/*
 
+# Create non-root user
+RUN useradd -m -u 1000 -s /bin/bash appuser && \
+    mkdir -p /models /checkpoints && \
+    chown -R appuser:appuser /app /models /checkpoints
+
 # Set up working directory
 WORKDIR /app
 
 # Install Python dependencies
 COPY pyproject.toml .
 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -e "."
 
 # Copy source code
 COPY src/ src/
 COPY tests/ tests/
 
-# Create directories for models and checkpoints
-RUN mkdir -p /models /checkpoints
-
 # Environment variables
 ENV PYTHONUNBUFFERED=1
 ENV MODEL_CACHE_DIR=/models
 ENV CHECKPOINT_DIR=/checkpoints
 
+# Switch to non-root user
+USER appuser
+
 # Health check
 HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
     CMD curl -f http://localhost:8000/health || exit 1

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +6 to +19
dependencies = [
"torch>=2.1.0",
"transformers>=4.36.0",
"peft>=0.7.0",
"accelerate>=0.25.0",
"safetensors>=0.4.0",
"tokenizers>=0.15.0",
"boto3>=1.34.0",
"fastapi>=0.109.0",
"uvicorn>=0.27.0",
"pydantic>=2.5.0",
"structlog>=24.1.0",
"httpx>=0.27.0",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Pin dependency versions to avoid unexpected breaking changes.

All dependencies use >= constraints without upper bounds, which can lead to:

  • Unexpected breaking changes when major versions are released
  • Inconsistent behavior across deployments
  • Difficult debugging when dependencies change behavior
🔎 Recommended approach for dependency management:

Consider one of these strategies:

Option 1: Pin exact versions in pyproject.toml (strictest)

dependencies = [
    "torch==2.1.0",
    "transformers==4.36.0",
    # ...
]

Option 2: Use compatible version ranges (recommended)

dependencies = [
    "torch>=2.1.0,<3.0.0",
    "transformers>=4.36.0,<5.0.0",
    # ...
]

Option 3: Use a requirements.txt or poetry.lock for pinning

Additionally, maintain a requirements-lock.txt or use Poetry with poetry.lock to ensure reproducible builds while keeping pyproject.toml flexible for development.

This is especially critical for ML dependencies like torch and transformers where API changes are common.

🤖 Prompt for AI Agents
In services/training-service/pyproject.toml around lines 6 to 19 the dependency
constraints are open-ended (using >=) which can introduce breaking changes;
update each dependency to either a pinned exact version (e.g., ==) or,
preferably, a compatible upper-bounded range (e.g., >=x.y.z,<next-major) for all
listed packages (torch, transformers, peft, accelerate, safetensors, tokenizers,
boto3, fastapi, uvicorn, pydantic, structlog, httpx), and ensure you commit a
lockfile (poetry.lock or a generated requirements-lock.txt) so builds are
reproducible; for ML libs like torch/transformers prefer tight upper bounds
(<3.0.0, <5.0.0 respectively) to avoid major-breaking upgrades.

Comment on lines +199 to +201
# Wrap with LoRA
model = get_peft_model(base, peft_config)
model.print_trainable_parameters()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

get_peft_model modifies base model; print_trainable_parameters uses stdout.

  1. get_peft_model can modify the base model in place, which may cause issues if the base model is cached and shared across sessions.
  2. print_trainable_parameters() outputs to stdout rather than the structured logger.
🔎 Suggested fixes:
-        # Wrap with LoRA
-        model = get_peft_model(base, peft_config)
-        model.print_trainable_parameters()
+        # Wrap with LoRA (using copy to avoid modifying cached base model)
+        import copy
+        base_copy = copy.deepcopy(base) if model_name in self.base_models else base
+        model = get_peft_model(base_copy, peft_config)
+        
+        # Log trainable parameters
+        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+        total_params = sum(p.numel() for p in model.parameters())
+        logger.info(
+            "trainable_parameters",
+            trainable=trainable_params,
+            total=total_params,
+            percentage=100 * trainable_params / total_params,
+        )

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (2)
crates/basilica-operator/src/controllers/training_session_controller.rs (2)

203-212: Status update condition still missing endpoint and start_time checks.

The condition now includes tokens_processed, but changes to endpoint or start_time still won't trigger status updates. This can cause the CRD status to become stale when these fields change during phase transitions.

Consider deriving PartialEq on TrainingSessionStatus and comparing directly, or extend the condition to include these fields.


239-251: Using is_err() conflates "not found" with other errors.

This issue was flagged in a previous review. The code treats any error from get_pod as "pod doesn't exist" and attempts creation. Network failures, permission errors, or API server issues would be masked and could lead to spurious creation attempts.

🔎 Suggested fix using proper error discrimination:
-        // Check if pod already exists
-        let existing_pod = self.client.get_pod(namespace, &pod_name).await;
-        if existing_pod.is_err() {
+        // Check if pod already exists
+        let pod_exists = match self.client.get_pod(namespace, &pod_name).await {
+            Ok(_) => true,
+            Err(e) if e.to_string().contains("NotFound") || e.to_string().contains("not found") => false,
+            Err(e) => {
+                error!(error = %e, "failed to check pod existence");
+                return Ok(status.with_error(format!("Failed to check pod: {}", e)));
+            }
+        };
+        if !pod_exists {
             // Create the pod
🧹 Nitpick comments (9)
.gitignore (1)

102-102: Remove redundant .venv pattern.

Both .venv/ (line 102) and .venv (line 137) are now in the file. In gitignore, .venv/ already matches the .venv directory. Adding .venv is redundant since both patterns accomplish the same goal for this use case.

Consider keeping only one pattern. For Python virtual environments (which are always directories), .venv/ is sufficient and more explicit about intent.

🔎 Apply this diff to remove the redundant pattern:
  # Runtime State
  *.pid
  *.lock
  tmp/
  temp/
- .venv/
+ .venv

  # Binaries

Or keep the explicit directory pattern and remove the new one:

- .venv

Also applies to: 137-137

scripts/local-training-e2e.sh (5)

58-62: Consider verifying the k3d install script before execution.

Piping curl directly to bash can execute arbitrary code from a remote URL without verification. For a local dev script this is often acceptable, but consider downloading first, inspecting, then executing, or using a package manager.

🔎 Safer alternative:
     # Install k3d if needed
     if ! command -v k3d &> /dev/null; then
         log_info "Installing k3d..."
-        curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+        # Option 1: Use package manager
+        # brew install k3d
+        # Option 2: Download and inspect before running
+        local install_script="/tmp/k3d-install.sh"
+        curl -sL https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh -o "$install_script"
+        log_info "Downloaded install script to $install_script - review if needed"
+        bash "$install_script"
+        rm -f "$install_script"
     fi

343-380: Consider extracting the curl logic to reduce duplication.

The authenticated and unauthenticated curl calls are nearly identical. This could be simplified with a helper function or by conditionally building the headers.

🔎 Optional refactor:
# Build curl args
CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json")
if [ -n "$AUTH_HEADER" ]; then
    CURL_ARGS+=(-H "$AUTH_HEADER")
fi

RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{
    "baseModel": "facebook/opt-125m",
    ...
}')

401-409: Unused loop variable i.

The loop counter i is never used inside the loop. Use _ to indicate an intentionally unused variable, which also silences shellcheck warning SC2034.

🔎 Apply this diff:
-    for i in {1..60}; do
+    for _ in {1..60}; do

449-456: Consider using single quotes for the trap command.

The current trap with double quotes works correctly here because $PF_PID is expanded when the trap is set, capturing the actual PID. However, using single quotes is more idiomatic and makes the intent clearer while silencing shellcheck warning SC2064.

🔎 Apply this diff:
     kubectl port-forward -n "$NAMESPACE" "pod/$POD_NAME" 8001:8000 &
     PF_PID=$!
     sleep 3

     # Cleanup on exit
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    trap 'kill $PF_PID 2>/dev/null' EXIT

500-502: Arbitrary sleep for model loading.

The sleep 10 is a fixed wait that may be insufficient for slower systems or unnecessary for faster ones. Consider polling the session status endpoint to confirm the model is loaded, similar to how pod readiness is checked elsewhere.

crates/basilica-operator/src/controllers/training_session_controller.rs (3)

267-279: Same is_err() issue for network policy check.

While the network policy failure is non-fatal (as noted in the comment), distinguishing error types would provide better observability. Consider logging differently for "not found" vs other errors.


361-370: Fragile string-based error detection for conflict.

Checking e.to_string().contains("409") or contains("AlreadyExists") is brittle and may break if error message format changes. Consider using typed error variants if the K8sClient trait supports them.


808-889: Good unit test coverage for resource builders.

The tests validate the core resource construction logic. Consider adding tests for:

  • Error cases (missing UID once that validation is added)
  • CPU-only mode (gpu_count = 0)
  • HTTPRoute construction
  • NetworkPolicy construction
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11a5072 and 40e914d.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
  • scripts/local-training-e2e.sh (1 hunks)
  • services/training-service/README.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


244-244: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh

[warning] 401-401: i appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 456-456: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (14)
services/training-service/README.md (1)

1-305: Comprehensive and well-organized documentation.

The README effectively documents the Training Service with clear sections covering prerequisites, quick start, API endpoints, configuration, and troubleshooting. The architecture diagram, example commands, and E2E testing guidance provide excellent user onboarding. The documentation properly aligns with the TrainingSession CRD workflow and operator-driven resource provisioning introduced in this PR.

scripts/local-training-e2e.sh (12)

1-38: LGTM!

Good script setup with set -euo pipefail, clear documentation, and useful logging helpers with color-coded output.


67-96: LGTM!

Well-structured cluster setup with proper idempotency check, reasonable port mappings, and appropriate wait for node readiness.


98-127: LGTM with minor note.

Good fallback logic for helm vs kubectl installation. The sleep 10 on line 121 before the kubectl wait is somewhat arbitrary but acceptable for a local dev script since the actual readiness is verified by the subsequent wait command.


129-150: LGTM!

Clean image build and load workflow for local development.


152-183: LGTM!

Well-organized deployment functions with proper sequencing and readiness checks.


185-211: LGTM!

Good defensive handling for stale Postgres volumes with the user existence check before proceeding.


213-286: LGTM!

Database reset, API key generation, and API runner functions are well-organized for the local dev workflow.


288-312: LGTM!

Good cleanup function with proper wait for pod termination.


509-548: LGTM!

Well-structured training loop with proper error handling and informative logging for each step.


573-620: LGTM!

Status and log viewing functions provide good observability into the cluster state.


622-640: LGTM!

Comprehensive cleanup that handles Postgres, cluster deletion, and generated artifacts.


642-734: LGTM!

Excellent help documentation with clear quick-start guide, and well-organized main command dispatch.

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

399-415: LGTM - MVP placeholder acknowledged.

The comment clearly documents that health checks are deferred. The immediate transition to Ready is acceptable for MVP.

Comment on lines +253 to +265
// Check if service already exists
let existing_svc = self.client.get_service(namespace, &svc_name).await;
if existing_svc.is_err() {
// Create the service
let svc = build_training_service(session, namespace, name)?;
match self.client.create_service(namespace, &svc).await {
Ok(_) => info!(service = %svc_name, "created training service"),
Err(e) => {
error!(error = %e, "failed to create training service");
return Ok(status.with_error(format!("Failed to create service: {}", e)));
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same is_err() issue for service existence check.

Apply the same fix pattern as recommended for the pod check to properly distinguish "not found" from other errors.

Comment on lines +314 to +321
"Failed" => {
let message = pod
.status
.as_ref()
.and_then(|s| s.message.clone())
.unwrap_or_else(|| "Pod failed".into());
Ok(status.with_error(message))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pod failure doesn't transition session to Failed phase.

When the pod enters "Failed" state, only an error message is set via with_error(), but the session phase remains "Scheduling". This could cause the reconciler to keep checking a failed pod indefinitely.

🔎 Consider transitioning to Failed phase:
                     "Failed" => {
                         let message = pod
                             .status
                             .as_ref()
                             .and_then(|s| s.message.clone())
                             .unwrap_or_else(|| "Pod failed".into());
-                        Ok(status.with_error(message))
+                        Ok(status
+                            .with_phase(TrainingSessionPhase::Failed)
+                            .with_error(message))
                     }
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 314 to 321, when matching a Pod with "Failed" status the code only
calls with_error(...) and does not update the session phase; change the status
update to set the session phase to Failed as well (e.g., call the method that
sets phase to TrainingSessionPhase::Failed or use
status.with_phase(...).with_error(message)) so the session transitions out of
Scheduling and the reconciler stops repeatedly processing the failed pod.

Comment on lines +389 to +394
Err(e) => {
warn!(error = %e, "service not found");
// Recreate service
let svc = build_training_service(session, namespace, name)?;
self.client.create_service(namespace, &svc).await?;
Ok(status)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent error propagation.

Here, create_service errors propagate via ?, returning an Err. In handle_pending (lines 258-264), service creation errors return Ok(status.with_error(...)). This inconsistency could cause different reconcile behaviors for similar failures.

Consider using consistent error handling strategy across all phase handlers.

🤖 Prompt for AI Agents
crates/basilica-operator/src/controllers/training_session_controller.rs around
lines 389 to 394: the handler currently propagates create_service errors with
`?` while other handlers (e.g., handle_pending at lines 258-264) convert similar
failures into an Ok(status.with_error(...)) result; make the behavior consistent
by catching the error from self.client.create_service(...) instead of using `?`
and return Ok(status.with_error(format!("create_service failed: {}", e))) (or
alternatively change the other handler to propagate — pick one consistent
strategy across all phase handlers), ensuring the status is updated with the
error and the function returns Ok(status) rather than bubbling the error.

Comment on lines +453 to +455
// Delete the pod
let _ = self.client.delete_pod(namespace, &pod_name).await;
return Ok(status.with_phase(TrainingSessionPhase::Terminated));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently ignoring pod deletion result.

The let _ = discards any error from delete_pod, which could mask cleanup failures. At minimum, log the outcome for observability.

🔎 Suggested fix:
-                    let _ = self.client.delete_pod(namespace, &pod_name).await;
+                    if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {
+                        warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");
+                    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Delete the pod
let _ = self.client.delete_pod(namespace, &pod_name).await;
return Ok(status.with_phase(TrainingSessionPhase::Terminated));
// Delete the pod
if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {
warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");
}
return Ok(status.with_phase(TrainingSessionPhase::Terminated));
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 453-455, replace the ignored call `let _ =
self.client.delete_pod(...).await;` with code that captures the Result from
delete_pod, matches on it, and logs the outcome (include namespace and pod_name)
— log success at debug/info and log failures with error details at warn/error so
deletion failures are observable; keep returning
Ok(status.with_phase(TrainingSessionPhase::Terminated)) afterwards (or
optionally propagate the error if you want to treat deletion failure as fatal).

Comment on lines +586 to +593
let owner_ref = OwnerReference {
api_version: "basilica.ai/v1".to_string(),
kind: "TrainingSession".to_string(),
name: name.to_string(),
uid: session.metadata.uid.clone().unwrap_or_default(),
controller: Some(true),
block_owner_deletion: Some(true),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Owner reference UID defaults to empty string if missing.

If session.metadata.uid is None, the owner reference will have an empty UID. Kubernetes requires a valid UID for owner references to work correctly for garbage collection. This should likely return an error instead.

🔎 Suggested fix:
+    let uid = session.metadata.uid.clone()
+        .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;
+
     let owner_ref = OwnerReference {
         api_version: "basilica.ai/v1".to_string(),
         kind: "TrainingSession".to_string(),
         name: name.to_string(),
-        uid: session.metadata.uid.clone().unwrap_or_default(),
+        uid,
         controller: Some(true),
         block_owner_deletion: Some(true),
     };
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 586 to 593, the OwnerReference is being constructed using
session.metadata.uid.clone().unwrap_or_default() which will create an empty UID
when the session UID is missing; Kubernetes requires a valid UID so instead
check for the UID and return an error if absent (e.g., use
session.metadata.uid.clone().ok_or_else(||anyhow!("missing metadata.uid for
TrainingSession {name}"))?), propagate the error by ensuring the enclosing
function returns a Result, and only build the OwnerReference when a non-empty
UID is present.

Comment on lines +646 to +653
let owner_ref = OwnerReference {
api_version: "basilica.ai/v1".to_string(),
kind: "TrainingSession".to_string(),
name: name.to_string(),
uid: session.metadata.uid.clone().unwrap_or_default(),
controller: Some(true),
block_owner_deletion: Some(true),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same empty UID issue in owner reference.

Same concern as build_training_pod - the owner reference will have an empty UID if the session metadata UID is missing. Apply the same fix pattern.

🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 646 to 653, the OwnerReference is being built with
session.metadata.uid.clone().unwrap_or_default(), which can produce an empty
UID; mirror the fix used in build_training_pod by retrieving the UID via
session.metadata.uid.clone().ok_or_else(|| anyhow::anyhow!("missing session
UID"))? and propagate/return the error if missing, then use the guaranteed UID
when constructing the OwnerReference instead of unwrapping to a default empty
string.

Comment on lines +692 to +731
let route_json = serde_json::json!({
"apiVersion": "gateway.networking.k8s.io/v1",
"kind": "HTTPRoute",
"metadata": {
"name": route_name,
"namespace": namespace,
"labels": {
"app": "basilica-training",
"session": session_name
}
},
"spec": {
"parentRefs": [{
"name": DEFAULT_GATEWAY_NAME,
"namespace": DEFAULT_GATEWAY_NAMESPACE
}],
"rules": [{
"matches": [{
"path": {
"type": "PathPrefix",
"value": path_prefix
}
}],
"filters": [{
"type": "URLRewrite",
"urlRewrite": {
"path": {
"type": "ReplacePrefixMatch",
"replacePrefixMatch": "/"
}
}
}],
"backendRefs": [{
"name": backend_service,
"namespace": namespace,
"port": port
}]
}]
}
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

HTTPRoute missing owner reference for garbage collection.

Unlike the Pod and Service, this HTTPRoute doesn't have an ownerReferences field. When the TrainingSession is deleted, this route will be orphaned and must be manually cleaned up.

🔎 Add owner reference to metadata:

The function would need the session parameter to access the UID:

 fn build_training_http_route(
+    session: &TrainingSession,
     session_name: &str,
     namespace: &str,
     backend_service: &str,
     port: u16,
 ) -> Result<DynamicObject> {
     let route_name = format!("training-route-{}", session_name);
     let path_prefix = format!("/sessions/{}/", session_name);
+    let uid = session.metadata.uid.clone()
+        .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;

     let route_json = serde_json::json!({
         // ...
         "metadata": {
             "name": route_name,
             "namespace": namespace,
             "labels": {
                 "app": "basilica-training",
                 "session": session_name
-            }
+            },
+            "ownerReferences": [{
+                "apiVersion": "basilica.ai/v1",
+                "kind": "TrainingSession",
+                "name": session_name,
+                "uid": uid,
+                "controller": true,
+                "blockOwnerDeletion": true
+            }]
         },
         // ...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let route_json = serde_json::json!({
"apiVersion": "gateway.networking.k8s.io/v1",
"kind": "HTTPRoute",
"metadata": {
"name": route_name,
"namespace": namespace,
"labels": {
"app": "basilica-training",
"session": session_name
}
},
"spec": {
"parentRefs": [{
"name": DEFAULT_GATEWAY_NAME,
"namespace": DEFAULT_GATEWAY_NAMESPACE
}],
"rules": [{
"matches": [{
"path": {
"type": "PathPrefix",
"value": path_prefix
}
}],
"filters": [{
"type": "URLRewrite",
"urlRewrite": {
"path": {
"type": "ReplacePrefixMatch",
"replacePrefixMatch": "/"
}
}
}],
"backendRefs": [{
"name": backend_service,
"namespace": namespace,
"port": port
}]
}]
}
});
let route_name = format!("training-route-{}", session_name);
let path_prefix = format!("/sessions/{}/", session_name);
let uid = session.metadata.uid.clone()
.ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;
let route_json = serde_json::json!({
"apiVersion": "gateway.networking.k8s.io/v1",
"kind": "HTTPRoute",
"metadata": {
"name": route_name,
"namespace": namespace,
"labels": {
"app": "basilica-training",
"session": session_name
},
"ownerReferences": [{
"apiVersion": "basilica.ai/v1",
"kind": "TrainingSession",
"name": session_name,
"uid": uid,
"controller": true,
"blockOwnerDeletion": true
}]
},
"spec": {
"parentRefs": [{
"name": DEFAULT_GATEWAY_NAME,
"namespace": DEFAULT_GATEWAY_NAMESPACE
}],
"rules": [{
"matches": [{
"path": {
"type": "PathPrefix",
"value": path_prefix
}
}],
"filters": [{
"type": "URLRewrite",
"urlRewrite": {
"path": {
"type": "ReplacePrefixMatch",
"replacePrefixMatch": "/"
}
}
}],
"backendRefs": [{
"name": backend_service,
"namespace": namespace,
"port": port
}]
}]
}
});
🤖 Prompt for AI Agents
crates/basilica-operator/src/controllers/training_session_controller.rs around
lines 692 to 731: the generated HTTPRoute JSON is missing
metadata.ownerReferences so the route will be orphaned when the TrainingSession
is deleted; update the function signature to accept the TrainingSession (or its
UID and controller metadata) and add an ownerReferences entry in metadata with
apiVersion, kind, name, uid from the session, controller: true and
blockOwnerDeletion: true so Kubernetes garbage collection removes the HTTPRoute
when the TrainingSession is deleted.

Comment on lines +755 to +762
let owner_ref = OwnerReference {
api_version: "basilica.ai/v1".into(),
kind: "TrainingSession".into(),
name: name.to_string(),
uid: session.metadata.uid.clone().unwrap_or_default(),
controller: Some(true),
block_owner_deletion: Some(true),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same empty UID issue in NetworkPolicy owner reference.

Apply the same fix pattern as recommended for the other resource builders.

🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 755 to 762, the OwnerReference is being built with uid:
session.metadata.uid.clone().unwrap_or_default(), which injects an empty string
when UID is missing; change this to preserve the optional nature of the UID
(e.g., use session.metadata.uid.clone() or
session.metadata.uid.clone().filter(|s| !s.is_empty())) so the
OwnerReference.uid is None when there is no UID, mirroring the fix used in the
other resource builders.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
services/training-service/README.md (2)

138-157: Add language specification to fenced code block.

Line 138 begins a code block without a language identifier, failing markdown linting compliance.

🔎 Apply this diff to specify the language:
-```
+```bash
 [STEP] Running training steps via port-forward...

289-321: Add language specification to fenced code block.

Line 289 begins an ASCII architecture diagram code block without a language identifier, failing markdown linting compliance.

🔎 Apply this diff to specify the language:
-```
+```text
 ┌─────────────────────────────────────────────────────────────┐
crates/basilica-api/src/api/routes/training.rs (4)

288-297: Fragile error detection using string matching.

String-based error checks (e.to_string().contains("409")) are brittle and may incorrectly classify errors if message formats change. The kube crate provides typed error variants that should be used instead.

🔎 Consider using typed error matching:
     if let Err(e) = k8s_client.create_namespace(&namespace).await {
-        // Ignore "already exists" errors
-        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+        // Ignore "already exists" errors using typed matching
+        if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) {
             error!(error = %e, namespace = %namespace, "Failed to create namespace");
             apimetrics::record_request("training.create", "POST", start, false);
             return Err(ApiError::Internal {
                 message: format!("Failed to create namespace: {}", e),
             });
         }
     }

363-374: Fragile error detection using string matching.

Similar to the namespace creation handler, this uses string matching to detect 404 errors instead of typed error matching with kube::Error variants.


495-521: Confusing sentinel error pattern for empty list handling.

Using an error with message "empty" as a sentinel value (line 502) and then checking for it with string matching (line 513) is an anti-pattern that makes the code harder to understand and maintain.

🔎 Consider cleaner error handling:
     // List all TrainingSession CRDs in the user's namespace
-    let sessions = api
+    let sessions_result = api
         .list(&ListParams::default().labels(&format!("user={}", auth.user_id)))
-        .await
-        .map_err(|e| {
-            // If namespace doesn't exist, return empty list
-            if e.to_string().contains("404") || e.to_string().contains("NotFound") {
-                return ApiError::Internal {
-                    message: "empty".to_string(), // Sentinel for empty list
-                };
-            }
-            error!(error = %e, "Failed to list sessions");
-            ApiError::Internal {
-                message: format!("Failed to list sessions: {}", e),
-            }
-        });
+        .await;
 
-    let sessions = match sessions {
-        Ok(list) => list,
-        Err(e) if e.to_string().contains("empty") => {
+    let sessions = match sessions_result {
+        Ok(list) => list,
+        Err(kube::Error::Api(ae)) if ae.code == 404 => {
+            // Namespace doesn't exist, return empty list
             apimetrics::record_request("training.list", "GET", start, true);
             return Ok(Json(vec![]));
         }
         Err(e) => {
+            error!(error = %e, "Failed to list sessions");
             apimetrics::record_request("training.list", "GET", start, false);
-            return Err(e);
+            return Err(ApiError::Internal {
+                message: format!("Failed to list sessions: {}", e),
+            });
         }
     };

185-192: Namespace collision risk remains.

The sanitization could create collisions when different user IDs map to the same namespace (e.g., "user_123" and "user-123" both become "u-user-123"). This was previously flagged but hasn't been addressed.

crates/basilica-sdk-python/python/basilica/training/__init__.py (1)

128-203: Add validation for empty data list.

The forward_backward method accesses examples[0] at lines 180 and 185 without verifying the list is non-empty, and line 174 calls max() on the list which will raise ValueError if empty.

🔎 Apply this diff to add validation:
     ) -> float:
         """
         Compute forward pass and gradients.

         Args:
             data: List of training examples (Datum objects or dicts with
                   'input_ids' and 'labels' keys)
             loss_fn: Loss function to use. Options:
                 - "cross_entropy": Standard NLL loss (default)
                 - "importance_sampling": Policy gradient with importance weighting
                 - "ppo": Proximal Policy Optimization with clipping
                 - "dpo": Direct Preference Optimization

         Returns:
             Loss value

         Example:
             >>> # Standard supervised learning
             >>> loss = session.forward_backward([{"input_ids": [1, 2, 3]}])

             >>> # RL with importance sampling
             >>> loss = session.forward_backward(data, loss_fn="importance_sampling")
         """
+        if not data:
+            raise TrainingError("data list cannot be empty")
+        
         # Normalize input
         examples = []
         for d in data:

Based on the web search results, calling max() on an empty sequence raises ValueError: max() arg is an empty sequence.

🧹 Nitpick comments (1)
crates/basilica-api/src/api/middleware/scope.rs (1)

180-231: Add test coverage for training route scope mappings.

The new training session scope mappings lack corresponding test cases in the test_required_scope_mapping function. This creates a gap in validation coverage.

🔎 Consider adding tests for the new training routes:

Add these test cases to test_required_scope_mapping (after line 417):

// Test training session endpoints
let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:create".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:list".to_string()));

let req = Request::builder()
    .method(Method::GET)
    .uri("/sessions/ts-123")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:view".to_string()));

let req = Request::builder()
    .method(Method::DELETE)
    .uri("/sessions/ts-123")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:delete".to_string()));

let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions/ts-123/internal")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:use".to_string()));

let req = Request::builder()
    .method(Method::POST)
    .uri("/sessions/ts-123/internal/train-123/forward_backward")
    .body(Body::empty())
    .unwrap();
assert_eq!(get_required_scope(&req), Some("training:use".to_string()));
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40e914d and 0c1cf26.

📒 Files selected for processing (7)
  • crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
  • crates/basilica-api/src/api/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-api/src/bin/gen_api_key.rs (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
  • examples/training_example.py (1 hunks)
  • services/training-service/README.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/basilica-api/src/api/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
crates/basilica-sdk-python/python/basilica/training/__init__.py (3)
crates/basilica-api/src/api/routes/training.rs (5)
  • forward_backward (733-757)
  • optim_step (761-784)
  • sample (788-812)
  • get_session (350-429)
  • list_sessions (482-586)
services/training-service/src/backend.py (4)
  • forward_backward (226-292)
  • optim_step (294-317)
  • sample (319-381)
  • list_sessions (482-484)
services/training-service/src/server.py (5)
  • forward_backward (244-273)
  • optim_step (277-283)
  • sample (290-311)
  • get_session (221-227)
  • list_sessions (177-179)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (6)
  • default_rank (44-46)
  • default_alpha (47-49)
  • default_dropout (50-52)
  • default (34-41)
  • default (80-86)
  • default (145-151)
crates/basilica-api/src/k8s/trait.rs (1)
  • kube_client (7-7)
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


138-138: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


289-289: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/__init__.py

201-201: Avoid specifying long messages outside the exception class

(TRY003)


215-215: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Avoid specifying long messages outside the exception class

(TRY003)


273-273: Avoid specifying long messages outside the exception class

(TRY003)


290-290: Avoid specifying long messages outside the exception class

(TRY003)


297-297: Avoid specifying long messages outside the exception class

(TRY003)


305-305: Avoid specifying long messages outside the exception class

(TRY003)


346-346: Avoid specifying long messages outside the exception class

(TRY003)


414-414: Avoid specifying long messages outside the exception class

(TRY003)


423-423: Avoid specifying long messages outside the exception class

(TRY003)


431-431: Avoid specifying long messages outside the exception class

(TRY003)


435-435: Avoid specifying long messages outside the exception class

(TRY003)


451-451: Avoid specifying long messages outside the exception class

(TRY003)


468-468: Avoid specifying long messages outside the exception class

(TRY003)


470-470: Avoid specifying long messages outside the exception class

(TRY003)


474-474: Avoid specifying long messages outside the exception class

(TRY003)


488-488: Avoid specifying long messages outside the exception class

(TRY003)


507-513: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

examples/training_example.py

18-18: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (2)
crates/basilica-api/src/bin/gen_api_key.rs (1)

22-24: LGTM!

The addition of "training:*" to the default scopes is appropriate and aligns with the new training session endpoints introduced in this PR.

examples/training_example.py (1)

1-27: LGTM!

The example effectively demonstrates the training SDK workflow with session creation, training loop, and sampling. The code is clear and appropriate for an example script.

Comment on lines +38 to +48
```
=== Creating session for facebook/opt-125m ===
Session: ts-a1b2c3d4

=== Training (3 steps) ===
Step 1: loss=6.2612
Step 2: loss=5.7423
Step 3: loss=5.7501

Sample: The quick brown -> fox jumps over the lazy
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 38 begins an example output code block without a language identifier, which fails markdown linting compliance.

🔎 Apply this diff to specify the language:
-```
+```text
 === Creating session for facebook/opt-125m ===
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
=== Creating session for facebook/opt-125m ===
Session: ts-a1b2c3d4
=== Training (3 steps) ===
Step 1: loss=6.2612
Step 2: loss=5.7423
Step 3: loss=5.7501
Sample: The quick brown -> fox jumps over the lazy
```
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
services/training-service/README.md around lines 38 to 48: the fenced code block
starting at line 38 lacks a language spec and triggers markdown linting; change
the opening fence from ``` to ```text (and keep the closing ``` unchanged) so
the example output is explicitly marked as plain text.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (5)
crates/basilica-api/src/api/routes/training.rs (5)

200-207: Potential namespace collision risk.

Different user IDs like "user_123" and "user-123" both map to "u-user-123", causing namespace collisions. Additionally, Kubernetes namespace names have a 63-character limit that isn't enforced here.

Based on past review comments, a hash-based approach was suggested to ensure unique, valid namespace names.


283-283: Hardcoded enableBilling prevents user override.

The enableBilling field is always true, ignoring potential user preferences.

Based on past review comments, consider adding an optional field to CreateSessionRequest with a default of true.


323-332: Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may incorrectly ignore real errors if the error message format changes.

Based on past review comments, use typed error matching like the CRD creation does at line 353: kube::Error::Api(ae) if ae.code == 409.


530-556: Confusing sentinel error pattern for empty list handling.

Using an ApiError::Internal with message "empty" as a sentinel value is a code smell that makes the code harder to maintain.

Based on past review comments, use proper typed error matching with early return for 404 cases, similar to how delete_session handles it at line 497.


398-409: Fix string-based error matching in session retrieval.

String-based error matching at lines 398-409 (and in list_sessions) is fragile and inconsistent with typed error handling used elsewhere in the file. Replace with typed kube::Error::Api pattern matching:

-    let session = api.get(&session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let session = api.get(&session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Session {} not found", session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, session_id = %session_id, "Failed to get session");
             ApiError::Internal {
                 message: format!("Failed to get session: {}", e),
             }
         }
     })?;

Apply the same fix to the list_sessions function.

🧹 Nitpick comments (9)
crates/basilica-api/src/api/routes/training.rs (3)

313-320: Consider collision risk with short session IDs.

Using only the first segment of UUID v4 (8 hex characters = ~4 billion possibilities) could lead to collisions as the number of sessions grows, especially across all users. While unlikely in the short term, full UUIDs would eliminate this risk entirely.

🔎 Suggested fix:
     let session_id = format!(
         "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
+        uuid::Uuid::new_v4().to_string().replace("-", "")
     );

Or simply use the standard hyphenated UUID:

-    let session_id = format!(
-        "ts-{}",
-        uuid::Uuid::new_v4()
-            .to_string()
-            .split('-')
-            .next()
-            .unwrap()
-    );
+    let session_id = format!("ts-{}", uuid::Uuid::new_v4());

634-709: Improve proxy error handling specificity.

The proxy always returns ApiError::ServiceUnavailable regardless of the underlying error type. This loses important context—a 404 from the training service means the session doesn't exist, while a 500 means an internal error.

🔎 Suggested improvement:

Parse the kube error and map specific HTTP status codes:

     let response = kube_client
         .request::<serde_json::Value>(request)
         .await
-        .map_err(|e| {
+        .map_err(|e| match &e {
+            kube::Error::Api(ae) if ae.code == 404 => {
+                ApiError::NotFound {
+                    message: format!("Training session {} not found", session_id),
+                }
+            }
+            _ => {
                 error!(
                     error = %e,
                     service = %service_name,
                     namespace = %namespace,
                     path = %path,
                     "Failed to proxy request to training service"
                 );
                 ApiError::ServiceUnavailable
+            }
         })?;

711-903: Consider reducing duplication in proxy handlers.

All seven proxy handler functions follow an identical pattern with only the path and metric name differing. While the current code is clear, extracting a macro or higher-order function could reduce maintenance burden.

💡 Example approach using a macro:
macro_rules! proxy_handler {
    ($name:ident, $metric:expr, $method:expr, $path_fn:expr, $body:expr) => {
        pub async fn $name(
            State(state): State<AppState>,
            Extension(auth): Extension<AuthContext>,
            Path(params): Path<_>,
            $($body: Json<serde_json::Value>,)?
        ) -> Result<axum::response::Response> {
            let start = Instant::now();
            let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;
            let kube_client = k8s_client.kube_client();
            let namespace = user_namespace(&auth.user_id);
            
            let result = proxy_to_training_service(
                &kube_client,
                &namespace,
                &params.0,
                &$path_fn(params),
                $method,
                $body,
            ).await;
            
            apimetrics::record_request($metric, stringify!($method), start, result.is_ok());
            result
        }
    };
}

This is a nice-to-have optimization, not a blocking issue.

crates/basilica-sdk-python/python/basilica/training/types.py (1)

169-176: Use asyncio.get_running_loop() instead of deprecated get_event_loop().

asyncio.get_event_loop() is deprecated since Python 3.10 and will emit a DeprecationWarning when called from a coroutine without a running loop. Since result_async is an async method, use get_running_loop() which is the recommended approach.

🔎 Apply this diff:
     async def result_async(self, timeout: Optional[float] = None):
         """Wait for operation to complete (async)."""
-        loop = asyncio.get_event_loop()
+        loop = asyncio.get_running_loop()
         if timeout is not None:
             return await asyncio.wait_for(
                 loop.run_in_executor(None, self._future.result), timeout=timeout
             )
         return await loop.run_in_executor(None, self._future.result)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)

146-150: Consider extracting endpoint resolution to reduce duplication.

The endpoint resolution logic here duplicates _get_endpoint() with a different suffix. This is minor, but could be consolidated.

🔎 Example refactor:
+    def _get_endpoint(self, operation: str = "sample") -> str:
+        """Get the endpoint for the given operation."""
+        if self._session_id and self._internal_id:
+            return f"/sessions/{self._session_id}/internal/{self._internal_id}/{operation}"
+        return f"/{operation}"
+
-    def _get_endpoint(self) -> str:
-        """Get the sampling endpoint."""
-        if self._session_id and self._internal_id:
-            return f"/sessions/{self._session_id}/internal/{self._internal_id}/sample"
-        return "/sample"

Then use self._get_endpoint("compute_logprobs") at line 148.

crates/basilica-sdk-python/python/basilica/training/exceptions.py (3)

8-14: Use explicit | None type annotation for optional parameter.

Per PEP 484, implicit Optional (using = None without a None type) is discouraged. This improves type checker compatibility.

🔎 Apply this diff:
-    def __init__(self, message: str, status_code: int = None):
+    def __init__(self, message: str, status_code: int | None = None):

54-62: Use explicit | None type annotation.

Same issue as the base class - retry_after: float = None should use explicit optional type.

🔎 Apply this diff:
-    def __init__(self, retry_after: float = None):
+    def __init__(self, retry_after: float | None = None):

72-75: Consider adding a default message to CheckpointError.

Unlike other exceptions, CheckpointError has no custom __init__, so callers must always provide a message. Consider adding a default for consistency with AuthenticationError and InsufficientResourcesError.

🔎 Apply this diff:
 class CheckpointError(TrainingError):
     """Checkpoint operation failed."""
 
-    pass
+    def __init__(self, message: str = "Checkpoint operation failed"):
+        super().__init__(message)
crates/basilica-sdk-python/python/basilica/training/service_client.py (1)

388-399: Async client is not properly closed.

The async client is lazily created but never closed, which can cause resource leaks. Consider adding an async close method or async context manager support.

🔎 Suggested approach:
async def close_async(self):
    """Close async resources."""
    if self._async_client:
        await self._async_client.aclose()
        self._async_client = None

async def __aenter__(self):
    return self

async def __aexit__(self, *args):
    await self.close_async()
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c1cf26 and 0aa3e6b.

📒 Files selected for processing (8)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/exceptions.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/service_client.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/training_client.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/types.py (1 hunks)
  • examples/training_example.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (3)
crates/basilica-sdk-python/python/basilica/training/types.py (7)
  • ModelInput (48-66)
  • SampleResponse (100-106)
  • SamplingParams (70-93)
  • to_dict (37-44)
  • to_dict (84-93)
  • from_string (64-66)
  • result_async (169-176)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
  • TrainingError (8-14)
crates/basilica-sdk-python/python/basilica/training/training_client.py (2)
  • sample (395-428)
  • close (430-435)
crates/basilica-sdk-python/python/basilica/training/service_client.py (5)
crates/basilica-sdk-python/python/basilica/training/types.py (1)
  • GetServerCapabilitiesResponse (127-132)
crates/basilica-sdk-python/python/basilica/training/training_client.py (4)
  • TrainingClient (22-465)
  • base_model (70-72)
  • session_id (65-67)
  • close (430-435)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)
  • SamplingClient (16-186)
  • close (178-180)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (4)
  • TrainingError (8-14)
  • SessionTimeoutError (36-44)
  • AuthenticationError (47-51)
  • ValidationError (65-69)
crates/basilica-api/src/api/routes/training.rs (2)
  • get_session (385-464)
  • list_sessions (517-621)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (9)
  • default_rank (44-46)
  • default_alpha (47-49)
  • default_dropout (50-52)
  • default (34-41)
  • default (80-86)
  • default (145-151)
  • default_ttl (214-216)
  • new (316-322)
  • new (374-391)
services/training-service/src/backend.py (4)
  • create_session (163-224)
  • forward_backward (226-292)
  • optim_step (294-317)
  • sample (319-381)
crates/basilica-sdk-python/python/basilica/training/__init__.py (5)
crates/basilica-sdk-python/python/basilica/training/types.py (8)
  • Datum (17-44)
  • ModelInput (48-66)
  • SamplingParams (70-93)
  • SampleResponse (100-106)
  • ForwardBackwardResult (110-115)
  • ForwardResult (119-123)
  • GetServerCapabilitiesResponse (127-132)
  • APIFuture (138-184)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (10)
  • TrainingError (8-14)
  • SessionNotFoundError (17-22)
  • SessionNotReadyError (25-33)
  • SessionTimeoutError (36-44)
  • AuthenticationError (47-51)
  • RateLimitError (54-62)
  • ValidationError (65-69)
  • CheckpointError (72-75)
  • ModelNotFoundError (78-83)
  • InsufficientResourcesError (86-90)
crates/basilica-sdk-python/python/basilica/training/service_client.py (1)
  • ServiceClient (24-399)
crates/basilica-sdk-python/python/basilica/training/training_client.py (1)
  • TrainingClient (22-465)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
  • SamplingClient (16-186)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/types.py

158-158: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


189-198: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/training_client.py

125-125: Avoid specifying long messages outside the exception class

(TRY003)


188-188: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


299-299: Avoid specifying long messages outside the exception class

(TRY003)


319-319: Avoid specifying long messages outside the exception class

(TRY003)


339-339: Avoid specifying long messages outside the exception class

(TRY003)


381-381: Avoid specifying long messages outside the exception class

(TRY003)


426-426: Avoid specifying long messages outside the exception class

(TRY003)


434-434: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/sampling_client.py

99-99: Avoid specifying long messages outside the exception class

(TRY003)


154-154: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/service_client.py

64-66: Avoid specifying long messages outside the exception class

(TRY003)


154-156: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Avoid specifying long messages outside the exception class

(TRY003)


199-199: Avoid specifying long messages outside the exception class

(TRY003)


207-207: Avoid specifying long messages outside the exception class

(TRY003)


235-235: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


350-350: Avoid specifying long messages outside the exception class

(TRY003)


352-352: Avoid specifying long messages outside the exception class

(TRY003)


356-356: Avoid specifying long messages outside the exception class

(TRY003)


374-374: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/exceptions.py

11-11: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


57-57: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


95-106: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/__init__.py

69-96: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (16)
examples/training_example.py (2)

45-77: LGTM!

The training loop demonstrates the SDK API well, with proper context manager usage for session cleanup, clear forward-backward/optim-step pattern, and checkpoint saving.


108-140: LGTM!

The async example properly demonstrates the async API pattern with try/finally for cleanup.

crates/basilica-sdk-python/python/basilica/training/types.py (2)

16-45: LGTM!

The Datum dataclass is well-documented with clear field semantics and a clean to_dict() serialization method.


69-93: LGTM!

SamplingParams provides sensible defaults and the to_dict() method handles None gracefully for stop_sequences.

crates/basilica-sdk-python/python/basilica/training/training_client.py (2)

219-258: LGTM!

The optim_step method correctly handles optional hyperparameter overrides and updates the internal step counter from the server response.


430-441: LGTM!

The close() method appropriately ignores 404 responses (idempotent deletion) and uses wait=False for executor shutdown during cleanup, which is acceptable since the session is being terminated.

crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)

26-48: LGTM!

The constructor clearly documents the dual-use pattern for session-bound vs standalone sampling, with appropriate optional parameters.


56-105: LGTM!

The sample() method cleanly handles payload construction, optional model hints, and gracefully handles both single and batch response formats with data.get("samples", [data]).

crates/basilica-sdk-python/python/basilica/training/__init__.py (2)

1-35: LGTM!

Excellent module docstring with a practical quick-start example and clear documentation of supported loss functions and authentication.


39-96: LGTM!

The module correctly re-exports all public symbols from submodules, and the __all__ list is well-organized with logical grouping (clients, types, exceptions) that aids discoverability.

crates/basilica-sdk-python/python/basilica/training/service_client.py (6)

36-73: LGTM!

The initialization logic handles API key and endpoint configuration correctly with appropriate fallbacks to environment variables and sensible defaults.


75-90: LGTM!

Capability query methods are clean and correctly use raise_for_status() for error handling.


247-307: No readiness polling for restored sessions.

Unlike create_lora_training_client which polls until the session is ready, create_training_client_from_state and create_training_client_from_state_with_optimizer return immediately after the API call. If session restoration is asynchronous, the returned TrainingClient may reference a session that isn't ready yet.

Consider adding similar readiness polling, or document that restoration is synchronous.


309-333: LGTM!

Clean validation ensuring at least one model reference is provided before creating the sampling client.


365-376: LGTM!

Simple implementation that correctly handles errors and returns session list.


163-187: Empty checkpointStorage fields should be required.

The bucket and path fields in checkpointStorage are set to empty strings without validation. Since storage operations require these values, either provide non-empty defaults, require them as parameters, or add validation to reject empty values at the API level.

Comment on lines +121 to +129
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct CheckpointStorageRequest {
pub backend: String,
pub bucket: String,
pub path: String,
#[serde(default)]
pub credentials_secret: Option<String>,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document accepted backend values.

The backend field has no validation or documentation indicating which values are accepted (e.g., "s3", "gcs", "local"). Consider adding validation or at minimum doc comments listing the supported storage backends.

🔎 Suggested improvement:

Add documentation:

 #[derive(Debug, Clone, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct CheckpointStorageRequest {
+    /// Storage backend type. Supported values: "s3", "gcs"
     pub backend: String,
     pub bucket: String,
     pub path: String,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct CheckpointStorageRequest {
pub backend: String,
pub bucket: String,
pub path: String,
#[serde(default)]
pub credentials_secret: Option<String>,
}
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct CheckpointStorageRequest {
/// Storage backend type. Supported values: "s3", "gcs"
pub backend: String,
pub bucket: String,
pub path: String,
#[serde(default)]
pub credentials_secret: Option<String>,
}
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 121 to 129, the
CheckpointStorageRequest struct's backend field lacks documentation or
validation about accepted values; add a doc comment above the backend field
enumerating supported backends (e.g., "s3", "gcs", "local") and, if practical,
enforce validation by either changing the type to a dedicated enum with
Deserialize implementations for those variants or adding a simple runtime check
where the struct is deserialized/used that returns a clear error for unsupported
values.

Comment on lines +347 to +363
resp = self._client.get(f"/sessions/{session_id}")

if resp.status_code == 404:
raise TrainingError(f"Session {session_id} not found")
if not resp.is_success:
raise TrainingError(f"Failed to get session: {resp.text}")

data = resp.json()
if data.get("phase") != "ready":
raise TrainingError(f"Session not ready: {data.get('phase')}")

return TrainingClient(
client=self._client,
session_id=session_id,
internal_id=f"train-{session_id}",
base_model=data.get("baseModel", "unknown"),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use specific exception types from exceptions.py.

The module defines SessionNotFoundError and SessionNotReadyError specifically for these cases, but generic TrainingError is used instead. Using the specific exceptions improves error handling for callers.

🔎 Apply this diff:
+from .exceptions import (
+    TrainingError,
+    SessionTimeoutError,
+    SessionNotFoundError,
+    SessionNotReadyError,
+    AuthenticationError,
+    ValidationError,
+)

Then update the method:

         if resp.status_code == 404:
-            raise TrainingError(f"Session {session_id} not found")
+            raise SessionNotFoundError(session_id)
         if not resp.is_success:
             raise TrainingError(f"Failed to get session: {resp.text}")

         data = resp.json()
         if data.get("phase") != "ready":
-            raise TrainingError(f"Session not ready: {data.get('phase')}")
+            raise SessionNotReadyError(session_id, data.get("phase", "unknown"))

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

350-350: Avoid specifying long messages outside the exception class

(TRY003)


352-352: Avoid specifying long messages outside the exception class

(TRY003)


356-356: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/service_client.py around
lines 347 to 363, the code raises generic TrainingError for a 404 and for a
session not-ready state; replace those with the more specific exceptions defined
in exceptions.py by importing SessionNotFoundError and SessionNotReadyError (or
adjusting existing imports) and raise SessionNotFoundError(f"Session
{session_id} not found") for the 404 branch and SessionNotReadyError(f"Session
not ready: {data.get('phase')}") when the phase is not "ready", leaving the
other TrainingError usages unchanged.

Comment on lines +378 to +386
async def _get_async_client(self) -> httpx.AsyncClient:
"""Get or create async HTTP client."""
if self._async_client is None:
self._async_client = httpx.AsyncClient(
base_url=self.endpoint,
headers={"Authorization": f"Bearer {self.api_key}"},
timeout=300.0,
)
return self._async_client
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Async client uses hardcoded timeout instead of configured value.

The sync client uses the timeout parameter from __init__, but the async client hardcodes 300.0. Store the timeout in __init__ and reuse it here for consistency.

🔎 Apply this diff:

In __init__:

         self._async_client = None  # Lazy init
+        self._timeout = timeout

In _get_async_client:

             self._async_client = httpx.AsyncClient(
                 base_url=self.endpoint,
                 headers={"Authorization": f"Bearer {self.api_key}"},
-                timeout=300.0,
+                timeout=self._timeout,
             )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/service_client.py around
lines 378-386, the async HTTP client is created with a hardcoded timeout of
300.0; update __init__ to store the provided timeout (e.g., self.timeout =
timeout) and change the AsyncClient construction to use that stored value
(timeout=self.timeout) so the async client honors the configured timeout
consistently with the sync client.

Comment on lines +106 to +131
def _call():
normalized = self._normalize_data(data)

# Pad sequences
max_len = max(len(d["input_ids"]) for d in normalized)
input_ids = []
attention_mask = []

for d in normalized:
ids = d["input_ids"]
pad_len = max_len - len(ids)
input_ids.append(ids + [0] * pad_len)
attention_mask.append([1] * len(ids) + [0] * pad_len)

resp = self._client.post(
self._proxy("forward"),
json={"input_ids": input_ids, "attention_mask": attention_mask},
)
if not resp.is_success:
raise TrainingError(f"forward failed: {resp.text}")
r = resp.json()
return ForwardResult(
logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]
)

return APIFuture(self._executor.submit(_call), ForwardResult)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add validation for empty data list.

If data is empty, max(len(d["input_ids"]) for d in normalized) at line 110 will raise a ValueError: max() arg is an empty sequence. Add a guard at the start of _call().

🔎 Apply this diff:
         def _call():
             normalized = self._normalize_data(data)
+            if not normalized:
+                raise ValueError("data list cannot be empty")

             # Pad sequences
             max_len = max(len(d["input_ids"]) for d in normalized)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _call():
normalized = self._normalize_data(data)
# Pad sequences
max_len = max(len(d["input_ids"]) for d in normalized)
input_ids = []
attention_mask = []
for d in normalized:
ids = d["input_ids"]
pad_len = max_len - len(ids)
input_ids.append(ids + [0] * pad_len)
attention_mask.append([1] * len(ids) + [0] * pad_len)
resp = self._client.post(
self._proxy("forward"),
json={"input_ids": input_ids, "attention_mask": attention_mask},
)
if not resp.is_success:
raise TrainingError(f"forward failed: {resp.text}")
r = resp.json()
return ForwardResult(
logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]
)
return APIFuture(self._executor.submit(_call), ForwardResult)
def _call():
normalized = self._normalize_data(data)
if not normalized:
raise ValueError("data list cannot be empty")
# Pad sequences
max_len = max(len(d["input_ids"]) for d in normalized)
input_ids = []
attention_mask = []
for d in normalized:
ids = d["input_ids"]
pad_len = max_len - len(ids)
input_ids.append(ids + [0] * pad_len)
attention_mask.append([1] * len(ids) + [0] * pad_len)
resp = self._client.post(
self._proxy("forward"),
json={"input_ids": input_ids, "attention_mask": attention_mask},
)
if not resp.is_success:
raise TrainingError(f"forward failed: {resp.text}")
r = resp.json()
return ForwardResult(
logprobs=r["logprobs"], tokens_processed=r["tokens_processed"]
)
return APIFuture(self._executor.submit(_call), ForwardResult)
🧰 Tools
🪛 Ruff (0.14.8)

125-125: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/training_client.py around
lines 106 to 131, add a guard after normalizing data to handle an empty input
list: after normalized = self._normalize_data(data) check if not normalized and
raise a TrainingError (or otherwise return a failed future) with a clear message
such as "forward failed: no data provided" to avoid the max() empty-sequence
ValueError; keep the rest of the function flow unchanged so the APIFuture is
still returned by _call().

Comment on lines +153 to +197
def _call():
normalized = self._normalize_data(data)

# Default labels to input_ids (causal LM)
for d in normalized:
if "labels" not in d or d["labels"] is None:
d["labels"] = d["input_ids"].copy()

# Pad sequences
max_len = max(len(d["input_ids"]) for d in normalized)
input_ids = []
labels = []
attention_mask = []
loss_weights = []

for d in normalized:
ids = d["input_ids"]
pad_len = max_len - len(ids)
input_ids.append(ids + [0] * pad_len)
labels.append(d["labels"] + [-100] * pad_len)
attention_mask.append([1] * len(ids) + [0] * pad_len)
if d.get("loss_weights"):
loss_weights.append(d["loss_weights"] + [0.0] * pad_len)

payload = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
"loss_fn": loss_fn,
}
if loss_weights:
payload["loss_weights"] = loss_weights

resp = self._client.post(self._proxy("forward_backward"), json=payload)
if not resp.is_success:
raise TrainingError(f"forward_backward failed: {resp.text}")

r = resp.json()
return ForwardBackwardResult(
loss=r["loss"],
logprobs=r.get("logprobs", []),
tokens_processed=r.get("tokens_processed", 0),
)

return APIFuture(self._executor.submit(_call), ForwardBackwardResult)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add validation for empty data list in forward_backward.

Same issue as forward() — line 162 will raise ValueError if data is empty. Add a guard after normalization.

🔎 Apply this diff:
         def _call():
             normalized = self._normalize_data(data)
+            if not normalized:
+                raise ValueError("data list cannot be empty")

             # Default labels to input_ids (causal LM)
             for d in normalized:
🧰 Tools
🪛 Ruff (0.14.8)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/training_client.py around
lines 153 to 197, after normalizing the input data add a guard that checks if
the normalized list is empty and raise a TrainingError (or appropriate API
error) with a clear message (e.g. "forward_backward requires non-empty data")
before calling max(len(...)) to avoid ValueError; ensure the new check occurs
immediately after normalized = self._normalize_data(data) and returns/raises
consistently with other methods like forward().

Comment on lines +36 to +37
if os.path.exists("build/api-token.txt"):
api_key = open("build/api-token.txt").read().strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resource leak: file handle not closed.

The file is opened without a context manager, leaving the handle unclosed after reading. This pattern is repeated in sampling_example() and async_example().

🔎 Apply this diff to fix the resource leak:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()

Apply the same fix at lines 83-84 and 114-115.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if os.path.exists("build/api-token.txt"):
api_key = open("build/api-token.txt").read().strip()
if os.path.exists("build/api-token.txt"):
with open("build/api-token.txt") as f:
api_key = f.read().strip()
🤖 Prompt for AI Agents
In examples/training_example.py around lines 36-37 (and apply the same fix at
lines 83-84 and 114-115), the file is opened with open(...) and read without
using a context manager which leaks file handles; replace those open(...).read()
calls with a with open(...) as f: api_key = f.read().strip() pattern (i.e., use
a with-statement context manager to open, read, strip, and automatically close
the file) for each occurrence.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (11)
crates/basilica-sdk-python/python/basilica/training/training_client.py (2)

107-132: Add validation for empty data list.

If data is empty, max(len(d["input_ids"]) for d in normalized) at line 111 will raise a ValueError: max() arg is an empty sequence. This issue was previously raised and remains unaddressed.


180-224: Add validation for empty data list in forward_backward.

Same issue as forward() — line 189 will raise ValueError if data is empty. This issue was previously raised and remains unaddressed.

examples/training_example.py (1)

36-37: Resource leak: file handles not closed.

The file is opened without a context manager at lines 37, 106, and 137, leaving handles unclosed after reading. This issue was previously raised and remains unaddressed.

Also applies to: 105-106, 136-137

crates/basilica-api/src/api/routes/training.rs (6)

200-207: Potential namespace collision risk.

The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". This issue was previously raised and remains unaddressed.


283-283: Hardcoded enableBilling prevents user override.

The enableBilling field is always set to true, ignoring any potential user preference. This issue was previously raised and remains unaddressed.


323-332: Fragile error detection using string matching.

Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors. This issue was previously raised and remains unaddressed.


530-556: Confusing sentinel error pattern for empty list handling.

Using an error with message "empty" as a sentinel value is a code smell. This issue was previously raised and remains unaddressed.


121-129: Document accepted backend values.

The backend field has no validation or documentation indicating which values are accepted. This issue was previously raised and remains unaddressed.


398-409: Fragile error detection still present despite marked as addressed.

The past review comment indicates this was addressed in commit 0c1cf26, but the code still uses string-based error matching with e.to_string().contains("404") at line 399.

🔎 Apply this fix to use typed error matching:
-    let session = api.get(&session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let session = api.get(&session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Session {} not found", session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, session_id = %session_id, "Failed to get session");
             ApiError::Internal {
                 message: format!("Failed to get session: {}", e),
             }
         }
     })?;
services/training-service/src/backend.py (2)

207-209: get_peft_model modifies base model; print_trainable_parameters uses stdout.

  1. get_peft_model can modify the base model in place, which may cause issues if the base model is cached and shared across sessions.
  2. print_trainable_parameters() outputs to stdout rather than the structured logger.

This issue was previously raised and remains unaddressed.


531-559: Critical bug: Optimizer references stale parameters after loading checkpoint.

After PeftModel.from_pretrained creates a new model at line 543, session.optimizer still references parameters from the old model. The optimizer state is loaded, but it won't update the new model's parameters. This issue was previously raised and remains unaddressed.

🧹 Nitpick comments (4)
crates/basilica-api/src/api/routes/training.rs (1)

713-959: Significant code duplication in proxy handlers.

The proxy handler functions (create_internal_session, get_internal_session, forward_backward, optim_step, sample, save_checkpoint, load_checkpoint, forward, compute_logprobs) follow an identical pattern with only minor variations in paths and methods. Consider extracting a common helper or using a macro to reduce duplication.

💡 Example refactor using a helper function:
async fn proxy_handler(
    state: State<AppState>,
    auth: Extension<AuthContext>,
    session_id: String,
    internal_session_id: Option<String>,
    path_suffix: &str,
    method: http::Method,
    body: Option<serde_json::Value>,
    metric_name: &str,
) -> Result<axum::response::Response> {
    let start = Instant::now();
    let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;
    let kube_client = k8s_client.kube_client();
    let namespace = user_namespace(&auth.user_id);
    
    let path = if let Some(internal_id) = internal_session_id {
        format!("/sessions/{}/{}", internal_id, path_suffix)
    } else {
        format!("/sessions{}", path_suffix)
    };
    
    let result = proxy_to_training_service(
        &kube_client,
        &namespace,
        &session_id,
        &path,
        method,
        body,
    )
    .await;
    
    apimetrics::record_request(metric_name, method.as_str(), start, result.is_ok());
    result
}
scripts/local-training-e2e.sh (3)

58-62: Consider adding a note about the security trade-off.

Installing k3d via curl | bash without signature verification is convenient but poses a security risk. While acceptable for local development scripts, consider documenting this trade-off or providing an option to skip auto-installation.


221-250: Fragile output parsing could break silently.

Lines 229-230 parse the gen-api-key output using grep and sed, which is fragile if the output format changes. While there's a validation check for empty values on Line 232, consider making the key generation output more structured (e.g., JSON) for robust parsing.

Suggested approach:

If gen-api-key can emit JSON, parse it with jq:

KEY_OUTPUT=$(cargo run -p basilica-api --bin gen-api-key -- \
    --user testuser --name training-e2e --scopes "training:*" \
    --format json 2>/dev/null)

API_TOKEN=$(echo "$KEY_OUTPUT" | jq -r '.token')
SQL_INSERT=$(echo "$KEY_OUTPUT" | jq -r '.sql')

Otherwise, document the expected format and add more validation:

 # Extract the token and SQL
 API_TOKEN=$(echo "$KEY_OUTPUT" | grep "Token (Authorization):" | sed 's/.*Bearer //')
 SQL_INSERT=$(echo "$KEY_OUTPUT" | grep "^INSERT INTO")

 if [ -z "$API_TOKEN" ] || [ -z "$SQL_INSERT" ]; then
     log_error "Failed to generate API key"
+    log_error "Expected output format not found. Output was:"
     echo "$KEY_OUTPUT"
     exit 1
 fi

362-401: Reduce duplication in session creation logic.

The session creation payload and curl invocation are duplicated for authenticated vs. unauthenticated requests. This makes the function harder to maintain.

🔎 Consolidate the logic:
# Create a training session (CPU mode with gpu_count: 0)
log_info "Creating training session (CPU mode)..."

CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json")
[ -n "$AUTH_HEADER" ] && CURL_ARGS+=(-H "$AUTH_HEADER")

RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{
    "baseModel": "facebook/opt-125m",
    "checkpointStorage": {
        "backend": "r2",
        "bucket": "test-bucket",
        "path": "test/checkpoints"
    },
    "loraConfig": {
        "rank": 8,
        "alpha": 16
    },
    "gpuResources": {
        "count": 0
    }
}')
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa3e6b and eabae23.

📒 Files selected for processing (8)
  • crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
  • crates/basilica-api/src/api/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/training_client.py (1 hunks)
  • examples/training_example.py (1 hunks)
  • scripts/local-training-e2e.sh (1 hunks)
  • services/training-service/src/backend.py (1 hunks)
  • services/training-service/src/server.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/basilica-api/src/api/mod.rs
🧰 Additional context used
🧬 Code graph analysis (5)
crates/basilica-sdk-python/python/basilica/training/training_client.py (3)
crates/basilica-sdk-python/python/basilica/training/types.py (9)
  • APIFuture (138-184)
  • Datum (17-44)
  • ForwardBackwardResult (110-115)
  • ForwardResult (119-123)
  • ModelInput (48-66)
  • result (163-167)
  • to_dict (37-44)
  • to_dict (84-93)
  • result_async (169-176)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (4)
  • SamplingClient (16-186)
  • compute_logprobs (128-158)
  • sample (56-105)
  • compute_logprobs_async (172-176)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
  • TrainingError (8-14)
scripts/local-training-e2e.sh (3)
crates/basilica-operator/src/runtime.rs (1)
  • run (162-361)
crates/basilica-api/src/server.rs (1)
  • run (1188-1226)
orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
  • deployments (103-213)
examples/training_example.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (4)
  • Datum (17-44)
  • ModelInput (48-66)
  • SamplingParams (70-93)
  • from_ints (59-61)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
  • compute_logprobs (128-158)
services/training-service/src/server.py (1)
services/training-service/src/backend.py (17)
  • ForwardBackwardResult (70-75)
  • ForwardResult (62-66)
  • LoraConfiguration (16-32)
  • OptimizerConfiguration (36-44)
  • SampleResult (79-85)
  • TrainingBackend (88-608)
  • list_sessions (579-581)
  • create_session (171-232)
  • get_session_status (561-571)
  • delete_session (573-577)
  • forward (302-349)
  • compute_logprobs (351-389)
  • forward_backward (234-300)
  • optim_step (391-414)
  • sample (416-478)
  • save_state (480-529)
  • load_state (531-559)
crates/basilica-api/src/api/routes/training.rs (4)
services/training-service/src/server.py (6)
  • create_session (213-244)
  • forward_backward (323-352)
  • optim_step (356-362)
  • sample (369-390)
  • forward (271-295)
  • compute_logprobs (299-319)
crates/basilica-operator/src/crd/training_session.rs (5)
  • default (34-41)
  • default (80-86)
  • default (145-151)
  • new (316-322)
  • new (374-391)
services/training-service/src/backend.py (6)
  • create_session (171-232)
  • forward_backward (234-300)
  • optim_step (391-414)
  • sample (416-478)
  • forward (302-349)
  • compute_logprobs (351-389)
crates/basilica-api/src/k8s/trait.rs (1)
  • kube_client (7-7)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/training_client.py

126-126: Avoid specifying long messages outside the exception class

(TRY003)


155-155: Avoid specifying long messages outside the exception class

(TRY003)


215-215: Avoid specifying long messages outside the exception class

(TRY003)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


305-305: Avoid specifying long messages outside the exception class

(TRY003)


326-326: Avoid specifying long messages outside the exception class

(TRY003)


346-346: Avoid specifying long messages outside the exception class

(TRY003)


366-366: Avoid specifying long messages outside the exception class

(TRY003)


408-408: Avoid specifying long messages outside the exception class

(TRY003)


453-453: Avoid specifying long messages outside the exception class

(TRY003)


461-461: Avoid specifying long messages outside the exception class

(TRY003)

🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh

[warning] 422-422: i appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 477-477: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (12)
crates/basilica-api/src/api/middleware/scope.rs (1)

180-242: LGTM! Training scope mappings are well-structured.

The scope mappings for training endpoints follow consistent patterns and provide appropriate granularity (training:create, training:list, training:view, training:use, training:delete). The distinction between public CRD management routes and internal proxy routes is clear.

services/training-service/src/server.py (1)

1-421: LGTM! Clean FastAPI implementation.

The server implementation follows FastAPI best practices with:

  • Proper lifespan management for backend initialization
  • Well-structured Pydantic models with validation constraints
  • Consistent error handling (ValueError → 404, generic Exception → 500)
  • Appropriate use of structured logging
  • Correct tensor conversions for backend operations

The code is clean, maintainable, and production-ready.

scripts/local-training-e2e.sh (10)

1-39: LGTM!

Good use of strict error handling (set -euo pipefail), clear documentation, and well-organized logging helpers.


67-96: LGTM!

The cluster creation logic is well-structured with proper reuse detection and readiness checks.


129-183: LGTM!

Image building, loading, and operator deployment are well-structured with proper readiness checks.


185-211: LGTM!

Excellent handling of stale Postgres data and robust retry logic with clear failure messaging.


252-286: LGTM!

The API startup and key generation workflows are clear and well-documented with helpful user guidance.


288-333: LGTM!

Comprehensive cleanup logic that handles both CRD-managed and orphaned resources with proper waiting for termination.


470-477: Trap cleanup is correct, but shellcheck prefers explicit quoting.

The trap on Line 477 correctly expands $PF_PID at trap-set time (after Line 473 sets it). Shellcheck warns (SC2064) because double quotes can sometimes cause unintended expansions, but this usage is safe.

For clarity and to silence the warning, you could make the intent explicit:

Optional: Explicit quoting style
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    # shellcheck disable=SC2064
+    trap "kill $PF_PID 2>/dev/null" EXIT

Or capture the PID in a way that makes the intent clearer:

readonly PF_PID=$!
trap "kill ${PF_PID} 2>/dev/null" EXIT

630-697: LGTM!

Utility functions for status display, log tailing, and cluster teardown are well-organized and handle errors gracefully.


699-791: LGTM!

Excellent documentation in the help text and clean command dispatcher with good error handling for unknown commands.


420-430: Shellcheck warning about unused loop variable is harmless.

Shellcheck warns (SC2034) that the loop variable i on Line 422 is unused. This is a common Bash pattern for counting iterations and is perfectly fine. The warning can be safely ignored, or you can suppress it with a comment if desired:

# shellcheck disable=SC2034
for i in {1..60}; do

Comment on lines +119 to +124
# Wait for Envoy Gateway
log_info "Waiting for Envoy Gateway to be ready..."
sleep 10
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
--timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify gateway readiness or fail explicitly.

The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).

🔎 Consider failing explicitly or adding a retry:
     kubectl wait --namespace envoy-gateway-system \
         --for=condition=Available deployment/envoy-gateway \
-        --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+        --timeout=180s || {
+            log_error "Envoy Gateway failed to become ready"
+            exit 1
+        }

Alternatively, if you want to be lenient in dev scenarios, add a note that downstream failures may occur:

     kubectl wait --namespace envoy-gateway-system \
         --for=condition=Available deployment/envoy-gateway \
-        --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+        --timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Wait for Envoy Gateway
log_info "Waiting for Envoy Gateway to be ready..."
sleep 10
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
--timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
# Wait for Envoy Gateway
log_info "Waiting for Envoy Gateway to be ready..."
sleep 10
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
--timeout=180s || {
log_error "Envoy Gateway failed to become ready"
exit 1
}
Suggested change
# Wait for Envoy Gateway
log_info "Waiting for Envoy Gateway to be ready..."
sleep 10
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
--timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
# Wait for Envoy Gateway
log_info "Waiting for Envoy Gateway to be ready..."
sleep 10
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
--timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail."
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 119 to 124, the kubectl wait
fallback currently only logs a warning on timeout and lets the script continue,
which can cause downstream cryptic failures; modify this block to fail
explicitly on timeout (exit non-zero) so the pipeline stops when the Envoy
Gateway is not Available, or implement a retry loop with a configurable retry
count/backoff and clear logging and finally exit non-zero if still unavailable;
optionally honor an environment variable (e.g., ALLOW_GATEWAY_WARN) to preserve
the current lenient behavior when set.

Comment on lines +354 to +360
# Health check
log_info "Checking API health..."
if ! curl -sf "$API_URL/health" > /dev/null; then
log_error "API not running. Start it with: $0 api"
exit 1
fi
curl -s "$API_URL/health" | jq .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add timeout to curl health check.

The health check curl on Line 356 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.

🔎 Add a timeout:
-    if ! curl -sf "$API_URL/health" > /dev/null; then
+    if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then
         log_error "API not running. Start it with: $0 api"
         exit 1
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Health check
log_info "Checking API health..."
if ! curl -sf "$API_URL/health" > /dev/null; then
log_error "API not running. Start it with: $0 api"
exit 1
fi
curl -s "$API_URL/health" | jq .
# Health check
log_info "Checking API health..."
if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then
log_error "API not running. Start it with: $0 api"
exit 1
fi
curl -s "$API_URL/health" | jq .
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 354 to 360, the curl health check
(line 356) lacks a timeout and can hang indefinitely; modify the curl invocation
to include a sensible timeout (for example --max-time 5 or -m 5) so the command
fails fast on unresponsive API, and keep the existing -s and -f flags; also
consider adding the same timeout to the subsequent curl that pipes to jq to
avoid hanging when printing the health response.

Comment on lines +483 to +488
if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
log_error "Training service not accessible via port-forward"
kill $PF_PID 2>/dev/null
return 1
fi
curl -s "$TRAINING_URL/health" | jq .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add timeout to training service health check.

The curl on Line 483 has no timeout and could hang if the port-forward is established but the service is unresponsive.

🔎 Add a timeout:
-    if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
+    if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then
         log_error "Training service not accessible via port-forward"
         kill $PF_PID 2>/dev/null
         return 1
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
log_error "Training service not accessible via port-forward"
kill $PF_PID 2>/dev/null
return 1
fi
curl -s "$TRAINING_URL/health" | jq .
if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then
log_error "Training service not accessible via port-forward"
kill $PF_PID 2>/dev/null
return 1
fi
curl -s "$TRAINING_URL/health" | jq .
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 483 to 488, the health check curl
calls lack timeouts and can hang if the port-forward exists but the service is
unresponsive; update both curl invocations to include a reasonable timeout
(e.g., --max-time 5 and/or --connect-timeout 2) and retain the existing error
handling so the first timed-out check triggers log_error, kills the PF_PID, and
returns 1; ensure the follow-up curl that pipes to jq also uses a timeout so the
script never blocks indefinitely.

Comment on lines +495 to +519
RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
-H "Content-Type: application/json" \
-d "{
\"session_id\": \"$INTERNAL_SESSION\",
\"base_model\": \"facebook/opt-125m\",
\"lora_config\": {
\"rank\": 8,
\"alpha\": 16,
\"dropout\": 0.05
},
\"optimizer_config\": {
\"learning_rate\": 0.0001,
\"weight_decay\": 0.01
}
}")
echo "$RESPONSE" | jq .

if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then
log_info "Training session created successfully"
else
log_error "Failed to create training session"
echo "$RESPONSE"
kill $PF_PID 2>/dev/null
return 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add timeout to training session creation.

The curl on Line 495 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive.

🔎 Add a timeout:
-    RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
+    RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \
         -H "Content-Type: application/json" \
         -d "{

Apply the same pattern to other training operation curls (lines 537, 558, 577, 588, 606).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
-H "Content-Type: application/json" \
-d "{
\"session_id\": \"$INTERNAL_SESSION\",
\"base_model\": \"facebook/opt-125m\",
\"lora_config\": {
\"rank\": 8,
\"alpha\": 16,
\"dropout\": 0.05
},
\"optimizer_config\": {
\"learning_rate\": 0.0001,
\"weight_decay\": 0.01
}
}")
echo "$RESPONSE" | jq .
if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then
log_info "Training session created successfully"
else
log_error "Failed to create training session"
echo "$RESPONSE"
kill $PF_PID 2>/dev/null
return 1
fi
RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \
-H "Content-Type: application/json" \
-d "{
\"session_id\": \"$INTERNAL_SESSION\",
\"base_model\": \"facebook/opt-125m\",
\"lora_config\": {
\"rank\": 8,
\"alpha\": 16,
\"dropout\": 0.05
},
\"optimizer_config\": {
\"learning_rate\": 0.0001,
\"weight_decay\": 0.01
}
}")
echo "$RESPONSE" | jq .
if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then
log_info "Training session created successfully"
else
log_error "Failed to create training session"
echo "$RESPONSE"
kill $PF_PID 2>/dev/null
return 1
fi
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 495 to 519, the curl POST that
creates the training session has no timeout and can hang; add a total request
timeout (e.g. --max-time 30) to the curl invocation to fail fast on
slow/unresponsive services, handle the non-zero exit by checking curl success
before parsing JSON, and apply the same --max-time timeout change to the other
training-related curl calls at lines 537, 558, 577, 588, and 606 so all training
operations use a consistent timeout.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (9)
crates/basilica-sdk-python/python/basilica/training/service_client.py (3)

37-74: Store timeout for async client initialization.

The timeout parameter is passed to the sync httpx.Client but not stored for later use. Line 399 in _get_async_client hardcodes timeout=300.0, ignoring the configured value.

🔎 Apply this diff to store and reuse timeout:
     def __init__(
         self,
         api_key: Optional[str] = None,
         endpoint: Optional[str] = None,
         timeout: float = 300.0,
     ):
         ...
+        self._timeout = timeout
         self._client = httpx.Client(
             base_url=self.endpoint,
             headers={"Authorization": f"Bearer {self.api_key}"},
             timeout=timeout,
         )
         self._async_client = None  # Lazy init

Then update _get_async_client to use self._timeout (see separate comment).

Based on past review comments.


350-378: Use specific exception types for better error handling.

Lines 364-365 and 370-371 raise generic TrainingError instead of the more specific SessionNotFoundError and SessionNotReadyError that are already imported.

🔎 Apply this diff to use specific exceptions:
+from .exceptions import (
+    TrainingError,
+    SessionTimeoutError,
+    SessionNotFoundError,
+    SessionNotReadyError,
+    AuthenticationError,
+    ValidationError,
+)

Then update the method:

         if resp.status_code == 404:
-            raise TrainingError(f"Session {session_id} not found")
+            raise SessionNotFoundError(session_id)
         if not resp.is_success:
             raise TrainingError(f"Failed to get session: {resp.text}")

         data = resp.json()
         if data.get("phase") != "ready":
-            raise TrainingError(f"Session not ready: {data.get('phase')}")
+            raise SessionNotReadyError(session_id, data.get("phase", "unknown"))

Based on past review comments.


393-401: Async client uses hardcoded timeout instead of configured value.

Line 399 hardcodes timeout=300.0, ignoring the timeout parameter passed to __init__. This creates inconsistency between sync and async clients.

🔎 Apply this diff (requires storing timeout in __init__):
     async def _get_async_client(self) -> httpx.AsyncClient:
         """Get or create async HTTP client."""
         if self._async_client is None:
             self._async_client = httpx.AsyncClient(
                 base_url=self.endpoint,
                 headers={"Authorization": f"Bearer {self.api_key}"},
-                timeout=300.0,
+                timeout=self._timeout,
             )
         return self._async_client

Based on past review comments.

crates/basilica-api/src/api/routes/training.rs (6)

121-129: Document accepted backend values.

The backend field lacks documentation about which values are accepted (e.g., "s3", "gcs"). Consider adding a doc comment or, better yet, using an enum with Deserialize to enforce valid values at deserialization time.

Based on past review comments.


399-399: Hardcoded enableBilling prevents user configuration.

Line 399 hardcodes "enableBilling": true, preventing users from opting out of billing. Consider adding an optional field to CreateSessionRequest with a sensible default.

Based on past review comments.


316-323: Namespace collision risk remains unaddressed.

The sanitization logic can create collisions (e.g., "user_123" and "user-123" both map to "u-user-123") and doesn't enforce Kubernetes' 63-character limit. Consider using a hash-based or base32-encoded approach for deterministic, collision-free namespace names.

Based on past review comments.


439-448: Fragile error detection via string matching.

Lines 441 checks for "409" or "AlreadyExists" in the error string, which is brittle. Use typed error matching instead: matches!(e, kube::Error::Api(ref ae) if ae.code == 409) for robustness.

Based on past review comments.


514-525: Fragile error detection via string matching.

Line 515 uses string matching (e.to_string().contains("404")) to detect not-found errors. This should use typed error matching: matches!(e, kube::Error::Api(ref ae) if ae.code == 404).

Note: Past review marked this as addressed, but string matching is still present in the current code.


646-672: Confusing sentinel error pattern for empty list handling.

Lines 652-654 use ApiError::Internal { message: "empty".to_string() } as a sentinel value to signal an empty list, which is later checked at line 664. This makes the code harder to understand. Use explicit control flow or an Option instead.

Based on past review comments.

🧹 Nitpick comments (5)
crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)

212-228: Consider consistent response handling across list methods.

Line 226 returns resp.json() directly, while other list methods extract specific keys (e.g., list_training_runs extracts "runs", list_checkpoints extracts "checkpoints"). If the sessions endpoint returns a wrapper object with a "sessions" key, extracting it would maintain consistency.

🔎 View suggested fix if the API returns a wrapper:
 def _call():
     resp = self._client.get("/sessions", params={"limit": limit})
     if not resp.is_success:
         raise TrainingError(f"list_sessions failed: {resp.text}")
-    return resp.json()
+    return resp.json().get("sessions", [])

290-292: Consider waiting for in-flight operations during shutdown.

Line 292 calls shutdown(wait=False), which may abandon in-flight operations. For graceful cleanup, wait=True ensures pending futures complete before shutdown.

🔎 Apply this diff for graceful shutdown:
 def close(self):
     """Close the REST client."""
-    self._executor.shutdown(wait=False)
+    self._executor.shutdown(wait=True)
crates/basilica-api/src/api/middleware/scope.rs (1)

180-267: Consider more precise pattern matching for internal session operations.

The scope validation patterns are functional but could be more precise. For example, line 192-196 uses contains("/internal/") which would match any path containing that substring, not just the intended /sessions/:session_id/internal/:internal_session_id pattern. While not a security risk (unmatched routes will 404), tighter patterns improve maintainability.

Example improvement:
// Instead of:
(&Method::GET, p) if p.starts_with("/sessions/") && p.contains("/internal/") => {
    Some("training:view".to_string())
}

// Consider:
(&Method::GET, p) if p.starts_with("/sessions/") && p.matches("/internal/").count() == 1 && !p.ends_with("/internal") => {
    Some("training:view".to_string())
}

However, given that invalid paths will be rejected at routing and all matched scopes are appropriate, this is a low-priority refinement.

crates/basilica-api/src/api/routes/training.rs (2)

750-825: Consider adding timeout for proxy requests.

The proxy implementation doesn't specify a timeout for the kube_client.request() call. If the training service is slow or hangs, this could block the API server thread. Consider adding a timeout and/or documenting that the kube client's default timeout will apply.


1459-1579: Placeholder implementations for checkpoint operations.

Several checkpoint management functions (download URL, delete, publish, unpublish) contain TODO comments and placeholder implementations (lines 1479, 1511-1520, 1543-1546, 1574-1575). While acceptable for an MVP, these should be tracked.

Do you want me to open issues to track the completion of these checkpoint operations, or are they already tracked elsewhere?

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eabae23 and 430c5b4.

📒 Files selected for processing (6)
  • crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
  • crates/basilica-api/src/api/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/__init__.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/rest_client.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/service_client.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/basilica-api/src/api/mod.rs (1)
crates/basilica-api/src/api/routes/training.rs (19)
  • create_session (410-498)
  • list_sessions (633-737)
  • get_session (501-580)
  • delete_session (586-630)
  • forward_backward (884-908)
  • optim_step (912-935)
  • sample (939-963)
  • forward (1023-1047)
  • compute_logprobs (1051-1075)
  • save_for_sampler (1079-1103)
  • get_capabilities (1109-1129)
  • list_training_runs (1133-1203)
  • get_training_run_by_path (1256-1312)
  • get_training_run (1207-1252)
  • list_checkpoints (1318-1390)
  • get_checkpoint_info (1394-1457)
  • publish_checkpoint (1528-1555)
  • unpublish_checkpoint (1559-1579)
  • delete_checkpoint (1495-1524)
crates/basilica-sdk-python/python/basilica/training/__init__.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (8)
  • Datum (17-44)
  • ModelInput (48-66)
  • SamplingParams (70-93)
  • SampleResponse (100-106)
  • ForwardBackwardResult (110-115)
  • ForwardResult (119-123)
  • GetServerCapabilitiesResponse (127-132)
  • APIFuture (138-184)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
  • TrainingError (8-14)
crates/basilica-api/src/api/routes/training.rs (4)
crates/basilica-api/src/server.rs (2)
  • std (1246-1246)
  • new (559-1165)
services/training-service/src/server.py (13)
  • LoraConfigRequest (53-59)
  • OptimizerConfigRequest (62-70)
  • CreateSessionRequest (73-80)
  • CreateSessionResponse (83-87)
  • SessionStatusResponse (179-187)
  • create_session (213-244)
  • get_session (248-254)
  • delete_session (258-264)
  • forward_backward (323-352)
  • optim_step (356-362)
  • sample (369-390)
  • forward (271-295)
  • compute_logprobs (299-319)
services/training-service/src/backend.py (7)
  • create_session (171-232)
  • delete_session (573-577)
  • forward_backward (234-300)
  • optim_step (391-414)
  • sample (416-478)
  • forward (302-349)
  • compute_logprobs (351-389)
crates/basilica-api/src/k8s/helpers.rs (1)
  • arr (9-11)
crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (2)
  • APIFuture (138-184)
  • result_async (169-176)
crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
  • TrainingError (8-14)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/__init__.py

70-98: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

crates/basilica-sdk-python/python/basilica/training/service_client.py

65-67: Avoid specifying long messages outside the exception class

(TRY003)


155-157: Avoid specifying long messages outside the exception class

(TRY003)


191-191: Avoid specifying long messages outside the exception class

(TRY003)


200-200: Avoid specifying long messages outside the exception class

(TRY003)


208-208: Avoid specifying long messages outside the exception class

(TRY003)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


328-328: Avoid specifying long messages outside the exception class

(TRY003)


365-365: Avoid specifying long messages outside the exception class

(TRY003)


367-367: Avoid specifying long messages outside the exception class

(TRY003)


371-371: Avoid specifying long messages outside the exception class

(TRY003)


389-389: Avoid specifying long messages outside the exception class

(TRY003)

crates/basilica-sdk-python/python/basilica/training/rest_client.py

53-53: Avoid specifying long messages outside the exception class

(TRY003)


71-71: Avoid specifying long messages outside the exception class

(TRY003)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


115-115: Avoid specifying long messages outside the exception class

(TRY003)


133-133: Avoid specifying long messages outside the exception class

(TRY003)


151-151: Avoid specifying long messages outside the exception class

(TRY003)


168-168: Avoid specifying long messages outside the exception class

(TRY003)


188-188: Avoid specifying long messages outside the exception class

(TRY003)


206-206: Avoid specifying long messages outside the exception class

(TRY003)


225-225: Avoid specifying long messages outside the exception class

(TRY003)


243-243: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1)

70-98: Current all organization is clear; ignore static analysis sorting hint.

Static analysis suggests alphabetically sorting __all__, but the current logical grouping (version, clients, types, exceptions) is more maintainable and readable than alphabetical order.

crates/basilica-api/src/api/routes/training.rs (1)

461-486: Good: Proper typed error matching for 409 Conflict.

Line 469 correctly uses kube::Error::Api(ae) if ae.code == 409 for typed error matching, which is more robust than string-based checks.

crates/basilica-api/src/api/mod.rs (3)

105-113: LGTM! Session lifecycle endpoints follow RESTful conventions.

The session lifecycle routes are well-structured and follow standard CRUD patterns under the protected router.


155-172: LGTM! Capabilities and training runs endpoints are well-structured.

The capabilities and training runs management routes follow clear RESTful patterns. The "Phase 3" comments provide helpful context for the phased rollout.


114-154: All training operation handlers are properly implemented in crates/basilica-api/src/api/routes/training.rs. The referenced handlers exist:

  • create_internal_session (line 829)
  • get_internal_session (line 857)
  • forward_backward (line 884)
  • optim_step (line 912)
  • sample (line 939)
  • save_checkpoint (line 967)
  • load_checkpoint (line 995)
  • forward (line 1023)
  • compute_logprobs (line 1051)
  • save_for_sampler (line 1079)

No issues found.

Comment on lines +1146 to +1164
let sessions = match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await {
Ok(list) => list,
Err(e) => {
if e.to_string().contains("404") || e.to_string().contains("NotFound") {
apimetrics::record_request("training.runs.list", "GET", start, true);
return Ok(Json(ListTrainingRunsResponse {
runs: vec![],
total: 0,
offset: params.offset,
limit: params.limit,
}));
}
error!(error = %e, "Failed to list training runs");
apimetrics::record_request("training.runs.list", "GET", start, false);
return Err(ApiError::Internal {
message: format!("Failed to list training runs: {}", e),
});
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1149 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 instead for more robust error handling.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1146 to 1164, the
code currently detects a 404 by matching substrings on the error string; change
this to typed error matching using kube::Error::Api(ae) if ae.code == 404 in the
match arm that handles Err(e). Specifically, add a match arm that matches
kube::Error::Api(ae) if ae.code == 404 and in that arm record the successful
metrics and return the empty ListTrainingRunsResponse as currently done for
404s; leave the existing error logging, metrics failure recording, and
ApiError::Internal return in the fallback arm for all other error variants.
Ensure the match uses the typed kube::Error pattern rather than string
comparisons.

Comment on lines +1219 to +1230
let session = api.get(&run_id).await.map_err(|e| {
if e.to_string().contains("404") || e.to_string().contains("NotFound") {
ApiError::NotFound {
message: format!("Training run {} not found", run_id),
}
} else {
error!(error = %e, run_id = %run_id, "Failed to get training run");
ApiError::Internal {
message: format!("Failed to get training run: {}", e),
}
}
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1220 uses string matching to detect 404 errors. Apply the same typed error matching pattern recommended elsewhere: kube::Error::Api(ae) if ae.code == 404.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1219 to 1230, the
code detects a 404 by string-matching on the error; replace that with typed
matching on kube::Error::Api(ae) if ae.code == 404. Update the map_err closure
to pattern-match the error (or use matches/if let) against kube::Error::Api(ae)
with ae.code == 404 and return ApiError::NotFound in that branch; for all other
errors keep the existing error! log (including the error and run_id) and return
ApiError::Internal with the error string. Ensure the pattern match compiles with
the error type returned by api.get and preserve existing logging and messages.

Comment on lines +1279 to +1290
let session = api.get(&session_id).await.map_err(|e| {
if e.to_string().contains("404") || e.to_string().contains("NotFound") {
ApiError::NotFound {
message: format!("Training run for path {} not found", params.path),
}
} else {
error!(error = %e, path = %params.path, "Failed to get training run by path");
ApiError::Internal {
message: format!("Failed to get training run: {}", e),
}
}
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Line 1280 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 for consistency with best practices.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1279 to 1290, the
closure currently detects 404 by string-matching the error message; replace that
with a typed match on the kube error variant (match e { kube::Error::Api(ae) if
ae.code == 404 => ApiError::NotFound { message: format!("Training run for path
{} not found", params.path) }, other => { error!(error = %other, path =
%params.path, "Failed to get training run by path"); ApiError::Internal {
message: format!("Failed to get training run: {}", other) } } }) so you use
kube::Error::Api(ae).code == 404 for 404 detection and preserve logging and
internal error construction for other cases.

Comment on lines +1332 to +1361
match api.get(run_id).await {
Ok(s) => vec![s],
Err(e) => {
if e.to_string().contains("404") || e.to_string().contains("NotFound") {
vec![]
} else {
error!(error = %e, run_id = %run_id, "Failed to get session for checkpoints");
apimetrics::record_request("training.checkpoints.list", "GET", start, false);
return Err(ApiError::Internal {
message: format!("Failed to list checkpoints: {}", e),
});
}
}
}
} else {
match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await {
Ok(list) => list.items,
Err(e) => {
if e.to_string().contains("404") || e.to_string().contains("NotFound") {
vec![]
} else {
error!(error = %e, "Failed to list sessions for checkpoints");
apimetrics::record_request("training.checkpoints.list", "GET", start, false);
return Err(ApiError::Internal {
message: format!("Failed to list checkpoints: {}", e),
});
}
}
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use typed error matching for 404 detection.

Lines 1335 and 1350 use string matching to detect 404 errors. Consistently apply typed error matching: kube::Error::Api(ae) if ae.code == 404.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1332 to 1361, the
code detects 404s by matching error.to_string() contents; change these to typed
matches against kube::Error::Api(ae) with ae.code == 404 in both error arms.
Replace the if e.to_string().contains(...) checks with a match (or if let) on
Err(e) => match e { kube::Error::Api(ae) if ae.code == 404 => vec![], _ => {
error!(error = %e, ...); apimetrics::record_request(...); return
Err(ApiError::Internal { message: format!(...) }); } } and ensure kube::Error is
in scope (add use kube::Error if needed).

Comment on lines +1434 to +1450
if let Some(cp_array) = status.get("checkpointDetails").and_then(|v| v.as_array()) {
for cp in cp_array {
let cp_path = cp.get("path").and_then(|v| v.as_str()).unwrap_or("");
if cp_path.contains(&checkpoint_name) || cp_path == params.path {
apimetrics::record_request("training.checkpoints.info", "GET", start, true);
return Ok(Json(CheckpointResponse {
checkpoint_id: cp.get("id").and_then(|v| v.as_str()).unwrap_or("").to_string(),
run_id: session_id,
path: cp_path.to_string(),
step: cp.get("step").and_then(|v| v.as_u64()).unwrap_or(0),
size_bytes: cp.get("sizeBytes").and_then(|v| v.as_u64()).unwrap_or(0),
created_at: cp.get("createdAt").and_then(|v| v.as_str()).unwrap_or("").to_string(),
is_published: cp.get("isPublished").and_then(|v| v.as_bool()).unwrap_or(false),
public_url: cp.get("publicUrl").and_then(|v| v.as_str()).map(String::from),
}));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Loose checkpoint path matching may return wrong checkpoint.

Line 1437 uses cp_path.contains(&checkpoint_name) which could match unintended checkpoints. For example, if checkpoint_name is "step_100", it would also match "step_1000". Consider using exact path matching or more precise substring logic.

🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1434 to 1450, the
checkpoint match uses cp_path.contains(&checkpoint_name) which can produce false
positives (e.g., "step_100" matching "step_1000"); change the condition to a
precise match — either compare the full path equality (cp_path == params.path)
or compare the checkpoint filename/basename exactly to checkpoint_name (extract
the last path segment and compare == checkpoint_name), handling potential
trailing slashes, then keep the existing return block when a true exact match is
found.

Comment on lines +120 to +136
def get_checkpoint_archive_url(self, checkpoint_id: str) -> APIFuture:
"""Get signed download URL for checkpoint.

Args:
checkpoint_id: Checkpoint ID

Returns:
APIFuture resolving to presigned URL string
"""

def _call():
resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url")
if not resp.is_success:
raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}")
return resp.json()["url"]

return APIFuture(self._executor.submit(_call), str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add defensive key checking to prevent KeyError.

Line 134 accesses resp.json()["url"] without verifying the key exists. If the API response is malformed or the key is missing, this will raise a KeyError.

🔎 Apply this diff to add validation:
 def _call():
     resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url")
     if not resp.is_success:
         raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}")
-    return resp.json()["url"]
+    data = resp.json()
+    if "url" not in data:
+        raise TrainingError("Missing 'url' in response")
+    return data["url"]
🧰 Tools
🪛 Ruff (0.14.8)

133-133: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +175 to +191
def publish_checkpoint(self, path: str) -> APIFuture:
"""Make checkpoint publicly accessible.

Args:
path: Checkpoint path to publish

Returns:
APIFuture resolving to public URL string
"""

def _call():
resp = self._client.post("/checkpoints/publish", json={"path": path})
if not resp.is_success:
raise TrainingError(f"publish_checkpoint failed: {resp.text}")
return resp.json()["public_url"]

return APIFuture(self._executor.submit(_call), str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add defensive key checking to prevent KeyError.

Line 189 accesses resp.json()["public_url"] without verifying the key exists, risking a KeyError if the response is malformed.

🔎 Apply this diff to add validation:
 def _call():
     resp = self._client.post("/checkpoints/publish", json={"path": path})
     if not resp.is_success:
         raise TrainingError(f"publish_checkpoint failed: {resp.text}")
-    return resp.json()["public_url"]
+    data = resp.json()
+    if "public_url" not in data:
+        raise TrainingError("Missing 'public_url' in response")
+    return data["public_url"]
🧰 Tools
🪛 Ruff (0.14.8)

188-188: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/rest_client.py around
lines 175 to 191, the code blindly accesses resp.json()["public_url"] which can
raise KeyError or crash on invalid JSON; parse the JSON into a variable, check
that it is a dict and contains the "public_url" key, and if missing raise a
TrainingError that includes the response status, raw text and/or parsed body for
debugging; also catch JSON decode errors and raise TrainingError with
response.text so the APIFuture always surfaces a clear, defensive error rather
than a KeyError.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (8)
crates/basilica-api/src/api/routes/training.rs (4)

121-129: Add validation or documentation for storage backend values.

The backend field accepts any string. Consider adding validation for supported backends (e.g., "s3", "gcs", "r2") or at minimum document accepted values.

Based on past review comments flagging this same issue.


341-348: Namespace collision and length issues remain unaddressed.

This sanitization can cause collisions (e.g., "user_123" and "user-123" both become "u-user-123"). Additionally, K8s namespace names have a 63-character limit that isn't enforced.

Based on past review comments flagging namespace collision risk.


1329-1347: Use typed error matching for 404 detection.

Apply the same typed matching pattern: kube::Error::Api(ae) if ae.code == 404.

Based on past review comments.


1617-1633: Loose checkpoint path matching may return wrong checkpoint.

Using cp_path.contains(&checkpoint_name) can match unintended checkpoints (e.g., "step_100" matches "step_1000"). Use exact path matching or compare the final path segment.

Based on past review comments.

examples/training_example.py (4)

39-40: Resource leak: file handle not closed.

This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed.

🔎 Apply this diff to fix:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()

Apply the same fix at lines 120-121, 151-152, and 185-186.


120-121: Resource leak: file handle not closed.

This is the same issue flagged earlier. Use a context manager to ensure the file is properly closed.

🔎 Apply the same fix as lines 39-40:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()

151-152: Resource leak: file handle not closed.

Same resource leak issue. Apply the same fix as previous occurrences.

🔎 Apply the context manager fix:
-        if os.path.exists("build/api-token.txt"):
-            api_key = open("build/api-token.txt").read().strip()
+        if os.path.exists("build/api-token.txt"):
+            with open("build/api-token.txt") as f:
+                api_key = f.read().strip()

185-186: Resource leak: file handle not closed.

Fourth occurrence of the same resource leak. Apply the context manager fix here as well.

🔎 Apply the context manager fix:
-    if os.path.exists("build/api-token.txt"):
-        api_key = open("build/api-token.txt").read().strip()
+    if os.path.exists("build/api-token.txt"):
+        with open("build/api-token.txt") as f:
+            api_key = f.read().strip()
🧹 Nitpick comments (11)
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (2)

142-145: Remove redundant os.environ.pop calls.

The patch.dict(os.environ, {}, clear=True) already clears all environment variables, making the subsequent pop calls unnecessary.

🔎 Simplify the test setup:
     def test_init_requires_api_key(self):
         """Test that ServiceClient requires API key."""
         with patch.dict(os.environ, {}, clear=True):
-            # Remove any environment variables
-            os.environ.pop("BASILICA_API_KEY", None)
-            os.environ.pop("BASILICA_API_TOKEN", None)
-
             with pytest.raises(ValueError, match="API key required"):
                 ServiceClient()

194-217: Consider verifying exact request parameters.

While the test confirms the method is called, asserting the exact parameters passed to mock_client.post would strengthen the test and catch parameter-related regressions.

Example enhancement:
# After line 217, add:
call_args = mock_client.post.call_args
assert call_args[1]['json']['data'][0]['input_ids'] == [1, 2, 3]
# Or use call_args_list to verify the exact call signature
crates/basilica-api/src/api/middleware/scope.rs (1)

180-268: Training scope mappings look well-structured.

The new training-related scope mappings follow the established pattern. The route matching logic correctly distinguishes between different operations (create, view, use, delete).

However, consider adding unit tests for the new training endpoints to maintain test coverage parity with existing rental and payment endpoint tests (visible in lines 283-454).

crates/basilica-api/src/api/routes/training.rs (5)

464-473: Use typed error matching for namespace creation.

Line 466 uses string matching (e.to_string().contains("409")) while the CRD creation at line 494 correctly uses typed matching. Apply the same pattern here for consistency and robustness.

🔎 Suggested fix:
     if let Err(e) = k8s_client.create_namespace(&namespace).await {
-        // Ignore "already exists" errors
-        if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+        // Ignore "already exists" errors using typed matching
+        // Note: This requires checking if k8s_client.create_namespace returns kube::Error
+        let is_conflict = e.to_string().contains("409") || e.to_string().contains("AlreadyExists");
+        if !is_conflict {
             error!(error = %e, namespace = %namespace, "Failed to create namespace");

Note: If create_namespace doesn't return kube::Error directly, you may need to update its signature or wrap the error type. Based on past review comments flagging fragile error detection.


562-573: Use typed error matching for 404 detection.

This uses string matching to detect 404 errors. Apply the typed error matching pattern used elsewhere: kube::Error::Api(ae) if ae.code == 404.

🔎 Suggested fix:
-    let original_session = api.get(original_session_id).await.map_err(|e| {
-        if e.to_string().contains("404") || e.to_string().contains("NotFound") {
+    let original_session = api.get(original_session_id).await.map_err(|e| match &e {
+        kube::Error::Api(ae) if ae.code == 404 => {
             ApiError::NotFound {
                 message: format!("Original session {} not found", original_session_id),
             }
-        } else {
+        }
+        _ => {
             error!(error = %e, path = %req.path, "Failed to get original session");
             ApiError::Internal {
                 message: format!("Failed to get original session: {}", e),
             }
         }
     })?;

697-708: Still uses string matching for 404 detection.

The past review marked this as "✅ Addressed in commit 0c1cf26" but the code still uses string matching. Use kube::Error::Api(ae) if ae.code == 404 for consistency with delete_session (line 796).


828-855: Sentinel error pattern still present.

The past review marked this as "✅ Addressed in commit 430c5b4" but the code still uses the "empty" sentinel string pattern. Refactor to use direct control flow:

🔎 Suggested fix:
-    let sessions = api
+    let sessions = match api
         .list(&ListParams::default().labels(&format!("user={}", auth.user_id)))
         .await
-        .map_err(|e| {
-            // If namespace doesn't exist, return empty list
-            if e.to_string().contains("404") || e.to_string().contains("NotFound") {
-                return ApiError::Internal {
-                    message: "empty".to_string(), // Sentinel for empty list
-                };
-            }
-            error!(error = %e, "Failed to list sessions");
-            ApiError::Internal {
-                message: format!("Failed to list sessions: {}", e),
-            }
-        });
-
-    let sessions = match sessions {
-        Ok(list) => list,
-        Err(e) if e.to_string().contains("empty") => {
+    {
+        Ok(list) => list,
+        Err(kube::Error::Api(ae)) if ae.code == 404 => {
             apimetrics::record_request("training.list", "GET", start, true);
             return Ok(Json(vec![]));
         }
         Err(e) => {
+            error!(error = %e, "Failed to list sessions");
             apimetrics::record_request("training.list", "GET", start, false);
-            return Err(e);
+            return Err(ApiError::Internal {
+                message: format!("Failed to list sessions: {}", e),
+            });
         }
     };

1290-1312: Consider making capabilities configurable.

Hardcoding model names and limits works for MVP, but consider loading these from configuration or a CRD to avoid code changes when adding new models or adjusting limits.

scripts/test-training-sdk.sh (1)

39-67: Consider checking for jq availability.

The script uses jq for JSON processing on lines 46, 56, and 66 without verifying it's installed. While jq is commonly available in development environments, adding a check would improve robustness and provide a clearer error message if it's missing.

🔎 Add this check after the venv activation block:
 echo ""
 
+# Check for jq
+if ! command -v jq &> /dev/null; then
+    echo "WARNING: jq not found, JSON output will be raw"
+    echo "Install with: apt-get install jq (or brew install jq on macOS)"
+fi
+
 # Check if API is running
examples/training_example.py (2)

47-57: Narrow the exception handler.

Catching bare Exception can mask unexpected errors including KeyboardInterrupt. For a production-quality example, consider catching specific exceptions or at minimum excluding KeyboardInterrupt and SystemExit.

🔎 Apply this diff to narrow the exception scope:
-    except Exception as e:
+    except (IOError, ValueError, RuntimeError) as e:
         print(f"Could not fetch capabilities: {e}")

Alternatively, if you want to catch all non-system exceptions:

-    except Exception as e:
+    except BaseException as e:
+        if isinstance(e, (KeyboardInterrupt, SystemExit)):
+            raise
         print(f"Could not fetch capabilities: {e}")

73-90: Consider adding error handling for educational value.

While the training loop is functionally correct, adding basic error handling around the .result() calls would make this example more production-ready and educational for users who will adapt this code.

🔎 Example error handling pattern:
try:
    for _ in range(5):
        result = training.forward_backward(
            [Datum(input_ids=train_tokens)],
            loss_fn="cross_entropy",
        ).result()
        
        step = training.optim_step().result()
        print(f"Step {step}: loss={result.loss:.4f}")
    
    checkpoint_path = training.save_state("checkpoint-final").result()
    print(f"\nCheckpoint saved: {checkpoint_path}")
except Exception as e:
    print(f"Training error: {e}")
    raise
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 430c5b4 and f0c655f.

📒 Files selected for processing (7)
  • crates/basilica-api/src/api/middleware/scope.rs (1 hunks)
  • crates/basilica-api/src/api/mod.rs (1 hunks)
  • crates/basilica-api/src/api/routes/training.rs (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/tests/__init__.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (1 hunks)
  • examples/training_example.py (1 hunks)
  • scripts/test-training-sdk.sh (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • crates/basilica-sdk-python/python/basilica/training/tests/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
examples/training_example.py (4)
crates/basilica-sdk-python/python/basilica/training/types.py (5)
  • Datum (17-44)
  • ModelInput (48-66)
  • SamplingParams (70-93)
  • result (163-167)
  • from_ints (59-61)
crates/basilica-sdk-python/python/basilica/training/training_client.py (8)
  • session_id (66-68)
  • forward_backward (160-224)
  • step (76-78)
  • optim_step (246-285)
  • save_state (289-308)
  • forward (97-132)
  • compute_logprobs (134-158)
  • sample (422-455)
services/training-service/src/server.py (6)
  • forward_backward (323-352)
  • optim_step (356-362)
  • save_state (397-407)
  • forward (271-295)
  • compute_logprobs (299-319)
  • sample (369-390)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)
  • compute_logprobs (128-158)
  • sample (56-105)
🪛 Ruff (0.14.8)
examples/training_example.py

55-55: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (23)
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (7)

62-84: LGTM!

Well-structured tests covering the core Datum.to_dict() functionality with good coverage of optional fields.


87-102: LGTM!

Good test coverage with proper mock verification for tokenizer integration.


105-134: LGTM!

Comprehensive test coverage for SamplingParams including defaults, custom parameters, and serialization.


240-294: LGTM!

Well-structured tests with proper resource cleanup via close() calls. Good coverage of REST client operations.


297-312: LGTM!

Clean test verifying GetServerCapabilitiesResponse construction from dictionary data.


315-316: LGTM!

Convenient pattern for running tests directly during development.


51-59: The @pytest.mark.asyncio decorator is properly supported—pytest-asyncio>=0.21 is already listed in the project's dev dependencies.

crates/basilica-api/src/api/mod.rs (1)

104-201: LGTM! Training routes properly integrated.

The training routes are correctly added to the protected routes with appropriate handler mappings. The middleware ordering (scope validation after auth) ensures proper authorization flow.

crates/basilica-api/src/api/routes/training.rs (1)

1662-1674: Placeholder token in download URL.

The URL contains token=placeholder which provides no actual security. Ensure this is replaced with proper signed URL generation before production use.

Verify this endpoint is not exposed in production without proper signed URL implementation.

scripts/test-training-sdk.sh (6)

1-4: LGTM: Proper error handling setup.

The set -e directive ensures the script fails fast on any command error, which is appropriate for an integration test script.


5-17: LGTM: Robust path resolution and optional venv activation.

The script correctly resolves paths and conditionally activates the virtual environment if present, providing good flexibility for different development setups.


19-27: LGTM: Robust health check with helpful error messages.

The health check properly validates API availability before proceeding and provides clear remediation steps if the API is not running.


29-37: LGTM: Secure token handling.

The script properly validates token file existence and truncates the token in output for security, which is a good practice.


69-73: LGTM: Straightforward example execution.

The Python example is executed correctly with the appropriate flag, relying on set -e for error propagation.


75-86: LGTM: Robust pytest execution with graceful degradation.

The script properly checks for pytest availability, provides installation instructions if missing, and gracefully continues even if some tests fail (appropriate for an integration test that may run in various environments).

examples/training_example.py (8)

1-33: LGTM: Clear documentation and appropriate imports.

The docstring provides excellent usage examples and API patterns. The sys.path manipulation is acceptable for an example script that needs to import from a local development SDK.


59-72: LGTM: Well-documented LoRA configuration.

The context manager usage is correct, and the inline comments clearly explain each LoRA parameter. The GPU count of 0 for CPU testing is appropriate for a portable example.


92-111: LGTM: Clear demonstration of logprobs computation.

The logprobs demo correctly uses the API to perform inference, compute per-token probabilities, and calculate sequence probability. The pattern is educational and correct.


122-142: LGTM: Clear demonstration of SamplingClient usage.

The sampling example correctly demonstrates using ModelInput.from_ints() for token-based prompts and SamplingParams for generation control. The API usage matches the expected patterns.


145-179: LGTM: Proper async/await pattern with cleanup.

The async example correctly demonstrates the async API with proper cleanup in the finally block. The use of try/finally ensures training.close() is called even if errors occur during training.


182-231: LGTM: Comprehensive RestClient demonstration.

The example effectively demonstrates REST API operations including listing runs/sessions/checkpoints and retrieving run details. Good defensive programming with .get() for dictionary access and proper cleanup with rest.close().


233-255: LGTM: Clean CLI dispatch.

The argument parsing and dispatch logic is straightforward and well-structured, providing clear entry points for different example modes.


112-114: The training.sample() convenience method exists and is correctly used in the example. No issue found.

Likely an incorrect or invalid review comment.

Comment on lines +651 to +667
match api.create(&PostParams::default(), &crd).await {
Ok(_) => {
info!(
session_id = %session_id,
namespace = %namespace,
resumed_from = %original_session_id,
"Created TrainingSession CRD from checkpoint"
);
}
Err(e) => {
error!(error = %e, "Failed to create TrainingSession CRD from checkpoint");
apimetrics::record_request("training.from_state", "POST", start, false);
return Err(ApiError::Internal {
message: format!("Failed to create session: {}", e),
});
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing 409 conflict handling for CRD creation.

Unlike create_session (line 494), this doesn't handle the case where the session already exists. If a UUID collision occurs (unlikely but possible), this will return an internal error instead of a proper conflict response.

🔎 Suggested fix:
     match api.create(&PostParams::default(), &crd).await {
         Ok(_) => {
             info!(
                 session_id = %session_id,
                 namespace = %namespace,
                 resumed_from = %original_session_id,
                 "Created TrainingSession CRD from checkpoint"
             );
         }
+        Err(kube::Error::Api(ae)) if ae.code == 409 => {
+            warn!(session_id = %session_id, "TrainingSession already exists");
+            apimetrics::record_request("training.from_state", "POST", start, false);
+            return Err(ApiError::Conflict {
+                message: format!("Session {} already exists", session_id),
+            });
+        }
         Err(e) => {
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 651 to 667, the
create() error path should detect a Kubernetes "already exists"/409 conflict and
return a Conflict API error instead of treating every error as Internal; update
the Err(e) branch to inspect the error (e.g., match on kube/k8s error kind or
map to AlreadyExists/Conflict), and for conflict cases log an appropriate
warn/info, record the apimetrics success=false for "training.from_state", and
return ApiError::Conflict with a clear message; for all other errors keep the
existing error log, metrics recording, and return ApiError::Internal.

Comment on lines +981 to +994
// Execute the request through the kube client
let response = kube_client
.request::<serde_json::Value>(request)
.await
.map_err(|e| {
error!(
error = %e,
service = %service_name,
namespace = %namespace,
path = %path,
"Failed to proxy request to training service"
);
ApiError::ServiceUnavailable
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing timeout for proxy requests.

The K8s API server proxy request has no explicit timeout. If the training pod is unresponsive, this could cause requests to hang indefinitely, potentially exhausting connection pools or worker threads.

🔎 Consider adding a timeout:

You may need to wrap the request with a tokio timeout or configure the kube client with request timeouts:

use tokio::time::{timeout, Duration};

let response = timeout(
    Duration::from_secs(30),
    kube_client.request::<serde_json::Value>(request)
).await
.map_err(|_| ApiError::GatewayTimeout { 
    message: "Training service request timed out".to_string() 
})?
.map_err(|e| {
    // ... existing error handling
})?;
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 981 to 994, the
kube client proxy request has no timeout and can hang indefinitely; wrap the
kube_client.request call with tokio::time::timeout using a sensible Duration
(e.g., 30s), map a timeout result to ApiError::GatewayTimeout with an
explanatory message, and preserve the existing map_err handling for the inner
request error (log the error and return ApiError::ServiceUnavailable); also add
the necessary tokio::time::{timeout, Duration} import and adjust logs to
indicate when a timeout occurred.

Comment on lines +1676 to +1707
/// Delete a checkpoint.
/// DELETE /checkpoints/:checkpoint_id
pub async fn delete_checkpoint(
State(state): State<AppState>,
Extension(auth): Extension<AuthContext>,
Path(checkpoint_id): Path<String>,
) -> Result<StatusCode> {
let start = Instant::now();

info!(
user_id = %auth.user_id,
checkpoint_id = %checkpoint_id,
"Deleting checkpoint"
);

// Check that we have K8s access
let _k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?;

// TODO: Actually delete the checkpoint from storage
// This would involve:
// 1. Finding the session that owns this checkpoint
// 2. Updating the session's checkpoint list
// 3. Deleting the checkpoint files from R2/S3

warn!(
checkpoint_id = %checkpoint_id,
"Checkpoint deletion not fully implemented - returning success"
);

apimetrics::record_request("training.checkpoints.delete", "DELETE", start, true);
Ok(StatusCode::NO_CONTENT)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Delete checkpoint returns success without performing deletion.

This endpoint claims success (NO_CONTENT) without actually deleting anything. This could lead to data inconsistencies and user confusion when checkpoints aren't actually removed.

Consider either:

  1. Returning a 501 Not Implemented until fully implemented
  2. Implementing the actual deletion logic
  3. Documenting this as expected MVP behavior in the API response
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1676 to 1707, the
delete_checkpoint handler currently returns 204 NO_CONTENT without performing
any deletion; either make this explicit or implement deletion. Fix by one of two
actions: (A) If not ready to delete, return StatusCode::NOT_IMPLEMENTED (501)
and log/warn that deletion is not implemented, keeping the k8s check and
metrics; or (B) implement the deletion flow: verify checkpoint ownership via
session lookup, remove checkpoint ID from the session record, delete checkpoint
files from the configured object storage (R2/S3) and any Kubernetes resources,
remove checkpoint metadata from DB, handle and log errors (return appropriate
status codes), and only record metrics and return 204 on success. Ensure auth
checks, error handling, and metrics remain consistent.

Comment on lines +10 to +11
# Add parent path for imports
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove sys.path manipulation in favor of proper package installation.

Modifying sys.path directly is fragile and creates dependencies on the directory structure. Instead, install the package in editable mode (pip install -e .) or configure pytest's pythonpath in pytest.ini or pyproject.toml.

🔎 Remove these lines and configure proper package installation:
-# Add parent path for imports
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
-
 from basilica.training import (

Then ensure the package is installed via:

pip install -e crates/basilica-sdk-python

Or add to pytest.ini:

[pytest]
pythonpath = crates/basilica-sdk-python/python

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py
around lines 10 to 11, remove the sys.path manipulation (sys.path.insert(...))
and rely on proper package installation or pytest configuration instead; install
the package in editable mode from the repo root (pip install -e
crates/basilica-sdk-python) or add the package path to pytest configuration
(pytest.ini or pyproject.toml pythonpath) so tests can import the package
without altering sys.path.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (15)
services/training-service/README.md (1)

38-48: Add language specification to fenced code block.

Line 38 begins an example output code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:
 ### Example Output

-```
+```text
 === Creating session for facebook/opt-125m ===
 Session: ts-a1b2c3d4

 === Training (3 steps) ===
 Step 1: loss=6.2612
 Step 2: loss=5.7423
 Step 3: loss=5.7501

 Sample: The quick brown -> fox jumps over the lazy
-```
+```
scripts/local-training-e2e.sh (4)

119-124: Verify gateway readiness or fail explicitly.

The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).

🔎 Consider failing explicitly:
     kubectl wait --namespace envoy-gateway-system \
         --for=condition=Available deployment/envoy-gateway \
-        --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+        --timeout=180s || {
+            log_error "Envoy Gateway failed to become ready"
+            exit 1
+        }

568-574: Add timeout to curl health check.

The health check curl on Line 570 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.

🔎 Add a timeout:
-    if ! curl -sf "$API_URL/health" > /dev/null; then
+    if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then
         log_error "API not running. Start it with: $0 api"
         exit 1
     fi
+    curl -s --max-time 10 "$API_URL/health" | jq .
-    curl -s "$API_URL/health" | jq .

695-702: Add timeout to training service health check.

The curl on Line 697 has no timeout and could hang if the port-forward is established but the service is unresponsive.

🔎 Add a timeout:
-    if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
+    if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then
         log_error "Training service not accessible via port-forward"
         kill $PF_PID 2>/dev/null
         return 1
     fi
-    curl -s "$TRAINING_URL/health" | jq .
+    curl -s --max-time 10 "$TRAINING_URL/health" | jq .

709-733: Add timeout to training session creation and subsequent curls.

The curl on Line 709 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive. The same issue applies to other training operation curls throughout run_training_steps.

🔎 Add timeouts to all training curls:
-    RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
+    RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \
         -H "Content-Type: application/json" \

Apply --max-time 30 (or appropriate value) to all curl calls in this function: lines 751, 772, 787, 791, 802, 820, 841, 858, 864, 871, 874, 880, 897, 902, 921.

crates/basilica-operator/src/controllers/training_session_controller.rs (10)

203-212: Status update condition still misses endpoint and start_time changes.

The condition now checks tokens_processed (line 206) but still doesn't check endpoint or start_time. Changes to these fields won't trigger status updates as previously noted.


239-251: Still using is_err() without distinguishing NotFound from other errors.

As previously noted, this treats network/auth errors the same as "pod not found" and proceeds to create a pod.


253-265: Same is_err() issue for service existence check.

Apply the same fix pattern as recommended for the pod check.


314-321: Pod failure still doesn't transition session to Failed phase.

As previously noted, when the pod enters "Failed" state, the session phase should transition to Failed, not remain in Scheduling.


389-394: Inconsistent error propagation with other handlers.

As previously noted, this propagates errors with ? while handle_pending converts similar failures to Ok(status.with_error(...)).


453-455: Silently ignoring pod deletion result.

As previously noted, the let _ = discards errors from delete_pod. At minimum, log the outcome.


643-650: Owner reference UID still defaults to empty string if missing.

As previously noted, the UID should be validated and an error returned if missing, not defaulted to an empty string.


710-717: Same empty UID issue in owner reference.

As previously noted, apply the same fix pattern as recommended for build_training_pod.


756-795: HTTPRoute still missing owner reference for garbage collection.

As previously noted, without ownerReferences, this HTTPRoute will be orphaned when the TrainingSession is deleted.


819-826: Same empty UID issue in NetworkPolicy owner reference.

As previously noted, apply the same fix pattern as recommended for the other resource builders.

🧹 Nitpick comments (3)
scripts/local-training-e2e.sh (3)

58-62: Add timeout to k3d install curl.

The k3d install curl has no timeout and could hang indefinitely if the network is slow or unresponsive.

🔎 Suggested fix
     if ! command -v k3d &> /dev/null; then
         log_info "Installing k3d..."
-        curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+        curl -s --max-time 60 https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
     fi

634-644: Unused loop variable.

The loop variable i is unused. Use _ to indicate an intentionally unused variable, or use brace expansion which is slightly more efficient.

🔎 Suggested fix:
-    for i in {1..60}; do
+    for _ in {1..60}; do
         POD_STATUS=$(kubectl get pods -A -l app=basilica-training -o jsonpath='{.items[0].status.phase}' 2>/dev/null)

Based on static analysis hint SC2034.


973-985: Add timeout to storage mounts curl.

The curl on Line 980 has no timeout and could hang if the storage daemon is unresponsive, leaving the port-forward orphaned.

🔎 Add timeout:
         kubectl port-forward -n basilica-storage "pod/$DAEMON_POD" 9092:9090 &
         PF_PID=$!
         sleep 1
-        curl -sf "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding"
+        curl -sf --max-time 5 "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding"
         kill $PF_PID 2>/dev/null
         wait $PF_PID 2>/dev/null
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0c655f and 73cdae1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • crates/basilica-api/Cargo.toml (1 hunks)
  • crates/basilica-operator/src/controllers/training_session_controller.rs (1 hunks)
  • scripts/local-training-e2e.sh (1 hunks)
  • services/training-service/README.md (1 hunks)
  • services/training-service/src/backend.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • services/training-service/src/backend.py
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/local-training-e2e.sh (7)
orchestrator/cluster-manager/src/clustermgr/commands/kubeconfig.py (1)
  • kubeconfig (22-27)
crates/basilica-common/src/utils/docker_validation.rs (1)
  • image (38-39)
crates/basilica-common/src/error.rs (1)
  • generic (357-361)
crates/basilica-api/src/server.rs (1)
  • run (1188-1226)
crates/basilica-operator/src/runtime.rs (1)
  • run (162-361)
orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
  • deployments (103-213)
orchestrator/cluster-manager/src/clustermgr/commands/gateway.py (1)
  • gateway (296-302)
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
crates/basilica-operator/src/crd/training_session.rs (5)
  • new (316-322)
  • new (374-391)
  • default (34-41)
  • default (80-86)
  • default (145-151)
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md

38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


128-128: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


200-200: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


253-253: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


404-404: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh

[warning] 636-636: i appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (7)
scripts/local-training-e2e.sh (6)

1-38: LGTM!

Good use of set -euo pipefail for strict error handling, and the color-coded logging helpers improve script readability.


67-96: LGTM!

Cluster creation logic is well-structured with proper port mappings, kubeconfig export, and node readiness wait.


129-150: LGTM!

Image build and load logic is clean with clear logging for each step.


152-176: LGTM!

Operator deployment is well-structured with proper kubectl wait and timeout.


399-433: LGTM!

Postgres setup with stale data detection and reset functionality is well-implemented.


1032-1088: LGTM!

Comprehensive help documentation clearly explains the workflow, commands, and test steps.

crates/basilica-operator/src/controllers/training_session_controller.rs (1)

872-982: Good test coverage for resource builders.

The tests comprehensively verify pod and service creation, including GPU resources, FUSE storage configuration, volume mounts, and command wrapping. This provides solid confidence in the resource builder functions.

Comment on lines +110 to +117
if sessions.sessions.is_empty() {
debug!("no active sessions in training service");
return None;
}

// Get status of the first session (typically there's only one per pod)
let session_id = &sessions.sessions[0];
let status_url = format!("{}/sessions/{}", internal_endpoint, session_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verify session name matches expected TrainingSession.

The code retrieves the first session from the list without confirming it matches the expected name. If the training service somehow returns multiple sessions or a different session, status updates could be applied to the wrong TrainingSession.

🔎 Suggested fix to validate session name
         if sessions.sessions.is_empty() {
             debug!("no active sessions in training service");
             return None;
         }
 
-        // Get status of the first session (typically there's only one per pod)
-        let session_id = &sessions.sessions[0];
+        // Get status of the first session, verify it matches expected name
+        let session_id = sessions.sessions.iter()
+            .find(|id| id.as_str() == name)
+            .or_else(|| sessions.sessions.first())?;
         let status_url = format!("{}/sessions/{}", internal_endpoint, session_id);
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 110 to 117, the code unconditionally uses the first session from
sessions.sessions which can misidentify the intended TrainingSession; instead,
search sessions.sessions for an entry that matches the expected `name` (or other
canonical identifier available in scope), select that session's id for status
checks, and only proceed when a matching session is found (otherwise log/debug
and return None). Ensure the lookup handles multiple sessions, uses the matched
session ID in the status_url, and retains clear logging when no matching session
exists.

Comment on lines +460 to +476
// Poll training service for status updates
let mut updated_status = status.clone();
if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await {
// Only update if there's actual progress
if steps != status.steps_completed || tokens != status.tokens_processed {
info!(
session = %name,
steps = steps,
tokens = tokens,
"updating training progress"
);
updated_status = updated_status.with_steps(steps, tokens);
}
}

// Update last activity timestamp
updated_status.last_updated = chrono::Utc::now().to_rfc3339();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timestamp updated even when status fetch fails.

Line 476 unconditionally updates last_updated even when fetch_training_status returns None. This creates misleading timestamps suggesting the session was recently checked when no actual status was obtained.

🔎 Suggested fix to update timestamp only on successful fetch
         // Poll training service for status updates
         let mut updated_status = status.clone();
         if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await {
             // Only update if there's actual progress
             if steps != status.steps_completed || tokens != status.tokens_processed {
                 info!(
                     session = %name,
                     steps = steps,
                     tokens = tokens,
                     "updating training progress"
                 );
                 updated_status = updated_status.with_steps(steps, tokens);
             }
+            // Update last activity timestamp only when we successfully fetched status
+            updated_status.last_updated = chrono::Utc::now().to_rfc3339();
         }
 
-        // Update last activity timestamp
-        updated_status.last_updated = chrono::Utc::now().to_rfc3339();
-
         Ok(updated_status)
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 460 to 476, last_updated is set unconditionally even when
fetch_training_status returned None; move the updated_status.last_updated =
chrono::Utc::now().to_rfc3339() into the Some branch so the timestamp is only
updated when fetch_training_status succeeded (place it after updating
steps/tokens or alongside that update), ensuring last_updated changes only on
successful status fetch.

Comment on lines +500 to +501
// Allow TRAINING_IMAGE env var to override the default image (for local dev)
let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

TRAINING_IMAGE override lacks dev_mode protection.

The environment variable allows overriding the spec image for "local dev" but isn't gated behind a dev_mode check. If this env var is accidentally set in production, it could allow execution of arbitrary container images, bypassing security policies.

🔎 Suggested fix to gate behind dev_mode
-    // Allow TRAINING_IMAGE env var to override the default image (for local dev)
-    let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());
+    // Allow TRAINING_IMAGE env var to override the default image (only in dev mode)
+    let image = if cfg!(feature = "dev_mode") {
+        std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone())
+    } else {
+        spec.image.clone()
+    };

Alternatively, check for an explicit DEV_MODE environment variable:

-    // Allow TRAINING_IMAGE env var to override the default image (for local dev)
-    let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());
+    // Allow TRAINING_IMAGE env var to override the default image (only in dev mode)
+    let image = if std::env::var("DEV_MODE").is_ok() {
+        std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone())
+    } else {
+        spec.image.clone()
+    };
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 500-501, the TRAINING_IMAGE env var is used unconditionally to
override the spec image; gate this override behind a dev-mode check (e.g., check
an explicit DEV_MODE env var or a operator config flag) so production cannot be
overridden. Update the code to read DEV_MODE (parse as bool or check for
"1"/"true"), only call std::env::var("TRAINING_IMAGE") when dev mode is enabled,
otherwise always use spec.image; also add a warning log when dev mode is enabled
and the override is applied, and ensure no panic on missing/invalid DEV_MODE by
defaulting to false.

Comment on lines +262 to +270
create_r2_credentials() {
export KUBECONFIG="$KUBECONFIG_PATH"

# R2 credentials - you can override these with environment variables
R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"
R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"
R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"
R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"
R2_REGION="${R2_REGION:-auto}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded R2 credentials in source code.

The R2 access key and secret key defaults appear to be real credentials committed to version control. This is a serious security risk—even if these are for a test bucket, credentials in source code can be scraped, reused, or accidentally exposed.

🔎 Remove default secrets and require environment variables:
     # R2 credentials - you can override these with environment variables
-    R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"
-    R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"
-    R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"
-    R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"
-    R2_REGION="${R2_REGION:-auto}"
+    R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}"
+    R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}"
+    R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}"
+    R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}"
+    R2_REGION="${R2_REGION:-auto}"

If these credentials are already committed, they should be rotated immediately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
create_r2_credentials() {
export KUBECONFIG="$KUBECONFIG_PATH"
# R2 credentials - you can override these with environment variables
R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"
R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"
R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"
R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"
R2_REGION="${R2_REGION:-auto}"
create_r2_credentials() {
export KUBECONFIG="$KUBECONFIG_PATH"
# R2 credentials - you can override these with environment variables
R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}"
R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}"
R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}"
R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}"
R2_REGION="${R2_REGION:-auto}"
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 262 to 270, the R2 access and
secret keys are hardcoded as defaults; remove these embedded secrets and require
callers to supply credentials via environment variables (or a secure secret
manager). Replace the default assignments so R2_ACCESS_KEY and R2_SECRET_KEY are
read directly from the environment without fallback values, validate at runtime
and exit with a clear error if they are unset, update README/.env.example to
document required vars, and ensure any committed secrets are rotated and removed
from the repo/history (and add a note to rotate if present).

Comment on lines +688 to +691
sleep 3

# Cleanup on exit
trap "kill $PF_PID 2>/dev/null" EXIT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Trap expansion happens at definition time, not signal time.

Using double quotes causes $PF_PID to expand immediately when the trap is set. If the variable changes later (or is unset), the trap will use the stale value.

🔎 Use single quotes for delayed expansion:
-    trap "kill $PF_PID 2>/dev/null" EXIT
+    trap 'kill $PF_PID 2>/dev/null' EXIT

Based on static analysis hint SC2064.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sleep 3
# Cleanup on exit
trap "kill $PF_PID 2>/dev/null" EXIT
sleep 3
# Cleanup on exit
trap 'kill $PF_PID 2>/dev/null' EXIT
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 688 to 691, the trap is defined
with double quotes so $PF_PID is expanded immediately; change the trap to use
single quotes (e.g., trap 'kill $PF_PID 2>/dev/null' EXIT) so the variable is
expanded at signal time and the current PF_PID value is used when the script
exits.

Comment on lines +128 to +135
```
Training Pod Storage Daemon (DaemonSet) R2 Bucket
┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐
│ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │
│ (volume) │ │ /var/lib/basilica/fuse/ │ ──sync──▶│ basilica│
└─────────────┘ │ u-testuser/ │ │ │
└──────────────────────────┘ └─────────┘
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 128 begins an ASCII diagram code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:
 #### How It Works

-```
+```text
 Training Pod                    Storage Daemon (DaemonSet)           R2 Bucket
 ┌─────────────┐                ┌──────────────────────────┐         ┌─────────┐
 │ /checkpoints│ ──hostPath──▶  │ FUSE Mount               │         │         │
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
Training Pod Storage Daemon (DaemonSet) R2 Bucket
┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐
│ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │
│ (volume) │ │ /var/lib/basilica/fuse/ │ ──sync──▶│ basilica│
└─────────────┘ │ u-testuser/ │ │ │
└──────────────────────────┘ └─────────┘
```
#### How It Works
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

128-128: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In services/training-service/README.md around lines 128 to 135, the fenced ASCII
diagram block lacks a language identifier causing markdown lint MD040; add a
language specifier (e.g., ```text) on the opening fence so the block becomes a
labeled code block, leaving the diagram content unchanged and keeping the
closing ``` as-is.

Comment on lines +200 to +205
```
{bucket}/{namespace}/{namespace}/{checkpoint_path}

Example:
basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 200 begins a code block showing the R2 object path structure without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:
 #### R2 Object Path

 Checkpoints are stored in R2 with the following path structure:

-```
+```text
 {bucket}/{namespace}/{namespace}/{checkpoint_path}

 Example:
 basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

200-200: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In services/training-service/README.md around lines 200 to 205, the fenced code
block showing the R2 object path lacks a language identifier (causing MD040);
update the opening fence to include the language specifier (e.g., change ``` to
```text) so the block becomes a language-marked code fence while leaving the
block contents unchanged.

Comment on lines +253 to +272
```
[STEP] Running training steps via port-forward...
[INFO] Training step 1/3
[INFO] Loss: 6.26, Tokens: 10
[INFO] Completed step: 1
[INFO] Training step 2/3
[INFO] Loss: 5.74, Tokens: 10
[INFO] Completed step: 2
[INFO] Training step 3/3
[INFO] Loss: 5.75, Tokens: 10
[INFO] Completed step: 3

[INFO] Training session status:
{
"session_id": "train-session-1",
"step_count": 3,
"tokens_processed": 30,
"lora_rank": 8
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 253 begins an example output code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:
 ### Example Output

-```
+```text
 [STEP] Running training steps via port-forward...
 [INFO] Training step 1/3
 [INFO]   Loss: 6.26, Tokens: 10
 [INFO]   Completed step: 1
 [INFO] Training step 2/3
 [INFO]   Loss: 5.74, Tokens: 10
 [INFO]   Completed step: 2
 [INFO] Training step 3/3
 [INFO]   Loss: 5.75, Tokens: 10
 [INFO]   Completed step: 3

 [INFO] Training session status:
 {
   "session_id": "train-session-1",
   "step_count": 3,
   "tokens_processed": 30,
   "lora_rank": 8
 }
-```
+```
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
[STEP] Running training steps via port-forward...
[INFO] Training step 1/3
[INFO] Loss: 6.26, Tokens: 10
[INFO] Completed step: 1
[INFO] Training step 2/3
[INFO] Loss: 5.74, Tokens: 10
[INFO] Completed step: 2
[INFO] Training step 3/3
[INFO] Loss: 5.75, Tokens: 10
[INFO] Completed step: 3
[INFO] Training session status:
{
"session_id": "train-session-1",
"step_count": 3,
"tokens_processed": 30,
"lora_rank": 8
}
```
### Example Output
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

253-253: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In services/training-service/README.md around lines 253 to 272, the example
output code block is missing a language identifier which triggers markdown lint
rule MD040; update the opening fenced code block from ``` to ```text so the
block is explicitly marked as plain text (leave the closing ``` unchanged) to
satisfy the linter and preserve the example output formatting.

Comment on lines +404 to +436
```
┌─────────────────────────────────────────────────────────────┐
│ Basilica API │
│ (creates TrainingSession CRD) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Basilica Operator │
│ • Reconciles TrainingSession CRD │
│ • Creates Pod + Service + NetworkPolicy │
│ • Polls training service for status updates │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Training Service Pod │
├─────────────────────────────────────────────────────────────┤
│ FastAPI Server (src/server.py) │
├─────────────────────────────────────────────────────────────┤
│ Training Backend (src/backend.py) │
│ • Session management │
│ • Forward-backward with gradient accumulation │
│ • Optimizer step │
│ • Text generation │
│ • Checkpoint save/load │
├─────────────────────────────────────────────────────────────┤
│ HuggingFace Transformers + PEFT │
│ • Model loading (OPT, LLaMA, etc.) │
│ • LoRA adapter (rank, alpha, dropout) │
│ • AdamW optimizer │
└─────────────────────────────────────────────────────────────┘
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Line 404 begins an ASCII architecture diagram code block without a language identifier, which fails markdown linting (MD040).

🔎 Apply this diff to specify the language:
 ## Architecture

-```
+```text
 ┌─────────────────────────────────────────────────────────────┐
 │                    Basilica API                             │
 │              (creates TrainingSession CRD)                  │
 └─────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                 Basilica Operator                           │
 │  • Reconciles TrainingSession CRD                           │
 │  • Creates Pod + Service + NetworkPolicy                    │
 │  • Polls training service for status updates                │
 └─────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
 ┌─────────────────────────────────────────────────────────────┐
 │               Training Service Pod                          │
 ├─────────────────────────────────────────────────────────────┤
 │              FastAPI Server (src/server.py)                 │
 ├─────────────────────────────────────────────────────────────┤
 │              Training Backend (src/backend.py)              │
 │  • Session management                                       │
 │  • Forward-backward with gradient accumulation              │
 │  • Optimizer step                                           │
 │  • Text generation                                          │
 │  • Checkpoint save/load                                     │
 ├─────────────────────────────────────────────────────────────┤
 │           HuggingFace Transformers + PEFT                   │
 │  • Model loading (OPT, LLaMA, etc.)                         │
 │  • LoRA adapter (rank, alpha, dropout)                      │
 │  • AdamW optimizer                                          │
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

404-404: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In services/training-service/README.md around lines 404 to 436, the ASCII
architecture diagram code block is missing a language identifier which triggers
markdown lint rule MD040; update the opening fence from ``` to ```text (or
another suitable language like ```plain) and keep the closing fence unchanged so
the block is recognized as a code block with a language specifier.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
examples/training_example.py (1)

39-40: Resource leak: file handle not closed.

This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed. The same pattern appears at lines 182-183, 213-214, and 247-248.

🧹 Nitpick comments (3)
crates/basilica-sdk-python/python/basilica/training/types.py (2)

176-194: Use get_running_loop() and fix caching inconsistency.

Three issues:

  1. _result_type is stored but never used (dead code)
  2. asyncio.get_event_loop() is deprecated in Python 3.10+ when called from a coroutine; use asyncio.get_running_loop() instead
  3. result() caches to _result, but result_async() bypasses this cache, causing inconsistent behavior if both are called
🔎 Proposed fix
-    def __init__(self, future: Future, result_type: type = None):
+    def __init__(self, future: Future, result_type: Optional[type] = None):
         self._future = future
         self._result = None
-        self._result_type = result_type
+        self._result_type = result_type  # Reserved for future deserialization

     def result(self, timeout: Optional[float] = None):
         """Block until operation completes (sync)."""
         if self._result is None:
             self._result = self._future.result(timeout=timeout)
         return self._result

     async def result_async(self, timeout: Optional[float] = None):
         """Wait for operation to complete (async)."""
-        loop = asyncio.get_event_loop()
-        if timeout is not None:
-            return await asyncio.wait_for(
-                loop.run_in_executor(None, self._future.result), timeout=timeout
-            )
-        return await loop.run_in_executor(None, self._future.result)
+        if self._result is not None:
+            return self._result
+        loop = asyncio.get_running_loop()
+        if timeout is not None:
+            self._result = await asyncio.wait_for(
+                loop.run_in_executor(None, self._future.result), timeout=timeout
+            )
+        else:
+            self._result = await loop.run_in_executor(None, self._future.result)
+        return self._result

207-216: Consider sorting __all__ alphabetically.

Per static analysis (RUF022), the export list is unsorted. Sorting improves readability and reduces merge conflicts.

examples/training_example.py (1)

126-144: Retry logic is reasonable but consider documenting the magic numbers.

The 5 retries with 1-second delays are hardcoded. For example code this is acceptable, but a brief inline comment explaining why these values were chosen would help readers understand the expected service behavior.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73cdae1 and f03f1d8.

📒 Files selected for processing (2)
  • crates/basilica-sdk-python/python/basilica/training/types.py (1 hunks)
  • examples/training_example.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
examples/training_example.py

55-55: Do not catch blind exception: Exception

(BLE001)


151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

crates/basilica-sdk-python/python/basilica/training/types.py

134-134: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


135-135: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


136-136: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


137-137: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


138-138: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


140-140: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


141-141: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


142-142: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


143-143: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


144-144: Unused method argument: kwargs

(ARG002)


176-176: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


207-216: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (7)
crates/basilica-sdk-python/python/basilica/training/types.py (4)

16-45: LGTM!

The Datum dataclass is well-structured with appropriate optional fields and a clean to_dict() serialization method.


47-67: LGTM!

The factory methods provide a clean API for creating ModelInput from different sources.


69-94: LGTM!

Sensible defaults and proper handling of stop_sequences in serialization.


99-124: LGTM!

Response dataclasses are well-defined with appropriate fields and defaults.

examples/training_example.py (3)

60-86: LGTM!

Good use of the context manager pattern for the training client, and the training loop demonstrates the APIFuture .result() pattern clearly.


207-241: Good use of try/finally for cleanup in async context.

The try/finally pattern ensures training.close() is called even if an exception occurs during async training.


295-317: LGTM!

Clean CLI argument handling with argparse and clear dispatch to the appropriate example function.

Comment on lines +132 to +151
def __init__(
self,
models: List[str] = None,
max_batch_tokens: int = None,
max_sequence_length: int = None,
gpu_types: List[str] = None,
lora_ranks: List[int] = None,
# camelCase variants from API
maxBatchTokens: int = None,
maxSequenceLength: int = None,
gpuTypes: List[str] = None,
loraRanks: List[int] = None,
**kwargs, # Ignore unknown fields
):
self.models = models or []
self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0
self.max_sequence_length = max_sequence_length or maxSequenceLength or 0
self.gpu_types = gpu_types or gpuTypes or []
self.lora_ranks = lora_ranks or loraRanks or []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Falsy value bug: explicit 0 values are ignored.

The or chaining treats 0 as falsy, so max_batch_tokens=0 falls through to check maxBatchTokens. While the final fallback to 0 masks this for zeroes, it's semantically incorrect and will fail if someone passes max_batch_tokens=0, maxBatchTokens=100 expecting 0.

Additionally, per static analysis hints, the implicit Optional types (using = None without Optional[T]) violate PEP 484.

🔎 Proposed fix using explicit None checks
     def __init__(
         self,
-        models: List[str] = None,
-        max_batch_tokens: int = None,
-        max_sequence_length: int = None,
-        gpu_types: List[str] = None,
-        lora_ranks: List[int] = None,
+        models: Optional[List[str]] = None,
+        max_batch_tokens: Optional[int] = None,
+        max_sequence_length: Optional[int] = None,
+        gpu_types: Optional[List[str]] = None,
+        lora_ranks: Optional[int] = None,
         # camelCase variants from API
-        maxBatchTokens: int = None,
-        maxSequenceLength: int = None,
-        gpuTypes: List[str] = None,
-        loraRanks: List[int] = None,
+        maxBatchTokens: Optional[int] = None,
+        maxSequenceLength: Optional[int] = None,
+        gpuTypes: Optional[List[str]] = None,
+        loraRanks: Optional[List[int]] = None,
         **kwargs,  # Ignore unknown fields
     ):
         self.models = models or []
-        self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0
-        self.max_sequence_length = max_sequence_length or maxSequenceLength or 0
-        self.gpu_types = gpu_types or gpuTypes or []
-        self.lora_ranks = lora_ranks or loraRanks or []
+        self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else (maxBatchTokens if maxBatchTokens is not None else 0)
+        self.max_sequence_length = max_sequence_length if max_sequence_length is not None else (maxSequenceLength if maxSequenceLength is not None else 0)
+        self.gpu_types = gpu_types if gpu_types is not None else (gpuTypes if gpuTypes is not None else [])
+        self.lora_ranks = lora_ranks if lora_ranks is not None else (loraRanks if loraRanks is not None else [])

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

134-134: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


135-135: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


136-136: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


137-137: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


138-138: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


140-140: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


141-141: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


142-142: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


143-143: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


144-144: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/types.py around lines 132
to 151, the constructor uses "or" chaining which treats valid falsy values like
0 as missing; change the assignments to explicit None checks (e.g., set
self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else
(maxBatchTokens if maxBatchTokens is not None else 0)) and similarly for
max_sequence_length, gpu_types, and lora_ranks so passed zeros or empty lists
are preserved; also update the parameter type annotations from "T = None" to
"Optional[T] = None" (import Optional from typing) for all optional params to
satisfy PEP 484.

if abs(loss_after_load - loss_before_changes) < 0.01:
print("✓ R2 checkpoint round-trip PASSED - weights restored correctly!")
else:
print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string.

This f-string has no placeholders, making the f prefix unnecessary.

🔎 Proposed fix
-            print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")
+            print("✗ R2 checkpoint round-trip FAILED - loss mismatch")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")
print("✗ R2 checkpoint round-trip FAILED - loss mismatch")
🧰 Tools
🪛 Ruff (0.14.8)

151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In examples/training_example.py around line 151, the print statement uses an
unnecessary f-string prefix for a literal without placeholders; remove the
leading "f" so the string is a normal literal (print("✗ R2 checkpoint round-trip
FAILED - loss mismatch")) to avoid misleading use of f-strings.

Made-with: Cursor

# Conflicts:
#	.gitignore
#	Cargo.lock
#	crates/basilica-api/Cargo.toml
#	crates/basilica-api/src/api/middleware/scope.rs
#	crates/basilica-api/src/api/mod.rs
#	crates/basilica-api/src/api/routes/mod.rs
#	crates/basilica-api/src/bin/gen_api_key.rs
#	crates/basilica-api/src/config/mod.rs
#	crates/basilica-api/src/k8s/trait.rs
#	crates/basilica-api/src/server.rs
#	crates/basilica-operator/src/controllers/mod.rs
#	crates/basilica-operator/src/crd/mod.rs
#	crates/basilica-operator/src/k8s_client.rs
#	crates/basilica-operator/src/runtime.rs
#	justfile
#	scripts/api/compose.dev.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant