Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughThis PR introduces the Basilica Training SDK and supporting infrastructure for LoRA-based model fine-tuning. It includes Python SDK clients, a FastAPI training backend service, Kubernetes resources (CRDs, operators, gateways), Docker Compose configuration, example scripts, and comprehensive testing. Changes
Sequence DiagramsequenceDiagram
participant User
participant ServiceClient
participant TrainingClient
participant HTTPClient as HTTP/Remote
participant TrainingBackend
participant Model as HuggingFace Model
User->>ServiceClient: create_lora_training_client(base_model, rank, ...)
ServiceClient->>HTTPClient: POST /sessions (create)
HTTPClient->>TrainingBackend: create_session(base_model)
TrainingBackend->>Model: Load base model & tokenizer
Model-->>TrainingBackend: Model loaded
TrainingBackend-->>HTTPClient: session_id, status
HTTPClient-->>ServiceClient: CreateSessionResponse
ServiceClient->>ServiceClient: Wrap in TrainingClient
ServiceClient-->>User: TrainingClient instance
User->>TrainingClient: forward_backward(input_ids, labels)
TrainingClient->>HTTPClient: POST /forward_backward
HTTPClient->>TrainingBackend: forward_backward(tokens, labels)
TrainingBackend->>Model: forward pass + loss
Model-->>TrainingBackend: logits
TrainingBackend->>TrainingBackend: backward pass
TrainingBackend-->>HTTPClient: loss, logprobs, tokens_processed
HTTPClient-->>TrainingClient: ForwardBackwardResult
TrainingClient-->>User: APIFuture (loss, logprobs)
User->>TrainingClient: optim_step()
TrainingClient->>HTTPClient: POST /optim_step
HTTPClient->>TrainingBackend: optim_step()
TrainingBackend->>TrainingBackend: gradient clip & optimizer.step()
TrainingBackend-->>HTTPClient: step count
HTTPClient-->>TrainingClient: OptimStepResponse
TrainingClient-->>User: step count
User->>TrainingClient: save_state(path)
TrainingClient->>HTTPClient: POST /save_state
HTTPClient->>TrainingBackend: save_state(path)
TrainingBackend->>Model: Save LoRA weights to S3/R2
Model-->>TrainingBackend: Checkpoint saved
TrainingBackend-->>HTTPClient: checkpoint path
HTTPClient-->>TrainingClient: SaveStateResponse
TrainingClient-->>User: checkpoint path
sequenceDiagram
participant Admin
participant kubectl as kubectl/K8s API
participant Operator
participant TrainingSessionCRD
participant PodController
participant TrainingPod as Training Pod
Admin->>kubectl: kubectl apply training-session.yaml
kubectl->>TrainingSessionCRD: Create TrainingSession resource
TrainingSessionCRD-->>kubectl: resource created
Operator->>TrainingSessionCRD: Watch for new TrainingSession
TrainingSessionCRD-->>Operator: TrainingSession event (Added)
Operator->>Operator: Validate spec (baseModel, gpuResources)
Operator->>Operator: Build training pod spec
Operator->>PodController: Create Pod with training-service:local image
PodController-->>Operator: Pod created
Operator->>TrainingSessionCRD: Update status (phase: loading_model)
TrainingSessionCRD-->>Operator: Status updated
PodController->>TrainingPod: Launch container
TrainingPod->>TrainingPod: Load model, initialize LoRA
TrainingPod-->>PodController: Ready (healthcheck passes)
Operator->>TrainingSessionCRD: Update status (phase: ready)
Operator->>TrainingSessionCRD: Set endpoint URL
TrainingSessionCRD-->>Operator: Status updated
Admin->>kubectl: Query TrainingSession status
kubectl->>TrainingSessionCRD: Get resource
TrainingSessionCRD-->>kubectl: Return phase=ready, endpoint
kubectl-->>Admin: Training session ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (32)
examples/11_agentgym.py (1)
265-265: Remove unnecessary f-string prefix.This line doesn't contain any placeholders, so the
fprefix is unnecessary.🔎 Apply this diff:
- print(f"TTL: 1 hour (auto-cleanup)") + print("TTL: 1 hour (auto-cleanup)")justfile (2)
962-970: Consider using pushd/popd for directory navigation.The script uses
cdto navigate into the SDK directory andcd ../..to return. While this works correctly withset -euo pipefail, usingpushd/popdwould make directory changes more robust and self-documenting.🔎 View suggested refactor:
# Install Python SDK in editable mode echo "Installing Python SDK..." if command -v uv &> /dev/null; then uv pip install -e crates/basilica-sdk-python else # Use maturin directly for standard venv - cd crates/basilica-sdk-python + pushd crates/basilica-sdk-python > /dev/null maturin develop --release - cd ../.. + popd > /dev/null fi
972-976: Apply the same pushd/popd pattern for stub generation.The stub generation step uses the same
cdpattern. Consider applying the same pushd/popd refactor here for consistency.🔎 View suggested refactor:
# Generate type stubs echo "Generating type stubs..." - cd crates/basilica-sdk-python + pushd crates/basilica-sdk-python > /dev/null cargo run --bin stub_gen --features stub-gen - cd ../.. + popd > /dev/nullcrates/basilica-sdk-python/python/basilica/training/__init__.py (2)
177-181: Remove or utilize the unusedloss_fnparameter.The
loss_fnparameter is documented but never used in the method implementation or sent to the API endpoint. This could confuse users who expect different loss functions to work.Consider either removing this parameter until the feature is implemented, or documenting it as reserved for future use.
🔎 Option 1: Remove the unused parameter:
def forward_backward( self, data: List[Datum], - loss_fn: str = "cross_entropy", ) -> ForwardBackwardResult: """ Compute forward pass and gradients. Args: data: List of training examples - loss_fn: Loss function (currently only "cross_entropy") Returns: ForwardBackwardResult with loss and logprobs """
589-606: Consider sorting__all__, though current grouping is also maintainable.Static analysis suggests sorting the
__all__list alphabetically. However, the current organization by category (client, config types, data types, exceptions) is logical and aids maintainability. Either approach is acceptable.🔎 If you prefer alphabetical sorting:
# Export all public symbols __all__ = [ - # Client - "TrainingClient", - "TrainingSession", - # Config types - "LoraConfig", - "OptimizerConfig", "CheckpointStorage", - "GpuResources", - # Data types "Datum", "ForwardBackwardResult", - "Sample", - # Exceptions + "GpuResources", + "LoraConfig", + "OptimizerConfig", + "Sample", + "SessionNotFoundError", + "TrainingClient", "TrainingError", - "SessionNotFoundError", + "TrainingSession", "TrainingServiceError", ]examples/training_example.py (1)
112-135: Consider renaming the unused loop variable.The loop variable
stepis not used within the loop body. Python convention is to use_stepor_for intentionally unused variables.🔎 Apply this diff:
- for step in range(num_steps): + for _step in range(num_steps): # Forward-backward pass (accumulate gradients) total_loss = 0.0scripts/local-training-e2e.sh (1)
427-427: Consider using single quotes in trap command (optional nitpick).While the current code works correctly (since
$PF_PIDis expanded at trap setup time), shell best practice is to use single quotes for trap commands to make the intention explicit.🔎 Apply this diff:
- trap "kill $PF_PID 2>/dev/null" EXIT + trap 'kill '"$PF_PID"' 2>/dev/null' EXITcrates/basilica-api/src/api/middleware/scope.rs (1)
180-184: Add test coverage for the new training session scope mappings.The scope mappings are correctly implemented and follow the existing patterns. However, the test suite in this file (lines 196-406) does not include test cases for the new training session endpoints.
🔎 Add test cases for training session routes
Add the following test cases to the
test_required_scope_mappingfunction:// Test training session endpoints let req = Request::builder() .method(Method::POST) .uri("/sessions") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:create".to_string())); let req = Request::builder() .method(Method::GET) .uri("/sessions") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:list".to_string())); let req = Request::builder() .method(Method::GET) .uri("/sessions/ts-123456") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:view".to_string())); let req = Request::builder() .method(Method::DELETE) .uri("/sessions/ts-123456") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:delete".to_string()));services/training-service/Dockerfile (2)
15-21: Consider production-optimized installation and exclude test files.The current setup uses editable mode installation (
-e ".") and includes test files in the production image:
- Line 17: Editable install is intended for development, not production deployment
- Line 21: Including tests/ directory adds unnecessary bloat to the production image
🔎 Apply this diff for production optimization:
# Install Python dependencies COPY pyproject.toml . RUN pip install --no-cache-dir --upgrade pip && \ - pip install --no-cache-dir -e "." + pip install --no-cache-dir . # Copy source code COPY src/ src/ -COPY tests/ tests/If you need tests in the image for integration testing, consider using a multi-stage build with separate dev and prod stages.
6-9: Consider removing git if not required at runtime.Git is installed as a system dependency but may only be needed during development or for certain ML libraries that clone repos. If the training service doesn't require git at runtime, removing it reduces the attack surface.
If git is only needed for development or specific edge cases, consider documenting why it's required or removing it to minimize the image size and security footprint.
docker-compose.training.yml (1)
33-37: Consider making GPU count configurable.The GPU count is hardcoded to
1. For flexibility across different development environments, consider using an environment variable.🔎 Example approach:
deploy: resources: reservations: devices: - driver: nvidia count: ${GPU_COUNT:-1} # Default to 1 if not set capabilities: [gpu]Then developers can override with:
GPU_COUNT=2 docker-compose -f docker-compose.training.yml upservices/training-service/README.md (2)
128-128: Unused import in SDK example.The
Datumimport is shown but never used in the example code. Either demonstrate its usage or remove it to avoid confusion for users.-from basilica.training import TrainingClient, Datum +from basilica.training import TrainingClient
169-187: Add language specifier to architecture diagram code block.The fenced code block for the architecture diagram is missing a language specifier. Use
textorplaintextto satisfy linters and improve rendering consistency.-``` +```text ┌─────────────────────────────────────────┐orchestrator/k8s/training/local-dev/operator-deployment.yaml (2)
12-32: Consider tightening RBAC for production variants.Using wildcard verbs (
"*") grants full permissions on these resources. While acceptable for local development, ensure production manifests follow least-privilege principles by specifying explicit verbs (get,list,watch,create,update,patch,delete) as needed.
63-79: Add securityContext to harden the container.Even for local development, adding a securityContext improves security posture and catches permission issues early.
🔎 Suggested addition:
containers: - name: operator image: basilica-operator:local imagePullPolicy: Never + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL env:services/training-service/src/backend.py (4)
93-100: Consider handling explicit CPU device with bfloat16.When
device="cpu"is passed explicitly (not as a fallback),dtyperemainsbfloat16, which has limited CPU support. The current logic only converts dtype when falling back from CUDA.🔎 Suggested fix:
- if device == "cuda" and not torch.cuda.is_available(): - # On Mac, use CPU to avoid MPS bfloat16 issues - self.device = "cpu" - self.dtype = torch.float32 # bfloat16 not well supported on CPU - else: - self.device = device - self.dtype = dtype + if device == "cuda" and not torch.cuda.is_available(): + # On Mac, use CPU to avoid MPS bfloat16 issues + self.device = "cpu" + self.dtype = torch.float32 # bfloat16 not well supported on CPU + elif device == "cpu" and dtype == torch.bfloat16: + # bfloat16 has limited CPU support + self.device = "cpu" + self.dtype = torch.float32 + else: + self.device = device + self.dtype = dtype
117-155: Security consideration:trust_remote_code=Trueexecutes arbitrary code.While often necessary for newer models, this setting allows execution of untrusted code from model repositories. Consider making this configurable or documenting the security implications.
🔎 Suggested approach:
def __init__( self, model_cache_dir: str = "/models", checkpoint_dir: str = "/checkpoints", device: str = "cuda", dtype: torch.dtype = torch.bfloat16, + trust_remote_code: bool = True, ): + self.trust_remote_code = trust_remote_codeThen use
self.trust_remote_codein_load_base_model.
176-178: Incomplete seeding for full reproducibility.
torch.manual_seedonly sets the CPU seed. For reproducible GPU training, CUDA seeds should also be set.🔎 Apply this diff:
if seed is not None: torch.manual_seed(seed) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(seed)
476-480: Consider explicit resource cleanup in delete_session.Deleting from the sessions dict doesn't immediately free GPU memory. For long-running services, consider explicit cleanup.
🔎 Suggested enhancement:
def delete_session(self, session_id: str) -> None: """Delete a training session.""" if session_id in self.sessions: + # Explicitly delete model to help free GPU memory + session = self.sessions[session_id] + del session.model + del session.optimizer del self.sessions[session_id] + # Suggest garbage collection for GPU memory + if torch.cuda.is_available(): + torch.cuda.empty_cache() logger.info("session_deleted", session_id=session_id)orchestrator/k8s/training/local-dev/gateway.yaml (1)
25-39: ReferenceGrant is namespace-specific tou-testuser.The ReferenceGrant only allows HTTPRoutes from the
u-testusernamespace. For a more flexible local-dev setup, consider using a wildcard or label selector if Gateway API supports it, or document that additional ReferenceGrants are needed for other user namespaces.crates/basilica-operator/src/controllers/training_session_controller.rs (5)
239-248: String-based error checking is fragile.Checking
e.to_string().contains("409")is fragile. Consider using proper error type matching if the error type supports it.
250-251: External endpoint is hardcoded.The external endpoint
https://api.basilica.ai/sessions/{}/is hardcoded. Consider making this configurable via environment variable or controller configuration for different environments.🔎 Suggested approach:
// At the top of the file, add: fn get_external_base_url() -> String { std::env::var("EXTERNAL_API_BASE_URL") .unwrap_or_else(|_| "https://api.basilica.ai".to_string()) } // Then in handle_initializing: let external_endpoint = format!("{}/sessions/{}/", get_external_base_url(), name);
277-293: MVP placeholder: health check not implemented.As noted in the comments, the actual health check (GET /health) is not implemented. Consider adding a TODO or tracking issue.
Do you want me to open an issue to track implementing the health check endpoint verification?
331-333: Pod deletion error is silently ignored.When TTL expires, pod deletion errors are discarded with
let _ =. Consider logging the error even if proceeding with termination.🔎 Suggested fix:
- let _ = self.client.delete_pod(namespace, &pod_name).await; + if let Err(e) = self.client.delete_pod(namespace, &pod_name).await { + warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry"); + }
367-392: Resource defaults are hardcoded; consider making configurable.CPU/memory limits (16/64Gi for GPU, 2/4Gi for CPU-only) are hardcoded. For flexibility across different cluster configurations, these could be made configurable via the TrainingSessionSpec or environment variables.
services/training-service/tests/test_backend.py (2)
73-76: Token IDs may exceed model vocabulary size.
torch.randint(0, 1000, ...)generates token IDs up to 999, butfacebook/opt-125mhas a vocabulary size of ~50k, so this is safe. However, for robustness, consider using the tokenizer's vocab size.🔎 More robust approach:
# Get tokenizer vocab size for safer random tokens tokenizer = backend.tokenizers.get(small_model) if tokenizer: vocab_size = len(tokenizer) input_ids = torch.randint(0, vocab_size, (1, 32)) else: input_ids = torch.randint(0, 1000, (1, 32))
165-166: Assertionlen(result.text) >= 0is always true.This assertion doesn't test anything meaningful since string length is always non-negative.
🔎 Consider a more meaningful assertion:
- assert len(result.text) >= 0 # May be empty for some models - assert len(result.token_ids) > 0 + # Verify generation produced output (may be empty text due to special tokens) + assert isinstance(result.text, str) + assert len(result.token_ids) > 0 + assert result.finish_reason in ("stop", "length")crates/basilica-operator/src/crd/training_session.rs (1)
211-213: Consider using a pinned image tag instead of:latest.The default image uses the
:latesttag, which can lead to inconsistent behavior across deployments and make rollbacks difficult. Consider pinning to a specific version tag or using a semantic version.🔎 Suggested improvement:
fn default_image() -> String { - "basilica/training:latest".into() + "basilica/training:v0.1.0".into() // Use a specific version }crates/basilica-api/src/api/routes/training.rs (2)
274-281: UUID truncation reduces uniqueness guarantees.Taking only the first segment of a UUID (8 hex characters) significantly reduces the UUID's uniqueness. This provides only 2^32 (~4 billion) possible values instead of the full 2^128, increasing collision risk in high-throughput scenarios.
Consider using the full UUID or a larger portion:
🔎 Suggested improvement:
let session_id = format!( "ts-{}", - uuid::Uuid::new_v4() - .to_string() - .split('-') - .next() - .unwrap() + uuid::Uuid::new_v4().to_string().replace('-', "") // Full UUID without hyphens );Or keep it short but use more characters:
let session_id = format!( "ts-{}", - uuid::Uuid::new_v4() - .to_string() - .split('-') - .next() - .unwrap() + &uuid::Uuid::new_v4().to_string()[..16] // First 16 chars (64 bits) );
334-334: Hardcoded API domain should be configurable.The endpoint domain "api.basilica.ai" is hardcoded, which makes it inflexible for different environments (dev, staging, production) or on-premise deployments.
Consider making this configurable via environment variable or AppState:
🔎 Suggested improvement:
+ let api_domain = std::env::var("API_DOMAIN").unwrap_or_else(|_| "api.basilica.ai".to_string()); let endpoint = format!( - "https://api.basilica.ai/sessions/{}/", + "https://{}/sessions/{}/", + api_domain, session_id );services/training-service/src/server.py (2)
190-194: Simplify target_modules handling.The logic to handle
target_modulesis more complex than necessary. You can rely on the backend's LoraConfiguration defaults instead.🔎 Simplified approach:
lora_kwargs = {} if request.lora_config: lora_kwargs = request.lora_config.model_dump(exclude_none=True) - if "target_modules" not in lora_kwargs or lora_kwargs["target_modules"] is None: - lora_kwargs.pop("target_modules", None)The
exclude_none=Truealready handles None values, and the backend's LoraConfiguration will apply defaults for missing fields.
289-312: Inconsistent error handling compared to create_session.The
sampleendpoint only catchesValueError, whilecreate_sessionalso catches generic exceptions and logs them. For consistency and better observability, add generic exception handling here as well.🔎 Apply this fix:
except ValueError as e: raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=str(e)) + except Exception as e: + logger.exception("sample_failed", error=str(e)) + raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(e))Apply similar fixes to
save_stateandload_stateendpoints.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
crates/basilica-api/src/api/middleware/scope.rs(1 hunks)crates/basilica-api/src/api/mod.rs(1 hunks)crates/basilica-api/src/api/routes/mod.rs(1 hunks)crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-api/src/config/mod.rs(2 hunks)crates/basilica-api/src/k8s/trait.rs(1 hunks)crates/basilica-api/src/server.rs(3 hunks)crates/basilica-operator/src/controllers/mod.rs(1 hunks)crates/basilica-operator/src/controllers/training_session_controller.rs(1 hunks)crates/basilica-operator/src/crd/mod.rs(1 hunks)crates/basilica-operator/src/crd/training_session.rs(1 hunks)crates/basilica-operator/src/k8s_client.rs(6 hunks)crates/basilica-operator/src/runtime.rs(6 hunks)crates/basilica-sdk-python/python/basilica/training/__init__.py(1 hunks)docker-compose.training.yml(1 hunks)examples/11_agentgym.py(2 hunks)examples/training_example.py(1 hunks)justfile(1 hunks)orchestrator/k8s/training/local-dev/gateway.yaml(1 hunks)orchestrator/k8s/training/local-dev/namespace.yaml(1 hunks)orchestrator/k8s/training/local-dev/operator-deployment.yaml(1 hunks)orchestrator/k8s/training/training-service-rbac.yaml(1 hunks)orchestrator/k8s/training/training-session-crd.yaml(1 hunks)orchestrator/k8s/training/training-session-example.yaml(1 hunks)scripts/api/compose.dev.yml(1 hunks)scripts/local-training-e2e.sh(1 hunks)services/training-service/Dockerfile(1 hunks)services/training-service/README.md(1 hunks)services/training-service/pyproject.toml(1 hunks)services/training-service/src/__init__.py(1 hunks)services/training-service/src/backend.py(1 hunks)services/training-service/src/server.py(1 hunks)services/training-service/tests/__init__.py(1 hunks)services/training-service/tests/test_backend.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (11)
crates/basilica-api/src/config/mod.rs (1)
crates/basilica-api/src/server.rs (1)
new(559-1165)
crates/basilica-api/src/api/mod.rs (1)
crates/basilica-api/src/api/routes/training.rs (4)
create_session(255-343)list_sessions(478-582)get_session(346-425)delete_session(431-475)
crates/basilica-operator/src/k8s_client.rs (1)
crates/basilica-operator/src/crd/training_session.rs (2)
default(34-41)default(80-86)
crates/basilica-api/src/server.rs (1)
crates/basilica-validator/src/api/client.rs (1)
new(24-34)
examples/11_agentgym.py (1)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
url(183-191)
examples/training_example.py (2)
crates/basilica-sdk-python/python/basilica/training/__init__.py (9)
Datum(84-89)TrainingError(111-114)create_session(446-534)base_model(168-170)session_id(163-165)forward_backward(177-233)optim_step(235-253)save_state(303-332)sample(255-301)services/training-service/src/backend.py (5)
create_session(163-224)forward_backward(226-292)optim_step(294-317)save_state(383-432)sample(319-381)
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
crates/basilica-operator/src/crd/training_session.rs (5)
new(316-322)new(374-391)default(34-41)default(80-86)default(145-151)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (10)
default_rank(44-46)default_alpha(47-49)default_dropout(50-52)default(34-41)default(80-86)default(145-151)default_learning_rate(89-91)default_weight_decay(92-94)new(316-322)new(374-391)crates/basilica-api/src/k8s/trait.rs (1)
kube_client(7-7)
scripts/local-training-e2e.sh (3)
crates/basilica-api/src/server.rs (1)
run(1188-1226)crates/basilica-operator/src/runtime.rs (1)
run(162-361)orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
deployments(103-213)
services/training-service/tests/test_backend.py (1)
services/training-service/src/backend.py (12)
LoraConfiguration(16-32)OptimizerConfiguration(36-44)TrainingBackend(80-511)create_session(163-224)forward_backward(226-292)optim_step(294-317)sample(319-381)save_state(383-432)load_state(434-462)get_session_status(464-474)delete_session(476-480)list_sessions(482-484)
services/training-service/src/server.py (1)
services/training-service/src/backend.py (14)
ForwardBackwardResult(62-67)LoraConfiguration(16-32)OptimizerConfiguration(36-44)SampleResult(71-77)TrainingBackend(80-511)list_sessions(482-484)create_session(163-224)get_session_status(464-474)delete_session(476-480)forward_backward(226-292)optim_step(294-317)sample(319-381)save_state(383-432)load_state(434-462)
🪛 Checkov (3.2.334)
orchestrator/k8s/training/local-dev/operator-deployment.yaml
[medium] 8-33: Minimize wildcard use in Roles and ClusterRoles
(CKV_K8S_49)
[medium] 47-79: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 47-79: Minimize the admission of root containers
(CKV_K8S_23)
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md
169-169: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.8)
examples/11_agentgym.py
265-265: f-string without any placeholders
Remove extraneous f prefix
(F541)
examples/training_example.py
57-57: Do not catch blind exception: Exception
(BLE001)
112-112: Loop control variable step not used within loop body
Rename unused step to _step
(B007)
crates/basilica-sdk-python/python/basilica/training/__init__.py
180-180: Unused method argument: loss_fn
(ARG002)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
226-226: Avoid specifying long messages outside the exception class
(TRY003)
247-247: Avoid specifying long messages outside the exception class
(TRY003)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
291-291: Avoid specifying long messages outside the exception class
(TRY003)
293-293: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
329-329: Avoid specifying long messages outside the exception class
(TRY003)
355-355: Avoid specifying long messages outside the exception class
(TRY003)
357-357: Avoid specifying long messages outside the exception class
(TRY003)
371-371: Avoid specifying long messages outside the exception class
(TRY003)
373-373: Avoid specifying long messages outside the exception class
(TRY003)
384-384: Avoid specifying long messages outside the exception class
(TRY003)
436-438: Avoid specifying long messages outside the exception class
(TRY003)
527-527: Avoid specifying long messages outside the exception class
(TRY003)
552-552: Avoid specifying long messages outside the exception class
(TRY003)
554-554: Avoid specifying long messages outside the exception class
(TRY003)
573-573: Avoid specifying long messages outside the exception class
(TRY003)
589-606: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh
[warning] 372-372: i appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 427-427: Use single quotes, otherwise this expands now rather than when signalled.
(SC2064)
🔇 Additional comments (82)
examples/11_agentgym.py (5)
1-27: Excellent documentation structure!The module docstring clearly describes the evaluation flow, lists available task types for both suites, and provides usage examples. This makes the example very accessible to users.
39-76: Clean implementation with flexible API key handling!The function properly handles optional API keys by conditionally populating environment variables. The deployment configuration with resource limits (500m CPU, 1Gi memory, 180s timeout) is appropriate for evaluation workloads.
84-125: Consistent pattern with good parameterization!The function follows the same structure as
deploy_affine_evaluator, ensuring code consistency. Theenv_nameparameter elegantly controls both the deployment name and Docker image tag.
133-214: Helpful usage examples with good UX!The usage printing functions provide clear, actionable curl examples. The fallback to
"YOUR_CHUTES_API_KEY"when the API key is not provided ensures the examples remain useful in all scenarios.
217-270: Well-structured main function with good user guidance!The main function provides clear warnings when the API key is missing, parses CLI arguments to select the appropriate evaluator, and prints comprehensive usage information. The overall flow is intuitive and user-friendly.
justfile (1)
937-947: Excellent improvements to the Python SDK workflow!The additions enhance the development experience significantly:
- Strict bash error handling with
set -euo pipefail- Smart venv strategy supporting both modern
uvand traditionalpython3 -m venv- Conditional dependency installation (maturin only for standard venv)
- Clear, helpful output messages guiding developers
These changes align well with modern Python development practices and make the SDK setup more reliable.
Also applies to: 957-959, 978-984
crates/basilica-sdk-python/python/basilica/training/__init__.py (5)
1-40: LGTM! Well-documented module setup.The module docstring provides clear usage examples and authentication guidance. The imports are appropriate for the SDK functionality.
43-109: LGTM! Well-structured data models.The dataclass definitions are clean with appropriate type hints and default values. The use of
default_factoryfor mutable defaults (lists) is correct.
111-127: LGTM! Clean exception hierarchy.The exception classes provide a clear hierarchy for error handling in the SDK.
235-384: LGTM! Consistent method implementations.The remaining methods follow a consistent pattern with proper error handling. The
closemethod correctly handles 404 errors for idempotency, which is a good practice for cleanup operations.
387-586: LGTM! Well-designed client with proper authentication and lifecycle management.The
TrainingClientclass is well-structured with:
- Proper API key validation and error handling using
response.is_successproperty to check for 2xx responses- Flexible configuration via environment variables or parameters
- Context manager support for resource cleanup
- Comprehensive error handling including specific 404 detection for missing sessions
The 300-second timeout is an appropriate explicit configuration for training operations, where requests may involve long-running tasks, compared to httpx's default 5-second timeout for network inactivity. All httpx API usage is correct.
services/training-service/src/__init__.py (1)
1-3: LGTM!Clean package initialization following Python conventions. The version string is consistent with the SDK versioning pattern.
crates/basilica-api/src/k8s/trait.rs (1)
129-131: LGTM!The comments clearly document the division of responsibilities between the API and operator for TrainingSession CRD operations, which will help future maintainers understand the architecture.
crates/basilica-api/src/config/mod.rs (2)
49-56: LGTM!The
dev_modeflag is well-designed for local development workflows. The documentation clearly indicates production requirements, and the default value (false) ensures production-safe behavior out of the box.
66-67: LGTM!The default initialization correctly sets
dev_modetofalseand provides an empty string forvalidator_hotkey, which aligns with the validation logic inserver.rsthat requires the hotkey in production mode.examples/training_example.py (1)
39-59: LGTM!The client initialization includes proper environment variable validation and user-friendly error messages. While the exception handler is broad, this is acceptable for example code where the goal is to provide clear guidance to users.
scripts/local-training-e2e.sh (7)
1-38: LGTM!The script header is well-structured with clear documentation, proper error handling (
set -euo pipefail), and a clean logging abstraction using color-coded functions.
40-65: LGTM!Comprehensive prerequisite checks with helpful error messages. The automatic k3d installation improves the developer experience.
288-409: LGTM!The E2E test orchestration is well-structured with proper error handling, health checks, and clear logging. The API token handling gracefully falls back when authentication is unavailable.
411-542: Training steps implementation is solid.The function properly manages port-forwarding with cleanup, performs training operations, and validates results at each step. Error handling is comprehensive.
544-591: LGTM!The status and logging functions provide comprehensive visibility into the cluster state and component logs, essential for debugging E2E tests.
613-662: LGTM!Excellent documentation that clearly explains the architecture, usage, and what each test step does. This will significantly improve the developer experience.
664-701: LGTM!Clean command dispatch with proper error handling for unknown commands.
crates/basilica-api/src/api/routes/mod.rs (1)
14-14: LGTM!The training module export follows the established pattern and properly exposes the new Training Session API routes.
crates/basilica-operator/src/crd/mod.rs (1)
5-5: LGTM!The training_session module export follows the established pattern and properly exposes the TrainingSession CRD types for use by the operator.
crates/basilica-operator/src/controllers/mod.rs (1)
6-6: LGTM!The training_session_controller module export follows the established pattern and properly integrates the TrainingSession controller into the operator.
orchestrator/k8s/training/local-dev/namespace.yaml (1)
1-18: LGTM!The namespace declarations are well-structured and follow Kubernetes best practices. The user namespace labeling (
basilica.ai/user: testuser) is a good pattern for identifying and filtering user-specific resources.services/training-service/tests/__init__.py (1)
1-1: LGTM!Standard Python package initializer with appropriate docstring.
crates/basilica-api/src/api/mod.rs (1)
105-115: LGTM!The training session routes are properly integrated into the protected routes with appropriate handler bindings. The comment clarifying that training operations are routed directly by Envoy Gateway is helpful architectural context.
orchestrator/k8s/training/training-service-rbac.yaml (1)
1-72: LGTM! RBAC configuration follows least-privilege principles.The RBAC configuration is well-structured:
- Appropriate read-only access to Secrets, ConfigMaps, and Nodes
- Full lifecycle management for TrainingSession CRDs, Pods, and Services (required for operator functionality)
- Event creation for observability
- Finalizer access for proper cleanup
The permissions align with the controller's responsibilities for managing training session workloads.
docker-compose.training.yml (1)
1-80: LGTM! Well-configured Docker Compose for local development.The configuration is well-structured with:
- Appropriate health checks with sufficient start period for model loading (120s)
- Proper service dependencies
- GPU resource allocation
- Volume persistence for models and checkpoints
- Restart policies for resilience
The debug logging and SQLite database are appropriate for local development.
crates/basilica-operator/src/runtime.rs (3)
46-50: LGTM!The
TrainingSessionCtxstruct follows the established pattern used by other controller contexts (JobCtx,RentalCtx,UserDeploymentCtx).
135-160: LGTM!The reconcile and error policy functions follow the same pattern as existing controllers (
reconcile_user_deployment,error_policy_user_deployment), with appropriate structured logging including namespace and session name.
333-359: LGTM!The training sessions controller stream is correctly wired into the operator's main loop with consistent error handling and properly included in the
futures::join!macro.orchestrator/k8s/training/training-session-example.yaml (1)
1-94: LGTM!These example manifests are well-documented and demonstrate both a comprehensive configuration (8B model with all optional fields) and a minimal configuration (70B model with defaults). The inline comments explaining each field are helpful for users.
crates/basilica-api/src/server.rs (3)
564-585: LGTM!The dev mode implementation correctly creates a stub validator client pointing to a non-routable address (
localhost:1) that will fail fast if accidentally invoked. The warning message clearly communicates limitations to developers.
902-981: LGTM!Appropriately gating the validator health check and rental health check tasks behind
dev_modesince these require a functional validator client.
1097-1123: Gate credit exhaustion check task behind!config.bittensor.dev_mode.The credit exhaustion check task uses a validator client for stopping community rentals, but it's only gated by billing client presence—not by dev_mode. In dev_mode, the validator client is a stub pointing to a non-existent endpoint, causing
stop_community_rental_internalcalls to fail silently (errors are logged). While this won't crash the application, community rentals won't actually be stopped in dev mode.The health check task (line 903) already uses the
!config.bittensor.dev_modepattern and should be followed here for consistency.orchestrator/k8s/training/training-session-crd.yaml (1)
1-183: LGTM overall!The CRD is well-designed with:
- Clear required fields and sensible defaults
- Proper validation constraints (min/max on numeric fields)
- Comprehensive status tracking with appropriate phases
- Useful printer columns for
kubectl getoutput- Short names for convenience (
ts,tsession)services/training-service/src/backend.py (10)
1-13: LGTM - Imports and logger setup are appropriate.Standard imports for the training backend with HuggingFace, PEFT, and PyTorch libraries.
15-33: LGTM - LoRA configuration with sensible defaults.The alpha/rank ratio of 2 and the target modules covering standard projection layers align with common LoRA practices.
35-45: LGTM - Optimizer configuration with standard AdamW defaults.
47-59: LGTM - SessionState dataclass properly tracks training session state.
61-78: LGTM - Result dataclasses provide clear return structures.
226-292: LGTM - Forward-backward implementation is correct.The weighted loss computation properly handles the shift for next-token prediction, and the gradient accumulation behavior (allowing multiple forward_backward calls before optim_step) is a useful feature for gradient accumulation.
294-317: LGTM - Optimizer step implementation is correct.The gradient clipping, optimizer step, and gradient zeroing sequence is properly implemented.
319-381: LGTM - Text generation implementation handles edge cases well.The temperature=0 (greedy decoding), top_k=0 (no filtering), and finish_reason detection are all correctly implemented.
383-432: LGTM - Checkpoint save implementation is comprehensive.Correctly saves adapter weights and optionally the full training state including optimizer, step count, and configurations.
492-511: LGTM - Helper methods are correctly implemented.
_get_sessionprovides proper validation, and_compute_logprobscorrectly handles the shift alignment for next-token prediction.orchestrator/k8s/training/local-dev/gateway.yaml (1)
1-24: LGTM - Gateway configuration is appropriate for local development.The GatewayClass and Gateway resources are correctly configured for Envoy Gateway with HTTP listener on port 8080.
crates/basilica-operator/src/k8s_client.rs (6)
23-24: LGTM - Import added for TrainingSession CRD types.
77-91: LGTM - Trait methods follow established patterns.The new TrainingSession CRUD methods are consistent with existing CRD operations in the trait.
183-183: LGTM - Storage map added for TrainingSession CRDs in mock client.
363-408: LGTM - MockK8sClient TrainingSession implementations follow existing patterns.The implementations are consistent with other CRD operations (BasilicaJob, GpuRental, etc.) in the mock client.
1038-1084: LGTM - KubeClient TrainingSession implementations are correct.The real Kubernetes API implementations follow the established patterns with proper error handling for 409 conflicts and use of Server-Side Apply for status updates.
1664-1693: LGTM - RateLimitedKubeClient properly delegates with rate limiting.All four TrainingSession methods correctly await rate limiting before delegating to the inner client.
crates/basilica-operator/src/controllers/training_session_controller.rs (6)
1-28: LGTM - Imports and constants are appropriate.The constants for gateway name, namespace, and service port are well-defined for Envoy Gateway integration.
165-211: LGTM - Scheduling handler correctly monitors pod phases.The state machine logic properly handles Running, Pending, and Failed pod states, with appropriate fallback to Pending if pod is missing.
486-489:restart_policy: Nevermeans failed pods won't auto-recover.With
Never, if the training container crashes, the pod enters Failed state permanently. This is intentional for training jobs (to preserve state/logs), but ensure the controller handles this appropriately.The controller does handle Failed pods in
handle_schedulingby setting an error status. Verify this is the desired behavior for production.
497-541: LGTM - Service spec is correctly configured.The selector matches the pod label, and the service port correctly targets the container port.
543-599: LGTM - HTTPRoute construction is correct.The route properly configures path matching, URL rewriting to strip the session prefix, and backend reference to the training service.
601-682: LGTM - Unit tests cover key resource builder functionality.Tests verify pod and service construction with expected names, ports, and GPU resource configuration.
Consider adding tests for the controller's reconcile logic using MockK8sClient to verify state transitions.
services/training-service/tests/test_backend.py (7)
1-31: LGTM - Test fixtures are well-structured.The fixtures properly set up the backend with temp directories and handle CUDA/CPU device selection. Using
facebook/opt-125mis appropriate for fast testing.
34-62: LGTM - Session creation tests cover happy path and error case.
64-112: LGTM - Forward/backward tests verify core training functionality.Tests appropriately verify loss computation and token counting for both standard and weighted loss scenarios.
114-147: LGTM - Optimizer step test verifies multi-step training workflow.
188-212: Test may not catch the optimizer parameter bug.Due to the bug identified in
load_state(optimizer references stale parameters after loading), this test might pass but actual training afterload_statewould fail. Consider adding a test that does training after loading to verify end-to-end correctness.🔎 Suggested additional test:
def test_training_after_load_checkpoint(self, backend, session_id, small_model): """Test that training works correctly after loading a checkpoint.""" backend.create_session( session_id=session_id, base_model=small_model, lora_config=LoraConfiguration(rank=8), optimizer_config=OptimizerConfiguration(), ) input_ids = torch.randint(0, 1000, (1, 32)) attention_mask = torch.ones_like(input_ids) labels = input_ids.clone() # Train and save backend.forward_backward(session_id, input_ids, attention_mask, labels) backend.optim_step(session_id) path = backend.save_state(session_id, "checkpoint-1") # Load and train more backend.load_state(session_id, path) result = backend.forward_backward(session_id, input_ids, attention_mask, labels) step = backend.optim_step(session_id) # Verify training continued correctly assert step == 2 assert result.loss > 0
214-227: LGTM - Status test verifies configuration is correctly stored.
229-267: LGTM - Session management and error handling tests are comprehensive.Tests cover deletion, listing multiple sessions, and proper error raising for nonexistent sessions.
crates/basilica-operator/src/crd/training_session.rs (6)
9-42: LGTM!The LoraConfig structure is well-defined with sensible defaults (rank=32, alpha=64, dropout=0.05) and appropriate validation ranges. The target modules cover standard attention projections commonly used in transformer fine-tuning.
62-97: LGTM!OptimizerConfig provides sensible defaults for fine-tuning (learning_rate=1e-4, weight_decay=0.01, grad_clip=1.0). The configuration is appropriate for LoRA training scenarios.
99-123: LGTM!CheckpointStorage provides flexible configuration for multiple storage backends with appropriate optional fields for credentials, region, and custom endpoints.
125-156: LGTM!GpuResources struct provides appropriate configuration for GPU requirements with sensible defaults (count=1) and validation ranges (1-8 GPUs, 8-256 GB memory).
221-264: LGTM!TrainingSessionPhase enum provides a clear lifecycle progression with appropriate requeue intervals for each phase. The terminal state handling and string conversion methods are correctly implemented.
266-427: LGTM!TrainingSessionStatus provides comprehensive state tracking with well-implemented builder methods. The automatic timestamp updates and phase transitions (e.g., setting Failed phase in with_error()) ensure consistent state management. The tests adequately cover the builder pattern and phase logic.
crates/basilica-api/src/api/routes/training.rs (2)
40-135: LGTM!Request types are well-structured with defaults that match the CRD definitions. The use of
serde(default)attributes ensures backward compatibility when fields are omitted.
427-475: LGTM!The delete handler correctly uses typed error matching (
ae.code == 404) and appropriately treats 404 as a non-error (idempotent delete). The implementation is robust and follows best practices.services/training-service/src/server.py (5)
1-46: LGTM!The initialization logic properly uses environment variables for configuration and correctly handles device selection. The lifespan context manager ensures proper backend lifecycle management.
82-161: LGTM!Response models are well-structured with appropriate field types. The SessionStatusResponse correctly includes training-specific fields (lora_rank, learning_rate) that are internal to the service.
263-267: The logprobs wrapping is correct. The backend'sForwardBackwardResultreturnslogprobs: List[float](a flat list for a single sequence), whileForwardBackwardResponseexpectslogprobs: List[List[float]](batched format). Wrapping the result in a list converts it to the required API contract.
176-179:list_sessions()is a trivial in-memory operation and does not require blocking prevention.The
list_sessions()method simply returnslist(self.sessions.keys())— a dictionary key access with no I/O, heavy computation, or GPU operations. This will not block the event loop.Likely an incorrect or invalid review comment.
61-70: The Python service uses Adam optimizer parameters not exposed in the CRD.The backend internally uses
beta1,beta2, andepsilonwhen configuring the AdamW optimizer, and the Python API accepts these parameters. However, the Rust CRD only exposeslearning_rate,weight_decay, andgrad_clip. This is intentional: the Python training service provides fine-grained control over Adam parameters, while the CRD presents a simpler interface for Kubernetes users. The field defaults match industry standards for Adam optimization and are properly validated.
| fn user_namespace(user_id: &str) -> String { | ||
| // Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens) | ||
| let sanitized: String = user_id | ||
| .chars() | ||
| .map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' }) | ||
| .collect(); | ||
| format!("u-{}", sanitized) | ||
| } |
There was a problem hiding this comment.
Potential namespace collision risk.
The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". Additionally, the function doesn't enforce Kubernetes namespace name length limits (max 63 characters).
Consider using a hash-based approach for deterministic, collision-free namespace names:
🔎 Suggested fix:
fn user_namespace(user_id: &str) -> String {
- // Sanitize user_id for K8s namespace (lowercase, alphanumeric + hyphens)
- let sanitized: String = user_id
- .chars()
- .map(|c| if c.is_alphanumeric() { c.to_ascii_lowercase() } else { '-' })
- .collect();
- format!("u-{}", sanitized)
+ // Use hash to ensure unique, valid K8s namespace name
+ use std::collections::hash_map::DefaultHasher;
+ use std::hash::{Hash, Hasher};
+
+ let mut hasher = DefaultHasher::new();
+ user_id.hash(&mut hasher);
+ let hash = hasher.finish();
+
+ format!("u-{:x}", hash) // Produces names like "u-a1b2c3d4e5f6g7h8"
}| }, | ||
| "ttlSeconds": req.ttl_seconds, | ||
| "seed": req.seed, | ||
| "enableBilling": true |
There was a problem hiding this comment.
Hardcoded enableBilling prevents user override.
The enableBilling field is always set to true, ignoring any potential user preference. Consider adding it as an optional field in CreateSessionRequest.
🔎 Suggested fix:
Add the field to CreateSessionRequest:
pub struct CreateSessionRequest {
pub base_model: String,
pub checkpoint_storage: CheckpointStorageRequest,
// ... other fields ...
+ #[serde(default = "default_enable_billing")]
+ pub enable_billing: bool,
}
+
+fn default_enable_billing() -> bool {
+ true
+}Then use it in build_training_session_crd:
- "enableBilling": true
+ "enableBilling": req.enable_billingCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around line 244, the JSON
hardcodes "enableBilling": true which prevents users from overriding billing;
add an optional field (Option<bool>) named enableBilling to the
CreateSessionRequest struct, update request deserialization/validation as
needed, and change build_training_session_crd to read request.enableBilling and
set the CRD field to that value (use false or a sensible default when None)
instead of the hardcoded true.
| if let Err(e) = k8s_client.create_namespace(&namespace).await { | ||
| // Ignore "already exists" errors | ||
| if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") { | ||
| error!(error = %e, namespace = %namespace, "Failed to create namespace"); | ||
| apimetrics::record_request("training.create", "POST", start, false); | ||
| return Err(ApiError::Internal { | ||
| message: format!("Failed to create namespace: {}", e), | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Fragile error detection using string matching.
Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors or incorrectly ignore failures if the error message format changes.
🔎 Apply this fix to use typed error matching:
if let Err(e) = k8s_client.create_namespace(&namespace).await {
- // Ignore "already exists" errors
- if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") {
+ // Ignore "already exists" errors using typed matching
+ if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) {
error!(error = %e, namespace = %namespace, "Failed to create namespace");
apimetrics::record_request("training.create", "POST", start, false);
return Err(ApiError::Internal {
message: format!("Failed to create namespace: {}", e),
});
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 284 to 293,
replace the fragile string-based checks against e.to_string() with typed error
matching: match the error (likely kube::Error) and specifically handle
kube::Error::Api(ae) where ae.code == 409 or ae.reason == "AlreadyExists" as the
non-fatal case; for all other error variants log the error with context, record
the metric, and return the ApiError::Internal as before. Ensure you import the
kube types if needed and keep the existing logging/metrics/return behavior for
real failures.
scripts/api/compose.dev.yml
Outdated
| POSTGRES_DB: basilica_api | ||
| ports: | ||
| - "5432:5432" | ||
| - "5433:5432" # Use 5433 to avoid conflict with local Postgres |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for the compose.dev.yml file and check its content
echo "=== Checking compose.dev.yml ==="
if [ -f "scripts/api/compose.dev.yml" ]; then
head -20 scripts/api/compose.dev.yml
echo ""
sed -n '8,12p' scripts/api/compose.dev.yml
else
echo "File not found at expected path"
fi
echo ""
echo "=== Searching for localhost:5432 references in codebase ==="
rg -n 'localhost:5432|127\.0\.0\.1:5432' --type-list 2>/dev/null | head -20 || echo "No direct references found or rg not available"
echo ""
echo "=== Searching in .env and documentation files ==="
find . -type f \( -name ".env*" -o -name "README*" -o -name "*.md" \) -exec grep -l "5432" {} \; 2>/dev/null | head -10Repository: one-covenant/basilica
Length of output: 1092
🏁 Script executed:
#!/bin/bash
# Check the documentation files that mention 5432 for context
echo "=== Checking docs/validator.md for postgres references ==="
grep -n -C 3 "5432" ./docs/validator.md 2>/dev/null || echo "No matches found"
echo ""
echo "=== Checking orchestrator/README.md for postgres references ==="
grep -n -C 3 "5432" ./orchestrator/README.md 2>/dev/null || echo "No matches found"
echo ""
echo "=== Checking for any DATABASE_URL or connection string patterns ==="
rg -n "DATABASE_URL|postgres.*localhost" --type md --type yaml 2>/dev/null | head -20 || echo "No matches found"
echo ""
echo "=== Look for local development or setup instructions ==="
find . -type f -name "*.md" -exec grep -l "docker-compose\|docker compose\|local.*setup\|development.*setup" {} \; 2>/dev/null | grep -i dev | head -5Repository: one-covenant/basilica
Length of output: 3472
Update docs/validator.md to reflect the new PostgreSQL port for local development.
The port change in scripts/api/compose.dev.yml is correct with a clear inline comment. However, docs/validator.md contains example connection strings (lines 933 and 2743) showing localhost:5432 which would now be incorrect. These examples should be updated to use localhost:5433 if they apply to the API service running with the new compose setup. Note that internal Docker references (e.g., postgres:5432 in other compose files) are unaffected and correct.
🤖 Prompt for AI Agents
In docs/validator.md (lines ~933 and ~2743) update the example connection
strings that reference localhost:5432 to localhost:5433 so they match the API
local dev compose mapping (scripts/api/compose.dev.yml line 10) — leave any
internal Docker host references (e.g., postgres:5432) unchanged; verify both
examples still parse correctly and run a quick search to ensure no other
localhost:5432 examples remain that should be localhost:5433.
| # Use NVIDIA CUDA image for GPU support, or Python slim for CPU-only | ||
| ARG BASE_IMAGE=python:3.11-slim | ||
| FROM ${BASE_IMAGE} | ||
|
|
||
| # Install system dependencies | ||
| RUN apt-get update && apt-get install -y \ | ||
| curl \ | ||
| git \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Set up working directory | ||
| WORKDIR /app | ||
|
|
||
| # Install Python dependencies | ||
| COPY pyproject.toml . | ||
| RUN pip install --no-cache-dir --upgrade pip && \ | ||
| pip install --no-cache-dir -e "." | ||
|
|
||
| # Copy source code | ||
| COPY src/ src/ | ||
| COPY tests/ tests/ | ||
|
|
||
| # Create directories for models and checkpoints | ||
| RUN mkdir -p /models /checkpoints | ||
|
|
||
| # Environment variables | ||
| ENV PYTHONUNBUFFERED=1 | ||
| ENV MODEL_CACHE_DIR=/models | ||
| ENV CHECKPOINT_DIR=/checkpoints | ||
|
|
||
| # Health check | ||
| HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \ | ||
| CMD curl -f http://localhost:8000/health || exit 1 | ||
|
|
||
| # Expose port | ||
| EXPOSE 8000 | ||
|
|
||
| # Run service | ||
| CMD ["python", "-m", "uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8000"] |
There was a problem hiding this comment.
Add non-root user for security.
The container runs as root by default, which violates the principle of least privilege and creates security risks. Container processes should run as non-root users.
🔎 Apply this diff to add a non-root user:
# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
git \
&& rm -rf /var/lib/apt/lists/*
+# Create non-root user
+RUN useradd -m -u 1000 -s /bin/bash appuser && \
+ mkdir -p /models /checkpoints && \
+ chown -R appuser:appuser /app /models /checkpoints
+
# Set up working directory
WORKDIR /app
# Install Python dependencies
COPY pyproject.toml .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -e "."
# Copy source code
COPY src/ src/
COPY tests/ tests/
-# Create directories for models and checkpoints
-RUN mkdir -p /models /checkpoints
-
# Environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_CACHE_DIR=/models
ENV CHECKPOINT_DIR=/checkpoints
+# Switch to non-root user
+USER appuser
+
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1Committable suggestion skipped: line range outside the PR's diff.
| dependencies = [ | ||
| "torch>=2.1.0", | ||
| "transformers>=4.36.0", | ||
| "peft>=0.7.0", | ||
| "accelerate>=0.25.0", | ||
| "safetensors>=0.4.0", | ||
| "tokenizers>=0.15.0", | ||
| "boto3>=1.34.0", | ||
| "fastapi>=0.109.0", | ||
| "uvicorn>=0.27.0", | ||
| "pydantic>=2.5.0", | ||
| "structlog>=24.1.0", | ||
| "httpx>=0.27.0", | ||
| ] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Pin dependency versions to avoid unexpected breaking changes.
All dependencies use >= constraints without upper bounds, which can lead to:
- Unexpected breaking changes when major versions are released
- Inconsistent behavior across deployments
- Difficult debugging when dependencies change behavior
🔎 Recommended approach for dependency management:
Consider one of these strategies:
Option 1: Pin exact versions in pyproject.toml (strictest)
dependencies = [
"torch==2.1.0",
"transformers==4.36.0",
# ...
]Option 2: Use compatible version ranges (recommended)
dependencies = [
"torch>=2.1.0,<3.0.0",
"transformers>=4.36.0,<5.0.0",
# ...
]Option 3: Use a requirements.txt or poetry.lock for pinning
Additionally, maintain a requirements-lock.txt or use Poetry with poetry.lock to ensure reproducible builds while keeping pyproject.toml flexible for development.
This is especially critical for ML dependencies like torch and transformers where API changes are common.
🤖 Prompt for AI Agents
In services/training-service/pyproject.toml around lines 6 to 19 the dependency
constraints are open-ended (using >=) which can introduce breaking changes;
update each dependency to either a pinned exact version (e.g., ==) or,
preferably, a compatible upper-bounded range (e.g., >=x.y.z,<next-major) for all
listed packages (torch, transformers, peft, accelerate, safetensors, tokenizers,
boto3, fastapi, uvicorn, pydantic, structlog, httpx), and ensure you commit a
lockfile (poetry.lock or a generated requirements-lock.txt) so builds are
reproducible; for ML libs like torch/transformers prefer tight upper bounds
(<3.0.0, <5.0.0 respectively) to avoid major-breaking upgrades.
| # Wrap with LoRA | ||
| model = get_peft_model(base, peft_config) | ||
| model.print_trainable_parameters() |
There was a problem hiding this comment.
get_peft_model modifies base model; print_trainable_parameters uses stdout.
get_peft_modelcan modify the base model in place, which may cause issues if the base model is cached and shared across sessions.print_trainable_parameters()outputs to stdout rather than the structured logger.
🔎 Suggested fixes:
- # Wrap with LoRA
- model = get_peft_model(base, peft_config)
- model.print_trainable_parameters()
+ # Wrap with LoRA (using copy to avoid modifying cached base model)
+ import copy
+ base_copy = copy.deepcopy(base) if model_name in self.base_models else base
+ model = get_peft_model(base_copy, peft_config)
+
+ # Log trainable parameters
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+ total_params = sum(p.numel() for p in model.parameters())
+ logger.info(
+ "trainable_parameters",
+ trainable=trainable_params,
+ total=total_params,
+ percentage=100 * trainable_params / total_params,
+ )Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (2)
crates/basilica-operator/src/controllers/training_session_controller.rs (2)
203-212: Status update condition still missingendpointandstart_timechecks.The condition now includes
tokens_processed, but changes toendpointorstart_timestill won't trigger status updates. This can cause the CRD status to become stale when these fields change during phase transitions.Consider deriving
PartialEqonTrainingSessionStatusand comparing directly, or extend the condition to include these fields.
239-251: Usingis_err()conflates "not found" with other errors.This issue was flagged in a previous review. The code treats any error from
get_podas "pod doesn't exist" and attempts creation. Network failures, permission errors, or API server issues would be masked and could lead to spurious creation attempts.🔎 Suggested fix using proper error discrimination:
- // Check if pod already exists - let existing_pod = self.client.get_pod(namespace, &pod_name).await; - if existing_pod.is_err() { + // Check if pod already exists + let pod_exists = match self.client.get_pod(namespace, &pod_name).await { + Ok(_) => true, + Err(e) if e.to_string().contains("NotFound") || e.to_string().contains("not found") => false, + Err(e) => { + error!(error = %e, "failed to check pod existence"); + return Ok(status.with_error(format!("Failed to check pod: {}", e))); + } + }; + if !pod_exists { // Create the pod
🧹 Nitpick comments (9)
.gitignore (1)
102-102: Remove redundant.venvpattern.Both
.venv/(line 102) and.venv(line 137) are now in the file. In gitignore,.venv/already matches the.venvdirectory. Adding.venvis redundant since both patterns accomplish the same goal for this use case.Consider keeping only one pattern. For Python virtual environments (which are always directories),
.venv/is sufficient and more explicit about intent.🔎 Apply this diff to remove the redundant pattern:
# Runtime State *.pid *.lock tmp/ temp/ - .venv/ + .venv # BinariesOr keep the explicit directory pattern and remove the new one:
- .venvAlso applies to: 137-137
scripts/local-training-e2e.sh (5)
58-62: Consider verifying the k3d install script before execution.Piping
curldirectly tobashcan execute arbitrary code from a remote URL without verification. For a local dev script this is often acceptable, but consider downloading first, inspecting, then executing, or using a package manager.🔎 Safer alternative:
# Install k3d if needed if ! command -v k3d &> /dev/null; then log_info "Installing k3d..." - curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash + # Option 1: Use package manager + # brew install k3d + # Option 2: Download and inspect before running + local install_script="/tmp/k3d-install.sh" + curl -sL https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh -o "$install_script" + log_info "Downloaded install script to $install_script - review if needed" + bash "$install_script" + rm -f "$install_script" fi
343-380: Consider extracting the curl logic to reduce duplication.The authenticated and unauthenticated curl calls are nearly identical. This could be simplified with a helper function or by conditionally building the headers.
🔎 Optional refactor:
# Build curl args CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json") if [ -n "$AUTH_HEADER" ]; then CURL_ARGS+=(-H "$AUTH_HEADER") fi RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{ "baseModel": "facebook/opt-125m", ... }')
401-409: Unused loop variablei.The loop counter
iis never used inside the loop. Use_to indicate an intentionally unused variable, which also silences shellcheck warning SC2034.🔎 Apply this diff:
- for i in {1..60}; do + for _ in {1..60}; do
449-456: Consider using single quotes for the trap command.The current trap with double quotes works correctly here because
$PF_PIDis expanded when the trap is set, capturing the actual PID. However, using single quotes is more idiomatic and makes the intent clearer while silencing shellcheck warning SC2064.🔎 Apply this diff:
kubectl port-forward -n "$NAMESPACE" "pod/$POD_NAME" 8001:8000 & PF_PID=$! sleep 3 # Cleanup on exit - trap "kill $PF_PID 2>/dev/null" EXIT + trap 'kill $PF_PID 2>/dev/null' EXIT
500-502: Arbitrary sleep for model loading.The
sleep 10is a fixed wait that may be insufficient for slower systems or unnecessary for faster ones. Consider polling the session status endpoint to confirm the model is loaded, similar to how pod readiness is checked elsewhere.crates/basilica-operator/src/controllers/training_session_controller.rs (3)
267-279: Sameis_err()issue for network policy check.While the network policy failure is non-fatal (as noted in the comment), distinguishing error types would provide better observability. Consider logging differently for "not found" vs other errors.
361-370: Fragile string-based error detection for conflict.Checking
e.to_string().contains("409")orcontains("AlreadyExists")is brittle and may break if error message format changes. Consider using typed error variants if the K8sClient trait supports them.
808-889: Good unit test coverage for resource builders.The tests validate the core resource construction logic. Consider adding tests for:
- Error cases (missing UID once that validation is added)
- CPU-only mode (gpu_count = 0)
- HTTPRoute construction
- NetworkPolicy construction
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.gitignore(1 hunks)crates/basilica-operator/src/controllers/training_session_controller.rs(1 hunks)scripts/local-training-e2e.sh(1 hunks)services/training-service/README.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md
93-93: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
244-244: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh
[warning] 401-401: i appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 456-456: Use single quotes, otherwise this expands now rather than when signalled.
(SC2064)
🔇 Additional comments (14)
services/training-service/README.md (1)
1-305: Comprehensive and well-organized documentation.The README effectively documents the Training Service with clear sections covering prerequisites, quick start, API endpoints, configuration, and troubleshooting. The architecture diagram, example commands, and E2E testing guidance provide excellent user onboarding. The documentation properly aligns with the TrainingSession CRD workflow and operator-driven resource provisioning introduced in this PR.
scripts/local-training-e2e.sh (12)
1-38: LGTM!Good script setup with
set -euo pipefail, clear documentation, and useful logging helpers with color-coded output.
67-96: LGTM!Well-structured cluster setup with proper idempotency check, reasonable port mappings, and appropriate wait for node readiness.
98-127: LGTM with minor note.Good fallback logic for helm vs kubectl installation. The
sleep 10on line 121 before thekubectl waitis somewhat arbitrary but acceptable for a local dev script since the actual readiness is verified by the subsequent wait command.
129-150: LGTM!Clean image build and load workflow for local development.
152-183: LGTM!Well-organized deployment functions with proper sequencing and readiness checks.
185-211: LGTM!Good defensive handling for stale Postgres volumes with the user existence check before proceeding.
213-286: LGTM!Database reset, API key generation, and API runner functions are well-organized for the local dev workflow.
288-312: LGTM!Good cleanup function with proper wait for pod termination.
509-548: LGTM!Well-structured training loop with proper error handling and informative logging for each step.
573-620: LGTM!Status and log viewing functions provide good observability into the cluster state.
622-640: LGTM!Comprehensive cleanup that handles Postgres, cluster deletion, and generated artifacts.
642-734: LGTM!Excellent help documentation with clear quick-start guide, and well-organized main command dispatch.
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
399-415: LGTM - MVP placeholder acknowledged.The comment clearly documents that health checks are deferred. The immediate transition to Ready is acceptable for MVP.
| // Check if service already exists | ||
| let existing_svc = self.client.get_service(namespace, &svc_name).await; | ||
| if existing_svc.is_err() { | ||
| // Create the service | ||
| let svc = build_training_service(session, namespace, name)?; | ||
| match self.client.create_service(namespace, &svc).await { | ||
| Ok(_) => info!(service = %svc_name, "created training service"), | ||
| Err(e) => { | ||
| error!(error = %e, "failed to create training service"); | ||
| return Ok(status.with_error(format!("Failed to create service: {}", e))); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Same is_err() issue for service existence check.
Apply the same fix pattern as recommended for the pod check to properly distinguish "not found" from other errors.
| "Failed" => { | ||
| let message = pod | ||
| .status | ||
| .as_ref() | ||
| .and_then(|s| s.message.clone()) | ||
| .unwrap_or_else(|| "Pod failed".into()); | ||
| Ok(status.with_error(message)) | ||
| } |
There was a problem hiding this comment.
Pod failure doesn't transition session to Failed phase.
When the pod enters "Failed" state, only an error message is set via with_error(), but the session phase remains "Scheduling". This could cause the reconciler to keep checking a failed pod indefinitely.
🔎 Consider transitioning to Failed phase:
"Failed" => {
let message = pod
.status
.as_ref()
.and_then(|s| s.message.clone())
.unwrap_or_else(|| "Pod failed".into());
- Ok(status.with_error(message))
+ Ok(status
+ .with_phase(TrainingSessionPhase::Failed)
+ .with_error(message))
}🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 314 to 321, when matching a Pod with "Failed" status the code only
calls with_error(...) and does not update the session phase; change the status
update to set the session phase to Failed as well (e.g., call the method that
sets phase to TrainingSessionPhase::Failed or use
status.with_phase(...).with_error(message)) so the session transitions out of
Scheduling and the reconciler stops repeatedly processing the failed pod.
| Err(e) => { | ||
| warn!(error = %e, "service not found"); | ||
| // Recreate service | ||
| let svc = build_training_service(session, namespace, name)?; | ||
| self.client.create_service(namespace, &svc).await?; | ||
| Ok(status) |
There was a problem hiding this comment.
Inconsistent error propagation.
Here, create_service errors propagate via ?, returning an Err. In handle_pending (lines 258-264), service creation errors return Ok(status.with_error(...)). This inconsistency could cause different reconcile behaviors for similar failures.
Consider using consistent error handling strategy across all phase handlers.
🤖 Prompt for AI Agents
crates/basilica-operator/src/controllers/training_session_controller.rs around
lines 389 to 394: the handler currently propagates create_service errors with
`?` while other handlers (e.g., handle_pending at lines 258-264) convert similar
failures into an Ok(status.with_error(...)) result; make the behavior consistent
by catching the error from self.client.create_service(...) instead of using `?`
and return Ok(status.with_error(format!("create_service failed: {}", e))) (or
alternatively change the other handler to propagate — pick one consistent
strategy across all phase handlers), ensuring the status is updated with the
error and the function returns Ok(status) rather than bubbling the error.
| // Delete the pod | ||
| let _ = self.client.delete_pod(namespace, &pod_name).await; | ||
| return Ok(status.with_phase(TrainingSessionPhase::Terminated)); |
There was a problem hiding this comment.
Silently ignoring pod deletion result.
The let _ = discards any error from delete_pod, which could mask cleanup failures. At minimum, log the outcome for observability.
🔎 Suggested fix:
- let _ = self.client.delete_pod(namespace, &pod_name).await;
+ if let Err(e) = self.client.delete_pod(namespace, &pod_name).await {
+ warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry");
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Delete the pod | |
| let _ = self.client.delete_pod(namespace, &pod_name).await; | |
| return Ok(status.with_phase(TrainingSessionPhase::Terminated)); | |
| // Delete the pod | |
| if let Err(e) = self.client.delete_pod(namespace, &pod_name).await { | |
| warn!(error = %e, pod = %pod_name, "failed to delete pod on TTL expiry"); | |
| } | |
| return Ok(status.with_phase(TrainingSessionPhase::Terminated)); |
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 453-455, replace the ignored call `let _ =
self.client.delete_pod(...).await;` with code that captures the Result from
delete_pod, matches on it, and logs the outcome (include namespace and pod_name)
— log success at debug/info and log failures with error details at warn/error so
deletion failures are observable; keep returning
Ok(status.with_phase(TrainingSessionPhase::Terminated)) afterwards (or
optionally propagate the error if you want to treat deletion failure as fatal).
| let owner_ref = OwnerReference { | ||
| api_version: "basilica.ai/v1".to_string(), | ||
| kind: "TrainingSession".to_string(), | ||
| name: name.to_string(), | ||
| uid: session.metadata.uid.clone().unwrap_or_default(), | ||
| controller: Some(true), | ||
| block_owner_deletion: Some(true), | ||
| }; |
There was a problem hiding this comment.
Owner reference UID defaults to empty string if missing.
If session.metadata.uid is None, the owner reference will have an empty UID. Kubernetes requires a valid UID for owner references to work correctly for garbage collection. This should likely return an error instead.
🔎 Suggested fix:
+ let uid = session.metadata.uid.clone()
+ .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;
+
let owner_ref = OwnerReference {
api_version: "basilica.ai/v1".to_string(),
kind: "TrainingSession".to_string(),
name: name.to_string(),
- uid: session.metadata.uid.clone().unwrap_or_default(),
+ uid,
controller: Some(true),
block_owner_deletion: Some(true),
};🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 586 to 593, the OwnerReference is being constructed using
session.metadata.uid.clone().unwrap_or_default() which will create an empty UID
when the session UID is missing; Kubernetes requires a valid UID so instead
check for the UID and return an error if absent (e.g., use
session.metadata.uid.clone().ok_or_else(||anyhow!("missing metadata.uid for
TrainingSession {name}"))?), propagate the error by ensuring the enclosing
function returns a Result, and only build the OwnerReference when a non-empty
UID is present.
| let owner_ref = OwnerReference { | ||
| api_version: "basilica.ai/v1".to_string(), | ||
| kind: "TrainingSession".to_string(), | ||
| name: name.to_string(), | ||
| uid: session.metadata.uid.clone().unwrap_or_default(), | ||
| controller: Some(true), | ||
| block_owner_deletion: Some(true), | ||
| }; |
There was a problem hiding this comment.
Same empty UID issue in owner reference.
Same concern as build_training_pod - the owner reference will have an empty UID if the session metadata UID is missing. Apply the same fix pattern.
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 646 to 653, the OwnerReference is being built with
session.metadata.uid.clone().unwrap_or_default(), which can produce an empty
UID; mirror the fix used in build_training_pod by retrieving the UID via
session.metadata.uid.clone().ok_or_else(|| anyhow::anyhow!("missing session
UID"))? and propagate/return the error if missing, then use the guaranteed UID
when constructing the OwnerReference instead of unwrapping to a default empty
string.
| let route_json = serde_json::json!({ | ||
| "apiVersion": "gateway.networking.k8s.io/v1", | ||
| "kind": "HTTPRoute", | ||
| "metadata": { | ||
| "name": route_name, | ||
| "namespace": namespace, | ||
| "labels": { | ||
| "app": "basilica-training", | ||
| "session": session_name | ||
| } | ||
| }, | ||
| "spec": { | ||
| "parentRefs": [{ | ||
| "name": DEFAULT_GATEWAY_NAME, | ||
| "namespace": DEFAULT_GATEWAY_NAMESPACE | ||
| }], | ||
| "rules": [{ | ||
| "matches": [{ | ||
| "path": { | ||
| "type": "PathPrefix", | ||
| "value": path_prefix | ||
| } | ||
| }], | ||
| "filters": [{ | ||
| "type": "URLRewrite", | ||
| "urlRewrite": { | ||
| "path": { | ||
| "type": "ReplacePrefixMatch", | ||
| "replacePrefixMatch": "/" | ||
| } | ||
| } | ||
| }], | ||
| "backendRefs": [{ | ||
| "name": backend_service, | ||
| "namespace": namespace, | ||
| "port": port | ||
| }] | ||
| }] | ||
| } | ||
| }); |
There was a problem hiding this comment.
HTTPRoute missing owner reference for garbage collection.
Unlike the Pod and Service, this HTTPRoute doesn't have an ownerReferences field. When the TrainingSession is deleted, this route will be orphaned and must be manually cleaned up.
🔎 Add owner reference to metadata:
The function would need the session parameter to access the UID:
fn build_training_http_route(
+ session: &TrainingSession,
session_name: &str,
namespace: &str,
backend_service: &str,
port: u16,
) -> Result<DynamicObject> {
let route_name = format!("training-route-{}", session_name);
let path_prefix = format!("/sessions/{}/", session_name);
+ let uid = session.metadata.uid.clone()
+ .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?;
let route_json = serde_json::json!({
// ...
"metadata": {
"name": route_name,
"namespace": namespace,
"labels": {
"app": "basilica-training",
"session": session_name
- }
+ },
+ "ownerReferences": [{
+ "apiVersion": "basilica.ai/v1",
+ "kind": "TrainingSession",
+ "name": session_name,
+ "uid": uid,
+ "controller": true,
+ "blockOwnerDeletion": true
+ }]
},
// ...📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let route_json = serde_json::json!({ | |
| "apiVersion": "gateway.networking.k8s.io/v1", | |
| "kind": "HTTPRoute", | |
| "metadata": { | |
| "name": route_name, | |
| "namespace": namespace, | |
| "labels": { | |
| "app": "basilica-training", | |
| "session": session_name | |
| } | |
| }, | |
| "spec": { | |
| "parentRefs": [{ | |
| "name": DEFAULT_GATEWAY_NAME, | |
| "namespace": DEFAULT_GATEWAY_NAMESPACE | |
| }], | |
| "rules": [{ | |
| "matches": [{ | |
| "path": { | |
| "type": "PathPrefix", | |
| "value": path_prefix | |
| } | |
| }], | |
| "filters": [{ | |
| "type": "URLRewrite", | |
| "urlRewrite": { | |
| "path": { | |
| "type": "ReplacePrefixMatch", | |
| "replacePrefixMatch": "/" | |
| } | |
| } | |
| }], | |
| "backendRefs": [{ | |
| "name": backend_service, | |
| "namespace": namespace, | |
| "port": port | |
| }] | |
| }] | |
| } | |
| }); | |
| let route_name = format!("training-route-{}", session_name); | |
| let path_prefix = format!("/sessions/{}/", session_name); | |
| let uid = session.metadata.uid.clone() | |
| .ok_or_else(|| anyhow::anyhow!("TrainingSession missing UID"))?; | |
| let route_json = serde_json::json!({ | |
| "apiVersion": "gateway.networking.k8s.io/v1", | |
| "kind": "HTTPRoute", | |
| "metadata": { | |
| "name": route_name, | |
| "namespace": namespace, | |
| "labels": { | |
| "app": "basilica-training", | |
| "session": session_name | |
| }, | |
| "ownerReferences": [{ | |
| "apiVersion": "basilica.ai/v1", | |
| "kind": "TrainingSession", | |
| "name": session_name, | |
| "uid": uid, | |
| "controller": true, | |
| "blockOwnerDeletion": true | |
| }] | |
| }, | |
| "spec": { | |
| "parentRefs": [{ | |
| "name": DEFAULT_GATEWAY_NAME, | |
| "namespace": DEFAULT_GATEWAY_NAMESPACE | |
| }], | |
| "rules": [{ | |
| "matches": [{ | |
| "path": { | |
| "type": "PathPrefix", | |
| "value": path_prefix | |
| } | |
| }], | |
| "filters": [{ | |
| "type": "URLRewrite", | |
| "urlRewrite": { | |
| "path": { | |
| "type": "ReplacePrefixMatch", | |
| "replacePrefixMatch": "/" | |
| } | |
| } | |
| }], | |
| "backendRefs": [{ | |
| "name": backend_service, | |
| "namespace": namespace, | |
| "port": port | |
| }] | |
| }] | |
| } | |
| }); |
🤖 Prompt for AI Agents
crates/basilica-operator/src/controllers/training_session_controller.rs around
lines 692 to 731: the generated HTTPRoute JSON is missing
metadata.ownerReferences so the route will be orphaned when the TrainingSession
is deleted; update the function signature to accept the TrainingSession (or its
UID and controller metadata) and add an ownerReferences entry in metadata with
apiVersion, kind, name, uid from the session, controller: true and
blockOwnerDeletion: true so Kubernetes garbage collection removes the HTTPRoute
when the TrainingSession is deleted.
| let owner_ref = OwnerReference { | ||
| api_version: "basilica.ai/v1".into(), | ||
| kind: "TrainingSession".into(), | ||
| name: name.to_string(), | ||
| uid: session.metadata.uid.clone().unwrap_or_default(), | ||
| controller: Some(true), | ||
| block_owner_deletion: Some(true), | ||
| }; |
There was a problem hiding this comment.
Same empty UID issue in NetworkPolicy owner reference.
Apply the same fix pattern as recommended for the other resource builders.
🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 755 to 762, the OwnerReference is being built with uid:
session.metadata.uid.clone().unwrap_or_default(), which injects an empty string
when UID is missing; change this to preserve the optional nature of the UID
(e.g., use session.metadata.uid.clone() or
session.metadata.uid.clone().filter(|s| !s.is_empty())) so the
OwnerReference.uid is None when there is no UID, mirroring the fix used in the
other resource builders.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (7)
services/training-service/README.md (2)
138-157: Add language specification to fenced code block.Line 138 begins a code block without a language identifier, failing markdown linting compliance.
🔎 Apply this diff to specify the language:
-``` +```bash [STEP] Running training steps via port-forward...
289-321: Add language specification to fenced code block.Line 289 begins an ASCII architecture diagram code block without a language identifier, failing markdown linting compliance.
🔎 Apply this diff to specify the language:
-``` +```text ┌─────────────────────────────────────────────────────────────┐crates/basilica-api/src/api/routes/training.rs (4)
288-297: Fragile error detection using string matching.String-based error checks (
e.to_string().contains("409")) are brittle and may incorrectly classify errors if message formats change. Thekubecrate provides typed error variants that should be used instead.🔎 Consider using typed error matching:
if let Err(e) = k8s_client.create_namespace(&namespace).await { - // Ignore "already exists" errors - if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") { + // Ignore "already exists" errors using typed matching + if !matches!(e, kube::Error::Api(ref ae) if ae.code == 409) { error!(error = %e, namespace = %namespace, "Failed to create namespace"); apimetrics::record_request("training.create", "POST", start, false); return Err(ApiError::Internal { message: format!("Failed to create namespace: {}", e), }); } }
363-374: Fragile error detection using string matching.Similar to the namespace creation handler, this uses string matching to detect 404 errors instead of typed error matching with
kube::Errorvariants.
495-521: Confusing sentinel error pattern for empty list handling.Using an error with message
"empty"as a sentinel value (line 502) and then checking for it with string matching (line 513) is an anti-pattern that makes the code harder to understand and maintain.🔎 Consider cleaner error handling:
// List all TrainingSession CRDs in the user's namespace - let sessions = api + let sessions_result = api .list(&ListParams::default().labels(&format!("user={}", auth.user_id))) - .await - .map_err(|e| { - // If namespace doesn't exist, return empty list - if e.to_string().contains("404") || e.to_string().contains("NotFound") { - return ApiError::Internal { - message: "empty".to_string(), // Sentinel for empty list - }; - } - error!(error = %e, "Failed to list sessions"); - ApiError::Internal { - message: format!("Failed to list sessions: {}", e), - } - }); + .await; - let sessions = match sessions { - Ok(list) => list, - Err(e) if e.to_string().contains("empty") => { + let sessions = match sessions_result { + Ok(list) => list, + Err(kube::Error::Api(ae)) if ae.code == 404 => { + // Namespace doesn't exist, return empty list apimetrics::record_request("training.list", "GET", start, true); return Ok(Json(vec![])); } Err(e) => { + error!(error = %e, "Failed to list sessions"); apimetrics::record_request("training.list", "GET", start, false); - return Err(e); + return Err(ApiError::Internal { + message: format!("Failed to list sessions: {}", e), + }); } };
185-192: Namespace collision risk remains.The sanitization could create collisions when different user IDs map to the same namespace (e.g., "user_123" and "user-123" both become "u-user-123"). This was previously flagged but hasn't been addressed.
crates/basilica-sdk-python/python/basilica/training/__init__.py (1)
128-203: Add validation for empty data list.The
forward_backwardmethod accessesexamples[0]at lines 180 and 185 without verifying the list is non-empty, and line 174 callsmax()on the list which will raiseValueErrorif empty.🔎 Apply this diff to add validation:
) -> float: """ Compute forward pass and gradients. Args: data: List of training examples (Datum objects or dicts with 'input_ids' and 'labels' keys) loss_fn: Loss function to use. Options: - "cross_entropy": Standard NLL loss (default) - "importance_sampling": Policy gradient with importance weighting - "ppo": Proximal Policy Optimization with clipping - "dpo": Direct Preference Optimization Returns: Loss value Example: >>> # Standard supervised learning >>> loss = session.forward_backward([{"input_ids": [1, 2, 3]}]) >>> # RL with importance sampling >>> loss = session.forward_backward(data, loss_fn="importance_sampling") """ + if not data: + raise TrainingError("data list cannot be empty") + # Normalize input examples = [] for d in data:Based on the web search results, calling
max()on an empty sequence raisesValueError: max() arg is an empty sequence.
🧹 Nitpick comments (1)
crates/basilica-api/src/api/middleware/scope.rs (1)
180-231: Add test coverage for training route scope mappings.The new training session scope mappings lack corresponding test cases in the
test_required_scope_mappingfunction. This creates a gap in validation coverage.🔎 Consider adding tests for the new training routes:
Add these test cases to
test_required_scope_mapping(after line 417):// Test training session endpoints let req = Request::builder() .method(Method::POST) .uri("/sessions") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:create".to_string())); let req = Request::builder() .method(Method::GET) .uri("/sessions") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:list".to_string())); let req = Request::builder() .method(Method::GET) .uri("/sessions/ts-123") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:view".to_string())); let req = Request::builder() .method(Method::DELETE) .uri("/sessions/ts-123") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:delete".to_string())); let req = Request::builder() .method(Method::POST) .uri("/sessions/ts-123/internal") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:use".to_string())); let req = Request::builder() .method(Method::POST) .uri("/sessions/ts-123/internal/train-123/forward_backward") .body(Body::empty()) .unwrap(); assert_eq!(get_required_scope(&req), Some("training:use".to_string()));
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
crates/basilica-api/src/api/middleware/scope.rs(1 hunks)crates/basilica-api/src/api/mod.rs(1 hunks)crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-api/src/bin/gen_api_key.rs(1 hunks)crates/basilica-sdk-python/python/basilica/training/__init__.py(1 hunks)examples/training_example.py(1 hunks)services/training-service/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/basilica-api/src/api/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
crates/basilica-sdk-python/python/basilica/training/__init__.py (3)
crates/basilica-api/src/api/routes/training.rs (5)
forward_backward(733-757)optim_step(761-784)sample(788-812)get_session(350-429)list_sessions(482-586)services/training-service/src/backend.py (4)
forward_backward(226-292)optim_step(294-317)sample(319-381)list_sessions(482-484)services/training-service/src/server.py (5)
forward_backward(244-273)optim_step(277-283)sample(290-311)get_session(221-227)list_sessions(177-179)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (6)
default_rank(44-46)default_alpha(47-49)default_dropout(50-52)default(34-41)default(80-86)default(145-151)crates/basilica-api/src/k8s/trait.rs (1)
kube_client(7-7)
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md
38-38: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
138-138: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
289-289: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/__init__.py
201-201: Avoid specifying long messages outside the exception class
(TRY003)
215-215: Avoid specifying long messages outside the exception class
(TRY003)
253-253: Avoid specifying long messages outside the exception class
(TRY003)
273-273: Avoid specifying long messages outside the exception class
(TRY003)
290-290: Avoid specifying long messages outside the exception class
(TRY003)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Avoid specifying long messages outside the exception class
(TRY003)
346-346: Avoid specifying long messages outside the exception class
(TRY003)
414-414: Avoid specifying long messages outside the exception class
(TRY003)
423-423: Avoid specifying long messages outside the exception class
(TRY003)
431-431: Avoid specifying long messages outside the exception class
(TRY003)
435-435: Avoid specifying long messages outside the exception class
(TRY003)
451-451: Avoid specifying long messages outside the exception class
(TRY003)
468-468: Avoid specifying long messages outside the exception class
(TRY003)
470-470: Avoid specifying long messages outside the exception class
(TRY003)
474-474: Avoid specifying long messages outside the exception class
(TRY003)
488-488: Avoid specifying long messages outside the exception class
(TRY003)
507-513: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
examples/training_example.py
18-18: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
🔇 Additional comments (2)
crates/basilica-api/src/bin/gen_api_key.rs (1)
22-24: LGTM!The addition of
"training:*"to the default scopes is appropriate and aligns with the new training session endpoints introduced in this PR.examples/training_example.py (1)
1-27: LGTM!The example effectively demonstrates the training SDK workflow with session creation, training loop, and sampling. The code is clear and appropriate for an example script.
| ``` | ||
| === Creating session for facebook/opt-125m === | ||
| Session: ts-a1b2c3d4 | ||
|
|
||
| === Training (3 steps) === | ||
| Step 1: loss=6.2612 | ||
| Step 2: loss=5.7423 | ||
| Step 3: loss=5.7501 | ||
|
|
||
| Sample: The quick brown -> fox jumps over the lazy | ||
| ``` |
There was a problem hiding this comment.
Add language specification to fenced code block.
Line 38 begins an example output code block without a language identifier, which fails markdown linting compliance.
🔎 Apply this diff to specify the language:
-```
+```text
=== Creating session for facebook/opt-125m ===📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| === Creating session for facebook/opt-125m === | |
| Session: ts-a1b2c3d4 | |
| === Training (3 steps) === | |
| Step 1: loss=6.2612 | |
| Step 2: loss=5.7423 | |
| Step 3: loss=5.7501 | |
| Sample: The quick brown -> fox jumps over the lazy | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
38-38: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
services/training-service/README.md around lines 38 to 48: the fenced code block
starting at line 38 lacks a language spec and triggers markdown linting; change
the opening fence from ``` to ```text (and keep the closing ``` unchanged) so
the example output is explicitly marked as plain text.
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (5)
crates/basilica-api/src/api/routes/training.rs (5)
200-207: Potential namespace collision risk.Different user IDs like "user_123" and "user-123" both map to "u-user-123", causing namespace collisions. Additionally, Kubernetes namespace names have a 63-character limit that isn't enforced here.
Based on past review comments, a hash-based approach was suggested to ensure unique, valid namespace names.
283-283: HardcodedenableBillingprevents user override.The
enableBillingfield is alwaystrue, ignoring potential user preferences.Based on past review comments, consider adding an optional field to
CreateSessionRequestwith a default oftrue.
323-332: Fragile error detection using string matching.Checking for "409" or "AlreadyExists" in the error string is brittle and may incorrectly ignore real errors if the error message format changes.
Based on past review comments, use typed error matching like the CRD creation does at line 353:
kube::Error::Api(ae) if ae.code == 409.
530-556: Confusing sentinel error pattern for empty list handling.Using an
ApiError::Internalwith message "empty" as a sentinel value is a code smell that makes the code harder to maintain.Based on past review comments, use proper typed error matching with early return for 404 cases, similar to how
delete_sessionhandles it at line 497.
398-409: Fix string-based error matching in session retrieval.String-based error matching at lines 398-409 (and in
list_sessions) is fragile and inconsistent with typed error handling used elsewhere in the file. Replace with typedkube::Error::Apipattern matching:- let session = api.get(&session_id).await.map_err(|e| { - if e.to_string().contains("404") || e.to_string().contains("NotFound") { + let session = api.get(&session_id).await.map_err(|e| match &e { + kube::Error::Api(ae) if ae.code == 404 => { ApiError::NotFound { message: format!("Session {} not found", session_id), } - } else { + } + _ => { error!(error = %e, session_id = %session_id, "Failed to get session"); ApiError::Internal { message: format!("Failed to get session: {}", e), } } })?;Apply the same fix to the
list_sessionsfunction.
🧹 Nitpick comments (9)
crates/basilica-api/src/api/routes/training.rs (3)
313-320: Consider collision risk with short session IDs.Using only the first segment of UUID v4 (8 hex characters = ~4 billion possibilities) could lead to collisions as the number of sessions grows, especially across all users. While unlikely in the short term, full UUIDs would eliminate this risk entirely.
🔎 Suggested fix:
let session_id = format!( "ts-{}", - uuid::Uuid::new_v4() - .to_string() - .split('-') - .next() - .unwrap() + uuid::Uuid::new_v4().to_string().replace("-", "") );Or simply use the standard hyphenated UUID:
- let session_id = format!( - "ts-{}", - uuid::Uuid::new_v4() - .to_string() - .split('-') - .next() - .unwrap() - ); + let session_id = format!("ts-{}", uuid::Uuid::new_v4());
634-709: Improve proxy error handling specificity.The proxy always returns
ApiError::ServiceUnavailableregardless of the underlying error type. This loses important context—a 404 from the training service means the session doesn't exist, while a 500 means an internal error.🔎 Suggested improvement:
Parse the kube error and map specific HTTP status codes:
let response = kube_client .request::<serde_json::Value>(request) .await - .map_err(|e| { + .map_err(|e| match &e { + kube::Error::Api(ae) if ae.code == 404 => { + ApiError::NotFound { + message: format!("Training session {} not found", session_id), + } + } + _ => { error!( error = %e, service = %service_name, namespace = %namespace, path = %path, "Failed to proxy request to training service" ); ApiError::ServiceUnavailable + } })?;
711-903: Consider reducing duplication in proxy handlers.All seven proxy handler functions follow an identical pattern with only the path and metric name differing. While the current code is clear, extracting a macro or higher-order function could reduce maintenance burden.
💡 Example approach using a macro:
macro_rules! proxy_handler { ($name:ident, $metric:expr, $method:expr, $path_fn:expr, $body:expr) => { pub async fn $name( State(state): State<AppState>, Extension(auth): Extension<AuthContext>, Path(params): Path<_>, $($body: Json<serde_json::Value>,)? ) -> Result<axum::response::Response> { let start = Instant::now(); let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?; let kube_client = k8s_client.kube_client(); let namespace = user_namespace(&auth.user_id); let result = proxy_to_training_service( &kube_client, &namespace, ¶ms.0, &$path_fn(params), $method, $body, ).await; apimetrics::record_request($metric, stringify!($method), start, result.is_ok()); result } }; }This is a nice-to-have optimization, not a blocking issue.
crates/basilica-sdk-python/python/basilica/training/types.py (1)
169-176: Useasyncio.get_running_loop()instead of deprecatedget_event_loop().
asyncio.get_event_loop()is deprecated since Python 3.10 and will emit aDeprecationWarningwhen called from a coroutine without a running loop. Sinceresult_asyncis an async method, useget_running_loop()which is the recommended approach.🔎 Apply this diff:
async def result_async(self, timeout: Optional[float] = None): """Wait for operation to complete (async).""" - loop = asyncio.get_event_loop() + loop = asyncio.get_running_loop() if timeout is not None: return await asyncio.wait_for( loop.run_in_executor(None, self._future.result), timeout=timeout ) return await loop.run_in_executor(None, self._future.result)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
146-150: Consider extracting endpoint resolution to reduce duplication.The endpoint resolution logic here duplicates
_get_endpoint()with a different suffix. This is minor, but could be consolidated.🔎 Example refactor:
+ def _get_endpoint(self, operation: str = "sample") -> str: + """Get the endpoint for the given operation.""" + if self._session_id and self._internal_id: + return f"/sessions/{self._session_id}/internal/{self._internal_id}/{operation}" + return f"/{operation}" + - def _get_endpoint(self) -> str: - """Get the sampling endpoint.""" - if self._session_id and self._internal_id: - return f"/sessions/{self._session_id}/internal/{self._internal_id}/sample" - return "/sample"Then use
self._get_endpoint("compute_logprobs")at line 148.crates/basilica-sdk-python/python/basilica/training/exceptions.py (3)
8-14: Use explicit| Nonetype annotation for optional parameter.Per PEP 484, implicit
Optional(using= Nonewithout aNonetype) is discouraged. This improves type checker compatibility.🔎 Apply this diff:
- def __init__(self, message: str, status_code: int = None): + def __init__(self, message: str, status_code: int | None = None):
54-62: Use explicit| Nonetype annotation.Same issue as the base class -
retry_after: float = Noneshould use explicit optional type.🔎 Apply this diff:
- def __init__(self, retry_after: float = None): + def __init__(self, retry_after: float | None = None):
72-75: Consider adding a default message toCheckpointError.Unlike other exceptions,
CheckpointErrorhas no custom__init__, so callers must always provide a message. Consider adding a default for consistency withAuthenticationErrorandInsufficientResourcesError.🔎 Apply this diff:
class CheckpointError(TrainingError): """Checkpoint operation failed.""" - pass + def __init__(self, message: str = "Checkpoint operation failed"): + super().__init__(message)crates/basilica-sdk-python/python/basilica/training/service_client.py (1)
388-399: Async client is not properly closed.The async client is lazily created but never closed, which can cause resource leaks. Consider adding an async close method or async context manager support.
🔎 Suggested approach:
async def close_async(self): """Close async resources.""" if self._async_client: await self._async_client.aclose() self._async_client = None async def __aenter__(self): return self async def __aexit__(self, *args): await self.close_async()
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-sdk-python/python/basilica/training/__init__.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/exceptions.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/sampling_client.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/service_client.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/training_client.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/types.py(1 hunks)examples/training_example.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py (3)
crates/basilica-sdk-python/python/basilica/training/types.py (7)
ModelInput(48-66)SampleResponse(100-106)SamplingParams(70-93)to_dict(37-44)to_dict(84-93)from_string(64-66)result_async(169-176)crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
TrainingError(8-14)crates/basilica-sdk-python/python/basilica/training/training_client.py (2)
sample(395-428)close(430-435)
crates/basilica-sdk-python/python/basilica/training/service_client.py (5)
crates/basilica-sdk-python/python/basilica/training/types.py (1)
GetServerCapabilitiesResponse(127-132)crates/basilica-sdk-python/python/basilica/training/training_client.py (4)
TrainingClient(22-465)base_model(70-72)session_id(65-67)close(430-435)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)
SamplingClient(16-186)close(178-180)crates/basilica-sdk-python/python/basilica/training/exceptions.py (4)
TrainingError(8-14)SessionTimeoutError(36-44)AuthenticationError(47-51)ValidationError(65-69)crates/basilica-api/src/api/routes/training.rs (2)
get_session(385-464)list_sessions(517-621)
crates/basilica-api/src/api/routes/training.rs (2)
crates/basilica-operator/src/crd/training_session.rs (9)
default_rank(44-46)default_alpha(47-49)default_dropout(50-52)default(34-41)default(80-86)default(145-151)default_ttl(214-216)new(316-322)new(374-391)services/training-service/src/backend.py (4)
create_session(163-224)forward_backward(226-292)optim_step(294-317)sample(319-381)
crates/basilica-sdk-python/python/basilica/training/__init__.py (5)
crates/basilica-sdk-python/python/basilica/training/types.py (8)
Datum(17-44)ModelInput(48-66)SamplingParams(70-93)SampleResponse(100-106)ForwardBackwardResult(110-115)ForwardResult(119-123)GetServerCapabilitiesResponse(127-132)APIFuture(138-184)crates/basilica-sdk-python/python/basilica/training/exceptions.py (10)
TrainingError(8-14)SessionNotFoundError(17-22)SessionNotReadyError(25-33)SessionTimeoutError(36-44)AuthenticationError(47-51)RateLimitError(54-62)ValidationError(65-69)CheckpointError(72-75)ModelNotFoundError(78-83)InsufficientResourcesError(86-90)crates/basilica-sdk-python/python/basilica/training/service_client.py (1)
ServiceClient(24-399)crates/basilica-sdk-python/python/basilica/training/training_client.py (1)
TrainingClient(22-465)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
SamplingClient(16-186)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/types.py
158-158: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
189-198: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
crates/basilica-sdk-python/python/basilica/training/training_client.py
125-125: Avoid specifying long messages outside the exception class
(TRY003)
188-188: Avoid specifying long messages outside the exception class
(TRY003)
253-253: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
299-299: Avoid specifying long messages outside the exception class
(TRY003)
319-319: Avoid specifying long messages outside the exception class
(TRY003)
339-339: Avoid specifying long messages outside the exception class
(TRY003)
381-381: Avoid specifying long messages outside the exception class
(TRY003)
426-426: Avoid specifying long messages outside the exception class
(TRY003)
434-434: Avoid specifying long messages outside the exception class
(TRY003)
crates/basilica-sdk-python/python/basilica/training/sampling_client.py
99-99: Avoid specifying long messages outside the exception class
(TRY003)
154-154: Avoid specifying long messages outside the exception class
(TRY003)
crates/basilica-sdk-python/python/basilica/training/service_client.py
64-66: Avoid specifying long messages outside the exception class
(TRY003)
154-156: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Avoid specifying long messages outside the exception class
(TRY003)
199-199: Avoid specifying long messages outside the exception class
(TRY003)
207-207: Avoid specifying long messages outside the exception class
(TRY003)
235-235: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
350-350: Avoid specifying long messages outside the exception class
(TRY003)
352-352: Avoid specifying long messages outside the exception class
(TRY003)
356-356: Avoid specifying long messages outside the exception class
(TRY003)
374-374: Avoid specifying long messages outside the exception class
(TRY003)
crates/basilica-sdk-python/python/basilica/training/exceptions.py
11-11: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
57-57: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
95-106: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
crates/basilica-sdk-python/python/basilica/training/__init__.py
69-96: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🔇 Additional comments (16)
examples/training_example.py (2)
45-77: LGTM!The training loop demonstrates the SDK API well, with proper context manager usage for session cleanup, clear forward-backward/optim-step pattern, and checkpoint saving.
108-140: LGTM!The async example properly demonstrates the async API pattern with
try/finallyfor cleanup.crates/basilica-sdk-python/python/basilica/training/types.py (2)
16-45: LGTM!The
Datumdataclass is well-documented with clear field semantics and a cleanto_dict()serialization method.
69-93: LGTM!
SamplingParamsprovides sensible defaults and theto_dict()method handlesNonegracefully forstop_sequences.crates/basilica-sdk-python/python/basilica/training/training_client.py (2)
219-258: LGTM!The
optim_stepmethod correctly handles optional hyperparameter overrides and updates the internal step counter from the server response.
430-441: LGTM!The
close()method appropriately ignores 404 responses (idempotent deletion) and useswait=Falsefor executor shutdown during cleanup, which is acceptable since the session is being terminated.crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)
26-48: LGTM!The constructor clearly documents the dual-use pattern for session-bound vs standalone sampling, with appropriate optional parameters.
56-105: LGTM!The
sample()method cleanly handles payload construction, optional model hints, and gracefully handles both single and batch response formats withdata.get("samples", [data]).crates/basilica-sdk-python/python/basilica/training/__init__.py (2)
1-35: LGTM!Excellent module docstring with a practical quick-start example and clear documentation of supported loss functions and authentication.
39-96: LGTM!The module correctly re-exports all public symbols from submodules, and the
__all__list is well-organized with logical grouping (clients, types, exceptions) that aids discoverability.crates/basilica-sdk-python/python/basilica/training/service_client.py (6)
36-73: LGTM!The initialization logic handles API key and endpoint configuration correctly with appropriate fallbacks to environment variables and sensible defaults.
75-90: LGTM!Capability query methods are clean and correctly use
raise_for_status()for error handling.
247-307: No readiness polling for restored sessions.Unlike
create_lora_training_clientwhich polls until the session is ready,create_training_client_from_stateandcreate_training_client_from_state_with_optimizerreturn immediately after the API call. If session restoration is asynchronous, the returnedTrainingClientmay reference a session that isn't ready yet.Consider adding similar readiness polling, or document that restoration is synchronous.
309-333: LGTM!Clean validation ensuring at least one model reference is provided before creating the sampling client.
365-376: LGTM!Simple implementation that correctly handles errors and returns session list.
163-187: EmptycheckpointStoragefields should be required.The
bucketandpathfields incheckpointStorageare set to empty strings without validation. Since storage operations require these values, either provide non-empty defaults, require them as parameters, or add validation to reject empty values at the API level.
| #[derive(Debug, Clone, Deserialize)] | ||
| #[serde(rename_all = "camelCase")] | ||
| pub struct CheckpointStorageRequest { | ||
| pub backend: String, | ||
| pub bucket: String, | ||
| pub path: String, | ||
| #[serde(default)] | ||
| pub credentials_secret: Option<String>, | ||
| } |
There was a problem hiding this comment.
Document accepted backend values.
The backend field has no validation or documentation indicating which values are accepted (e.g., "s3", "gcs", "local"). Consider adding validation or at minimum doc comments listing the supported storage backends.
🔎 Suggested improvement:
Add documentation:
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct CheckpointStorageRequest {
+ /// Storage backend type. Supported values: "s3", "gcs"
pub backend: String,
pub bucket: String,
pub path: String,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[derive(Debug, Clone, Deserialize)] | |
| #[serde(rename_all = "camelCase")] | |
| pub struct CheckpointStorageRequest { | |
| pub backend: String, | |
| pub bucket: String, | |
| pub path: String, | |
| #[serde(default)] | |
| pub credentials_secret: Option<String>, | |
| } | |
| #[derive(Debug, Clone, Deserialize)] | |
| #[serde(rename_all = "camelCase")] | |
| pub struct CheckpointStorageRequest { | |
| /// Storage backend type. Supported values: "s3", "gcs" | |
| pub backend: String, | |
| pub bucket: String, | |
| pub path: String, | |
| #[serde(default)] | |
| pub credentials_secret: Option<String>, | |
| } |
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 121 to 129, the
CheckpointStorageRequest struct's backend field lacks documentation or
validation about accepted values; add a doc comment above the backend field
enumerating supported backends (e.g., "s3", "gcs", "local") and, if practical,
enforce validation by either changing the type to a dedicated enum with
Deserialize implementations for those variants or adding a simple runtime check
where the struct is deserialized/used that returns a clear error for unsupported
values.
| resp = self._client.get(f"/sessions/{session_id}") | ||
|
|
||
| if resp.status_code == 404: | ||
| raise TrainingError(f"Session {session_id} not found") | ||
| if not resp.is_success: | ||
| raise TrainingError(f"Failed to get session: {resp.text}") | ||
|
|
||
| data = resp.json() | ||
| if data.get("phase") != "ready": | ||
| raise TrainingError(f"Session not ready: {data.get('phase')}") | ||
|
|
||
| return TrainingClient( | ||
| client=self._client, | ||
| session_id=session_id, | ||
| internal_id=f"train-{session_id}", | ||
| base_model=data.get("baseModel", "unknown"), | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use specific exception types from exceptions.py.
The module defines SessionNotFoundError and SessionNotReadyError specifically for these cases, but generic TrainingError is used instead. Using the specific exceptions improves error handling for callers.
🔎 Apply this diff:
+from .exceptions import (
+ TrainingError,
+ SessionTimeoutError,
+ SessionNotFoundError,
+ SessionNotReadyError,
+ AuthenticationError,
+ ValidationError,
+)Then update the method:
if resp.status_code == 404:
- raise TrainingError(f"Session {session_id} not found")
+ raise SessionNotFoundError(session_id)
if not resp.is_success:
raise TrainingError(f"Failed to get session: {resp.text}")
data = resp.json()
if data.get("phase") != "ready":
- raise TrainingError(f"Session not ready: {data.get('phase')}")
+ raise SessionNotReadyError(session_id, data.get("phase", "unknown"))Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.8)
350-350: Avoid specifying long messages outside the exception class
(TRY003)
352-352: Avoid specifying long messages outside the exception class
(TRY003)
356-356: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/service_client.py around
lines 347 to 363, the code raises generic TrainingError for a 404 and for a
session not-ready state; replace those with the more specific exceptions defined
in exceptions.py by importing SessionNotFoundError and SessionNotReadyError (or
adjusting existing imports) and raise SessionNotFoundError(f"Session
{session_id} not found") for the 404 branch and SessionNotReadyError(f"Session
not ready: {data.get('phase')}") when the phase is not "ready", leaving the
other TrainingError usages unchanged.
| async def _get_async_client(self) -> httpx.AsyncClient: | ||
| """Get or create async HTTP client.""" | ||
| if self._async_client is None: | ||
| self._async_client = httpx.AsyncClient( | ||
| base_url=self.endpoint, | ||
| headers={"Authorization": f"Bearer {self.api_key}"}, | ||
| timeout=300.0, | ||
| ) | ||
| return self._async_client |
There was a problem hiding this comment.
Async client uses hardcoded timeout instead of configured value.
The sync client uses the timeout parameter from __init__, but the async client hardcodes 300.0. Store the timeout in __init__ and reuse it here for consistency.
🔎 Apply this diff:
In __init__:
self._async_client = None # Lazy init
+ self._timeout = timeoutIn _get_async_client:
self._async_client = httpx.AsyncClient(
base_url=self.endpoint,
headers={"Authorization": f"Bearer {self.api_key}"},
- timeout=300.0,
+ timeout=self._timeout,
)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/service_client.py around
lines 378-386, the async HTTP client is created with a hardcoded timeout of
300.0; update __init__ to store the provided timeout (e.g., self.timeout =
timeout) and change the AsyncClient construction to use that stored value
(timeout=self.timeout) so the async client honors the configured timeout
consistently with the sync client.
| def _call(): | ||
| normalized = self._normalize_data(data) | ||
|
|
||
| # Pad sequences | ||
| max_len = max(len(d["input_ids"]) for d in normalized) | ||
| input_ids = [] | ||
| attention_mask = [] | ||
|
|
||
| for d in normalized: | ||
| ids = d["input_ids"] | ||
| pad_len = max_len - len(ids) | ||
| input_ids.append(ids + [0] * pad_len) | ||
| attention_mask.append([1] * len(ids) + [0] * pad_len) | ||
|
|
||
| resp = self._client.post( | ||
| self._proxy("forward"), | ||
| json={"input_ids": input_ids, "attention_mask": attention_mask}, | ||
| ) | ||
| if not resp.is_success: | ||
| raise TrainingError(f"forward failed: {resp.text}") | ||
| r = resp.json() | ||
| return ForwardResult( | ||
| logprobs=r["logprobs"], tokens_processed=r["tokens_processed"] | ||
| ) | ||
|
|
||
| return APIFuture(self._executor.submit(_call), ForwardResult) |
There was a problem hiding this comment.
Add validation for empty data list.
If data is empty, max(len(d["input_ids"]) for d in normalized) at line 110 will raise a ValueError: max() arg is an empty sequence. Add a guard at the start of _call().
🔎 Apply this diff:
def _call():
normalized = self._normalize_data(data)
+ if not normalized:
+ raise ValueError("data list cannot be empty")
# Pad sequences
max_len = max(len(d["input_ids"]) for d in normalized)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _call(): | |
| normalized = self._normalize_data(data) | |
| # Pad sequences | |
| max_len = max(len(d["input_ids"]) for d in normalized) | |
| input_ids = [] | |
| attention_mask = [] | |
| for d in normalized: | |
| ids = d["input_ids"] | |
| pad_len = max_len - len(ids) | |
| input_ids.append(ids + [0] * pad_len) | |
| attention_mask.append([1] * len(ids) + [0] * pad_len) | |
| resp = self._client.post( | |
| self._proxy("forward"), | |
| json={"input_ids": input_ids, "attention_mask": attention_mask}, | |
| ) | |
| if not resp.is_success: | |
| raise TrainingError(f"forward failed: {resp.text}") | |
| r = resp.json() | |
| return ForwardResult( | |
| logprobs=r["logprobs"], tokens_processed=r["tokens_processed"] | |
| ) | |
| return APIFuture(self._executor.submit(_call), ForwardResult) | |
| def _call(): | |
| normalized = self._normalize_data(data) | |
| if not normalized: | |
| raise ValueError("data list cannot be empty") | |
| # Pad sequences | |
| max_len = max(len(d["input_ids"]) for d in normalized) | |
| input_ids = [] | |
| attention_mask = [] | |
| for d in normalized: | |
| ids = d["input_ids"] | |
| pad_len = max_len - len(ids) | |
| input_ids.append(ids + [0] * pad_len) | |
| attention_mask.append([1] * len(ids) + [0] * pad_len) | |
| resp = self._client.post( | |
| self._proxy("forward"), | |
| json={"input_ids": input_ids, "attention_mask": attention_mask}, | |
| ) | |
| if not resp.is_success: | |
| raise TrainingError(f"forward failed: {resp.text}") | |
| r = resp.json() | |
| return ForwardResult( | |
| logprobs=r["logprobs"], tokens_processed=r["tokens_processed"] | |
| ) | |
| return APIFuture(self._executor.submit(_call), ForwardResult) |
🧰 Tools
🪛 Ruff (0.14.8)
125-125: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/training_client.py around
lines 106 to 131, add a guard after normalizing data to handle an empty input
list: after normalized = self._normalize_data(data) check if not normalized and
raise a TrainingError (or otherwise return a failed future) with a clear message
such as "forward failed: no data provided" to avoid the max() empty-sequence
ValueError; keep the rest of the function flow unchanged so the APIFuture is
still returned by _call().
| def _call(): | ||
| normalized = self._normalize_data(data) | ||
|
|
||
| # Default labels to input_ids (causal LM) | ||
| for d in normalized: | ||
| if "labels" not in d or d["labels"] is None: | ||
| d["labels"] = d["input_ids"].copy() | ||
|
|
||
| # Pad sequences | ||
| max_len = max(len(d["input_ids"]) for d in normalized) | ||
| input_ids = [] | ||
| labels = [] | ||
| attention_mask = [] | ||
| loss_weights = [] | ||
|
|
||
| for d in normalized: | ||
| ids = d["input_ids"] | ||
| pad_len = max_len - len(ids) | ||
| input_ids.append(ids + [0] * pad_len) | ||
| labels.append(d["labels"] + [-100] * pad_len) | ||
| attention_mask.append([1] * len(ids) + [0] * pad_len) | ||
| if d.get("loss_weights"): | ||
| loss_weights.append(d["loss_weights"] + [0.0] * pad_len) | ||
|
|
||
| payload = { | ||
| "input_ids": input_ids, | ||
| "attention_mask": attention_mask, | ||
| "labels": labels, | ||
| "loss_fn": loss_fn, | ||
| } | ||
| if loss_weights: | ||
| payload["loss_weights"] = loss_weights | ||
|
|
||
| resp = self._client.post(self._proxy("forward_backward"), json=payload) | ||
| if not resp.is_success: | ||
| raise TrainingError(f"forward_backward failed: {resp.text}") | ||
|
|
||
| r = resp.json() | ||
| return ForwardBackwardResult( | ||
| loss=r["loss"], | ||
| logprobs=r.get("logprobs", []), | ||
| tokens_processed=r.get("tokens_processed", 0), | ||
| ) | ||
|
|
||
| return APIFuture(self._executor.submit(_call), ForwardBackwardResult) |
There was a problem hiding this comment.
Add validation for empty data list in forward_backward.
Same issue as forward() — line 162 will raise ValueError if data is empty. Add a guard after normalization.
🔎 Apply this diff:
def _call():
normalized = self._normalize_data(data)
+ if not normalized:
+ raise ValueError("data list cannot be empty")
# Default labels to input_ids (causal LM)
for d in normalized:🧰 Tools
🪛 Ruff (0.14.8)
188-188: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/training_client.py around
lines 153 to 197, after normalizing the input data add a guard that checks if
the normalized list is empty and raise a TrainingError (or appropriate API
error) with a clear message (e.g. "forward_backward requires non-empty data")
before calling max(len(...)) to avoid ValueError; ensure the new check occurs
immediately after normalized = self._normalize_data(data) and returns/raises
consistently with other methods like forward().
| if os.path.exists("build/api-token.txt"): | ||
| api_key = open("build/api-token.txt").read().strip() |
There was a problem hiding this comment.
Resource leak: file handle not closed.
The file is opened without a context manager, leaving the handle unclosed after reading. This pattern is repeated in sampling_example() and async_example().
🔎 Apply this diff to fix the resource leak:
- if os.path.exists("build/api-token.txt"):
- api_key = open("build/api-token.txt").read().strip()
+ if os.path.exists("build/api-token.txt"):
+ with open("build/api-token.txt") as f:
+ api_key = f.read().strip()Apply the same fix at lines 83-84 and 114-115.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if os.path.exists("build/api-token.txt"): | |
| api_key = open("build/api-token.txt").read().strip() | |
| if os.path.exists("build/api-token.txt"): | |
| with open("build/api-token.txt") as f: | |
| api_key = f.read().strip() |
🤖 Prompt for AI Agents
In examples/training_example.py around lines 36-37 (and apply the same fix at
lines 83-84 and 114-115), the file is opened with open(...) and read without
using a context manager which leaks file handles; replace those open(...).read()
calls with a with open(...) as f: api_key = f.read().strip() pattern (i.e., use
a with-statement context manager to open, read, strip, and automatically close
the file) for each occurrence.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (11)
crates/basilica-sdk-python/python/basilica/training/training_client.py (2)
107-132: Add validation for empty data list.If
datais empty,max(len(d["input_ids"]) for d in normalized)at line 111 will raise aValueError: max() arg is an empty sequence. This issue was previously raised and remains unaddressed.
180-224: Add validation for empty data list inforward_backward.Same issue as
forward()— line 189 will raiseValueErrorifdatais empty. This issue was previously raised and remains unaddressed.examples/training_example.py (1)
36-37: Resource leak: file handles not closed.The file is opened without a context manager at lines 37, 106, and 137, leaving handles unclosed after reading. This issue was previously raised and remains unaddressed.
Also applies to: 105-106, 136-137
crates/basilica-api/src/api/routes/training.rs (6)
200-207: Potential namespace collision risk.The sanitization logic could create collisions: different user IDs like "user_123" and "user-123" would both map to "u-user-123". This issue was previously raised and remains unaddressed.
283-283: HardcodedenableBillingprevents user override.The
enableBillingfield is always set totrue, ignoring any potential user preference. This issue was previously raised and remains unaddressed.
323-332: Fragile error detection using string matching.Checking for "409" or "AlreadyExists" in the error string is brittle and may miss actual errors. This issue was previously raised and remains unaddressed.
530-556: Confusing sentinel error pattern for empty list handling.Using an error with message "empty" as a sentinel value is a code smell. This issue was previously raised and remains unaddressed.
121-129: Document accepted backend values.The
backendfield has no validation or documentation indicating which values are accepted. This issue was previously raised and remains unaddressed.
398-409: Fragile error detection still present despite marked as addressed.The past review comment indicates this was addressed in commit 0c1cf26, but the code still uses string-based error matching with
e.to_string().contains("404")at line 399.🔎 Apply this fix to use typed error matching:
- let session = api.get(&session_id).await.map_err(|e| { - if e.to_string().contains("404") || e.to_string().contains("NotFound") { + let session = api.get(&session_id).await.map_err(|e| match &e { + kube::Error::Api(ae) if ae.code == 404 => { ApiError::NotFound { message: format!("Session {} not found", session_id), } - } else { + } + _ => { error!(error = %e, session_id = %session_id, "Failed to get session"); ApiError::Internal { message: format!("Failed to get session: {}", e), } } })?;services/training-service/src/backend.py (2)
207-209:get_peft_modelmodifies base model;print_trainable_parametersuses stdout.
get_peft_modelcan modify the base model in place, which may cause issues if the base model is cached and shared across sessions.print_trainable_parameters()outputs to stdout rather than the structured logger.This issue was previously raised and remains unaddressed.
531-559: Critical bug: Optimizer references stale parameters after loading checkpoint.After
PeftModel.from_pretrainedcreates a new model at line 543,session.optimizerstill references parameters from the old model. The optimizer state is loaded, but it won't update the new model's parameters. This issue was previously raised and remains unaddressed.
🧹 Nitpick comments (4)
crates/basilica-api/src/api/routes/training.rs (1)
713-959: Significant code duplication in proxy handlers.The proxy handler functions (create_internal_session, get_internal_session, forward_backward, optim_step, sample, save_checkpoint, load_checkpoint, forward, compute_logprobs) follow an identical pattern with only minor variations in paths and methods. Consider extracting a common helper or using a macro to reduce duplication.
💡 Example refactor using a helper function:
async fn proxy_handler( state: State<AppState>, auth: Extension<AuthContext>, session_id: String, internal_session_id: Option<String>, path_suffix: &str, method: http::Method, body: Option<serde_json::Value>, metric_name: &str, ) -> Result<axum::response::Response> { let start = Instant::now(); let k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?; let kube_client = k8s_client.kube_client(); let namespace = user_namespace(&auth.user_id); let path = if let Some(internal_id) = internal_session_id { format!("/sessions/{}/{}", internal_id, path_suffix) } else { format!("/sessions{}", path_suffix) }; let result = proxy_to_training_service( &kube_client, &namespace, &session_id, &path, method, body, ) .await; apimetrics::record_request(metric_name, method.as_str(), start, result.is_ok()); result }scripts/local-training-e2e.sh (3)
58-62: Consider adding a note about the security trade-off.Installing k3d via
curl | bashwithout signature verification is convenient but poses a security risk. While acceptable for local development scripts, consider documenting this trade-off or providing an option to skip auto-installation.
221-250: Fragile output parsing could break silently.Lines 229-230 parse the
gen-api-keyoutput usinggrepandsed, which is fragile if the output format changes. While there's a validation check for empty values on Line 232, consider making the key generation output more structured (e.g., JSON) for robust parsing.Suggested approach:
If
gen-api-keycan emit JSON, parse it withjq:KEY_OUTPUT=$(cargo run -p basilica-api --bin gen-api-key -- \ --user testuser --name training-e2e --scopes "training:*" \ --format json 2>/dev/null) API_TOKEN=$(echo "$KEY_OUTPUT" | jq -r '.token') SQL_INSERT=$(echo "$KEY_OUTPUT" | jq -r '.sql')Otherwise, document the expected format and add more validation:
# Extract the token and SQL API_TOKEN=$(echo "$KEY_OUTPUT" | grep "Token (Authorization):" | sed 's/.*Bearer //') SQL_INSERT=$(echo "$KEY_OUTPUT" | grep "^INSERT INTO") if [ -z "$API_TOKEN" ] || [ -z "$SQL_INSERT" ]; then log_error "Failed to generate API key" + log_error "Expected output format not found. Output was:" echo "$KEY_OUTPUT" exit 1 fi
362-401: Reduce duplication in session creation logic.The session creation payload and curl invocation are duplicated for authenticated vs. unauthenticated requests. This makes the function harder to maintain.
🔎 Consolidate the logic:
# Create a training session (CPU mode with gpu_count: 0) log_info "Creating training session (CPU mode)..." CURL_ARGS=(-s -X POST "$API_URL/sessions" -H "Content-Type: application/json") [ -n "$AUTH_HEADER" ] && CURL_ARGS+=(-H "$AUTH_HEADER") RESPONSE=$(curl "${CURL_ARGS[@]}" -d '{ "baseModel": "facebook/opt-125m", "checkpointStorage": { "backend": "r2", "bucket": "test-bucket", "path": "test/checkpoints" }, "loraConfig": { "rank": 8, "alpha": 16 }, "gpuResources": { "count": 0 } }')
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
crates/basilica-api/src/api/middleware/scope.rs(1 hunks)crates/basilica-api/src/api/mod.rs(1 hunks)crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-sdk-python/python/basilica/training/training_client.py(1 hunks)examples/training_example.py(1 hunks)scripts/local-training-e2e.sh(1 hunks)services/training-service/src/backend.py(1 hunks)services/training-service/src/server.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/basilica-api/src/api/mod.rs
🧰 Additional context used
🧬 Code graph analysis (5)
crates/basilica-sdk-python/python/basilica/training/training_client.py (3)
crates/basilica-sdk-python/python/basilica/training/types.py (9)
APIFuture(138-184)Datum(17-44)ForwardBackwardResult(110-115)ForwardResult(119-123)ModelInput(48-66)result(163-167)to_dict(37-44)to_dict(84-93)result_async(169-176)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (4)
SamplingClient(16-186)compute_logprobs(128-158)sample(56-105)compute_logprobs_async(172-176)crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
TrainingError(8-14)
scripts/local-training-e2e.sh (3)
crates/basilica-operator/src/runtime.rs (1)
run(162-361)crates/basilica-api/src/server.rs (1)
run(1188-1226)orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
deployments(103-213)
examples/training_example.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (4)
Datum(17-44)ModelInput(48-66)SamplingParams(70-93)from_ints(59-61)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (1)
compute_logprobs(128-158)
services/training-service/src/server.py (1)
services/training-service/src/backend.py (17)
ForwardBackwardResult(70-75)ForwardResult(62-66)LoraConfiguration(16-32)OptimizerConfiguration(36-44)SampleResult(79-85)TrainingBackend(88-608)list_sessions(579-581)create_session(171-232)get_session_status(561-571)delete_session(573-577)forward(302-349)compute_logprobs(351-389)forward_backward(234-300)optim_step(391-414)sample(416-478)save_state(480-529)load_state(531-559)
crates/basilica-api/src/api/routes/training.rs (4)
services/training-service/src/server.py (6)
create_session(213-244)forward_backward(323-352)optim_step(356-362)sample(369-390)forward(271-295)compute_logprobs(299-319)crates/basilica-operator/src/crd/training_session.rs (5)
default(34-41)default(80-86)default(145-151)new(316-322)new(374-391)services/training-service/src/backend.py (6)
create_session(171-232)forward_backward(234-300)optim_step(391-414)sample(416-478)forward(302-349)compute_logprobs(351-389)crates/basilica-api/src/k8s/trait.rs (1)
kube_client(7-7)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/training_client.py
126-126: Avoid specifying long messages outside the exception class
(TRY003)
155-155: Avoid specifying long messages outside the exception class
(TRY003)
215-215: Avoid specifying long messages outside the exception class
(TRY003)
280-280: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Avoid specifying long messages outside the exception class
(TRY003)
326-326: Avoid specifying long messages outside the exception class
(TRY003)
346-346: Avoid specifying long messages outside the exception class
(TRY003)
366-366: Avoid specifying long messages outside the exception class
(TRY003)
408-408: Avoid specifying long messages outside the exception class
(TRY003)
453-453: Avoid specifying long messages outside the exception class
(TRY003)
461-461: Avoid specifying long messages outside the exception class
(TRY003)
🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh
[warning] 422-422: i appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 477-477: Use single quotes, otherwise this expands now rather than when signalled.
(SC2064)
🔇 Additional comments (12)
crates/basilica-api/src/api/middleware/scope.rs (1)
180-242: LGTM! Training scope mappings are well-structured.The scope mappings for training endpoints follow consistent patterns and provide appropriate granularity (training:create, training:list, training:view, training:use, training:delete). The distinction between public CRD management routes and internal proxy routes is clear.
services/training-service/src/server.py (1)
1-421: LGTM! Clean FastAPI implementation.The server implementation follows FastAPI best practices with:
- Proper lifespan management for backend initialization
- Well-structured Pydantic models with validation constraints
- Consistent error handling (ValueError → 404, generic Exception → 500)
- Appropriate use of structured logging
- Correct tensor conversions for backend operations
The code is clean, maintainable, and production-ready.
scripts/local-training-e2e.sh (10)
1-39: LGTM!Good use of strict error handling (
set -euo pipefail), clear documentation, and well-organized logging helpers.
67-96: LGTM!The cluster creation logic is well-structured with proper reuse detection and readiness checks.
129-183: LGTM!Image building, loading, and operator deployment are well-structured with proper readiness checks.
185-211: LGTM!Excellent handling of stale Postgres data and robust retry logic with clear failure messaging.
252-286: LGTM!The API startup and key generation workflows are clear and well-documented with helpful user guidance.
288-333: LGTM!Comprehensive cleanup logic that handles both CRD-managed and orphaned resources with proper waiting for termination.
470-477: Trap cleanup is correct, but shellcheck prefers explicit quoting.The trap on Line 477 correctly expands
$PF_PIDat trap-set time (after Line 473 sets it). Shellcheck warns (SC2064) because double quotes can sometimes cause unintended expansions, but this usage is safe.For clarity and to silence the warning, you could make the intent explicit:
Optional: Explicit quoting style
- trap "kill $PF_PID 2>/dev/null" EXIT + # shellcheck disable=SC2064 + trap "kill $PF_PID 2>/dev/null" EXITOr capture the PID in a way that makes the intent clearer:
readonly PF_PID=$! trap "kill ${PF_PID} 2>/dev/null" EXIT
630-697: LGTM!Utility functions for status display, log tailing, and cluster teardown are well-organized and handle errors gracefully.
699-791: LGTM!Excellent documentation in the help text and clean command dispatcher with good error handling for unknown commands.
420-430: Shellcheck warning about unused loop variable is harmless.Shellcheck warns (SC2034) that the loop variable
ion Line 422 is unused. This is a common Bash pattern for counting iterations and is perfectly fine. The warning can be safely ignored, or you can suppress it with a comment if desired:# shellcheck disable=SC2034 for i in {1..60}; do
| # Wait for Envoy Gateway | ||
| log_info "Waiting for Envoy Gateway to be ready..." | ||
| sleep 10 | ||
| kubectl wait --namespace envoy-gateway-system \ | ||
| --for=condition=Available deployment/envoy-gateway \ | ||
| --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." |
There was a problem hiding this comment.
Verify gateway readiness or fail explicitly.
The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).
🔎 Consider failing explicitly or adding a retry:
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
- --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+ --timeout=180s || {
+ log_error "Envoy Gateway failed to become ready"
+ exit 1
+ }Alternatively, if you want to be lenient in dev scenarios, add a note that downstream failures may occur:
kubectl wait --namespace envoy-gateway-system \
--for=condition=Available deployment/envoy-gateway \
- --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..."
+ --timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Wait for Envoy Gateway | |
| log_info "Waiting for Envoy Gateway to be ready..." | |
| sleep 10 | |
| kubectl wait --namespace envoy-gateway-system \ | |
| --for=condition=Available deployment/envoy-gateway \ | |
| --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." | |
| # Wait for Envoy Gateway | |
| log_info "Waiting for Envoy Gateway to be ready..." | |
| sleep 10 | |
| kubectl wait --namespace envoy-gateway-system \ | |
| --for=condition=Available deployment/envoy-gateway \ | |
| --timeout=180s || { | |
| log_error "Envoy Gateway failed to become ready" | |
| exit 1 | |
| } |
| # Wait for Envoy Gateway | |
| log_info "Waiting for Envoy Gateway to be ready..." | |
| sleep 10 | |
| kubectl wait --namespace envoy-gateway-system \ | |
| --for=condition=Available deployment/envoy-gateway \ | |
| --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." | |
| # Wait for Envoy Gateway | |
| log_info "Waiting for Envoy Gateway to be ready..." | |
| sleep 10 | |
| kubectl wait --namespace envoy-gateway-system \ | |
| --for=condition=Available deployment/envoy-gateway \ | |
| --timeout=180s || log_warn "Timeout waiting for envoy-gateway. Continuing, but downstream operations may fail." |
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 119 to 124, the kubectl wait
fallback currently only logs a warning on timeout and lets the script continue,
which can cause downstream cryptic failures; modify this block to fail
explicitly on timeout (exit non-zero) so the pipeline stops when the Envoy
Gateway is not Available, or implement a retry loop with a configurable retry
count/backoff and clear logging and finally exit non-zero if still unavailable;
optionally honor an environment variable (e.g., ALLOW_GATEWAY_WARN) to preserve
the current lenient behavior when set.
| # Health check | ||
| log_info "Checking API health..." | ||
| if ! curl -sf "$API_URL/health" > /dev/null; then | ||
| log_error "API not running. Start it with: $0 api" | ||
| exit 1 | ||
| fi | ||
| curl -s "$API_URL/health" | jq . |
There was a problem hiding this comment.
Add timeout to curl health check.
The health check curl on Line 356 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.
🔎 Add a timeout:
- if ! curl -sf "$API_URL/health" > /dev/null; then
+ if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then
log_error "API not running. Start it with: $0 api"
exit 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Health check | |
| log_info "Checking API health..." | |
| if ! curl -sf "$API_URL/health" > /dev/null; then | |
| log_error "API not running. Start it with: $0 api" | |
| exit 1 | |
| fi | |
| curl -s "$API_URL/health" | jq . | |
| # Health check | |
| log_info "Checking API health..." | |
| if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then | |
| log_error "API not running. Start it with: $0 api" | |
| exit 1 | |
| fi | |
| curl -s "$API_URL/health" | jq . |
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 354 to 360, the curl health check
(line 356) lacks a timeout and can hang indefinitely; modify the curl invocation
to include a sensible timeout (for example --max-time 5 or -m 5) so the command
fails fast on unresponsive API, and keep the existing -s and -f flags; also
consider adding the same timeout to the subsequent curl that pipes to jq to
avoid hanging when printing the health response.
| if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then | ||
| log_error "Training service not accessible via port-forward" | ||
| kill $PF_PID 2>/dev/null | ||
| return 1 | ||
| fi | ||
| curl -s "$TRAINING_URL/health" | jq . |
There was a problem hiding this comment.
Add timeout to training service health check.
The curl on Line 483 has no timeout and could hang if the port-forward is established but the service is unresponsive.
🔎 Add a timeout:
- if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then
+ if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then
log_error "Training service not accessible via port-forward"
kill $PF_PID 2>/dev/null
return 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then | |
| log_error "Training service not accessible via port-forward" | |
| kill $PF_PID 2>/dev/null | |
| return 1 | |
| fi | |
| curl -s "$TRAINING_URL/health" | jq . | |
| if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then | |
| log_error "Training service not accessible via port-forward" | |
| kill $PF_PID 2>/dev/null | |
| return 1 | |
| fi | |
| curl -s "$TRAINING_URL/health" | jq . |
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 483 to 488, the health check curl
calls lack timeouts and can hang if the port-forward exists but the service is
unresponsive; update both curl invocations to include a reasonable timeout
(e.g., --max-time 5 and/or --connect-timeout 2) and retain the existing error
handling so the first timed-out check triggers log_error, kills the PF_PID, and
returns 1; ensure the follow-up curl that pipes to jq also uses a timeout so the
script never blocks indefinitely.
| RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "{ | ||
| \"session_id\": \"$INTERNAL_SESSION\", | ||
| \"base_model\": \"facebook/opt-125m\", | ||
| \"lora_config\": { | ||
| \"rank\": 8, | ||
| \"alpha\": 16, | ||
| \"dropout\": 0.05 | ||
| }, | ||
| \"optimizer_config\": { | ||
| \"learning_rate\": 0.0001, | ||
| \"weight_decay\": 0.01 | ||
| } | ||
| }") | ||
| echo "$RESPONSE" | jq . | ||
|
|
||
| if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then | ||
| log_info "Training session created successfully" | ||
| else | ||
| log_error "Failed to create training session" | ||
| echo "$RESPONSE" | ||
| kill $PF_PID 2>/dev/null | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
Add timeout to training session creation.
The curl on Line 495 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive.
🔎 Add a timeout:
- RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \
+ RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \
-H "Content-Type: application/json" \
-d "{Apply the same pattern to other training operation curls (lines 537, 558, 577, 588, 606).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \ | |
| -H "Content-Type: application/json" \ | |
| -d "{ | |
| \"session_id\": \"$INTERNAL_SESSION\", | |
| \"base_model\": \"facebook/opt-125m\", | |
| \"lora_config\": { | |
| \"rank\": 8, | |
| \"alpha\": 16, | |
| \"dropout\": 0.05 | |
| }, | |
| \"optimizer_config\": { | |
| \"learning_rate\": 0.0001, | |
| \"weight_decay\": 0.01 | |
| } | |
| }") | |
| echo "$RESPONSE" | jq . | |
| if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then | |
| log_info "Training session created successfully" | |
| else | |
| log_error "Failed to create training session" | |
| echo "$RESPONSE" | |
| kill $PF_PID 2>/dev/null | |
| return 1 | |
| fi | |
| RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \ | |
| -H "Content-Type: application/json" \ | |
| -d "{ | |
| \"session_id\": \"$INTERNAL_SESSION\", | |
| \"base_model\": \"facebook/opt-125m\", | |
| \"lora_config\": { | |
| \"rank\": 8, | |
| \"alpha\": 16, | |
| \"dropout\": 0.05 | |
| }, | |
| \"optimizer_config\": { | |
| \"learning_rate\": 0.0001, | |
| \"weight_decay\": 0.01 | |
| } | |
| }") | |
| echo "$RESPONSE" | jq . | |
| if echo "$RESPONSE" | jq -e '.session_id' > /dev/null 2>&1; then | |
| log_info "Training session created successfully" | |
| else | |
| log_error "Failed to create training session" | |
| echo "$RESPONSE" | |
| kill $PF_PID 2>/dev/null | |
| return 1 | |
| fi |
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 495 to 519, the curl POST that
creates the training session has no timeout and can hang; add a total request
timeout (e.g. --max-time 30) to the curl invocation to fail fast on
slow/unresponsive services, handle the non-zero exit by checking curl success
before parsing JSON, and apply the same --max-time timeout change to the other
training-related curl calls at lines 537, 558, 577, 588, and 606 so all training
operations use a consistent timeout.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (9)
crates/basilica-sdk-python/python/basilica/training/service_client.py (3)
37-74: Store timeout for async client initialization.The
timeoutparameter is passed to the synchttpx.Clientbut not stored for later use. Line 399 in_get_async_clienthardcodestimeout=300.0, ignoring the configured value.🔎 Apply this diff to store and reuse timeout:
def __init__( self, api_key: Optional[str] = None, endpoint: Optional[str] = None, timeout: float = 300.0, ): ... + self._timeout = timeout self._client = httpx.Client( base_url=self.endpoint, headers={"Authorization": f"Bearer {self.api_key}"}, timeout=timeout, ) self._async_client = None # Lazy initThen update
_get_async_clientto useself._timeout(see separate comment).Based on past review comments.
350-378: Use specific exception types for better error handling.Lines 364-365 and 370-371 raise generic
TrainingErrorinstead of the more specificSessionNotFoundErrorandSessionNotReadyErrorthat are already imported.🔎 Apply this diff to use specific exceptions:
+from .exceptions import ( + TrainingError, + SessionTimeoutError, + SessionNotFoundError, + SessionNotReadyError, + AuthenticationError, + ValidationError, +)Then update the method:
if resp.status_code == 404: - raise TrainingError(f"Session {session_id} not found") + raise SessionNotFoundError(session_id) if not resp.is_success: raise TrainingError(f"Failed to get session: {resp.text}") data = resp.json() if data.get("phase") != "ready": - raise TrainingError(f"Session not ready: {data.get('phase')}") + raise SessionNotReadyError(session_id, data.get("phase", "unknown"))Based on past review comments.
393-401: Async client uses hardcoded timeout instead of configured value.Line 399 hardcodes
timeout=300.0, ignoring thetimeoutparameter passed to__init__. This creates inconsistency between sync and async clients.🔎 Apply this diff (requires storing timeout in __init__):
async def _get_async_client(self) -> httpx.AsyncClient: """Get or create async HTTP client.""" if self._async_client is None: self._async_client = httpx.AsyncClient( base_url=self.endpoint, headers={"Authorization": f"Bearer {self.api_key}"}, - timeout=300.0, + timeout=self._timeout, ) return self._async_clientBased on past review comments.
crates/basilica-api/src/api/routes/training.rs (6)
121-129: Document accepted backend values.The
backendfield lacks documentation about which values are accepted (e.g., "s3", "gcs"). Consider adding a doc comment or, better yet, using an enum withDeserializeto enforce valid values at deserialization time.Based on past review comments.
399-399: HardcodedenableBillingprevents user configuration.Line 399 hardcodes
"enableBilling": true, preventing users from opting out of billing. Consider adding an optional field toCreateSessionRequestwith a sensible default.Based on past review comments.
316-323: Namespace collision risk remains unaddressed.The sanitization logic can create collisions (e.g., "user_123" and "user-123" both map to "u-user-123") and doesn't enforce Kubernetes' 63-character limit. Consider using a hash-based or base32-encoded approach for deterministic, collision-free namespace names.
Based on past review comments.
439-448: Fragile error detection via string matching.Lines 441 checks for "409" or "AlreadyExists" in the error string, which is brittle. Use typed error matching instead:
matches!(e, kube::Error::Api(ref ae) if ae.code == 409)for robustness.Based on past review comments.
514-525: Fragile error detection via string matching.Line 515 uses string matching (
e.to_string().contains("404")) to detect not-found errors. This should use typed error matching:matches!(e, kube::Error::Api(ref ae) if ae.code == 404).Note: Past review marked this as addressed, but string matching is still present in the current code.
646-672: Confusing sentinel error pattern for empty list handling.Lines 652-654 use
ApiError::Internal { message: "empty".to_string() }as a sentinel value to signal an empty list, which is later checked at line 664. This makes the code harder to understand. Use explicit control flow or an Option instead.Based on past review comments.
🧹 Nitpick comments (5)
crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)
212-228: Consider consistent response handling across list methods.Line 226 returns
resp.json()directly, while other list methods extract specific keys (e.g.,list_training_runsextracts"runs",list_checkpointsextracts"checkpoints"). If the sessions endpoint returns a wrapper object with a"sessions"key, extracting it would maintain consistency.🔎 View suggested fix if the API returns a wrapper:
def _call(): resp = self._client.get("/sessions", params={"limit": limit}) if not resp.is_success: raise TrainingError(f"list_sessions failed: {resp.text}") - return resp.json() + return resp.json().get("sessions", [])
290-292: Consider waiting for in-flight operations during shutdown.Line 292 calls
shutdown(wait=False), which may abandon in-flight operations. For graceful cleanup,wait=Trueensures pending futures complete before shutdown.🔎 Apply this diff for graceful shutdown:
def close(self): """Close the REST client.""" - self._executor.shutdown(wait=False) + self._executor.shutdown(wait=True)crates/basilica-api/src/api/middleware/scope.rs (1)
180-267: Consider more precise pattern matching for internal session operations.The scope validation patterns are functional but could be more precise. For example, line 192-196 uses
contains("/internal/")which would match any path containing that substring, not just the intended/sessions/:session_id/internal/:internal_session_idpattern. While not a security risk (unmatched routes will 404), tighter patterns improve maintainability.Example improvement:
// Instead of: (&Method::GET, p) if p.starts_with("/sessions/") && p.contains("/internal/") => { Some("training:view".to_string()) } // Consider: (&Method::GET, p) if p.starts_with("/sessions/") && p.matches("/internal/").count() == 1 && !p.ends_with("/internal") => { Some("training:view".to_string()) }However, given that invalid paths will be rejected at routing and all matched scopes are appropriate, this is a low-priority refinement.
crates/basilica-api/src/api/routes/training.rs (2)
750-825: Consider adding timeout for proxy requests.The proxy implementation doesn't specify a timeout for the
kube_client.request()call. If the training service is slow or hangs, this could block the API server thread. Consider adding a timeout and/or documenting that the kube client's default timeout will apply.
1459-1579: Placeholder implementations for checkpoint operations.Several checkpoint management functions (download URL, delete, publish, unpublish) contain TODO comments and placeholder implementations (lines 1479, 1511-1520, 1543-1546, 1574-1575). While acceptable for an MVP, these should be tracked.
Do you want me to open issues to track the completion of these checkpoint operations, or are they already tracked elsewhere?
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
crates/basilica-api/src/api/middleware/scope.rs(1 hunks)crates/basilica-api/src/api/mod.rs(1 hunks)crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-sdk-python/python/basilica/training/__init__.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/rest_client.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/service_client.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/basilica-api/src/api/mod.rs (1)
crates/basilica-api/src/api/routes/training.rs (19)
create_session(410-498)list_sessions(633-737)get_session(501-580)delete_session(586-630)forward_backward(884-908)optim_step(912-935)sample(939-963)forward(1023-1047)compute_logprobs(1051-1075)save_for_sampler(1079-1103)get_capabilities(1109-1129)list_training_runs(1133-1203)get_training_run_by_path(1256-1312)get_training_run(1207-1252)list_checkpoints(1318-1390)get_checkpoint_info(1394-1457)publish_checkpoint(1528-1555)unpublish_checkpoint(1559-1579)delete_checkpoint(1495-1524)
crates/basilica-sdk-python/python/basilica/training/__init__.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (8)
Datum(17-44)ModelInput(48-66)SamplingParams(70-93)SampleResponse(100-106)ForwardBackwardResult(110-115)ForwardResult(119-123)GetServerCapabilitiesResponse(127-132)APIFuture(138-184)crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
TrainingError(8-14)
crates/basilica-api/src/api/routes/training.rs (4)
crates/basilica-api/src/server.rs (2)
std(1246-1246)new(559-1165)services/training-service/src/server.py (13)
LoraConfigRequest(53-59)OptimizerConfigRequest(62-70)CreateSessionRequest(73-80)CreateSessionResponse(83-87)SessionStatusResponse(179-187)create_session(213-244)get_session(248-254)delete_session(258-264)forward_backward(323-352)optim_step(356-362)sample(369-390)forward(271-295)compute_logprobs(299-319)services/training-service/src/backend.py (7)
create_session(171-232)delete_session(573-577)forward_backward(234-300)optim_step(391-414)sample(416-478)forward(302-349)compute_logprobs(351-389)crates/basilica-api/src/k8s/helpers.rs (1)
arr(9-11)
crates/basilica-sdk-python/python/basilica/training/rest_client.py (2)
crates/basilica-sdk-python/python/basilica/training/types.py (2)
APIFuture(138-184)result_async(169-176)crates/basilica-sdk-python/python/basilica/training/exceptions.py (1)
TrainingError(8-14)
🪛 Ruff (0.14.8)
crates/basilica-sdk-python/python/basilica/training/__init__.py
70-98: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
crates/basilica-sdk-python/python/basilica/training/service_client.py
65-67: Avoid specifying long messages outside the exception class
(TRY003)
155-157: Avoid specifying long messages outside the exception class
(TRY003)
191-191: Avoid specifying long messages outside the exception class
(TRY003)
200-200: Avoid specifying long messages outside the exception class
(TRY003)
208-208: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
328-328: Avoid specifying long messages outside the exception class
(TRY003)
365-365: Avoid specifying long messages outside the exception class
(TRY003)
367-367: Avoid specifying long messages outside the exception class
(TRY003)
371-371: Avoid specifying long messages outside the exception class
(TRY003)
389-389: Avoid specifying long messages outside the exception class
(TRY003)
crates/basilica-sdk-python/python/basilica/training/rest_client.py
53-53: Avoid specifying long messages outside the exception class
(TRY003)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
89-89: Avoid specifying long messages outside the exception class
(TRY003)
115-115: Avoid specifying long messages outside the exception class
(TRY003)
133-133: Avoid specifying long messages outside the exception class
(TRY003)
151-151: Avoid specifying long messages outside the exception class
(TRY003)
168-168: Avoid specifying long messages outside the exception class
(TRY003)
188-188: Avoid specifying long messages outside the exception class
(TRY003)
206-206: Avoid specifying long messages outside the exception class
(TRY003)
225-225: Avoid specifying long messages outside the exception class
(TRY003)
243-243: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (5)
crates/basilica-sdk-python/python/basilica/training/__init__.py (1)
70-98: Current all organization is clear; ignore static analysis sorting hint.Static analysis suggests alphabetically sorting
__all__, but the current logical grouping (version, clients, types, exceptions) is more maintainable and readable than alphabetical order.crates/basilica-api/src/api/routes/training.rs (1)
461-486: Good: Proper typed error matching for 409 Conflict.Line 469 correctly uses
kube::Error::Api(ae) if ae.code == 409for typed error matching, which is more robust than string-based checks.crates/basilica-api/src/api/mod.rs (3)
105-113: LGTM! Session lifecycle endpoints follow RESTful conventions.The session lifecycle routes are well-structured and follow standard CRUD patterns under the protected router.
155-172: LGTM! Capabilities and training runs endpoints are well-structured.The capabilities and training runs management routes follow clear RESTful patterns. The "Phase 3" comments provide helpful context for the phased rollout.
114-154: All training operation handlers are properly implemented incrates/basilica-api/src/api/routes/training.rs. The referenced handlers exist:
create_internal_session(line 829)get_internal_session(line 857)forward_backward(line 884)optim_step(line 912)sample(line 939)save_checkpoint(line 967)load_checkpoint(line 995)forward(line 1023)compute_logprobs(line 1051)save_for_sampler(line 1079)No issues found.
| let sessions = match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await { | ||
| Ok(list) => list, | ||
| Err(e) => { | ||
| if e.to_string().contains("404") || e.to_string().contains("NotFound") { | ||
| apimetrics::record_request("training.runs.list", "GET", start, true); | ||
| return Ok(Json(ListTrainingRunsResponse { | ||
| runs: vec![], | ||
| total: 0, | ||
| offset: params.offset, | ||
| limit: params.limit, | ||
| })); | ||
| } | ||
| error!(error = %e, "Failed to list training runs"); | ||
| apimetrics::record_request("training.runs.list", "GET", start, false); | ||
| return Err(ApiError::Internal { | ||
| message: format!("Failed to list training runs: {}", e), | ||
| }); | ||
| } | ||
| }; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use typed error matching for 404 detection.
Line 1149 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 instead for more robust error handling.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1146 to 1164, the
code currently detects a 404 by matching substrings on the error string; change
this to typed error matching using kube::Error::Api(ae) if ae.code == 404 in the
match arm that handles Err(e). Specifically, add a match arm that matches
kube::Error::Api(ae) if ae.code == 404 and in that arm record the successful
metrics and return the empty ListTrainingRunsResponse as currently done for
404s; leave the existing error logging, metrics failure recording, and
ApiError::Internal return in the fallback arm for all other error variants.
Ensure the match uses the typed kube::Error pattern rather than string
comparisons.
| let session = api.get(&run_id).await.map_err(|e| { | ||
| if e.to_string().contains("404") || e.to_string().contains("NotFound") { | ||
| ApiError::NotFound { | ||
| message: format!("Training run {} not found", run_id), | ||
| } | ||
| } else { | ||
| error!(error = %e, run_id = %run_id, "Failed to get training run"); | ||
| ApiError::Internal { | ||
| message: format!("Failed to get training run: {}", e), | ||
| } | ||
| } | ||
| })?; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use typed error matching for 404 detection.
Line 1220 uses string matching to detect 404 errors. Apply the same typed error matching pattern recommended elsewhere: kube::Error::Api(ae) if ae.code == 404.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1219 to 1230, the
code detects a 404 by string-matching on the error; replace that with typed
matching on kube::Error::Api(ae) if ae.code == 404. Update the map_err closure
to pattern-match the error (or use matches/if let) against kube::Error::Api(ae)
with ae.code == 404 and return ApiError::NotFound in that branch; for all other
errors keep the existing error! log (including the error and run_id) and return
ApiError::Internal with the error string. Ensure the pattern match compiles with
the error type returned by api.get and preserve existing logging and messages.
| let session = api.get(&session_id).await.map_err(|e| { | ||
| if e.to_string().contains("404") || e.to_string().contains("NotFound") { | ||
| ApiError::NotFound { | ||
| message: format!("Training run for path {} not found", params.path), | ||
| } | ||
| } else { | ||
| error!(error = %e, path = %params.path, "Failed to get training run by path"); | ||
| ApiError::Internal { | ||
| message: format!("Failed to get training run: {}", e), | ||
| } | ||
| } | ||
| })?; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use typed error matching for 404 detection.
Line 1280 uses string matching to detect 404 errors. Use kube::Error::Api(ae) if ae.code == 404 for consistency with best practices.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1279 to 1290, the
closure currently detects 404 by string-matching the error message; replace that
with a typed match on the kube error variant (match e { kube::Error::Api(ae) if
ae.code == 404 => ApiError::NotFound { message: format!("Training run for path
{} not found", params.path) }, other => { error!(error = %other, path =
%params.path, "Failed to get training run by path"); ApiError::Internal {
message: format!("Failed to get training run: {}", other) } } }) so you use
kube::Error::Api(ae).code == 404 for 404 detection and preserve logging and
internal error construction for other cases.
| match api.get(run_id).await { | ||
| Ok(s) => vec![s], | ||
| Err(e) => { | ||
| if e.to_string().contains("404") || e.to_string().contains("NotFound") { | ||
| vec![] | ||
| } else { | ||
| error!(error = %e, run_id = %run_id, "Failed to get session for checkpoints"); | ||
| apimetrics::record_request("training.checkpoints.list", "GET", start, false); | ||
| return Err(ApiError::Internal { | ||
| message: format!("Failed to list checkpoints: {}", e), | ||
| }); | ||
| } | ||
| } | ||
| } | ||
| } else { | ||
| match api.list(&ListParams::default().labels(&format!("user={}", auth.user_id))).await { | ||
| Ok(list) => list.items, | ||
| Err(e) => { | ||
| if e.to_string().contains("404") || e.to_string().contains("NotFound") { | ||
| vec![] | ||
| } else { | ||
| error!(error = %e, "Failed to list sessions for checkpoints"); | ||
| apimetrics::record_request("training.checkpoints.list", "GET", start, false); | ||
| return Err(ApiError::Internal { | ||
| message: format!("Failed to list checkpoints: {}", e), | ||
| }); | ||
| } | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use typed error matching for 404 detection.
Lines 1335 and 1350 use string matching to detect 404 errors. Consistently apply typed error matching: kube::Error::Api(ae) if ae.code == 404.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1332 to 1361, the
code detects 404s by matching error.to_string() contents; change these to typed
matches against kube::Error::Api(ae) with ae.code == 404 in both error arms.
Replace the if e.to_string().contains(...) checks with a match (or if let) on
Err(e) => match e { kube::Error::Api(ae) if ae.code == 404 => vec![], _ => {
error!(error = %e, ...); apimetrics::record_request(...); return
Err(ApiError::Internal { message: format!(...) }); } } and ensure kube::Error is
in scope (add use kube::Error if needed).
| if let Some(cp_array) = status.get("checkpointDetails").and_then(|v| v.as_array()) { | ||
| for cp in cp_array { | ||
| let cp_path = cp.get("path").and_then(|v| v.as_str()).unwrap_or(""); | ||
| if cp_path.contains(&checkpoint_name) || cp_path == params.path { | ||
| apimetrics::record_request("training.checkpoints.info", "GET", start, true); | ||
| return Ok(Json(CheckpointResponse { | ||
| checkpoint_id: cp.get("id").and_then(|v| v.as_str()).unwrap_or("").to_string(), | ||
| run_id: session_id, | ||
| path: cp_path.to_string(), | ||
| step: cp.get("step").and_then(|v| v.as_u64()).unwrap_or(0), | ||
| size_bytes: cp.get("sizeBytes").and_then(|v| v.as_u64()).unwrap_or(0), | ||
| created_at: cp.get("createdAt").and_then(|v| v.as_str()).unwrap_or("").to_string(), | ||
| is_published: cp.get("isPublished").and_then(|v| v.as_bool()).unwrap_or(false), | ||
| public_url: cp.get("publicUrl").and_then(|v| v.as_str()).map(String::from), | ||
| })); | ||
| } | ||
| } |
There was a problem hiding this comment.
Loose checkpoint path matching may return wrong checkpoint.
Line 1437 uses cp_path.contains(&checkpoint_name) which could match unintended checkpoints. For example, if checkpoint_name is "step_100", it would also match "step_1000". Consider using exact path matching or more precise substring logic.
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1434 to 1450, the
checkpoint match uses cp_path.contains(&checkpoint_name) which can produce false
positives (e.g., "step_100" matching "step_1000"); change the condition to a
precise match — either compare the full path equality (cp_path == params.path)
or compare the checkpoint filename/basename exactly to checkpoint_name (extract
the last path segment and compare == checkpoint_name), handling potential
trailing slashes, then keep the existing return block when a true exact match is
found.
| def get_checkpoint_archive_url(self, checkpoint_id: str) -> APIFuture: | ||
| """Get signed download URL for checkpoint. | ||
|
|
||
| Args: | ||
| checkpoint_id: Checkpoint ID | ||
|
|
||
| Returns: | ||
| APIFuture resolving to presigned URL string | ||
| """ | ||
|
|
||
| def _call(): | ||
| resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url") | ||
| if not resp.is_success: | ||
| raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}") | ||
| return resp.json()["url"] | ||
|
|
||
| return APIFuture(self._executor.submit(_call), str) |
There was a problem hiding this comment.
Add defensive key checking to prevent KeyError.
Line 134 accesses resp.json()["url"] without verifying the key exists. If the API response is malformed or the key is missing, this will raise a KeyError.
🔎 Apply this diff to add validation:
def _call():
resp = self._client.get(f"/checkpoints/{checkpoint_id}/download_url")
if not resp.is_success:
raise TrainingError(f"get_checkpoint_archive_url failed: {resp.text}")
- return resp.json()["url"]
+ data = resp.json()
+ if "url" not in data:
+ raise TrainingError("Missing 'url' in response")
+ return data["url"]🧰 Tools
🪛 Ruff (0.14.8)
133-133: Avoid specifying long messages outside the exception class
(TRY003)
| def publish_checkpoint(self, path: str) -> APIFuture: | ||
| """Make checkpoint publicly accessible. | ||
|
|
||
| Args: | ||
| path: Checkpoint path to publish | ||
|
|
||
| Returns: | ||
| APIFuture resolving to public URL string | ||
| """ | ||
|
|
||
| def _call(): | ||
| resp = self._client.post("/checkpoints/publish", json={"path": path}) | ||
| if not resp.is_success: | ||
| raise TrainingError(f"publish_checkpoint failed: {resp.text}") | ||
| return resp.json()["public_url"] | ||
|
|
||
| return APIFuture(self._executor.submit(_call), str) |
There was a problem hiding this comment.
Add defensive key checking to prevent KeyError.
Line 189 accesses resp.json()["public_url"] without verifying the key exists, risking a KeyError if the response is malformed.
🔎 Apply this diff to add validation:
def _call():
resp = self._client.post("/checkpoints/publish", json={"path": path})
if not resp.is_success:
raise TrainingError(f"publish_checkpoint failed: {resp.text}")
- return resp.json()["public_url"]
+ data = resp.json()
+ if "public_url" not in data:
+ raise TrainingError("Missing 'public_url' in response")
+ return data["public_url"]🧰 Tools
🪛 Ruff (0.14.8)
188-188: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/rest_client.py around
lines 175 to 191, the code blindly accesses resp.json()["public_url"] which can
raise KeyError or crash on invalid JSON; parse the JSON into a variable, check
that it is a dict and contains the "public_url" key, and if missing raise a
TrainingError that includes the response status, raw text and/or parsed body for
debugging; also catch JSON decode errors and raise TrainingError with
response.text so the APIFuture always surfaces a clear, defensive error rather
than a KeyError.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (8)
crates/basilica-api/src/api/routes/training.rs (4)
121-129: Add validation or documentation for storage backend values.The
backendfield accepts any string. Consider adding validation for supported backends (e.g., "s3", "gcs", "r2") or at minimum document accepted values.Based on past review comments flagging this same issue.
341-348: Namespace collision and length issues remain unaddressed.This sanitization can cause collisions (e.g., "user_123" and "user-123" both become "u-user-123"). Additionally, K8s namespace names have a 63-character limit that isn't enforced.
Based on past review comments flagging namespace collision risk.
1329-1347: Use typed error matching for 404 detection.Apply the same typed matching pattern:
kube::Error::Api(ae) if ae.code == 404.Based on past review comments.
1617-1633: Loose checkpoint path matching may return wrong checkpoint.Using
cp_path.contains(&checkpoint_name)can match unintended checkpoints (e.g., "step_100" matches "step_1000"). Use exact path matching or compare the final path segment.Based on past review comments.
examples/training_example.py (4)
39-40: Resource leak: file handle not closed.This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed.
🔎 Apply this diff to fix:
- if os.path.exists("build/api-token.txt"): - api_key = open("build/api-token.txt").read().strip() + if os.path.exists("build/api-token.txt"): + with open("build/api-token.txt") as f: + api_key = f.read().strip()Apply the same fix at lines 120-121, 151-152, and 185-186.
120-121: Resource leak: file handle not closed.This is the same issue flagged earlier. Use a context manager to ensure the file is properly closed.
🔎 Apply the same fix as lines 39-40:
- if os.path.exists("build/api-token.txt"): - api_key = open("build/api-token.txt").read().strip() + if os.path.exists("build/api-token.txt"): + with open("build/api-token.txt") as f: + api_key = f.read().strip()
151-152: Resource leak: file handle not closed.Same resource leak issue. Apply the same fix as previous occurrences.
🔎 Apply the context manager fix:
- if os.path.exists("build/api-token.txt"): - api_key = open("build/api-token.txt").read().strip() + if os.path.exists("build/api-token.txt"): + with open("build/api-token.txt") as f: + api_key = f.read().strip()
185-186: Resource leak: file handle not closed.Fourth occurrence of the same resource leak. Apply the context manager fix here as well.
🔎 Apply the context manager fix:
- if os.path.exists("build/api-token.txt"): - api_key = open("build/api-token.txt").read().strip() + if os.path.exists("build/api-token.txt"): + with open("build/api-token.txt") as f: + api_key = f.read().strip()
🧹 Nitpick comments (11)
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (2)
142-145: Remove redundantos.environ.popcalls.The
patch.dict(os.environ, {}, clear=True)already clears all environment variables, making the subsequentpopcalls unnecessary.🔎 Simplify the test setup:
def test_init_requires_api_key(self): """Test that ServiceClient requires API key.""" with patch.dict(os.environ, {}, clear=True): - # Remove any environment variables - os.environ.pop("BASILICA_API_KEY", None) - os.environ.pop("BASILICA_API_TOKEN", None) - with pytest.raises(ValueError, match="API key required"): ServiceClient()
194-217: Consider verifying exact request parameters.While the test confirms the method is called, asserting the exact parameters passed to
mock_client.postwould strengthen the test and catch parameter-related regressions.Example enhancement:
# After line 217, add: call_args = mock_client.post.call_args assert call_args[1]['json']['data'][0]['input_ids'] == [1, 2, 3] # Or use call_args_list to verify the exact call signaturecrates/basilica-api/src/api/middleware/scope.rs (1)
180-268: Training scope mappings look well-structured.The new training-related scope mappings follow the established pattern. The route matching logic correctly distinguishes between different operations (create, view, use, delete).
However, consider adding unit tests for the new training endpoints to maintain test coverage parity with existing rental and payment endpoint tests (visible in lines 283-454).
crates/basilica-api/src/api/routes/training.rs (5)
464-473: Use typed error matching for namespace creation.Line 466 uses string matching (
e.to_string().contains("409")) while the CRD creation at line 494 correctly uses typed matching. Apply the same pattern here for consistency and robustness.🔎 Suggested fix:
if let Err(e) = k8s_client.create_namespace(&namespace).await { - // Ignore "already exists" errors - if !e.to_string().contains("409") && !e.to_string().contains("AlreadyExists") { + // Ignore "already exists" errors using typed matching + // Note: This requires checking if k8s_client.create_namespace returns kube::Error + let is_conflict = e.to_string().contains("409") || e.to_string().contains("AlreadyExists"); + if !is_conflict { error!(error = %e, namespace = %namespace, "Failed to create namespace");Note: If
create_namespacedoesn't returnkube::Errordirectly, you may need to update its signature or wrap the error type. Based on past review comments flagging fragile error detection.
562-573: Use typed error matching for 404 detection.This uses string matching to detect 404 errors. Apply the typed error matching pattern used elsewhere:
kube::Error::Api(ae) if ae.code == 404.🔎 Suggested fix:
- let original_session = api.get(original_session_id).await.map_err(|e| { - if e.to_string().contains("404") || e.to_string().contains("NotFound") { + let original_session = api.get(original_session_id).await.map_err(|e| match &e { + kube::Error::Api(ae) if ae.code == 404 => { ApiError::NotFound { message: format!("Original session {} not found", original_session_id), } - } else { + } + _ => { error!(error = %e, path = %req.path, "Failed to get original session"); ApiError::Internal { message: format!("Failed to get original session: {}", e), } } })?;
697-708: Still uses string matching for 404 detection.The past review marked this as "✅ Addressed in commit 0c1cf26" but the code still uses string matching. Use
kube::Error::Api(ae) if ae.code == 404for consistency withdelete_session(line 796).
828-855: Sentinel error pattern still present.The past review marked this as "✅ Addressed in commit 430c5b4" but the code still uses the "empty" sentinel string pattern. Refactor to use direct control flow:
🔎 Suggested fix:
- let sessions = api + let sessions = match api .list(&ListParams::default().labels(&format!("user={}", auth.user_id))) .await - .map_err(|e| { - // If namespace doesn't exist, return empty list - if e.to_string().contains("404") || e.to_string().contains("NotFound") { - return ApiError::Internal { - message: "empty".to_string(), // Sentinel for empty list - }; - } - error!(error = %e, "Failed to list sessions"); - ApiError::Internal { - message: format!("Failed to list sessions: {}", e), - } - }); - - let sessions = match sessions { - Ok(list) => list, - Err(e) if e.to_string().contains("empty") => { + { + Ok(list) => list, + Err(kube::Error::Api(ae)) if ae.code == 404 => { apimetrics::record_request("training.list", "GET", start, true); return Ok(Json(vec![])); } Err(e) => { + error!(error = %e, "Failed to list sessions"); apimetrics::record_request("training.list", "GET", start, false); - return Err(e); + return Err(ApiError::Internal { + message: format!("Failed to list sessions: {}", e), + }); } };
1290-1312: Consider making capabilities configurable.Hardcoding model names and limits works for MVP, but consider loading these from configuration or a CRD to avoid code changes when adding new models or adjusting limits.
scripts/test-training-sdk.sh (1)
39-67: Consider checking forjqavailability.The script uses
jqfor JSON processing on lines 46, 56, and 66 without verifying it's installed. Whilejqis commonly available in development environments, adding a check would improve robustness and provide a clearer error message if it's missing.🔎 Add this check after the venv activation block:
echo "" +# Check for jq +if ! command -v jq &> /dev/null; then + echo "WARNING: jq not found, JSON output will be raw" + echo "Install with: apt-get install jq (or brew install jq on macOS)" +fi + # Check if API is runningexamples/training_example.py (2)
47-57: Narrow the exception handler.Catching bare
Exceptioncan mask unexpected errors includingKeyboardInterrupt. For a production-quality example, consider catching specific exceptions or at minimum excludingKeyboardInterruptandSystemExit.🔎 Apply this diff to narrow the exception scope:
- except Exception as e: + except (IOError, ValueError, RuntimeError) as e: print(f"Could not fetch capabilities: {e}")Alternatively, if you want to catch all non-system exceptions:
- except Exception as e: + except BaseException as e: + if isinstance(e, (KeyboardInterrupt, SystemExit)): + raise print(f"Could not fetch capabilities: {e}")
73-90: Consider adding error handling for educational value.While the training loop is functionally correct, adding basic error handling around the
.result()calls would make this example more production-ready and educational for users who will adapt this code.🔎 Example error handling pattern:
try: for _ in range(5): result = training.forward_backward( [Datum(input_ids=train_tokens)], loss_fn="cross_entropy", ).result() step = training.optim_step().result() print(f"Step {step}: loss={result.loss:.4f}") checkpoint_path = training.save_state("checkpoint-final").result() print(f"\nCheckpoint saved: {checkpoint_path}") except Exception as e: print(f"Training error: {e}") raise
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
crates/basilica-api/src/api/middleware/scope.rs(1 hunks)crates/basilica-api/src/api/mod.rs(1 hunks)crates/basilica-api/src/api/routes/training.rs(1 hunks)crates/basilica-sdk-python/python/basilica/training/tests/__init__.py(1 hunks)crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py(1 hunks)examples/training_example.py(1 hunks)scripts/test-training-sdk.sh(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- crates/basilica-sdk-python/python/basilica/training/tests/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
examples/training_example.py (4)
crates/basilica-sdk-python/python/basilica/training/types.py (5)
Datum(17-44)ModelInput(48-66)SamplingParams(70-93)result(163-167)from_ints(59-61)crates/basilica-sdk-python/python/basilica/training/training_client.py (8)
session_id(66-68)forward_backward(160-224)step(76-78)optim_step(246-285)save_state(289-308)forward(97-132)compute_logprobs(134-158)sample(422-455)services/training-service/src/server.py (6)
forward_backward(323-352)optim_step(356-362)save_state(397-407)forward(271-295)compute_logprobs(299-319)sample(369-390)crates/basilica-sdk-python/python/basilica/training/sampling_client.py (2)
compute_logprobs(128-158)sample(56-105)
🪛 Ruff (0.14.8)
examples/training_example.py
55-55: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (23)
crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py (7)
62-84: LGTM!Well-structured tests covering the core
Datum.to_dict()functionality with good coverage of optional fields.
87-102: LGTM!Good test coverage with proper mock verification for tokenizer integration.
105-134: LGTM!Comprehensive test coverage for
SamplingParamsincluding defaults, custom parameters, and serialization.
240-294: LGTM!Well-structured tests with proper resource cleanup via
close()calls. Good coverage of REST client operations.
297-312: LGTM!Clean test verifying
GetServerCapabilitiesResponseconstruction from dictionary data.
315-316: LGTM!Convenient pattern for running tests directly during development.
51-59: The@pytest.mark.asynciodecorator is properly supported—pytest-asyncio>=0.21 is already listed in the project's dev dependencies.crates/basilica-api/src/api/mod.rs (1)
104-201: LGTM! Training routes properly integrated.The training routes are correctly added to the protected routes with appropriate handler mappings. The middleware ordering (scope validation after auth) ensures proper authorization flow.
crates/basilica-api/src/api/routes/training.rs (1)
1662-1674: Placeholder token in download URL.The URL contains
token=placeholderwhich provides no actual security. Ensure this is replaced with proper signed URL generation before production use.Verify this endpoint is not exposed in production without proper signed URL implementation.
scripts/test-training-sdk.sh (6)
1-4: LGTM: Proper error handling setup.The
set -edirective ensures the script fails fast on any command error, which is appropriate for an integration test script.
5-17: LGTM: Robust path resolution and optional venv activation.The script correctly resolves paths and conditionally activates the virtual environment if present, providing good flexibility for different development setups.
19-27: LGTM: Robust health check with helpful error messages.The health check properly validates API availability before proceeding and provides clear remediation steps if the API is not running.
29-37: LGTM: Secure token handling.The script properly validates token file existence and truncates the token in output for security, which is a good practice.
69-73: LGTM: Straightforward example execution.The Python example is executed correctly with the appropriate flag, relying on
set -efor error propagation.
75-86: LGTM: Robust pytest execution with graceful degradation.The script properly checks for pytest availability, provides installation instructions if missing, and gracefully continues even if some tests fail (appropriate for an integration test that may run in various environments).
examples/training_example.py (8)
1-33: LGTM: Clear documentation and appropriate imports.The docstring provides excellent usage examples and API patterns. The sys.path manipulation is acceptable for an example script that needs to import from a local development SDK.
59-72: LGTM: Well-documented LoRA configuration.The context manager usage is correct, and the inline comments clearly explain each LoRA parameter. The GPU count of 0 for CPU testing is appropriate for a portable example.
92-111: LGTM: Clear demonstration of logprobs computation.The logprobs demo correctly uses the API to perform inference, compute per-token probabilities, and calculate sequence probability. The pattern is educational and correct.
122-142: LGTM: Clear demonstration of SamplingClient usage.The sampling example correctly demonstrates using
ModelInput.from_ints()for token-based prompts andSamplingParamsfor generation control. The API usage matches the expected patterns.
145-179: LGTM: Proper async/await pattern with cleanup.The async example correctly demonstrates the async API with proper cleanup in the
finallyblock. The use oftry/finallyensurestraining.close()is called even if errors occur during training.
182-231: LGTM: Comprehensive RestClient demonstration.The example effectively demonstrates REST API operations including listing runs/sessions/checkpoints and retrieving run details. Good defensive programming with
.get()for dictionary access and proper cleanup withrest.close().
233-255: LGTM: Clean CLI dispatch.The argument parsing and dispatch logic is straightforward and well-structured, providing clear entry points for different example modes.
112-114: Thetraining.sample()convenience method exists and is correctly used in the example. No issue found.Likely an incorrect or invalid review comment.
| match api.create(&PostParams::default(), &crd).await { | ||
| Ok(_) => { | ||
| info!( | ||
| session_id = %session_id, | ||
| namespace = %namespace, | ||
| resumed_from = %original_session_id, | ||
| "Created TrainingSession CRD from checkpoint" | ||
| ); | ||
| } | ||
| Err(e) => { | ||
| error!(error = %e, "Failed to create TrainingSession CRD from checkpoint"); | ||
| apimetrics::record_request("training.from_state", "POST", start, false); | ||
| return Err(ApiError::Internal { | ||
| message: format!("Failed to create session: {}", e), | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing 409 conflict handling for CRD creation.
Unlike create_session (line 494), this doesn't handle the case where the session already exists. If a UUID collision occurs (unlikely but possible), this will return an internal error instead of a proper conflict response.
🔎 Suggested fix:
match api.create(&PostParams::default(), &crd).await {
Ok(_) => {
info!(
session_id = %session_id,
namespace = %namespace,
resumed_from = %original_session_id,
"Created TrainingSession CRD from checkpoint"
);
}
+ Err(kube::Error::Api(ae)) if ae.code == 409 => {
+ warn!(session_id = %session_id, "TrainingSession already exists");
+ apimetrics::record_request("training.from_state", "POST", start, false);
+ return Err(ApiError::Conflict {
+ message: format!("Session {} already exists", session_id),
+ });
+ }
Err(e) => {🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 651 to 667, the
create() error path should detect a Kubernetes "already exists"/409 conflict and
return a Conflict API error instead of treating every error as Internal; update
the Err(e) branch to inspect the error (e.g., match on kube/k8s error kind or
map to AlreadyExists/Conflict), and for conflict cases log an appropriate
warn/info, record the apimetrics success=false for "training.from_state", and
return ApiError::Conflict with a clear message; for all other errors keep the
existing error log, metrics recording, and return ApiError::Internal.
| // Execute the request through the kube client | ||
| let response = kube_client | ||
| .request::<serde_json::Value>(request) | ||
| .await | ||
| .map_err(|e| { | ||
| error!( | ||
| error = %e, | ||
| service = %service_name, | ||
| namespace = %namespace, | ||
| path = %path, | ||
| "Failed to proxy request to training service" | ||
| ); | ||
| ApiError::ServiceUnavailable | ||
| })?; |
There was a problem hiding this comment.
Missing timeout for proxy requests.
The K8s API server proxy request has no explicit timeout. If the training pod is unresponsive, this could cause requests to hang indefinitely, potentially exhausting connection pools or worker threads.
🔎 Consider adding a timeout:
You may need to wrap the request with a tokio timeout or configure the kube client with request timeouts:
use tokio::time::{timeout, Duration};
let response = timeout(
Duration::from_secs(30),
kube_client.request::<serde_json::Value>(request)
).await
.map_err(|_| ApiError::GatewayTimeout {
message: "Training service request timed out".to_string()
})?
.map_err(|e| {
// ... existing error handling
})?;🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 981 to 994, the
kube client proxy request has no timeout and can hang indefinitely; wrap the
kube_client.request call with tokio::time::timeout using a sensible Duration
(e.g., 30s), map a timeout result to ApiError::GatewayTimeout with an
explanatory message, and preserve the existing map_err handling for the inner
request error (log the error and return ApiError::ServiceUnavailable); also add
the necessary tokio::time::{timeout, Duration} import and adjust logs to
indicate when a timeout occurred.
| /// Delete a checkpoint. | ||
| /// DELETE /checkpoints/:checkpoint_id | ||
| pub async fn delete_checkpoint( | ||
| State(state): State<AppState>, | ||
| Extension(auth): Extension<AuthContext>, | ||
| Path(checkpoint_id): Path<String>, | ||
| ) -> Result<StatusCode> { | ||
| let start = Instant::now(); | ||
|
|
||
| info!( | ||
| user_id = %auth.user_id, | ||
| checkpoint_id = %checkpoint_id, | ||
| "Deleting checkpoint" | ||
| ); | ||
|
|
||
| // Check that we have K8s access | ||
| let _k8s_client = state.k8s.as_ref().ok_or(ApiError::ServiceUnavailable)?; | ||
|
|
||
| // TODO: Actually delete the checkpoint from storage | ||
| // This would involve: | ||
| // 1. Finding the session that owns this checkpoint | ||
| // 2. Updating the session's checkpoint list | ||
| // 3. Deleting the checkpoint files from R2/S3 | ||
|
|
||
| warn!( | ||
| checkpoint_id = %checkpoint_id, | ||
| "Checkpoint deletion not fully implemented - returning success" | ||
| ); | ||
|
|
||
| apimetrics::record_request("training.checkpoints.delete", "DELETE", start, true); | ||
| Ok(StatusCode::NO_CONTENT) | ||
| } |
There was a problem hiding this comment.
Delete checkpoint returns success without performing deletion.
This endpoint claims success (NO_CONTENT) without actually deleting anything. This could lead to data inconsistencies and user confusion when checkpoints aren't actually removed.
Consider either:
- Returning a 501 Not Implemented until fully implemented
- Implementing the actual deletion logic
- Documenting this as expected MVP behavior in the API response
🤖 Prompt for AI Agents
In crates/basilica-api/src/api/routes/training.rs around lines 1676 to 1707, the
delete_checkpoint handler currently returns 204 NO_CONTENT without performing
any deletion; either make this explicit or implement deletion. Fix by one of two
actions: (A) If not ready to delete, return StatusCode::NOT_IMPLEMENTED (501)
and log/warn that deletion is not implemented, keeping the k8s check and
metrics; or (B) implement the deletion flow: verify checkpoint ownership via
session lookup, remove checkpoint ID from the session record, delete checkpoint
files from the configured object storage (R2/S3) and any Kubernetes resources,
remove checkpoint metadata from DB, handle and log errors (return appropriate
status codes), and only record metrics and return 204 on success. Ensure auth
checks, error handling, and metrics remain consistent.
| # Add parent path for imports | ||
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Remove sys.path manipulation in favor of proper package installation.
Modifying sys.path directly is fragile and creates dependencies on the directory structure. Instead, install the package in editable mode (pip install -e .) or configure pytest's pythonpath in pytest.ini or pyproject.toml.
🔎 Remove these lines and configure proper package installation:
-# Add parent path for imports
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
-
from basilica.training import (Then ensure the package is installed via:
pip install -e crates/basilica-sdk-pythonOr add to pytest.ini:
[pytest]
pythonpath = crates/basilica-sdk-python/pythonCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/tests/test_clients.py
around lines 10 to 11, remove the sys.path manipulation (sys.path.insert(...))
and rely on proper package installation or pytest configuration instead; install
the package in editable mode from the repo root (pip install -e
crates/basilica-sdk-python) or add the package path to pytest configuration
(pytest.ini or pyproject.toml pythonpath) so tests can import the package
without altering sys.path.
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (15)
services/training-service/README.md (1)
38-48: Add language specification to fenced code block.Line 38 begins an example output code block without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
### Example Output -``` +```text === Creating session for facebook/opt-125m === Session: ts-a1b2c3d4 === Training (3 steps) === Step 1: loss=6.2612 Step 2: loss=5.7423 Step 3: loss=5.7501 Sample: The quick brown -> fox jumps over the lazy -``` +```scripts/local-training-e2e.sh (4)
119-124: Verify gateway readiness or fail explicitly.The timeout warning on Line 124 allows the script to continue even if the Envoy Gateway deployment never becomes available. This could cause cryptic failures in downstream operations (like deploying HTTPRoutes or testing).
🔎 Consider failing explicitly:
kubectl wait --namespace envoy-gateway-system \ --for=condition=Available deployment/envoy-gateway \ - --timeout=180s || log_warn "Timeout waiting for envoy-gateway, continuing..." + --timeout=180s || { + log_error "Envoy Gateway failed to become ready" + exit 1 + }
568-574: Add timeout to curl health check.The health check curl on Line 570 has no timeout, which could cause the script to hang indefinitely if the API is unresponsive.
🔎 Add a timeout:
- if ! curl -sf "$API_URL/health" > /dev/null; then + if ! curl -sf --max-time 10 "$API_URL/health" > /dev/null; then log_error "API not running. Start it with: $0 api" exit 1 fi + curl -s --max-time 10 "$API_URL/health" | jq . - curl -s "$API_URL/health" | jq .
695-702: Add timeout to training service health check.The curl on Line 697 has no timeout and could hang if the port-forward is established but the service is unresponsive.
🔎 Add a timeout:
- if ! curl -sf "$TRAINING_URL/health" > /dev/null 2>&1; then + if ! curl -sf --max-time 10 "$TRAINING_URL/health" > /dev/null 2>&1; then log_error "Training service not accessible via port-forward" kill $PF_PID 2>/dev/null return 1 fi - curl -s "$TRAINING_URL/health" | jq . + curl -s --max-time 10 "$TRAINING_URL/health" | jq .
709-733: Add timeout to training session creation and subsequent curls.The curl on Line 709 creates a training session but has no timeout, risking a hang if the service is slow or unresponsive. The same issue applies to other training operation curls throughout
run_training_steps.🔎 Add timeouts to all training curls:
- RESPONSE=$(curl -s -X POST "$TRAINING_URL/sessions" \ + RESPONSE=$(curl -s --max-time 30 -X POST "$TRAINING_URL/sessions" \ -H "Content-Type: application/json" \Apply
--max-time 30(or appropriate value) to all curl calls in this function: lines 751, 772, 787, 791, 802, 820, 841, 858, 864, 871, 874, 880, 897, 902, 921.crates/basilica-operator/src/controllers/training_session_controller.rs (10)
203-212: Status update condition still misses endpoint and start_time changes.The condition now checks
tokens_processed(line 206) but still doesn't checkendpointorstart_time. Changes to these fields won't trigger status updates as previously noted.
239-251: Still using is_err() without distinguishing NotFound from other errors.As previously noted, this treats network/auth errors the same as "pod not found" and proceeds to create a pod.
253-265: Same is_err() issue for service existence check.Apply the same fix pattern as recommended for the pod check.
314-321: Pod failure still doesn't transition session to Failed phase.As previously noted, when the pod enters "Failed" state, the session phase should transition to Failed, not remain in Scheduling.
389-394: Inconsistent error propagation with other handlers.As previously noted, this propagates errors with
?while handle_pending converts similar failures to Ok(status.with_error(...)).
453-455: Silently ignoring pod deletion result.As previously noted, the
let _ =discards errors from delete_pod. At minimum, log the outcome.
643-650: Owner reference UID still defaults to empty string if missing.As previously noted, the UID should be validated and an error returned if missing, not defaulted to an empty string.
710-717: Same empty UID issue in owner reference.As previously noted, apply the same fix pattern as recommended for build_training_pod.
756-795: HTTPRoute still missing owner reference for garbage collection.As previously noted, without ownerReferences, this HTTPRoute will be orphaned when the TrainingSession is deleted.
819-826: Same empty UID issue in NetworkPolicy owner reference.As previously noted, apply the same fix pattern as recommended for the other resource builders.
🧹 Nitpick comments (3)
scripts/local-training-e2e.sh (3)
58-62: Add timeout to k3d install curl.The k3d install curl has no timeout and could hang indefinitely if the network is slow or unresponsive.
🔎 Suggested fix
if ! command -v k3d &> /dev/null; then log_info "Installing k3d..." - curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash + curl -s --max-time 60 https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash fi
634-644: Unused loop variable.The loop variable
iis unused. Use_to indicate an intentionally unused variable, or use brace expansion which is slightly more efficient.🔎 Suggested fix:
- for i in {1..60}; do + for _ in {1..60}; do POD_STATUS=$(kubectl get pods -A -l app=basilica-training -o jsonpath='{.items[0].status.phase}' 2>/dev/null)Based on static analysis hint SC2034.
973-985: Add timeout to storage mounts curl.The curl on Line 980 has no timeout and could hang if the storage daemon is unresponsive, leaving the port-forward orphaned.
🔎 Add timeout:
kubectl port-forward -n basilica-storage "pod/$DAEMON_POD" 9092:9090 & PF_PID=$! sleep 1 - curl -sf "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding" + curl -sf --max-time 5 "http://localhost:9092/mounts" 2>/dev/null | jq . 2>/dev/null || echo "No mounts or daemon not responding" kill $PF_PID 2>/dev/null wait $PF_PID 2>/dev/null
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
crates/basilica-api/Cargo.toml(1 hunks)crates/basilica-operator/src/controllers/training_session_controller.rs(1 hunks)scripts/local-training-e2e.sh(1 hunks)services/training-service/README.md(1 hunks)services/training-service/src/backend.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- services/training-service/src/backend.py
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/local-training-e2e.sh (7)
orchestrator/cluster-manager/src/clustermgr/commands/kubeconfig.py (1)
kubeconfig(22-27)crates/basilica-common/src/utils/docker_validation.rs (1)
image(38-39)crates/basilica-common/src/error.rs (1)
generic(357-361)crates/basilica-api/src/server.rs (1)
run(1188-1226)crates/basilica-operator/src/runtime.rs (1)
run(162-361)orchestrator/cluster-manager/src/clustermgr/commands/deployments.py (1)
deployments(103-213)orchestrator/cluster-manager/src/clustermgr/commands/gateway.py (1)
gateway(296-302)
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
crates/basilica-operator/src/crd/training_session.rs (5)
new(316-322)new(374-391)default(34-41)default(80-86)default(145-151)
🪛 markdownlint-cli2 (0.18.1)
services/training-service/README.md
38-38: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
128-128: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
200-200: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
253-253: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
404-404: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Shellcheck (0.11.0)
scripts/local-training-e2e.sh
[warning] 636-636: i appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.
(SC2064)
🔇 Additional comments (7)
scripts/local-training-e2e.sh (6)
1-38: LGTM!Good use of
set -euo pipefailfor strict error handling, and the color-coded logging helpers improve script readability.
67-96: LGTM!Cluster creation logic is well-structured with proper port mappings, kubeconfig export, and node readiness wait.
129-150: LGTM!Image build and load logic is clean with clear logging for each step.
152-176: LGTM!Operator deployment is well-structured with proper kubectl wait and timeout.
399-433: LGTM!Postgres setup with stale data detection and reset functionality is well-implemented.
1032-1088: LGTM!Comprehensive help documentation clearly explains the workflow, commands, and test steps.
crates/basilica-operator/src/controllers/training_session_controller.rs (1)
872-982: Good test coverage for resource builders.The tests comprehensively verify pod and service creation, including GPU resources, FUSE storage configuration, volume mounts, and command wrapping. This provides solid confidence in the resource builder functions.
| if sessions.sessions.is_empty() { | ||
| debug!("no active sessions in training service"); | ||
| return None; | ||
| } | ||
|
|
||
| // Get status of the first session (typically there's only one per pod) | ||
| let session_id = &sessions.sessions[0]; | ||
| let status_url = format!("{}/sessions/{}", internal_endpoint, session_id); |
There was a problem hiding this comment.
Verify session name matches expected TrainingSession.
The code retrieves the first session from the list without confirming it matches the expected name. If the training service somehow returns multiple sessions or a different session, status updates could be applied to the wrong TrainingSession.
🔎 Suggested fix to validate session name
if sessions.sessions.is_empty() {
debug!("no active sessions in training service");
return None;
}
- // Get status of the first session (typically there's only one per pod)
- let session_id = &sessions.sessions[0];
+ // Get status of the first session, verify it matches expected name
+ let session_id = sessions.sessions.iter()
+ .find(|id| id.as_str() == name)
+ .or_else(|| sessions.sessions.first())?;
let status_url = format!("{}/sessions/{}", internal_endpoint, session_id);🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 110 to 117, the code unconditionally uses the first session from
sessions.sessions which can misidentify the intended TrainingSession; instead,
search sessions.sessions for an entry that matches the expected `name` (or other
canonical identifier available in scope), select that session's id for status
checks, and only proceed when a matching session is found (otherwise log/debug
and return None). Ensure the lookup handles multiple sessions, uses the matched
session ID in the status_url, and retains clear logging when no matching session
exists.
| // Poll training service for status updates | ||
| let mut updated_status = status.clone(); | ||
| if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await { | ||
| // Only update if there's actual progress | ||
| if steps != status.steps_completed || tokens != status.tokens_processed { | ||
| info!( | ||
| session = %name, | ||
| steps = steps, | ||
| tokens = tokens, | ||
| "updating training progress" | ||
| ); | ||
| updated_status = updated_status.with_steps(steps, tokens); | ||
| } | ||
| } | ||
|
|
||
| // Update last activity timestamp | ||
| updated_status.last_updated = chrono::Utc::now().to_rfc3339(); |
There was a problem hiding this comment.
Timestamp updated even when status fetch fails.
Line 476 unconditionally updates last_updated even when fetch_training_status returns None. This creates misleading timestamps suggesting the session was recently checked when no actual status was obtained.
🔎 Suggested fix to update timestamp only on successful fetch
// Poll training service for status updates
let mut updated_status = status.clone();
if let Some((steps, tokens)) = self.fetch_training_status(namespace, name).await {
// Only update if there's actual progress
if steps != status.steps_completed || tokens != status.tokens_processed {
info!(
session = %name,
steps = steps,
tokens = tokens,
"updating training progress"
);
updated_status = updated_status.with_steps(steps, tokens);
}
+ // Update last activity timestamp only when we successfully fetched status
+ updated_status.last_updated = chrono::Utc::now().to_rfc3339();
}
- // Update last activity timestamp
- updated_status.last_updated = chrono::Utc::now().to_rfc3339();
-
Ok(updated_status)🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 460 to 476, last_updated is set unconditionally even when
fetch_training_status returned None; move the updated_status.last_updated =
chrono::Utc::now().to_rfc3339() into the Some branch so the timestamp is only
updated when fetch_training_status succeeded (place it after updating
steps/tokens or alongside that update), ensuring last_updated changes only on
successful status fetch.
| // Allow TRAINING_IMAGE env var to override the default image (for local dev) | ||
| let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone()); |
There was a problem hiding this comment.
TRAINING_IMAGE override lacks dev_mode protection.
The environment variable allows overriding the spec image for "local dev" but isn't gated behind a dev_mode check. If this env var is accidentally set in production, it could allow execution of arbitrary container images, bypassing security policies.
🔎 Suggested fix to gate behind dev_mode
- // Allow TRAINING_IMAGE env var to override the default image (for local dev)
- let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());
+ // Allow TRAINING_IMAGE env var to override the default image (only in dev mode)
+ let image = if cfg!(feature = "dev_mode") {
+ std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone())
+ } else {
+ spec.image.clone()
+ };Alternatively, check for an explicit DEV_MODE environment variable:
- // Allow TRAINING_IMAGE env var to override the default image (for local dev)
- let image = std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone());
+ // Allow TRAINING_IMAGE env var to override the default image (only in dev mode)
+ let image = if std::env::var("DEV_MODE").is_ok() {
+ std::env::var("TRAINING_IMAGE").unwrap_or_else(|_| spec.image.clone())
+ } else {
+ spec.image.clone()
+ };🤖 Prompt for AI Agents
In crates/basilica-operator/src/controllers/training_session_controller.rs
around lines 500-501, the TRAINING_IMAGE env var is used unconditionally to
override the spec image; gate this override behind a dev-mode check (e.g., check
an explicit DEV_MODE env var or a operator config flag) so production cannot be
overridden. Update the code to read DEV_MODE (parse as bool or check for
"1"/"true"), only call std::env::var("TRAINING_IMAGE") when dev mode is enabled,
otherwise always use spec.image; also add a warning log when dev mode is enabled
and the override is applied, and ensure no panic on missing/invalid DEV_MODE by
defaulting to false.
| create_r2_credentials() { | ||
| export KUBECONFIG="$KUBECONFIG_PATH" | ||
|
|
||
| # R2 credentials - you can override these with environment variables | ||
| R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}" | ||
| R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}" | ||
| R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}" | ||
| R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}" | ||
| R2_REGION="${R2_REGION:-auto}" |
There was a problem hiding this comment.
Critical: Hardcoded R2 credentials in source code.
The R2 access key and secret key defaults appear to be real credentials committed to version control. This is a serious security risk—even if these are for a test bucket, credentials in source code can be scraped, reused, or accidentally exposed.
🔎 Remove default secrets and require environment variables:
# R2 credentials - you can override these with environment variables
- R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}"
- R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}"
- R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}"
- R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}"
- R2_REGION="${R2_REGION:-auto}"
+ R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}"
+ R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}"
+ R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}"
+ R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}"
+ R2_REGION="${R2_REGION:-auto}"If these credentials are already committed, they should be rotated immediately.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| create_r2_credentials() { | |
| export KUBECONFIG="$KUBECONFIG_PATH" | |
| # R2 credentials - you can override these with environment variables | |
| R2_ENDPOINT="${R2_ENDPOINT:-https://80f15715bb0b882c9e967c13e677ed7d.r2.cloudflarestorage.com}" | |
| R2_ACCESS_KEY="${R2_ACCESS_KEY:-df1147f966d3f8f16af3911da34478e8}" | |
| R2_SECRET_KEY="${R2_SECRET_KEY:-2db5c9bd18a845df832938f95217e64d91d50437bdf9f39f5423f752eb3c697b}" | |
| R2_BUCKET="${R2_BUCKET:-80f15715bb0b882c9e967c13e677ed7d}" | |
| R2_REGION="${R2_REGION:-auto}" | |
| create_r2_credentials() { | |
| export KUBECONFIG="$KUBECONFIG_PATH" | |
| # R2 credentials - you can override these with environment variables | |
| R2_ENDPOINT="${R2_ENDPOINT:?R2_ENDPOINT environment variable is required}" | |
| R2_ACCESS_KEY="${R2_ACCESS_KEY:?R2_ACCESS_KEY environment variable is required}" | |
| R2_SECRET_KEY="${R2_SECRET_KEY:?R2_SECRET_KEY environment variable is required}" | |
| R2_BUCKET="${R2_BUCKET:?R2_BUCKET environment variable is required}" | |
| R2_REGION="${R2_REGION:-auto}" |
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 262 to 270, the R2 access and
secret keys are hardcoded as defaults; remove these embedded secrets and require
callers to supply credentials via environment variables (or a secure secret
manager). Replace the default assignments so R2_ACCESS_KEY and R2_SECRET_KEY are
read directly from the environment without fallback values, validate at runtime
and exit with a clear error if they are unset, update README/.env.example to
document required vars, and ensure any committed secrets are rotated and removed
from the repo/history (and add a note to rotate if present).
| sleep 3 | ||
|
|
||
| # Cleanup on exit | ||
| trap "kill $PF_PID 2>/dev/null" EXIT |
There was a problem hiding this comment.
Trap expansion happens at definition time, not signal time.
Using double quotes causes $PF_PID to expand immediately when the trap is set. If the variable changes later (or is unset), the trap will use the stale value.
🔎 Use single quotes for delayed expansion:
- trap "kill $PF_PID 2>/dev/null" EXIT
+ trap 'kill $PF_PID 2>/dev/null' EXITBased on static analysis hint SC2064.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sleep 3 | |
| # Cleanup on exit | |
| trap "kill $PF_PID 2>/dev/null" EXIT | |
| sleep 3 | |
| # Cleanup on exit | |
| trap 'kill $PF_PID 2>/dev/null' EXIT |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 691-691: Use single quotes, otherwise this expands now rather than when signalled.
(SC2064)
🤖 Prompt for AI Agents
In scripts/local-training-e2e.sh around lines 688 to 691, the trap is defined
with double quotes so $PF_PID is expanded immediately; change the trap to use
single quotes (e.g., trap 'kill $PF_PID 2>/dev/null' EXIT) so the variable is
expanded at signal time and the current PF_PID value is used when the script
exits.
| ``` | ||
| Training Pod Storage Daemon (DaemonSet) R2 Bucket | ||
| ┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐ | ||
| │ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │ | ||
| │ (volume) │ │ /var/lib/basilica/fuse/ │ ──sync──▶│ basilica│ | ||
| └─────────────┘ │ u-testuser/ │ │ │ | ||
| └──────────────────────────┘ └─────────┘ | ||
| ``` |
There was a problem hiding this comment.
Add language specification to fenced code block.
Line 128 begins an ASCII diagram code block without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
#### How It Works
-```
+```text
Training Pod Storage Daemon (DaemonSet) R2 Bucket
┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐
│ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| Training Pod Storage Daemon (DaemonSet) R2 Bucket | |
| ┌─────────────┐ ┌──────────────────────────┐ ┌─────────┐ | |
| │ /checkpoints│ ──hostPath──▶ │ FUSE Mount │ │ │ | |
| │ (volume) │ │ /var/lib/basilica/fuse/ │ ──sync──▶│ basilica│ | |
| └─────────────┘ │ u-testuser/ │ │ │ | |
| └──────────────────────────┘ └─────────┘ | |
| ``` | |
| #### How It Works | |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
128-128: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In services/training-service/README.md around lines 128 to 135, the fenced ASCII
diagram block lacks a language identifier causing markdown lint MD040; add a
language specifier (e.g., ```text) on the opening fence so the block becomes a
labeled code block, leaving the diagram content unchanged and keeping the
closing ``` as-is.
| ``` | ||
| {bucket}/{namespace}/{namespace}/{checkpoint_path} | ||
|
|
||
| Example: | ||
| basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors | ||
| ``` |
There was a problem hiding this comment.
Add language specification to fenced code block.
Line 200 begins a code block showing the R2 object path structure without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
#### R2 Object Path
Checkpoints are stored in R2 with the following path structure:
-```
+```text
{bucket}/{namespace}/{namespace}/{checkpoint_path}
Example:
basilica/u-testuser/u-testuser/train-session-1/test-checkpoint-final/adapter_model.safetensors
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
200-200: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In services/training-service/README.md around lines 200 to 205, the fenced code
block showing the R2 object path lacks a language identifier (causing MD040);
update the opening fence to include the language specifier (e.g., change ``` to
```text) so the block becomes a language-marked code fence while leaving the
block contents unchanged.
| ``` | ||
| [STEP] Running training steps via port-forward... | ||
| [INFO] Training step 1/3 | ||
| [INFO] Loss: 6.26, Tokens: 10 | ||
| [INFO] Completed step: 1 | ||
| [INFO] Training step 2/3 | ||
| [INFO] Loss: 5.74, Tokens: 10 | ||
| [INFO] Completed step: 2 | ||
| [INFO] Training step 3/3 | ||
| [INFO] Loss: 5.75, Tokens: 10 | ||
| [INFO] Completed step: 3 | ||
|
|
||
| [INFO] Training session status: | ||
| { | ||
| "session_id": "train-session-1", | ||
| "step_count": 3, | ||
| "tokens_processed": 30, | ||
| "lora_rank": 8 | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Add language specification to fenced code block.
Line 253 begins an example output code block without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
### Example Output
-```
+```text
[STEP] Running training steps via port-forward...
[INFO] Training step 1/3
[INFO] Loss: 6.26, Tokens: 10
[INFO] Completed step: 1
[INFO] Training step 2/3
[INFO] Loss: 5.74, Tokens: 10
[INFO] Completed step: 2
[INFO] Training step 3/3
[INFO] Loss: 5.75, Tokens: 10
[INFO] Completed step: 3
[INFO] Training session status:
{
"session_id": "train-session-1",
"step_count": 3,
"tokens_processed": 30,
"lora_rank": 8
}
-```
+```📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| [STEP] Running training steps via port-forward... | |
| [INFO] Training step 1/3 | |
| [INFO] Loss: 6.26, Tokens: 10 | |
| [INFO] Completed step: 1 | |
| [INFO] Training step 2/3 | |
| [INFO] Loss: 5.74, Tokens: 10 | |
| [INFO] Completed step: 2 | |
| [INFO] Training step 3/3 | |
| [INFO] Loss: 5.75, Tokens: 10 | |
| [INFO] Completed step: 3 | |
| [INFO] Training session status: | |
| { | |
| "session_id": "train-session-1", | |
| "step_count": 3, | |
| "tokens_processed": 30, | |
| "lora_rank": 8 | |
| } | |
| ``` | |
| ### Example Output | |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
253-253: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In services/training-service/README.md around lines 253 to 272, the example
output code block is missing a language identifier which triggers markdown lint
rule MD040; update the opening fenced code block from ``` to ```text so the
block is explicitly marked as plain text (leave the closing ``` unchanged) to
satisfy the linter and preserve the example output formatting.
| ``` | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Basilica API │ | ||
| │ (creates TrainingSession CRD) │ | ||
| └─────────────────────┬───────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Basilica Operator │ | ||
| │ • Reconciles TrainingSession CRD │ | ||
| │ • Creates Pod + Service + NetworkPolicy │ | ||
| │ • Polls training service for status updates │ | ||
| └─────────────────────┬───────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Training Service Pod │ | ||
| ├─────────────────────────────────────────────────────────────┤ | ||
| │ FastAPI Server (src/server.py) │ | ||
| ├─────────────────────────────────────────────────────────────┤ | ||
| │ Training Backend (src/backend.py) │ | ||
| │ • Session management │ | ||
| │ • Forward-backward with gradient accumulation │ | ||
| │ • Optimizer step │ | ||
| │ • Text generation │ | ||
| │ • Checkpoint save/load │ | ||
| ├─────────────────────────────────────────────────────────────┤ | ||
| │ HuggingFace Transformers + PEFT │ | ||
| │ • Model loading (OPT, LLaMA, etc.) │ | ||
| │ • LoRA adapter (rank, alpha, dropout) │ | ||
| │ • AdamW optimizer │ | ||
| └─────────────────────────────────────────────────────────────┘ | ||
| ``` |
There was a problem hiding this comment.
Add language specification to fenced code block.
Line 404 begins an ASCII architecture diagram code block without a language identifier, which fails markdown linting (MD040).
🔎 Apply this diff to specify the language:
## Architecture
-```
+```text
┌─────────────────────────────────────────────────────────────┐
│ Basilica API │
│ (creates TrainingSession CRD) │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Basilica Operator │
│ • Reconciles TrainingSession CRD │
│ • Creates Pod + Service + NetworkPolicy │
│ • Polls training service for status updates │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Training Service Pod │
├─────────────────────────────────────────────────────────────┤
│ FastAPI Server (src/server.py) │
├─────────────────────────────────────────────────────────────┤
│ Training Backend (src/backend.py) │
│ • Session management │
│ • Forward-backward with gradient accumulation │
│ • Optimizer step │
│ • Text generation │
│ • Checkpoint save/load │
├─────────────────────────────────────────────────────────────┤
│ HuggingFace Transformers + PEFT │
│ • Model loading (OPT, LLaMA, etc.) │
│ • LoRA adapter (rank, alpha, dropout) │
│ • AdamW optimizer │
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
404-404: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In services/training-service/README.md around lines 404 to 436, the ASCII
architecture diagram code block is missing a language identifier which triggers
markdown lint rule MD040; update the opening fence from ``` to ```text (or
another suitable language like ```plain) and keep the closing fence unchanged so
the block is recognized as a code block with a language specifier.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
examples/training_example.py (1)
39-40: Resource leak: file handle not closed.This issue was previously flagged. The file is opened without a context manager, leaving the handle unclosed. The same pattern appears at lines 182-183, 213-214, and 247-248.
🧹 Nitpick comments (3)
crates/basilica-sdk-python/python/basilica/training/types.py (2)
176-194: Useget_running_loop()and fix caching inconsistency.Three issues:
_result_typeis stored but never used (dead code)asyncio.get_event_loop()is deprecated in Python 3.10+ when called from a coroutine; useasyncio.get_running_loop()insteadresult()caches to_result, butresult_async()bypasses this cache, causing inconsistent behavior if both are called🔎 Proposed fix
- def __init__(self, future: Future, result_type: type = None): + def __init__(self, future: Future, result_type: Optional[type] = None): self._future = future self._result = None - self._result_type = result_type + self._result_type = result_type # Reserved for future deserialization def result(self, timeout: Optional[float] = None): """Block until operation completes (sync).""" if self._result is None: self._result = self._future.result(timeout=timeout) return self._result async def result_async(self, timeout: Optional[float] = None): """Wait for operation to complete (async).""" - loop = asyncio.get_event_loop() - if timeout is not None: - return await asyncio.wait_for( - loop.run_in_executor(None, self._future.result), timeout=timeout - ) - return await loop.run_in_executor(None, self._future.result) + if self._result is not None: + return self._result + loop = asyncio.get_running_loop() + if timeout is not None: + self._result = await asyncio.wait_for( + loop.run_in_executor(None, self._future.result), timeout=timeout + ) + else: + self._result = await loop.run_in_executor(None, self._future.result) + return self._result
207-216: Consider sorting__all__alphabetically.Per static analysis (RUF022), the export list is unsorted. Sorting improves readability and reduces merge conflicts.
examples/training_example.py (1)
126-144: Retry logic is reasonable but consider documenting the magic numbers.The 5 retries with 1-second delays are hardcoded. For example code this is acceptable, but a brief inline comment explaining why these values were chosen would help readers understand the expected service behavior.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/basilica-sdk-python/python/basilica/training/types.py(1 hunks)examples/training_example.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
examples/training_example.py
55-55: Do not catch blind exception: Exception
(BLE001)
151-151: f-string without any placeholders
Remove extraneous f prefix
(F541)
crates/basilica-sdk-python/python/basilica/training/types.py
134-134: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
135-135: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
136-136: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
137-137: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
138-138: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
140-140: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
141-141: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
142-142: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
143-143: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
144-144: Unused method argument: kwargs
(ARG002)
176-176: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
207-216: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🔇 Additional comments (7)
crates/basilica-sdk-python/python/basilica/training/types.py (4)
16-45: LGTM!The
Datumdataclass is well-structured with appropriate optional fields and a cleanto_dict()serialization method.
47-67: LGTM!The factory methods provide a clean API for creating
ModelInputfrom different sources.
69-94: LGTM!Sensible defaults and proper handling of
stop_sequencesin serialization.
99-124: LGTM!Response dataclasses are well-defined with appropriate fields and defaults.
examples/training_example.py (3)
60-86: LGTM!Good use of the context manager pattern for the training client, and the training loop demonstrates the APIFuture
.result()pattern clearly.
207-241: Good use of try/finally for cleanup in async context.The
try/finallypattern ensurestraining.close()is called even if an exception occurs during async training.
295-317: LGTM!Clean CLI argument handling with argparse and clear dispatch to the appropriate example function.
| def __init__( | ||
| self, | ||
| models: List[str] = None, | ||
| max_batch_tokens: int = None, | ||
| max_sequence_length: int = None, | ||
| gpu_types: List[str] = None, | ||
| lora_ranks: List[int] = None, | ||
| # camelCase variants from API | ||
| maxBatchTokens: int = None, | ||
| maxSequenceLength: int = None, | ||
| gpuTypes: List[str] = None, | ||
| loraRanks: List[int] = None, | ||
| **kwargs, # Ignore unknown fields | ||
| ): | ||
| self.models = models or [] | ||
| self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0 | ||
| self.max_sequence_length = max_sequence_length or maxSequenceLength or 0 | ||
| self.gpu_types = gpu_types or gpuTypes or [] | ||
| self.lora_ranks = lora_ranks or loraRanks or [] | ||
|
|
There was a problem hiding this comment.
Falsy value bug: explicit 0 values are ignored.
The or chaining treats 0 as falsy, so max_batch_tokens=0 falls through to check maxBatchTokens. While the final fallback to 0 masks this for zeroes, it's semantically incorrect and will fail if someone passes max_batch_tokens=0, maxBatchTokens=100 expecting 0.
Additionally, per static analysis hints, the implicit Optional types (using = None without Optional[T]) violate PEP 484.
🔎 Proposed fix using explicit None checks
def __init__(
self,
- models: List[str] = None,
- max_batch_tokens: int = None,
- max_sequence_length: int = None,
- gpu_types: List[str] = None,
- lora_ranks: List[int] = None,
+ models: Optional[List[str]] = None,
+ max_batch_tokens: Optional[int] = None,
+ max_sequence_length: Optional[int] = None,
+ gpu_types: Optional[List[str]] = None,
+ lora_ranks: Optional[int] = None,
# camelCase variants from API
- maxBatchTokens: int = None,
- maxSequenceLength: int = None,
- gpuTypes: List[str] = None,
- loraRanks: List[int] = None,
+ maxBatchTokens: Optional[int] = None,
+ maxSequenceLength: Optional[int] = None,
+ gpuTypes: Optional[List[str]] = None,
+ loraRanks: Optional[List[int]] = None,
**kwargs, # Ignore unknown fields
):
self.models = models or []
- self.max_batch_tokens = max_batch_tokens or maxBatchTokens or 0
- self.max_sequence_length = max_sequence_length or maxSequenceLength or 0
- self.gpu_types = gpu_types or gpuTypes or []
- self.lora_ranks = lora_ranks or loraRanks or []
+ self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else (maxBatchTokens if maxBatchTokens is not None else 0)
+ self.max_sequence_length = max_sequence_length if max_sequence_length is not None else (maxSequenceLength if maxSequenceLength is not None else 0)
+ self.gpu_types = gpu_types if gpu_types is not None else (gpuTypes if gpuTypes is not None else [])
+ self.lora_ranks = lora_ranks if lora_ranks is not None else (loraRanks if loraRanks is not None else [])Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.8)
134-134: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
135-135: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
136-136: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
137-137: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
138-138: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
140-140: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
141-141: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
142-142: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
143-143: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
144-144: Unused method argument: kwargs
(ARG002)
🤖 Prompt for AI Agents
In crates/basilica-sdk-python/python/basilica/training/types.py around lines 132
to 151, the constructor uses "or" chaining which treats valid falsy values like
0 as missing; change the assignments to explicit None checks (e.g., set
self.max_batch_tokens = max_batch_tokens if max_batch_tokens is not None else
(maxBatchTokens if maxBatchTokens is not None else 0)) and similarly for
max_sequence_length, gpu_types, and lora_ranks so passed zeros or empty lists
are preserved; also update the parameter type annotations from "T = None" to
"Optional[T] = None" (import Optional from typing) for all optional params to
satisfy PEP 484.
| if abs(loss_after_load - loss_before_changes) < 0.01: | ||
| print("✓ R2 checkpoint round-trip PASSED - weights restored correctly!") | ||
| else: | ||
| print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch") |
There was a problem hiding this comment.
Remove extraneous f prefix from string.
This f-string has no placeholders, making the f prefix unnecessary.
🔎 Proposed fix
- print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch")
+ print("✗ R2 checkpoint round-trip FAILED - loss mismatch")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"✗ R2 checkpoint round-trip FAILED - loss mismatch") | |
| print("✗ R2 checkpoint round-trip FAILED - loss mismatch") |
🧰 Tools
🪛 Ruff (0.14.8)
151-151: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In examples/training_example.py around line 151, the print statement uses an
unnecessary f-string prefix for a literal without placeholders; remove the
leading "f" so the string is a normal literal (print("✗ R2 checkpoint round-trip
FAILED - loss mismatch")) to avoid misleading use of f-strings.
Made-with: Cursor # Conflicts: # .gitignore # Cargo.lock # crates/basilica-api/Cargo.toml # crates/basilica-api/src/api/middleware/scope.rs # crates/basilica-api/src/api/mod.rs # crates/basilica-api/src/api/routes/mod.rs # crates/basilica-api/src/bin/gen_api_key.rs # crates/basilica-api/src/config/mod.rs # crates/basilica-api/src/k8s/trait.rs # crates/basilica-api/src/server.rs # crates/basilica-operator/src/controllers/mod.rs # crates/basilica-operator/src/crd/mod.rs # crates/basilica-operator/src/k8s_client.rs # crates/basilica-operator/src/runtime.rs # justfile # scripts/api/compose.dev.yml
Summary
Brief description of what this PR does.
Related Issues
Closes #(issue number)
Type of Change
Changes Made
List the main changes in this PR:
Testing
How Has This Been Tested?
Describe the tests you ran to verify your changes.
cargo test)Test Configuration
Checklist
cargo fmtto format my codecargo clippyand addressed all warningsAdditional Context
Add any other context about the PR here.
Summary by CodeRabbit
New Features
Documentation