Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Nov 6, 2025

Overview:

add dynamoModel CRD

Summary by CodeRabbit

Release Notes

  • New Features

    • Added DynamoModel custom resource for managing model lifecycle and LoRA adapters
    • Added modelRef field to enable linking models to components for endpoint discovery
    • Introduced automatic headless service generation for model endpoint discovery
  • Documentation

    • Added comprehensive DynamoModel deployment guide with quickstart and troubleshooting
    • Updated API reference with DynamoModel schema and configuration options
    • Added LoRA adapter deployment instructions to deployment guide

Example of a DGD and associated new DynamoModel CR :

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: sglang-disagg
spec:
  services:
    Frontend:
      dynamoNamespace: sglang-disagg
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
    decode:
      modelRef:
        name: Qwen/Qwen3-0.6B
      envFromSecret: hf-token-secret
      dynamoNamespace: sglang-disagg
      componentType: worker
      subComponentType: decode
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
          workingDir: /workspace/examples/backends/sglang
          command:
          - python3
          - -m
          - dynamo.sglang
          args:
            - --model-path
            - Qwen/Qwen3-0.6B
            - --served-model-name
            - Qwen/Qwen3-0.6B
            - --page-size
            - "16"
            - --tp
            - "1"
            - --trust-remote-code
            - --skip-tokenizer-init
            - --disaggregation-mode
            - decode
            - --disaggregation-transfer-backend
            - nixl
            - --disaggregation-bootstrap-port
            - "12345"
            - --host
            - "0.0.0.0"
    prefill:
      modelRef:
        name: Qwen/Qwen3-0.6B
      envFromSecret: hf-token-secret
      dynamoNamespace: sglang-disagg
      componentType: worker
      subComponentType: prefill
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0
          workingDir: /workspace/examples/backends/sglang
          command:
          - python3
          - -m
          - dynamo.sglang
          args:
            - --model-path
            - Qwen/Qwen3-0.6B
            - --served-model-name
            - Qwen/Qwen3-0.6B
            - --page-size
            - "16"
            - --tp
            - "1"
            - --trust-remote-code
            - --skip-tokenizer-init
            - --disaggregation-mode
            - prefill
            - --disaggregation-transfer-backend
            - nixl
            - --disaggregation-bootstrap-port
            - "12345"
            - --host
            - "0.0.0.0"
---
# Example DynamoModel CR - Base Model
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: sglang-3-0-6b-my-lora
spec:
  modelName: Qwen/Qwen3-0.6B-my-lora
  baseModelName: Qwen/Qwen3-0.6B
  modelType: lora
  source:
    uri: s3://my-bucket/Qwen/Qwen3-0.6B-my-lora

the new controller would make sure the workers of the DGD (both decode and worker) would have the LORA loaded by calling their POST /v1/loras API.

internally we use headless service and associated endpointSlices to make sure the LORA are loaded

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso requested a review from a team as a code owner November 6, 2025 21:19
@github-actions github-actions bot added the feat label Nov 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

Walkthrough

This pull request introduces the DynamoModel Custom Resource Definition and supporting infrastructure to manage model lifecycle and LoRA adapter deployment in the Dynamo operator. Key additions include a new DynamoModel CRD with spec and status fields, ModelRef fields on existing CRDs for model linkage, an endpoint management client for LoRA operations, a worker pool for concurrent task execution, a DynamoModel reconciler, RBAC updates, and comprehensive documentation.

Changes

Cohort / File(s) Summary
DynamoModel CRD Definitions (Kubernetes Manifests)
deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml
Introduces new CustomResourceDefinition for DynamoModel resource with API group nvidia.com, v1alpha1 version. Includes spec fields (baseModelName, modelName, modelType enum, source), status fields (endpoints, readyEndpoints, totalEndpoints, conditions), printer columns (BaseModel, Type, Ready, Total, Age), and namespaced scope with status subresource.
ModelRef Extension to Existing CRDs
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml, deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
Adds new optional modelRef field to DynamoComponentDeployment and DynamoGraphDeployment CRD specs. Field references a model via name (required) and revision (optional) for endpoint discovery via headless service.
Go API Types
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go, deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go, deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
Adds DynamoModel, DynamoModelSpec, DynamoModelStatus, EndpointInfo, ModelSource, and ModelReference types with Kubebuilder markers and helper methods (IsLoRA, GetReadyEndpoints, HasEndpoints, HasReadyEndpoints). Updates deepcopy methods for new types and adjusts imports for Kubernetes core/v1 types.
DynamoModel Controller
deploy/cloud/operator/internal/controller/dynamo_model_controller.go, deploy/cloud/operator/internal/controller/dynamo_model_controller_test.go
Implements DynamoModelReconciler with endpoint discovery via EndpointSlice, LoRA loading/unloading through EndpointClient, status field management (TotalEndpoints, ReadyEndpoints), condition updates (EndpointsReady, ServicesFound), finalization logic, and comprehensive test coverage for reconciliation scenarios.
Model Service Reconciliation
deploy/cloud/operator/internal/dynamo/model_service.go, deploy/cloud/operator/internal/dynamo/graph.go
Adds ReconcileModelServicesForComponents to generate and sync headless services for model endpoint discovery. Introduces helpers: HashModelName, GenerateServiceName, AddBaseModelLabel, AddBaseModelAnnotation. Updates DynamoComponentDeployment and DynamoGraphDeployment controllers to call service reconciliation.
Endpoint Management (modelendpoint Package)
deploy/cloud/operator/internal/modelendpoint/client.go, deploy/cloud/operator/internal/modelendpoint/discovery.go, deploy/cloud/operator/internal/modelendpoint/lora.go, deploy/cloud/operator/internal/modelendpoint/types.go, deploy/cloud/operator/internal/modelendpoint/client_test.go, deploy/cloud/operator/internal/modelendpoint/discovery_test.go, deploy/cloud/operator/internal/modelendpoint/lora_test.go
Introduces modelendpoint package with Client for concurrent LoRA loading/unloading via HTTP. Provides endpoint discovery utilities (ExtractCandidates, FindModelsForBaseModel), LoRA HTTP operation helpers (loadLoRA, unloadLoRA), and Candidate type. Includes bounded concurrency control with configurable timeouts and comprehensive test coverage.
Worker Pool Utilities
deploy/cloud/operator/internal/workerpool/pool.go, deploy/cloud/operator/internal/workerpool/pool_test.go
Introduces generic Task and Result types with Execute function for bounded concurrency task execution. Provides maxWorkers control, per-task and aggregate timeouts, error aggregation, and result ordering preservation with comprehensive test coverage.
RBAC and Configuration Updates
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/operator/config/rbac/role.yaml, deploy/cloud/operator/PROJECT, deploy/cloud/operator/config/crd/kustomization.yaml, deploy/cloud/operator/config/samples/kustomization.yaml
Adds RBAC rules for endpointslices discovery and dynamomodels resource (create, delete, get, list, patch, update, watch, and subresources). Updates kustomization files to include DynamoModel CRD and sample. Registers DynamoModel API resource in PROJECT file.
Constants and Bootstrapping
deploy/cloud/operator/internal/consts/consts.go, deploy/cloud/operator/cmd/main.go
Adds three new Kubernetes label and annotation constants for base model identification. Wires EndpointClient into DynamoModelReconciler initialization in main.go with modelendpoint import.
Existing Controller Updates
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go, deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go, deploy/cloud/operator/internal/controller/suite_test.go
Updates DynamoComponentDeployment and DynamoGraphDeployment reconcilers to call ReconcileModelServicesForComponents. Adds discovery/v1 scheme registration in test suite for EndpointSlice support.
Documentation
docs/kubernetes/README.md, docs/kubernetes/api_reference.md, docs/kubernetes/deployment/create_deployment.md, docs/kubernetes/deployment/dynamomodel-guide.md, docs/kubernetes/dynamo_operator.md
Adds comprehensive DynamoModel documentation including quick start guide, API reference, deployment workflow, integration examples with S3 and HuggingFace sources, status monitoring, troubleshooting, and operator overview updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50–70 minutes

Areas requiring extra attention:

  • Controller Logic (dynamo_model_controller.go): Dense reconciliation loop with endpoint discovery, LoRA lifecycle management, status/condition updates, and finalization logic—requires careful review of error handling and state transitions
  • Concurrency Implementation (modelendpoint/client.go, workerpool/pool.go): Bounded concurrency patterns, timeout handling, goroutine coordination, and error aggregation—verify correctness of task dispatch, cancellation propagation, and result aggregation
  • Integration Points (dynamocomponentdeployment_controller.go, dynamographdeployment_controller.go): New service reconciliation calls inserted into existing controller flows—verify no unintended side effects on existing reconciliation paths
  • HTTP Client Interactions (modelendpoint/lora.go): URL construction, request body marshalling, response parsing, and error propagation—verify robustness against edge cases and malformed responses
  • Test Coverage (dynamo_model_controller_test.go, all modelendpoint tests): Verify endpoint discovery mocking, status assertions, and concurrency test scenarios exercise intended behavior

Poem

🐰 A new garden of models takes root today,
DynamoModel tends each LoRA's way,
Endpoints hop between service and pod,
The worker pool hops, never too odd—
In bounded concurrency, we frolic with grace,
And headless services find their place!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add dynamoModel CRD' clearly and concisely summarizes the main change - introducing a new DynamoModel custom resource definition.
Description check ✅ Passed The description provides an overview, detailed examples of DGD and DynamoModel CRs, and explains controller behavior, but is missing the 'Where should the reviewer start?' and 'Related Issues' sections from the template.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)

10005-10018: Tighten modelRef schema (non-empty name, disallow unknown keys).

Looks good overall. To prevent empty strings and catch typos, add minimal validations.

Apply within this block:

               modelRef:
                 description: |-
                   ModelRef references a model that this component serves
                   When specified, a headless service will be created for endpoint discovery
-                properties:
+                additionalProperties: false
+                properties:
                   name:
                     description: Name is the base model identifier (e.g., "llama-3-70b-instruct-v1")
                     type: string
+                    minLength: 1
                   revision:
                     description: Revision is the model revision/version (optional)
                     type: string
+                    minLength: 1
                 required:
                   - name
                 type: object

Optional (nice UX): add an additionalPrinterColumn to surface the model at kubectl get time:

@@
     - additionalPrinterColumns:
       - description: Dynamo component
         jsonPath: .spec.dynamoComponent
         name: DynamoComponent
         type: string
+      - description: Model
+        jsonPath: .spec.modelRef.name
+        name: Model
+        type: string

Please confirm:

  • types.go defines ModelRef with json:"modelRef,omitempty" and string fields, and controller tolerates missing/empty revision.
  • No model names require characters beyond basic DNS-1123 label charset; if they do, keep minLength but skip adding a strict pattern. Based on learnings.
deploy/cloud/operator/internal/modelendpoint/lora.go (1)

66-66: Standardize logging levels for success cases.

Success logging is inconsistent: loadLoRA uses Info level (line 66) while unloadLoRA uses V(1) level (line 96). For consistency and operational visibility, both should log at the same level.

Consider standardizing to Info level for both operations:

-	logs.V(1).Info("Successfully unloaded LoRA", "address", address, "modelName", modelName)
+	logs.Info("Successfully unloaded LoRA", "address", address, "modelName", modelName)

Also applies to: 96-96

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1)

85-87: Consider adding validation to enforce loraPath usage.

The loraPath field is described as "only applicable for lora model type" but there's no schema-level validation to enforce this constraint. Users could accidentally set loraPath on base or adapter models.

Add CEL validation to ensure loraPath is only set when modelType is lora:

                 loraPath:
                   description: LoraPath is the path to the LoRA adapter (only applicable for lora model type)
                   type: string
                 modelName:
                   description: ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora")
                   type: string
                 modelType:
                   default: base
                   description: ModelType specifies the type of model (e.g., "base", "lora", "adapter")
                   enum:
                     - base
                     - lora
                     - adapter
                   type: string
               required:
                 - baseModelName
                 - modelName
               type: object
+              x-kubernetes-validations:
+                - rule: "self.modelType != 'lora' || has(self.loraPath)"
+                  message: "loraPath is required when modelType is 'lora'"
+                - rule: "self.modelType == 'lora' || !has(self.loraPath)"
+                  message: "loraPath should only be set when modelType is 'lora'"

Also applies to: 91-98

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2)

10005-10018: Validate and sanitize modelRef for Service/labels; document empty revision semantics

Good addition. Please confirm:

  • Reconcile sanitizes modelRef.name (and revision if used) into valid DNS-1123 Service names (lowercase, [a-z0-9-], <=63), with truncation+hash to avoid collisions when names exceed 63 or contain dots/uppercases.
  • If modelRef is used in labels/selector values, ensure label constraints (<=63; allowed charset) or apply normalization similarly.
  • Clarify behavior when revision is empty (e.g., treated as “latest”, or excluded from identity). Add this to the Go type docstring so controller-gen propagates it.

Optionally, enforce constraints at the API by adding kubebuilder validation on the Go types (e.g., Patterns and MaxLength for name/revision) instead of manual YAML edits.

Based on learnings.


10005-10018: Improve kubectl UX with printer columns

Consider adding print columns on the Go type for:

  • Model (.spec.modelRef.name)
  • Revision (.spec.modelRef.revision)

Use +kubebuilder:printcolumn annotations so controller-gen emits them here (don’t hand-edit this YAML). This makes kubectl get dcd more informative.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0763f and 885d792.

📒 Files selected for processing (26)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (4 hunks)
  • deploy/cloud/operator/PROJECT (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2 hunks)
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (11 hunks)
  • deploy/cloud/operator/cmd/main.go (2 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/kustomization.yaml (1 hunks)
  • deploy/cloud/operator/config/rbac/role.yaml (4 hunks)
  • deploy/cloud/operator/config/samples/kustomization.yaml (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/headless_service.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/client.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/discovery.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/lora.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/types.go (1 hunks)
  • deploy/cloud/operator/internal/workerpool/pool.go (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
  • deploy/cloud/operator/config/crd/kustomization.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml
  • deploy/cloud/operator/config/samples/kustomization.yaml
  • deploy/cloud/operator/config/crd/kustomization.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

  • deploy/cloud/operator/config/crd/kustomization.yaml
🧬 Code graph analysis (12)
deploy/cloud/operator/internal/modelendpoint/discovery.go (2)
deploy/cloud/operator/internal/modelendpoint/types.go (1)
  • Candidate (21-24)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)
  • DynamoModelList (110-114)
deploy/cloud/operator/cmd/main.go (2)
deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1)
  • DynamoModelReconciler (63-67)
deploy/cloud/operator/internal/modelendpoint/client.go (2)
  • Client (43-45)
  • NewClient (48-54)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (1)
  • ReconcileModelServicesForComponents (37-97)
deploy/cloud/operator/internal/workerpool/pool.go (1)
deploy/cloud/operator/api/dynamo/schemas/schemas.go (1)
  • Duration (38-38)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)
deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)
  • SchemeBuilder (35-35)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (2)
  • ReconcileModelServicesForComponents (37-97)
  • AddBaseModelLabel (143-147)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
  • ModelReference (279-287)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (6)
  • DynamoModel (99-105)
  • DynamoModelList (110-114)
  • DynamoModelSpec (25-44)
  • ModelSource (47-54)
  • DynamoModelStatus (72-86)
  • EndpointInfo (57-69)
deploy/cloud/operator/internal/controller/dynamo_model_controller.go (5)
deploy/cloud/operator/internal/modelendpoint/client.go (2)
  • Client (43-45)
  • NewClient (48-54)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)
  • DynamoModel (99-105)
  • EndpointInfo (57-69)
deploy/cloud/operator/internal/modelendpoint/discovery.go (2)
  • FindModelsForBaseModel (77-112)
  • ExtractCandidates (35-73)
deploy/cloud/operator/internal/modelendpoint/types.go (1)
  • Candidate (21-24)
deploy/cloud/operator/internal/consts/consts.go (1)
  • DynamoSystemPort (22-22)
deploy/cloud/operator/internal/dynamo/headless_service.go (3)
deploy/cloud/operator/internal/controller_common/resource.go (2)
  • Reconciler (49-52)
  • SyncResource (60-195)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
  • DynamoComponentDeploymentSharedSpec (48-111)
  • ModelReference (279-287)
deploy/cloud/operator/internal/consts/consts.go (3)
  • KubeLabelDynamoBaseModel (41-41)
  • DynamoSystemPortName (23-23)
  • DynamoSystemPort (22-22)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (1)
  • AddBaseModelLabel (143-147)
deploy/cloud/operator/internal/modelendpoint/client.go (3)
deploy/cloud/operator/internal/modelendpoint/types.go (1)
  • Candidate (21-24)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)
  • DynamoModel (99-105)
  • EndpointInfo (57-69)
deploy/cloud/operator/internal/workerpool/pool.go (2)
  • Task (28-31)
  • Execute (43-102)
deploy/cloud/operator/internal/modelendpoint/lora.go (1)
deploy/cloud/operator/internal/modelendpoint/client.go (1)
  • Client (43-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (17)
deploy/cloud/operator/PROJECT (1)

27-34: LGTM!

The DynamoModel resource configuration follows the same structure and conventions as the existing DynamoComponentDeployment and DynamoGraphDeployment resources.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

344-354: LGTM!

The model service reconciliation is appropriately placed after Grove scaling, ensuring that workload resources are created before setting up endpoint discovery services. Error handling follows the established pattern in this controller.

deploy/cloud/operator/internal/consts/consts.go (1)

41-41: LGTM!

The new constant follows the established naming convention and is appropriately positioned with other Kubernetes label constants.

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)

86-89: LGTM!

The optional ModelRef field is well-documented and designed for backward compatibility. The documentation clearly explains its purpose for endpoint discovery via headless services.


278-287: LGTM!

The ModelReference type is well-designed with appropriate validation markers. The required Name field and optional Revision field provide flexibility while ensuring essential information is present.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)

330-343: LGTM!

The model service reconciliation is correctly implemented for component-level reconciliation. The componentMap contains only the current component, which is appropriate for this controller's scope.


943-955: Improved label handling.

The function now properly initializes and populates labels instead of returning an empty map. This ensures that:

  1. Existing component labels are preserved
  2. Base model labels are added when a ModelRef is specified

This is a positive change that enables proper label propagation throughout the resource hierarchy.

deploy/cloud/operator/cmd/main.go (2)

63-63: LGTM!

The modelendpoint import is correctly added and used for creating the EndpointClient.


564-571: LGTM!

The DynamoModelReconciler setup follows the established pattern for controller initialization. The EndpointClient is appropriately created once and injected into the reconciler.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Go types are properly defined and aligned with the CRD schema.

The verification confirms that the ModelReference struct is correctly defined in deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (lines 278–287) with proper kubebuilder annotations (+kubebuilder:validation:Required for name, +optional for revision). The modelRef field in DynamoComponentDeploymentSharedSpec is correctly typed as *ModelReference with the +optional tag and proper JSON marshaling hints. The autogenerated CRD schema accurately reflects these Go types, and the structure aligns with how modelRef is used in the controller code (e.g., AddBaseModelLabel function). No issues found.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

65-72: LGTM: RBAC permissions properly scoped for DynamoModel CRD.

The added permissions for EndpointSlices discovery and DynamoModel lifecycle management (including finalizers and status updates) are appropriate and follow standard Kubernetes controller patterns.

Also applies to: 372-372, 387-387, 396-396

deploy/cloud/operator/config/rbac/role.yaml (1)

89-96: LGTM: RBAC permissions consistent with Helm template.

The RBAC additions mirror those in the Helm template and are properly scoped for the DynamoModel controller's operational needs.

Also applies to: 173-173, 188-188, 197-197

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (2)

166-182: Verify whether podName should be required in EndpointInfo.

The podName field is not in the required list (lines 179-181), suggesting it may be optional. However, in a Kubernetes environment with endpoint discovery via EndpointSlices, the pod name should typically always be known and valuable for debugging and observability.

Please confirm whether podName can legitimately be absent in any scenario. If not, consider adding it to the required fields:

                     required:
                       - address
+                      - podName
                       - ready

174-178: Clarify the design intent for base model endpoint tracking.

The comment states "For base models: always false (no probing performed)," which suggests base model endpoints are tracked but never marked ready. This raises questions about the utility of endpoint tracking for base models and whether the status structure optimally serves both base and LoRA model use cases.

Please clarify the design rationale:

  • Why track endpoints for base models if ready is always false?
  • Is there a future plan to probe base model readiness?
  • Would separate status structures for base vs LoRA models improve clarity?
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)

10005-10018: RBAC check for headless Service creation

Since modelRef triggers headless service creation for endpoint discovery, verify the PR includes RBAC for Services and EndpointSlices (get/list/watch/create/update/patch) in the operator’s ClusterRole.

Based on learnings.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Verify that base CRD file has been updated to generate this template change.

Per prior learnings on this codebase, CRD schemas in deploy/cloud/helm/crds/templates/*.yaml are auto-generated from base CRD files in deploy/cloud/operator/config/crd/bases/ and should not be manually edited, as manual changes would be overwritten during regeneration.

The modelRef field definition itself appears structurally sound as OpenAPI v3 schema. However, ensure that the corresponding base CRD file (deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml) has been updated with this field, and that this template change was auto-generated from it rather than manually added.

deploy/cloud/operator/internal/dynamo/headless_service.go (1)

91-105: Review comment is incorrect for Go 1.24.0

The repository declares go 1.24.0 in deploy/cloud/operator/go.mod, which is well after Go 1.22. Starting with Go 1.22, loop variables are scoped per iteration rather than reused across iterations, so closures over loop variables are safe. The problematic code pattern the review describes is not an issue in this codebase.

Additionally, the review references the wrong file. The actual loops capturing candidate are in deploy/cloud/operator/internal/modelendpoint/client.go (LoadLoRA at lines 91–102, UnloadLoRA at lines 152–163), not in headless_service.go.

The code requires no changes.

Likely an incorrect or invalid review comment.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Copy link
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 9, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
docs/kubernetes/deployment/dynamomodel-guide.md (1)

53-58: Fix documentation lint warnings.

Several fenced code blocks (for example here) lack a language hint, which triggers markdownlint MD040. While updating, please also register this page in the Sphinx toctree to clear the “document isn’t included in any toctree” warning from the docs pipeline.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Improve description clarity to separate intent from implementation details.

Verification confirms the modelRef structure is consistent across both DynamoGraphDeployment and DynamoComponentDeployment CRDs. However, the description mixes user-facing intent with controller implementation details. Separate these concerns for schema clarity:

                      modelRef:
                        description: |-
-                          ModelRef references a model that this component serves
-                          When specified, a headless service will be created for endpoint discovery
+                          ModelRef references a model that this component serves.
+                          The referenced model must exist as a DynamoModel resource in the same namespace.

Apply the same improvement to the parallel field in DynamoComponentDeployment for consistency.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)

10005-10018: Add validation constraints, printer columns, and optional immutability to modelRef field.

The modelRef field is properly wired in the controller (uses spec.ModelRef safely with nil/empty checks), but the generated CRD lacks string validation, UX enhancements, and optional immutability constraints:

  1. Validation constraints: Add minLength: 1, maxLength: 253, and optionally a pattern on name to reject empty or invalid identifiers.
  2. Printer columns: Surface modelRef.name and modelRef.revision in kubectl get for better visibility.
  3. Immutability (optional): Add x-kubernetes-validations CEL rule to prevent accidental model name swaps; keep revision mutable if rollout scenarios depend on it.
  4. Description: Clarify that the referenced DynamoModel is resolved in the same namespace as the DynamoComponentDeployment.

Suggested diffs (from original review comment):

                modelRef:
                  description: |-
                    ModelRef references a model that this component serves
                    When specified, a headless service will be created for endpoint discovery
+                   The referenced DynamoModel is resolved in the same namespace as this resource.
                  properties:
                    name:
                      description: Name is the base model identifier (e.g., "llama-3-70b-instruct-v1")
                      type: string
+                     minLength: 1
+                     maxLength: 253
                    revision:
                      description: Revision is the model revision/version (optional)
                      type: string
                  required:
                    - name
                  type: object

Also add printer columns for Model and Revision in additionalPrinterColumns as shown in the original review comment.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (2)

96-108: Consider adding validation to enforce that source is only applicable for LoRA models.

The description on line 97 states that the source field is "only applicable for lora model type," but there's no OpenAPI schema validation enforcing this constraint. Users could inadvertently set source on base model types, which may lead to confusion or unexpected behavior.

Consider adding a CEL validation rule or restructuring the spec to enforce this constraint:

               source:
                 description: Source specifies the model source location (only applicable for lora model type)
                 properties:
                   uri:
                     description: |-
                       URI is the model source URI
                       Supported formats:
                       - S3: s3://bucket/path/to/model
                       - HuggingFace: hf://org/model@revision_sha
                     type: string
                 required:
                   - uri
                 type: object
+              x-kubernetes-validations:
+                - rule: "self.modelType == 'base' ? !has(self.source) : true"
+                  message: "source field is only applicable when modelType is 'lora' or 'adapter'"
             required:
               - baseModelName
               - modelName

176-192: Clarify whether podName is required in endpoint status.

The podName field is described (lines 180-182) but not included in the required fields list (lines 189-192), which only requires address and ready. If the discovery logic always populates podName (as suggested by the implementation in discovery.go), consider making it a required field for consistency and to prevent incomplete status representations.

If podName should always be present, apply this diff:

                   required:
                     - address
+                    - podName
                     - ready

Alternatively, if podName can legitimately be absent in some scenarios, update the description to clarify when it may be omitted.

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)

10005-10018: Add validation markers to ModelReference.Name in Go struct; clarify that service naming uses hashing.

The Go struct at deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:279 defines ModelReference with only a Required marker on Name. The field lacks MinLength, MaxLength, or pattern constraints. However, the implementation mitigates DNS concerns: model_service.go:162 shows service names are generated as dynamo-model-{8-char-hash}, not from the raw name.

Recommended improvements:

  • Add +kubebuilder:validation:MinLength=1 and +kubebuilder:validation:MaxLength=255 to constrain the Name field in the Go struct.
  • Update the struct comment to clarify semantics: "Name is the base model identifier (e.g., HuggingFace model ID like 'meta-llama/llama-3-70b-instruct'). Service names are generated deterministically from this value and do not use it directly."
  • Document namespace behavior: add comment clarifying whether ModelRef must resolve in the same namespace (current code shows no cross-namespace lookup).
  • Optional: if cross-namespace refs are planned, add an optional Namespace *string field to the struct and corresponding validation.

Regenerate CRDs via make manifests or controller-gen after updating the Go struct; do not hand-edit the YAML.

deploy/cloud/operator/internal/modelendpoint/client.go (1)

32-53: LGTM! Consider documenting timeout interaction.

The timeout constants and client initialization are well-structured. The bounded concurrency (10) with per-request (15s) and total (30s) timeouts provides good control.

Optional: The interaction between RequestTimeout (applied to httpClient) and TotalTimeout (enforced by workerpool.Execute) could be clarified in a comment. With concurrent execution, individual requests timing out at 15s will surface as errors before the 30s total timeout, which appears intentional but might benefit from documentation.

deploy/cloud/operator/internal/dynamo/model_service.go (1)

152-164: LGTM! Hash-based naming is appropriate.

The 8-character SHA256 hash provides a good balance between brevity and collision resistance. For typical deployments with hundreds to thousands of models, collision probability is negligible.

Optional consideration: The 8-character hex hash provides ~4 billion possible values with 50% collision probability at ~77k models (birthday paradox). For very large-scale deployments planning to manage tens of thousands of unique base models, consider increasing the hash length. For typical use cases, the current implementation is well-suited.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0763f and 5b0782f.

📒 Files selected for processing (37)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (4 hunks)
  • deploy/cloud/operator/PROJECT (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2 hunks)
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (11 hunks)
  • deploy/cloud/operator/cmd/main.go (2 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/kustomization.yaml (1 hunks)
  • deploy/cloud/operator/config/rbac/role.yaml (4 hunks)
  • deploy/cloud/operator/config/samples/kustomization.yaml (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamo_model_controller_test.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (3 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1 hunks)
  • deploy/cloud/operator/internal/controller/suite_test.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/model_service.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/client.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/client_test.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/discovery.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/discovery_test.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/lora.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/lora_test.go (1 hunks)
  • deploy/cloud/operator/internal/modelendpoint/types.go (1 hunks)
  • deploy/cloud/operator/internal/workerpool/pool.go (1 hunks)
  • deploy/cloud/operator/internal/workerpool/pool_test.go (1 hunks)
  • docs/kubernetes/README.md (1 hunks)
  • docs/kubernetes/api_reference.md (5 hunks)
  • docs/kubernetes/deployment/create_deployment.md (1 hunks)
  • docs/kubernetes/deployment/dynamomodel-guide.md (1 hunks)
  • docs/kubernetes/dynamo_operator.md (2 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • deploy/cloud/operator/config/crd/kustomization.yaml
  • docs/kubernetes/api_reference.md
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/config/crd/kustomization.yaml
  • docs/kubernetes/api_reference.md
  • docs/kubernetes/dynamo_operator.md
  • deploy/cloud/operator/PROJECT
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml
  • deploy/cloud/operator/config/samples/kustomization.yaml
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

  • deploy/cloud/operator/config/crd/kustomization.yaml
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • docs/kubernetes/api_reference.md
  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • docs/kubernetes/deployment/dynamomodel-guide.md
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

  • docs/kubernetes/api_reference.md
  • deploy/cloud/operator/PROJECT
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.

Applied to files:

  • deploy/cloud/operator/PROJECT
🪛 GitHub Actions: Generate Documentation
docs/kubernetes/deployment/dynamomodel-guide.md

[warning] 1-1: Document isn't included in any toctree [toc.not_included].

🪛 LanguageTool
docs/kubernetes/deployment/dynamomodel-guide.md

[uncategorized] ~503-~503: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...y| All endpoints are ready | ✅ Good - full service availability | |NotReady` | Not all e...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
docs/kubernetes/api_reference.md

421-421: Bare URL used

(MD034, no-bare-urls)

docs/kubernetes/deployment/dynamomodel-guide.md

54-54: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


253-253: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


265-265: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


315-315: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (36)
deploy/cloud/operator/config/crd/kustomization.yaml (1)

22-22: LGTM! DynamoModel CRD resource added correctly.

The addition of the DynamoModel CRD base follows the established pattern for other CRDs (DynamoComponentDeployment and DynamoGraphDeployment) and integrates properly with the kustomization structure.

deploy/cloud/operator/internal/controller/suite_test.go (1)

36-36: LGTM! Test scheme properly extended for EndpointSlice support.

The addition of discoveryv1 types to the test scheme follows the established pattern and correctly enables EndpointSlice-based discovery testing for the new DynamoModel functionality.

Also applies to: 113-114

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

344-354: LGTM! Model service reconciliation properly integrated.

The new reconciliation step is well-placed after Grove scaling and includes appropriate error handling and logging. The integration follows the established patterns in the controller.

deploy/cloud/operator/PROJECT (1)

27-34: LGTM! DynamoModel resource properly configured.

The kubebuilder project configuration for DynamoModel follows the established pattern and includes all required fields (crdVersion, namespaced, controller, domain, kind, path, version).

docs/kubernetes/dynamo_operator.md (2)

15-15: LGTM! DynamoModel controller documented clearly.

The addition accurately describes the DynamoModelController's purpose for managing model lifecycle and LoRA adapters.


104-116: LGTM! Documentation structure enhanced for model management.

The updates clearly list DynamoModel as a core CRD and provide a helpful reference to the detailed deployment guide. The documentation follows the established structure and provides good user guidance.

docs/kubernetes/deployment/create_deployment.md (1)

221-259: LGTM! Excellent documentation for LoRA adapter deployment.

The new Step 6 provides clear, practical guidance with well-structured YAML examples showing the relationship between modelRef in the worker configuration and the corresponding DynamoModel resource. The link to the comprehensive guide is helpful for users seeking more details.

docs/kubernetes/api_reference.md (3)

40-40: LGTM! DynamoModel properly integrated into API reference.

The DynamoModel resource type and modelRef field are correctly added to the API reference, maintaining consistency with the existing documentation structure.

Also applies to: 170-170, 206-206


351-425: LGTM! Comprehensive DynamoModel API documentation.

The documentation thoroughly covers all DynamoModel-related types (DynamoModel, DynamoModelSpec, DynamoModelStatus, EndpointInfo) with clear field descriptions, validation rules, and cross-references. The structure follows the established pattern for other CRD documentation in this file.

Note: The static analysis warning about a bare URL at line 421 is a false positive—it's an example endpoint address in the documentation, not a link that needs formatting.


468-500: LGTM! ModelReference and ModelSource types documented clearly.

The documentation for these supporting types is complete and follows the established conventions, with appropriate field descriptions and validation annotations.

deploy/cloud/operator/internal/dynamo/graph.go (1)

1029-1030: No nil safety issue — function already handles nil gracefully.

The verification shows that AddBaseModelLabel at lines 168-173 of deploy/cloud/operator/internal/dynamo/model_service.go explicitly checks for nil input with if labels == nil || modelRef == nil || modelRef.Name == "" { return } and returns early. The code is safe and requires no changes.

Likely an incorrect or invalid review comment.

deploy/cloud/operator/internal/consts/consts.go (1)

41-43: LGTM! Constant additions follow established conventions.

The three new constants for base model labeling and annotation are well-named and consistent with the existing patterns in this file.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)

65-72: LGTM! EndpointSlice RBAC permissions are correctly scoped.

The read-only access (get, list, watch) to EndpointSlices in the discovery.k8s.io API group is appropriate for the endpoint discovery functionality.


372-372: LGTM! DynamoModel RBAC permissions are comprehensive.

The permissions for dynamomodels resources, finalizers, and status follow the established pattern for other CRDs managed by this operator.

Also applies to: 387-387, 396-396

deploy/cloud/operator/config/rbac/role.yaml (2)

89-96: LGTM! EndpointSlice permissions match Helm template.

The read-only RBAC rules for EndpointSlices are correctly configured and consistent with the manager-rbac.yaml template.


173-173: LGTM! DynamoModel permissions are properly configured.

The dynamomodels resource and its subresources (finalizers, status) have appropriate RBAC verbs for controller operations.

Also applies to: 188-188, 197-197

deploy/cloud/operator/internal/modelendpoint/discovery.go (2)

36-74: LGTM! Endpoint extraction logic is well-implemented.

The ExtractCandidates function correctly:

  • Filters for Pod-ready endpoints only
  • Validates TargetRef kind is "Pod" (addressing past review feedback)
  • Handles IPv6 addresses using net.JoinHostPort (addressing past review feedback)
  • Collects service names for tracking

79-115: LGTM! Model lookup uses efficient field indexing.

The FindModelsForBaseModel function leverages controller-runtime's field indexer for O(1) lookups, which is an efficient approach for finding models by base model name or hash. Error handling and logging are appropriate.

deploy/cloud/operator/internal/modelendpoint/lora_test.go (2)

30-211: LGTM! Comprehensive test coverage for LoRA loading.

The test suite thoroughly validates:

  • URL path construction with various base address formats
  • Request body structure with proper lora_name and source.uri fields
  • Response handling for success (200, 201) and error cases (400, 404, 500)
  • Content-Type header verification

213-330: LGTM! Thorough test coverage for LoRA unloading.

The unload tests properly verify:

  • DELETE method usage
  • URL path construction including models with special characters
  • Success responses (200, 204)
  • Error responses (404, 500) with appropriate error propagation
deploy/cloud/operator/internal/modelendpoint/discovery_test.go (2)

40-464: LGTM! Excellent test coverage for endpoint extraction.

The TestExtractCandidates suite thoroughly validates all edge cases:

  • Filtering logic for Pod-only, ready endpoints
  • Handling of nil/non-Pod TargetRefs
  • Multiple addresses per endpoint
  • Multiple services in the slice
  • Mixed valid/invalid endpoints

The test structure with optional validateCandidates hooks provides flexibility for deeper assertions.


466-650: LGTM! Comprehensive test coverage for model lookup.

The TestFindModelsForBaseModel suite effectively validates:

  • Field indexer-based lookups with multiple matches
  • Empty result handling
  • Namespace-scoped filtering
  • Proper use of fake client with index registration

The tests correctly construct the scheme and register the indexer function for testing the efficient O(1) lookup behavior.

deploy/cloud/operator/cmd/main.go (2)

63-63: LGTM!

The import is correctly added and used for creating the EndpointClient.


564-571: Code is correctly implemented; original concerns do not apply.

The EndpointClient initialization in main.go follows Go best practices for HTTP clients:

  • NewClient() returns a configured *Client without error, which is the correct pattern for factory functions that always succeed
  • The Client is already configured with RequestTimeout = 15 * time.Second and supports bounded concurrency (MaxConcurrentOperations = 10)
  • All Client methods properly accept context.Context for cancellation and timeout control
  • http.Client does not require explicit cleanup; connection pooling is automatic

The initialization pattern and resource management are appropriate for this use case.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

10139-10152: Verify that these YAML changes were generated from updated Go structs, not manually edited.

Per the learnings from PR 2012, CRD schema files in deploy/cloud/helm/crds/templates/ are auto-generated artifacts that should not be manually modified—changes should come from regenerating CRDs via controller-gen from the corresponding Go struct definitions. Before merging, confirm that:

  1. The modelRef field was added to the DynamoGraphDeploymentSpec Go struct
  2. The YAML changes are the output of running make generate or equivalent CRD regeneration
  3. No manual edits were made directly to the YAML file

The field design itself is sound: name (required string) correctly identifies the model, and revision (optional string) provides flexibility. The description clearly explains the purpose and relationship to headless service creation.

deploy/cloud/operator/internal/modelendpoint/client.go (2)

55-136: LGTM! LoadLoRA implementation is solid.

The function correctly:

  • Skips loading for non-LoRA models (lines 64-76)
  • Validates Source URI requirement for LoRA models (lines 78-86)
  • Executes load operations with bounded concurrency
  • Returns partial results with detailed error information

Note on runtime validation (lines 78-86): The Source URI validation occurs at runtime rather than via API-side validation (admission webhooks). Based on past review comments, this is a known limitation tracked in issue #4190 and requires certificate management infrastructure to implement properly.


138-192: LGTM! UnloadLoRA follows the same robust pattern.

The function mirrors the LoadLoRA implementation with bounded concurrency and detailed logging. The best-effort approach (continuing on errors) is appropriate for unload operations, with comprehensive per-endpoint error reporting.

deploy/cloud/operator/internal/dynamo/model_service.go (3)

36-97: LGTM! ReconcileModelServicesForComponents is well-designed.

The function correctly:

  • Deduplicates services by base model name to avoid creating duplicates (lines 48-66)
  • Generates deterministic headless services for endpoint discovery
  • Provides appropriate error handling with context
  • Serves as reusable logic across multiple controllers

99-150: LGTM! Headless service generation is correctly implemented.

The function properly configures a headless service for model endpoint discovery:

  • ClusterIP: None enables direct pod IP resolution
  • PublishNotReadyAddresses: false (line 135) ensures only ready pods appear in EndpointSlices, which is critical for reliable endpoint discovery
  • Deterministic naming and label-based selection using hashes
  • Annotations preserve original model name for human readability

166-181: LGTM! Helper functions are clean and defensive.

Both AddBaseModelLabel and AddBaseModelAnnotation correctly handle nil inputs and provide consistent label/annotation management patterns.

deploy/cloud/operator/internal/controller/dynamo_model_controller.go (6)

47-72: LGTM! Constants and reconciler structure are well-defined.

The exported constants provide clear contract points for conditions and reasons. The requeue duration is now properly defined as a constant (requeueAfterDuration, line 64), addressing previous feedback. The reconciler struct has appropriate fields for client operations, event recording, and endpoint management.


80-130: LGTM! Early reconciliation logic is correct.

The function properly:

  • Fetches the DynamoModel with standard NotFound handling
  • Delegates finalizer handling to the common handler
  • Handles the no-endpoints case appropriately by updating status and conditions without requeuing, relying on EndpointSlice watches to trigger future reconciliation (line 129)

132-147: LGTM! The infinite requeue fix is correctly implemented.

The failure detection logic (lines 136-141) correctly addresses the past review concern:

  • For LoRA models: hasFailures is true if probe errors occur OR if not all endpoints are ready (line 140)
  • For base models: hasFailures is true only on probe errors, since Ready is expected to remain false for base models

This prevents infinite 30-second requeues for healthy base models while still requeuing on actual probe failures and for LoRA models with unready endpoints.


149-208: LGTM! Status and condition updates are well-structured.

The condition updates correctly differentiate behavior by model type:

  • LoRA models (lines 164-178): Check readiness counts and set EndpointsReady to True only when all endpoints are ready
  • Base models (lines 179-188): Simply verify endpoints exist, since readiness doesn't apply

The requeue logic (lines 200-206) appropriately uses the constant requeueAfterDuration when failures are detected, with clear logging of the reason.


211-265: LGTM! Helper functions and manager setup are well-implemented.

The helpers are straightforward and correct. The SetupWithManager function properly configures:

  • Field indexing by base model hash (lines 240-252) for efficient O(1) queries when mapping EndpointSlices to DynamoModels
  • Appropriate predicates: GenerationChangedPredicate for DynamoModel to avoid unnecessary reconciliations
  • EndpointSlice watch with custom mapping and disabled generic events to trigger reconciliation on endpoint changes

267-372: LGTM! Finalization and endpoint discovery are correctly implemented.

FinalizeResource (lines 295-337) properly implements best-effort cleanup:

  • Only cleans up LoRA models (base models don't require unload operations)
  • Continues with deletion even if endpoint discovery or unload operations fail (lines 310, 322)
  • Logs failures at Info level, which is appropriate since deletion proceeds regardless
  • Emits events for observability

getEndpointCandidates (lines 339-372) correctly performs label-based endpoint discovery using the hashed base model name, enabling efficient queries.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants