-
Notifications
You must be signed in to change notification settings - Fork 676
feat: add dynamoModel CRD #4166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
WalkthroughThis pull request introduces the DynamoModel Custom Resource Definition and supporting infrastructure to manage model lifecycle and LoRA adapter deployment in the Dynamo operator. Key additions include a new DynamoModel CRD with spec and status fields, ModelRef fields on existing CRDs for model linkage, an endpoint management client for LoRA operations, a worker pool for concurrent task execution, a DynamoModel reconciler, RBAC updates, and comprehensive documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50–70 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (5)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10005-10018: Tighten modelRef schema (non-empty name, disallow unknown keys).Looks good overall. To prevent empty strings and catch typos, add minimal validations.
Apply within this block:
modelRef: description: |- ModelRef references a model that this component serves When specified, a headless service will be created for endpoint discovery - properties: + additionalProperties: false + properties: name: description: Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") type: string + minLength: 1 revision: description: Revision is the model revision/version (optional) type: string + minLength: 1 required: - name type: objectOptional (nice UX): add an additionalPrinterColumn to surface the model at kubectl get time:
@@ - additionalPrinterColumns: - description: Dynamo component jsonPath: .spec.dynamoComponent name: DynamoComponent type: string + - description: Model + jsonPath: .spec.modelRef.name + name: Model + type: stringPlease confirm:
- types.go defines
ModelRefwithjson:"modelRef,omitempty"and string fields, and controller tolerates missing/emptyrevision.- No model names require characters beyond basic DNS-1123 label charset; if they do, keep minLength but skip adding a strict pattern. Based on learnings.
deploy/cloud/operator/internal/modelendpoint/lora.go (1)
66-66: Standardize logging levels for success cases.Success logging is inconsistent:
loadLoRAusesInfolevel (line 66) whileunloadLoRAusesV(1)level (line 96). For consistency and operational visibility, both should log at the same level.Consider standardizing to
Infolevel for both operations:- logs.V(1).Info("Successfully unloaded LoRA", "address", address, "modelName", modelName) + logs.Info("Successfully unloaded LoRA", "address", address, "modelName", modelName)Also applies to: 96-96
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (1)
85-87: Consider adding validation to enforce loraPath usage.The
loraPathfield is described as "only applicable for lora model type" but there's no schema-level validation to enforce this constraint. Users could accidentally setloraPathon base or adapter models.Add CEL validation to ensure
loraPathis only set whenmodelTypeislora:loraPath: description: LoraPath is the path to the LoRA adapter (only applicable for lora model type) type: string modelName: description: ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora") type: string modelType: default: base description: ModelType specifies the type of model (e.g., "base", "lora", "adapter") enum: - base - lora - adapter type: string required: - baseModelName - modelName type: object + x-kubernetes-validations: + - rule: "self.modelType != 'lora' || has(self.loraPath)" + message: "loraPath is required when modelType is 'lora'" + - rule: "self.modelType == 'lora' || !has(self.loraPath)" + message: "loraPath should only be set when modelType is 'lora'"Also applies to: 91-98
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2)
10005-10018: Validate and sanitize modelRef for Service/labels; document empty revision semanticsGood addition. Please confirm:
- Reconcile sanitizes modelRef.name (and revision if used) into valid DNS-1123 Service names (lowercase, [a-z0-9-], <=63), with truncation+hash to avoid collisions when names exceed 63 or contain dots/uppercases.
- If modelRef is used in labels/selector values, ensure label constraints (<=63; allowed charset) or apply normalization similarly.
- Clarify behavior when revision is empty (e.g., treated as “latest”, or excluded from identity). Add this to the Go type docstring so controller-gen propagates it.
Optionally, enforce constraints at the API by adding kubebuilder validation on the Go types (e.g., Patterns and MaxLength for name/revision) instead of manual YAML edits.
Based on learnings.
10005-10018: Improve kubectl UX with printer columnsConsider adding print columns on the Go type for:
- Model (.spec.modelRef.name)
- Revision (.spec.modelRef.revision)
Use +kubebuilder:printcolumn annotations so controller-gen emits them here (don’t hand-edit this YAML). This makes kubectl get dcd more informative.
Based on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(4 hunks)deploy/cloud/operator/PROJECT(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(2 hunks)deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go(11 hunks)deploy/cloud/operator/cmd/main.go(2 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml(1 hunks)deploy/cloud/operator/config/crd/kustomization.yaml(1 hunks)deploy/cloud/operator/config/rbac/role.yaml(4 hunks)deploy/cloud/operator/config/samples/kustomization.yaml(1 hunks)deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/dynamo_model_controller.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(2 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go(1 hunks)deploy/cloud/operator/internal/dynamo/headless_service.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/client.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/discovery.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/lora.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/types.go(1 hunks)deploy/cloud/operator/internal/workerpool/pool.go(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Applied to files:
deploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/consts/consts.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
deploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yamldeploy/cloud/operator/config/crd/kustomization.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
deploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yamldeploy/cloud/operator/internal/consts/consts.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yamldeploy/cloud/operator/config/samples/kustomization.yamldeploy/cloud/operator/config/crd/kustomization.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Applied to files:
deploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
Applied to files:
deploy/cloud/operator/config/crd/kustomization.yaml
🧬 Code graph analysis (12)
deploy/cloud/operator/internal/modelendpoint/discovery.go (2)
deploy/cloud/operator/internal/modelendpoint/types.go (1)
Candidate(21-24)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)
DynamoModelList(110-114)
deploy/cloud/operator/cmd/main.go (2)
deploy/cloud/operator/internal/controller/dynamo_model_controller.go (1)
DynamoModelReconciler(63-67)deploy/cloud/operator/internal/modelendpoint/client.go (2)
Client(43-45)NewClient(48-54)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (1)
ReconcileModelServicesForComponents(37-97)
deploy/cloud/operator/internal/workerpool/pool.go (1)
deploy/cloud/operator/api/dynamo/schemas/schemas.go (1)
Duration(38-38)
deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (1)
deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)
SchemeBuilder(35-35)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (2)
ReconcileModelServicesForComponents(37-97)AddBaseModelLabel(143-147)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
ModelReference(279-287)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (6)
DynamoModel(99-105)DynamoModelList(110-114)DynamoModelSpec(25-44)ModelSource(47-54)DynamoModelStatus(72-86)EndpointInfo(57-69)
deploy/cloud/operator/internal/controller/dynamo_model_controller.go (5)
deploy/cloud/operator/internal/modelendpoint/client.go (2)
Client(43-45)NewClient(48-54)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)
DynamoModel(99-105)EndpointInfo(57-69)deploy/cloud/operator/internal/modelendpoint/discovery.go (2)
FindModelsForBaseModel(77-112)ExtractCandidates(35-73)deploy/cloud/operator/internal/modelendpoint/types.go (1)
Candidate(21-24)deploy/cloud/operator/internal/consts/consts.go (1)
DynamoSystemPort(22-22)
deploy/cloud/operator/internal/dynamo/headless_service.go (3)
deploy/cloud/operator/internal/controller_common/resource.go (2)
Reconciler(49-52)SyncResource(60-195)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
DynamoComponentDeploymentSharedSpec(48-111)ModelReference(279-287)deploy/cloud/operator/internal/consts/consts.go (3)
KubeLabelDynamoBaseModel(41-41)DynamoSystemPortName(23-23)DynamoSystemPort(22-22)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/dynamo/headless_service.go (1)
AddBaseModelLabel(143-147)
deploy/cloud/operator/internal/modelendpoint/client.go (3)
deploy/cloud/operator/internal/modelendpoint/types.go (1)
Candidate(21-24)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go (2)
DynamoModel(99-105)EndpointInfo(57-69)deploy/cloud/operator/internal/workerpool/pool.go (2)
Task(28-31)Execute(43-102)
deploy/cloud/operator/internal/modelendpoint/lora.go (1)
deploy/cloud/operator/internal/modelendpoint/client.go (1)
Client(43-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: sglang (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (17)
deploy/cloud/operator/PROJECT (1)
27-34: LGTM!The DynamoModel resource configuration follows the same structure and conventions as the existing DynamoComponentDeployment and DynamoGraphDeployment resources.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
344-354: LGTM!The model service reconciliation is appropriately placed after Grove scaling, ensuring that workload resources are created before setting up endpoint discovery services. Error handling follows the established pattern in this controller.
deploy/cloud/operator/internal/consts/consts.go (1)
41-41: LGTM!The new constant follows the established naming convention and is appropriately positioned with other Kubernetes label constants.
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
86-89: LGTM!The optional
ModelReffield is well-documented and designed for backward compatibility. The documentation clearly explains its purpose for endpoint discovery via headless services.
278-287: LGTM!The
ModelReferencetype is well-designed with appropriate validation markers. The requiredNamefield and optionalRevisionfield provide flexibility while ensuring essential information is present.deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)
330-343: LGTM!The model service reconciliation is correctly implemented for component-level reconciliation. The componentMap contains only the current component, which is appropriate for this controller's scope.
943-955: Improved label handling.The function now properly initializes and populates labels instead of returning an empty map. This ensures that:
- Existing component labels are preserved
- Base model labels are added when a ModelRef is specified
This is a positive change that enables proper label propagation throughout the resource hierarchy.
deploy/cloud/operator/cmd/main.go (2)
63-63: LGTM!The modelendpoint import is correctly added and used for creating the EndpointClient.
564-571: LGTM!The DynamoModelReconciler setup follows the established pattern for controller initialization. The EndpointClient is appropriately created once and injected into the reconciler.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
10139-10152: Go types are properly defined and aligned with the CRD schema.The verification confirms that the
ModelReferencestruct is correctly defined indeploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(lines 278–287) with proper kubebuilder annotations (+kubebuilder:validation:Requiredforname,+optionalforrevision). ThemodelReffield inDynamoComponentDeploymentSharedSpecis correctly typed as*ModelReferencewith the+optionaltag and proper JSON marshaling hints. The autogenerated CRD schema accurately reflects these Go types, and the structure aligns with howmodelRefis used in the controller code (e.g.,AddBaseModelLabelfunction). No issues found.deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
65-72: LGTM: RBAC permissions properly scoped for DynamoModel CRD.The added permissions for EndpointSlices discovery and DynamoModel lifecycle management (including finalizers and status updates) are appropriate and follow standard Kubernetes controller patterns.
Also applies to: 372-372, 387-387, 396-396
deploy/cloud/operator/config/rbac/role.yaml (1)
89-96: LGTM: RBAC permissions consistent with Helm template.The RBAC additions mirror those in the Helm template and are properly scoped for the DynamoModel controller's operational needs.
Also applies to: 173-173, 188-188, 197-197
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (2)
166-182: Verify whether podName should be required in EndpointInfo.The
podNamefield is not in therequiredlist (lines 179-181), suggesting it may be optional. However, in a Kubernetes environment with endpoint discovery via EndpointSlices, the pod name should typically always be known and valuable for debugging and observability.Please confirm whether
podNamecan legitimately be absent in any scenario. If not, consider adding it to the required fields:required: - address + - podName - ready
174-178: Clarify the design intent for base model endpoint tracking.The comment states "For base models: always false (no probing performed)," which suggests base model endpoints are tracked but never marked ready. This raises questions about the utility of endpoint tracking for base models and whether the status structure optimally serves both base and LoRA model use cases.
Please clarify the design rationale:
- Why track endpoints for base models if ready is always false?
- Is there a future plan to probe base model readiness?
- Would separate status structures for base vs LoRA models improve clarity?
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
10005-10018: RBAC check for headless Service creationSince modelRef triggers headless service creation for endpoint discovery, verify the PR includes RBAC for Services and EndpointSlices (get/list/watch/create/update/patch) in the operator’s ClusterRole.
Based on learnings.
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
10139-10152: Verify that base CRD file has been updated to generate this template change.Per prior learnings on this codebase, CRD schemas in
deploy/cloud/helm/crds/templates/*.yamlare auto-generated from base CRD files indeploy/cloud/operator/config/crd/bases/and should not be manually edited, as manual changes would be overwritten during regeneration.The
modelReffield definition itself appears structurally sound as OpenAPI v3 schema. However, ensure that the corresponding base CRD file (deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml) has been updated with this field, and that this template change was auto-generated from it rather than manually added.deploy/cloud/operator/internal/dynamo/headless_service.go (1)
91-105: Review comment is incorrect for Go 1.24.0The repository declares
go 1.24.0in deploy/cloud/operator/go.mod, which is well after Go 1.22. Starting with Go 1.22, loop variables are scoped per iteration rather than reused across iterations, so closures over loop variables are safe. The problematic code pattern the review describes is not an issue in this codebase.Additionally, the review references the wrong file. The actual loops capturing
candidateare in deploy/cloud/operator/internal/modelendpoint/client.go (LoadLoRA at lines 91–102, UnloadLoRA at lines 152–163), not in headless_service.go.The code requires no changes.
Likely an incorrect or invalid review comment.
deploy/cloud/operator/internal/controller/dynamo_model_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
deploy/cloud/operator/internal/controller/dynamo_model_controller.go
Outdated
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamo_model_controller.go
Outdated
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamo_model_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
deploy/cloud/operator/internal/controller/dynamo_model_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
tmonty12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (8)
docs/kubernetes/deployment/dynamomodel-guide.md (1)
53-58: Fix documentation lint warnings.Several fenced code blocks (for example here) lack a language hint, which triggers
markdownlintMD040. While updating, please also register this page in the Sphinx toctree to clear the “document isn’t included in any toctree” warning from the docs pipeline.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
10139-10152: Improve description clarity to separate intent from implementation details.Verification confirms the
modelRefstructure is consistent across both DynamoGraphDeployment and DynamoComponentDeployment CRDs. However, the description mixes user-facing intent with controller implementation details. Separate these concerns for schema clarity:modelRef: description: |- - ModelRef references a model that this component serves - When specified, a headless service will be created for endpoint discovery + ModelRef references a model that this component serves. + The referenced model must exist as a DynamoModel resource in the same namespace.Apply the same improvement to the parallel field in DynamoComponentDeployment for consistency.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10005-10018: Add validation constraints, printer columns, and optional immutability to modelRef field.The
modelReffield is properly wired in the controller (usesspec.ModelRefsafely with nil/empty checks), but the generated CRD lacks string validation, UX enhancements, and optional immutability constraints:
- Validation constraints: Add
minLength: 1,maxLength: 253, and optionally apatternonnameto reject empty or invalid identifiers.- Printer columns: Surface
modelRef.nameandmodelRef.revisioninkubectl getfor better visibility.- Immutability (optional): Add
x-kubernetes-validationsCEL rule to prevent accidental model name swaps; keeprevisionmutable if rollout scenarios depend on it.- Description: Clarify that the referenced DynamoModel is resolved in the same namespace as the DynamoComponentDeployment.
Suggested diffs (from original review comment):
modelRef: description: |- ModelRef references a model that this component serves When specified, a headless service will be created for endpoint discovery + The referenced DynamoModel is resolved in the same namespace as this resource. properties: name: description: Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") type: string + minLength: 1 + maxLength: 253 revision: description: Revision is the model revision/version (optional) type: string required: - name type: objectAlso add printer columns for Model and Revision in
additionalPrinterColumnsas shown in the original review comment.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml (2)
96-108: Consider adding validation to enforce thatsourceis only applicable for LoRA models.The description on line 97 states that the
sourcefield is "only applicable for lora model type," but there's no OpenAPI schema validation enforcing this constraint. Users could inadvertently setsourceon base model types, which may lead to confusion or unexpected behavior.Consider adding a CEL validation rule or restructuring the spec to enforce this constraint:
source: description: Source specifies the model source location (only applicable for lora model type) properties: uri: description: |- URI is the model source URI Supported formats: - S3: s3://bucket/path/to/model - HuggingFace: hf://org/model@revision_sha type: string required: - uri type: object + x-kubernetes-validations: + - rule: "self.modelType == 'base' ? !has(self.source) : true" + message: "source field is only applicable when modelType is 'lora' or 'adapter'" required: - baseModelName - modelName
176-192: Clarify whetherpodNameis required in endpoint status.The
podNamefield is described (lines 180-182) but not included in the required fields list (lines 189-192), which only requiresaddressandready. If the discovery logic always populatespodName(as suggested by the implementation in discovery.go), consider making it a required field for consistency and to prevent incomplete status representations.If
podNameshould always be present, apply this diff:required: - address + - podName - readyAlternatively, if
podNamecan legitimately be absent in some scenarios, update the description to clarify when it may be omitted.deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
10005-10018: Add validation markers toModelReference.Namein Go struct; clarify that service naming uses hashing.The Go struct at
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:279definesModelReferencewith only aRequiredmarker onName. The field lacksMinLength,MaxLength, or pattern constraints. However, the implementation mitigates DNS concerns:model_service.go:162shows service names are generated asdynamo-model-{8-char-hash}, not from the raw name.Recommended improvements:
- Add
+kubebuilder:validation:MinLength=1and+kubebuilder:validation:MaxLength=255to constrain theNamefield in the Go struct.- Update the struct comment to clarify semantics: "Name is the base model identifier (e.g., HuggingFace model ID like 'meta-llama/llama-3-70b-instruct'). Service names are generated deterministically from this value and do not use it directly."
- Document namespace behavior: add comment clarifying whether
ModelRefmust resolve in the same namespace (current code shows no cross-namespace lookup).- Optional: if cross-namespace refs are planned, add an
optional Namespace *stringfield to the struct and corresponding validation.Regenerate CRDs via
make manifestsor controller-gen after updating the Go struct; do not hand-edit the YAML.deploy/cloud/operator/internal/modelendpoint/client.go (1)
32-53: LGTM! Consider documenting timeout interaction.The timeout constants and client initialization are well-structured. The bounded concurrency (10) with per-request (15s) and total (30s) timeouts provides good control.
Optional: The interaction between
RequestTimeout(applied tohttpClient) andTotalTimeout(enforced byworkerpool.Execute) could be clarified in a comment. With concurrent execution, individual requests timing out at 15s will surface as errors before the 30s total timeout, which appears intentional but might benefit from documentation.deploy/cloud/operator/internal/dynamo/model_service.go (1)
152-164: LGTM! Hash-based naming is appropriate.The 8-character SHA256 hash provides a good balance between brevity and collision resistance. For typical deployments with hundreds to thousands of models, collision probability is negligible.
Optional consideration: The 8-character hex hash provides ~4 billion possible values with 50% collision probability at ~77k models (birthday paradox). For very large-scale deployments planning to manage tens of thousands of unique base models, consider increasing the hash length. For typical use cases, the current implementation is well-suited.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (37)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(4 hunks)deploy/cloud/operator/PROJECT(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamo_model_types.go(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(2 hunks)deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go(11 hunks)deploy/cloud/operator/cmd/main.go(2 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yaml(1 hunks)deploy/cloud/operator/config/crd/kustomization.yaml(1 hunks)deploy/cloud/operator/config/rbac/role.yaml(4 hunks)deploy/cloud/operator/config/samples/kustomization.yaml(1 hunks)deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/dynamo_model_controller.go(1 hunks)deploy/cloud/operator/internal/controller/dynamo_model_controller_test.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(3 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(1 hunks)deploy/cloud/operator/internal/controller/suite_test.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph.go(1 hunks)deploy/cloud/operator/internal/dynamo/model_service.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/client.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/client_test.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/discovery.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/discovery_test.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/lora.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/lora_test.go(1 hunks)deploy/cloud/operator/internal/modelendpoint/types.go(1 hunks)deploy/cloud/operator/internal/workerpool/pool.go(1 hunks)deploy/cloud/operator/internal/workerpool/pool_test.go(1 hunks)docs/kubernetes/README.md(1 hunks)docs/kubernetes/api_reference.md(5 hunks)docs/kubernetes/deployment/create_deployment.md(1 hunks)docs/kubernetes/deployment/dynamomodel-guide.md(1 hunks)docs/kubernetes/dynamo_operator.md(2 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
deploy/cloud/operator/config/crd/kustomization.yamldocs/kubernetes/api_reference.mddeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
deploy/cloud/operator/config/crd/kustomization.yamldocs/kubernetes/api_reference.mddocs/kubernetes/dynamo_operator.mddeploy/cloud/operator/PROJECTdeploy/cloud/operator/config/crd/bases/nvidia.com_dynamomodels.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamomodels.yamldeploy/cloud/operator/config/samples/kustomization.yamldeploy/cloud/operator/internal/consts/consts.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
Applied to files:
deploy/cloud/operator/config/crd/kustomization.yaml
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Applied to files:
docs/kubernetes/api_reference.mddeploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldocs/kubernetes/deployment/dynamomodel-guide.mddeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Applied to files:
docs/kubernetes/api_reference.mddeploy/cloud/operator/PROJECTdeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.
Applied to files:
deploy/cloud/operator/PROJECT
🪛 GitHub Actions: Generate Documentation
docs/kubernetes/deployment/dynamomodel-guide.md
[warning] 1-1: Document isn't included in any toctree [toc.not_included].
🪛 LanguageTool
docs/kubernetes/deployment/dynamomodel-guide.md
[uncategorized] ~503-~503: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...y| All endpoints are ready | ✅ Good - full service availability | |NotReady` | Not all e...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🪛 markdownlint-cli2 (0.18.1)
docs/kubernetes/api_reference.md
421-421: Bare URL used
(MD034, no-bare-urls)
docs/kubernetes/deployment/dynamomodel-guide.md
54-54: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
253-253: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
265-265: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
315-315: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (36)
deploy/cloud/operator/config/crd/kustomization.yaml (1)
22-22: LGTM! DynamoModel CRD resource added correctly.The addition of the DynamoModel CRD base follows the established pattern for other CRDs (DynamoComponentDeployment and DynamoGraphDeployment) and integrates properly with the kustomization structure.
deploy/cloud/operator/internal/controller/suite_test.go (1)
36-36: LGTM! Test scheme properly extended for EndpointSlice support.The addition of
discoveryv1types to the test scheme follows the established pattern and correctly enables EndpointSlice-based discovery testing for the new DynamoModel functionality.Also applies to: 113-114
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
344-354: LGTM! Model service reconciliation properly integrated.The new reconciliation step is well-placed after Grove scaling and includes appropriate error handling and logging. The integration follows the established patterns in the controller.
deploy/cloud/operator/PROJECT (1)
27-34: LGTM! DynamoModel resource properly configured.The kubebuilder project configuration for DynamoModel follows the established pattern and includes all required fields (crdVersion, namespaced, controller, domain, kind, path, version).
docs/kubernetes/dynamo_operator.md (2)
15-15: LGTM! DynamoModel controller documented clearly.The addition accurately describes the DynamoModelController's purpose for managing model lifecycle and LoRA adapters.
104-116: LGTM! Documentation structure enhanced for model management.The updates clearly list DynamoModel as a core CRD and provide a helpful reference to the detailed deployment guide. The documentation follows the established structure and provides good user guidance.
docs/kubernetes/deployment/create_deployment.md (1)
221-259: LGTM! Excellent documentation for LoRA adapter deployment.The new Step 6 provides clear, practical guidance with well-structured YAML examples showing the relationship between
modelRefin the worker configuration and the correspondingDynamoModelresource. The link to the comprehensive guide is helpful for users seeking more details.docs/kubernetes/api_reference.md (3)
40-40: LGTM! DynamoModel properly integrated into API reference.The DynamoModel resource type and
modelReffield are correctly added to the API reference, maintaining consistency with the existing documentation structure.Also applies to: 170-170, 206-206
351-425: LGTM! Comprehensive DynamoModel API documentation.The documentation thoroughly covers all DynamoModel-related types (DynamoModel, DynamoModelSpec, DynamoModelStatus, EndpointInfo) with clear field descriptions, validation rules, and cross-references. The structure follows the established pattern for other CRD documentation in this file.
Note: The static analysis warning about a bare URL at line 421 is a false positive—it's an example endpoint address in the documentation, not a link that needs formatting.
468-500: LGTM! ModelReference and ModelSource types documented clearly.The documentation for these supporting types is complete and follows the established conventions, with appropriate field descriptions and validation annotations.
deploy/cloud/operator/internal/dynamo/graph.go (1)
1029-1030: No nil safety issue — function already handles nil gracefully.The verification shows that
AddBaseModelLabelat lines 168-173 ofdeploy/cloud/operator/internal/dynamo/model_service.goexplicitly checks for nil input withif labels == nil || modelRef == nil || modelRef.Name == "" { return }and returns early. The code is safe and requires no changes.Likely an incorrect or invalid review comment.
deploy/cloud/operator/internal/consts/consts.go (1)
41-43: LGTM! Constant additions follow established conventions.The three new constants for base model labeling and annotation are well-named and consistent with the existing patterns in this file.
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)
65-72: LGTM! EndpointSlice RBAC permissions are correctly scoped.The read-only access (get, list, watch) to EndpointSlices in the discovery.k8s.io API group is appropriate for the endpoint discovery functionality.
372-372: LGTM! DynamoModel RBAC permissions are comprehensive.The permissions for dynamomodels resources, finalizers, and status follow the established pattern for other CRDs managed by this operator.
Also applies to: 387-387, 396-396
deploy/cloud/operator/config/rbac/role.yaml (2)
89-96: LGTM! EndpointSlice permissions match Helm template.The read-only RBAC rules for EndpointSlices are correctly configured and consistent with the manager-rbac.yaml template.
173-173: LGTM! DynamoModel permissions are properly configured.The dynamomodels resource and its subresources (finalizers, status) have appropriate RBAC verbs for controller operations.
Also applies to: 188-188, 197-197
deploy/cloud/operator/internal/modelendpoint/discovery.go (2)
36-74: LGTM! Endpoint extraction logic is well-implemented.The
ExtractCandidatesfunction correctly:
- Filters for Pod-ready endpoints only
- Validates TargetRef kind is "Pod" (addressing past review feedback)
- Handles IPv6 addresses using
net.JoinHostPort(addressing past review feedback)- Collects service names for tracking
79-115: LGTM! Model lookup uses efficient field indexing.The
FindModelsForBaseModelfunction leverages controller-runtime's field indexer for O(1) lookups, which is an efficient approach for finding models by base model name or hash. Error handling and logging are appropriate.deploy/cloud/operator/internal/modelendpoint/lora_test.go (2)
30-211: LGTM! Comprehensive test coverage for LoRA loading.The test suite thoroughly validates:
- URL path construction with various base address formats
- Request body structure with proper lora_name and source.uri fields
- Response handling for success (200, 201) and error cases (400, 404, 500)
- Content-Type header verification
213-330: LGTM! Thorough test coverage for LoRA unloading.The unload tests properly verify:
- DELETE method usage
- URL path construction including models with special characters
- Success responses (200, 204)
- Error responses (404, 500) with appropriate error propagation
deploy/cloud/operator/internal/modelendpoint/discovery_test.go (2)
40-464: LGTM! Excellent test coverage for endpoint extraction.The
TestExtractCandidatessuite thoroughly validates all edge cases:
- Filtering logic for Pod-only, ready endpoints
- Handling of nil/non-Pod TargetRefs
- Multiple addresses per endpoint
- Multiple services in the slice
- Mixed valid/invalid endpoints
The test structure with optional
validateCandidateshooks provides flexibility for deeper assertions.
466-650: LGTM! Comprehensive test coverage for model lookup.The
TestFindModelsForBaseModelsuite effectively validates:
- Field indexer-based lookups with multiple matches
- Empty result handling
- Namespace-scoped filtering
- Proper use of fake client with index registration
The tests correctly construct the scheme and register the indexer function for testing the efficient O(1) lookup behavior.
deploy/cloud/operator/cmd/main.go (2)
63-63: LGTM!The import is correctly added and used for creating the EndpointClient.
564-571: Code is correctly implemented; original concerns do not apply.The EndpointClient initialization in main.go follows Go best practices for HTTP clients:
NewClient()returns a configured*Clientwithout error, which is the correct pattern for factory functions that always succeed- The Client is already configured with
RequestTimeout = 15 * time.Secondand supports bounded concurrency (MaxConcurrentOperations = 10)- All Client methods properly accept
context.Contextfor cancellation and timeout controlhttp.Clientdoes not require explicit cleanup; connection pooling is automaticThe initialization pattern and resource management are appropriate for this use case.
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
10139-10152: Verify that these YAML changes were generated from updated Go structs, not manually edited.Per the learnings from PR 2012, CRD schema files in
deploy/cloud/helm/crds/templates/are auto-generated artifacts that should not be manually modified—changes should come from regenerating CRDs viacontroller-genfrom the corresponding Go struct definitions. Before merging, confirm that:
- The
modelReffield was added to theDynamoGraphDeploymentSpecGo struct- The YAML changes are the output of running
make generateor equivalent CRD regeneration- No manual edits were made directly to the YAML file
The field design itself is sound:
name(required string) correctly identifies the model, andrevision(optional string) provides flexibility. The description clearly explains the purpose and relationship to headless service creation.deploy/cloud/operator/internal/modelendpoint/client.go (2)
55-136: LGTM! LoadLoRA implementation is solid.The function correctly:
- Skips loading for non-LoRA models (lines 64-76)
- Validates Source URI requirement for LoRA models (lines 78-86)
- Executes load operations with bounded concurrency
- Returns partial results with detailed error information
Note on runtime validation (lines 78-86): The Source URI validation occurs at runtime rather than via API-side validation (admission webhooks). Based on past review comments, this is a known limitation tracked in issue #4190 and requires certificate management infrastructure to implement properly.
138-192: LGTM! UnloadLoRA follows the same robust pattern.The function mirrors the LoadLoRA implementation with bounded concurrency and detailed logging. The best-effort approach (continuing on errors) is appropriate for unload operations, with comprehensive per-endpoint error reporting.
deploy/cloud/operator/internal/dynamo/model_service.go (3)
36-97: LGTM! ReconcileModelServicesForComponents is well-designed.The function correctly:
- Deduplicates services by base model name to avoid creating duplicates (lines 48-66)
- Generates deterministic headless services for endpoint discovery
- Provides appropriate error handling with context
- Serves as reusable logic across multiple controllers
99-150: LGTM! Headless service generation is correctly implemented.The function properly configures a headless service for model endpoint discovery:
ClusterIP: Noneenables direct pod IP resolutionPublishNotReadyAddresses: false(line 135) ensures only ready pods appear in EndpointSlices, which is critical for reliable endpoint discovery- Deterministic naming and label-based selection using hashes
- Annotations preserve original model name for human readability
166-181: LGTM! Helper functions are clean and defensive.Both
AddBaseModelLabelandAddBaseModelAnnotationcorrectly handle nil inputs and provide consistent label/annotation management patterns.deploy/cloud/operator/internal/controller/dynamo_model_controller.go (6)
47-72: LGTM! Constants and reconciler structure are well-defined.The exported constants provide clear contract points for conditions and reasons. The requeue duration is now properly defined as a constant (
requeueAfterDuration, line 64), addressing previous feedback. The reconciler struct has appropriate fields for client operations, event recording, and endpoint management.
80-130: LGTM! Early reconciliation logic is correct.The function properly:
- Fetches the DynamoModel with standard NotFound handling
- Delegates finalizer handling to the common handler
- Handles the no-endpoints case appropriately by updating status and conditions without requeuing, relying on EndpointSlice watches to trigger future reconciliation (line 129)
132-147: LGTM! The infinite requeue fix is correctly implemented.The failure detection logic (lines 136-141) correctly addresses the past review concern:
- For LoRA models:
hasFailuresis true if probe errors occur OR if not all endpoints are ready (line 140)- For base models:
hasFailuresis true only on probe errors, sinceReadyis expected to remainfalsefor base modelsThis prevents infinite 30-second requeues for healthy base models while still requeuing on actual probe failures and for LoRA models with unready endpoints.
149-208: LGTM! Status and condition updates are well-structured.The condition updates correctly differentiate behavior by model type:
- LoRA models (lines 164-178): Check readiness counts and set
EndpointsReadytoTrueonly when all endpoints are ready- Base models (lines 179-188): Simply verify endpoints exist, since readiness doesn't apply
The requeue logic (lines 200-206) appropriately uses the constant
requeueAfterDurationwhen failures are detected, with clear logging of the reason.
211-265: LGTM! Helper functions and manager setup are well-implemented.The helpers are straightforward and correct. The
SetupWithManagerfunction properly configures:
- Field indexing by base model hash (lines 240-252) for efficient O(1) queries when mapping EndpointSlices to DynamoModels
- Appropriate predicates:
GenerationChangedPredicatefor DynamoModel to avoid unnecessary reconciliations- EndpointSlice watch with custom mapping and disabled generic events to trigger reconciliation on endpoint changes
267-372: LGTM! Finalization and endpoint discovery are correctly implemented.
FinalizeResource(lines 295-337) properly implements best-effort cleanup:
- Only cleans up LoRA models (base models don't require unload operations)
- Continues with deletion even if endpoint discovery or unload operations fail (lines 310, 322)
- Logs failures at Info level, which is appropriate since deletion proceeds regardless
- Emits events for observability
getEndpointCandidates(lines 339-372) correctly performs label-based endpoint discovery using the hashed base model name, enabling efficient queries.
deploy/cloud/operator/internal/controller/dynamo_model_controller_test.go
Outdated
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Overview:
add dynamoModel CRD
Summary by CodeRabbit
Release Notes
New Features
Documentation
Example of a DGD and associated new DynamoModel CR :
the new controller would make sure the workers of the DGD (both decode and worker) would have the LORA loaded by calling their POST /v1/loras API.
internally we use headless service and associated endpointSlices to make sure the LORA are loaded