Skip to content

Conversation

@tmonty12
Copy link

@tmonty12 tmonty12 commented Oct 27, 2025

This enhancement proposal addresses critical rollout scenarios for DynamoGraphDeployments (DGDs) and proposes strategies to support safe, zero-downtime updates for Dynamo/Grove deployments.

Problems Addressed

  1. Model Crosstalk - Workers from different models registering under the same EndpointId when deployed in the same namespace
  2. Failed Rollout Scenarios:
    a. Worker version upgrades causing request errors during rolling updates
    b. MDC checksum mismatches when updating model configurations
    c. Incompatible KV cache transfer between old/new worker sets during disaggregated rollouts

Proposed Solutions

  1. Namespace Isolation - Use DGD name as the namespace (deprecate dynamoNamespace field)
  2. Worker Group Hashing - Introduce workerGroupName field to isolate worker sets and enable canary deployments
  3. Enhanced Rolling Update Controls - Add maxUnavailable and maxSurge configuration support

Goals

  • Zero-downtime rollouts for all update scenarios
  • Consistent rollout strategy UX with Kubernetes ecosystem
  • Support for canary deployments and advanced use cases (hierarchical planner, speculative decoding)

@tmonty12 tmonty12 changed the title Kubernetes Roll Out Strategy Rollout Support for DynamoGraphDeployments Oct 28, 2025
@tmonty12 tmonty12 marked this pull request as ready for review October 28, 2025 21:14
replicas: 1

VllmDecodeWorker-A:
workerGroupName: A # ← Groups A decode + prefill
Copy link

@julienmancuso julienmancuso Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our discussion, if we use this concept of groups, we might want to gang schedule group A and group B separately.
we need to make sure such feature is supported by underlying framework (grove, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants