feat(manifests): add node overprovisioning for faster agentic session startup#1458
feat(manifests): add node overprovisioning for faster agentic session startup#1458rh-rahulshetty wants to merge 16 commits intoambient-code:mainfrom
Conversation
✅ Deploy Preview for cheerful-kitten-f556a0 canceled.
|
📝 WalkthroughWalkthroughIntroduces Kubernetes manifests for an overprovisioning component: a namespace, a low-priority PriorityClass, and a Deployment managing five pause pods configured for eviction during cluster scale-down. Changes
🚥 Pre-merge checks | ✅ 8✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 46-56: Add explicit resource limits matching the existing requests
under the container's resources block: mirror cpu: "500m" and memory: "512Mi"
into resources.limits so the pause container has limits equal to requests;
update the manifest where resources.requests is defined (the pause container in
the overprovisioner deployment) to include limits to ensure QoS Guaranteed and
compliance with LimitRange rules.
- Around line 14-33: The deployment currently allows all 5 replicas (replicas:
5) of the acp-overprovisioner pod to land on a single node; add spreading rules
to the pod template (under the pod spec where priorityClassName:
acp-overprovisioning is set) to distribute placeholders across nodes: add a
topologySpreadConstraints entry targeting pods with label app:
acp-overprovisioner (topologyKey: kubernetes.io/hostname, maxSkew: 1,
whenUnsatisfiable: ScheduleAnyway) or alternatively a preferred podAntiAffinity
(preferredDuringSchedulingIgnoredDuringExecution) to prefer one pod per host;
update the pod template spec to include one of these so placeholders are spread
rather than stacked on a single node.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 184f886e-ac04-4178-af33-9b3ffaa09eb7
📒 Files selected for processing (4)
components/manifests/components/overprovisioner/deployment.yamlcomponents/manifests/components/overprovisioner/namespace.yamlcomponents/manifests/components/overprovisioner/priorityclass.yamlcomponents/manifests/components/overprovisioner/prometheusrule.yaml
| resources: | ||
| requests: | ||
| # ── Tunable: must match runner pod resource requests ── | ||
| # These values mirror the agentic session runner container requests | ||
| # (see operator/internal/handlers/sessions.go defaults). | ||
| # When a placeholder is evicted, the freed capacity is exactly | ||
| # what a runner pod needs to start immediately. | ||
| cpu: "500m" | ||
| memory: "512Mi" | ||
| # No limits set intentionally — the pause container uses zero | ||
| # actual CPU/memory. Only the requests matter for scheduling. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Set resource limits alongside requests.
Repo guideline requires both. The "pause uses zero CPU/memory" reasoning is true in practice, but an explicit limit (equal to requests) makes the container QoS Guaranteed, protects against any future LimitRange surprises, and keeps this manifest compliant without changing runtime behavior.
🔧 Proposed
resources:
requests:
cpu: "500m"
memory: "512Mi"
- # No limits set intentionally — the pause container uses zero
- # actual CPU/memory. Only the requests matter for scheduling.
+ limits:
+ cpu: "500m"
+ memory: "512Mi"As per coding guidelines: "Resource limits/requests required on containers."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resources: | |
| requests: | |
| # ── Tunable: must match runner pod resource requests ── | |
| # These values mirror the agentic session runner container requests | |
| # (see operator/internal/handlers/sessions.go defaults). | |
| # When a placeholder is evicted, the freed capacity is exactly | |
| # what a runner pod needs to start immediately. | |
| cpu: "500m" | |
| memory: "512Mi" | |
| # No limits set intentionally — the pause container uses zero | |
| # actual CPU/memory. Only the requests matter for scheduling. | |
| resources: | |
| requests: | |
| # ── Tunable: must match runner pod resource requests ── | |
| # These values mirror the agentic session runner container requests | |
| # (see operator/internal/handlers/sessions.go defaults). | |
| # When a placeholder is evicted, the freed capacity is exactly | |
| # what a runner pod needs to start immediately. | |
| cpu: "500m" | |
| memory: "512Mi" | |
| limits: | |
| cpu: "500m" | |
| memory: "512Mi" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
46 - 56, Add explicit resource limits matching the existing requests under the
container's resources block: mirror cpu: "500m" and memory: "512Mi" into
resources.limits so the pause container has limits equal to requests; update the
manifest where resources.requests is defined (the pause container in the
overprovisioner deployment) to include limits to ensure QoS Guaranteed and
compliance with LimitRange rules.
Signed-off-by: Rahul Shetty <rashetty@redhat.com>
64b0369 to
4bd05da
Compare
|
It looks good. This + imagepuller gets us the desired behavior of users never hitting a node that's not had images pre-pulled. |
|
I see you have health rules for the new component. Do the existing OCP metrics cover being able to graph the behavior of it? |
jeremyeder
left a comment
There was a problem hiding this comment.
lgtm, one ask about behavior metrics - can be a followup.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
components/manifests/components/overprovisioner/deployment.yaml (1)
56-67:⚠️ Potential issue | 🟠 MajorMissing resource
limits— still violates repo guideline.The explanatory comment is new, but the compliance gap isn't: repo guideline requires both
requestsandlimitson containers. The "Burstable makes eviction easier" reasoning is moot here — these pods are evicted via preemption (priority -10), not via node-pressure QoS ordering, so losing the Burstable class costs you nothing. Settinglimitsequal torequestsgivesGuaranteedQoS, keeps scheduling behavior identical, and passes the guideline check.🔧 Proposed
resources: requests: # ── Tunable: must match runner pod resource requests ── # These values mirror the agentic session runner container requests # (see operator/internal/handlers/sessions.go defaults). # When a placeholder is evicted, the freed capacity is exactly # what a runner pod needs to start immediately. cpu: "500m" memory: "512Mi" - # No limits set intentionally — keeps QoS class as Burstable, - # making these pods easier to evict. The pause container uses - # zero actual CPU/memory; only requests matter for scheduling. + limits: + cpu: "500m" + memory: "512Mi"As per coding guidelines: "Resource limits/requests required on containers."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/manifests/components/overprovisioner/deployment.yaml` around lines 56 - 67, The manifest's container resources define requests (resources.requests with cpu "500m" and memory "512Mi") but omit resources.limits, violating the repo guideline that every container must have both requests and limits; add a matching resources.limits block for the same container and set cpu and memory limits equal to the requests (cpu: "500m", memory: "512Mi") so the container has both requests and limits (resources.requests and resources.limits) and the scheduling/eviction behavior remains unchanged while satisfying the policy.
🧹 Nitpick comments (1)
components/manifests/components/overprovisioner/deployment.yaml (1)
48-55: Pin the pause image by digest.
registry.k8s.io/pause:3.9is a mutable tag. For a capacity-reservation component that silently underpins session startup latency, a digest pin (pause@sha256:...) prevents a surprise image swap from silently breaking the buffer across restarts/rollouts. Also considerimagePullPolicy: IfNotPresentto play nicely withacp-image-pullerif/when this image is pre-cached.🔧 Example
- name: pause - image: registry.k8s.io/pause:3.9 + image: registry.k8s.io/pause:3.9@sha256:<pinned-digest> + imagePullPolicy: IfNotPresent🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/manifests/components/overprovisioner/deployment.yaml` around lines 48 - 55, Update the pause container spec (container name "pause") to use a digest-pinned image (replace image: registry.k8s.io/pause:3.9 with registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy: IfNotPresent; make these changes on the container entry where name is "pause" so the image is immutable across restarts/rollouts and it cooperates with acp-image-puller caching.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 56-67: The manifest's container resources define requests
(resources.requests with cpu "500m" and memory "512Mi") but omit
resources.limits, violating the repo guideline that every container must have
both requests and limits; add a matching resources.limits block for the same
container and set cpu and memory limits equal to the requests (cpu: "500m",
memory: "512Mi") so the container has both requests and limits
(resources.requests and resources.limits) and the scheduling/eviction behavior
remains unchanged while satisfying the policy.
---
Nitpick comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 48-55: Update the pause container spec (container name "pause") to
use a digest-pinned image (replace image: registry.k8s.io/pause:3.9 with
registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy:
IfNotPresent; make these changes on the container entry where name is "pause" so
the image is immutable across restarts/rollouts and it cooperates with
acp-image-puller caching.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ae63ec6-fb55-4053-b260-872d3524fbfa
📒 Files selected for processing (3)
components/manifests/components/overprovisioner/deployment.yamlcomponents/manifests/components/overprovisioner/namespace.yamlcomponents/manifests/components/overprovisioner/priorityclass.yaml
✅ Files skipped from review due to trivial changes (2)
- components/manifests/components/overprovisioner/priorityclass.yaml
- components/manifests/components/overprovisioner/namespace.yaml
|
@jeremyeder Just confirmed in the cluster that we don't have the required metrics for that prometheus alert, so deleted it for now. I think if required, we can observe the |
Summary
Adds low-priority placeholder pods that reserve spare node capacity for agentic session runners. When a user creates a session, the runner pod preempts a placeholder instantly instead of waiting for the cluster autoscaler to provision a new node.
-10) ensures placeholders are evicted firstsafe-to-evictannotation preserves cluster scale-down behaviorComplements the existing
acp-image-puller— image-puller caches runner images on nodes, overprovisioner ensures nodes are available.How it works
Placeholder pods reserve capacity → User creates session → Runner preempts placeholder →
Runner starts immediately → Evicted placeholder goes Pending → Cluster autoscaler adds node →
Placeholder reschedules → Buffer restored
Experiment
The following observations were found when testing this in staging.
Initial state — 5 placeholder pods running in cluster (node size: 4):

After creating 7 agentic sessions — placeholders preempted (node size: 4):

Cluster autoscaler replenishes — placeholders rescheduled (node size scaled to 5):

Deleted 7 agentic session — (node size scaled down to 4) placeholders rescheduled back to available nodes
Summary by CodeRabbit