feat(manifests): add node overprovisioning for faster agentic session startup by rh-rahulshetty · Pull Request #1458 · ambient-code/platform

rh-rahulshetty · 2026-04-24T08:38:18Z

Summary

Adds low-priority placeholder pods that reserve spare node capacity for agentic session runners. When a user creates a session, the runner pod preempts a placeholder instantly instead of waiting for the cluster autoscaler to provision a new node.

PriorityClass (-10) ensures placeholders are evicted first
5 replicas, each matching runner resource requests (500m CPU / 512Mi memory)
safe-to-evict annotation preserves cluster scale-down behavior
PrometheusRule alerts for missing replicas or deployment

Complements the existing acp-image-puller — image-puller caches runner images on nodes, overprovisioner ensures nodes are available.

How it works

Placeholder pods reserve capacity → User creates session → Runner preempts placeholder →
Runner starts immediately → Evicted placeholder goes Pending → Cluster autoscaler adds node →
Placeholder reschedules → Buffer restored

Experiment

The following observations were found when testing this in staging.

Initial state — 5 placeholder pods running in cluster (node size: 4):

After creating 7 agentic sessions — placeholders preempted (node size: 4):

Cluster autoscaler replenishes — placeholders rescheduled (node size scaled to 5):

Deleted 7 agentic session — (node size scaled down to 4) placeholders rescheduled back to available nodes

Summary by CodeRabbit

New Features
- Added cluster autoscaling optimization with overprovisioning placeholder pods that reserve capacity and are evicted when needed for production workloads.
- Introduced priority-based pod scheduling to ensure efficient resource utilization and faster workload placement during scale events.

netlify · 2026-04-24T08:38:23Z

✅ Deploy Preview for cheerful-kitten-f556a0 canceled.

Name	Link
🔨 Latest commit	`c915e2a`
🔍 Latest deploy log	https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69f3ae32701d160008255478

coderabbitai · 2026-04-24T08:38:31Z

📝 Walkthrough

Walkthrough

Introduces Kubernetes manifests for an overprovisioning component: a namespace, a low-priority PriorityClass, and a Deployment managing five pause pods configured for eviction during cluster scale-down.

Changes

Cohort / File(s)	Summary
Overprovisioner Resources `components/manifests/components/overprovisioner/namespace.yaml`, `components/manifests/components/overprovisioner/priorityclass.yaml`, `components/manifests/components/overprovisioner/deployment.yaml`	Adds namespace `acp-overprovisioner`, PriorityClass `acp-overprovisioning` (value: -10), and Deployment with 5 pause pod replicas. Pods configured with hardened security policies, 500m CPU/512Mi memory requests, topology spread constraints, and `safe-to-evict` annotation for autoscaler preemption.

🚥 Pre-merge checks | ✅ 8

✅ Passed checks (8 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title follows Conventional Commits format (feat(manifests): description) and accurately describes the main changeset—adding node overprovisioning infrastructure via new Deployment, Namespace, and PriorityClass.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity	✅ Passed	PR adds three Kubernetes manifests (Namespace, PriorityClass, Deployment) with no algorithmic complexity or performance regressions.
Security And Secret Handling	✅ Passed	PR introduces three Kubernetes manifests with zero security violations. No hardcoded secrets, credentials, API keys, or injection vectors. Container security context fully hardened with runAsNonRoot, seccompProfile RuntimeDefault, allowPrivilegeEscalation false, and readOnlyRootFilesystem true.
Kubernetes Resource Safety	✅ Passed	Manifests satisfy all safety criteria: no child resources requiring OwnerReferences, properly defined resource requests with documented no-limits design, zero overly permissive RBAC, proper namespace scoping, and comprehensive pod/container security hardening.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 46-56: Add explicit resource limits matching the existing requests
under the container's resources block: mirror cpu: "500m" and memory: "512Mi"
into resources.limits so the pause container has limits equal to requests;
update the manifest where resources.requests is defined (the pause container in
the overprovisioner deployment) to include limits to ensure QoS Guaranteed and
compliance with LimitRange rules.
- Around line 14-33: The deployment currently allows all 5 replicas (replicas:
5) of the acp-overprovisioner pod to land on a single node; add spreading rules
to the pod template (under the pod spec where priorityClassName:
acp-overprovisioning is set) to distribute placeholders across nodes: add a
topologySpreadConstraints entry targeting pods with label app:
acp-overprovisioner (topologyKey: kubernetes.io/hostname, maxSkew: 1,
whenUnsatisfiable: ScheduleAnyway) or alternatively a preferred podAntiAffinity
(preferredDuringSchedulingIgnoredDuringExecution) to prefer one pod per host;
update the pod template spec to include one of these so placeholders are spread
rather than stacked on a single node.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 184f886e-ac04-4178-af33-9b3ffaa09eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 44893b9 and ffce156.

📒 Files selected for processing (4)

components/manifests/components/overprovisioner/deployment.yaml
components/manifests/components/overprovisioner/namespace.yaml
components/manifests/components/overprovisioner/priorityclass.yaml
components/manifests/components/overprovisioner/prometheusrule.yaml

coderabbitai · 2026-04-24T08:41:44Z

+          resources:
+            requests:
+              # ── Tunable: must match runner pod resource requests ──
+              # These values mirror the agentic session runner container requests
+              # (see operator/internal/handlers/sessions.go defaults).
+              # When a placeholder is evicted, the freed capacity is exactly
+              # what a runner pod needs to start immediately.
+              cpu: "500m"
+              memory: "512Mi"
+              # No limits set intentionally — the pause container uses zero
+              # actual CPU/memory. Only the requests matter for scheduling.


🛠️ Refactor suggestion | 🟠 Major

Set resource limits alongside requests.

Repo guideline requires both. The "pause uses zero CPU/memory" reasoning is true in practice, but an explicit limit (equal to requests) makes the container QoS Guaranteed, protects against any future LimitRange surprises, and keeps this manifest compliant without changing runtime behavior.

🔧 Proposed

resources: requests: cpu: "500m" memory: "512Mi" - # No limits set intentionally — the pause container uses zero - # actual CPU/memory. Only the requests matter for scheduling. + limits: + cpu: "500m" + memory: "512Mi"

As per coding guidelines: "Resource limits/requests required on containers."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resources:

requests:

# ── Tunable: must match runner pod resource requests ──

# These values mirror the agentic session runner container requests

# (see operator/internal/handlers/sessions.go defaults).

# When a placeholder is evicted, the freed capacity is exactly

# what a runner pod needs to start immediately.

cpu: "500m"

memory: "512Mi"

# No limits set intentionally — the pause container uses zero

# actual CPU/memory. Only the requests matter for scheduling.

resources:

requests:

# ── Tunable: must match runner pod resource requests ──

# These values mirror the agentic session runner container requests

# (see operator/internal/handlers/sessions.go defaults).

# When a placeholder is evicted, the freed capacity is exactly

# what a runner pod needs to start immediately.

cpu: "500m"

memory: "512Mi"

limits:

cpu: "500m"

memory: "512Mi"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/manifests/components/overprovisioner/deployment.yaml` around lines 46 - 56, Add explicit resource limits matching the existing requests under the container's resources block: mirror cpu: "500m" and memory: "512Mi" into resources.limits so the pause container has limits equal to requests; update the manifest where resources.requests is defined (the pause container in the overprovisioner deployment) to include limits to ensure QoS Guaranteed and compliance with LimitRange rules.

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

jeremyeder · 2026-04-24T16:33:15Z

It looks good. This + imagepuller gets us the desired behavior of users never hitting a node that's not had images pre-pulled.

jeremyeder · 2026-04-24T16:34:16Z

I see you have health rules for the new component. Do the existing OCP metrics cover being able to graph the behavior of it?

jeremyeder

lgtm, one ask about behavior metrics - can be a followup.

….yaml

coderabbitai

♻️ Duplicate comments (1)

components/manifests/components/overprovisioner/deployment.yaml (1)

56-67: ⚠️ Potential issue | 🟠 Major

Missing resource limits — still violates repo guideline.

The explanatory comment is new, but the compliance gap isn't: repo guideline requires both requests and limits on containers. The "Burstable makes eviction easier" reasoning is moot here — these pods are evicted via preemption (priority -10), not via node-pressure QoS ordering, so losing the Burstable class costs you nothing. Setting limits equal to requests gives Guaranteed QoS, keeps scheduling behavior identical, and passes the guideline check.

🔧 Proposed

           resources:
             requests:
               # ── Tunable: must match runner pod resource requests ──
               # These values mirror the agentic session runner container requests
               # (see operator/internal/handlers/sessions.go defaults).
               # When a placeholder is evicted, the freed capacity is exactly
               # what a runner pod needs to start immediately.
               cpu: "500m"
               memory: "512Mi"
-              # No limits set intentionally — keeps QoS class as Burstable,
-              # making these pods easier to evict. The pause container uses
-              # zero actual CPU/memory; only requests matter for scheduling.
+            limits:
+              cpu: "500m"
+              memory: "512Mi"

As per coding guidelines: "Resource limits/requests required on containers."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
56 - 67, The manifest's container resources define requests (resources.requests
with cpu "500m" and memory "512Mi") but omit resources.limits, violating the
repo guideline that every container must have both requests and limits; add a
matching resources.limits block for the same container and set cpu and memory
limits equal to the requests (cpu: "500m", memory: "512Mi") so the container has
both requests and limits (resources.requests and resources.limits) and the
scheduling/eviction behavior remains unchanged while satisfying the policy.

🧹 Nitpick comments (1)

components/manifests/components/overprovisioner/deployment.yaml (1)
48-55: Pin the pause image by digest.

registry.k8s.io/pause:3.9 is a mutable tag. For a capacity-reservation component that silently underpins session startup latency, a digest pin (pause@sha256:...) prevents a surprise image swap from silently breaking the buffer across restarts/rollouts. Also consider imagePullPolicy: IfNotPresent to play nicely with acp-image-puller if/when this image is pre-cached.
🔧 Example
         - name: pause
-          image: registry.k8s.io/pause:3.9
+          image: registry.k8s.io/pause:3.9@sha256:<pinned-digest>
+          imagePullPolicy: IfNotPresent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
48 - 55, Update the pause container spec (container name "pause") to use a
digest-pinned image (replace image: registry.k8s.io/pause:3.9 with
registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy:
IfNotPresent; make these changes on the container entry where name is "pause" so
the image is immutable across restarts/rollouts and it cooperates with
acp-image-puller caching.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 56-67: The manifest's container resources define requests
(resources.requests with cpu "500m" and memory "512Mi") but omit
resources.limits, violating the repo guideline that every container must have
both requests and limits; add a matching resources.limits block for the same
container and set cpu and memory limits equal to the requests (cpu: "500m",
memory: "512Mi") so the container has both requests and limits
(resources.requests and resources.limits) and the scheduling/eviction behavior
remains unchanged while satisfying the policy.

---

Nitpick comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 48-55: Update the pause container spec (container name "pause") to
use a digest-pinned image (replace image: registry.k8s.io/pause:3.9 with
registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy:
IfNotPresent; make these changes on the container entry where name is "pause" so
the image is immutable across restarts/rollouts and it cooperates with
acp-image-puller caching.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ae63ec6-fb55-4053-b260-872d3524fbfa

📥 Commits

Reviewing files that changed from the base of the PR and between ffce156 and ef5bba0.

📒 Files selected for processing (3)

components/manifests/components/overprovisioner/deployment.yaml
components/manifests/components/overprovisioner/namespace.yaml
components/manifests/components/overprovisioner/priorityclass.yaml

✅ Files skipped from review due to trivial changes (2)

components/manifests/components/overprovisioner/priorityclass.yaml
components/manifests/components/overprovisioner/namespace.yaml

rh-rahulshetty · 2026-04-24T18:09:00Z

@jeremyeder Just confirmed in the cluster that we don't have the required metrics for that prometheus alert, so deleted it for now. I think if required, we can observe the kube_pod_info on this provisioner namespace to see the restarts caused by auto-scalar.

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

changes for implementing node overprevisioner

4bd05da

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

rh-rahulshetty force-pushed the feature/overprovisioner branch from 64b0369 to 4bd05da Compare April 24, 2026 08:53

rh-rahulshetty assigned jeremyeder and rh-rahulshetty and unassigned jeremyeder Apr 24, 2026

rh-rahulshetty requested a review from jeremyeder April 24, 2026 08:54

jeremyeder previously approved these changes Apr 24, 2026

View reviewed changes

Delete components/manifests/components/overprovisioner/prometheusrule…

ef5bba0

….yaml

rh-rahulshetty dismissed jeremyeder’s stale review via ef5bba0 April 24, 2026 18:05

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

mergify Bot added 14 commits April 24, 2026 21:01

Merge branch 'main' into feature/overprovisioner

1c91b0d

Merge branch 'main' into feature/overprovisioner

d4f0081

Merge branch 'main' into feature/overprovisioner

0770082

Merge branch 'main' into feature/overprovisioner

893f728

Merge branch 'main' into feature/overprovisioner

bc16a39

Merge branch 'main' into feature/overprovisioner

095c296

Merge branch 'main' into feature/overprovisioner

4e14119

Merge branch 'main' into feature/overprovisioner

dc822a6

Merge branch 'main' into feature/overprovisioner

6f1a07e

Merge branch 'main' into feature/overprovisioner

60265d2

Merge branch 'main' into feature/overprovisioner

ab24f87

Merge branch 'main' into feature/overprovisioner

0ffbb6a

Merge branch 'main' into feature/overprovisioner

5fa1e59

Merge branch 'main' into feature/overprovisioner

c915e2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(manifests): add node overprovisioning for faster agentic session startup#1458

feat(manifests): add node overprovisioning for faster agentic session startup#1458
rh-rahulshetty wants to merge 16 commits intoambient-code:mainfrom
rh-rahulshetty:feature/overprovisioner

rh-rahulshetty commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

jeremyeder commented Apr 24, 2026

Uh oh!

jeremyeder commented Apr 24, 2026

Uh oh!

jeremyeder left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

rh-rahulshetty commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rh-rahulshetty commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Experiment

Summary by CodeRabbit

Uh oh!

netlify Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cheerful-kitten-f556a0 canceled.

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jeremyeder commented Apr 24, 2026

Uh oh!

jeremyeder commented Apr 24, 2026

Uh oh!

jeremyeder left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rh-rahulshetty commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rh-rahulshetty commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented Apr 24, 2026 •

edited

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading