Skip to content

feat(manifests): add node overprovisioning for faster agentic session startup#1458

Open
rh-rahulshetty wants to merge 16 commits intoambient-code:mainfrom
rh-rahulshetty:feature/overprovisioner
Open

feat(manifests): add node overprovisioning for faster agentic session startup#1458
rh-rahulshetty wants to merge 16 commits intoambient-code:mainfrom
rh-rahulshetty:feature/overprovisioner

Conversation

@rh-rahulshetty
Copy link
Copy Markdown
Contributor

@rh-rahulshetty rh-rahulshetty commented Apr 24, 2026

Summary

Adds low-priority placeholder pods that reserve spare node capacity for agentic session runners. When a user creates a session, the runner pod preempts a placeholder instantly instead of waiting for the cluster autoscaler to provision a new node.

  • PriorityClass (-10) ensures placeholders are evicted first
  • 5 replicas, each matching runner resource requests (500m CPU / 512Mi memory)
  • safe-to-evict annotation preserves cluster scale-down behavior
  • PrometheusRule alerts for missing replicas or deployment

Complements the existing acp-image-puller — image-puller caches runner images on nodes, overprovisioner ensures nodes are available.

How it works

Placeholder pods reserve capacity → User creates session → Runner preempts placeholder →
Runner starts immediately → Evicted placeholder goes Pending → Cluster autoscaler adds node →
Placeholder reschedules → Buffer restored

Experiment

The following observations were found when testing this in staging.

Initial state — 5 placeholder pods running in cluster (node size: 4):
image

After creating 7 agentic sessions — placeholders preempted (node size: 4):
image

Cluster autoscaler replenishes — placeholders rescheduled (node size scaled to 5):
image

Deleted 7 agentic session — (node size scaled down to 4) placeholders rescheduled back to available nodes

Summary by CodeRabbit

  • New Features
    • Added cluster autoscaling optimization with overprovisioning placeholder pods that reserve capacity and are evicted when needed for production workloads.
    • Introduced priority-based pod scheduling to ensure efficient resource utilization and faster workload placement during scale events.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 24, 2026

Deploy Preview for cheerful-kitten-f556a0 canceled.

Name Link
🔨 Latest commit c915e2a
🔍 Latest deploy log https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69f3ae32701d160008255478

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Introduces Kubernetes manifests for an overprovisioning component: a namespace, a low-priority PriorityClass, and a Deployment managing five pause pods configured for eviction during cluster scale-down.

Changes

Cohort / File(s) Summary
Overprovisioner Resources
components/manifests/components/overprovisioner/namespace.yaml, components/manifests/components/overprovisioner/priorityclass.yaml, components/manifests/components/overprovisioner/deployment.yaml
Adds namespace acp-overprovisioner, PriorityClass acp-overprovisioning (value: -10), and Deployment with 5 pause pod replicas. Pods configured with hardened security policies, 500m CPU/512Mi memory requests, topology spread constraints, and safe-to-evict annotation for autoscaler preemption.
🚥 Pre-merge checks | ✅ 8
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (feat(manifests): description) and accurately describes the main changeset—adding node overprovisioning infrastructure via new Deployment, Namespace, and PriorityClass.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity ✅ Passed PR adds three Kubernetes manifests (Namespace, PriorityClass, Deployment) with no algorithmic complexity or performance regressions.
Security And Secret Handling ✅ Passed PR introduces three Kubernetes manifests with zero security violations. No hardcoded secrets, credentials, API keys, or injection vectors. Container security context fully hardened with runAsNonRoot, seccompProfile RuntimeDefault, allowPrivilegeEscalation false, and readOnlyRootFilesystem true.
Kubernetes Resource Safety ✅ Passed Manifests satisfy all safety criteria: no child resources requiring OwnerReferences, properly defined resource requests with documented no-limits design, zero overly permissive RBAC, proper namespace scoping, and comprehensive pod/container security hardening.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 46-56: Add explicit resource limits matching the existing requests
under the container's resources block: mirror cpu: "500m" and memory: "512Mi"
into resources.limits so the pause container has limits equal to requests;
update the manifest where resources.requests is defined (the pause container in
the overprovisioner deployment) to include limits to ensure QoS Guaranteed and
compliance with LimitRange rules.
- Around line 14-33: The deployment currently allows all 5 replicas (replicas:
5) of the acp-overprovisioner pod to land on a single node; add spreading rules
to the pod template (under the pod spec where priorityClassName:
acp-overprovisioning is set) to distribute placeholders across nodes: add a
topologySpreadConstraints entry targeting pods with label app:
acp-overprovisioner (topologyKey: kubernetes.io/hostname, maxSkew: 1,
whenUnsatisfiable: ScheduleAnyway) or alternatively a preferred podAntiAffinity
(preferredDuringSchedulingIgnoredDuringExecution) to prefer one pod per host;
update the pod template spec to include one of these so placeholders are spread
rather than stacked on a single node.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 184f886e-ac04-4178-af33-9b3ffaa09eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 44893b9 and ffce156.

📒 Files selected for processing (4)
  • components/manifests/components/overprovisioner/deployment.yaml
  • components/manifests/components/overprovisioner/namespace.yaml
  • components/manifests/components/overprovisioner/priorityclass.yaml
  • components/manifests/components/overprovisioner/prometheusrule.yaml

Comment thread components/manifests/components/overprovisioner/deployment.yaml
Comment on lines +46 to +56
resources:
requests:
# ── Tunable: must match runner pod resource requests ──
# These values mirror the agentic session runner container requests
# (see operator/internal/handlers/sessions.go defaults).
# When a placeholder is evicted, the freed capacity is exactly
# what a runner pod needs to start immediately.
cpu: "500m"
memory: "512Mi"
# No limits set intentionally — the pause container uses zero
# actual CPU/memory. Only the requests matter for scheduling.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Set resource limits alongside requests.

Repo guideline requires both. The "pause uses zero CPU/memory" reasoning is true in practice, but an explicit limit (equal to requests) makes the container QoS Guaranteed, protects against any future LimitRange surprises, and keeps this manifest compliant without changing runtime behavior.

🔧 Proposed
           resources:
             requests:
               cpu: "500m"
               memory: "512Mi"
-              # No limits set intentionally — the pause container uses zero
-              # actual CPU/memory. Only the requests matter for scheduling.
+            limits:
+              cpu: "500m"
+              memory: "512Mi"

As per coding guidelines: "Resource limits/requests required on containers."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resources:
requests:
# ── Tunable: must match runner pod resource requests ──
# These values mirror the agentic session runner container requests
# (see operator/internal/handlers/sessions.go defaults).
# When a placeholder is evicted, the freed capacity is exactly
# what a runner pod needs to start immediately.
cpu: "500m"
memory: "512Mi"
# No limits set intentionally — the pause container uses zero
# actual CPU/memory. Only the requests matter for scheduling.
resources:
requests:
# ── Tunable: must match runner pod resource requests ──
# These values mirror the agentic session runner container requests
# (see operator/internal/handlers/sessions.go defaults).
# When a placeholder is evicted, the freed capacity is exactly
# what a runner pod needs to start immediately.
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
46 - 56, Add explicit resource limits matching the existing requests under the
container's resources block: mirror cpu: "500m" and memory: "512Mi" into
resources.limits so the pause container has limits equal to requests; update the
manifest where resources.requests is defined (the pause container in the
overprovisioner deployment) to include limits to ensure QoS Guaranteed and
compliance with LimitRange rules.

Signed-off-by: Rahul Shetty <rashetty@redhat.com>
@jeremyeder
Copy link
Copy Markdown
Contributor

It looks good. This + imagepuller gets us the desired behavior of users never hitting a node that's not had images pre-pulled.

@jeremyeder
Copy link
Copy Markdown
Contributor

I see you have health rules for the new component. Do the existing OCP metrics cover being able to graph the behavior of it?

jeremyeder
jeremyeder previously approved these changes Apr 24, 2026
Copy link
Copy Markdown
Contributor

@jeremyeder jeremyeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one ask about behavior metrics - can be a followup.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
components/manifests/components/overprovisioner/deployment.yaml (1)

56-67: ⚠️ Potential issue | 🟠 Major

Missing resource limits — still violates repo guideline.

The explanatory comment is new, but the compliance gap isn't: repo guideline requires both requests and limits on containers. The "Burstable makes eviction easier" reasoning is moot here — these pods are evicted via preemption (priority -10), not via node-pressure QoS ordering, so losing the Burstable class costs you nothing. Setting limits equal to requests gives Guaranteed QoS, keeps scheduling behavior identical, and passes the guideline check.

🔧 Proposed
           resources:
             requests:
               # ── Tunable: must match runner pod resource requests ──
               # These values mirror the agentic session runner container requests
               # (see operator/internal/handlers/sessions.go defaults).
               # When a placeholder is evicted, the freed capacity is exactly
               # what a runner pod needs to start immediately.
               cpu: "500m"
               memory: "512Mi"
-              # No limits set intentionally — keeps QoS class as Burstable,
-              # making these pods easier to evict. The pause container uses
-              # zero actual CPU/memory; only requests matter for scheduling.
+            limits:
+              cpu: "500m"
+              memory: "512Mi"

As per coding guidelines: "Resource limits/requests required on containers."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
56 - 67, The manifest's container resources define requests (resources.requests
with cpu "500m" and memory "512Mi") but omit resources.limits, violating the
repo guideline that every container must have both requests and limits; add a
matching resources.limits block for the same container and set cpu and memory
limits equal to the requests (cpu: "500m", memory: "512Mi") so the container has
both requests and limits (resources.requests and resources.limits) and the
scheduling/eviction behavior remains unchanged while satisfying the policy.
🧹 Nitpick comments (1)
components/manifests/components/overprovisioner/deployment.yaml (1)

48-55: Pin the pause image by digest.

registry.k8s.io/pause:3.9 is a mutable tag. For a capacity-reservation component that silently underpins session startup latency, a digest pin (pause@sha256:...) prevents a surprise image swap from silently breaking the buffer across restarts/rollouts. Also consider imagePullPolicy: IfNotPresent to play nicely with acp-image-puller if/when this image is pre-cached.

🔧 Example
         - name: pause
-          image: registry.k8s.io/pause:3.9
+          image: registry.k8s.io/pause:3.9@sha256:<pinned-digest>
+          imagePullPolicy: IfNotPresent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/components/overprovisioner/deployment.yaml` around lines
48 - 55, Update the pause container spec (container name "pause") to use a
digest-pinned image (replace image: registry.k8s.io/pause:3.9 with
registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy:
IfNotPresent; make these changes on the container entry where name is "pause" so
the image is immutable across restarts/rollouts and it cooperates with
acp-image-puller caching.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 56-67: The manifest's container resources define requests
(resources.requests with cpu "500m" and memory "512Mi") but omit
resources.limits, violating the repo guideline that every container must have
both requests and limits; add a matching resources.limits block for the same
container and set cpu and memory limits equal to the requests (cpu: "500m",
memory: "512Mi") so the container has both requests and limits
(resources.requests and resources.limits) and the scheduling/eviction behavior
remains unchanged while satisfying the policy.

---

Nitpick comments:
In `@components/manifests/components/overprovisioner/deployment.yaml`:
- Around line 48-55: Update the pause container spec (container name "pause") to
use a digest-pinned image (replace image: registry.k8s.io/pause:3.9 with
registry.k8s.io/pause@sha256:<actual-digest>) and add imagePullPolicy:
IfNotPresent; make these changes on the container entry where name is "pause" so
the image is immutable across restarts/rollouts and it cooperates with
acp-image-puller caching.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ae63ec6-fb55-4053-b260-872d3524fbfa

📥 Commits

Reviewing files that changed from the base of the PR and between ffce156 and ef5bba0.

📒 Files selected for processing (3)
  • components/manifests/components/overprovisioner/deployment.yaml
  • components/manifests/components/overprovisioner/namespace.yaml
  • components/manifests/components/overprovisioner/priorityclass.yaml
✅ Files skipped from review due to trivial changes (2)
  • components/manifests/components/overprovisioner/priorityclass.yaml
  • components/manifests/components/overprovisioner/namespace.yaml

@rh-rahulshetty
Copy link
Copy Markdown
Contributor Author

@jeremyeder Just confirmed in the cluster that we don't have the required metrics for that prometheus alert, so deleted it for now. I think if required, we can observe the kube_pod_info on this provisioner namespace to see the restarts caused by auto-scalar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants