Add Kubernetes autoscaling support #418

ArsalanAnwer0 · 2026-01-24T04:39:46Z

Implement HorizontalPodAutoscaler (HPA) for 8 services to enable dynamic scaling based on CPU/memory utilization.

Changes

Add shared HPA template (buttercup.hpa) in _helpers.tpl to eliminate duplication
Add HPA support for fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, and task-downloader
Update deployment templates to support HPA (conditional replica counts)
Add autoscaling configuration to service values files (disabled by default)
Add memory based scaling for memory intensive services (fuzzer-bot, coverage-bot, tracer-bot)
Document scale-down stabilization window choices per service

Configuration

Autoscaling is disabled by default. Enable per service in values.yaml:

fuzzer-bot:
  autoscaling:
    enabled: true

Stabilization Windows

600s (coverage-bot, tracer-bot, patcher): Long-running analysis/patching tasks that should not be interrupted
300s (fuzzer-bot, pov-reproducer, seed-gen, build-bot, task-downloader): Shorter-lived tasks that tolerate faster scale down

Prerequisites

Requires metrics server installed in the cluster.

Resolves #348

- Implement HPA resources for 8 services (fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, task-downloader) - Add autoscaling configuration to service values files (disabled by default) - Update deployment templates to support conditional replica counts when HPA is enabled - Adjust resource requests for better HPA accuracy (fuzzer-bot, coverage-bot, pov-reproducer, tracer-bot) - Add autoscaling documentation to global values.yaml Resolves trailofbits#348

dguido

PR Review: Add Kubernetes autoscaling support

Thanks for adding HPA support! The implementation is solid, but there are a few items to address before merging.

🔴 Blocking: Resource Request Increases Need Justification

The PR silently increases resource requests, which will affect cluster capacity even when HPA is disabled:

Service	CPU Request	Memory Request
coverage-bot	250m → 500m	256Mi → 6Gi (24x increase)
fuzzer-bot	—	256Mi → 1536Mi (6x)
pov-reproducer	100m → 500m	1Gi → 3Gi (3x)
tracer-bot	—	256Mi → 1536Mi (6x)

The PR description mentions "adjust resource requests for better HPA accuracy" but these are significant changes that need justification:

Action required: Either:

Add justification for these values (OOM observations, profiling data, etc.), or
Split resource request changes into a separate PR so they can be reviewed independently

🟡 Suggestions

Consider a shared HPA template

All 8 hpa.yaml files are nearly identical (272 lines of duplication). Consider a helper template in _helpers.tpl:

{{- define "common.hpa" -}}
{{- if and .Values.enabled .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
{{- end }}
{{- end }}

Not blocking, but would reduce maintenance burden.

Add memory-based scaling for memory-intensive services

Only pov-reproducer has targetMemoryUtilizationPercentage. Given fuzzer-bot, coverage-bot, and tracer-bot are memory-intensive, they might benefit from memory-based scaling too.

Document the scale-down timing differences

Some services use 600s stabilization (coverage-bot, tracer-bot, patcher), others use 300s. A brief comment in the values files explaining why would help future maintainers.

🟢 What looks good

Uses autoscaling/v2 (current stable API) ✓
Correctly gates replica count with {{- if not .Values.autoscaling.enabled }} ✓
Disabled by default (safe rollout) ✓
Condition and .Values.enabled .Values.autoscaling.enabled is correct ✓
Well-structured behavior policies with stabilization windows ✓

Optional improvements (not blocking)

Consider adding PodDisruptionBudgets for critical services when scaling down
Update deployment docs with metrics-server setup instructions for different cluster types

- Revert resource request increases to original values - Extract shared HPA template into _helpers.tpl to reduce duplication - Add memory-based scaling for memory-intensive services - Document scale-down stabilization window choices

ArsalanAnwer0 requested review from hbrodin, michaelbrownuc, ret2libc and reytchison as code owners January 24, 2026 04:39

dguido requested changes Jan 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kubernetes autoscaling support #418

Add Kubernetes autoscaling support #418

ArsalanAnwer0 commented Jan 24, 2026 •

edited

Loading

Uh oh!

dguido left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Kubernetes autoscaling support #418

Are you sure you want to change the base?

Add Kubernetes autoscaling support #418

Conversation

ArsalanAnwer0 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Configuration

Stabilization Windows

Prerequisites

Uh oh!

dguido left a comment

Choose a reason for hiding this comment

PR Review: Add Kubernetes autoscaling support

🔴 Blocking: Resource Request Increases Need Justification

🟡 Suggestions

Consider a shared HPA template

Add memory-based scaling for memory-intensive services

Document the scale-down timing differences

🟢 What looks good

Optional improvements (not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArsalanAnwer0 commented Jan 24, 2026 •

edited

Loading