Skip to content

Conversation

@ArsalanAnwer0
Copy link
Contributor

@ArsalanAnwer0 ArsalanAnwer0 commented Jan 24, 2026

Implement HorizontalPodAutoscaler (HPA) for 8 services to enable dynamic scaling based on CPU/memory utilization.

Changes

  • Add shared HPA template (buttercup.hpa) in _helpers.tpl to eliminate duplication
  • Add HPA support for fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, and task-downloader
  • Update deployment templates to support HPA (conditional replica counts)
  • Add autoscaling configuration to service values files (disabled by default)
  • Add memory based scaling for memory intensive services (fuzzer-bot, coverage-bot, tracer-bot)
  • Document scale-down stabilization window choices per service

Configuration

Autoscaling is disabled by default. Enable per service in values.yaml:

fuzzer-bot:
  autoscaling:
    enabled: true

Stabilization Windows

  • 600s (coverage-bot, tracer-bot, patcher): Long-running analysis/patching tasks that should not be interrupted
  • 300s (fuzzer-bot, pov-reproducer, seed-gen, build-bot, task-downloader): Shorter-lived tasks that tolerate faster scale down

Prerequisites

Requires metrics server installed in the cluster.

Resolves #348

- Implement HPA resources for 8 services (fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, task-downloader)
- Add autoscaling configuration to service values files (disabled by default)
- Update deployment templates to support conditional replica counts when HPA is enabled
- Adjust resource requests for better HPA accuracy (fuzzer-bot, coverage-bot, pov-reproducer, tracer-bot)
- Add autoscaling documentation to global values.yaml

Resolves trailofbits#348
Copy link
Member

@dguido dguido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Add Kubernetes autoscaling support

Thanks for adding HPA support! The implementation is solid, but there are a few items to address before merging.


🔴 Blocking: Resource Request Increases Need Justification

The PR silently increases resource requests, which will affect cluster capacity even when HPA is disabled:

Service CPU Request Memory Request
coverage-bot 250m → 500m 256Mi → 6Gi (24x increase)
fuzzer-bot 256Mi → 1536Mi (6x)
pov-reproducer 100m → 500m 1Gi → 3Gi (3x)
tracer-bot 256Mi → 1536Mi (6x)

The PR description mentions "adjust resource requests for better HPA accuracy" but these are significant changes that need justification:

Action required: Either:

  1. Add justification for these values (OOM observations, profiling data, etc.), or
  2. Split resource request changes into a separate PR so they can be reviewed independently

🟡 Suggestions

Consider a shared HPA template

All 8 hpa.yaml files are nearly identical (272 lines of duplication). Consider a helper template in _helpers.tpl:

{{- define "common.hpa" -}}
{{- if and .Values.enabled .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
{{- end }}
{{- end }}

Not blocking, but would reduce maintenance burden.

Add memory-based scaling for memory-intensive services

Only pov-reproducer has targetMemoryUtilizationPercentage. Given fuzzer-bot, coverage-bot, and tracer-bot are memory-intensive, they might benefit from memory-based scaling too.

Document the scale-down timing differences

Some services use 600s stabilization (coverage-bot, tracer-bot, patcher), others use 300s. A brief comment in the values files explaining why would help future maintainers.


🟢 What looks good

  • Uses autoscaling/v2 (current stable API) ✓
  • Correctly gates replica count with {{- if not .Values.autoscaling.enabled }}
  • Disabled by default (safe rollout) ✓
  • Condition and .Values.enabled .Values.autoscaling.enabled is correct ✓
  • Well-structured behavior policies with stabilization windows ✓

Optional improvements (not blocking)

  • Consider adding PodDisruptionBudgets for critical services when scaling down
  • Update deployment docs with metrics-server setup instructions for different cluster types

- Revert resource request increases to original values
- Extract shared HPA template into _helpers.tpl to reduce duplication
- Add memory-based scaling for memory-intensive services
- Document scale-down stabilization window choices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement k8s autoscaling

2 participants