Kubernetes operator for cost optimization -- automatically scales down workloads during off-hours and detects idle resources.
Sleep Schedules -- scale down workloads on a time-based schedule:
- Scales Deployments and StatefulSets to zero replicas
- Scales Prometheus Operator CRDs (ThanosRuler, Alertmanager, Prometheus) to zero replicas
- Suspends MariaDB Operator CRDs (MariaDB, MaxScale) via
spec.suspend - Suspends CronJobs, FluxCD HelmReleases and Kustomizations
- Hibernates CNPG PostgreSQL clusters
- Timezone-aware scheduling with day-of-week filters
- Overnight schedule support (e.g., 22:00-06:00)
- Label selectors and name-based matching (with wildcards)
- State preservation -- original replica counts, suspend states, and hibernation annotations are stored for restoration
Idle Detection -- detect and optionally scale down underutilized workloads:
- Monitors CPU and memory usage against configurable thresholds
- Configurable idle duration before action
- Three modes:
alert(report only),scale(auto-scale to zero), orresize(in-place pod right-sizing via K8s 1.33) - Supports Deployments, StatefulSets, and CronJobs
- MatchNames wildcard selector (e.g.,
prod-*) - Finalizer ensures workloads are restored on detector deletion
- Tracks original state for safe restoration
Note: The idle detector requires metrics-server in the cluster. Without it, the operator runs in degraded mode (always returns not-idle).
helm install slumlord oci://ghcr.io/cschockaert/charts/slumlord --version 2.13.0
make install # Install CRDs
make run # Run locally against current kubeconfig
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: nightly-sleep
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Deployment
- StatefulSet
schedule:
start: "22:00"
end: "06:00"
timezone: Europe/Paris
days: [1, 2, 3, 4, 5]
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: weekend-stop
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Deployment
- StatefulSet
- CronJob
schedule:
start: "00:00"
end: "23:59"
timezone: Europe/Paris
days: [0, 6]
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: pg-hibernate
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Cluster
schedule:
start: "20:00"
end: "07:00"
timezone: Europe/Paris
days: [1, 2, 3, 4, 5]
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: flux-suspend
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- HelmRelease
- Kustomization
schedule:
start: "21:55"
end: "06:05"
timezone: Europe/Paris
days: [1, 2, 3, 4, 5]
Important: When managing FluxCD resources alongside Deployments/StatefulSets, use a wider sleep window for the FluxCD schedule. Suspend Flux reconciliation before scaling workloads, and resume it after restoring them. This prevents Flux from restoring scaled-down workloads during the sleep window.
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: monitoring-sleep
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Prometheus
- Alertmanager
- ThanosRuler
schedule:
start: "20:00"
end: "07:00"
timezone: Europe/Paris
days: [1, 2, 3, 4, 5]
Note: The Prometheus Operator itself should NOT be scaled down -- only its managed CRs. The operator must be running to reconcile the CRs back up on wake.
apiVersion: slumlord.io/v1alpha1
kind: SlumlordSleepSchedule
metadata:
name: mariadb-suspend
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- MariaDB
- MaxScale
schedule:
start: "21:55"
end: "06:05"
timezone: Europe/Paris
days: [1, 2, 3, 4, 5]
Important: Like FluxCD, the MariaDB Operator's
spec.suspendpauses its reconciliation loop. Suspend the operator before scaling down underlying workloads, and resume it after restoring them. This prevents the operator from recreating resources during the sleep window.
apiVersion: slumlord.io/v1alpha1
kind: SlumlordIdleDetector
metadata:
name: idle-alert
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Deployment
- StatefulSet
thresholds:
cpuPercent: 5
memoryPercent: 10
idleDuration: "1h"
action: alert
apiVersion: slumlord.io/v1alpha1
kind: SlumlordIdleDetector
metadata:
name: idle-scaler
spec:
selector:
matchNames:
- "dev-*"
- "staging-*"
types:
- Deployment
thresholds:
cpuPercent: 3
memoryPercent: 5
idleDuration: "2h"
action: scale
apiVersion: slumlord.io/v1alpha1
kind: SlumlordIdleDetector
metadata:
name: idle-resizer
spec:
selector:
matchLabels:
slumlord.io/managed: "true"
types:
- Deployment
- StatefulSet
thresholds:
cpuPercent: 10
memoryPercent: 15
idleDuration: "1h"
action: resize
reconcileInterval: 10m
resize:
bufferPercent: 30
minRequests:
cpu: "50m"
memory: "64Mi"
Note: The
resizeaction requires Kubernetes 1.33+ with InPlacePodVerticalScaling feature gate enabled. It patches pod resource requests in-place without restarting pods.
| Field | Type | Required | Description |
|---|---|---|---|
spec.selector.matchLabels |
map[string]string |
No | Label selector for target workloads |
spec.selector.matchNames |
[]string |
No | Name patterns (supports wildcards) |
spec.selector.types |
[]string |
No | Workload types to manage. Valid: Deployment, StatefulSet, CronJob, Cluster, HelmRelease, Kustomization, ThanosRuler, Alertmanager, Prometheus, MariaDB, MaxScale. Default: all types |
spec.schedule.start |
string |
Yes | Sleep start time in HH:MM format |
spec.schedule.end |
string |
Yes | Wake time in HH:MM format |
spec.schedule.timezone |
string |
No | IANA timezone (e.g., Europe/Paris). Default: UTC |
spec.schedule.days |
[]int |
No | Days of week (0=Sunday, 6=Saturday). Default: every day |
spec.suspend |
bool |
No | Pause the schedule. Sleeping workloads are woken up. Default: false |
spec.reconcileInterval |
duration |
No | Override reconcile interval (e.g., 2m, 10m). Default: 5m |
| Field | Type | Required | Description |
|---|---|---|---|
spec.selector.matchLabels |
map[string]string |
No | Label selector for target workloads |
spec.selector.matchNames |
[]string |
No | Name patterns with wildcard support (e.g., prod-*) |
spec.selector.types |
[]string |
No | Workload types: Deployment, StatefulSet, CronJob. Default: all |
spec.thresholds.cpuPercent |
int32 |
No | CPU usage % threshold (0-100). Below = idle |
spec.thresholds.memoryPercent |
int32 |
No | Memory usage % threshold (0-100). Below = idle |
spec.idleDuration |
string |
Yes | How long a workload must be idle before action (e.g., 30m, 1h) |
spec.action |
string |
Yes | alert (report only), scale (auto-scale to zero), or resize (in-place right-sizing) |
spec.reconcileInterval |
duration |
No | Override reconcile interval (e.g., 5m, 10m). Default: 5m |
spec.resize.bufferPercent |
int32 |
No | Headroom % above actual usage for resize. Default: 25 |
spec.resize.minRequests.cpu |
quantity |
No | Minimum CPU request floor. Default: 50m |
spec.resize.minRequests.memory |
quantity |
No | Minimum memory request floor. Default: 64Mi |
kubectl get slumlordsleepschedules -A
NAMESPACE NAME SLEEPING START END DAYS AGE
default nightly-sleep false 22:00 06:00 Mon-Fri 19h
kubectl get slumlordidledetectors -A
NAMESPACE NAME ACTION IDLE DURATION LAST CHECK AGE
default idle-alert alert 1h 2026-02-08T12:00:00Z 1d
The operator runs a reconciliation loop (default: every 5 minutes, configurable) for each SlumlordSleepSchedule resource:
- Checks if the current time (in the configured timezone) falls within the sleep window
- On sleep: scales Deployments/StatefulSets to 0, scales Prometheus Operator CRDs (ThanosRuler, Alertmanager, Prometheus) to 0, suspends CronJobs, hibernates CNPG clusters, suspends FluxCD HelmReleases/Kustomizations, and suspends MariaDB Operator CRDs (MariaDB, MaxScale)
- On wake: restores all workloads to their original state
- Original state (replica counts, suspend flags, hibernation annotations) is stored in
status.managedWorkloadsto survive operator restarts
The idle detector reconciles periodically (default: every 5 minutes, configurable) for each SlumlordIdleDetector resource:
- Lists workloads matching the selector (labels and/or name patterns)
- Checks resource usage against configured thresholds
- Tracks how long each workload has been continuously idle
- In
alertmode: reports idle workloads instatus.idleWorkloads - In
scalemode: scales down workloads idle longer thanidleDuration, stores original state instatus.scaledWorkloads - On detector deletion: restores all scaled workloads via finalizer
The BinPacker reconciler performs cluster-wide list operations each cycle (Nodes, Pods, ReplicaSets, PDBs, Deployments, StatefulSets). These are served from the controller-runtime informer cache, not direct API server calls. On very large clusters (thousands of Deployments/Pods), consider scoping with nodeSelector and namespaces to reduce the working set.
graph LR
A[SlumlordSleepSchedule] --> B[Sleep Controller]
B --> C[Deployments]
B --> D[StatefulSets]
B --> E[CronJobs]
B --> F[CNPG Clusters]
B --> G[HelmReleases]
B --> H[Kustomizations]
B --> K[ThanosRulers]
B --> L[Alertmanagers]
B --> M[Prometheuses]
B --> N[MariaDBs]
B --> O[MaxScales]
I[SlumlordIdleDetector] --> J[Idle Controller]
J --> C
J --> D
J --> E
Each controller has a default reconcile interval that can be overridden globally via CLI flags or per-resource via spec.reconcileInterval:
| Controller | Default | CLI Flag | Per-resource field |
|---|---|---|---|
| SleepSchedule | 5m | --sleep-reconcile-interval |
spec.reconcileInterval |
| IdleDetector | 5m30s | --idle-reconcile-interval |
spec.reconcileInterval |
| BinPacker | 6m | --binpacker-reconcile-interval |
spec.reconcileInterval |
| NodeDrainPolicy | 6m30s | --nodedrain-reconcile-interval |
spec.reconcileInterval |
Per-resource overrides take priority over global CLI flags, which take priority over built-in defaults.
Helm values:
reconcileIntervals:
sleepSchedule: "10m" # Reduce API server load in large clusters
idleDetector: "10m"
binPacker: "10m"
nodeDrain: "10m"
Recommendations:
- Dev/staging: use longer intervals (10m+) to reduce load
- Production: use shorter intervals (2-5m) for faster response
- Large clusters: combine longer intervals with scoped selectors (
nodeSelector,namespaces) - Short intervals (< 1m) increase API server pressure with minimal benefit
Important: Remove Slumlord resources before uninstalling to restore workloads.
# Restore all workloads by deleting resources first
kubectl delete slumlordidledetectors --all -A
kubectl delete slumlordsleepschedules --all -A
# Then uninstall
helm uninstall slumlord
make generate # Regenerate DeepCopy methods
make manifests # Regenerate CRD manifests
make test # Run tests
make lint # Lint code
make build # Build binary
Apache 2.0