-
Notifications
You must be signed in to change notification settings - Fork 38
feat(disruption): Memory pressure. #1040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Zenithar
merged 8 commits into
main
from
zenithar/chaos-controller/memory_pressure_injection
Feb 26, 2026
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e0350aa
feat(disruption): Memory pressure.
Zenithar 15825a7
chore(test): add E2E tests, and fix CPU stress race condition.
Zenithar 5dcad81
chore(go): update vendor.
Zenithar 5e73238
chore(ci): update licenses
Zenithar 4ad05e9
chore(git): pr reviews.
Zenithar 47ac995
Merge branch 'main' into zenithar/chaos-controller/memory_pressure_in…
aymericDD 4b9dae8
chore(license): update banners.
Zenithar 7c84e77
fix(injector): fix data race in memoryStressInjector.Clean()
Zenithar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -56,5 +56,6 @@ ebpf/builds/ | |
|
|
||
| # DS_Store | ||
| .DS_Store | ||
| .claude | ||
|
|
||
| .envrc | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| Kubernetes operator for chaos engineering (Datadog). Injects systemic failures (network, CPU, disk, DNS, gRPC, container/node failure) into Kubernetes clusters at scale. Built with Kubebuilder v3 and controller-runtime. | ||
|
|
||
| ## Build Commands | ||
|
|
||
| ```bash | ||
| make docker-build-all # Build all Docker images (manager, injector, handler) | ||
| make docker-build-injector # Build injector Docker image | ||
| make docker-build-handler # Build handler Docker image | ||
| make docker-build-manager # Build manager Docker image | ||
| make docker-build-only-all # Build all images without saving tars | ||
| make manifests # Generate CRDs and RBAC manifests | ||
| make generate # Generate Go code (deepcopy, etc.) | ||
| make generate-mocks # Regenerate mocks (mockery v2.53.5) | ||
| make clean-mocks # Remove all generated mocks | ||
| make generate-disruptionlistener-protobuf # Generate disruptionlistener protobuf | ||
| make generate-chaosdogfood-protobuf # Generate chaosdogfood protobuf | ||
| make chaosli # Build CLI helper tool | ||
| make chaosli-test # Test chaosli API portability (Docker) | ||
| make godeps # go mod tidy + vendor | ||
| make deps # godeps + license check | ||
| make header # Check/fix license headers | ||
| make header-fix # Fix missing license headers | ||
| make license # Check licenses | ||
| make release # Run release script (VERSION required) | ||
| make update-deps # Update Python dependencies (tasks/requirements.txt) | ||
| ``` | ||
|
|
||
| ## Testing | ||
|
|
||
| ```bash | ||
| make test # Run all unit tests (Ginkgo v2) | ||
| make test TEST_ARGS="injector" # Filter tests by package name | ||
| make test TEST_ARGS="--until-it-fails" # Detect flaky tests | ||
| make test GINKGO_PROCS=4 # Control parallelism | ||
| make e2e-test # End-to-end tests (requires cluster) | ||
| make e2e-test SKIP_DEPLOY=true # E2E tests without redeploying controller | ||
| ``` | ||
|
|
||
| Tests use **Ginkgo v2** (BDD) with **Gomega** matchers. Coverage output: `cover.profile`. | ||
|
|
||
| ## Linting and Formatting | ||
|
|
||
| ```bash | ||
| make lint # golangci-lint (v2.8.0) | ||
| make fmt # Format Go code | ||
| make vet # Go vet | ||
| make spellcheck # Spell check markdown docs | ||
| make spellcheck-report # Spell check with report output | ||
| make spellcheck-docker # Spell check via Docker (platform-agnostic) | ||
| make spellcheck-format-spelling # Sort and deduplicate .spelling file | ||
| ``` | ||
|
|
||
| ## Local Development | ||
|
|
||
| ```bash | ||
| make lima-all # Start local k3s cluster with controller | ||
| make lima-start # Start lima cluster | ||
| make lima-stop # Stop and delete lima cluster | ||
| make lima-redeploy # Rebuild and redeploy to local cluster | ||
Zenithar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| make lima-install # Install CRDs and controller into lima cluster | ||
| make lima-uninstall # Uninstall CRDs and controller from lima cluster | ||
| make lima-restart # Restart chaos-controller pod | ||
| make lima-push-all # Push all images to lima cluster | ||
| make lima-push-injector # Build and push injector image to lima | ||
| make lima-push-handler # Build and push handler image to lima | ||
| make lima-push-manager # Build and push manager image to lima | ||
| make lima-install-cert-manager # Install cert-manager into cluster | ||
| make lima-install-datadog-agent # Install Datadog agent into cluster | ||
| make lima-install-demo # Install demo workloads (curl + nginx) | ||
| make lima-install-longhorn # Install Longhorn StorageClass for disk throttling | ||
| make lima-kubectx # Configure kubectl context for lima | ||
| make lima-kubectx-clean # Remove lima references from kubectl config | ||
| make minikube-load-all # Load all images into minikube | ||
| make watch # Auto-rebuild on file changes | ||
| make debug # Prepare for IDE debugging | ||
| make run # Run controller locally | ||
| ``` | ||
|
|
||
| ## CI | ||
|
|
||
| ```bash | ||
| make ci-install-minikube # Install and start minikube for CI | ||
| make venv # Create Python virtual environment | ||
| make install-datadog-ci # Install datadog-ci binary | ||
| ``` | ||
|
|
||
| ## Tool Installation | ||
|
|
||
| ```bash | ||
| make install-golangci-lint # Install golangci-lint | ||
| make install-controller-gen # Install controller-gen | ||
| make install-mockery # Install mockery | ||
| make install-helm # Install Helm | ||
| make install-protobuf # Install protoc | ||
| make install-kubebuilder # Install kubebuilder + setup-envtest | ||
| make install-yamlfmt # Install yamlfmt | ||
| make install-watchexec # Install watchexec (via brew) | ||
| make install-go # Install Go (version from Makefile) | ||
| ``` | ||
|
|
||
| ## Architecture | ||
|
|
||
| Three main components, each with its own Dockerfile in `bin/`: | ||
|
|
||
| - **Manager** (`main.go`, `controllers/`): Long-running controller pod. Watches Disruption CRDs, selects targets via label selectors, creates chaos pods, manages lifecycle with finalizers. Reconciliation flow: add finalizer → compute spec hash → select targets → create chaos pods → track injection status. | ||
| - **Injector** (`injector/`, `cli/injector/`): Runs as ephemeral chaos pods on target nodes. Performs actual disruption using Linux primitives (cgroups, tc, iptables, eBPF). One chaos pod per target per disruption kind. | ||
| - **Handler** (`webhook/`, `cli/handler/`): Admission webhook for pod initialization-time network disruptions. | ||
|
|
||
| ### CRDs (api/v1beta1/) | ||
|
|
||
| - **Disruption**: Main resource defining what failure to inject and targeting criteria | ||
| - **DisruptionCron**: Scheduled/recurring disruptions | ||
| - **DisruptionRollout**: Progressive disruption rollout | ||
|
|
||
| ### Key Packages | ||
|
|
||
| - `controllers/` — Reconciliation controllers for Disruption, DisruptionCron, and DisruptionRollout CRDs | ||
| - `targetselector/` — Target selection logic (labels, count, filters, safety nets) | ||
Zenithar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - `safemode/` — Safety mechanisms to prevent dangerous disruptions | ||
| - `eventnotifier/` — Notifications (Slack, Datadog, HTTP) | ||
| - `o11y/` — Observability (metrics, tracing, profiling for Datadog and Prometheus) | ||
| - `cloudservice/` — Cloud provider integrations | ||
| - `ebpf/` — eBPF programs for network disruption | ||
| - `grpc/disruptionlistener/` — gRPC service for disruption events | ||
| - `chart/` — Helm chart for deployment | ||
|
|
||
| ### Code Generation | ||
|
|
||
| CRDs are defined in `api/v1beta1/` with kubebuilder markers. After modifying types, run `make manifests generate`. Mocks are generated with mockery into `mocks/`. Protobuf definitions live in `grpc/` and `dogfood/`. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Kubernetes >= 1.16 (not 1.20.0-1.20.4) | ||
| - Go 1.25.6 | ||
| - Docker with buildx (multi-arch: amd64, arm64) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| // Unless explicitly stated otherwise all files in this repository are licensed | ||
| // under the Apache License Version 2.0. | ||
| // This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
| // Copyright 2026 Datadog, Inc. | ||
|
|
||
| package v1beta1 | ||
|
|
||
| import ( | ||
| "fmt" | ||
| "strconv" | ||
| "strings" | ||
|
|
||
| "github.com/hashicorp/go-multierror" | ||
| ) | ||
|
|
||
| // MemoryPressureSpec represents a memory pressure disruption | ||
| type MemoryPressureSpec struct { | ||
| // Target memory utilization as a percentage (e.g., "76%") | ||
| // +kubebuilder:validation:Required | ||
| TargetPercent string `json:"targetPercent" chaos_validate:"required"` | ||
| // Duration over which memory is gradually consumed (e.g., "10m") | ||
| // If empty, memory is consumed immediately | ||
| RampDuration DisruptionDuration `json:"rampDuration,omitempty"` | ||
| } | ||
|
|
||
| // Validate validates args for the given disruption | ||
| func (s *MemoryPressureSpec) Validate() (retErr error) { | ||
| // Rule: targetPercent must be a valid percentage between 1 and 100 | ||
| pct, err := ParseTargetPercent(s.TargetPercent) | ||
| if err != nil { | ||
| retErr = multierror.Append(retErr, fmt.Errorf("invalid targetPercent %q: %w", s.TargetPercent, err)) | ||
| } else if pct < 1 || pct > 100 { | ||
| retErr = multierror.Append(retErr, fmt.Errorf("targetPercent must be between 1 and 100, got %d", pct)) | ||
| } | ||
|
|
||
| // Rule: rampDuration must be non-negative | ||
| if s.RampDuration.Duration() < 0 { | ||
| retErr = multierror.Append(retErr, fmt.Errorf("rampDuration must be non-negative, got %s", s.RampDuration)) | ||
| } | ||
|
|
||
| return retErr | ||
| } | ||
|
|
||
| // GenerateArgs generates injection or cleanup pod arguments for the given spec | ||
| func (s *MemoryPressureSpec) GenerateArgs() []string { | ||
| args := []string{ | ||
| "memory-pressure", | ||
| "--target-percent", s.TargetPercent, | ||
| } | ||
|
|
||
| if s.RampDuration.Duration() > 0 { | ||
| args = append(args, "--ramp-duration", s.RampDuration.Duration().String()) | ||
| } | ||
|
|
||
| return args | ||
| } | ||
|
|
||
| func (s *MemoryPressureSpec) Explain() []string { | ||
| pct, _ := ParseTargetPercent(s.TargetPercent) | ||
|
|
||
| explanation := fmt.Sprintf("spec.memoryPressure will cause memory pressure on the target, by joining its cgroup and allocating memory to reach %d%% of the target's memory limit", pct) | ||
|
|
||
| if s.RampDuration.Duration() > 0 { | ||
| explanation += fmt.Sprintf(", ramping up over %s.", s.RampDuration.Duration()) | ||
| } else { | ||
| explanation += " immediately." | ||
| } | ||
|
|
||
| return []string{"", explanation} | ||
| } | ||
|
|
||
| // ParseTargetPercent parses a percentage string like "76%" or "76" and returns the integer value | ||
| func ParseTargetPercent(s string) (int, error) { | ||
| s = strings.TrimSpace(s) | ||
| s = strings.TrimSuffix(s, "%") | ||
|
|
||
| return strconv.Atoi(s) | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.