Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,6 @@ ebpf/builds/

# DS_Store
.DS_Store
.claude

.envrc
141 changes: 141 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Kubernetes operator for chaos engineering (Datadog). Injects systemic failures (network, CPU, disk, DNS, gRPC, container/node failure) into Kubernetes clusters at scale. Built with Kubebuilder v3 and controller-runtime.

## Build Commands

```bash
make docker-build-all # Build all Docker images (manager, injector, handler)
make docker-build-injector # Build injector Docker image
make docker-build-handler # Build handler Docker image
make docker-build-manager # Build manager Docker image
make docker-build-only-all # Build all images without saving tars
make manifests # Generate CRDs and RBAC manifests
make generate # Generate Go code (deepcopy, etc.)
make generate-mocks # Regenerate mocks (mockery v2.53.5)
make clean-mocks # Remove all generated mocks
make generate-disruptionlistener-protobuf # Generate disruptionlistener protobuf
make generate-chaosdogfood-protobuf # Generate chaosdogfood protobuf
make chaosli # Build CLI helper tool
make chaosli-test # Test chaosli API portability (Docker)
make godeps # go mod tidy + vendor
make deps # godeps + license check
make header # Check/fix license headers
make header-fix # Fix missing license headers
make license # Check licenses
make release # Run release script (VERSION required)
make update-deps # Update Python dependencies (tasks/requirements.txt)
```

## Testing

```bash
make test # Run all unit tests (Ginkgo v2)
make test TEST_ARGS="injector" # Filter tests by package name
make test TEST_ARGS="--until-it-fails" # Detect flaky tests
make test GINKGO_PROCS=4 # Control parallelism
make e2e-test # End-to-end tests (requires cluster)
make e2e-test SKIP_DEPLOY=true # E2E tests without redeploying controller
```

Tests use **Ginkgo v2** (BDD) with **Gomega** matchers. Coverage output: `cover.profile`.

## Linting and Formatting

```bash
make lint # golangci-lint (v2.8.0)
make fmt # Format Go code
make vet # Go vet
make spellcheck # Spell check markdown docs
make spellcheck-report # Spell check with report output
make spellcheck-docker # Spell check via Docker (platform-agnostic)
make spellcheck-format-spelling # Sort and deduplicate .spelling file
```

## Local Development

```bash
make lima-all # Start local k3s cluster with controller
make lima-start # Start lima cluster
make lima-stop # Stop and delete lima cluster
make lima-redeploy # Rebuild and redeploy to local cluster
make lima-install # Install CRDs and controller into lima cluster
make lima-uninstall # Uninstall CRDs and controller from lima cluster
make lima-restart # Restart chaos-controller pod
make lima-push-all # Push all images to lima cluster
make lima-push-injector # Build and push injector image to lima
make lima-push-handler # Build and push handler image to lima
make lima-push-manager # Build and push manager image to lima
make lima-install-cert-manager # Install cert-manager into cluster
make lima-install-datadog-agent # Install Datadog agent into cluster
make lima-install-demo # Install demo workloads (curl + nginx)
make lima-install-longhorn # Install Longhorn StorageClass for disk throttling
make lima-kubectx # Configure kubectl context for lima
make lima-kubectx-clean # Remove lima references from kubectl config
make minikube-load-all # Load all images into minikube
make watch # Auto-rebuild on file changes
make debug # Prepare for IDE debugging
make run # Run controller locally
```

## CI

```bash
make ci-install-minikube # Install and start minikube for CI
make venv # Create Python virtual environment
make install-datadog-ci # Install datadog-ci binary
```

## Tool Installation

```bash
make install-golangci-lint # Install golangci-lint
make install-controller-gen # Install controller-gen
make install-mockery # Install mockery
make install-helm # Install Helm
make install-protobuf # Install protoc
make install-kubebuilder # Install kubebuilder + setup-envtest
make install-yamlfmt # Install yamlfmt
make install-watchexec # Install watchexec (via brew)
make install-go # Install Go (version from Makefile)
```

## Architecture

Three main components, each with its own Dockerfile in `bin/`:

- **Manager** (`main.go`, `controllers/`): Long-running controller pod. Watches Disruption CRDs, selects targets via label selectors, creates chaos pods, manages lifecycle with finalizers. Reconciliation flow: add finalizer → compute spec hash → select targets → create chaos pods → track injection status.
- **Injector** (`injector/`, `cli/injector/`): Runs as ephemeral chaos pods on target nodes. Performs actual disruption using Linux primitives (cgroups, tc, iptables, eBPF). One chaos pod per target per disruption kind.
- **Handler** (`webhook/`, `cli/handler/`): Admission webhook for pod initialization-time network disruptions.

### CRDs (api/v1beta1/)

- **Disruption**: Main resource defining what failure to inject and targeting criteria
- **DisruptionCron**: Scheduled/recurring disruptions
- **DisruptionRollout**: Progressive disruption rollout

### Key Packages

- `controllers/` — Reconciliation controllers for Disruption, DisruptionCron, and DisruptionRollout CRDs
- `targetselector/` — Target selection logic (labels, count, filters, safety nets)
- `safemode/` — Safety mechanisms to prevent dangerous disruptions
- `eventnotifier/` — Notifications (Slack, Datadog, HTTP)
- `o11y/` — Observability (metrics, tracing, profiling for Datadog and Prometheus)
- `cloudservice/` — Cloud provider integrations
- `ebpf/` — eBPF programs for network disruption
- `grpc/disruptionlistener/` — gRPC service for disruption events
- `chart/` — Helm chart for deployment

### Code Generation

CRDs are defined in `api/v1beta1/` with kubebuilder markers. After modifying types, run `make manifests generate`. Mocks are generated with mockery into `mocks/`. Protobuf definitions live in `grpc/` and `dogfood/`.

## Requirements

- Kubernetes >= 1.16 (not 1.20.0-1.20.4)
- Go 1.25.6
- Docker with buildx (multi-arch: amd64, arm64)
15 changes: 0 additions & 15 deletions LICENSE-3rdparty.csv
Original file line number Diff line number Diff line change
Expand Up @@ -414,8 +414,6 @@ github.com/miekg/dns,github.com/miekg/dns,BSD-3-Clause
github.com/mitchellh/go-homedir,github.com/mitchellh/go-homedir,MIT
github.com/moby/docker-image-spec,github.com/moby/docker-image-spec/specs-go/v1,Apache-2.0
github.com/moby/locker,github.com/moby/locker,Apache-2.0
github.com/moby/spdystream,github.com/moby/spdystream,Apache-2.0
github.com/moby/spdystream,github.com/moby/spdystream/spdy,Apache-2.0
github.com/moby/sys/mountinfo,github.com/moby/sys/mountinfo,Apache-2.0
github.com/moby/sys/sequential,github.com/moby/sys/sequential,Apache-2.0
github.com/moby/sys/signal,github.com/moby/sys/signal,Apache-2.0
Expand All @@ -426,7 +424,6 @@ github.com/moby/term,github.com/moby/term/windows,Apache-2.0
github.com/modern-go/concurrent,github.com/modern-go/concurrent,Apache-2.0
github.com/modern-go/reflect2,github.com/modern-go/reflect2,Apache-2.0
github.com/munnerz/goautoneg,github.com/munnerz/goautoneg,BSD-3-Clause
github.com/mxk/go-flowrate,github.com/mxk/go-flowrate/flowrate,BSD-3-Clause
github.com/onsi/ginkgo/v2,github.com/onsi/ginkgo/v2,MIT
github.com/onsi/ginkgo/v2,github.com/onsi/ginkgo/v2/config,MIT
github.com/onsi/ginkgo/v2,github.com/onsi/ginkgo/v2/formatter,MIT
Expand Down Expand Up @@ -662,7 +659,6 @@ golang.org/x/net,golang.org/x/net/ipv4,BSD-3-Clause
golang.org/x/net,golang.org/x/net/ipv6,BSD-3-Clause
golang.org/x/net,golang.org/x/net/proxy,BSD-3-Clause
golang.org/x/net,golang.org/x/net/trace,BSD-3-Clause
golang.org/x/net,golang.org/x/net/websocket,BSD-3-Clause
golang.org/x/oauth2,golang.org/x/oauth2,BSD-3-Clause
golang.org/x/oauth2,golang.org/x/oauth2/internal,BSD-3-Clause
golang.org/x/sync,golang.org/x/sync/errgroup,BSD-3-Clause
Expand Down Expand Up @@ -942,20 +938,14 @@ k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/dump,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/duration,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/errors,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/framer,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/httpstream,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/httpstream/spdy,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/httpstream/wsstream,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/intstr,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/json,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/managedfields,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/managedfields/internal,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/mergepatch,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/naming,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/net,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/portforward,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/proxy,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/rand,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/remotecommand,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/runtime,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/sets,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/strategicpatch,Apache-2.0
Expand All @@ -967,7 +957,6 @@ k8s.io/apimachinery,k8s.io/apimachinery/pkg/util/yaml,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/version,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/pkg/watch,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/third_party/forked/golang/json,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/third_party/forked/golang/netutil,Apache-2.0
k8s.io/apimachinery,k8s.io/apimachinery/third_party/forked/golang/reflect,Apache-2.0
k8s.io/cli-runtime,k8s.io/cli-runtime/pkg/printers,Apache-2.0
k8s.io/client-go,k8s.io/client-go/applyconfigurations,Apache-2.0
Expand Down Expand Up @@ -1295,15 +1284,11 @@ k8s.io/client-go,k8s.io/client-go/tools/pager,Apache-2.0
k8s.io/client-go,k8s.io/client-go/tools/record,Apache-2.0
k8s.io/client-go,k8s.io/client-go/tools/record/util,Apache-2.0
k8s.io/client-go,k8s.io/client-go/tools/reference,Apache-2.0
k8s.io/client-go,k8s.io/client-go/tools/remotecommand,Apache-2.0
k8s.io/client-go,k8s.io/client-go/transport,Apache-2.0
k8s.io/client-go,k8s.io/client-go/transport/spdy,Apache-2.0
k8s.io/client-go,k8s.io/client-go/transport/websocket,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/apply,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/cert,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/connrotation,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/consistencydetector,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/exec,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/flowcontrol,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/homedir,Apache-2.0
k8s.io/client-go,k8s.io/client-go/util/jsonpath,Apache-2.0
Expand Down
28 changes: 23 additions & 5 deletions api/v1beta1/disruption_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ type DisruptionSpec struct {
// +nullable
CPUPressure *CPUPressureSpec `json:"cpuPressure,omitempty"`
// +nullable
MemoryPressure *MemoryPressureSpec `json:"memoryPressure,omitempty"`
// +nullable
DiskPressure *DiskPressureSpec `json:"diskPressure,omitempty"`
// +nullable
DiskFailure *DiskFailureSpec `json:"diskFailure,omitempty"`
Expand Down Expand Up @@ -694,25 +696,25 @@ func (s DisruptionSpec) validateGlobalDisruptionScope(requireSelectors bool) (re
}

// Rule: At least one disruption kind must be applied
if s.CPUPressure == nil && s.DiskPressure == nil && s.DiskFailure == nil && s.Network == nil && s.GRPC == nil && s.DNS == nil && s.ContainerFailure == nil && s.NodeFailure == nil && s.PodReplacement == nil {
if s.CPUPressure == nil && s.MemoryPressure == nil && s.DiskPressure == nil && s.DiskFailure == nil && s.Network == nil && s.GRPC == nil && s.DNS == nil && s.ContainerFailure == nil && s.NodeFailure == nil && s.PodReplacement == nil {
retErr = multierror.Append(retErr, errors.New("at least one disruption kind must be specified, please read the docs to see your options"))
}

// Rule: ContainerFailure, NodeFailure, and PodReplacement disruptions are not compatible with other failure types
if s.ContainerFailure != nil {
if s.CPUPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.NodeFailure != nil || s.PodReplacement != nil {
if s.CPUPressure != nil || s.MemoryPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.NodeFailure != nil || s.PodReplacement != nil {
retErr = multierror.Append(retErr, errors.New("container failure disruptions are not compatible with other disruption kinds. The container failure will remove the impact of the other disruption types"))
}
}

if s.NodeFailure != nil {
if s.CPUPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.ContainerFailure != nil || s.PodReplacement != nil {
if s.CPUPressure != nil || s.MemoryPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.ContainerFailure != nil || s.PodReplacement != nil {
retErr = multierror.Append(retErr, errors.New("node failure disruptions are not compatible with other disruption kinds. The node failure will remove the impact of the other disruption types"))
}
}

if s.PodReplacement != nil {
if s.CPUPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.ContainerFailure != nil || s.NodeFailure != nil {
if s.CPUPressure != nil || s.MemoryPressure != nil || s.DiskPressure != nil || s.DiskFailure != nil || s.Network != nil || s.GRPC != nil || s.DNS != nil || s.ContainerFailure != nil || s.NodeFailure != nil {
retErr = multierror.Append(retErr, errors.New("pod replacement disruptions are not compatible with other disruption kinds. The pod replacement will remove the impact of the other disruption types"))
}
// Rule: container failure not possible if disruption is node-level
Expand All @@ -724,6 +726,7 @@ func (s DisruptionSpec) validateGlobalDisruptionScope(requireSelectors bool) (re
// Rule: on init compatibility
if s.OnInit {
if s.CPUPressure != nil ||
s.MemoryPressure != nil ||
s.NodeFailure != nil ||
s.PodReplacement != nil ||
s.ContainerFailure != nil ||
Expand All @@ -747,6 +750,11 @@ func (s DisruptionSpec) validateGlobalDisruptionScope(requireSelectors bool) (re
retErr = multierror.Append(retErr, errors.New("disk pressure disruptions apply to all containers, specifying certain containers does not isolate the disruption"))
}

// Rule: No specificity of containers on a memory disruption
if len(s.Containers) != 0 && s.MemoryPressure != nil {
retErr = multierror.Append(retErr, errors.New("memory pressure disruptions apply to all containers, specifying certain containers does not isolate the disruption"))
}

// Rule: DisruptionTrigger
if s.Triggers != nil && !s.Triggers.IsZero() {
if !s.Triggers.Inject.IsZero() && !s.Triggers.CreatePods.IsZero() {
Expand All @@ -772,7 +780,7 @@ func (s DisruptionSpec) validateGlobalDisruptionScope(requireSelectors bool) (re
if s.Pulse != nil {
if s.Pulse.ActiveDuration.Duration() > 0 || s.Pulse.DormantDuration.Duration() > 0 {
if s.NodeFailure != nil || s.PodReplacement != nil || s.ContainerFailure != nil {
retErr = multierror.Append(retErr, errors.New("pulse is only compatible with network, cpu pressure, disk pressure, dns, and grpc disruptions"))
retErr = multierror.Append(retErr, errors.New("pulse is only compatible with network, cpu pressure, memory pressure, disk pressure, dns, and grpc disruptions"))
}
}

Expand Down Expand Up @@ -824,6 +832,8 @@ func (s DisruptionSpec) DisruptionKindPicker(kind chaostypes.DisruptionKindName)
disruptionKind = s.Network
case chaostypes.DisruptionKindCPUPressure:
disruptionKind = s.CPUPressure
case chaostypes.DisruptionKindMemoryPressure:
disruptionKind = s.MemoryPressure
case chaostypes.DisruptionKindDiskPressure:
disruptionKind = s.DiskPressure
case chaostypes.DisruptionKindGRPCDisruption:
Expand Down Expand Up @@ -888,6 +898,10 @@ func (s DisruptionSpec) DisruptionCount() int {
count++
}

if s.MemoryPressure != nil {
count++
}

if s.ContainerFailure != nil {
count++
}
Expand Down Expand Up @@ -1060,6 +1074,10 @@ func (s DisruptionSpec) Explain() []string {
explanation = append(explanation, s.CPUPressure.Explain()...)
}

if s.MemoryPressure != nil {
explanation = append(explanation, s.MemoryPressure.Explain()...)
}

if s.DiskPressure != nil {
explanation = append(explanation, s.DiskPressure.Explain()...)
}
Expand Down
78 changes: 78 additions & 0 deletions api/v1beta1/memory_pressure.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
// Unless explicitly stated otherwise all files in this repository are licensed
// under the Apache License Version 2.0.
// This product includes software developed at Datadog (https://www.datadoghq.com/).
// Copyright 2026 Datadog, Inc.

package v1beta1

import (
"fmt"
"strconv"
"strings"

"github.com/hashicorp/go-multierror"
)

// MemoryPressureSpec represents a memory pressure disruption
type MemoryPressureSpec struct {
// Target memory utilization as a percentage (e.g., "76%")
// +kubebuilder:validation:Required
TargetPercent string `json:"targetPercent" chaos_validate:"required"`
// Duration over which memory is gradually consumed (e.g., "10m")
// If empty, memory is consumed immediately
RampDuration DisruptionDuration `json:"rampDuration,omitempty"`
}

// Validate validates args for the given disruption
func (s *MemoryPressureSpec) Validate() (retErr error) {
// Rule: targetPercent must be a valid percentage between 1 and 100
pct, err := ParseTargetPercent(s.TargetPercent)
if err != nil {
retErr = multierror.Append(retErr, fmt.Errorf("invalid targetPercent %q: %w", s.TargetPercent, err))
} else if pct < 1 || pct > 100 {
retErr = multierror.Append(retErr, fmt.Errorf("targetPercent must be between 1 and 100, got %d", pct))
}

// Rule: rampDuration must be non-negative
if s.RampDuration.Duration() < 0 {
retErr = multierror.Append(retErr, fmt.Errorf("rampDuration must be non-negative, got %s", s.RampDuration))
}

return retErr
}

// GenerateArgs generates injection or cleanup pod arguments for the given spec
func (s *MemoryPressureSpec) GenerateArgs() []string {
args := []string{
"memory-pressure",
"--target-percent", s.TargetPercent,
}

if s.RampDuration.Duration() > 0 {
args = append(args, "--ramp-duration", s.RampDuration.Duration().String())
}

return args
}

func (s *MemoryPressureSpec) Explain() []string {
pct, _ := ParseTargetPercent(s.TargetPercent)

explanation := fmt.Sprintf("spec.memoryPressure will cause memory pressure on the target, by joining its cgroup and allocating memory to reach %d%% of the target's memory limit", pct)

if s.RampDuration.Duration() > 0 {
explanation += fmt.Sprintf(", ramping up over %s.", s.RampDuration.Duration())
} else {
explanation += " immediately."
}

return []string{"", explanation}
}

// ParseTargetPercent parses a percentage string like "76%" or "76" and returns the integer value
func ParseTargetPercent(s string) (int, error) {
s = strings.TrimSpace(s)
s = strings.TrimSuffix(s, "%")

return strconv.Atoi(s)
}
Loading