Skip to content

hack/multicluster: add failure-injection repro scripts#1479

Draft
hidalgopl wants to merge 1 commit intomainfrom
worktree-pb+failure-modes
Draft

hack/multicluster: add failure-injection repro scripts#1479
hidalgopl wants to merge 1 commit intomainfrom
worktree-pb+failure-modes

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

Four standalone scripts under hack/multicluster/repro/ that reproduce specific resilience bugs in the multicluster operator by injecting network faults via Pumba and reading observable effects from operator logs / reconcile durations.

Coverage:

  • finding-01-dosend-stall.sh raft DoSend synchronous in Ready loop;
    leader churn under asymmetric silent
    drop (pkg/multicluster/leaderelection)
  • finding-02-03-admin-unbounded.sh admin-client ClientTimeout dropped on
    StretchCluster path + reconcile has no
    context.WithTimeout
  • finding-04-serial-fanout.sh syncStatus / setupLicense / Phase-1
    scans iterate clusters serially with
    no reachability check
  • finding-05-probe-flap.sh health probe has no hysteresis at
    marginal RTT; SpecSynced flaps

Each script auto-detects the local vcluster-on-k3d dev env (set up via task dev:setup-multicluster-dev-env) when run with no arguments, or accepts --contexts=ctxA,ctxB,ctxC to target a real multi-cluster setup. Common flags (--namespace, --operator-service, --containerd-sock, etc.) let the scripts run against any operator deployment that follows the chart conventions.

A README.md in the directory documents requirements, modes, flags, and how to interpret the PASS/FAIL verdict.

These are not integration tests — line-level regression coverage for each finding should live in unit tests alongside the fix. The scripts are meant to catch the class of bug end-to-end, and to drive customer- facing demos of multi-cloud resilience.

Four standalone scripts under hack/multicluster/repro/ that reproduce
specific resilience bugs in the multicluster operator by injecting
network faults via Pumba and reading observable effects from operator
logs / reconcile durations.

Coverage:
- finding-01-dosend-stall.sh      raft DoSend synchronous in Ready loop;
                                  leader churn under asymmetric silent
                                  drop (pkg/multicluster/leaderelection)
- finding-02-03-admin-unbounded.sh admin-client ClientTimeout dropped on
                                  StretchCluster path + reconcile has no
                                  context.WithTimeout
- finding-04-serial-fanout.sh     syncStatus / setupLicense / Phase-1
                                  scans iterate clusters serially with
                                  no reachability check
- finding-05-probe-flap.sh        health probe has no hysteresis at
                                  marginal RTT; SpecSynced flaps

Each script auto-detects the local vcluster-on-k3d dev env (set up via
task dev:setup-multicluster-dev-env) when run with no arguments, or
accepts --contexts=ctxA,ctxB,ctxC to target a real multi-cluster setup.
Common flags (--namespace, --operator-service, --containerd-sock, etc.)
let the scripts run against any operator deployment that follows the
chart conventions.

A README.md in the directory documents requirements, modes, flags, and
how to interpret the PASS/FAIL verdict.

These are not integration tests — line-level regression coverage for
each finding should live in unit tests alongside the fix. The scripts
are meant to catch the class of bug end-to-end, and to drive customer-
facing demos of multi-cloud resilience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants