hack/multicluster: add failure-injection repro scripts#1479
Draft
hack/multicluster: add failure-injection repro scripts#1479
Conversation
Four standalone scripts under hack/multicluster/repro/ that reproduce
specific resilience bugs in the multicluster operator by injecting
network faults via Pumba and reading observable effects from operator
logs / reconcile durations.
Coverage:
- finding-01-dosend-stall.sh raft DoSend synchronous in Ready loop;
leader churn under asymmetric silent
drop (pkg/multicluster/leaderelection)
- finding-02-03-admin-unbounded.sh admin-client ClientTimeout dropped on
StretchCluster path + reconcile has no
context.WithTimeout
- finding-04-serial-fanout.sh syncStatus / setupLicense / Phase-1
scans iterate clusters serially with
no reachability check
- finding-05-probe-flap.sh health probe has no hysteresis at
marginal RTT; SpecSynced flaps
Each script auto-detects the local vcluster-on-k3d dev env (set up via
task dev:setup-multicluster-dev-env) when run with no arguments, or
accepts --contexts=ctxA,ctxB,ctxC to target a real multi-cluster setup.
Common flags (--namespace, --operator-service, --containerd-sock, etc.)
let the scripts run against any operator deployment that follows the
chart conventions.
A README.md in the directory documents requirements, modes, flags, and
how to interpret the PASS/FAIL verdict.
These are not integration tests — line-level regression coverage for
each finding should live in unit tests alongside the fix. The scripts
are meant to catch the class of bug end-to-end, and to drive customer-
facing demos of multi-cloud resilience.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RafalKorepta
approved these changes
Apr 23, 2026
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four standalone scripts under hack/multicluster/repro/ that reproduce specific resilience bugs in the multicluster operator by injecting network faults via Pumba and reading observable effects from operator logs / reconcile durations.
Coverage:
leader churn under asymmetric silent
drop (pkg/multicluster/leaderelection)
StretchCluster path + reconcile has no
context.WithTimeout
scans iterate clusters serially with
no reachability check
marginal RTT; SpecSynced flaps
Each script auto-detects the local vcluster-on-k3d dev env (set up via task dev:setup-multicluster-dev-env) when run with no arguments, or accepts --contexts=ctxA,ctxB,ctxC to target a real multi-cluster setup. Common flags (--namespace, --operator-service, --containerd-sock, etc.) let the scripts run against any operator deployment that follows the chart conventions.
A README.md in the directory documents requirements, modes, flags, and how to interpret the PASS/FAIL verdict.
These are not integration tests — line-level regression coverage for each finding should live in unit tests alongside the fix. The scripts are meant to catch the class of bug end-to-end, and to drive customer- facing demos of multi-cloud resilience.