operator: fix SASL bootstrap user password drift after Secret rotation by david-yu · Pull Request #1465 · redpanda-data/redpanda-operator

david-yu · 2026-04-20T17:39:04Z

Summary

Fixes a SASL bootstrap-user password-drift bug surfaced by users migrating off the legacy Helm flow. Implements option (b) from the reproducer writeup: the sidecar configwatcher now mirrors the bootstrap user Secret's password into Redpanda's SCRAM DB on every sync, so a rotated bootstrap Secret propagates into the running cluster and rpk keeps authenticating after the next pod restart.

The bug

A Redpanda cluster runs with SASL enabled and an operator-owned <fullname>-bootstrap-user Secret.
The Secret is deleted — e.g. cleaning up a Helm-era artifact and expecting the operator to take clean ownership.
charts/redpanda/render_state.go:FetchBootstrapUser / operator/multicluster/secrets.go:secretBootstrapUser treat the missing Secret as "first-time bootstrap" and write a new one with a fresh helmette.RandAlphaNum(32) password.
The running Redpanda retained the original password in its internal SCRAM DB because operator/internal/configwatcher/configwatcher.go passed recreate=false when syncing the internal superuser ("the internal user should only ever be created once, so don't update its password ever").
On the next pod restart, the Pod's RPK_USER / RPK_PASS env vars were re-materialized from the new Secret, but Redpanda still rejected them:
```
SASL_AUTHENTICATION_FAILED: Invalid credentials
```
The original password still worked — confirming the drift.

The fix

operator/internal/configwatcher/configwatcher.go:

New syncInternalUser helper that does CreateUser, and on "already exists" falls through to UpdateUser(user, password, mechanism). Never falls back to delete-and-recreate — dropping the internal superuser even briefly could strand the operator.
SyncUsers routes the internal superuser through syncInternalUser instead of syncUser(..., recreate=false).
Event-driven, not polling. SyncUsers runs only at pod start (syncInitial) and when fsnotify observes a change to the mounted Secret volume. There is no timer or reconcile tick. In steady state, UpdateUser fires once per pod lifetime (at boot); a Secret rotation adds exactly one more call when kubelet swaps the volume. See configwatcher.go:109-191 — Start → syncInitial (one-shot) then watchFilesystem, a pure fsnotify.Watcher select loop.
UpdateUser against Redpanda is a no-op when the password matches, so even the boot-time call is effectively free on the cluster side.

Rotation flow after the fix:

User deletes the bootstrap Secret; operator regenerates with P2.
K8s updates the mounted Secret file on every pod; fsnotify fires → SyncUsers runs.
Each pod's env still holds P1 (env vars aren't re-read mid-container), so syncInternalUser re-asserts P1 — idempotent, Redpanda's SCRAM DB is unchanged.
Each pod eventually restarts (rolling or manual); the new env exposes P2; syncInternalUser drives UpdateUser → Redpanda's SCRAM DB is now P2.
rpk in the restarted pod authenticates with P2 successfully.

Changes

operator/internal/configwatcher/configwatcher.go — the fix (new syncInternalUser helper, SyncUsers wiring, comment update).
operator/internal/configwatcher/configwatcher_test.go — extends the existing testcontainer-based test with a rotation case: flip RPK_PASS, re-run SyncUsers, assert the original password no longer authenticates and the rotated password does. Also enables redpanda.WithAdminAPIAuthentication() so the admin API actually enforces basic auth and the assertion discriminates.
acceptance/features/bootstrap-user.feature + acceptance/steps/bootstrap_user.go + acceptance/steps/register.go — end-to-end regression: install SASL cluster, delete the operator-owned bootstrap Secret, wait for regeneration with a different password, restart all pods, assert rpk still works.
.changes/unreleased/operator-Fixed-20260420-150000.yaml — changelog entry under the operator project as Fixed.

Test plan

go test ./operator/internal/configwatcher/... — rotation subtest passes (validated locally against a Rancher Desktop Docker socket).
task test:acceptance with the bootstrap-user.feature scenario — expected to pass with the fix and would fail on main.
No other scenarios affected — the new acceptance scenario is @skip:gke @skip:aks @skip:eks (local-only) and uses a dedicated cluster name (bootstrap-regen).

🤖 Generated with Claude Code

Adds an acceptance scenario that fails on current main: after the bootstrap user Secret is deleted, the operator regenerates it with a fresh random password while the running Redpanda cluster still holds the original password in its internal SCRAM DB, so rpk (and any other consumer of the new Secret) fails SASL auth after the next pod restart. The drift lives at the intersection of two deliberate choices: * charts/redpanda/render_state.go:FetchBootstrapUser and operator/multicluster/secrets.go:secretBootstrapUser treat a missing default-named Secret as "first-time bootstrap" and generate a fresh random password (helmette.RandAlphaNum(32)), with no signal that the cluster has already been bootstrapped. * operator/internal/configwatcher/configwatcher.go explicitly passes recreate=false when syncing the internal superuser, so Redpanda's SCRAM DB is never rewritten after the initial bootstrap. Either (a) preserving the original password or (b) calling AlterUserSCRAMs when the Secret rotates would close the gap. This commit is purely a reproducer — no production code is touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The sidecar configwatcher used to call syncUser for the internal superuser with recreate=false, explicitly never updating the password once the user existed in Redpanda. That left a silent drift whenever the bootstrap user Secret was rotated (e.g. the operator regenerating a deleted Secret with a fresh random password): Redpanda kept the original SCRAM credential while consumers of the new Secret failed SASL auth with Invalid credentials after the next pod restart. Add a dedicated syncInternalUser helper that, on CreateUser returning "already exists", drives UpdateUser against the admin API so the running cluster picks up whatever password the mounted Secret now holds. UpdateUser is idempotent against Redpanda so this is safe to invoke on every sync. Unlike the regular syncUser path, this helper never falls back to delete-and-recreate — dropping the internal superuser even briefly could strand the operator. Extend the testcontainer-based TestConfigWatcher with a rotation scenario that verifies the new behavior via a Kafka SASL handshake: the original password must fail authentication after rotation and the rotated password must succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gelog Review pass: the fix drives the rpadmin HTTP admin API's UpdateUser, not the Kafka protocol's AlterUserSCRAMs. Tighten comments and the changelog entry accordingly. Also fix the kafkaSASLHandshake test helper comment which claimed a Metadata request when it actually pings each seed broker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

No behavior change — just tighten the wording from "on every sync" to "at pod start and on fsnotify events when the mounted Secret changes, not on a timer" so readers don't assume this introduces continuous polling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu · 2026-04-20T19:35:51Z

Still needs further discussion with eng before moving to review.

david-yu · 2026-04-20T20:49:26Z

Closing as per @andrewstucki 's advice.

Currently you can manually work around it by running the rpk command with the old password on the cluster to update it to the generated password, but really we can't support changing the bootstrap password without a bunch of redesign, because it's meant to be long-lived and once auth is enabled/enforced on the cluster in order to change the password, well, you need the old password too (i.e. if going from A --> B as the user account being authed, then you have to know both A and B -- we can't know that though because we only have the one secret field that exists)

Any sort of redesign to support this would need to take into account being able to specify 2 passwords for the user we leverage, the old password and the new password -- but fundamentally just saying "use password B" isn't going to work unless we know password A too -- which is why regenerating the password isn't supported right now

david-yu · 2026-04-20T21:03:46Z

Related to issue redpanda-data/helm-charts#1596

david-yu and others added 2 commits April 20, 2026 10:38

operator: add changelog entry for bootstrap-user password drift fix

02d5b21

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu changed the title ~~acceptance: reproducer — bootstrap user password drift after Secret delete~~ operator: fix SASL bootstrap user password drift after Secret rotation Apr 20, 2026

david-yu and others added 4 commits April 20, 2026 10:57

operator: drop "not on a timer" qualifier from changelog

ce47318

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu closed this Apr 20, 2026

david-yu mentioned this pull request Apr 20, 2026

Document manual bootstrap user password resync after Helm-to-Operator migration redpanda-data/docs#1674

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: fix SASL bootstrap user password drift after Secret rotation#1465

operator: fix SASL bootstrap user password drift after Secret rotation#1465
david-yu wants to merge 6 commits intomainfrom
bug/bootstrap-user-password-drift

david-yu commented Apr 20, 2026 •

edited

Loading

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

david-yu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Changes

Test plan

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

david-yu commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

david-yu commented Apr 20, 2026 •

edited

Loading