operator: fix SASL bootstrap user password drift after Secret rotation#1465
operator: fix SASL bootstrap user password drift after Secret rotation#1465
Conversation
Adds an acceptance scenario that fails on current main: after the
bootstrap user Secret is deleted, the operator regenerates it with a
fresh random password while the running Redpanda cluster still holds
the original password in its internal SCRAM DB, so rpk (and any other
consumer of the new Secret) fails SASL auth after the next pod restart.
The drift lives at the intersection of two deliberate choices:
* charts/redpanda/render_state.go:FetchBootstrapUser and
operator/multicluster/secrets.go:secretBootstrapUser treat a
missing default-named Secret as "first-time bootstrap" and
generate a fresh random password (helmette.RandAlphaNum(32)),
with no signal that the cluster has already been bootstrapped.
* operator/internal/configwatcher/configwatcher.go explicitly
passes recreate=false when syncing the internal superuser, so
Redpanda's SCRAM DB is never rewritten after the initial bootstrap.
Either (a) preserving the original password or (b) calling
AlterUserSCRAMs when the Secret rotates would close the gap. This
commit is purely a reproducer — no production code is touched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sidecar configwatcher used to call syncUser for the internal superuser with recreate=false, explicitly never updating the password once the user existed in Redpanda. That left a silent drift whenever the bootstrap user Secret was rotated (e.g. the operator regenerating a deleted Secret with a fresh random password): Redpanda kept the original SCRAM credential while consumers of the new Secret failed SASL auth with Invalid credentials after the next pod restart. Add a dedicated syncInternalUser helper that, on CreateUser returning "already exists", drives UpdateUser against the admin API so the running cluster picks up whatever password the mounted Secret now holds. UpdateUser is idempotent against Redpanda so this is safe to invoke on every sync. Unlike the regular syncUser path, this helper never falls back to delete-and-recreate — dropping the internal superuser even briefly could strand the operator. Extend the testcontainer-based TestConfigWatcher with a rotation scenario that verifies the new behavior via a Kafka SASL handshake: the original password must fail authentication after rotation and the rotated password must succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelog Review pass: the fix drives the rpadmin HTTP admin API's UpdateUser, not the Kafka protocol's AlterUserSCRAMs. Tighten comments and the changelog entry accordingly. Also fix the kafkaSASLHandshake test helper comment which claimed a Metadata request when it actually pings each seed broker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior change — just tighten the wording from "on every sync" to "at pod start and on fsnotify events when the mounted Secret changes, not on a timer" so readers don't assume this introduces continuous polling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Still needs further discussion with eng before moving to review. |
|
Closing as per @andrewstucki 's advice. Currently you can manually work around it by running the rpk command with the old password on the cluster to update it to the generated password, but really we can't support changing the bootstrap password without a bunch of redesign, because it's meant to be long-lived and once auth is enabled/enforced on the cluster in order to change the password, well, you need the old password too (i.e. if going from A --> B as the user account being authed, then you have to know both A and B -- we can't know that though because we only have the one secret field that exists) Any sort of redesign to support this would need to take into account being able to specify 2 passwords for the user we leverage, the old password and the new password -- but fundamentally just saying "use password B" isn't going to work unless we know password A too -- which is why regenerating the password isn't supported right now |
|
Related to issue redpanda-data/helm-charts#1596 |
Summary
Fixes a SASL bootstrap-user password-drift bug surfaced by users migrating off the legacy Helm flow. Implements option (b) from the reproducer writeup: the sidecar configwatcher now mirrors the bootstrap user Secret's password into Redpanda's SCRAM DB on every sync, so a rotated bootstrap Secret propagates into the running cluster and
rpkkeeps authenticating after the next pod restart.The bug
<fullname>-bootstrap-userSecret.charts/redpanda/render_state.go:FetchBootstrapUser/operator/multicluster/secrets.go:secretBootstrapUsertreat the missing Secret as "first-time bootstrap" and write a new one with a freshhelmette.RandAlphaNum(32)password.operator/internal/configwatcher/configwatcher.gopassedrecreate=falsewhen syncing the internal superuser ("the internal user should only ever be created once, so don't update its password ever").RPK_USER/RPK_PASSenv vars were re-materialized from the new Secret, but Redpanda still rejected them:The fix
operator/internal/configwatcher/configwatcher.go:syncInternalUserhelper that doesCreateUser, and on"already exists"falls through toUpdateUser(user, password, mechanism). Never falls back to delete-and-recreate — dropping the internal superuser even briefly could strand the operator.SyncUsersroutes the internal superuser throughsyncInternalUserinstead ofsyncUser(..., recreate=false).SyncUsersruns only at pod start (syncInitial) and whenfsnotifyobserves a change to the mounted Secret volume. There is no timer or reconcile tick. In steady state,UpdateUserfires once per pod lifetime (at boot); a Secret rotation adds exactly one more call when kubelet swaps the volume. Seeconfigwatcher.go:109-191—Start→syncInitial(one-shot) thenwatchFilesystem, a purefsnotify.Watcherselect loop.UpdateUseragainst Redpanda is a no-op when the password matches, so even the boot-time call is effectively free on the cluster side.Rotation flow after the fix:
P2.SyncUsersruns.P1(env vars aren't re-read mid-container), sosyncInternalUserre-assertsP1— idempotent, Redpanda's SCRAM DB is unchanged.P2;syncInternalUserdrivesUpdateUser→ Redpanda's SCRAM DB is nowP2.rpkin the restarted pod authenticates withP2successfully.Changes
syncInternalUserhelper,SyncUserswiring, comment update).RPK_PASS, re-runSyncUsers, assert the original password no longer authenticates and the rotated password does. Also enablesredpanda.WithAdminAPIAuthentication()so the admin API actually enforces basic auth and the assertion discriminates.rpkstill works.operatorproject asFixed.Test plan
go test ./operator/internal/configwatcher/...— rotation subtest passes (validated locally against a Rancher Desktop Docker socket).task test:acceptancewith thebootstrap-user.featurescenario — expected to pass with the fix and would fail onmain.@skip:gke @skip:aks @skip:eks(local-only) and uses a dedicated cluster name (bootstrap-regen).🤖 Generated with Claude Code