Skip to content

operator: fix SASL bootstrap user password drift after Secret rotation#1465

Closed
david-yu wants to merge 6 commits intomainfrom
bug/bootstrap-user-password-drift
Closed

operator: fix SASL bootstrap user password drift after Secret rotation#1465
david-yu wants to merge 6 commits intomainfrom
bug/bootstrap-user-password-drift

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 20, 2026

Summary

Fixes a SASL bootstrap-user password-drift bug surfaced by users migrating off the legacy Helm flow. Implements option (b) from the reproducer writeup: the sidecar configwatcher now mirrors the bootstrap user Secret's password into Redpanda's SCRAM DB on every sync, so a rotated bootstrap Secret propagates into the running cluster and rpk keeps authenticating after the next pod restart.

The bug

  1. A Redpanda cluster runs with SASL enabled and an operator-owned <fullname>-bootstrap-user Secret.
  2. The Secret is deleted — e.g. cleaning up a Helm-era artifact and expecting the operator to take clean ownership.
  3. charts/redpanda/render_state.go:FetchBootstrapUser / operator/multicluster/secrets.go:secretBootstrapUser treat the missing Secret as "first-time bootstrap" and write a new one with a fresh helmette.RandAlphaNum(32) password.
  4. The running Redpanda retained the original password in its internal SCRAM DB because operator/internal/configwatcher/configwatcher.go passed recreate=false when syncing the internal superuser ("the internal user should only ever be created once, so don't update its password ever").
  5. On the next pod restart, the Pod's RPK_USER / RPK_PASS env vars were re-materialized from the new Secret, but Redpanda still rejected them:
    SASL_AUTHENTICATION_FAILED: Invalid credentials
    
    The original password still worked — confirming the drift.

The fix

operator/internal/configwatcher/configwatcher.go:

  • New syncInternalUser helper that does CreateUser, and on "already exists" falls through to UpdateUser(user, password, mechanism). Never falls back to delete-and-recreate — dropping the internal superuser even briefly could strand the operator.
  • SyncUsers routes the internal superuser through syncInternalUser instead of syncUser(..., recreate=false).
  • Event-driven, not polling. SyncUsers runs only at pod start (syncInitial) and when fsnotify observes a change to the mounted Secret volume. There is no timer or reconcile tick. In steady state, UpdateUser fires once per pod lifetime (at boot); a Secret rotation adds exactly one more call when kubelet swaps the volume. See configwatcher.go:109-191StartsyncInitial (one-shot) then watchFilesystem, a pure fsnotify.Watcher select loop.
  • UpdateUser against Redpanda is a no-op when the password matches, so even the boot-time call is effectively free on the cluster side.

Rotation flow after the fix:

  1. User deletes the bootstrap Secret; operator regenerates with P2.
  2. K8s updates the mounted Secret file on every pod; fsnotify fires → SyncUsers runs.
  3. Each pod's env still holds P1 (env vars aren't re-read mid-container), so syncInternalUser re-asserts P1 — idempotent, Redpanda's SCRAM DB is unchanged.
  4. Each pod eventually restarts (rolling or manual); the new env exposes P2; syncInternalUser drives UpdateUser → Redpanda's SCRAM DB is now P2.
  5. rpk in the restarted pod authenticates with P2 successfully.

Changes

  • operator/internal/configwatcher/configwatcher.go — the fix (new syncInternalUser helper, SyncUsers wiring, comment update).
  • operator/internal/configwatcher/configwatcher_test.go — extends the existing testcontainer-based test with a rotation case: flip RPK_PASS, re-run SyncUsers, assert the original password no longer authenticates and the rotated password does. Also enables redpanda.WithAdminAPIAuthentication() so the admin API actually enforces basic auth and the assertion discriminates.
  • acceptance/features/bootstrap-user.feature + acceptance/steps/bootstrap_user.go + acceptance/steps/register.go — end-to-end regression: install SASL cluster, delete the operator-owned bootstrap Secret, wait for regeneration with a different password, restart all pods, assert rpk still works.
  • .changes/unreleased/operator-Fixed-20260420-150000.yaml — changelog entry under the operator project as Fixed.

Test plan

  • go test ./operator/internal/configwatcher/... — rotation subtest passes (validated locally against a Rancher Desktop Docker socket).
  • task test:acceptance with the bootstrap-user.feature scenario — expected to pass with the fix and would fail on main.
  • No other scenarios affected — the new acceptance scenario is @skip:gke @skip:aks @skip:eks (local-only) and uses a dedicated cluster name (bootstrap-regen).

🤖 Generated with Claude Code

david-yu and others added 2 commits April 20, 2026 10:38
Adds an acceptance scenario that fails on current main: after the
bootstrap user Secret is deleted, the operator regenerates it with a
fresh random password while the running Redpanda cluster still holds
the original password in its internal SCRAM DB, so rpk (and any other
consumer of the new Secret) fails SASL auth after the next pod restart.

The drift lives at the intersection of two deliberate choices:

  * charts/redpanda/render_state.go:FetchBootstrapUser and
    operator/multicluster/secrets.go:secretBootstrapUser treat a
    missing default-named Secret as "first-time bootstrap" and
    generate a fresh random password (helmette.RandAlphaNum(32)),
    with no signal that the cluster has already been bootstrapped.
  * operator/internal/configwatcher/configwatcher.go explicitly
    passes recreate=false when syncing the internal superuser, so
    Redpanda's SCRAM DB is never rewritten after the initial bootstrap.

Either (a) preserving the original password or (b) calling
AlterUserSCRAMs when the Secret rotates would close the gap. This
commit is purely a reproducer — no production code is touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu changed the title acceptance: reproducer — bootstrap user password drift after Secret delete operator: fix SASL bootstrap user password drift after Secret rotation Apr 20, 2026
david-yu and others added 4 commits April 20, 2026 10:57
The sidecar configwatcher used to call syncUser for the internal
superuser with recreate=false, explicitly never updating the password
once the user existed in Redpanda. That left a silent drift whenever
the bootstrap user Secret was rotated (e.g. the operator regenerating
a deleted Secret with a fresh random password): Redpanda kept the
original SCRAM credential while consumers of the new Secret failed
SASL auth with Invalid credentials after the next pod restart.

Add a dedicated syncInternalUser helper that, on CreateUser returning
"already exists", drives UpdateUser against the admin API so the
running cluster picks up whatever password the mounted Secret now
holds. UpdateUser is idempotent against Redpanda so this is safe to
invoke on every sync. Unlike the regular syncUser path, this helper
never falls back to delete-and-recreate — dropping the internal
superuser even briefly could strand the operator.

Extend the testcontainer-based TestConfigWatcher with a rotation
scenario that verifies the new behavior via a Kafka SASL handshake:
the original password must fail authentication after rotation and the
rotated password must succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelog

Review pass: the fix drives the rpadmin HTTP admin API's UpdateUser,
not the Kafka protocol's AlterUserSCRAMs. Tighten comments and the
changelog entry accordingly. Also fix the kafkaSASLHandshake test
helper comment which claimed a Metadata request when it actually
pings each seed broker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior change — just tighten the wording from "on every sync" to
"at pod start and on fsnotify events when the mounted Secret changes,
not on a timer" so readers don't assume this introduces continuous
polling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

Still needs further discussion with eng before moving to review.

@david-yu
Copy link
Copy Markdown
Contributor Author

Closing as per @andrewstucki 's advice.

Currently you can manually work around it by running the rpk command with the old password on the cluster to update it to the generated password, but really we can't support changing the bootstrap password without a bunch of redesign, because it's meant to be long-lived and once auth is enabled/enforced on the cluster in order to change the password, well, you need the old password too (i.e. if going from A --> B as the user account being authed, then you have to know both A and B -- we can't know that though because we only have the one secret field that exists)

Any sort of redesign to support this would need to take into account being able to specify 2 passwords for the user we leverage, the old password and the new password -- but fundamentally just saying "use password B" isn't going to work unless we know password A too -- which is why regenerating the password isn't supported right now

@david-yu david-yu closed this Apr 20, 2026
@david-yu
Copy link
Copy Markdown
Contributor Author

Related to issue redpanda-data/helm-charts#1596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant