Skip to content

Document manual bootstrap user password resync after Helm-to-Operator migration#1674

Merged
micheleRP merged 4 commits intoredpanda-data:mainfrom
david-yu:docs/migrate-bootstrap-user-resync
Apr 21, 2026
Merged

Document manual bootstrap user password resync after Helm-to-Operator migration#1674
micheleRP merged 4 commits intoredpanda-data:mainfrom
david-yu:docs/migrate-bootstrap-user-resync

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 20, 2026

Summary

  • Adds a new Resynchronize the bootstrap user password section to the Helm-to-Operator migration page, describing the manual rpk acl user update workflow users must run when the Operator regenerates the bootstrap-user Secret.
  • Explains the password-drift symptom (SASL_AUTHENTICATION_FAILED after pod restart) so readers can recognize when this procedure applies.
  • Documents the current operator behavior: the Operator does not resynchronize the SCRAM database password automatically, so the fix is manual. (Context: operator: fix SASL bootstrap user password drift after Secret rotation redpanda-operator#1465, which would have handled this on the operator side, was closed.)

Preview pages

Test plan

  • Preview the migrate/kubernetes/helm-to-operator page and confirm the new section renders between the migration procedure and the rollback section.
  • Verify the AsciiDoc callouts (1-5) render correctly on the rpk acl user update block.
  • Confirm the cross-reference to manage:kubernetes/security/authentication/k-authentication.adoc resolves.

🤖 Generated with Claude Code

… migration

When the Redpanda Operator regenerates the bootstrap-user Secret, the
new password is not synced into Redpanda's SCRAM database, so rpk
fails to authenticate after pod restarts. Document the manual
`rpk acl user update` workflow that resynchronizes the password.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu requested a review from a team as a code owner April 20, 2026 21:50
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 20, 2026

Deploy Preview for redpanda-docs-preview ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit a07ff53
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/69e7b3ce5a0b680008f4e5b7
😎 Deploy Preview https://deploy-preview-1674--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd16987a-4a85-4cd3-89ad-d9722f5b72e8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change adds a new troubleshooting section to the Helm-to-Operator migration documentation. The section addresses a password synchronization issue that occurs when regenerating the bootstrap user Secret on SASL-enabled clusters during migration. The Secret receives a new random password while Redpanda's SCRAM database retains the original Helm-era password, causing authentication failures. The update includes detection steps, diagnostic commands, and manual remediation procedures using rpk to resynchronize credentials.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • JakeSCahill
  • micheleRP
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The pull request description provides a clear summary, preview pages, and test plan. However, it does not include the JIRA ticket reference or indicate which checkbox applies (e.g., Content gap, Support Follow-up). Add the JIRA ticket number to resolve and select the appropriate check category (e.g., Content gap or Support Follow-up) from the template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding documentation for manual bootstrap user password resynchronization after Helm-to-Operator migration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 232-234: The shell examples use unquoted variable expansions in
the `rpk acl user update` and `export RPK_PASS` lines which can break when
variables contain spaces or special chars; update the `rpk acl user update
$RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS`
invocation to quote the expansions (e.g., use "$RPK_USER",
"$RPK_SASL_MECHANISM", "$RPK_NEW_PASS") and likewise quote the export (`export
RPK_PASS="$RPK_NEW_PASS"`) so the `rpk cluster info` step runs reliably with the
intended credentials.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ff668d0-e554-429f-9109-4a62ec574531

📥 Commits

Reviewing files that changed from the base of the PR and between de08b45 and 7de3cb5.

📒 Files selected for processing (1)
  • modules/migrate/pages/kubernetes/helm-to-operator.adoc

Comment thread modules/migrate/pages/kubernetes/helm-to-operator.adoc Outdated
Verified on a local kind cluster that the manual resync workflow can
leave a multi-broker cluster in a mixed state: the SCRAM database
ends up at the new password while some pods still hold the original
in env, because kubelet caches immutable Secrets per node and does
not reliably re-read them when the Secret is deleted and recreated.
Add a pre-check step, node-drain guidance, and a warning about the
inverted drift that occurs if the check is skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented Apr 20, 2026

Amended with a pre-check step and a kubelet Secret-cache warning after end-to-end testing on a local kind cluster against chart redpanda-26.1.2 and operator operator-26.1.2 (test notes).

Key finding: the original resync procedure works on the pod it's run from, but can leave a multi-broker cluster in a mixed state. The operator creates <cluster-name>-bootstrap-user with immutable: true, and kubelet's per-node cache does not always invalidate on delete-and-recreate. After the resync, pods whose env still holds the original password start failing SASL auth — an inverted drift.

Changes in this push:

  • New first step: iterate every broker pod and print RPK_PASS; only proceed once they all agree.
  • If a pod disagrees: force-delete; if it still disagrees after re-creation, drain the node so the pod reschedules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@micheleRP
Copy link
Copy Markdown
Contributor

Review notes

Three suggestions worth considering before merge.

1. CodeRabbit finding is still open: quote shell variable expansions (lines 251-252)

Lines 249-250 correctly quote "$RPK_PASS" and "<old-password>", but the next two lines don't:

rpk acl user update $RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS <3>
export RPK_PASS=$RPK_NEW_PASS <4>

If $RPK_NEW_PASS (the freshly generated operator password) contains a space, !, $, or other special character, unquoted expansion breaks the command and the recovery procedure fails — which is especially bad in a recovery doc. Suggested fix:

-rpk acl user update $RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS <3>
-export RPK_PASS=$RPK_NEW_PASS <4>
+rpk acl user update "$RPK_USER" --mechanism "$RPK_SASL_MECHANISM" --new-password "$RPK_NEW_PASS" <3>
+export RPK_PASS="$RPK_NEW_PASS" <4>

2. "Recovery options" xref points to a configuration page, not recovery docs (line 217)

The NOTE tells readers:

If you no longer have it, see xref:manage:kubernetes/security/authentication/k-authentication.adoc[] for recovery options.

I checked the target page — it's "Configure Authentication for Redpanda in Kubernetes" and it has no content about password recovery, reset, or bootstrap-user restoration. Grepping the whole modules/ tree, no page documents superuser password recovery except this one.

The xref promises something that isn't there. Options:

  • Drop the "If you no longer have it..." clause entirely (doesn't help the reader right now).
  • Reword to set realistic expectations (e.g., "If you no longer have it, you may need to restore from backup or contact Redpanda support").
  • Commit to adding a dedicated recovery section somewhere and link it from here.

3. Section placement reads as a required migration step

== Resynchronize the bootstrap user password sits between == Migrate to the Redpanda Operator and Helm and == Roll back from Redpanda Operator to Helm — looks like step 3 of the migration in the TOC. The prose and NOTE correctly scope it as conditional ("Only perform these steps if rpk fails SASL authentication"), but TOC scanners won't see that.

Two framings would set expectations up-front:

  • Retitle the H2: == Resynchronize the bootstrap user password (if authentication fails).
  • Or move the section under == Troubleshooting (line 274) as an H3 — which is what a recovery procedure structurally is.

Either works; the Troubleshooting move keeps the main migration flow clean and positions the recovery procedure where readers look when something's broken.

@micheleRP
Copy link
Copy Markdown
Contributor

Additional suggestions after cross-checking the originating Slack thread

Andrew Stucki's replies in that thread are the authoritative context behind this procedure. Three items to absorb into the page — listed in priority order (safety → framing → explanation).

C. Add a preventive step: back up the old password before regenerating the Secret

This is the most important addition. The customer (Alex Lavoie) in the Slack thread called this out explicitly:

i'll take a backup of the password anyways so i will have the old pwd available.

If a reader regenerates the Secret without having the old password, the entire procedure on this page becomes impossible — the rpk acl user update step needs the old password to authenticate. The current NOTE only covers the recovery case ("if you no longer have it, see…"), which is too late.

Suggested: add a TIP above step 1, and reorder the NOTE so the preventive step comes before the recovery path.

[TIP]
====
If you haven't regenerated the bootstrap-user Secret yet, back up the current password first so you have it available for the resync:

[,bash]
----
kubectl --namespace <namespace> get secret <cluster-name>-bootstrap-user \
  -o jsonpath='{.data.password}' | base64 -d
----
====

B. Frame the procedure as a recovery path users should try to avoid

Andrew was unambiguous (replies 13, 14, 17, 19) that regenerating the bootstrap-user Secret is fundamentally unsupported:

what they're attempting to do is basically unsupported without manual intervention and we may want to make bootstrapUser on the CRD immutable

the PR fix is not a real fix — in order to change any password, well, we need an admin password

The current page reads like a neutral "here's how to do it" — readers scanning the TOC might conclude this is a supported workflow. It isn't; it's a recovery path for a situation they should have tried to prevent.

Suggested: lead the section with a CAUTION admonition that sets that expectation.

[CAUTION]
====
The bootstrap user is designed to be set once at cluster creation and remain long-lived. Avoid regenerating the `<cluster-name>-bootstrap-user` Secret when possible. Only use this procedure if the Secret has already been regenerated and you need to recover from the resulting authentication failure.
====

This pairs well with suggestion #3 in my earlier comment (moving the section under Troubleshooting or adding (if authentication fails) to the H2) — both framings reinforce that this is recovery, not routine migration.

A. Explain why the operator can't do this automatically

Andrew's core technical justification (replies 19-20) is currently missing from the page:

to change any password, well, you need the old password too (i.e. if going from A → B as the user account being authed, then you have to know both A and B — we can't know that though because we only have the one secret field that exists)

any sort of redesign to support this would need to take into account being able to specify 2 passwords for the user we leverage

The page currently says:

The Redpanda Operator does not resynchronize this password for you.

…but doesn't explain why not, which makes the limitation read like an oversight rather than a constraint inherent to SCRAM. One sentence closes the gap.

Suggested: extend line 213 to something like:

The Redpanda Operator does not resynchronize this password for you: changing a SCRAM password requires authenticating with the old password, and the bootstrap Secret only tracks one credential at a time. You must update the SCRAM database manually, using the old password to authenticate and the new password from the regenerated Secret as the target.

These are additive to the three suggestions in #1674 (comment) (quoting, broken recovery xref, heading placement), not replacements.

@micheleRP micheleRP self-requested a review April 21, 2026 16:18
Reframes the procedure as a recovery path rather than a routine
migration step, per Slack thread context from Andrew Stucki.

- Move the section under Troubleshooting as an H3 so the TOC no
  longer suggests it is a required migration step.
- Add a CAUTION explaining that the bootstrap user is designed to be
  long-lived and the Secret should not be regenerated.
- Add a TIP recommending that readers back up the current password
  before any operation that might regenerate the Secret.
- Explain the SCRAM-level reason the operator can't auto-resync
  (password change requires the old password; the Secret only holds
  one credential).
- Replace the broken recovery xref in the NOTE with realistic
  guidance (restore from backup or contact support).
- Quote shell variable expansions in the `rpk acl user update` and
  `export RPK_PASS` lines so special characters do not break the
  command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

Thanks @micheleRP — all six suggestions applied in a07ff539. Summary:

From your first comment:

  1. Quoted shell variables. rpk acl user update and the follow-up export now quote "$RPK_USER", "$RPK_SASL_MECHANISM", "$RPK_NEW_PASS". Also resolves CodeRabbit's inline finding.
  2. Broken recovery xref. Dropped the k-authentication.adoc link (it has no recovery content) and reworded the NOTE: "you may need to restore from backup or contact Redpanda support."
  3. Section placement. Moved the whole section from a standalone H2 between migration and rollback to an H3 under == Troubleshooting. TOC no longer reads it as a migration step.

From your second comment:

  • (C) Backup preventive step. Added a TIP block above step 1 with the kubectl get secret ... | base64 -d command to back up the current password before any operation that might regenerate the Secret.
  • (B) Frame as recovery path. Added a CAUTION at the top of the section stating the bootstrap user is designed to be set once and remain long-lived, and this procedure is for recovering from the resulting authentication failure, not a routine workflow. Pairs with the Troubleshooting placement.
  • (A) Explain why the operator can't auto-sync. Extended the "does not resynchronize" sentence: "changing a SCRAM password requires authenticating with the old password, and the bootstrap Secret only tracks one credential at a time."

Copy link
Copy Markdown
Contributor

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@micheleRP micheleRP merged commit b216ec2 into redpanda-data:main Apr 21, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants