Document manual bootstrap user password resync after Helm-to-Operator migration#1674
Conversation
… migration When the Redpanda Operator regenerates the bootstrap-user Secret, the new password is not synced into Redpanda's SCRAM database, so rpk fails to authenticate after pod restarts. Document the manual `rpk acl user update` workflow that resynchronizes the password. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for redpanda-docs-preview ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change adds a new troubleshooting section to the Helm-to-Operator migration documentation. The section addresses a password synchronization issue that occurs when regenerating the bootstrap user Secret on SASL-enabled clusters during migration. The Secret receives a new random password while Redpanda's SCRAM database retains the original Helm-era password, causing authentication failures. The update includes detection steps, diagnostic commands, and manual remediation procedures using Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 232-234: The shell examples use unquoted variable expansions in
the `rpk acl user update` and `export RPK_PASS` lines which can break when
variables contain spaces or special chars; update the `rpk acl user update
$RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS`
invocation to quote the expansions (e.g., use "$RPK_USER",
"$RPK_SASL_MECHANISM", "$RPK_NEW_PASS") and likewise quote the export (`export
RPK_PASS="$RPK_NEW_PASS"`) so the `rpk cluster info` step runs reliably with the
intended credentials.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9ff668d0-e554-429f-9109-4a62ec574531
📒 Files selected for processing (1)
modules/migrate/pages/kubernetes/helm-to-operator.adoc
Verified on a local kind cluster that the manual resync workflow can leave a multi-broker cluster in a mixed state: the SCRAM database ends up at the new password while some pods still hold the original in env, because kubelet caches immutable Secrets per node and does not reliably re-read them when the Secret is deleted and recreated. Add a pre-check step, node-drain guidance, and a warning about the inverted drift that occurs if the check is skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Amended with a pre-check step and a kubelet Secret-cache warning after end-to-end testing on a local kind cluster against chart Key finding: the original resync procedure works on the pod it's run from, but can leave a multi-broker cluster in a mixed state. The operator creates Changes in this push:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review notesThree suggestions worth considering before merge. 1. CodeRabbit finding is still open: quote shell variable expansions (lines 251-252)Lines 249-250 correctly quote rpk acl user update $RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS <3>
export RPK_PASS=$RPK_NEW_PASS <4>If -rpk acl user update $RPK_USER --mechanism $RPK_SASL_MECHANISM --new-password $RPK_NEW_PASS <3>
-export RPK_PASS=$RPK_NEW_PASS <4>
+rpk acl user update "$RPK_USER" --mechanism "$RPK_SASL_MECHANISM" --new-password "$RPK_NEW_PASS" <3>
+export RPK_PASS="$RPK_NEW_PASS" <4>2. "Recovery options" xref points to a configuration page, not recovery docs (line 217)The NOTE tells readers:
I checked the target page — it's "Configure Authentication for Redpanda in Kubernetes" and it has no content about password recovery, reset, or bootstrap-user restoration. Grepping the whole The xref promises something that isn't there. Options:
3. Section placement reads as a required migration step
Two framings would set expectations up-front:
Either works; the Troubleshooting move keeps the main migration flow clean and positions the recovery procedure where readers look when something's broken. |
Additional suggestions after cross-checking the originating Slack threadAndrew Stucki's replies in that thread are the authoritative context behind this procedure. Three items to absorb into the page — listed in priority order (safety → framing → explanation). C. Add a preventive step: back up the old password before regenerating the SecretThis is the most important addition. The customer (Alex Lavoie) in the Slack thread called this out explicitly:
If a reader regenerates the Secret without having the old password, the entire procedure on this page becomes impossible — the Suggested: add a TIP above step 1, and reorder the NOTE so the preventive step comes before the recovery path. [TIP]
====
If you haven't regenerated the bootstrap-user Secret yet, back up the current password first so you have it available for the resync:
[,bash]
----
kubectl --namespace <namespace> get secret <cluster-name>-bootstrap-user \
-o jsonpath='{.data.password}' | base64 -d
----
====B. Frame the procedure as a recovery path users should try to avoidAndrew was unambiguous (replies 13, 14, 17, 19) that regenerating the bootstrap-user Secret is fundamentally unsupported:
The current page reads like a neutral "here's how to do it" — readers scanning the TOC might conclude this is a supported workflow. It isn't; it's a recovery path for a situation they should have tried to prevent. Suggested: lead the section with a CAUTION admonition that sets that expectation. [CAUTION]
====
The bootstrap user is designed to be set once at cluster creation and remain long-lived. Avoid regenerating the `<cluster-name>-bootstrap-user` Secret when possible. Only use this procedure if the Secret has already been regenerated and you need to recover from the resulting authentication failure.
====This pairs well with suggestion #3 in my earlier comment (moving the section under Troubleshooting or adding A. Explain why the operator can't do this automaticallyAndrew's core technical justification (replies 19-20) is currently missing from the page:
The page currently says:
…but doesn't explain why not, which makes the limitation read like an oversight rather than a constraint inherent to SCRAM. One sentence closes the gap. Suggested: extend line 213 to something like: The Redpanda Operator does not resynchronize this password for you: changing a SCRAM password requires authenticating with the old password, and the bootstrap Secret only tracks one credential at a time. You must update the SCRAM database manually, using the old password to authenticate and the new password from the regenerated Secret as the target.These are additive to the three suggestions in #1674 (comment) (quoting, broken recovery xref, heading placement), not replacements. |
Reframes the procedure as a recovery path rather than a routine migration step, per Slack thread context from Andrew Stucki. - Move the section under Troubleshooting as an H3 so the TOC no longer suggests it is a required migration step. - Add a CAUTION explaining that the bootstrap user is designed to be long-lived and the Secret should not be regenerated. - Add a TIP recommending that readers back up the current password before any operation that might regenerate the Secret. - Explain the SCRAM-level reason the operator can't auto-resync (password change requires the old password; the Secret only holds one credential). - Replace the broken recovery xref in the NOTE with realistic guidance (restore from backup or contact support). - Quote shell variable expansions in the `rpk acl user update` and `export RPK_PASS` lines so special characters do not break the command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @micheleRP — all six suggestions applied in From your first comment:
From your second comment:
|
Summary
Resynchronize the bootstrap user passwordsection to the Helm-to-Operator migration page, describing the manualrpk acl user updateworkflow users must run when the Operator regenerates the bootstrap-user Secret.SASL_AUTHENTICATION_FAILEDafter pod restart) so readers can recognize when this procedure applies.Preview pages
Test plan
migrate/kubernetes/helm-to-operatorpage and confirm the new section renders between the migration procedure and the rollback section.rpk acl user updateblock.manage:kubernetes/security/authentication/k-authentication.adocresolves.🤖 Generated with Claude Code