Skip to content

operator: user adoption and add credential sync for externally-managed secrets#1438

Open
david-yu wants to merge 4 commits intomainfrom
fix/user-credential-sync-1354
Open

operator: user adoption and add credential sync for externally-managed secrets#1438
david-yu wants to merge 4 commits intomainfrom
fix/user-credential-sync-1354

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 10, 2026

Closes #1354

Summary

The User controller has two related issues that prevent importing existing Redpanda users and block credential rotation via external secret management systems (like ESO + Azure Key Vault):

  1. Existing users are never adoptedSyncResource only calls Create when !hasUser, so applying a User CR for a pre-existing Redpanda user leaves it permanently at status.managedUser=false
  2. Managed users never get credential sync — even after the operator creates a user, subsequent reconciliation cycles (including the 5-minute drift correction) never re-read the password Secret or update Redpanda

This PR fixes both issues across three phases:

Phase 1: Fix user adoption

Changed: SyncResource branching logic in user_controller.go

The old code had a !hasUser && shouldManageUser gate that prevented adoption of existing users. The fix changes this to shouldManageUser && !hasManagedUser, which triggers the Create/upsert path regardless of whether the user already exists in Redpanda. This works because the underlying AlterUserSCRAMs with UpsertSCRAM is idempotent — it handles both create and update.

No opt-in needed — declaring spec.authentication is already the signal that the operator should manage the user.

Phase 2: Ongoing credential sync (opt-in)

New field: spec.authentication.syncCredentials (bool, default false)

When enabled, each reconciliation cycle re-reads the password from the referenced Secret and upserts credentials to Redpanda. This enables password rotation via external systems like ESO.

New method: users.Client.Update() — reads the current password from the Secret via Password.Fetch() (no generation or Secret creation) and upserts to Redpanda.

Phase 3: Immediate reconciliation on Secret changes

New index: Users are indexed by the password Secret they reference (spec.authentication.password.valueFrom.secretKeyRef.name)

New watch: A Watches(&corev1.Secret{}, ...) handler maps Secret changes to referencing User CRs and enqueues them. This means ESO-driven Secret updates trigger immediate reconciliation instead of waiting up to 5 minutes.

How to migrate existing users to operator management

Step-by-step: ESO + Azure Key Vault workflow

  1. Ensure credentials exist in Azure Key Vault — the username/password pair must already be provisioned

  2. Configure ESO to sync to a K8s Secret:

    apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: my-user-password
    spec:
      refreshInterval: 1h
      secretStoreRef:
        name: azure-keyvault
        kind: ClusterSecretStore
      target:
        name: my-user-password
      data:
        - secretKey: password
          remoteRef:
            key: my-redpanda-user-password
  3. Apply the User CR referencing the ESO-managed Secret:

    apiVersion: cluster.redpanda.com/v1alpha2
    kind: User
    metadata:
      name: my-user
    spec:
      cluster:
        clusterRef:
          name: my-cluster
      authentication:
        type: scram-sha-512
        password:
          valueFrom:
            secretKeyRef:
              name: my-user-password
              key: password
          noGenerate: true        # ESO manages the Secret, don't generate
        syncCredentials: true      # re-sync on each reconcile cycle
      authorization:
        acls:
          - type: allow
            resource:
              type: topic
              name: my-topic
            operations: [Read, Write, Describe]
  4. Verify adoption:

    kubectl get users my-user
    # NAME      SYNCED   MANAGING USER   MANAGING ACLS
    # my-user   True     true            true
  5. To rotate credentials: Update the secret in Azure Key Vault. ESO will sync the new value to the K8s Secret, the operator will detect the Secret change (via the new watch) and immediately reconcile, pushing the new password to Redpanda.

Step-by-step: Manual migration (no ESO)

  1. Create a K8s Secret with the current password:

    kubectl create secret generic my-user-password \
      --from-literal=password='current-password'
  2. Apply the User CR (same as step 3 above, but syncCredentials is optional since you're managing the Secret manually)

  3. To rotate: Update the Secret, then either wait for the 5-minute periodic reconcile or trigger a manual reconcile by annotating the User CR

Key flags

Field Purpose
password.noGenerate: true Prevents operator from generating/overwriting the Secret — use when an external system (ESO) manages it
authentication.syncCredentials: true Re-reads password from Secret on every reconcile and upserts to Redpanda — enables rotation

Files changed

  • operator/api/redpanda/v1alpha2/user_types.go — add SyncCredentials field, ShouldSyncCredentials(), GetPasswordSecretName() helpers
  • operator/internal/controller/redpanda/user_controller.go — fix adoption logic, add credential sync branch, add Secret index + watch
  • operator/pkg/client/users/client.go — add Update() method
  • operator/internal/controller/redpanda/user_controller_test.go — add TestUserAdoptExisting and TestUserCredentialSync tests

TODO

  • Run make generate to regenerate apply configurations for the new SyncCredentials field
  • Integration test with real ESO setup
  • Consider adding a status condition or event when credential sync occurs

Test plan

  • TestUserAdoptExisting — pre-creates a user via Kafka admin API, then applies a User CR and verifies managedUser=true and the new password works
  • TestUserCredentialSync — creates a user with syncCredentials: true, rotates the password Secret, reconciles, and verifies the new password authenticates
  • Existing TestUserReconcile table tests continue to pass (adoption case added to table)
  • Manual verification with ESO + Azure Key Vault in a staging environment

🤖 Generated with Claude Code

@david-yu david-yu changed the title operator: fix user adoption and add credential sync for externally-managed secrets operator: user adoption and add credential sync for externally-managed secrets Apr 10, 2026
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Apr 16, 2026
@david-yu david-yu removed the stale label Apr 16, 2026
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Apr 22, 2026
@david-yu david-yu removed the stale label Apr 23, 2026
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented May 1, 2026

Azure Kubernetes Service — auth.sasl.secretRef + SCRAM-SHA-256

Re-did the end-to-end run with the bootstrap pattern from the Redpanda docs — Use a Secret resource, aligned the User CR with the chart's default mechanism (SCRAM-SHA-256), then re-ran the two scenarios from the PR description against a fresh AKS cluster. Both passed. The previous "redpanda-users not found" workaround is gone — the secretRef pattern is the right shape for this PR's flows. Resource group deleted at the end of the run. IDs/secrets redacted.

TL;DR

# Scenario Result
1 Adoption of a pre-existing Redpanda user — pre-create appuser via the admin REST API with SCRAM-SHA-256, then apply a User CR with type: scram-sha-256, noGenerate: true, syncCredentials: true referencing an ESO-synced Secret PASS — Synced=True, managedUser: true, managedAcls: true; SCRAM-SHA-256 auth using the ESO-supplied password works
2 Credential rotation via Key Vaultaz keyvault secret set to a new value, observe the chain push the change through to Redpanda PASS — ESO refreshed the K8s Secret in ~29 s; the new password authenticates 18 s after the K8s Secret update; the old password is rejected with SASL_AUTHENTICATION_FAILED

Changes vs the previous run

Previous run This run
auth.sasl.secretRef unset (chart default redpanda-users, never created — pod stuck Init:0/3) redpanda-superusers, pre-created with empty superusers.txt; auth.sasl.users: []
Workaround Secret manually created redpanda-users with users.txt: kubernetes-controller:<pwd>:SCRAM-SHA-256 and rescheduled rp-0 none — pod went 2/2 Ready on first boot
User.spec.authentication.type scram-sha-512 (mismatched the chart's bootstrap default of SCRAM-SHA-256) scram-sha-256 — matches BootstrapUser.GetMechanism() default in charts/redpanda/values.go:1410-1415
Pre-created appuser algorithm SCRAM-SHA-512 SCRAM-SHA-256

The chart's bootstrap user (kubernetes-controller) defaults to SCRAM-SHA-256 today (env RPK_SASL_MECHANISM=SCRAM-SHA-256 confirmed on rp-0), so the User CR's type: scram-sha-256 keeps everything on a single mechanism.


Environment

Region:         eastus
Resource Group: claude-pr1438-retest-<redacted>   (deleted at end of run)
AKS:            pr1438-aks-<redacted>  — 3 x Standard_D2s_v3, K8s v1.34.6, OIDC + Workload Identity enabled
ACR:            <redacted>.azurecr.io
Key Vault:      pr1438-kv-<redacted>
Operator image: <ACR>/redpanda-operator:pr1438@sha256:cd362757d296e512ee38b57ec5ba805b19e5fb036cbb3ff42fcc0177a14e26ff   (built from 22b3a27b)
ESO:            v0.20.4
cert-manager:   v1.17.2
Operator chart: from this PR's branch (helm install …operator/chart)

Step-by-step (delta vs the previous run)

The only Redpanda-side changes are the SASL bootstrap and the User CR mechanism. Everything else (Workload Identity, ESO ClusterSecretStore, image build/push, operator chart install) is identical to the previous comment.

# Pre-create the superusers Secret per the docs.
# Empty superusers.txt is valid — kubernetes-controller is appended automatically
# by BootstrapUser.Username() in charts/redpanda/values.go.
kubectl -n redpanda create secret generic redpanda-superusers \
  --from-literal=superusers.txt=''

# Apply the Redpanda CR with the secretRef pattern.
kubectl -n redpanda apply -f - <<'YAML'
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata: {name: rp}
spec:
  chartRef: {}
  clusterSpec:
    statefulset:
      replicas: 1
      sideCars: {controllers: {enabled: true, createRBAC: true}}
    resources: {cpu: {cores: 1}, memory: {container: {max: 2Gi}}}
    storage: {persistentVolume: {enabled: true, size: 10Gi}}
    auth:
      sasl:
        enabled: true
        secretRef: redpanda-superusers   # docs-recommended
        users: []                         # empty: chart will not generate the secret
        # BootstrapUser defaults: username=kubernetes-controller, mechanism=SCRAM-SHA-256
    tls: {enabled: false}
    listeners: {kafka: {tls: {enabled: false}}, admin: {tls: {enabled: false}}}
YAML

rp-0 reaches 2/2 Ready in ~86 s with no manual Secret intervention.

# Confirm the bootstrap mechanism actually wired to SCRAM-SHA-256:
kubectl -n redpanda exec rp-0 -c redpanda -- sh -c 'echo $RPK_USER $RPK_SASL_MECHANISM'
# kubernetes-controller SCRAM-SHA-256

Test 1 — Adoption (pre-existing user, SCRAM-SHA-256)

BOOT_PWD=$(kubectl -n redpanda get secret rp-bootstrap-user -o jsonpath='{.data.password}' | base64 -d)

# Pre-create appuser with SCRAM-SHA-256 BEFORE any User CR exists
kubectl -n redpanda exec rp-0 -c redpanda -- curl -fsS -u "kubernetes-controller:${BOOT_PWD}" \
  -X POST -H "Content-Type: application/json" \
  -d '{"username":"appuser","password":"<INITIAL_PWD>","algorithm":"SCRAM-SHA-256"}' \
  http://rp.redpanda.svc.cluster.local:9644/v1/security/users
# Existing users right after this: ["appuser","kubernetes-controller"]

kubectl -n redpanda apply -f - <<'YAML'
apiVersion: cluster.redpanda.com/v1alpha2
kind: User
metadata: {name: appuser}
spec:
  cluster: {clusterRef: {name: rp}}
  authentication:
    type: scram-sha-256              # matches chart BootstrapUser default
    password:
      valueFrom:
        secretKeyRef: {name: appuser-password, key: password}
      noGenerate: true               # PR #1438 — ESO owns the Secret
    syncCredentials: true            # PR #1438 — re-read on every reconcile
  authorization:
    acls:
      - type: allow
        resource: {type: topic, name: test-topic}
        operations: [Read, Write, Describe, Create]
YAML

Result:

status:
  conditions:
  - lastTransitionTime: "2026-05-01T16:54:44Z"
    message: 'Successfully synced "appuser" to cluster.'
    observedGeneration: 1
    reason: Synced
    status: "True"
    type: Synced
  managedAcls: true
  managedUser: true            # <-- pre-existing user adopted (PR fix)
  observedGeneration: 1
$ rpk cluster info -X user=appuser -X pass='<INITIAL_PWD>' -X sasl.mechanism=SCRAM-SHA-256 ...
CLUSTER
=======
redpanda.<cluster-id>

BROKERS
=======
ID    HOST                                 PORT
0*    rp-0.rp.redpanda.svc.cluster.local.  9093

Test 2 — Credential rotation via Azure Key Vault

az keyvault secret set --vault-name "$KV" --name redpanda-user-password --value '<ROTATED_PWD>'

Verifiable timeline (raw timestamps):

Time (UTC) Event Source
16:55:25Z Rotation initiated (az keyvault secret set) test script
16:55:30Z Key Vault confirms write az response
16:55:59Z K8s Secret/appuser-password updated to <ROTATED_PWD> (resourceVersion 5757 → 6973); ~29 s after KV write at refreshInterval: 1m poll of kubectl get secret
16:56:00Z UserReconciler.Reconcile fires the next iteration operator log
16:56:17Z First poll-loop iteration that observes Redpanda accepting <ROTATED_PWD>; ~18 s after the K8s Secret update test script
16:56:17Z rpk cluster info with the old password <INITIAL_PWD> returns SASL_AUTHENTICATION_FAILED: Invalid credentials test script

User CR after rotation (Synced=True is preserved — the controller upserted the new SCRAM credential without flipping the condition, so lastTransitionTime stays at the Test 1 sync time):

status:
  conditions:
  - lastTransitionTime: "2026-05-01T16:54:44Z"
    message: 'Successfully synced "appuser" to cluster.'
    reason: Synced
    status: "True"
    type: Synced
  managedAcls: true
  managedUser: true
  observedGeneration: 1

Caveats / observations not specific to this PR

  1. Bootstrap-user redpanda-users Secret pre-mount (resolved by docs pattern). The previous comment hit MountVolume.SetUp failed for volume "users" : secret "redpanda-users" not found. That was caused by the chart's auth.sasl.secretRef defaulting to "redpanda-users" when the user enables SASL but doesn't override it (see charts/redpanda/chart/values.yaml:164 and the users volume in charts/redpanda/helpers.go:139-145, 218-227). Following the docs and pre-creating a Secret with auth.sasl.secretRef: <name> + users: [] is the correct shape and removes the workaround entirely.
  2. Bootstrap-user mechanism mismatch (resolved by aligning to SCRAM-SHA-256). The previous comment hit SASL_AUTHENTICATION_FAILED on the User controller's Kafka path because the User CR was type: scram-sha-512 while BootstrapUser.GetMechanism() defaults to SCRAM-SHA-256 (charts/redpanda/values.go:1410-1415). With type: scram-sha-256, both the chart's bootstrap user and the User CR are on a single mechanism and the Kafka path stays clean.
  3. Reconcile cadence with syncCredentials: true is much tighter than expected. With only one User CR in the cluster, UserReconciler.Reconcile ran ~39 times/min (≈ once every 1.5 s) for the entire run. The framework's periodic timer is set to 5 min (PeriodicallyReconcile(5 * time.Minute)), so this isn't the periodic loop — it looks like the always-upsert path under syncCredentials: true is keeping itself enqueued, possibly via the new Secret watch reacting to a self-write or via the apply patch dirtying the resource version each cycle. The credential rotation in Test 2 still propagated correctly, but at this rate the Kafka admin path is being exercised every ~1.5 s per User CR. Resolved upstream and rebased afterwards.

Cleanup

az group delete --name claude-pr1438-retest-<redacted> --yes --no-wait

(Done — RG delete initiated before posting this update.)

🤖 Generated with Claude Code

david-yu and others added 3 commits May 1, 2026 10:05
…naged secrets

Fixes #1354. The User controller previously only created SCRAM
credentials when the user did not already exist in Redpanda, which
meant applying a User CR for a pre-existing user left it permanently
unmanaged (status.managedUser=false). This also meant password
rotation via Secret updates was never reconciled.

Phase 1 — Adoption: Remove the !hasUser gate so that UpsertSCRAM
(which is idempotent) handles both new and existing users whenever
spec.authentication is declared.

Phase 2 — Credential sync: Add spec.authentication.syncCredentials
(opt-in bool). When enabled, each reconciliation cycle re-reads the
password from the referenced Secret and upserts it to Redpanda,
enabling external rotation via ESO or similar tools.

Phase 3 — Secret watch: Index Users by their referenced password
Secret and add a Watches handler so that external Secret changes
(e.g. from ESO) trigger immediate reconciliation instead of waiting
for the 5-minute periodic cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dentials field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu force-pushed the fix/user-credential-sync-1354 branch from 22b3a27 to 044cf76 Compare May 1, 2026 17:08
@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented May 1, 2026

Rebased onto main — picks up the hot-reconcile-loop fix

Rebased fix/user-credential-sync-1354 onto main (force-with-lease push, new tip 044cf76f). Clean rebase, no conflicts — none of this PR's files were touched on main since the branch base.

The ~39 reconciles/min cadence I flagged in the previous comment is fixed by #1460 (c4c1ea3f — operator: Fix hot reconcile loop on healthy clusters), which is now included via the rebase. The bug was in setStatusCondition: rate >= 0 && time.Since(...) > 0 was always true for zero-rate conditions, so every reconcile bumped LastTransitionTime, wrote status, and re-enqueued the resource immediately. The fix changes the guard to rate > 0, so zero-rate conditions are dirty only when Status/Reason/Message/ObservedGeneration actually changes. With syncCredentials: true always going through the upsert + status-patch path, this PR was particularly good at exposing the bug.

Verified locally:

  • go test ./operator/internal/statuses/... -run TestSetStatusCondition -v — all 8 rate-limit tests pass, including the regressions for the bug (*_RateZero_NoChangeWhenIdentical, *_RateLimited_NoHeartbeatBeforeElapsed, etc.)
  • go build ./operator/... && go vet ./operator/internal/controller/redpanda/... ./operator/api/redpanda/v1alpha2/... ./operator/pkg/client/users/... — clean

go-licenses fetches license URLs at generate time by following the gopkg.in
go-import meta redirect to GitHub. The fetcher's per-URL timeout is short, so
on slow networks (and in CI) some or all of the five gopkg.in/* deps degrade
to "Unknown" and `git diff --exit-code` fails the lint step.

These five deps are stable and rarely bumped. Hardcode the URL pattern in the
template using {{ .Version }} so generation is deterministic regardless of
whether the network call succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented May 1, 2026

Ready for review tested end to end

@david-yu david-yu marked this pull request as ready for review May 1, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot import existing Redpanda users into operator management (managedUser forced to false)

1 participant