From 2908fa12c2b4fd93768dc19c4ef8249d5c2666e5 Mon Sep 17 00:00:00 2001 From: Ed Milic Date: Mon, 6 Apr 2026 07:18:31 -0400 Subject: [PATCH 01/42] docs: add ADRs for OpenFGA operator proposal Propose adopting a Kubernetes operator for OpenFGA lifecycle management, covering migration handling, declarative CRDs for stores/models/tuples, and the operator deployment model as a Helm subchart. Also adds the ADR process documentation, template, and chart analysis. ADR-001: Adopt a Kubernetes Operator ADR-002: Operator-Managed Migrations ADR-003: Declarative Store Lifecycle CRDs ADR-004: Operator Deployment as Helm Subchart --- docs/adr/000-template.md | 48 ++++ docs/adr/001-adopt-openfga-operator.md | 95 ++++++++ docs/adr/002-operator-managed-migrations.md | 215 ++++++++++++++++++ .../003-declarative-store-lifecycle-crds.md | 199 ++++++++++++++++ docs/adr/004-operator-deployment-model.md | 167 ++++++++++++++ docs/adr/README.md | 180 +++++++++++++++ 6 files changed, 904 insertions(+) create mode 100644 docs/adr/000-template.md create mode 100644 docs/adr/001-adopt-openfga-operator.md create mode 100644 docs/adr/002-operator-managed-migrations.md create mode 100644 docs/adr/003-declarative-store-lifecycle-crds.md create mode 100644 docs/adr/004-operator-deployment-model.md create mode 100644 docs/adr/README.md diff --git a/docs/adr/000-template.md b/docs/adr/000-template.md new file mode 100644 index 0000000..2cc78bb --- /dev/null +++ b/docs/adr/000-template.md @@ -0,0 +1,48 @@ +# ADR-NNN: Title + +- **Status:** Proposed +- **Date:** YYYY-MM-DD +- **Deciders:** [list of people involved] +- **Related Issues:** # +- **Related ADR:** [ADR-NNN](NNN-filename.md) + +## Context + +What is the problem or situation that motivates this decision? What constraints exist? What forces are at play? + +Include enough background that someone unfamiliar with the project can understand why this decision matters. + +## Decision + +What is the change being proposed or decided? + +### Alternatives Considered + +**A. [Alternative name]** + +[Description of the alternative] + +*Pros:* ... +*Cons:* ... + +**B. [Alternative name]** + +[Description of the alternative] + +*Pros:* ... +*Cons:* ... + +## Consequences + +### Positive + +- What improves as a result of this decision? + +### Negative + +- What gets harder, more complex, or more costly? + +### Risks + +- What assumptions might prove false? +- What could go wrong? diff --git a/docs/adr/001-adopt-openfga-operator.md b/docs/adr/001-adopt-openfga-operator.md new file mode 100644 index 0000000..ca84553 --- /dev/null +++ b/docs/adr/001-adopt-openfga-operator.md @@ -0,0 +1,95 @@ +# ADR-001: Adopt a Kubernetes Operator for OpenFGA Lifecycle Management + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #143, #144 + +## Context + +The OpenFGA Helm chart currently handles all lifecycle concerns — deployment, configuration, database migrations, and secret management — through Helm templates and hooks. This approach works for simple installations but breaks down in several important scenarios: + +1. **Database migrations rely on Helm hooks**, which are incompatible with GitOps tools (ArgoCD, FluxCD) and Helm's own `--wait` flag. This is the single biggest pain point for users, accounting for 6 open issues (#211, #107, #120, #100, #95, #126). + +2. **Store provisioning, authorization model updates, and tuple management** are runtime operations that happen through the OpenFGA API. There is no declarative, GitOps-native way to manage these. Teams must use imperative scripts, CI pipelines, or manual API calls to set up stores and push models after deployment. + +3. **The migration init container** depends on `groundnuty/k8s-wait-for`, an unmaintained image with known CVEs, pinned by mutable tag (#132, #144). + +4. **Migration and runtime workloads share a single ServiceAccount**, violating least-privilege when cloud IAM-based database authentication (AWS IRSA, GCP Workload Identity) maps the ServiceAccount directly to a database role (#95). + +### Alternatives Considered + +**A. Fix migrations within the Helm chart (no operator)** + +- Strip Helm hook annotations from the migration Job by default, rendering it as a regular resource. +- Replace `k8s-wait-for` with a shell-based init container that polls the database schema version directly. +- Add a separate ServiceAccount for the migration Job. + +*Pros:* Lower complexity, no new component to maintain. +*Cons:* Doesn't solve the ordering problem cleanly — the Job and Deployment are created simultaneously, requiring an init container to gate startup. Still requires an image or script to poll. Doesn't address store/model/tuple lifecycle at all. + +**B. Recommend initContainer mode as default** + +- Change `datastore.migrationType` default from `"job"` to `"initContainer"`, running migrations inside each pod. + +*Pros:* No separate Job, no hooks, no `k8s-wait-for`. +*Cons:* Every pod runs migrations on startup (wasteful). Rolling updates trigger redundant migrations. Crash-loops on migration failure. Still shares ServiceAccount. No path to store lifecycle management. + +**C. Build an operator (selected)** + +- A Kubernetes operator manages migrations as internal reconciliation logic and exposes CRDs for store, model, and tuple lifecycle. + +*Pros:* Solves all migration issues. Enables GitOps-native authorization management. Follows established Kubernetes patterns (CNPG, Strimzi, cert-manager). Separates concerns cleanly. +*Cons:* Significant development and maintenance investment. New component to deploy and monitor. Learning curve for contributors. + +**D. External migration tool (e.g., Flyway, golang-migrate)** + +- Remove migrations from the chart entirely and document using an external tool. + +*Pros:* Simplifies the chart completely. +*Cons:* Shifts complexity to the user. Every user must build their own migration pipeline. No standard approach across the community. + +## Decision + +We will build an **OpenFGA Kubernetes Operator** that handles: + +1. **Database migration orchestration** (Stage 1) — replacing Helm hooks, the `k8s-wait-for` init container, and shared ServiceAccount with operator-managed migration Jobs and deployment readiness gating. + +2. **Declarative store lifecycle management** (Stages 2-4) — exposing `FGAStore`, `FGAModel`, and `FGATuples` CRDs for GitOps-native authorization configuration. + +The operator will be: +- Written in Go using `controller-runtime` / kubebuilder +- Distributed as a Helm subchart dependency of the main OpenFGA chart +- Optional — users who don't need it can set `operator.enabled: false` and fall back to the existing behavior + +Development will follow a staged approach to deliver value incrementally: + +| Stage | Scope | Outcome | +|-------|-------|---------| +| 1 | Operator scaffolding + migration handling | All 6 migration issues resolved | +| 2 | `FGAStore` CRD | Declarative store provisioning | +| 3 | `FGAModel` CRD | Declarative authorization model management | +| 4 | `FGATuples` CRD | Declarative tuple management | + +## Consequences + +### Positive + +- **Resolves all 6 migration issues** (#211, #107, #120, #100, #95, #126) and related dependency issues (#132, #144) +- **Eliminates `k8s-wait-for` dependency** — removes an unmaintained, CVE-carrying image from the supply chain +- **Enables GitOps-native authorization management** — stores, models, and tuples become declarative Kubernetes resources that ArgoCD/FluxCD can sync +- **Enforces least-privilege** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) +- **Simplifies the Helm chart** — removes migration Job template, init container logic, RBAC for job-status-reading, and hook annotations +- **Follows Kubernetes ecosystem conventions** — operators are the standard pattern for managing stateful application lifecycle + +### Negative + +- **New component to maintain** — the operator is a full Go project with its own release cycle, CI, testing, and CVE surface +- **Increased deployment footprint** — an additional pod running in the cluster (though resource requirements are minimal: ~50m CPU, ~64Mi memory) +- **Learning curve** — contributors need to understand controller-runtime patterns to modify the operator +- **CRD management complexity** — Helm does not upgrade or delete CRDs; users may need to apply CRD manifests separately on operator upgrades + +### Neutral + +- **Backward compatibility preserved** — the `operator.enabled: false` fallback maintains the existing Helm hook behavior for users who haven't migrated +- **No change for memory-datastore users** — users running with `datastore.engine: memory` are unaffected (no migrations, no operator needed) diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md new file mode 100644 index 0000000..8fb0cd7 --- /dev/null +++ b/docs/adr/002-operator-managed-migrations.md @@ -0,0 +1,215 @@ +# ADR-002: Replace Helm Hook Migrations with Operator-Managed Migrations + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) +- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #144 + +## Context + +### How Migrations Work Today + +The current Helm chart uses a **Helm hook Job** to run database migrations (`openfga migrate`) and a **`k8s-wait-for` init container** on the Deployment to block server startup until the migration completes. + +Seven files are involved: + +| File | Role | +|------|------| +| `templates/job.yaml` | Migration Job with Helm hook annotations | +| `templates/deployment.yaml` | OpenFGA Deployment + `wait-for-migration` init container | +| `templates/serviceaccount.yaml` | Shared ServiceAccount (migration + runtime) | +| `templates/rbac.yaml` | Role + RoleBinding so init container can poll Job status | +| `templates/_helpers.tpl` | Datastore environment variable helpers | +| `values.yaml` | `datastore.*`, `migrate.*`, `initContainer.*` configuration | +| `Chart.yaml` | `bitnami/common` dependency for migration sidecars | + +**The migration Job** (`templates/job.yaml`) is annotated as a Helm hook: + +```yaml +annotations: + "helm.sh/hook": post-install,post-upgrade,post-rollback,post-delete + "helm.sh/hook-delete-policy": before-hook-creation + "helm.sh/hook-weight": "1" +``` + +This means Helm manages it outside the normal release lifecycle — it only runs after Helm finishes creating/upgrading all other resources. + +**The wait-for init container** blocks the Deployment pods from starting: + +```yaml +initContainers: + - name: wait-for-migration + image: "groundnuty/k8s-wait-for:v2.0" + args: ["job-wr", "openfga-migrate"] +``` + +It polls the Kubernetes API (`GET /apis/batch/v1/.../jobs/openfga-migrate`) until `.status.succeeded >= 1`. This requires RBAC permissions (Role/RoleBinding for `batch/jobs` `get`/`list`). + +**The alternative mode** (`datastore.migrationType: initContainer`) runs migration directly inside each Deployment pod as an init container, avoiding hooks entirely but introducing redundant migration runs across replicas. + +### The Six Issues + +| Issue | Tool | Root Cause | +|-------|------|-----------| +| **#211** | ArgoCD | ArgoCD ignores Helm hook annotations. The migration Job is never created as a managed resource. The init container waits forever for a Job that doesn't exist. | +| **#107** | ArgoCD | Same root cause. The Job is invisible in ArgoCD's UI — users can't see, debug, or manually sync it. | +| **#120** | Helm `--wait` | Circular deadlock. Helm waits for the Deployment to be ready before running post-install hooks. The Deployment is never ready because the init container waits for the hook Job. The Job never runs because Helm is waiting. | +| **#100** | FluxCD | FluxCD waits for all resources by default. The `hook-delete-policy: before-hook-creation` removes the completed Job before FluxCD can confirm the Deployment is healthy. | +| **#95** | AWS IRSA | Migration and runtime share a ServiceAccount. With IAM-based DB auth, the runtime gets DDL permissions it doesn't need (CREATE TABLE, ALTER TABLE). | +| **#126** | All | The `k8s-wait-for` image is configured in two separate places in `values.yaml`, leading to inconsistency. Related: #132 (image unmaintained, has CVEs) and #144 (pinned by mutable tag). | + +### Why Helm Hooks Are Fundamentally Wrong for This + +Helm hooks are a **deploy-time orchestration mechanism**. They assume Helm is the active agent running the deployment. GitOps tools (ArgoCD, FluxCD) break this assumption — they render the chart to manifests and apply them declaratively. The hook annotations are either ignored (ArgoCD) or cause ordering/cleanup conflicts (FluxCD). + +This is not a bug in ArgoCD or FluxCD. It is a fundamental mismatch between Helm's imperative hook model and the declarative GitOps model. + +## Decision + +Replace the Helm hook migration Job and `k8s-wait-for` init container with **operator-managed migrations** as part of Stage 1 of the OpenFGA Operator (see [ADR-001](001-adopt-openfga-operator.md)). + +### How It Works + +The operator runs a **migration controller** that reconciles the OpenFGA Deployment: + +``` +┌────────────────────────────────────────────────────────┐ +│ Operator Reconciliation │ +│ │ +│ 1. Read Deployment → extract image tag (e.g. v1.14.0) │ +│ 2. Read ConfigMap/openfga-migration-status │ +│ └── "Last migrated version: v1.13.0" │ +│ 3. Versions differ → migration needed │ +│ 4. Create Job/openfga-migrate │ +│ ├── ServiceAccount: openfga-migrator (DDL perms) │ +│ ├── Image: openfga/openfga:v1.14.0 │ +│ ├── Args: ["migrate"] │ +│ └── ttlSecondsAfterFinished: 300 │ +│ 5. Watch Job until succeeded │ +│ 6. Update ConfigMap → "version: v1.14.0" │ +│ 7. Scale Deployment replicas: 0 → 3 │ +│ 8. OpenFGA pods start, serve requests │ +└────────────────────────────────────────────────────────┘ +``` + +**Key design decisions within this approach:** + +#### Deployment starts at replicas: 0 + +The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema. + +#### Version tracking via ConfigMap + +A ConfigMap (`openfga-migration-status`) records the last successfully migrated version. The operator compares this to the Deployment's image tag to determine if migration is needed. This is: +- Simple to inspect (`kubectl get configmap openfga-migration-status -o yaml`) +- Survives operator restarts +- Can be manually deleted to force re-migration + +#### Separate ServiceAccount for migrations + +The operator creates a dedicated `openfga-migrator` ServiceAccount for migration Jobs. Users can annotate it with cloud IAM roles that grant DDL permissions, while the runtime ServiceAccount retains only CRUD permissions. + +#### Migration Job is a regular resource + +The Job created by the operator has no Helm hook annotations. It is a standard Kubernetes Job, visible to ArgoCD, FluxCD, and all Kubernetes tooling. It has an owner reference to the operator's managed resource for proper garbage collection. + +#### Failure handling + +| Failure | Behavior | +|---------|----------| +| Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. | +| Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. | +| Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. | +| Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). | + +### Sequence Comparison + +**Before (Helm hooks):** + +``` +helm install + ├── Create ServiceAccount, RBAC, Secret, Service + ├── Create Deployment (with wait-for-migration init container) + │ └── Pod starts → init container polls for Job → waits... + ├── [Helm finishes regular resources] + ├── Run post-install hooks: + │ └── Create Job/openfga-migrate → runs openfga migrate + │ └── Job succeeds + ├── Init container sees Job succeeded → exits + └── Main container starts +``` + +Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4. + +**After (operator-managed):** + +``` +helm install + ├── Create ServiceAccount (runtime), ServiceAccount (migrator) + ├── Create Secret, Service + ├── Create Deployment (replicas: 0, no init containers) + ├── Create Operator Deployment + └── [Helm is done — all resources are regular, no hooks] + +Operator starts: + ├── Detects Deployment image version + ├── No migration status ConfigMap → migration needed + ├── Creates Job/openfga-migrate (regular Job, no hooks) + │ └── Uses openfga-migrator ServiceAccount + │ └── Runs openfga migrate → succeeds + ├── Creates ConfigMap with migrated version + └── Scales Deployment to 3 replicas → pods start +``` + +No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects. + +### What Changes in the Helm Chart + +**Removed:** + +| File/Section | Reason | +|--------------|--------| +| `templates/job.yaml` | Operator creates migration Jobs | +| `templates/rbac.yaml` | No init container polling Job status | +| `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated | +| `values.yaml`: `datastore.migrationType` | Operator always uses Job internally | +| `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering | +| `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks | +| Deployment init containers for migration | Operator manages readiness via replica scaling | + +**Added:** + +| File/Section | Purpose | +|--------------|---------| +| `values.yaml`: `operator.enabled` | Toggle operator subchart | +| `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs | +| `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration | +| `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount | +| `charts/openfga-operator/` | Operator subchart | + +**Preserved (backward compatible):** + +When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption. + +## Consequences + +### Positive + +- **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility +- **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144) +- **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95) +- **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed +- **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors +- **Idempotent and crash-safe** — operator can restart at any point and resume correctly + +### Negative + +- **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected) +- **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this +- **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy) + +### Risks + +- **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1. +- **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented. diff --git a/docs/adr/003-declarative-store-lifecycle-crds.md b/docs/adr/003-declarative-store-lifecycle-crds.md new file mode 100644 index 0000000..a54ee44 --- /dev/null +++ b/docs/adr/003-declarative-store-lifecycle-crds.md @@ -0,0 +1,199 @@ +# ADR-003: Declarative Store Lifecycle Management via CRDs + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) + +## Context + +OpenFGA is an authorization service. After deploying the server, teams must perform several runtime operations to make it usable: + +1. **Create a store** — a logical container for authorization data +2. **Write an authorization model** — the DSL that defines types, relations, and permissions +3. **Write tuples** — the relationship data that the model operates on (e.g., "user:anne is owner of document:budget") + +Today, these operations happen outside Kubernetes — through the OpenFGA API, CLI (`fga`), or custom scripts in CI pipelines. There is no declarative, Kubernetes-native way to manage them. + +This creates several problems: + +- **No GitOps for authorization config** — authorization models live in scripts or API calls, not in version-controlled manifests that ArgoCD/FluxCD sync. +- **No drift detection** — if someone modifies a model or tuple via the API, there's no controller to detect and reconcile the change. +- **No cross-team ownership** — each team that uses OpenFGA must build their own tooling to manage stores and models. There's no standard pattern. +- **Manual coordination** — deploying a new version of an application that needs a model change requires coordinating the Helm upgrade with a separate model push. + +### Alternatives Considered + +**A. CLI wrapper in CI pipelines** + +Use the `fga` CLI in a CI/CD step after `helm upgrade` to create stores, push models, and write tuples. + +*Pros:* No new Kubernetes components. Works with any CI system. +*Cons:* Imperative, not declarative. No drift detection. Each team builds their own pipeline. Model changes are not atomic with deployments. No visibility in Kubernetes tooling. + +**B. Helm post-install hook Job** + +Add a Helm hook Job that runs `fga` CLI commands after installation. + +*Pros:* Stays within the Helm ecosystem. +*Cons:* Helm hooks are the exact problem we're solving in ADR-002. Same ArgoCD/FluxCD incompatibilities. Hook Jobs are fire-and-forget with no reconciliation. + +**C. CRDs managed by the operator (selected)** + +Expose `FGAStore`, `FGAModel`, and `FGATuples` as Custom Resource Definitions. The operator watches these resources and reconciles them against the OpenFGA API. + +*Pros:* Fully declarative. GitOps-native. Continuous reconciliation. Standard Kubernetes patterns. Teams own their auth config as manifests. +*Cons:* Requires the operator (ADR-001). CRD design and reconciliation logic add development scope. Tuple reconciliation is complex. + +## Decision + +Introduce three CRDs, built in stages after the migration handling (ADR-002) is complete: + +### Stage 2: FGAStore + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGAStore +metadata: + name: my-app + namespace: my-team +spec: + # Reference to the OpenFGA instance + openfgaRef: + url: openfga.openfga-system.svc:8081 + credentialsRef: + name: openfga-api-credentials # Secret with API key or client credentials + # Store display name + name: "my-app-store" +status: + storeId: "01HXYZ..." + ready: true + conditions: + - type: Ready + status: "True" + lastTransitionTime: "2026-04-06T12:00:00Z" +``` + +**Controller behavior:** +- On create: call `CreateStore` API, store the returned store ID in `.status.storeId` +- On delete: call `DeleteStore` API (with finalizer to ensure cleanup) +- Idempotent: if a store with the same name exists, adopt it rather than creating a duplicate +- Status: set `Ready` condition when store is confirmed to exist + +### Stage 3: FGAModel + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGAModel +metadata: + name: my-app-model + namespace: my-team +spec: + storeRef: + name: my-app # References an FGAStore in the same namespace + model: | + model + schema 1.1 + type user + type organization + relations + define member: [user] + define admin: [user] + type document + relations + define reader: [user, organization#member] + define writer: [user, organization#admin] + define owner: [user] +status: + modelId: "01HABC..." + ready: true + lastWrittenHash: "sha256:a1b2c3..." # Hash of the model DSL to detect changes + conditions: + - type: Ready + status: "True" + - type: InSync + status: "True" +``` + +**Controller behavior:** +- On create/update: hash the model DSL. If hash differs from `.status.lastWrittenHash`, call `WriteAuthorizationModel` API +- Store the returned model ID in `.status.modelId` +- Model writes are append-only in OpenFGA (each write creates a new version), so this is safe +- Validation: optionally validate DSL syntax before calling the API (fail-fast with a clear error condition) +- The controller does NOT delete old model versions — OpenFGA retains model history + +### Stage 4: FGATuples + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGATuples +metadata: + name: my-app-base-tuples + namespace: my-team +spec: + storeRef: + name: my-app + tuples: + - user: "user:anne" + relation: "owner" + object: "document:budget" + - user: "team:engineering#member" + relation: "reader" + object: "folder:engineering-docs" + - user: "organization:acme#admin" + relation: "writer" + object: "folder:engineering-docs" +status: + writtenCount: 3 + ready: true + lastReconciled: "2026-04-06T12:00:00Z" + conditions: + - type: Ready + status: "True" + - type: InSync + status: "True" +``` + +**Controller behavior:** +- Maintain an **ownership model** — the controller tracks which tuples it wrote (via annotations or a status field). It only manages tuples it owns, never deleting tuples written by the application at runtime. +- On reconciliation: diff the desired tuples (from spec) against owned tuples in the store + - Tuples in spec but not in store → write them + - Tuples in store (owned) but not in spec → delete them + - Tuples in store but not owned → leave them alone +- Pagination: handle large tuple sets that exceed API response limits +- Batching: use `Write` API with batch operations to minimize API calls + +**Scope limitation:** `FGATuples` is intended for **base/static tuples** — organizational structure, role assignments, resource hierarchies. It is NOT intended to replace application-level tuple writes for dynamic data (e.g., per-request access grants). The ownership model ensures these two concerns don't interfere. + +### CRD Design Principles + +1. **Namespace-scoped** — all CRDs are namespaced, allowing teams to manage their own stores/models/tuples in their namespace +2. **Reference-based** — `FGAModel` and `FGATuples` reference an `FGAStore` by name, not by store ID. The controller resolves the reference. +3. **Status-driven** — controllers report state via `.status.conditions` following Kubernetes conventions (`Ready`, `InSync`, error conditions) +4. **Finalizers for cleanup** — `FGAStore` uses a finalizer to ensure the store is deleted from OpenFGA when the CR is deleted +5. **Idempotent** — all operations are safe to retry. Re-running reconciliation produces the same result. +6. **`v1alpha1` API version** — signals that the CRD schema may change. We will promote to `v1beta1` and `v1` as the design stabilizes. + +## Consequences + +### Positive + +- **GitOps-native authorization management** — stores, models, and tuples are Kubernetes resources that ArgoCD/FluxCD sync from Git +- **Drift detection and reconciliation** — the operator continuously ensures the actual state matches the declared state +- **Cross-team standardization** — every team uses the same CRDs, eliminating custom scripts and CI hacks +- **Atomic deployments** — a team can include `FGAModel` in their application's Helm chart; model updates deploy alongside code changes +- **Visibility** — `kubectl get fgastores`, `kubectl get fgamodels`, `kubectl describe fgatuples` provide instant visibility into authorization configuration +- **RBAC integration** — Kubernetes RBAC controls who can create/modify stores, models, and tuples per namespace + +### Negative + +- **Significant development scope** — three controllers, each with its own reconciliation logic, error handling, and tests +- **Tuple reconciliation complexity** — diffing and ownership tracking for tuples is the most complex piece; edge cases around partial failures, pagination, and large tuple sets +- **CRD upgrade burden** — CRD schema changes require careful migration; Helm does not upgrade CRDs automatically +- **API dependency** — the operator must be able to reach the OpenFGA API; network issues or API downtime affect reconciliation +- **Not suitable for all tuple management** — dynamic, application-driven tuples should still be written via the API, not CRDs. Users must understand this boundary. + +### Risks + +- **FGATuples at scale** — for stores with millions of tuples, the reconciliation diff could be expensive. The ownership model mitigates this (only diff owned tuples), but documentation must clearly state that `FGATuples` is for base/static data, not high-volume dynamic writes. +- **Multi-cluster** — if OpenFGA serves multiple clusters, CRDs in one cluster may conflict with CRDs in another pointing at the same store. This is out of scope for `v1alpha1` but should be considered for future versions. diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md new file mode 100644 index 0000000..bedb12c --- /dev/null +++ b/docs/adr/004-operator-deployment-model.md @@ -0,0 +1,167 @@ +# ADR-004: Operator Deployment as Helm Subchart Dependency + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) + +## Context + +The OpenFGA Operator (ADR-001) needs a deployment model — how do users install it alongside or independent of the OpenFGA server? + +There are several established patterns in the Kubernetes ecosystem: + +### Alternatives Considered + +**A. Standalone operator chart (install separately)** + +Users install the operator chart first, then install the OpenFGA chart. The operator watches for OpenFGA Deployments across namespaces. + +*Example:* +```bash +helm install openfga-operator openfga/openfga-operator -n openfga-system +helm install openfga openfga/openfga -n my-namespace +``` + +*Pros:* Clean separation of concerns. One operator instance serves multiple OpenFGA installations. Follows the OLM/OperatorHub pattern. +*Cons:* Two install steps. Ordering dependency — operator must exist before the chart is useful. Users must manage two releases. Harder to get started. + +**B. Operator bundled in the main chart (single chart, always installed)** + +The operator Deployment, RBAC, and CRDs are templates in the main OpenFGA chart. No subchart. + +*Pros:* Simplest for users — one chart, one install. No dependency management. +*Cons:* Chart becomes larger and harder to maintain. Users who manage the operator separately (e.g., cluster-wide) can't disable it. CRDs are tied to the application chart's release cycle. Multiple OpenFGA installations in the same cluster would deploy multiple operator instances. + +**C. Operator as a conditional subchart dependency (selected)** + +The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Enabled by default, but users can disable it. + +*Example:* +```bash +# Everything in one command +helm install openfga openfga/openfga \ + --set datastore.engine=postgres \ + --set operator.enabled=true + +# Or, operator managed separately +helm install openfga-operator openfga/openfga-operator -n openfga-system +helm install openfga openfga/openfga \ + --set operator.enabled=false +``` + +*Pros:* Single install for most users. Operator chart has its own versioning. Users can disable for standalone management. Clean separation in code. +*Cons:* Subchart dependency adds some Chart.yaml complexity. CRDs still need special handling (Helm's `crds/` directory or a pre-install hook). + +**D. OLM (Operator Lifecycle Manager) only** + +Publish the operator to OperatorHub. Users install via OLM. + +*Pros:* Standard pattern for OpenShift. Handles CRD upgrades, operator upgrades, and RBAC. +*Cons:* OLM is not available on all clusters (not standard on EKS, GKE, AKS). Adds a dependency on OLM itself. Doesn't help Helm-only users. + +## Decision + +The operator will be distributed as a **conditional Helm subchart dependency** of the main OpenFGA chart. + +### Chart Structure + +``` +helm-charts/ +├── charts/ +│ ├── openfga/ # Main chart (existing) +│ │ ├── Chart.yaml # Declares openfga-operator as dependency +│ │ ├── values.yaml # operator.enabled: true +│ │ ├── templates/ +│ │ └── crds/ # Empty in Stage 1 +│ │ +│ └── openfga-operator/ # Operator subchart (new) +│ ├── Chart.yaml +│ ├── values.yaml +│ ├── templates/ +│ │ ├── deployment.yaml +│ │ ├── serviceaccount.yaml +│ │ ├── clusterrole.yaml +│ │ └── clusterrolebinding.yaml +│ └── crds/ # CRDs added in Stages 2-4 +│ ├── fgastore.yaml +│ ├── fgamodel.yaml +│ └── fgatuples.yaml +``` + +### Dependency Declaration + +```yaml +# charts/openfga/Chart.yaml +dependencies: + - name: openfga-operator + version: "0.1.x" + repository: "oci://ghcr.io/openfga/helm-charts" + condition: operator.enabled +``` + +### CRD Handling + +Helm has specific behavior around CRDs: + +1. **`crds/` directory** — CRDs placed here are installed on `helm install` but are **never upgraded or deleted** by Helm. This is safe but requires manual CRD upgrades. + +2. **Pre-install/pre-upgrade hook Job** — a Job that runs `kubectl apply -f` on CRD manifests before the main install/upgrade. This handles upgrades but reintroduces Helm hooks (the problem ADR-002 solves). + +3. **Static manifests applied separately** — CRDs are published as a standalone YAML file. Users run `kubectl apply -f` before `helm install`. This is the pattern used by cert-manager, Istio, and Prometheus Operator. + +**Decision:** Use the `crds/` directory in the operator subchart for initial installation. Publish CRD manifests as a standalone artifact for upgrades. Document both paths clearly. + +```bash +# First install — Helm installs CRDs automatically +helm install openfga openfga/openfga + +# CRD upgrades — applied manually (Helm won't upgrade them) +kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0/crds.yaml +``` + +### Installation Modes + +| Mode | Command | Use case | +|------|---------|----------| +| **All-in-one** (default) | `helm install openfga openfga/openfga` | Most users. Single install, operator included. | +| **Operator disabled** | `helm install openfga openfga/openfga --set operator.enabled=false` | Operator managed separately or not needed (memory datastore). | +| **Operator standalone** | `helm install op openfga/openfga-operator -n openfga-system` | Cluster-wide operator serving multiple OpenFGA instances. | + +### Multi-Instance Considerations + +When multiple OpenFGA installations exist in the same cluster: + +- **All-in-one mode:** Each installation gets its own operator instance. The operator only watches resources in its own namespace. This is simple but wasteful. +- **Standalone mode:** One operator installation watches all namespaces (or a configured set). Individual OpenFGA installations set `operator.enabled=false`. This is more efficient for large clusters. + +The operator will support both modes via a `watchNamespace` configuration: + +```yaml +# Operator values +operator: + watchNamespace: "" # empty = watch own namespace only (all-in-one mode) + # watchNamespace: "" # or set to a specific namespace + # watchAllNamespaces: true # watch all namespaces (standalone mode) +``` + +## Consequences + +### Positive + +- **Single `helm install` for most users** — no ordering dependencies, no manual operator setup +- **Opt-out available** — `operator.enabled: false` for users who manage it separately or don't need it +- **Independent versioning** — operator chart has its own version; can be released on a different cadence than the main chart +- **Clean code separation** — operator code and templates are in their own chart directory +- **Standalone installation supported** — cluster admins can install one operator for multiple OpenFGA instances +- **Consistent with ecosystem** — this is the same pattern used by charts that depend on Bitnami PostgreSQL, Redis, etc. + +### Negative + +- **CRD upgrade complexity** — Helm does not upgrade CRDs; users must apply CRD manifests separately on operator upgrades +- **Multiple operators in all-in-one mode** — if a user installs OpenFGA in three namespaces, they get three operator pods (wasteful). Documentation should recommend standalone mode for multi-instance clusters. +- **Subchart value passing** — configuring the operator requires prefixed values (e.g., `openfga-operator.image.tag`), which is slightly less ergonomic than top-level values + +### Neutral + +- **OLM support is not excluded** — the operator can be published to OperatorHub in the future alongside the Helm distribution. The two are not mutually exclusive. diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000..298a9e3 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,180 @@ +# Architecture Decision Records + +This directory contains Architecture Decision Records (ADRs) for the OpenFGA Helm Charts project. + +ADRs are short documents that capture significant architectural decisions along with their context, alternatives considered, and consequences. They serve as a decision log — not a living design doc, but a point-in-time record of *why* a decision was made. + +We follow the format described by [Michael Nygard](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions). + +## Index + +| ADR | Title | Status | Date | +|-----|-------|--------|------| +| [ADR-001](001-adopt-openfga-operator.md) | Adopt a Kubernetes Operator for OpenFGA Lifecycle Management | Proposed | 2026-04-06 | +| [ADR-002](002-operator-managed-migrations.md) | Replace Helm Hook Migrations with Operator-Managed Migrations | Proposed | 2026-04-06 | +| [ADR-003](003-declarative-store-lifecycle-crds.md) | Declarative Store Lifecycle Management via CRDs | Proposed | 2026-04-06 | +| [ADR-004](004-operator-deployment-model.md) | Operator Deployment as Helm Subchart Dependency | Proposed | 2026-04-06 | + +--- + +## What is an ADR? + +An ADR captures a single architectural decision. It records: + +- **What** was decided +- **Why** it was decided (the context and constraints at the time) +- **What alternatives** were considered and why they were rejected +- **What consequences** follow from the decision (positive, negative, and neutral) + +ADRs are **immutable once accepted** — if a decision changes, you write a new ADR that supersedes the old one rather than editing it. This preserves the history of *why* things changed over time. + +## ADR Lifecycle + +``` +Proposed → Accepted → (optionally) Superseded or Deprecated + ↑ + │ feedback loop + │ + Discussion +``` + +### Statuses + +| Status | Meaning | +|--------|---------| +| **Proposed** | The ADR has been written and is open for discussion. No commitment has been made. | +| **Accepted** | The decision has been agreed upon by maintainers. Implementation can proceed. | +| **Deprecated** | The decision is no longer relevant (e.g., the feature was removed). | +| **Superseded by ADR-XXX** | A newer ADR has replaced this decision. The old ADR links to the new one. | + +## How to Propose an ADR + +1. **Create a branch** — e.g., `docs/adr-005-my-decision` + +2. **Copy the template** — use `000-template.md` as a starting point + +3. **Write the ADR** — fill in Context, Decision, and Consequences. Focus on *why*, not *how*. The most valuable part is the Alternatives Considered section — it shows reviewers what you evaluated and why you chose this path. + +4. **Assign a number** — use the next sequential number. Check the index above. + +5. **Open a pull request** — the PR is where discussion happens. Title it: `ADR-005: ` + +6. **Add to the index** — update the table in this README with the new entry (status: Proposed) + +### Proposing related ADRs together + +When multiple ADRs are part of a single cohesive proposal — e.g., a foundational decision and several downstream decisions that depend on it — they can be submitted in a single PR. This lets reviewers see the full picture instead of bouncing between separate PRs. + +When doing this: + +- **Explain the relationship in the PR description** — identify which ADR is the foundational decision and which are downstream. For example: "ADR-001 is the core decision to build an operator. ADR-002, 003, and 004 are downstream decisions about how the operator handles migrations, CRDs, and deployment." +- **Each ADR can be accepted or rejected independently** — a reviewer might approve the foundational decision but push back on a downstream one. If that happens, split the PR: merge the accepted ADRs and keep the contested ones open for further discussion. +- **Keep each ADR self-contained** — even though they're in the same PR, each ADR should stand on its own. A reader should be able to understand ADR-003 without reading ADR-002 first (though they may reference each other). + +## How to Give Feedback on an ADR + +ADR review happens in the **pull request**, not by editing the ADR directly. This keeps the discussion visible and linked to the decision. + +### As a reviewer + +- **Comment on the PR** — ask questions, challenge assumptions, suggest alternatives. Good review questions: + - "Did you consider X as an alternative?" + - "What happens if Y fails?" + - "This conflicts with how we do Z — can you address that?" + - "I agree with the decision but the consequence about X should mention Y" + +- **Request changes** if you believe the decision is wrong or incomplete + +- **Approve** when you're satisfied the decision is sound and well-documented + +### As the author responding to feedback + +- **Update the ADR in the PR** based on feedback: + - Add alternatives that reviewers suggested (with your evaluation of them) + - Expand the Consequences section if reviewers identified impacts you missed + - Clarify the Context if reviewers were confused about the problem + - Adjust the Decision if feedback reveals a better approach + +- **Do NOT delete feedback-driven changes** — if a reviewer raised a valid alternative and you addressed it, the ADR is stronger for including it + +- **Resolve PR comments** as you address them so reviewers can track progress + +### Reaching consensus + +- ADRs move to **Accepted** when maintainers approve the PR +- Not every maintainer needs to approve — follow the project's normal review standards +- If consensus can't be reached, escalate to a synchronous discussion (meeting, call) and record the outcome in the PR +- Disagreement is fine — document it in the Consequences section as a risk or trade-off rather than hiding it + +## How to Supersede an ADR + +When a decision needs to change: + +1. **Do NOT edit the original ADR** — it's a historical record + +2. **Write a new ADR** that references the old one: + ```markdown + - **Supersedes:** [ADR-002](002-operator-managed-migrations.md) + ``` + +3. **Update the old ADR's status** — change it to: + ```markdown + - **Status:** Superseded by [ADR-007](007-new-approach.md) + ``` + +4. **Update the index** in this README + +This way, anyone reading ADR-002 knows it's been replaced and can follow the link to understand what changed and why. + +## ADR Format + +Every ADR follows this structure: + +```markdown +# ADR-NNN: Title + +- **Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXX +- **Date:** YYYY-MM-DD +- **Deciders:** Who was involved in the decision +- **Related Issues:** GitHub issue references +- **Related ADR:** Links to related ADRs + +## Context + +What is the problem or situation that motivates this decision? +Include enough background that someone unfamiliar with the project +can understand why this decision matters. + +## Decision + +What is the decision and why was it chosen? + +### Alternatives Considered + +What other options were evaluated? Why were they rejected? +This is often the most valuable section — it prevents future +contributors from re-proposing rejected approaches. + +## Consequences + +### Positive +What improves as a result of this decision? + +### Negative +What gets harder or more complex? Be honest — every decision has costs. + +### Risks +What could go wrong? What assumptions might prove false? +``` + +## Template + +A blank template is available at [000-template.md](000-template.md). + +## Tips for Writing Good ADRs + +- **Keep it short** — an ADR is one decision, not a design doc. If it's longer than 2-3 pages, consider splitting it. +- **Focus on why, not how** — implementation details change; the reasoning behind the decision is what matters long-term. +- **Be honest about trade-offs** — an ADR that lists only positive consequences isn't credible. Every decision has costs. +- **Write for your future self** — in 18 months, you won't remember why you chose this. The ADR should tell you. +- **Not every decision needs an ADR** — use ADRs for decisions that are hard to reverse, affect multiple components, or where the reasoning isn't obvious from the code. From c6a645afa9bf7d00385a1a0dc03cc8d13ed28e79 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Fri, 10 Apr 2026 13:09:35 -0400 Subject: [PATCH 02/42] feat: add operator for migration orchestration (Stage 1) Replace Helm hook-based migrations with a lightweight Kubernetes operator that watches OpenFGA Deployments, detects version changes, and runs migrations as regular Jobs. - Go operator using controller-runtime (no CRDs) - Helm subchart with opt-in via operator.enabled (default false) - Dedicated migration ServiceAccount (separate from runtime) - Auto-recovery on database failure (delete/retry cycle) - GitHub Actions workflow for multi-arch image builds - Integration test values for local Kubernetes clusters Resolves #211, #107, #120, #100, #126 --- .github/workflows/operator.yml | 100 ++++++ charts/openfga-operator/Chart.yaml | 13 + charts/openfga-operator/crds/README.md | 4 + charts/openfga-operator/templates/NOTES.txt | 11 + .../openfga-operator/templates/_helpers.tpl | 72 ++++ .../templates/clusterrole.yaml | 31 ++ .../templates/clusterrolebinding.yaml | 14 + .../templates/deployment.yaml | 79 ++++ .../templates/serviceaccount.yaml | 13 + charts/openfga-operator/values.yaml | 59 +++ charts/openfga/Chart.lock | 7 +- charts/openfga/Chart.yaml | 4 + charts/openfga/templates/_helpers.tpl | 11 + charts/openfga/templates/deployment.yaml | 17 +- charts/openfga/templates/job.yaml | 2 +- charts/openfga/templates/rbac.yaml | 2 +- charts/openfga/templates/serviceaccount.yaml | 13 + charts/openfga/values.schema.json | 69 ++++ charts/openfga/values.yaml | 26 ++ operator/.dockerignore | 6 + operator/Dockerfile | 17 + operator/Makefile | 26 ++ operator/README.md | 129 +++++++ operator/cmd/main.go | 102 ++++++ operator/go.mod | 66 ++++ operator/go.sum | 171 +++++++++ operator/internal/controller/helpers.go | 269 ++++++++++++++ .../controller/migration_controller.go | 215 +++++++++++ .../controller/migration_controller_test.go | 336 ++++++++++++++++++ operator/tests/README.md | 186 ++++++++++ operator/tests/values-db-outage.yaml | 70 ++++ operator/tests/values-happy-path.yaml | 70 ++++ operator/tests/values-no-db.yaml | 23 ++ 33 files changed, 2225 insertions(+), 8 deletions(-) create mode 100644 .github/workflows/operator.yml create mode 100644 charts/openfga-operator/Chart.yaml create mode 100644 charts/openfga-operator/crds/README.md create mode 100644 charts/openfga-operator/templates/NOTES.txt create mode 100644 charts/openfga-operator/templates/_helpers.tpl create mode 100644 charts/openfga-operator/templates/clusterrole.yaml create mode 100644 charts/openfga-operator/templates/clusterrolebinding.yaml create mode 100644 charts/openfga-operator/templates/deployment.yaml create mode 100644 charts/openfga-operator/templates/serviceaccount.yaml create mode 100644 charts/openfga-operator/values.yaml create mode 100644 operator/.dockerignore create mode 100644 operator/Dockerfile create mode 100644 operator/Makefile create mode 100644 operator/README.md create mode 100644 operator/cmd/main.go create mode 100644 operator/go.mod create mode 100644 operator/go.sum create mode 100644 operator/internal/controller/helpers.go create mode 100644 operator/internal/controller/migration_controller.go create mode 100644 operator/internal/controller/migration_controller_test.go create mode 100644 operator/tests/README.md create mode 100644 operator/tests/values-db-outage.yaml create mode 100644 operator/tests/values-happy-path.yaml create mode 100644 operator/tests/values-no-db.yaml diff --git a/.github/workflows/operator.yml b/.github/workflows/operator.yml new file mode 100644 index 0000000..71f9e7f --- /dev/null +++ b/.github/workflows/operator.yml @@ -0,0 +1,100 @@ +name: Operator + +on: + push: + branches: + - main + paths: + - "operator/**" + pull_request: + paths: + - "operator/**" + workflow_dispatch: + inputs: + push_image: + description: "Push the operator image to GHCR" + type: boolean + default: true + +env: + IMAGE_NAME: ghcr.io/${{ github.repository_owner }}/openfga-operator + +jobs: + test: + runs-on: ubuntu-latest + permissions: + contents: read + steps: + - name: Checkout + uses: actions/checkout@v6 + + - name: Set up Go + uses: actions/setup-go@v5 + with: + go-version-file: operator/go.mod + cache-dependency-path: operator/go.sum + + - name: Run tests + working-directory: operator + run: go test ./... -v + + - name: Run vet + working-directory: operator + run: go vet ./... + + build-and-push: + needs: test + if: >- + (github.event_name == 'push' && github.ref == 'refs/heads/main') || + (github.event_name == 'workflow_dispatch' && inputs.push_image) + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - name: Checkout + uses: actions/checkout@v6 + + - name: Extract version from Chart.yaml + id: version + run: | + version=$(grep '^appVersion:' charts/openfga-operator/Chart.yaml | awk '{print $2}' | tr -d '"') + echo "version=${version}" >> "$GITHUB_OUTPUT" + short_sha="${GITHUB_SHA::7}" + echo "short_sha=${short_sha}" >> "$GITHUB_OUTPUT" + echo "Operator version: ${version} (sha: ${short_sha})" + + - name: Determine image tags + id: tags + run: | + if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then + echo "tags=${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }},${{ env.IMAGE_NAME }}:latest" >> "$GITHUB_OUTPUT" + else + # Dev build — tag with version-sha to avoid clobbering release tags + echo "tags=${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }}-${{ steps.version.outputs.short_sha }}" >> "$GITHUB_OUTPUT" + fi + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to GHCR + uses: docker/login-action@v4.1.0 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Build and push + uses: docker/build-push-action@v6 + with: + context: operator + push: true + platforms: linux/amd64,linux/arm64 + tags: ${{ steps.tags.outputs.tags }} + cache-from: type=gha + cache-to: type=gha,mode=max + labels: | + org.opencontainers.image.source=https://github.com/${{ github.repository }} + org.opencontainers.image.version=${{ steps.version.outputs.version }} + org.opencontainers.image.title=openfga-operator + org.opencontainers.image.description=OpenFGA Kubernetes operator for migration orchestration diff --git a/charts/openfga-operator/Chart.yaml b/charts/openfga-operator/Chart.yaml new file mode 100644 index 0000000..1bdacb0 --- /dev/null +++ b/charts/openfga-operator/Chart.yaml @@ -0,0 +1,13 @@ +apiVersion: v2 +name: openfga-operator +description: Helm chart for the OpenFGA Kubernetes operator. + +type: application +version: 0.1.0 +appVersion: "0.1.0" + +home: "https://openfga.github.io/helm-charts" +icon: https://github.com/openfga/community/raw/main/brand-assets/icon/color/openfga-icon-color.svg + +annotations: + artifacthub.io/license: Apache-2.0 diff --git a/charts/openfga-operator/crds/README.md b/charts/openfga-operator/crds/README.md new file mode 100644 index 0000000..060b0d0 --- /dev/null +++ b/charts/openfga-operator/crds/README.md @@ -0,0 +1,4 @@ +# CRDs + +This directory is reserved for Custom Resource Definitions added in later stages. +No CRDs are installed in Stage 1 (migration orchestration). diff --git a/charts/openfga-operator/templates/NOTES.txt b/charts/openfga-operator/templates/NOTES.txt new file mode 100644 index 0000000..8c398b1 --- /dev/null +++ b/charts/openfga-operator/templates/NOTES.txt @@ -0,0 +1,11 @@ +The openfga-operator has been deployed. + +NOTE: The operator container image ({{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}) +does not exist yet. The operator pod will remain in ImagePullBackOff until +the Go binary is built and pushed. + +To check operator status: + kubectl get deployment --namespace {{ include "openfga-operator.namespace" . }} {{ include "openfga-operator.fullname" . }} + +To view operator logs (once the image is available): + kubectl logs --namespace {{ include "openfga-operator.namespace" . }} -l "app.kubernetes.io/name={{ include "openfga-operator.name" . }}" diff --git a/charts/openfga-operator/templates/_helpers.tpl b/charts/openfga-operator/templates/_helpers.tpl new file mode 100644 index 0000000..70d6e4c --- /dev/null +++ b/charts/openfga-operator/templates/_helpers.tpl @@ -0,0 +1,72 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "openfga-operator.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +If release name contains chart name it will be used as a full name. +*/}} +{{- define "openfga-operator.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Expand the namespace of the release. +Allows overriding it for multi-namespace deployments in combined charts. +*/}} +{{- define "openfga-operator.namespace" -}} +{{- default .Release.Namespace .Values.namespaceOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "openfga-operator.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "openfga-operator.labels" -}} +helm.sh/chart: {{ include "openfga-operator.chart" . }} +{{ include "openfga-operator.selectorLabels" . }} +app.kubernetes.io/component: operator +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +app.kubernetes.io/part-of: openfga +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "openfga-operator.selectorLabels" -}} +app.kubernetes.io/name: {{ include "openfga-operator.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "openfga-operator.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "openfga-operator.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/charts/openfga-operator/templates/clusterrole.yaml b/charts/openfga-operator/templates/clusterrole.yaml new file mode 100644 index 0000000..09d0fd7 --- /dev/null +++ b/charts/openfga-operator/templates/clusterrole.yaml @@ -0,0 +1,31 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: {{ include "openfga-operator.fullname" . }} + labels: + {{- include "openfga-operator.labels" . | nindent 4 }} +rules: + - apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["get", "list", "watch", "patch"] + - apiGroups: ["apps"] + resources: ["deployments/status"] + verbs: ["update"] + - apiGroups: ["batch"] + resources: ["jobs"] + verbs: ["get", "list", "watch", "create", "delete"] + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["get", "list", "watch", "create", "update"] + - apiGroups: [""] + resources: ["secrets"] + verbs: ["get"] + - apiGroups: [""] + resources: ["serviceaccounts"] + verbs: ["get", "list", "create"] + - apiGroups: ["coordination.k8s.io"] + resources: ["leases"] + verbs: ["get", "list", "watch", "create", "update"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch"] diff --git a/charts/openfga-operator/templates/clusterrolebinding.yaml b/charts/openfga-operator/templates/clusterrolebinding.yaml new file mode 100644 index 0000000..854521a --- /dev/null +++ b/charts/openfga-operator/templates/clusterrolebinding.yaml @@ -0,0 +1,14 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: {{ include "openfga-operator.fullname" . }} + labels: + {{- include "openfga-operator.labels" . | nindent 4 }} +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: {{ include "openfga-operator.fullname" . }} +subjects: + - kind: ServiceAccount + name: {{ include "openfga-operator.serviceAccountName" . }} + namespace: {{ include "openfga-operator.namespace" . }} diff --git a/charts/openfga-operator/templates/deployment.yaml b/charts/openfga-operator/templates/deployment.yaml new file mode 100644 index 0000000..ae8af0d --- /dev/null +++ b/charts/openfga-operator/templates/deployment.yaml @@ -0,0 +1,79 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "openfga-operator.fullname" . }} + namespace: {{ include "openfga-operator.namespace" . }} + labels: + {{- include "openfga-operator.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.replicaCount }} + selector: + matchLabels: + {{- include "openfga-operator.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "openfga-operator.labels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "openfga-operator.serviceAccountName" . }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} + containers: + - name: operator + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + args: + {{- if .Values.leaderElection.enabled }} + - --leader-elect + {{- end }} + {{- if .Values.watchNamespace }} + - --watch-namespace={{ .Values.watchNamespace }} + {{- else if .Values.watchAllNamespaces }} + - --watch-all-namespaces + {{- end }} + ports: + - name: healthz + containerPort: 8081 + protocol: TCP + livenessProbe: + httpGet: + path: /healthz + port: healthz + initialDelaySeconds: 15 + periodSeconds: 20 + readinessProbe: + httpGet: + path: /readyz + port: healthz + initialDelaySeconds: 5 + periodSeconds: 10 + {{- with .Values.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/charts/openfga-operator/templates/serviceaccount.yaml b/charts/openfga-operator/templates/serviceaccount.yaml new file mode 100644 index 0000000..8b1f894 --- /dev/null +++ b/charts/openfga-operator/templates/serviceaccount.yaml @@ -0,0 +1,13 @@ +{{- if .Values.serviceAccount.create -}} +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ include "openfga-operator.serviceAccountName" . }} + namespace: {{ include "openfga-operator.namespace" . }} + labels: + {{- include "openfga-operator.labels" . | nindent 4 }} + {{- with .Values.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml new file mode 100644 index 0000000..891ad57 --- /dev/null +++ b/charts/openfga-operator/values.yaml @@ -0,0 +1,59 @@ +replicaCount: 1 + +image: + repository: openfga/openfga-operator + pullPolicy: IfNotPresent + # -- Overrides the image tag whose default is the chart appVersion. + tag: "" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + # -- Specifies whether a service account should be created. + create: true + # -- Annotations to add to the service account. + annotations: {} + # -- The name of the service account to use. + # If not set and create is true, a name is generated using the fullname template. + name: "" + +podAnnotations: {} + +podSecurityContext: {} + # runAsNonRoot: true + # seccompProfile: + # type: RuntimeDefault + +securityContext: {} + # capabilities: + # drop: + # - ALL + # readOnlyRootFilesystem: true + # runAsNonRoot: true + # runAsUser: 65532 + +# -- Constrain the operator to watch a single namespace. +# Leave empty to default to the release namespace. +watchNamespace: "" + +# -- Watch all namespaces. Overrides watchNamespace. +watchAllNamespaces: false + +leaderElection: + # -- Enable leader election for controller manager. + enabled: true + +resources: {} + # requests: + # cpu: 10m + # memory: 64Mi + # limits: + # memory: 128Mi + +nodeSelector: {} + +tolerations: [] + +affinity: {} diff --git a/charts/openfga/Chart.lock b/charts/openfga/Chart.lock index e82ffa5..8011453 100644 --- a/charts/openfga/Chart.lock +++ b/charts/openfga/Chart.lock @@ -8,5 +8,8 @@ dependencies: - name: common repository: oci://registry-1.docker.io/bitnamicharts version: 2.13.3 -digest: sha256:4bbfb25821b0dfb6c70aabb5caf4c5ec7e6526261f93a8f531f507f1d4c43e3e -generated: "2026-03-18T11:41:40.1785546-04:00" +- name: openfga-operator + repository: file://../openfga-operator + version: 0.1.0 +digest: sha256:d502dc105790995a4368a049c0f593820d08f2f82dc9c9a70480a343c7affe8b +generated: "2026-04-10T11:45:16.638975-04:00" diff --git a/charts/openfga/Chart.yaml b/charts/openfga/Chart.yaml index c7eeb76..624d5ca 100644 --- a/charts/openfga/Chart.yaml +++ b/charts/openfga/Chart.yaml @@ -29,3 +29,7 @@ dependencies: repository: oci://registry-1.docker.io/bitnamicharts tags: - bitnami-common + - name: openfga-operator + version: "0.1.0" + repository: "file://../openfga-operator" + condition: operator.enabled diff --git a/charts/openfga/templates/_helpers.tpl b/charts/openfga/templates/_helpers.tpl index 5889497..35ad94a 100644 --- a/charts/openfga/templates/_helpers.tpl +++ b/charts/openfga/templates/_helpers.tpl @@ -74,6 +74,17 @@ Create the name of the service account to use {{- end }} {{- end }} +{{/* +Create the name of the migration service account to use (operator mode only) +*/}} +{{- define "openfga.migrationServiceAccountName" -}} +{{- if .Values.migration.serviceAccount.name }} +{{- .Values.migration.serviceAccount.name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-migration" (include "openfga.fullname" .) | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} + {{/* Return true if a secret object should be created */}} diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index e6c1fff..1cd20ac 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -4,12 +4,21 @@ metadata: name: {{ include "openfga.fullname" . }} labels: {{- include "openfga.labels" . | nindent 4 }} - {{- with .Values.annotations }} annotations: + {{- if .Values.operator.enabled }} + openfga.dev/desired-replicas: "{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}" + openfga.dev/migration-service-account: "{{ include "openfga.migrationServiceAccountName" . }}" + {{- end }} + {{- with .Values.annotations }} {{- toYaml . | nindent 4 }} - {{- end }} + {{- end }} spec: - {{- if not .Values.autoscaling.enabled }} + {{- if .Values.operator.enabled }} + {{- if .Values.autoscaling.enabled }} + {{- fail "operator.enabled and autoscaling.enabled cannot both be true" }} + {{- end }} + replicas: 0 + {{- else if not .Values.autoscaling.enabled }} replicas: {{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory")}} {{- end }} selector: @@ -37,7 +46,7 @@ spec: serviceAccountName: {{ include "openfga.serviceAccountName" . }} securityContext: {{- toYaml .Values.podSecurityContext | nindent 8 }} - {{ if or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations) .Values.extraInitContainers }} + {{ if and (not .Values.operator.enabled) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations) .Values.extraInitContainers) }} initContainers: {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations (eq .Values.datastore.migrationType "job") }} - name: wait-for-migration diff --git a/charts/openfga/templates/job.yaml b/charts/openfga/templates/job.yaml index fc70228..d46d938 100644 --- a/charts/openfga/templates/job.yaml +++ b/charts/openfga/templates/job.yaml @@ -1,4 +1,4 @@ -{{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations (eq .Values.datastore.migrationType "job") -}} +{{- if and (not .Values.operator.enabled) (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations (eq .Values.datastore.migrationType "job") -}} apiVersion: batch/v1 kind: Job metadata: diff --git a/charts/openfga/templates/rbac.yaml b/charts/openfga/templates/rbac.yaml index 3c8e0f8..71d3c09 100644 --- a/charts/openfga/templates/rbac.yaml +++ b/charts/openfga/templates/rbac.yaml @@ -1,4 +1,4 @@ -{{- if .Values.serviceAccount.create -}} +{{- if and (not .Values.operator.enabled) .Values.serviceAccount.create -}} apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: diff --git a/charts/openfga/templates/serviceaccount.yaml b/charts/openfga/templates/serviceaccount.yaml index bbe191c..278be07 100644 --- a/charts/openfga/templates/serviceaccount.yaml +++ b/charts/openfga/templates/serviceaccount.yaml @@ -10,3 +10,16 @@ metadata: {{- toYaml . | nindent 4 }} {{- end }} {{- end }} +{{- if and .Values.operator.enabled .Values.migration.serviceAccount.create }} +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ include "openfga.migrationServiceAccountName" . }} + labels: + {{- include "openfga.labels" . | nindent 4 }} + {{- with .Values.migration.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/charts/openfga/values.schema.json b/charts/openfga/values.schema.json index 151cc21..76b2cf9 100644 --- a/charts/openfga/values.schema.json +++ b/charts/openfga/values.schema.json @@ -1289,6 +1289,75 @@ "type": "boolean", "description": "This value is not used by this chart, but allows a common pattern of enabling/disabling subchart dependencies (where OpenFGA is a subchart)", "default": false + }, + "operator": { + "type": "object", + "description": "Controls the openfga-operator subchart. When enabled, migration is managed by the operator instead of the Helm job hook.", + "properties": { + "enabled": { + "type": "boolean", + "description": "Enable the openfga-operator subchart for operator-managed migrations", + "default": false + } + } + }, + "openfga-operator": { + "type": "object", + "description": "Values passed through to the openfga-operator subchart" + }, + "migration": { + "type": "object", + "description": "Controls operator-driven migration behavior. Only used when operator.enabled is true.", + "properties": { + "enabled": { + "type": "boolean", + "description": "Enable operator-managed database migrations", + "default": true + }, + "timeout": { + "type": ["string", "null"], + "description": "Timeout passed to the migration Job as activeDeadlineSeconds", + "default": "" + }, + "backoffLimit": { + "type": "integer", + "description": "Number of retries before marking the migration as failed", + "default": 3 + }, + "ttlSecondsAfterFinished": { + "type": "integer", + "description": "Seconds to keep completed/failed migration Jobs before cleanup", + "default": 600 + }, + "serviceAccount": { + "type": "object", + "properties": { + "create": { + "type": "boolean", + "description": "Create a dedicated service account for migration Jobs", + "default": true + }, + "annotations": { + "type": "object", + "description": "Annotations to add to the migration service account", + "additionalProperties": { + "type": "string" + }, + "default": {} + }, + "name": { + "type": "string", + "description": "The name of the migration service account. Defaults to {fullname}-migration.", + "default": "" + } + } + }, + "resources": { + "type": "object", + "description": "Resource requests/limits for migration Job pods", + "default": {} + } + } } }, "additionalProperties": false diff --git a/charts/openfga/values.yaml b/charts/openfga/values.yaml index 75fa19d..e843563 100644 --- a/charts/openfga/values.yaml +++ b/charts/openfga/values.yaml @@ -385,6 +385,32 @@ testContainerSpec: {} # -- Array of extra K8s manifests to deploy ## Note: Supports use of custom Helm templates extraObjects: [] + +# -- operator controls the openfga-operator subchart. +# When enabled, migration is managed by the operator instead of the Helm job hook. +operator: + enabled: false + +# -- migration controls operator-driven migration behavior. +# Only used when operator.enabled is true. +migration: + enabled: true + # -- Timeout passed to the migration Job as activeDeadlineSeconds. + timeout: "" + # -- Number of retries before marking the migration as failed. + backoffLimit: 3 + # -- Seconds to keep completed/failed migration Jobs before cleanup. + ttlSecondsAfterFinished: 600 + serviceAccount: + # -- Create a dedicated service account for migration Jobs. + create: true + # -- Annotations to add to the migration service account. + annotations: {} + # -- The name of the migration service account. + # If not set and create is true, defaults to {fullname}-migration. + name: "" + # -- Resource requests/limits for migration Job pods. + resources: {} ## Example: Deploy a PostgreSQL instance for dev/test using official Docker images. ## For production, use a managed database service or an operator like CloudnativePG. ## Configure the chart to use the secret: diff --git a/operator/.dockerignore b/operator/.dockerignore new file mode 100644 index 0000000..3efb8a0 --- /dev/null +++ b/operator/.dockerignore @@ -0,0 +1,6 @@ +**/.git +**/.gitignore +**/README.md +**/LICENSE +**/Makefile +**/.dockerignore diff --git a/operator/Dockerfile b/operator/Dockerfile new file mode 100644 index 0000000..be9097e --- /dev/null +++ b/operator/Dockerfile @@ -0,0 +1,17 @@ +FROM golang:1.25 AS builder + +WORKDIR /workspace +COPY go.mod go.sum ./ +RUN go mod download + +COPY cmd/ cmd/ +COPY internal/ internal/ + +RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /operator ./cmd/ + +FROM gcr.io/distroless/static:nonroot +WORKDIR / +COPY --from=builder /operator . +USER 65532:65532 + +ENTRYPOINT ["/operator"] diff --git a/operator/Makefile b/operator/Makefile new file mode 100644 index 0000000..575bb7c --- /dev/null +++ b/operator/Makefile @@ -0,0 +1,26 @@ +IMG ?= openfga/openfga-operator:dev + +.PHONY: build test vet fmt lint docker-build docker-push clean + +build: + go build -o bin/operator ./cmd/ + +test: + go test ./... -v + +vet: + go vet ./... + +fmt: + go fmt ./... + +lint: vet fmt + +docker-build: + docker build -t $(IMG) . + +docker-push: + docker push $(IMG) + +clean: + rm -rf bin/ diff --git a/operator/README.md b/operator/README.md new file mode 100644 index 0000000..a2b0699 --- /dev/null +++ b/operator/README.md @@ -0,0 +1,129 @@ +# OpenFGA Operator + +A Kubernetes operator that manages database migrations for OpenFGA deployments. Instead of relying on Helm hooks and init containers, the operator watches OpenFGA Deployments, detects version changes, and orchestrates migrations as regular Jobs. + +This is **Stage 1** of the operator — focused solely on migration orchestration. See [ADR-001](../docs/adr/001-adopt-operator.md) for the full roadmap. + +## How It Works + +1. The operator watches Deployments labeled `app.kubernetes.io/part-of: openfga` +2. When a version change is detected (comparing the container image tag to the `openfga-migration-status` ConfigMap), the operator: + - Keeps the Deployment at 0 replicas + - Creates a migration Job running `openfga migrate` + - Waits for the Job to complete + - Updates the ConfigMap with the new version + - Scales the Deployment up to the desired replica count +3. On failure, a `MigrationFailed` condition is set on the Deployment and replicas stay at 0 + +## Prerequisites + +- Go 1.25+ +- Docker +- Helm 3.6+ +- A Kubernetes cluster (Rancher Desktop, kind, etc.) + +## Development + +### Build + +```bash +cd operator +go build ./... +``` + +### Test + +```bash +go test ./... -v +``` + +### Lint + +```bash +go vet ./... +``` + +### Docker Image + +```bash +docker build -t openfga/openfga-operator:dev . +``` + +## Local Testing + +Integration test values and instructions are in [`tests/`](tests/). Three scenarios are provided: + +| Scenario | Values File | What It Tests | +|----------|-------------|---------------| +| Happy path | `tests/values-happy-path.yaml` | Full lifecycle: Postgres up, migration succeeds, OpenFGA scales to 3/3 | +| DB outage & recovery | `tests/values-db-outage.yaml` | Postgres starts at 0 replicas; scale it up later to verify self-healing | +| No database | `tests/values-no-db.yaml` | Permanent failure: operator retries without crashing, app stays at 0 | + +Quick start: + +```bash +# 1. Build the operator image +cd operator +docker build -t openfga/openfga-operator:dev . + +# 2. Update chart dependencies +cd .. +helm dependency update charts/openfga + +# 3. Run the happy-path test +kubectl create namespace openfga-test +helm install openfga-test charts/openfga -n openfga-test \ + -f operator/tests/values-happy-path.yaml + +# 4. Verify (wait ~30s) +kubectl get all -n openfga-test + +# 5. Clean up +helm uninstall openfga-test -n openfga-test +kubectl delete namespace openfga-test +``` + +See [`tests/README.md`](tests/README.md) for detailed verification steps and all three scenarios. + +## Project Structure + +``` +operator/ +├── cmd/ +│ └── main.go # Entry point, manager setup +├── internal/ +│ └── controller/ +│ ├── migration_controller.go # Reconciliation loop +│ ├── migration_controller_test.go # Unit tests +│ └── helpers.go # Job builder, scaling, ConfigMap helpers +├── Dockerfile # Multi-stage build (distroless runtime) +├── Makefile +├── go.mod +└── go.sum +``` + +## Configuration + +The operator accepts the following flags: + +| Flag | Default | Description | +|------|---------|-------------| +| `--leader-elect` | `false` | Enable leader election | +| `--watch-namespace` | `""` | Namespace to watch (defaults to release namespace) | +| `--watch-all-namespaces` | `false` | Watch all namespaces | +| `--metrics-bind-address` | `:8080` | Metrics endpoint address | +| `--health-probe-bind-address` | `:8081` | Health probe endpoint address | +| `--backoff-limit` | `3` | BackoffLimit for migration Jobs | +| `--active-deadline-seconds` | `300` | ActiveDeadlineSeconds for migration Jobs | +| `--ttl-seconds-after-finished` | `300` | TTLSecondsAfterFinished for migration Jobs | + +When deployed via the Helm subchart, these are configured through `values.yaml`. See `charts/openfga-operator/values.yaml` for all available options. + +## Annotations + +The operator reads these annotations from the OpenFGA Deployment: + +| Annotation | Description | +|------------|-------------| +| `openfga.dev/desired-replicas` | The replica count to restore after migration succeeds. Set by the Helm chart. | +| `openfga.dev/migration-service-account` | The ServiceAccount to use for migration Jobs. Defaults to the Deployment's SA. | diff --git a/operator/cmd/main.go b/operator/cmd/main.go new file mode 100644 index 0000000..e4c9d3e --- /dev/null +++ b/operator/cmd/main.go @@ -0,0 +1,102 @@ +package main + +import ( + "flag" + "os" + + "k8s.io/apimachinery/pkg/runtime" + utilruntime "k8s.io/apimachinery/pkg/runtime/serializer" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/cache" + "sigs.k8s.io/controller-runtime/pkg/healthz" + "sigs.k8s.io/controller-runtime/pkg/log/zap" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" + + "github.com/openfga/openfga-operator/internal/controller" +) + +var scheme = runtime.NewScheme() + +func init() { + _ = clientgoscheme.AddToScheme(scheme) + // Suppress unused import. + _ = utilruntime.CodecFactory{} +} + +func main() { + var ( + leaderElect bool + watchNamespace string + watchAllNamespaces bool + metricsAddr string + healthProbeAddr string + backoffLimit int + activeDeadline int + ttlAfterFinished int + ) + + flag.BoolVar(&leaderElect, "leader-elect", false, "Enable leader election for the controller manager.") + flag.StringVar(&watchNamespace, "watch-namespace", "", "Namespace to watch. Defaults to the release namespace.") + flag.BoolVar(&watchAllNamespaces, "watch-all-namespaces", false, "Watch all namespaces.") + flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") + flag.StringVar(&healthProbeAddr, "health-probe-bind-address", ":8081", "The address the health probe endpoint binds to.") + flag.IntVar(&backoffLimit, "backoff-limit", int(controller.DefaultBackoffLimit), "BackoffLimit for migration Jobs.") + flag.IntVar(&activeDeadline, "active-deadline-seconds", int(controller.DefaultActiveDeadlineSeconds), "ActiveDeadlineSeconds for migration Jobs.") + flag.IntVar(&ttlAfterFinished, "ttl-seconds-after-finished", int(controller.DefaultTTLSecondsAfterFinished), "TTLSecondsAfterFinished for migration Jobs.") + + opts := zap.Options{Development: false} + opts.BindFlags(flag.CommandLine) + flag.Parse() + + ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts))) + logger := ctrl.Log.WithName("setup") + + // Configure cache namespace restrictions. + var cacheOpts cache.Options + if watchNamespace != "" && !watchAllNamespaces { + cacheOpts.DefaultNamespaces = map[string]cache.Config{ + watchNamespace: {}, + } + } + + mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ + Scheme: scheme, + Metrics: metricsserver.Options{BindAddress: metricsAddr}, + HealthProbeBindAddress: healthProbeAddr, + LeaderElection: leaderElect, + LeaderElectionID: "openfga-operator-leader", + Cache: cacheOpts, + }) + if err != nil { + logger.Error(err, "unable to create manager") + os.Exit(1) + } + + reconciler := &controller.MigrationReconciler{ + Client: mgr.GetClient(), + BackoffLimit: int32(backoffLimit), + ActiveDeadlineSeconds: int64(activeDeadline), + TTLSecondsAfterFinished: int32(ttlAfterFinished), + } + + if err := reconciler.SetupWithManager(mgr); err != nil { + logger.Error(err, "unable to create controller", "controller", "MigrationReconciler") + os.Exit(1) + } + + if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil { + logger.Error(err, "unable to set up health check") + os.Exit(1) + } + if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil { + logger.Error(err, "unable to set up readiness check") + os.Exit(1) + } + + logger.Info("starting manager") + if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { + logger.Error(err, "problem running manager") + os.Exit(1) + } +} diff --git a/operator/go.mod b/operator/go.mod new file mode 100644 index 0000000..fea9649 --- /dev/null +++ b/operator/go.mod @@ -0,0 +1,66 @@ +module github.com/openfga/openfga-operator + +go 1.25.6 + +require ( + k8s.io/api v0.35.3 + k8s.io/apimachinery v0.35.3 + k8s.io/client-go v0.35.3 + k8s.io/utils v0.0.0-20260319190234-28399d86e0b5 + sigs.k8s.io/controller-runtime v0.23.3 +) + +require ( + github.com/beorn7/perks v1.0.1 // indirect + github.com/cespare/xxhash/v2 v2.3.0 // indirect + github.com/davecgh/go-spew v1.1.1 // indirect + github.com/emicklei/go-restful/v3 v3.12.2 // indirect + github.com/evanphx/json-patch/v5 v5.9.11 // indirect + github.com/fsnotify/fsnotify v1.9.0 // indirect + github.com/fxamacker/cbor/v2 v2.9.0 // indirect + github.com/go-logr/logr v1.4.3 // indirect + github.com/go-logr/zapr v1.3.0 // indirect + github.com/go-openapi/jsonpointer v0.21.0 // indirect + github.com/go-openapi/jsonreference v0.20.2 // indirect + github.com/go-openapi/swag v0.23.0 // indirect + github.com/google/btree v1.1.3 // indirect + github.com/google/gnostic-models v0.7.0 // indirect + github.com/google/go-cmp v0.7.0 // indirect + github.com/google/uuid v1.6.0 // indirect + github.com/josharian/intern v1.0.0 // indirect + github.com/json-iterator/go v1.1.12 // indirect + github.com/mailru/easyjson v0.7.7 // indirect + github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect + github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect + github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect + github.com/pmezard/go-difflib v1.0.0 // indirect + github.com/prometheus/client_golang v1.23.2 // indirect + github.com/prometheus/client_model v0.6.2 // indirect + github.com/prometheus/common v0.66.1 // indirect + github.com/prometheus/procfs v0.16.1 // indirect + github.com/spf13/pflag v1.0.9 // indirect + github.com/x448/float16 v0.8.4 // indirect + go.uber.org/multierr v1.11.0 // indirect + go.uber.org/zap v1.27.0 // indirect + go.yaml.in/yaml/v2 v2.4.3 // indirect + go.yaml.in/yaml/v3 v3.0.4 // indirect + golang.org/x/net v0.47.0 // indirect + golang.org/x/oauth2 v0.30.0 // indirect + golang.org/x/sync v0.18.0 // indirect + golang.org/x/sys v0.38.0 // indirect + golang.org/x/term v0.37.0 // indirect + golang.org/x/text v0.31.0 // indirect + golang.org/x/time v0.9.0 // indirect + gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect + google.golang.org/protobuf v1.36.8 // indirect + gopkg.in/evanphx/json-patch.v4 v4.13.0 // indirect + gopkg.in/inf.v0 v0.9.1 // indirect + gopkg.in/yaml.v3 v3.0.1 // indirect + k8s.io/apiextensions-apiserver v0.35.0 // indirect + k8s.io/klog/v2 v2.130.1 // indirect + k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 // indirect + sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect + sigs.k8s.io/randfill v1.0.0 // indirect + sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 // indirect + sigs.k8s.io/yaml v1.6.0 // indirect +) diff --git a/operator/go.sum b/operator/go.sum new file mode 100644 index 0000000..79e7481 --- /dev/null +++ b/operator/go.sum @@ -0,0 +1,171 @@ +github.com/Masterminds/semver/v3 v3.4.0 h1:Zog+i5UMtVoCU8oKka5P7i9q9HgrJeGzI9SA1Xbatp0= +github.com/Masterminds/semver/v3 v3.4.0/go.mod h1:4V+yj/TJE1HU9XfppCwVMZq3I84lprf4nC11bSS5beM= +github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM= +github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw= +github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= +github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= +github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/emicklei/go-restful/v3 v3.12.2 h1:DhwDP0vY3k8ZzE0RunuJy8GhNpPL6zqLkDf9B/a0/xU= +github.com/emicklei/go-restful/v3 v3.12.2/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc= +github.com/evanphx/json-patch v0.5.2 h1:xVCHIVMUu1wtM/VkR9jVZ45N3FhZfYMMYGorLCR8P3k= +github.com/evanphx/json-patch v0.5.2/go.mod h1:ZWS5hhDbVDyob71nXKNL0+PWn6ToqBHMikGIFbs31qQ= +github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjTM0wiaDU= +github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/XEtnUf6OZxqIQTM= +github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k= +github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0= +github.com/fxamacker/cbor/v2 v2.9.0 h1:NpKPmjDBgUfBms6tr6JZkTHtfFGcMKsw3eGcmD/sapM= +github.com/fxamacker/cbor/v2 v2.9.0/go.mod h1:vM4b+DJCtHn+zz7h3FFp/hDAI9WNWCsZj23V5ytsSxQ= +github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI= +github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY= +github.com/go-logr/zapr v1.3.0 h1:XGdV8XW8zdwFiwOA2Dryh1gj2KRQyOOoNmBy4EplIcQ= +github.com/go-logr/zapr v1.3.0/go.mod h1:YKepepNBd1u/oyhd/yQmtjVXmm9uML4IXUgMOwR8/Gg= +github.com/go-openapi/jsonpointer v0.19.6/go.mod h1:osyAmYz/mB/C3I+WsTTSgw1ONzaLJoLCyoi6/zppojs= +github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ= +github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY= +github.com/go-openapi/jsonreference v0.20.2 h1:3sVjiK66+uXK/6oQ8xgcRKcFgQ5KXa2KvnJRumpMGbE= +github.com/go-openapi/jsonreference v0.20.2/go.mod h1:Bl1zwGIM8/wsvqjsOQLJ/SH+En5Ap4rVB5KVcIDZG2k= +github.com/go-openapi/swag v0.22.3/go.mod h1:UzaqsxGiab7freDnrUUra0MwWfN/q7tE4j+VcZ0yl14= +github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE= +github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ= +github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= +github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= +github.com/google/btree v1.1.3 h1:CVpQJjYgC4VbzxeGVHfvZrv1ctoYCAI8vbl07Fcxlyg= +github.com/google/btree v1.1.3/go.mod h1:qOPhT0dTNdNzV6Z/lhRX0YXUafgPLFUh+gZMl761Gm4= +github.com/google/gnostic-models v0.7.0 h1:qwTtogB15McXDaNqTZdzPJRHvaVJlAl+HVQnLmJEJxo= +github.com/google/gnostic-models v0.7.0/go.mod h1:whL5G0m6dmc5cPxKc5bdKdEN3UjI7OUGxBlw57miDrQ= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= +github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= +github.com/google/gofuzz v1.2.0 h1:xRy4A+RhZaiKjJ1bPfwQ8sedCA+YS2YcCHW6ec7JMi0= +github.com/google/gofuzz v1.2.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= +github.com/google/pprof v0.0.0-20250403155104-27863c87afa6 h1:BHT72Gu3keYf3ZEu2J0b1vyeLSOYI8bm5wbJM/8yDe8= +github.com/google/pprof v0.0.0-20250403155104-27863c87afa6/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA= +github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= +github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY= +github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y= +github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM= +github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo= +github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo= +github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ= +github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI= +github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= +github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= +github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= +github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI= +github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= +github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= +github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= +github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw= +github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0= +github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc= +github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q= +github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg= +github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q= +github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk= +github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee h1:W5t00kpgFdJifH4BDsTlE89Zl93FEloxaWZfGcifgq8= +github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ= +github.com/onsi/ginkgo/v2 v2.27.2 h1:LzwLj0b89qtIy6SSASkzlNvX6WktqurSHwkk2ipF/Ns= +github.com/onsi/ginkgo/v2 v2.27.2/go.mod h1:ArE1D/XhNXBXCBkKOLkbsb2c81dQHCRcF5zwn/ykDRo= +github.com/onsi/gomega v1.38.2 h1:eZCjf2xjZAqe+LeWvKb5weQ+NcPwX84kqJ0cZNxok2A= +github.com/onsi/gomega v1.38.2/go.mod h1:W2MJcYxRGV63b418Ai34Ud0hEdTVXq9NW9+Sx6uXf3k= +github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= +github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o= +github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= +github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk= +github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE= +github.com/prometheus/common v0.66.1 h1:h5E0h5/Y8niHc5DlaLlWLArTQI7tMrsfQjHV+d9ZoGs= +github.com/prometheus/common v0.66.1/go.mod h1:gcaUsgf3KfRSwHY4dIMXLPV0K/Wg1oZ8+SbZk/HH/dA= +github.com/prometheus/procfs v0.16.1 h1:hZ15bTNuirocR6u0JZ6BAHHmwS1p8B4P6MRqxtzMyRg= +github.com/prometheus/procfs v0.16.1/go.mod h1:teAbpZRB1iIAJYREa1LsoWUXykVXA1KlTmWl8x/U+Is= +github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ= +github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc= +github.com/spf13/pflag v1.0.9 h1:9exaQaMOCwffKiiiYk6/BndUBv+iRViNW+4lEMi0PvY= +github.com/spf13/pflag v1.0.9/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= +github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= +github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw= +github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo= +github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY= +github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA= +github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= +github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= +github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +github.com/x448/float16 v0.8.4 h1:qLwI1I70+NjRFUR3zs1JPUCgaCXSh3SW62uAKT1mSBM= +github.com/x448/float16 v0.8.4/go.mod h1:14CWIYCyZA/cWjXOioeEpHeN/83MdbZDRQHoFcYsOfg= +go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= +go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= +go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0= +go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y= +go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8= +go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= +go.yaml.in/yaml/v2 v2.4.3 h1:6gvOSjQoTB3vt1l+CU+tSyi/HOjfOjRLJ4YwYZGwRO0= +go.yaml.in/yaml/v2 v2.4.3/go.mod h1:zSxWcmIDjOzPXpjlTTbAsKokqkDNAVtZO0WOMiT90s8= +go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc= +go.yaml.in/yaml/v3 v3.0.4/go.mod h1:DhzuOOF2ATzADvBadXxruRBLzYTpT36CKvDb3+aBEFg= +golang.org/x/mod v0.29.0 h1:HV8lRxZC4l2cr3Zq1LvtOsi/ThTgWnUk/y64QSs8GwA= +golang.org/x/mod v0.29.0/go.mod h1:NyhrlYXJ2H4eJiRy/WDBO6HMqZQ6q9nk4JzS3NuCK+w= +golang.org/x/net v0.47.0 h1:Mx+4dIFzqraBXUugkia1OOvlD6LemFo1ALMHjrXDOhY= +golang.org/x/net v0.47.0/go.mod h1:/jNxtkgq5yWUGYkaZGqo27cfGZ1c5Nen03aYrrKpVRU= +golang.org/x/oauth2 v0.30.0 h1:dnDm7JmhM45NNpd8FDDeLhK6FwqbOf4MLCM9zb1BOHI= +golang.org/x/oauth2 v0.30.0/go.mod h1:B++QgG3ZKulg6sRPGD/mqlHQs5rB3Ml9erfeDY7xKlU= +golang.org/x/sync v0.18.0 h1:kr88TuHDroi+UVf+0hZnirlk8o8T+4MrK6mr60WkH/I= +golang.org/x/sync v0.18.0/go.mod h1:9KTHXmSnoGruLpwFjVSX0lNNA75CykiMECbovNTZqGI= +golang.org/x/sys v0.38.0 h1:3yZWxaJjBmCWXqhN1qh02AkOnCQ1poK6oF+a7xWL6Gc= +golang.org/x/sys v0.38.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks= +golang.org/x/term v0.37.0 h1:8EGAD0qCmHYZg6J17DvsMy9/wJ7/D/4pV/wfnld5lTU= +golang.org/x/term v0.37.0/go.mod h1:5pB4lxRNYYVZuTLmy8oR2BH8dflOR+IbTYFD8fi3254= +golang.org/x/text v0.31.0 h1:aC8ghyu4JhP8VojJ2lEHBnochRno1sgL6nEi9WGFGMM= +golang.org/x/text v0.31.0/go.mod h1:tKRAlv61yKIjGGHX/4tP1LTbc13YSec1pxVEWXzfoeM= +golang.org/x/time v0.9.0 h1:EsRrnYcQiGH+5FfbgvV4AP7qEZstoyrHB0DzarOQ4ZY= +golang.org/x/time v0.9.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM= +golang.org/x/tools v0.38.0 h1:Hx2Xv8hISq8Lm16jvBZ2VQf+RLmbd7wVUsALibYI/IQ= +golang.org/x/tools v0.38.0/go.mod h1:yEsQ/d/YK8cjh0L6rZlY8tgtlKiBNTL14pGDJPJpYQs= +gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= +gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= +google.golang.org/protobuf v1.36.8 h1:xHScyCOEuuwZEc6UtSOvPbAT4zRh0xcNRYekJwfqyMc= +google.golang.org/protobuf v1.36.8/go.mod h1:fuxRtAxBytpl4zzqUh6/eyUujkJdNiuEkXntxiD/uRU= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= +gopkg.in/evanphx/json-patch.v4 v4.13.0 h1:czT3CmqEaQ1aanPc5SdlgQrrEIb8w/wwCvWWnfEbYzo= +gopkg.in/evanphx/json-patch.v4 v4.13.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= +gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= +gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= +gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +k8s.io/api v0.35.3 h1:pA2fiBc6+N9PDf7SAiluKGEBuScsTzd2uYBkA5RzNWQ= +k8s.io/api v0.35.3/go.mod h1:9Y9tkBcFwKNq2sxwZTQh1Njh9qHl81D0As56tu42GA4= +k8s.io/apiextensions-apiserver v0.35.0 h1:3xHk2rTOdWXXJM+RDQZJvdx0yEOgC0FgQ1PlJatA5T4= +k8s.io/apiextensions-apiserver v0.35.0/go.mod h1:E1Ahk9SADaLQ4qtzYFkwUqusXTcaV2uw3l14aqpL2LU= +k8s.io/apimachinery v0.35.3 h1:MeaUwQCV3tjKP4bcwWGgZ/cp/vpsRnQzqO6J6tJyoF8= +k8s.io/apimachinery v0.35.3/go.mod h1:jQCgFZFR1F4Ik7hvr2g84RTJSZegBc8yHgFWKn//hns= +k8s.io/client-go v0.35.3 h1:s1lZbpN4uI6IxeTM2cpdtrwHcSOBML1ODNTCCfsP1pg= +k8s.io/client-go v0.35.3/go.mod h1:RzoXkc0mzpWIDvBrRnD+VlfXP+lRzqQjCmKtiwZ8Q9c= +k8s.io/klog/v2 v2.130.1 h1:n9Xl7H1Xvksem4KFG4PYbdQCQxqc/tTUyrgXaOhHSzk= +k8s.io/klog/v2 v2.130.1/go.mod h1:3Jpz1GvMt720eyJH1ckRHK1EDfpxISzJ7I9OYgaDtPE= +k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 h1:Y3gxNAuB0OBLImH611+UDZcmKS3g6CthxToOb37KgwE= +k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912/go.mod h1:kdmbQkyfwUagLfXIad1y2TdrjPFWp2Q89B3qkRwf/pQ= +k8s.io/utils v0.0.0-20260319190234-28399d86e0b5 h1:kBawHLSnx/mYHmRnNUf9d4CpjREbeZuxoSGOX/J+aYM= +k8s.io/utils v0.0.0-20260319190234-28399d86e0b5/go.mod h1:xDxuJ0whA3d0I4mf/C4ppKHxXynQ+fxnkmQH0vTHnuk= +sigs.k8s.io/controller-runtime v0.23.3 h1:VjB/vhoPoA9l1kEKZHBMnQF33tdCLQKJtydy4iqwZ80= +sigs.k8s.io/controller-runtime v0.23.3/go.mod h1:B6COOxKptp+YaUT5q4l6LqUJTRpizbgf9KSRNdQGns0= +sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 h1:IpInykpT6ceI+QxKBbEflcR5EXP7sU1kvOlxwZh5txg= +sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730/go.mod h1:mdzfpAEoE6DHQEN0uh9ZbOCuHbLK5wOm7dK4ctXE9Tg= +sigs.k8s.io/randfill v1.0.0 h1:JfjMILfT8A6RbawdsK2JXGBR5AQVfd+9TbzrlneTyrU= +sigs.k8s.io/randfill v1.0.0/go.mod h1:XeLlZ/jmk4i1HRopwe7/aU3H5n1zNUcX6TM94b3QxOY= +sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 h1:2WOzJpHUBVrrkDjU4KBT8n5LDcj824eX0I5UKcgeRUs= +sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE= +sigs.k8s.io/yaml v1.6.0 h1:G8fkbMSAFqgEFgh4b1wmtzDnioxFCUgTZhlbj5P9QYs= +sigs.k8s.io/yaml v1.6.0/go.mod h1:796bPqUfzR/0jLAl6XjHl3Ck7MiyVv8dbTdyT3/pMf4= diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go new file mode 100644 index 0000000..175805d --- /dev/null +++ b/operator/internal/controller/helpers.go @@ -0,0 +1,269 @@ +package controller + +import ( + "context" + "fmt" + "strconv" + "strings" + "time" + + appsv1 "k8s.io/api/apps/v1" + batchv1 "k8s.io/api/batch/v1" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/utils/ptr" + "sigs.k8s.io/controller-runtime/pkg/client" +) + +const ( + // Labels used to discover OpenFGA Deployments. + LabelPartOf = "app.kubernetes.io/part-of" + LabelComponent = "app.kubernetes.io/component" + + LabelPartOfValue = "openfga" + LabelComponentValue = "authorization-controller" + + // Annotations set on the Deployment by the Helm chart / operator. + AnnotationDesiredReplicas = "openfga.dev/desired-replicas" + AnnotationMigrationServiceAccount = "openfga.dev/migration-service-account" + + // Defaults for migration Job configuration. + DefaultBackoffLimit int32 = 3 + DefaultActiveDeadlineSeconds int64 = 300 + DefaultTTLSecondsAfterFinished int32 = 300 +) + +// extractImageTag returns the tag portion of a container image reference. +// For "openfga/openfga:v1.14.0" it returns "v1.14.0". +// For "openfga/openfga@sha256:abc..." it returns the digest. +// If there is no tag or digest, it returns "latest". +func extractImageTag(image string) string { + // Handle digest references. + if idx := strings.LastIndex(image, "@"); idx != -1 { + return image[idx+1:] + } + + // Handle tag references — be careful not to split on the port in a registry URL. + // Find the last '/' to isolate the image name from the registry. + lastSlash := strings.LastIndex(image, "/") + nameAndTag := image + if lastSlash != -1 { + nameAndTag = image[lastSlash+1:] + } + + if idx := strings.LastIndex(nameAndTag, ":"); idx != -1 { + return nameAndTag[idx+1:] + } + + return "latest" +} + +// migrationConfigMapName returns the name of the ConfigMap used to track migration state. +func migrationConfigMapName(deploymentName string) string { + return deploymentName + "-migration-status" +} + +// migrationJobName returns the name of the migration Job. +func migrationJobName(deploymentName string) string { + return deploymentName + "-migrate" +} + +// buildMigrationJob constructs a migration Job for the given Deployment and version. +func buildMigrationJob( + deployment *appsv1.Deployment, + desiredVersion string, + backoffLimit int32, + activeDeadlineSeconds int64, + ttlSecondsAfterFinished int32, +) *batchv1.Job { + // Extract the main container's image and datastore env vars. + mainContainer := deployment.Spec.Template.Spec.Containers[0] + + // Determine the migration service account. + migrationSA := deployment.Annotations[AnnotationMigrationServiceAccount] + if migrationSA == "" { + migrationSA = deployment.Spec.Template.Spec.ServiceAccountName + } + + // Filter env vars — only pass datastore-related vars to the migration Job. + var datastoreEnvVars []corev1.EnvVar + for _, env := range mainContainer.Env { + if strings.HasPrefix(env.Name, "OPENFGA_DATASTORE_") { + datastoreEnvVars = append(datastoreEnvVars, env) + } + } + + return &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: migrationJobName(deployment.Name), + Namespace: deployment.Namespace, + Labels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: "migration", + "app.kubernetes.io/managed-by": "openfga-operator", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: deployment.Name, + UID: deployment.UID, + Controller: ptr.To(true), + BlockOwnerDeletion: ptr.To(true), + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(backoffLimit), + ActiveDeadlineSeconds: ptr.To(activeDeadlineSeconds), + TTLSecondsAfterFinished: ptr.To(ttlSecondsAfterFinished), + Template: corev1.PodTemplateSpec{ + ObjectMeta: metav1.ObjectMeta{ + Labels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: "migration", + }, + }, + Spec: corev1.PodSpec{ + ServiceAccountName: migrationSA, + RestartPolicy: corev1.RestartPolicyNever, + Containers: []corev1.Container{ + { + Name: "migrate-database", + Image: mainContainer.Image, + Args: []string{"migrate"}, + Env: datastoreEnvVars, + }, + }, + // Inherit scheduling constraints from the parent Deployment. + NodeSelector: deployment.Spec.Template.Spec.NodeSelector, + Tolerations: deployment.Spec.Template.Spec.Tolerations, + Affinity: deployment.Spec.Template.Spec.Affinity, + }, + }, + }, + } +} + +// updateMigrationStatus creates or updates the migration-status ConfigMap. +func updateMigrationStatus( + ctx context.Context, + c client.Client, + deployment *appsv1.Deployment, + version string, + jobName string, +) error { + cmName := migrationConfigMapName(deployment.Name) + cm := &corev1.ConfigMap{ + ObjectMeta: metav1.ObjectMeta{ + Name: cmName, + Namespace: deployment.Namespace, + Labels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: "migration", + "app.kubernetes.io/managed-by": "openfga-operator", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: deployment.Name, + UID: deployment.UID, + Controller: ptr.To(true), + BlockOwnerDeletion: ptr.To(true), + }, + }, + }, + Data: map[string]string{ + "version": version, + "migratedAt": time.Now().UTC().Format(time.RFC3339), + "jobName": jobName, + }, + } + + // Try to get existing ConfigMap first. + existing := &corev1.ConfigMap{} + err := c.Get(ctx, client.ObjectKeyFromObject(cm), existing) + if err != nil { + if client.IgnoreNotFound(err) != nil { + return fmt.Errorf("getting migration status ConfigMap: %w", err) + } + // ConfigMap doesn't exist — create it. + if createErr := c.Create(ctx, cm); createErr != nil { + return fmt.Errorf("creating migration status ConfigMap: %w", createErr) + } + return nil + } + + // Update existing ConfigMap. + existing.Data = cm.Data + existing.Labels = cm.Labels + if updateErr := c.Update(ctx, existing); updateErr != nil { + return fmt.Errorf("updating migration status ConfigMap: %w", updateErr) + } + return nil +} + +// ensureDeploymentScaled ensures the Deployment is scaled to the desired replica count. +// The desired count is read from the AnnotationDesiredReplicas annotation. +// Returns true if the Deployment was already at the desired scale. +func ensureDeploymentScaled(ctx context.Context, c client.Client, deployment *appsv1.Deployment) (bool, error) { + desiredStr, ok := deployment.Annotations[AnnotationDesiredReplicas] + if !ok || desiredStr == "" { + // No annotation — nothing to do. The Deployment may not have been scaled down yet. + return true, nil + } + + desired, err := strconv.ParseInt(desiredStr, 10, 32) + if err != nil { + return false, fmt.Errorf("parsing desired replicas annotation: %w", err) + } + + desiredInt32 := int32(desired) + current := int32(1) + if deployment.Spec.Replicas != nil { + current = *deployment.Spec.Replicas + } + + if current == desiredInt32 { + return true, nil + } + + patch := client.MergeFrom(deployment.DeepCopy()) + deployment.Spec.Replicas = ptr.To(desiredInt32) + if patchErr := c.Patch(ctx, deployment, patch); patchErr != nil { + return false, fmt.Errorf("scaling deployment to %d replicas: %w", desiredInt32, patchErr) + } + return false, nil +} + +// scaleDeploymentToZero scales the Deployment to 0 replicas, storing the current +// desired count in an annotation so it can be restored later. +func scaleDeploymentToZero(ctx context.Context, c client.Client, deployment *appsv1.Deployment) error { + if deployment.Spec.Replicas != nil && *deployment.Spec.Replicas == 0 { + return nil // Already at zero. + } + + patch := client.MergeFrom(deployment.DeepCopy()) + + // Store the current desired replica count before zeroing. + currentReplicas := int32(1) + if deployment.Spec.Replicas != nil { + currentReplicas = *deployment.Spec.Replicas + } + + // Only store if not already stored (avoid overwriting with 0 on re-reconciliation). + if _, ok := deployment.Annotations[AnnotationDesiredReplicas]; !ok { + if deployment.Annotations == nil { + deployment.Annotations = make(map[string]string) + } + deployment.Annotations[AnnotationDesiredReplicas] = strconv.FormatInt(int64(currentReplicas), 10) + } + + deployment.Spec.Replicas = ptr.To(int32(0)) + + if err := c.Patch(ctx, deployment, patch); err != nil { + return fmt.Errorf("scaling deployment to 0: %w", err) + } + return nil +} diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go new file mode 100644 index 0000000..935e4ff --- /dev/null +++ b/operator/internal/controller/migration_controller.go @@ -0,0 +1,215 @@ +package controller + +import ( + "context" + "fmt" + "time" + + appsv1 "k8s.io/api/apps/v1" + batchv1 "k8s.io/api/batch/v1" + corev1 "k8s.io/api/core/v1" + apierrors "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/builder" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/handler" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/controller-runtime/pkg/predicate" + "sigs.k8s.io/controller-runtime/pkg/reconcile" +) + +// MigrationReconciler watches OpenFGA Deployments and orchestrates database +// migrations when the application version changes. +type MigrationReconciler struct { + client.Client + + // BackoffLimit for migration Jobs. + BackoffLimit int32 + // ActiveDeadlineSeconds for migration Jobs. + ActiveDeadlineSeconds int64 + // TTLSecondsAfterFinished for migration Jobs. + TTLSecondsAfterFinished int32 +} + +// Reconcile handles a single reconciliation for an OpenFGA Deployment. +func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + logger := log.FromContext(ctx) + + // 1. Get the OpenFGA Deployment. + deployment := &appsv1.Deployment{} + if err := r.Get(ctx, req.NamespacedName, deployment); err != nil { + if apierrors.IsNotFound(err) { + return ctrl.Result{}, nil + } + return ctrl.Result{}, err + } + + // 2. Extract the desired version from the Deployment's image tag. + if len(deployment.Spec.Template.Spec.Containers) == 0 { + logger.Info("deployment has no containers, skipping") + return ctrl.Result{}, nil + } + desiredVersion := extractImageTag(deployment.Spec.Template.Spec.Containers[0].Image) + + // 3. Check current migration status from ConfigMap. + configMap := &corev1.ConfigMap{} + cmName := migrationConfigMapName(req.Name) + err := r.Get(ctx, types.NamespacedName{Name: cmName, Namespace: req.Namespace}, configMap) + + currentVersion := "" + if err == nil { + currentVersion = configMap.Data["version"] + } else if !apierrors.IsNotFound(err) { + return ctrl.Result{}, fmt.Errorf("getting migration status: %w", err) + } + + // 4. If versions match, ensure Deployment is scaled up and return. + if currentVersion == desiredVersion { + logger.V(1).Info("migration up to date", "version", desiredVersion) + if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { + return ctrl.Result{}, scaleErr + } + return ctrl.Result{}, nil + } + + logger.Info("migration needed", "currentVersion", currentVersion, "desiredVersion", desiredVersion) + + // 5. Ensure the Deployment is scaled to zero before migrating. + if err := scaleDeploymentToZero(ctx, r.Client, deployment); err != nil { + return ctrl.Result{}, err + } + + // 6. Check if a migration Job already exists. + jobName := migrationJobName(req.Name) + job := &batchv1.Job{} + err = r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: req.Namespace}, job) + + if apierrors.IsNotFound(err) { + // Create the migration Job. + job = buildMigrationJob( + deployment, + desiredVersion, + r.BackoffLimit, + r.ActiveDeadlineSeconds, + r.TTLSecondsAfterFinished, + ) + if createErr := r.Create(ctx, job); createErr != nil { + return ctrl.Result{}, fmt.Errorf("creating migration job: %w", createErr) + } + logger.Info("created migration job", "job", jobName, "version", desiredVersion) + return ctrl.Result{RequeueAfter: 5 * time.Second}, nil + } else if err != nil { + return ctrl.Result{}, fmt.Errorf("getting migration job: %w", err) + } + + // 7. Check Job status. + if job.Status.Succeeded >= 1 { + logger.Info("migration succeeded", "version", desiredVersion) + + // Update migration status ConfigMap. + if statusErr := updateMigrationStatus(ctx, r.Client, deployment, desiredVersion, jobName); statusErr != nil { + return ctrl.Result{}, statusErr + } + + // Scale Deployment back up. + if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { + return ctrl.Result{}, scaleErr + } + + return ctrl.Result{}, nil + } + + backoffLimit := r.BackoffLimit + if job.Spec.BackoffLimit != nil { + backoffLimit = *job.Spec.BackoffLimit + } + + if job.Status.Failed >= backoffLimit { + logger.Error(nil, "migration job failed, will delete and retry", "job", jobName, "version", desiredVersion) + + // Set condition so kubectl describe shows the failure. + setMigrationFailedCondition(deployment, desiredVersion) + if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + logger.Error(patchErr, "failed to set MigrationFailed condition") + } + + // Delete the failed Job so a fresh one is created on the next reconcile. + // This allows auto-recovery when the database comes back. + propagation := metav1.DeletePropagationBackground + if delErr := r.Delete(ctx, job, &client.DeleteOptions{ + PropagationPolicy: &propagation, + }); delErr != nil && !apierrors.IsNotFound(delErr) { + return ctrl.Result{}, fmt.Errorf("deleting failed migration job: %w", delErr) + } + logger.Info("deleted failed migration job, will retry", "job", jobName) + + // Requeue after a longer delay to avoid tight retry loops. + return ctrl.Result{RequeueAfter: 60 * time.Second}, nil + } + + // 8. Job still running — requeue. + logger.V(1).Info("migration job in progress", "job", jobName) + return ctrl.Result{RequeueAfter: 10 * time.Second}, nil +} + +// setMigrationFailedCondition sets a MigrationFailed condition on the Deployment. +func setMigrationFailedCondition(deployment *appsv1.Deployment, version string) { + condition := appsv1.DeploymentCondition{ + Type: "MigrationFailed", + Status: corev1.ConditionTrue, + LastTransitionTime: metav1.Now(), + Reason: "MigrationJobFailed", + Message: fmt.Sprintf("Database migration failed for version %s. Check migration job logs.", version), + } + + // Replace existing MigrationFailed condition if present. + for i, c := range deployment.Status.Conditions { + if c.Type == "MigrationFailed" { + deployment.Status.Conditions[i] = condition + return + } + } + deployment.Status.Conditions = append(deployment.Status.Conditions, condition) +} + +// SetupWithManager sets up the controller with the Manager. +func (r *MigrationReconciler) SetupWithManager(mgr ctrl.Manager) error { + // Only watch Deployments that are part of OpenFGA. + labelPredicate, err := predicate.LabelSelectorPredicate(metav1.LabelSelector{ + MatchLabels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: LabelComponentValue, + }, + }) + if err != nil { + return fmt.Errorf("creating label predicate: %w", err) + } + + return ctrl.NewControllerManagedBy(mgr). + For(&appsv1.Deployment{}, builder.WithPredicates(labelPredicate)). + Owns(&batchv1.Job{}). + Watches(&corev1.ConfigMap{}, handler.EnqueueRequestsFromMapFunc( + func(ctx context.Context, obj client.Object) []reconcile.Request { + // Only watch ConfigMaps that are migration status ConfigMaps. + if obj.GetLabels()[LabelPartOf] != LabelPartOfValue || + obj.GetLabels()["app.kubernetes.io/managed-by"] != "openfga-operator" { + return nil + } + // Map back to the owning Deployment. + for _, ref := range obj.GetOwnerReferences() { + if ref.Kind == "Deployment" { + return []reconcile.Request{ + {NamespacedName: types.NamespacedName{ + Name: ref.Name, + Namespace: obj.GetNamespace(), + }}, + } + } + } + return nil + }, + )). + Complete(r) +} diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go new file mode 100644 index 0000000..dc0f0c5 --- /dev/null +++ b/operator/internal/controller/migration_controller_test.go @@ -0,0 +1,336 @@ +package controller + +import ( + "context" + "testing" + "time" + + appsv1 "k8s.io/api/apps/v1" + batchv1 "k8s.io/api/batch/v1" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + "k8s.io/utils/ptr" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client/fake" +) + +func newScheme() *runtime.Scheme { + s := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(s) + return s +} + +func newTestDeployment(name, namespace, image string, replicas int32) *appsv1.Deployment { + return &appsv1.Deployment{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + Namespace: namespace, + UID: "test-uid-123", + Labels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: LabelComponentValue, + }, + Annotations: map[string]string{}, + }, + Spec: appsv1.DeploymentSpec{ + Replicas: ptr.To(replicas), + Selector: &metav1.LabelSelector{ + MatchLabels: map[string]string{"app": "openfga"}, + }, + Template: corev1.PodTemplateSpec{ + ObjectMeta: metav1.ObjectMeta{ + Labels: map[string]string{"app": "openfga"}, + }, + Spec: corev1.PodSpec{ + ServiceAccountName: "openfga", + Containers: []corev1.Container{ + { + Name: "openfga", + Image: image, + Env: []corev1.EnvVar{ + {Name: "OPENFGA_DATASTORE_ENGINE", Value: "postgres"}, + {Name: "OPENFGA_DATASTORE_URI", Value: "postgres://localhost/openfga"}, + {Name: "OPENFGA_LOG_LEVEL", Value: "info"}, + }, + }, + }, + }, + }, + }, + } +} + +func newReconciler(objects ...runtime.Object) *MigrationReconciler { + scheme := newScheme() + clientBuilder := fake.NewClientBuilder().WithScheme(scheme) + for _, obj := range objects { + clientBuilder = clientBuilder.WithRuntimeObjects(obj) + } + return &MigrationReconciler{ + Client: clientBuilder.Build(), + BackoffLimit: DefaultBackoffLimit, + ActiveDeadlineSeconds: DefaultActiveDeadlineSeconds, + TTLSecondsAfterFinished: DefaultTTLSecondsAfterFinished, + } +} + +func TestReconcile_FirstInstall_CreatesJob(t *testing.T) { + // Given: a Deployment with no migration-status ConfigMap. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + r := newReconciler(dep) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: a migration Job should be created and requeue requested. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue, got none") + } + + // Verify the Job was created. + job := &batchv1.Job{} + if err := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); err != nil { + t.Fatalf("expected migration job to be created: %v", err) + } + + if job.Spec.Template.Spec.Containers[0].Image != "openfga/openfga:v1.14.0" { + t.Errorf("expected job image openfga/openfga:v1.14.0, got %s", job.Spec.Template.Spec.Containers[0].Image) + } + + if job.Spec.Template.Spec.Containers[0].Args[0] != "migrate" { + t.Errorf("expected job args [migrate], got %v", job.Spec.Template.Spec.Containers[0].Args) + } + + // Verify only datastore env vars were passed. + for _, env := range job.Spec.Template.Spec.Containers[0].Env { + if env.Name == "OPENFGA_LOG_LEVEL" { + t.Error("non-datastore env var OPENFGA_LOG_LEVEL should not be passed to migration job") + } + } +} + +func TestReconcile_VersionMatch_ScalesUp(t *testing.T) { + // Given: a Deployment at 0 replicas with matching migration-status ConfigMap. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + cm := &corev1.ConfigMap{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migration-status", + Namespace: "default", + }, + Data: map[string]string{ + "version": "v1.14.0", + "migratedAt": "2026-04-06T12:00:00Z", + "jobName": "openfga-migrate", + }, + } + + r := newReconciler(dep, cm) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, no requeue. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 0 { + t.Error("expected no requeue when versions match") + } + + // Verify Deployment was scaled up. + updated := &appsv1.Deployment{} + if err := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); err != nil { + t.Fatalf("getting deployment: %v", err) + } + if *updated.Spec.Replicas != 3 { + t.Errorf("expected 3 replicas, got %d", *updated.Spec.Replicas) + } +} + +func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { + // Given: a Deployment at 0 replicas, no ConfigMap, and a succeeded migration Job. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + Status: batchv1.JobStatus{ + Succeeded: 1, + }, + } + + r := newReconciler(dep, job) + + // When: reconciling. + _, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + // Verify ConfigMap was created. + cm := &corev1.ConfigMap{} + if err := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migration-status", Namespace: "default", + }, cm); err != nil { + t.Fatalf("expected ConfigMap to be created: %v", err) + } + if cm.Data["version"] != "v1.14.0" { + t.Errorf("expected version v1.14.0 in ConfigMap, got %s", cm.Data["version"]) + } + + // Verify Deployment was scaled up. + updated := &appsv1.Deployment{} + if err := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); err != nil { + t.Fatalf("getting deployment: %v", err) + } + if *updated.Spec.Replicas != 3 { + t.Errorf("expected 3 replicas, got %d", *updated.Spec.Replicas) + } +} + +func TestReconcile_JobFailed_DeletesJobAndRequeues(t *testing.T) { + // Given: a Deployment at 0 replicas and a failed migration Job. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + Status: batchv1.JobStatus{ + Failed: 3, + }, + } + + r := newReconciler(dep, job) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, but requeue after 60s for retry. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 60*time.Second { + t.Errorf("expected 60s requeue, got %v", result.RequeueAfter) + } + + // Verify Deployment was NOT scaled up — still at 0. + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if *updated.Spec.Replicas != 0 { + t.Errorf("expected 0 replicas after failed migration, got %d", *updated.Spec.Replicas) + } + + // Verify the failed Job was deleted. + deletedJob := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, deletedJob); getErr == nil { + t.Error("expected failed migration job to be deleted") + } +} + +func TestReconcile_DeploymentNotFound_NoError(t *testing.T) { + r := newReconciler() + + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "nonexistent", Namespace: "default"}, + }) + + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 0 { + t.Error("expected no requeue for missing deployment") + } +} + +func TestExtractImageTag(t *testing.T) { + tests := []struct { + image string + expected string + }{ + {"openfga/openfga:v1.14.0", "v1.14.0"}, + {"openfga/openfga:latest", "latest"}, + {"openfga/openfga", "latest"}, + {"ghcr.io/openfga/openfga:v1.14.0", "v1.14.0"}, + {"registry.example.com:5000/openfga/openfga:v1.14.0", "v1.14.0"}, + {"openfga/openfga@sha256:abcdef1234567890", "sha256:abcdef1234567890"}, + } + + for _, tt := range tests { + t.Run(tt.image, func(t *testing.T) { + got := extractImageTag(tt.image) + if got != tt.expected { + t.Errorf("extractImageTag(%q) = %q, want %q", tt.image, got, tt.expected) + } + }) + } +} diff --git a/operator/tests/README.md b/operator/tests/README.md new file mode 100644 index 0000000..e4377a9 --- /dev/null +++ b/operator/tests/README.md @@ -0,0 +1,186 @@ +# Local Integration Tests + +Manual integration tests for the OpenFGA operator on a local Kubernetes cluster (Rancher Desktop, kind, minikube, etc.). + +## Prerequisites + +- A running local Kubernetes cluster +- Helm 3.6+ +- The operator image built locally: + ```bash + cd operator + docker build -t openfga/openfga-operator:dev . + ``` +- Chart dependencies updated: + ```bash + helm dependency update charts/openfga + ``` + +All test values files use `imagePullPolicy: Never`, so the locally-built image must be available to the cluster's container runtime. On Rancher Desktop (dockerd) and Docker Desktop this works automatically. For kind, load the image first: + +```bash +kind load docker-image openfga/openfga-operator:dev +``` + +## Test Scenarios + +### 1. Happy Path + +Deploys OpenFGA with a Postgres instance. The operator should run the migration and scale OpenFGA up within ~30 seconds. + +```bash +kubectl create namespace openfga-test +helm install openfga-test charts/openfga -n openfga-test \ + -f operator/tests/values-happy-path.yaml +``` + +**Expected outcome:** + +| Resource | State | +|----------|-------| +| `openfga-test-openfga-operator` | `1/1 Running` | +| `openfga-test-postgres` | `1/1 Running` | +| `openfga-test-migrate-xxxxx` | `0/1 Completed` | +| `openfga-test` (OpenFGA) | `3/3 Running` | + +**Verify:** + +```bash +# All resources healthy +kubectl get all -n openfga-test + +# Operator logs show full lifecycle +kubectl logs -n openfga-test deployment/openfga-test-openfga-operator + +# Migration status recorded +kubectl get configmap openfga-test-migration-status -n openfga-test -o jsonpath='{.data}' + +# Database tables created +kubectl exec -n openfga-test deployment/openfga-test-postgres -- \ + psql -U openfga -d openfga -c '\dt' + +# OpenFGA responding +kubectl run curl-test --image=curlimages/curl -n openfga-test \ + --rm -it --restart=Never -- curl -s http://openfga-test:8080/healthz +# Expected: {"status":"SERVING"} +``` + +**Clean up:** + +```bash +helm uninstall openfga-test -n openfga-test +kubectl delete namespace openfga-test +``` + +--- + +### 2. Database Outage and Recovery + +Deploys OpenFGA with a Postgres instance scaled to 0 replicas (simulating a database that isn't ready yet). The operator should retry migrations until Postgres becomes available, then self-heal. + +```bash +kubectl create namespace openfga-test +helm install openfga-test charts/openfga -n openfga-test \ + -f operator/tests/values-db-outage.yaml +``` + +**Expected behavior while Postgres is down:** + +- Migration Job runs and fails (each pod times out after ~60s) +- After 3 failures (backoffLimit), the operator: + - Sets `MigrationFailed: True` condition on the Deployment + - Deletes the failed Job + - Creates a fresh Job after a 60-second delay +- This cycle repeats indefinitely +- OpenFGA stays at 0 replicas throughout (safe — no unmigrated app running) + +**Watch the failure cycle:** + +```bash +# Check deployment conditions +kubectl get deployment openfga-test -n openfga-test \ + -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}' + +# Watch operator logs for delete/retry cycle +kubectl logs -n openfga-test deployment/openfga-test-openfga-operator -f +# Look for: +# "migration job failed, will delete and retry" +# "deleted failed migration job, will retry" +# "created migration job" +``` + +**Bring Postgres back (after a few minutes):** + +```bash +kubectl scale deployment openfga-test-postgres -n openfga-test --replicas=1 +``` + +**Expected recovery (within ~60s of Postgres becoming ready):** + +- The currently running migration pod connects and succeeds +- Operator updates the ConfigMap with the new version +- Operator scales OpenFGA to 3/3 replicas +- `{"status":"SERVING"}` from the health endpoint + +**Verify recovery:** + +```bash +# OpenFGA should be 3/3 Running +kubectl get all -n openfga-test + +# Migration status recorded +kubectl get configmap openfga-test-migration-status -n openfga-test -o jsonpath='{.data}' + +# Health check +kubectl run curl-test --image=curlimages/curl -n openfga-test \ + --rm -it --restart=Never -- curl -s http://openfga-test:8080/healthz +``` + +**Clean up:** + +```bash +helm uninstall openfga-test -n openfga-test +kubectl delete namespace openfga-test +``` + +--- + +### 3. No Database (Permanent Failure) + +Deploys OpenFGA pointing at a Postgres hostname that doesn't exist. The operator should continuously retry without crashing or leaving the app in a broken state. + +```bash +kubectl create namespace openfga-test +helm install openfga-test charts/openfga -n openfga-test \ + -f operator/tests/values-no-db.yaml +``` + +**Expected behavior:** + +- Migration Jobs fail repeatedly (DNS resolution fails for `postgres-does-not-exist`) +- Operator sets `MigrationFailed: True` on the Deployment +- Operator deletes failed Jobs and retries every ~60 seconds +- OpenFGA stays at 0 replicas indefinitely — never starts against an unmigrated database + +This scenario verifies the operator doesn't crash-loop or consume excessive resources when the database is permanently unavailable. + +**Verify:** + +```bash +# OpenFGA at 0/0, operator at 1/1 +kubectl get deployments -n openfga-test + +# MigrationFailed condition present +kubectl get deployment openfga-test -n openfga-test \ + -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}' + +# Operator logs show retry cycle +kubectl logs -n openfga-test deployment/openfga-test-openfga-operator --tail=20 +``` + +**Clean up:** + +```bash +helm uninstall openfga-test -n openfga-test +kubectl delete namespace openfga-test +``` diff --git a/operator/tests/values-db-outage.yaml b/operator/tests/values-db-outage.yaml new file mode 100644 index 0000000..a7c5972 --- /dev/null +++ b/operator/tests/values-db-outage.yaml @@ -0,0 +1,70 @@ +# Test values: Postgres deployed but scaled to 0 (simulates DB outage) +operator: + enabled: true + +openfga-operator: + image: + repository: openfga/openfga-operator + tag: dev + pullPolicy: Never + resources: + requests: + cpu: 10m + memory: 64Mi + +datastore: + engine: postgres + uri: "postgres://openfga:changeme@openfga-test-postgres:5432/openfga?sslmode=disable" + +migration: + enabled: true + serviceAccount: + create: true + +extraObjects: + - apiVersion: v1 + kind: Secret + metadata: + name: openfga-test-postgres-creds + stringData: + POSTGRES_USER: openfga + POSTGRES_PASSWORD: changeme + POSTGRES_DB: openfga + - apiVersion: apps/v1 + kind: Deployment + metadata: + name: openfga-test-postgres + spec: + replicas: 0 # Start with Postgres DOWN + selector: + matchLabels: + app: openfga-test-postgres + template: + metadata: + labels: + app: openfga-test-postgres + spec: + containers: + - name: postgres + image: postgres:17 + ports: + - containerPort: 5432 + envFrom: + - secretRef: + name: openfga-test-postgres-creds + volumeMounts: + - name: data + mountPath: /var/lib/postgresql/data + volumes: + - name: data + emptyDir: {} + - apiVersion: v1 + kind: Service + metadata: + name: openfga-test-postgres + spec: + selector: + app: openfga-test-postgres + ports: + - port: 5432 + targetPort: 5432 diff --git a/operator/tests/values-happy-path.yaml b/operator/tests/values-happy-path.yaml new file mode 100644 index 0000000..77e6306 --- /dev/null +++ b/operator/tests/values-happy-path.yaml @@ -0,0 +1,70 @@ +# Local test values for operator-managed migration on Rancher Desktop +operator: + enabled: true + +openfga-operator: + image: + repository: openfga/openfga-operator + tag: dev + pullPolicy: Never + resources: + requests: + cpu: 10m + memory: 64Mi + +datastore: + engine: postgres + uri: "postgres://openfga:changeme@openfga-test-postgres:5432/openfga?sslmode=disable" + +migration: + enabled: true + serviceAccount: + create: true + +extraObjects: + - apiVersion: v1 + kind: Secret + metadata: + name: openfga-test-postgres-creds + stringData: + POSTGRES_USER: openfga + POSTGRES_PASSWORD: changeme + POSTGRES_DB: openfga + - apiVersion: apps/v1 + kind: Deployment + metadata: + name: openfga-test-postgres + spec: + replicas: 1 + selector: + matchLabels: + app: openfga-test-postgres + template: + metadata: + labels: + app: openfga-test-postgres + spec: + containers: + - name: postgres + image: postgres:17 + ports: + - containerPort: 5432 + envFrom: + - secretRef: + name: openfga-test-postgres-creds + volumeMounts: + - name: data + mountPath: /var/lib/postgresql/data + volumes: + - name: data + emptyDir: {} + - apiVersion: v1 + kind: Service + metadata: + name: openfga-test-postgres + spec: + selector: + app: openfga-test-postgres + ports: + - port: 5432 + targetPort: 5432 diff --git a/operator/tests/values-no-db.yaml b/operator/tests/values-no-db.yaml new file mode 100644 index 0000000..2d1cd76 --- /dev/null +++ b/operator/tests/values-no-db.yaml @@ -0,0 +1,23 @@ +# Test values with NO postgres — simulates database unavailable +operator: + enabled: true + +openfga-operator: + image: + repository: openfga/openfga-operator + tag: dev + pullPolicy: Never + resources: + requests: + cpu: 10m + memory: 64Mi + +datastore: + engine: postgres + # Points to a service that doesn't exist + uri: "postgres://openfga:changeme@postgres-does-not-exist:5432/openfga?sslmode=disable" + +migration: + enabled: true + serviceAccount: + create: true From 4108b2a3722592d829fa1e20d3781a2381dfa5ea Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 08:34:55 -0400 Subject: [PATCH 03/42] fix: address PR #309 review feedback from Copilot and CodeRabbit - Harden pod security (runAsNonRoot, seccompProfile, drop ALL caps) - Find container by name instead of index to handle sidecars - Skip migration for memory datastore - Persist retry-after annotation before Job deletion to survive re-enqueue - Clear MigrationFailed condition on success - Propagate imagePullSecrets and securityContext to migration Jobs - Remove unused RBAC rules (secrets, serviceaccounts) - Add POD_NAMESPACE downward API for namespace-scoped watch default - Remove no-op migration values (timeout, backoffLimit, resources) - Fix migration SA helper to require name when create=false - Guard operator logic on both operator.enabled and migration.enabled - Build and load operator image into kind for chart-testing CI - Add path filters to operator workflow - Fix ADR inaccuracies (retry strategy, default-enabled wording) - Pin Dockerfile base image to golang:1.26.2 --- .github/workflows/operator.yml | 4 + .github/workflows/test.yml | 7 + charts/openfga-operator/templates/NOTES.txt | 7 +- .../templates/clusterrole.yaml | 6 - .../templates/deployment.yaml | 11 +- charts/openfga-operator/values.yaml | 24 ++-- charts/openfga/templates/_helpers.tpl | 6 +- charts/openfga/templates/deployment.yaml | 11 +- charts/openfga/values.schema.json | 20 --- charts/openfga/values.yaml | 9 +- docs/adr/002-operator-managed-migrations.md | 2 +- docs/adr/004-operator-deployment-model.md | 8 +- docs/adr/README.md | 2 +- operator/Dockerfile | 2 +- operator/Makefile | 1 + operator/README.md | 20 +-- operator/cmd/main.go | 13 +- operator/go.mod | 2 +- operator/internal/controller/helpers.go | 34 +++-- .../controller/migration_controller.go | 98 ++++++++++++-- .../controller/migration_controller_test.go | 128 +++++++++++++++++- 21 files changed, 312 insertions(+), 103 deletions(-) diff --git a/.github/workflows/operator.yml b/.github/workflows/operator.yml index 71f9e7f..dc88c55 100644 --- a/.github/workflows/operator.yml +++ b/.github/workflows/operator.yml @@ -6,9 +6,13 @@ on: - main paths: - "operator/**" + - "charts/openfga-operator/**" + - ".github/workflows/operator.yml" pull_request: paths: - "operator/**" + - "charts/openfga-operator/**" + - ".github/workflows/operator.yml" workflow_dispatch: inputs: push_image: diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index 4903083..f2369c5 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -59,6 +59,13 @@ jobs: if: steps.list-changed.outputs.changed == 'true' uses: helm/kind-action@v1.14.0 + - name: Build and load operator image into kind + if: steps.list-changed.outputs.changed == 'true' + run: | + version=$(grep '^appVersion:' charts/openfga-operator/Chart.yaml | awk '{print $2}' | tr -d '"') + docker build -t "openfga/openfga-operator:${version}" operator/ + kind load docker-image "openfga/openfga-operator:${version}" --name chart-testing + - name: Run chart-testing (install) if: steps.list-changed.outputs.changed == 'true' run: ct install --target-branch ${{ github.event.repository.default_branch }} diff --git a/charts/openfga-operator/templates/NOTES.txt b/charts/openfga-operator/templates/NOTES.txt index 8c398b1..c2e09f9 100644 --- a/charts/openfga-operator/templates/NOTES.txt +++ b/charts/openfga-operator/templates/NOTES.txt @@ -1,11 +1,10 @@ The openfga-operator has been deployed. -NOTE: The operator container image ({{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}) -does not exist yet. The operator pod will remain in ImagePullBackOff until -the Go binary is built and pushed. +NOTE: Ensure the operator image ({{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}) is available in your registry. +If unavailable, the operator pod may remain in ImagePullBackOff until the image is pushed. To check operator status: kubectl get deployment --namespace {{ include "openfga-operator.namespace" . }} {{ include "openfga-operator.fullname" . }} -To view operator logs (once the image is available): +To view operator logs: kubectl logs --namespace {{ include "openfga-operator.namespace" . }} -l "app.kubernetes.io/name={{ include "openfga-operator.name" . }}" diff --git a/charts/openfga-operator/templates/clusterrole.yaml b/charts/openfga-operator/templates/clusterrole.yaml index 09d0fd7..7dbfddf 100644 --- a/charts/openfga-operator/templates/clusterrole.yaml +++ b/charts/openfga-operator/templates/clusterrole.yaml @@ -17,12 +17,6 @@ rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "watch", "create", "update"] - - apiGroups: [""] - resources: ["secrets"] - verbs: ["get"] - - apiGroups: [""] - resources: ["serviceaccounts"] - verbs: ["get", "list", "create"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["get", "list", "watch", "create", "update"] diff --git a/charts/openfga-operator/templates/deployment.yaml b/charts/openfga-operator/templates/deployment.yaml index ae8af0d..4570660 100644 --- a/charts/openfga-operator/templates/deployment.yaml +++ b/charts/openfga-operator/templates/deployment.yaml @@ -40,11 +40,16 @@ spec: {{- if .Values.leaderElection.enabled }} - --leader-elect {{- end }} - {{- if .Values.watchNamespace }} - - --watch-namespace={{ .Values.watchNamespace }} - {{- else if .Values.watchAllNamespaces }} + {{- if .Values.watchAllNamespaces }} - --watch-all-namespaces + {{- else if .Values.watchNamespace }} + - --watch-namespace={{ .Values.watchNamespace }} {{- end }} + env: + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace ports: - name: healthz containerPort: 8081 diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 891ad57..59c9dfb 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -21,18 +21,18 @@ serviceAccount: podAnnotations: {} -podSecurityContext: {} - # runAsNonRoot: true - # seccompProfile: - # type: RuntimeDefault - -securityContext: {} - # capabilities: - # drop: - # - ALL - # readOnlyRootFilesystem: true - # runAsNonRoot: true - # runAsUser: 65532 +podSecurityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + +securityContext: + capabilities: + drop: + - ALL + readOnlyRootFilesystem: true + runAsNonRoot: true + runAsUser: 65532 # -- Constrain the operator to watch a single namespace. # Leave empty to default to the release namespace. diff --git a/charts/openfga/templates/_helpers.tpl b/charts/openfga/templates/_helpers.tpl index 35ad94a..cc50e03 100644 --- a/charts/openfga/templates/_helpers.tpl +++ b/charts/openfga/templates/_helpers.tpl @@ -78,10 +78,10 @@ Create the name of the service account to use Create the name of the migration service account to use (operator mode only) */}} {{- define "openfga.migrationServiceAccountName" -}} -{{- if .Values.migration.serviceAccount.name }} -{{- .Values.migration.serviceAccount.name | trunc 63 | trimSuffix "-" }} +{{- if .Values.migration.serviceAccount.create }} +{{- default (printf "%s-migration" (include "openfga.fullname" .)) .Values.migration.serviceAccount.name | trunc 63 | trimSuffix "-" }} {{- else }} -{{- printf "%s-migration" (include "openfga.fullname" .) | trunc 63 | trimSuffix "-" }} +{{- required "migration.serviceAccount.name must be set when migration.serviceAccount.create=false" .Values.migration.serviceAccount.name | trunc 63 | trimSuffix "-" }} {{- end }} {{- end }} diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index 1cd20ac..3d101c0 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -5,15 +5,17 @@ metadata: labels: {{- include "openfga.labels" . | nindent 4 }} annotations: - {{- if .Values.operator.enabled }} + {{- if and .Values.operator.enabled .Values.migration.enabled }} openfga.dev/desired-replicas: "{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}" + {{- if or .Values.migration.serviceAccount.create .Values.migration.serviceAccount.name }} openfga.dev/migration-service-account: "{{ include "openfga.migrationServiceAccountName" . }}" {{- end }} + {{- end }} {{- with .Values.annotations }} {{- toYaml . | nindent 4 }} {{- end }} spec: - {{- if .Values.operator.enabled }} + {{- if and .Values.operator.enabled .Values.migration.enabled }} {{- if .Values.autoscaling.enabled }} {{- fail "operator.enabled and autoscaling.enabled cannot both be true" }} {{- end }} @@ -46,8 +48,10 @@ spec: serviceAccountName: {{ include "openfga.serviceAccountName" . }} securityContext: {{- toYaml .Values.podSecurityContext | nindent 8 }} - {{ if and (not .Values.operator.enabled) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations) .Values.extraInitContainers) }} + {{- $operatorMigration := and .Values.operator.enabled .Values.migration.enabled }} + {{ if or (and (not $operatorMigration) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations))) .Values.extraInitContainers }} initContainers: + {{- if not $operatorMigration }} {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations (eq .Values.datastore.migrationType "job") }} - name: wait-for-migration securityContext: @@ -86,6 +90,7 @@ spec: {{- include "common.tplvalues.render" ( dict "value" .Values.migrate.sidecars "context" $) | nindent 8 }} {{- end }} {{- end }} + {{- end }} {{- with .Values.extraInitContainers }} {{- toYaml . | nindent 8 }} {{- end }} diff --git a/charts/openfga/values.schema.json b/charts/openfga/values.schema.json index 76b2cf9..54e737b 100644 --- a/charts/openfga/values.schema.json +++ b/charts/openfga/values.schema.json @@ -1314,21 +1314,6 @@ "description": "Enable operator-managed database migrations", "default": true }, - "timeout": { - "type": ["string", "null"], - "description": "Timeout passed to the migration Job as activeDeadlineSeconds", - "default": "" - }, - "backoffLimit": { - "type": "integer", - "description": "Number of retries before marking the migration as failed", - "default": 3 - }, - "ttlSecondsAfterFinished": { - "type": "integer", - "description": "Seconds to keep completed/failed migration Jobs before cleanup", - "default": 600 - }, "serviceAccount": { "type": "object", "properties": { @@ -1351,11 +1336,6 @@ "default": "" } } - }, - "resources": { - "type": "object", - "description": "Resource requests/limits for migration Job pods", - "default": {} } } } diff --git a/charts/openfga/values.yaml b/charts/openfga/values.yaml index e843563..24723c2 100644 --- a/charts/openfga/values.yaml +++ b/charts/openfga/values.yaml @@ -394,13 +394,8 @@ operator: # -- migration controls operator-driven migration behavior. # Only used when operator.enabled is true. migration: + # -- Enable operator-managed migrations. Set to false if you manage migrations externally. enabled: true - # -- Timeout passed to the migration Job as activeDeadlineSeconds. - timeout: "" - # -- Number of retries before marking the migration as failed. - backoffLimit: 3 - # -- Seconds to keep completed/failed migration Jobs before cleanup. - ttlSecondsAfterFinished: 600 serviceAccount: # -- Create a dedicated service account for migration Jobs. create: true @@ -409,8 +404,6 @@ migration: # -- The name of the migration service account. # If not set and create is true, defaults to {fullname}-migration. name: "" - # -- Resource requests/limits for migration Job pods. - resources: {} ## Example: Deploy a PostgreSQL instance for dev/test using official Docker images. ## For production, use a managed database service or an operator like CloudnativePG. ## Configure the chart to use the secret: diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md index 8fb0cd7..9b86d9d 100644 --- a/docs/adr/002-operator-managed-migrations.md +++ b/docs/adr/002-operator-managed-migrations.md @@ -121,7 +121,7 @@ The Job created by the operator has no Helm hook annotations. It is a standard K | Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. | | Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. | | Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. | -| Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). | +| Database unreachable | Job fails to connect. After exhausting `backoffLimit`, operator deletes the failed Job, sets a `retry-after` annotation, and recreates a fresh Job after a fixed 60-second cooldown. Cycle repeats until the database becomes available. | ### Sequence Comparison diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md index bedb12c..745e777 100644 --- a/docs/adr/004-operator-deployment-model.md +++ b/docs/adr/004-operator-deployment-model.md @@ -35,7 +35,7 @@ The operator Deployment, RBAC, and CRDs are templates in the main OpenFGA chart. **C. Operator as a conditional subchart dependency (selected)** -The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Enabled by default, but users can disable it. +The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Disabled by default for backward compatibility; users opt in with `operator.enabled: true`. *Example:* ```bash @@ -71,7 +71,7 @@ helm-charts/ ├── charts/ │ ├── openfga/ # Main chart (existing) │ │ ├── Chart.yaml # Declares openfga-operator as dependency -│ │ ├── values.yaml # operator.enabled: true +│ │ ├── values.yaml # operator.enabled: false (opt-in) │ │ ├── templates/ │ │ └── crds/ # Empty in Stage 1 │ │ @@ -124,8 +124,8 @@ kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0 | Mode | Command | Use case | |------|---------|----------| -| **All-in-one** (default) | `helm install openfga openfga/openfga` | Most users. Single install, operator included. | -| **Operator disabled** | `helm install openfga openfga/openfga --set operator.enabled=false` | Operator managed separately or not needed (memory datastore). | +| **Default** (no operator) | `helm install openfga openfga/openfga` | Backward compatible. Uses Helm hooks for migration. | +| **All-in-one** | `helm install openfga openfga/openfga --set operator.enabled=true` | Single install with operator-managed migrations. | | **Operator standalone** | `helm install op openfga/openfga-operator -n openfga-system` | Cluster-wide operator serving multiple OpenFGA instances. | ### Multi-Instance Considerations diff --git a/docs/adr/README.md b/docs/adr/README.md index 298a9e3..5f80512 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -30,7 +30,7 @@ ADRs are **immutable once accepted** — if a decision changes, you write a new ## ADR Lifecycle -``` +```text Proposed → Accepted → (optionally) Superseded or Deprecated ↑ │ feedback loop diff --git a/operator/Dockerfile b/operator/Dockerfile index be9097e..7d836a3 100644 --- a/operator/Dockerfile +++ b/operator/Dockerfile @@ -1,4 +1,4 @@ -FROM golang:1.25 AS builder +FROM golang:1.26.2 AS builder WORKDIR /workspace COPY go.mod go.sum ./ diff --git a/operator/Makefile b/operator/Makefile index 575bb7c..4b97c0c 100644 --- a/operator/Makefile +++ b/operator/Makefile @@ -3,6 +3,7 @@ IMG ?= openfga/openfga-operator:dev .PHONY: build test vet fmt lint docker-build docker-push clean build: + mkdir -p bin go build -o bin/operator ./cmd/ test: diff --git a/operator/README.md b/operator/README.md index a2b0699..8e576db 100644 --- a/operator/README.md +++ b/operator/README.md @@ -2,12 +2,12 @@ A Kubernetes operator that manages database migrations for OpenFGA deployments. Instead of relying on Helm hooks and init containers, the operator watches OpenFGA Deployments, detects version changes, and orchestrates migrations as regular Jobs. -This is **Stage 1** of the operator — focused solely on migration orchestration. See [ADR-001](../docs/adr/001-adopt-operator.md) for the full roadmap. +This is **Stage 1** of the operator — focused solely on migration orchestration. See [ADR-001](../docs/adr/001-adopt-openfga-operator.md) for the full roadmap. ## How It Works 1. The operator watches Deployments labeled `app.kubernetes.io/part-of: openfga` -2. When a version change is detected (comparing the container image tag to the `openfga-migration-status` ConfigMap), the operator: +2. When a version change is detected (comparing the container image tag to the `{name}-migration-status` ConfigMap), the operator: - Keeps the Deployment at 0 replicas - Creates a migration Job running `openfga migrate` - Waits for the Job to complete @@ -108,14 +108,14 @@ The operator accepts the following flags: | Flag | Default | Description | |------|---------|-------------| -| `--leader-elect` | `false` | Enable leader election | -| `--watch-namespace` | `""` | Namespace to watch (defaults to release namespace) | -| `--watch-all-namespaces` | `false` | Watch all namespaces | -| `--metrics-bind-address` | `:8080` | Metrics endpoint address | -| `--health-probe-bind-address` | `:8081` | Health probe endpoint address | -| `--backoff-limit` | `3` | BackoffLimit for migration Jobs | -| `--active-deadline-seconds` | `300` | ActiveDeadlineSeconds for migration Jobs | -| `--ttl-seconds-after-finished` | `300` | TTLSecondsAfterFinished for migration Jobs | +| `--leader-elect` | `false` | Enable leader election so only one replica actively reconciles at a time. Required when running multiple operator replicas for high availability; standby pods wait for the leader's Lease to expire before taking over. Not needed for single-replica deployments. | +| `--watch-namespace` | `""` | Namespace to watch for OpenFGA Deployments. Defaults to the operator pod's own namespace (via `POD_NAMESPACE` env var). Set explicitly for multi-namespace setups. | +| `--watch-all-namespaces` | `false` | Watch all namespaces for OpenFGA Deployments, making the operator cluster-wide. Overrides `--watch-namespace`. | +| `--metrics-bind-address` | `:8080` | Address the Prometheus metrics endpoint binds to. Change only if the default port conflicts with other containers in the pod. | +| `--health-probe-bind-address` | `:8081` | Address the Kubernetes liveness and readiness probe endpoints bind to. Change only if the default port conflicts. | +| `--backoff-limit` | `3` | Number of times a migration Job's pod can fail before the Job is considered failed. After hitting this limit the operator deletes the Job, sets a `MigrationFailed` condition on the Deployment, and retries after a 60-second cooldown. | +| `--active-deadline-seconds` | `300` | Maximum wall-clock seconds a migration Job can run before Kubernetes terminates it. Prevents stuck migrations from blocking the pipeline indefinitely. Increase for very large databases. | +| `--ttl-seconds-after-finished` | `300` | Seconds Kubernetes keeps a completed or failed Job (and its pods) before garbage-collecting them, giving you time to inspect logs. | When deployed via the Helm subchart, these are configured through `values.yaml`. See `charts/openfga-operator/values.yaml` for all available options. diff --git a/operator/cmd/main.go b/operator/cmd/main.go index e4c9d3e..9fb9107 100644 --- a/operator/cmd/main.go +++ b/operator/cmd/main.go @@ -5,7 +5,6 @@ import ( "os" "k8s.io/apimachinery/pkg/runtime" - utilruntime "k8s.io/apimachinery/pkg/runtime/serializer" clientgoscheme "k8s.io/client-go/kubernetes/scheme" ctrl "sigs.k8s.io/controller-runtime" "sigs.k8s.io/controller-runtime/pkg/cache" @@ -20,8 +19,6 @@ var scheme = runtime.NewScheme() func init() { _ = clientgoscheme.AddToScheme(scheme) - // Suppress unused import. - _ = utilruntime.CodecFactory{} } func main() { @@ -37,7 +34,7 @@ func main() { ) flag.BoolVar(&leaderElect, "leader-elect", false, "Enable leader election for the controller manager.") - flag.StringVar(&watchNamespace, "watch-namespace", "", "Namespace to watch. Defaults to the release namespace.") + flag.StringVar(&watchNamespace, "watch-namespace", "", "Namespace to watch. Defaults to the operator pod namespace.") flag.BoolVar(&watchAllNamespaces, "watch-all-namespaces", false, "Watch all namespaces.") flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") flag.StringVar(&healthProbeAddr, "health-probe-bind-address", ":8081", "The address the health probe endpoint binds to.") @@ -52,6 +49,14 @@ func main() { ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts))) logger := ctrl.Log.WithName("setup") + // Fall back to the pod's namespace when no explicit scope is set. + if !watchAllNamespaces && watchNamespace == "" { + if podNS, ok := os.LookupEnv("POD_NAMESPACE"); ok && podNS != "" { + watchNamespace = podNS + logger.Info("defaulting watch scope to pod namespace", "namespace", podNS) + } + } + // Configure cache namespace restrictions. var cacheOpts cache.Options if watchNamespace != "" && !watchAllNamespaces { diff --git a/operator/go.mod b/operator/go.mod index fea9649..8cd2bbe 100644 --- a/operator/go.mod +++ b/operator/go.mod @@ -1,6 +1,6 @@ module github.com/openfga/openfga-operator -go 1.25.6 +go 1.26.2 require ( k8s.io/api v0.35.3 diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 175805d..23e0d8d 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -26,6 +26,7 @@ const ( // Annotations set on the Deployment by the Helm chart / operator. AnnotationDesiredReplicas = "openfga.dev/desired-replicas" AnnotationMigrationServiceAccount = "openfga.dev/migration-service-account" + AnnotationRetryAfter = "openfga.dev/migration-retry-after" // Defaults for migration Job configuration. DefaultBackoffLimit int32 = 3 @@ -68,17 +69,29 @@ func migrationJobName(deploymentName string) string { return deploymentName + "-migrate" } -// buildMigrationJob constructs a migration Job for the given Deployment and version. +// findOpenFGAContainer finds the OpenFGA container in the Deployment's pod spec. +// It looks for a container named "openfga" first, then falls back to the first container. +func findOpenFGAContainer(deployment *appsv1.Deployment) *corev1.Container { + for i := range deployment.Spec.Template.Spec.Containers { + if deployment.Spec.Template.Spec.Containers[i].Name == "openfga" { + return &deployment.Spec.Template.Spec.Containers[i] + } + } + // Fallback: use the first container (for charts that don't name it "openfga"). + if len(deployment.Spec.Template.Spec.Containers) > 0 { + return &deployment.Spec.Template.Spec.Containers[0] + } + return nil +} + +// buildMigrationJob constructs a migration Job for the given Deployment. func buildMigrationJob( deployment *appsv1.Deployment, - desiredVersion string, + mainContainer *corev1.Container, backoffLimit int32, activeDeadlineSeconds int64, ttlSecondsAfterFinished int32, ) *batchv1.Job { - // Extract the main container's image and datastore env vars. - mainContainer := deployment.Spec.Template.Spec.Containers[0] - // Determine the migration service account. migrationSA := deployment.Annotations[AnnotationMigrationServiceAccount] if migrationSA == "" { @@ -127,12 +140,15 @@ func buildMigrationJob( Spec: corev1.PodSpec{ ServiceAccountName: migrationSA, RestartPolicy: corev1.RestartPolicyNever, + ImagePullSecrets: deployment.Spec.Template.Spec.ImagePullSecrets, + SecurityContext: deployment.Spec.Template.Spec.SecurityContext, Containers: []corev1.Container{ { - Name: "migrate-database", - Image: mainContainer.Image, - Args: []string{"migrate"}, - Env: datastoreEnvVars, + Name: "migrate-database", + Image: mainContainer.Image, + Args: []string{"migrate"}, + Env: datastoreEnvVars, + SecurityContext: mainContainer.SecurityContext, }, }, // Inherit scheduling constraints from the parent Deployment. diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 935e4ff..9da395c 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -3,6 +3,7 @@ package controller import ( "context" "fmt" + "strings" "time" appsv1 "k8s.io/api/apps/v1" @@ -46,14 +47,24 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, err } - // 2. Extract the desired version from the Deployment's image tag. - if len(deployment.Spec.Template.Spec.Containers) == 0 { + // 2. Find the OpenFGA container and extract the desired version. + mainContainer := findOpenFGAContainer(deployment) + if mainContainer == nil { logger.Info("deployment has no containers, skipping") return ctrl.Result{}, nil } - desiredVersion := extractImageTag(deployment.Spec.Template.Spec.Containers[0].Image) + desiredVersion := extractImageTag(mainContainer.Image) - // 3. Check current migration status from ConfigMap. + // 3. Skip migration for memory datastore — just ensure the Deployment is scaled up. + if isMemoryDatastore(mainContainer) { + logger.V(1).Info("memory datastore detected, skipping migration") + if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { + return ctrl.Result{}, scaleErr + } + return ctrl.Result{}, nil + } + + // 4. Check current migration status from ConfigMap. configMap := &corev1.ConfigMap{} cmName := migrationConfigMapName(req.Name) err := r.Get(ctx, types.NamespacedName{Name: cmName, Namespace: req.Namespace}, configMap) @@ -65,9 +76,13 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, fmt.Errorf("getting migration status: %w", err) } - // 4. If versions match, ensure Deployment is scaled up and return. + // 5. If versions match, ensure Deployment is scaled up and return. if currentVersion == desiredVersion { logger.V(1).Info("migration up to date", "version", desiredVersion) + clearMigrationFailedCondition(deployment) + if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + logger.Error(patchErr, "failed to clear MigrationFailed condition") + } if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { return ctrl.Result{}, scaleErr } @@ -76,12 +91,22 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( logger.Info("migration needed", "currentVersion", currentVersion, "desiredVersion", desiredVersion) - // 5. Ensure the Deployment is scaled to zero before migrating. + // 6. Ensure the Deployment is scaled to zero before migrating. if err := scaleDeploymentToZero(ctx, r.Client, deployment); err != nil { return ctrl.Result{}, err } - // 6. Check if a migration Job already exists. + // 7. Check retry-after annotation to honor backoff cooldown. + if retryAfter, ok := deployment.Annotations[AnnotationRetryAfter]; ok { + retryTime, parseErr := time.Parse(time.RFC3339, retryAfter) + if parseErr == nil && time.Now().Before(retryTime) { + remaining := time.Until(retryTime) + logger.V(1).Info("in retry cooldown", "retryAfter", retryAfter, "remaining", remaining) + return ctrl.Result{RequeueAfter: remaining}, nil + } + } + + // 8. Check if a migration Job already exists. jobName := migrationJobName(req.Name) job := &batchv1.Job{} err = r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: req.Namespace}, job) @@ -90,11 +115,19 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( // Create the migration Job. job = buildMigrationJob( deployment, - desiredVersion, + mainContainer, r.BackoffLimit, r.ActiveDeadlineSeconds, r.TTLSecondsAfterFinished, ) + // Clear the retry-after annotation now that we're creating a new Job. + if _, hasRetry := deployment.Annotations[AnnotationRetryAfter]; hasRetry { + patch := client.MergeFrom(deployment.DeepCopy()) + delete(deployment.Annotations, AnnotationRetryAfter) + if patchErr := r.Patch(ctx, deployment, patch); patchErr != nil { + logger.Error(patchErr, "failed to clear retry-after annotation") + } + } if createErr := r.Create(ctx, job); createErr != nil { return ctrl.Result{}, fmt.Errorf("creating migration job: %w", createErr) } @@ -104,10 +137,16 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, fmt.Errorf("getting migration job: %w", err) } - // 7. Check Job status. + // 9. Check Job status. if job.Status.Succeeded >= 1 { logger.Info("migration succeeded", "version", desiredVersion) + // Clear MigrationFailed condition. + clearMigrationFailedCondition(deployment) + if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + logger.Error(patchErr, "failed to clear MigrationFailed condition") + } + // Update migration status ConfigMap. if statusErr := updateMigrationStatus(ctx, r.Client, deployment, desiredVersion, jobName); statusErr != nil { return ctrl.Result{}, statusErr @@ -135,8 +174,19 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( logger.Error(patchErr, "failed to set MigrationFailed condition") } + // Persist a retry-after annotation so the cooldown is honored even + // when the Job deletion triggers an immediate re-enqueue. + retryAfter := time.Now().Add(60 * time.Second).UTC().Format(time.RFC3339) + patch := client.MergeFrom(deployment.DeepCopy()) + if deployment.Annotations == nil { + deployment.Annotations = make(map[string]string) + } + deployment.Annotations[AnnotationRetryAfter] = retryAfter + if patchErr := r.Patch(ctx, deployment, patch); patchErr != nil { + logger.Error(patchErr, "failed to set retry-after annotation") + } + // Delete the failed Job so a fresh one is created on the next reconcile. - // This allows auto-recovery when the database comes back. propagation := metav1.DeletePropagationBackground if delErr := r.Delete(ctx, job, &client.DeleteOptions{ PropagationPolicy: &propagation, @@ -145,15 +195,26 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } logger.Info("deleted failed migration job, will retry", "job", jobName) - // Requeue after a longer delay to avoid tight retry loops. + // Requeue after the cooldown period. return ctrl.Result{RequeueAfter: 60 * time.Second}, nil } - // 8. Job still running — requeue. + // 10. Job still running — requeue. logger.V(1).Info("migration job in progress", "job", jobName) return ctrl.Result{RequeueAfter: 10 * time.Second}, nil } +// isMemoryDatastore checks if the Deployment is using the memory datastore +// (no database migration needed). +func isMemoryDatastore(container *corev1.Container) bool { + for _, env := range container.Env { + if env.Name == "OPENFGA_DATASTORE_ENGINE" { + return strings.EqualFold(env.Value, "memory") + } + } + return false +} + // setMigrationFailedCondition sets a MigrationFailed condition on the Deployment. func setMigrationFailedCondition(deployment *appsv1.Deployment, version string) { condition := appsv1.DeploymentCondition{ @@ -174,6 +235,19 @@ func setMigrationFailedCondition(deployment *appsv1.Deployment, version string) deployment.Status.Conditions = append(deployment.Status.Conditions, condition) } +// clearMigrationFailedCondition removes or sets the MigrationFailed condition to False. +func clearMigrationFailedCondition(deployment *appsv1.Deployment) { + for i, c := range deployment.Status.Conditions { + if c.Type == "MigrationFailed" { + deployment.Status.Conditions[i].Status = corev1.ConditionFalse + deployment.Status.Conditions[i].LastTransitionTime = metav1.Now() + deployment.Status.Conditions[i].Reason = "MigrationSucceeded" + deployment.Status.Conditions[i].Message = "Migration completed successfully." + return + } + } +} + // SetupWithManager sets up the controller with the Manager. func (r *MigrationReconciler) SetupWithManager(mgr ctrl.Manager) error { // Only watch Deployments that are part of OpenFGA. diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index dc0f0c5..f463c7d 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -230,7 +230,7 @@ func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { } } -func TestReconcile_JobFailed_DeletesJobAndRequeues(t *testing.T) { +func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { // Given: a Deployment at 0 replicas and a failed migration Job. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) dep.Annotations[AnnotationDesiredReplicas] = "3" @@ -295,6 +295,87 @@ func TestReconcile_JobFailed_DeletesJobAndRequeues(t *testing.T) { }, deletedJob); getErr == nil { t.Error("expected failed migration job to be deleted") } + + // Verify retry-after annotation was set on the Deployment. + if _, ok := updated.Annotations[AnnotationRetryAfter]; !ok { + t.Error("expected retry-after annotation to be set on Deployment") + } +} + +func TestReconcile_RetryAfterCooldown_SkipsJobCreation(t *testing.T) { + // Given: a Deployment with a retry-after annotation in the future. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + dep.Annotations[AnnotationRetryAfter] = time.Now().Add(30 * time.Second).UTC().Format(time.RFC3339) + + r := newReconciler(dep) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, requeue with remaining cooldown time. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue during cooldown") + } + if result.RequeueAfter > 30*time.Second { + t.Errorf("expected requeue within 30s, got %v", result.RequeueAfter) + } + + // Verify no Job was created. + job := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); getErr == nil { + t.Error("expected no migration job during cooldown") + } +} + +func TestReconcile_MemoryDatastore_SkipsMigration(t *testing.T) { + // Given: a Deployment using the memory datastore. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "1" + dep.Spec.Template.Spec.Containers[0].Env = []corev1.EnvVar{ + {Name: "OPENFGA_DATASTORE_ENGINE", Value: "memory"}, + } + + r := newReconciler(dep) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, no requeue. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 0 { + t.Error("expected no requeue for memory datastore") + } + + // Verify Deployment was scaled up (no migration needed). + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if *updated.Spec.Replicas != 1 { + t.Errorf("expected 1 replica, got %d", *updated.Spec.Replicas) + } + + // Verify no Job was created. + job := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); getErr == nil { + t.Error("expected no migration job for memory datastore") + } } func TestReconcile_DeploymentNotFound_NoError(t *testing.T) { @@ -312,6 +393,51 @@ func TestReconcile_DeploymentNotFound_NoError(t *testing.T) { } } +func TestReconcile_FindContainerByName(t *testing.T) { + // Given: a Deployment with a sidecar before the openfga container. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Spec.Template.Spec.Containers = []corev1.Container{ + { + Name: "sidecar", + Image: "envoyproxy/envoy:v1.30", + }, + { + Name: "openfga", + Image: "openfga/openfga:v1.14.0", + Env: []corev1.EnvVar{ + {Name: "OPENFGA_DATASTORE_ENGINE", Value: "postgres"}, + {Name: "OPENFGA_DATASTORE_URI", Value: "postgres://localhost/openfga"}, + }, + }, + } + + r := newReconciler(dep) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: Job should use the openfga container's image, not the sidecar's. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue, got none") + } + + job := &batchv1.Job{} + if err := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); err != nil { + t.Fatalf("expected migration job to be created: %v", err) + } + + if job.Spec.Template.Spec.Containers[0].Image != "openfga/openfga:v1.14.0" { + t.Errorf("expected job image openfga/openfga:v1.14.0, got %s", job.Spec.Template.Spec.Containers[0].Image) + } +} + func TestExtractImageTag(t *testing.T) { tests := []struct { image string From 40ca6caa8887a3fc93f7984244ca46167a495b14 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 09:14:14 -0400 Subject: [PATCH 04/42] fix: address additional Copilot review feedback on PR #309 - Render extraInitContainers in operator mode (previously skipped) - Add version label to migration Jobs and delete stale Jobs on image change - Use namespaced Role/RoleBinding when watchAllNamespaces is false - Replace Status().Update with Status().Patch to avoid write conflicts - Fix logger.Error(nil, ...) to logger.Info for expected failure state - Wire desiredVersion param into buildMigrationJob for version tracking - Update RBAC: deployments/status verb from update to patch - Don't force replicas: 0 for memory engine in operator mode - Guard migration SA creation on migration.enabled - Document both required labels and mutable tag limitation in README --- .../templates/clusterrole.yaml | 9 ++++++- .../templates/clusterrolebinding.yaml | 11 +++++++++ charts/openfga/templates/deployment.yaml | 11 +++++---- charts/openfga/templates/serviceaccount.yaml | 2 +- operator/README.md | 6 ++++- operator/internal/controller/helpers.go | 2 ++ .../controller/migration_controller.go | 24 +++++++++++++++---- 7 files changed, 53 insertions(+), 12 deletions(-) diff --git a/charts/openfga-operator/templates/clusterrole.yaml b/charts/openfga-operator/templates/clusterrole.yaml index 7dbfddf..652b48e 100644 --- a/charts/openfga-operator/templates/clusterrole.yaml +++ b/charts/openfga-operator/templates/clusterrole.yaml @@ -1,7 +1,14 @@ apiVersion: rbac.authorization.k8s.io/v1 +{{- if .Values.watchAllNamespaces }} kind: ClusterRole +{{- else }} +kind: Role +{{- end }} metadata: name: {{ include "openfga-operator.fullname" . }} + {{- if not .Values.watchAllNamespaces }} + namespace: {{ include "openfga-operator.namespace" . }} + {{- end }} labels: {{- include "openfga-operator.labels" . | nindent 4 }} rules: @@ -10,7 +17,7 @@ rules: verbs: ["get", "list", "watch", "patch"] - apiGroups: ["apps"] resources: ["deployments/status"] - verbs: ["update"] + verbs: ["patch"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "list", "watch", "create", "delete"] diff --git a/charts/openfga-operator/templates/clusterrolebinding.yaml b/charts/openfga-operator/templates/clusterrolebinding.yaml index 854521a..cfca8d1 100644 --- a/charts/openfga-operator/templates/clusterrolebinding.yaml +++ b/charts/openfga-operator/templates/clusterrolebinding.yaml @@ -1,12 +1,23 @@ apiVersion: rbac.authorization.k8s.io/v1 +{{- if .Values.watchAllNamespaces }} kind: ClusterRoleBinding +{{- else }} +kind: RoleBinding +{{- end }} metadata: name: {{ include "openfga-operator.fullname" . }} + {{- if not .Values.watchAllNamespaces }} + namespace: {{ include "openfga-operator.namespace" . }} + {{- end }} labels: {{- include "openfga-operator.labels" . | nindent 4 }} roleRef: apiGroup: rbac.authorization.k8s.io + {{- if .Values.watchAllNamespaces }} kind: ClusterRole + {{- else }} + kind: Role + {{- end }} name: {{ include "openfga-operator.fullname" . }} subjects: - kind: ServiceAccount diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index 3d101c0..2de84cd 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -19,7 +19,7 @@ spec: {{- if .Values.autoscaling.enabled }} {{- fail "operator.enabled and autoscaling.enabled cannot both be true" }} {{- end }} - replicas: 0 + replicas: {{ ternary 1 0 (eq .Values.datastore.engine "memory") }} {{- else if not .Values.autoscaling.enabled }} replicas: {{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory")}} {{- end }} @@ -49,10 +49,11 @@ spec: securityContext: {{- toYaml .Values.podSecurityContext | nindent 8 }} {{- $operatorMigration := and .Values.operator.enabled .Values.migration.enabled }} - {{ if or (and (not $operatorMigration) (or (and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations))) .Values.extraInitContainers }} + {{- $needsMigrationInit := and (not $operatorMigration) (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations }} + {{- if or $needsMigrationInit .Values.extraInitContainers }} initContainers: - {{- if not $operatorMigration }} - {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations (eq .Values.datastore.migrationType "job") }} + {{- if $needsMigrationInit }} + {{- if eq .Values.datastore.migrationType "job" }} - name: wait-for-migration securityContext: {{- toYaml .Values.securityContext | nindent 12 }} @@ -62,7 +63,7 @@ spec: resources: {{- toYaml .Values.datastore.migrations.resources | nindent 12 }} {{- end }} - {{- if and (has .Values.datastore.engine (list "postgres" "mysql")) (eq .Values.datastore.migrationType "initContainer") }} + {{- if eq .Values.datastore.migrationType "initContainer" }} {{- with .Values.migrate.extraInitContainers }} {{- toYaml . | nindent 8 }} {{- end }} diff --git a/charts/openfga/templates/serviceaccount.yaml b/charts/openfga/templates/serviceaccount.yaml index 278be07..f732c46 100644 --- a/charts/openfga/templates/serviceaccount.yaml +++ b/charts/openfga/templates/serviceaccount.yaml @@ -10,7 +10,7 @@ metadata: {{- toYaml . | nindent 4 }} {{- end }} {{- end }} -{{- if and .Values.operator.enabled .Values.migration.serviceAccount.create }} +{{- if and .Values.operator.enabled .Values.migration.enabled .Values.migration.serviceAccount.create }} --- apiVersion: v1 kind: ServiceAccount diff --git a/operator/README.md b/operator/README.md index 8e576db..a2e927b 100644 --- a/operator/README.md +++ b/operator/README.md @@ -6,7 +6,7 @@ This is **Stage 1** of the operator — focused solely on migration orchestratio ## How It Works -1. The operator watches Deployments labeled `app.kubernetes.io/part-of: openfga` +1. The operator watches Deployments labeled `app.kubernetes.io/part-of: openfga` and `app.kubernetes.io/component: authorization-controller` 2. When a version change is detected (comparing the container image tag to the `{name}-migration-status` ConfigMap), the operator: - Keeps the Deployment at 0 replicas - Creates a migration Job running `openfga migrate` @@ -127,3 +127,7 @@ The operator reads these annotations from the OpenFGA Deployment: |------------|-------------| | `openfga.dev/desired-replicas` | The replica count to restore after migration succeeds. Set by the Helm chart. | | `openfga.dev/migration-service-account` | The ServiceAccount to use for migration Jobs. Defaults to the Deployment's SA. | + +## Limitations + +- **Mutable image tags:** The operator detects version changes by comparing the container image tag (or digest). If you deploy with a mutable tag like `latest` or reuse the same tag for different builds, the operator will not detect changes and will skip the migration. Use immutable tags (e.g., `v1.14.0`) or pin images by digest for reliable migration triggering. diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 23e0d8d..a9c37d5 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -88,6 +88,7 @@ func findOpenFGAContainer(deployment *appsv1.Deployment) *corev1.Container { func buildMigrationJob( deployment *appsv1.Deployment, mainContainer *corev1.Container, + desiredVersion string, backoffLimit int32, activeDeadlineSeconds int64, ttlSecondsAfterFinished int32, @@ -114,6 +115,7 @@ func buildMigrationJob( LabelPartOf: LabelPartOfValue, LabelComponent: "migration", "app.kubernetes.io/managed-by": "openfga-operator", + "app.kubernetes.io/version": desiredVersion, }, OwnerReferences: []metav1.OwnerReference{ { diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 9da395c..77dd22e 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -79,8 +79,9 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( // 5. If versions match, ensure Deployment is scaled up and return. if currentVersion == desiredVersion { logger.V(1).Info("migration up to date", "version", desiredVersion) + statusPatch := client.MergeFrom(deployment.DeepCopy()) clearMigrationFailedCondition(deployment) - if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + if patchErr := r.Status().Patch(ctx, deployment, statusPatch); patchErr != nil { logger.Error(patchErr, "failed to clear MigrationFailed condition") } if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { @@ -116,6 +117,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( job = buildMigrationJob( deployment, mainContainer, + desiredVersion, r.BackoffLimit, r.ActiveDeadlineSeconds, r.TTLSecondsAfterFinished, @@ -137,13 +139,26 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, fmt.Errorf("getting migration job: %w", err) } + // 8b. If the existing Job is for a different version, delete it and recreate. + if jobVersion := job.Labels["app.kubernetes.io/version"]; jobVersion != "" && jobVersion != desiredVersion { + logger.Info("existing migration job is for a different version, deleting", "jobVersion", jobVersion, "desiredVersion", desiredVersion) + propagation := metav1.DeletePropagationBackground + if delErr := r.Delete(ctx, job, &client.DeleteOptions{ + PropagationPolicy: &propagation, + }); delErr != nil && !apierrors.IsNotFound(delErr) { + return ctrl.Result{}, fmt.Errorf("deleting stale migration job: %w", delErr) + } + return ctrl.Result{RequeueAfter: 5 * time.Second}, nil + } + // 9. Check Job status. if job.Status.Succeeded >= 1 { logger.Info("migration succeeded", "version", desiredVersion) // Clear MigrationFailed condition. + statusPatch := client.MergeFrom(deployment.DeepCopy()) clearMigrationFailedCondition(deployment) - if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + if patchErr := r.Status().Patch(ctx, deployment, statusPatch); patchErr != nil { logger.Error(patchErr, "failed to clear MigrationFailed condition") } @@ -166,11 +181,12 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } if job.Status.Failed >= backoffLimit { - logger.Error(nil, "migration job failed, will delete and retry", "job", jobName, "version", desiredVersion) + logger.Info("migration job failed, will delete and retry", "job", jobName, "version", desiredVersion) // Set condition so kubectl describe shows the failure. + statusPatch := client.MergeFrom(deployment.DeepCopy()) setMigrationFailedCondition(deployment, desiredVersion) - if patchErr := r.Status().Update(ctx, deployment); patchErr != nil { + if patchErr := r.Status().Patch(ctx, deployment, statusPatch); patchErr != nil { logger.Error(patchErr, "failed to set MigrationFailed condition") } From 7e33bf7a5681310152375c9ce2b292c119f8ccf8 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 09:57:11 -0400 Subject: [PATCH 05/42] fix: address remaining PR #309 review feedback - Add opt-in annotation (openfga.dev/migration-enabled) so the operator only manages migrations for explicitly opted-in Deployments - Propagate volumes, volumeMounts, and envFrom from the Deployment to migration Jobs for TLS certs and file-based credentials - Remove watchAllNamespaces option; operator is now always namespace-scoped - Update ADR-004 dependency example to match actual file:// reference - Add test for migration-not-enabled skip behavior --- .../templates/deployment.yaml | 4 +- .../templates/{clusterrole.yaml => role.yaml} | 6 --- ...usterrolebinding.yaml => rolebinding.yaml} | 10 ----- charts/openfga-operator/values.yaml | 5 +-- charts/openfga/templates/deployment.yaml | 1 + docs/adr/004-operator-deployment-model.md | 19 ++++---- operator/README.md | 8 ++-- operator/cmd/main.go | 20 ++++----- operator/internal/controller/helpers.go | 6 ++- .../controller/migration_controller.go | 8 +++- .../controller/migration_controller_test.go | 44 ++++++++++++++++++- 11 files changed, 79 insertions(+), 52 deletions(-) rename charts/openfga-operator/templates/{clusterrole.yaml => role.yaml} (86%) rename charts/openfga-operator/templates/{clusterrolebinding.yaml => rolebinding.yaml} (69%) diff --git a/charts/openfga-operator/templates/deployment.yaml b/charts/openfga-operator/templates/deployment.yaml index 4570660..ecad091 100644 --- a/charts/openfga-operator/templates/deployment.yaml +++ b/charts/openfga-operator/templates/deployment.yaml @@ -40,9 +40,7 @@ spec: {{- if .Values.leaderElection.enabled }} - --leader-elect {{- end }} - {{- if .Values.watchAllNamespaces }} - - --watch-all-namespaces - {{- else if .Values.watchNamespace }} + {{- if .Values.watchNamespace }} - --watch-namespace={{ .Values.watchNamespace }} {{- end }} env: diff --git a/charts/openfga-operator/templates/clusterrole.yaml b/charts/openfga-operator/templates/role.yaml similarity index 86% rename from charts/openfga-operator/templates/clusterrole.yaml rename to charts/openfga-operator/templates/role.yaml index 652b48e..dd17870 100644 --- a/charts/openfga-operator/templates/clusterrole.yaml +++ b/charts/openfga-operator/templates/role.yaml @@ -1,14 +1,8 @@ apiVersion: rbac.authorization.k8s.io/v1 -{{- if .Values.watchAllNamespaces }} -kind: ClusterRole -{{- else }} kind: Role -{{- end }} metadata: name: {{ include "openfga-operator.fullname" . }} - {{- if not .Values.watchAllNamespaces }} namespace: {{ include "openfga-operator.namespace" . }} - {{- end }} labels: {{- include "openfga-operator.labels" . | nindent 4 }} rules: diff --git a/charts/openfga-operator/templates/clusterrolebinding.yaml b/charts/openfga-operator/templates/rolebinding.yaml similarity index 69% rename from charts/openfga-operator/templates/clusterrolebinding.yaml rename to charts/openfga-operator/templates/rolebinding.yaml index cfca8d1..afacb98 100644 --- a/charts/openfga-operator/templates/clusterrolebinding.yaml +++ b/charts/openfga-operator/templates/rolebinding.yaml @@ -1,23 +1,13 @@ apiVersion: rbac.authorization.k8s.io/v1 -{{- if .Values.watchAllNamespaces }} -kind: ClusterRoleBinding -{{- else }} kind: RoleBinding -{{- end }} metadata: name: {{ include "openfga-operator.fullname" . }} - {{- if not .Values.watchAllNamespaces }} namespace: {{ include "openfga-operator.namespace" . }} - {{- end }} labels: {{- include "openfga-operator.labels" . | nindent 4 }} roleRef: apiGroup: rbac.authorization.k8s.io - {{- if .Values.watchAllNamespaces }} - kind: ClusterRole - {{- else }} kind: Role - {{- end }} name: {{ include "openfga-operator.fullname" . }} subjects: - kind: ServiceAccount diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 59c9dfb..38a4bff 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -34,13 +34,10 @@ securityContext: runAsNonRoot: true runAsUser: 65532 -# -- Constrain the operator to watch a single namespace. +# -- Namespace to watch for OpenFGA Deployments. # Leave empty to default to the release namespace. watchNamespace: "" -# -- Watch all namespaces. Overrides watchNamespace. -watchAllNamespaces: false - leaderElection: # -- Enable leader election for controller manager. enabled: true diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index 2de84cd..5fc980f 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -6,6 +6,7 @@ metadata: {{- include "openfga.labels" . | nindent 4 }} annotations: {{- if and .Values.operator.enabled .Values.migration.enabled }} + openfga.dev/migration-enabled: "true" openfga.dev/desired-replicas: "{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}" {{- if or .Values.migration.serviceAccount.create .Values.migration.serviceAccount.name }} openfga.dev/migration-service-account: "{{ include "openfga.migrationServiceAccountName" . }}" diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md index 745e777..603bf3a 100644 --- a/docs/adr/004-operator-deployment-model.md +++ b/docs/adr/004-operator-deployment-model.md @@ -96,10 +96,14 @@ helm-charts/ dependencies: - name: openfga-operator version: "0.1.x" - repository: "oci://ghcr.io/openfga/helm-charts" + repository: "file://../openfga-operator" condition: operator.enabled ``` +> **Note:** The `file://` reference is used because the operator subchart lives in the same +> monorepo. When the charts are published, consumers pulling from a registry will resolve the +> dependency automatically via the chart's packaging. + ### CRD Handling Helm has specific behavior around CRDs: @@ -130,19 +134,12 @@ kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0 ### Multi-Instance Considerations -When multiple OpenFGA installations exist in the same cluster: - -- **All-in-one mode:** Each installation gets its own operator instance. The operator only watches resources in its own namespace. This is simple but wasteful. -- **Standalone mode:** One operator installation watches all namespaces (or a configured set). Individual OpenFGA installations set `operator.enabled=false`. This is more efficient for large clusters. - -The operator will support both modes via a `watchNamespace` configuration: +When multiple OpenFGA installations exist in the same cluster, each installation gets its own operator instance. The operator is **namespace-scoped** — it only watches resources in its own namespace (or the namespace specified via `--watch-namespace`). This ensures independent OpenFGA installations never interfere with each other. ```yaml # Operator values operator: - watchNamespace: "" # empty = watch own namespace only (all-in-one mode) - # watchNamespace: "" # or set to a specific namespace - # watchAllNamespaces: true # watch all namespaces (standalone mode) + watchNamespace: "" # empty = watch own namespace only (default) ``` ## Consequences @@ -153,7 +150,7 @@ operator: - **Opt-out available** — `operator.enabled: false` for users who manage it separately or don't need it - **Independent versioning** — operator chart has its own version; can be released on a different cadence than the main chart - **Clean code separation** — operator code and templates are in their own chart directory -- **Standalone installation supported** — cluster admins can install one operator for multiple OpenFGA instances +- **Namespace isolation** — each operator instance is scoped to its own namespace, so multiple OpenFGA installations coexist safely - **Consistent with ecosystem** — this is the same pattern used by charts that depend on Bitnami PostgreSQL, Redis, etc. ### Negative diff --git a/operator/README.md b/operator/README.md index a2e927b..0e43a3d 100644 --- a/operator/README.md +++ b/operator/README.md @@ -6,7 +6,7 @@ This is **Stage 1** of the operator — focused solely on migration orchestratio ## How It Works -1. The operator watches Deployments labeled `app.kubernetes.io/part-of: openfga` and `app.kubernetes.io/component: authorization-controller` +1. The operator watches Deployments **in its own namespace** labeled `app.kubernetes.io/part-of: openfga` and `app.kubernetes.io/component: authorization-controller` 2. When a version change is detected (comparing the container image tag to the `{name}-migration-status` ConfigMap), the operator: - Keeps the Deployment at 0 replicas - Creates a migration Job running `openfga migrate` @@ -17,7 +17,7 @@ This is **Stage 1** of the operator — focused solely on migration orchestratio ## Prerequisites -- Go 1.25+ +- Go 1.26.2+ - Docker - Helm 3.6+ - A Kubernetes cluster (Rancher Desktop, kind, etc.) @@ -109,8 +109,7 @@ The operator accepts the following flags: | Flag | Default | Description | |------|---------|-------------| | `--leader-elect` | `false` | Enable leader election so only one replica actively reconciles at a time. Required when running multiple operator replicas for high availability; standby pods wait for the leader's Lease to expire before taking over. Not needed for single-replica deployments. | -| `--watch-namespace` | `""` | Namespace to watch for OpenFGA Deployments. Defaults to the operator pod's own namespace (via `POD_NAMESPACE` env var). Set explicitly for multi-namespace setups. | -| `--watch-all-namespaces` | `false` | Watch all namespaces for OpenFGA Deployments, making the operator cluster-wide. Overrides `--watch-namespace`. | +| `--watch-namespace` | `""` | Namespace to watch for OpenFGA Deployments. Defaults to the operator pod's own namespace (via `POD_NAMESPACE` env var). Each operator instance manages only its own namespace, so multiple independent OpenFGA installations can coexist safely. | | `--metrics-bind-address` | `:8080` | Address the Prometheus metrics endpoint binds to. Change only if the default port conflicts with other containers in the pod. | | `--health-probe-bind-address` | `:8081` | Address the Kubernetes liveness and readiness probe endpoints bind to. Change only if the default port conflicts. | | `--backoff-limit` | `3` | Number of times a migration Job's pod can fail before the Job is considered failed. After hitting this limit the operator deletes the Job, sets a `MigrationFailed` condition on the Deployment, and retries after a 60-second cooldown. | @@ -125,6 +124,7 @@ The operator reads these annotations from the OpenFGA Deployment: | Annotation | Description | |------------|-------------| +| `openfga.dev/migration-enabled` | Must be `"true"` for the operator to manage migrations. Deployments without this annotation are ignored. Set by the Helm chart when `operator.enabled` and `migration.enabled` are both true. | | `openfga.dev/desired-replicas` | The replica count to restore after migration succeeds. Set by the Helm chart. | | `openfga.dev/migration-service-account` | The ServiceAccount to use for migration Jobs. Defaults to the Deployment's SA. | diff --git a/operator/cmd/main.go b/operator/cmd/main.go index 9fb9107..7917264 100644 --- a/operator/cmd/main.go +++ b/operator/cmd/main.go @@ -23,19 +23,17 @@ func init() { func main() { var ( - leaderElect bool - watchNamespace string - watchAllNamespaces bool - metricsAddr string - healthProbeAddr string - backoffLimit int - activeDeadline int - ttlAfterFinished int + leaderElect bool + watchNamespace string + metricsAddr string + healthProbeAddr string + backoffLimit int + activeDeadline int + ttlAfterFinished int ) flag.BoolVar(&leaderElect, "leader-elect", false, "Enable leader election for the controller manager.") flag.StringVar(&watchNamespace, "watch-namespace", "", "Namespace to watch. Defaults to the operator pod namespace.") - flag.BoolVar(&watchAllNamespaces, "watch-all-namespaces", false, "Watch all namespaces.") flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") flag.StringVar(&healthProbeAddr, "health-probe-bind-address", ":8081", "The address the health probe endpoint binds to.") flag.IntVar(&backoffLimit, "backoff-limit", int(controller.DefaultBackoffLimit), "BackoffLimit for migration Jobs.") @@ -50,7 +48,7 @@ func main() { logger := ctrl.Log.WithName("setup") // Fall back to the pod's namespace when no explicit scope is set. - if !watchAllNamespaces && watchNamespace == "" { + if watchNamespace == "" { if podNS, ok := os.LookupEnv("POD_NAMESPACE"); ok && podNS != "" { watchNamespace = podNS logger.Info("defaulting watch scope to pod namespace", "namespace", podNS) @@ -59,7 +57,7 @@ func main() { // Configure cache namespace restrictions. var cacheOpts cache.Options - if watchNamespace != "" && !watchAllNamespaces { + if watchNamespace != "" { cacheOpts.DefaultNamespaces = map[string]cache.Config{ watchNamespace: {}, } diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index a9c37d5..efc7a4b 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -24,6 +24,7 @@ const ( LabelComponentValue = "authorization-controller" // Annotations set on the Deployment by the Helm chart / operator. + AnnotationMigrationEnabled = "openfga.dev/migration-enabled" AnnotationDesiredReplicas = "openfga.dev/desired-replicas" AnnotationMigrationServiceAccount = "openfga.dev/migration-service-account" AnnotationRetryAfter = "openfga.dev/migration-retry-after" @@ -150,10 +151,13 @@ func buildMigrationJob( Image: mainContainer.Image, Args: []string{"migrate"}, Env: datastoreEnvVars, + EnvFrom: mainContainer.EnvFrom, + VolumeMounts: mainContainer.VolumeMounts, SecurityContext: mainContainer.SecurityContext, }, }, - // Inherit scheduling constraints from the parent Deployment. + // Inherit volumes and scheduling constraints from the parent Deployment. + Volumes: deployment.Spec.Template.Spec.Volumes, NodeSelector: deployment.Spec.Template.Spec.NodeSelector, Tolerations: deployment.Spec.Template.Spec.Tolerations, Affinity: deployment.Spec.Template.Spec.Affinity, diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 77dd22e..e28aed8 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -47,7 +47,13 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, err } - // 2. Find the OpenFGA container and extract the desired version. + // 2. Skip if migration is not opted-in via annotation. + if deployment.Annotations[AnnotationMigrationEnabled] != "true" { + logger.V(1).Info("migration not enabled for this deployment, skipping") + return ctrl.Result{}, nil + } + + // 3. Find the OpenFGA container and extract the desired version. mainContainer := findOpenFGAContainer(deployment) if mainContainer == nil { logger.Info("deployment has no containers, skipping") diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index f463c7d..811e89c 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -33,7 +33,9 @@ func newTestDeployment(name, namespace, image string, replicas int32) *appsv1.De LabelPartOf: LabelPartOfValue, LabelComponent: LabelComponentValue, }, - Annotations: map[string]string{}, + Annotations: map[string]string{ + AnnotationMigrationEnabled: "true", + }, }, Spec: appsv1.DeploymentSpec{ Replicas: ptr.To(replicas), @@ -438,6 +440,46 @@ func TestReconcile_FindContainerByName(t *testing.T) { } } +func TestReconcile_MigrationNotEnabled_Skips(t *testing.T) { + // Given: a Deployment without the migration-enabled annotation. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 3) + delete(dep.Annotations, AnnotationMigrationEnabled) + + r := newReconciler(dep) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, no requeue, no Job created, replicas unchanged. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 0 { + t.Error("expected no requeue when migration is not enabled") + } + + // Verify no Job was created. + job := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); getErr == nil { + t.Error("expected no migration job when migration is not enabled") + } + + // Verify replicas unchanged. + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if *updated.Spec.Replicas != 3 { + t.Errorf("expected 3 replicas unchanged, got %d", *updated.Spec.Replicas) + } +} + func TestExtractImageTag(t *testing.T) { tests := []struct { image string From ff2b39ad177eb535b66dda6a07f38baabab4ae43 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 10:12:47 -0400 Subject: [PATCH 06/42] fix: remove EnvFrom from migration Job to preserve least-privilege The migration Job should only receive explicitly filtered OPENFGA_DATASTORE_* env vars, not the full EnvFrom from the source Deployment which could leak non-datastore secrets. --- operator/internal/controller/helpers.go | 1 - 1 file changed, 1 deletion(-) diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index efc7a4b..3727e2f 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -151,7 +151,6 @@ func buildMigrationJob( Image: mainContainer.Image, Args: []string{"migrate"}, Env: datastoreEnvVars, - EnvFrom: mainContainer.EnvFrom, VolumeMounts: mainContainer.VolumeMounts, SecurityContext: mainContainer.SecurityContext, }, From 722f3e703d04af290ee066c03ece3c5a6cc92861 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 10:24:29 -0400 Subject: [PATCH 07/42] fix: address Copilot review round 3 on PR #309 - Wrap deployment annotations in conditional to avoid emitting empty annotations: field which produces an invalid manifest - Store full version in annotation (openfga.dev/desired-version) and truncate label to 63 chars to support digest-pinned images - Align operator image default to ghcr.io/openfga/openfga-operator to match CI publishing target --- .github/workflows/test.yml | 4 ++-- charts/openfga-operator/values.yaml | 2 +- charts/openfga/templates/deployment.yaml | 5 ++++- operator/internal/controller/helpers.go | 11 ++++++++++- operator/internal/controller/migration_controller.go | 7 ++++++- 5 files changed, 23 insertions(+), 6 deletions(-) diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index f2369c5..bbec31d 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -63,8 +63,8 @@ jobs: if: steps.list-changed.outputs.changed == 'true' run: | version=$(grep '^appVersion:' charts/openfga-operator/Chart.yaml | awk '{print $2}' | tr -d '"') - docker build -t "openfga/openfga-operator:${version}" operator/ - kind load docker-image "openfga/openfga-operator:${version}" --name chart-testing + docker build -t "ghcr.io/openfga/openfga-operator:${version}" operator/ + kind load docker-image "ghcr.io/openfga/openfga-operator:${version}" --name chart-testing - name: Run chart-testing (install) if: steps.list-changed.outputs.changed == 'true' diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 38a4bff..1b52525 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -1,7 +1,7 @@ replicaCount: 1 image: - repository: openfga/openfga-operator + repository: ghcr.io/openfga/openfga-operator pullPolicy: IfNotPresent # -- Overrides the image tag whose default is the chart appVersion. tag: "" diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index 5fc980f..ded90ab 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -4,8 +4,10 @@ metadata: name: {{ include "openfga.fullname" . }} labels: {{- include "openfga.labels" . | nindent 4 }} + {{- $hasOperatorAnnotations := and .Values.operator.enabled .Values.migration.enabled }} + {{- if or $hasOperatorAnnotations .Values.annotations }} annotations: - {{- if and .Values.operator.enabled .Values.migration.enabled }} + {{- if $hasOperatorAnnotations }} openfga.dev/migration-enabled: "true" openfga.dev/desired-replicas: "{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}" {{- if or .Values.migration.serviceAccount.create .Values.migration.serviceAccount.name }} @@ -15,6 +17,7 @@ metadata: {{- with .Values.annotations }} {{- toYaml . | nindent 4 }} {{- end }} + {{- end }} spec: {{- if and .Values.operator.enabled .Values.migration.enabled }} {{- if .Values.autoscaling.enabled }} diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 3727e2f..799c609 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -108,6 +108,12 @@ func buildMigrationJob( } } + // Truncate version for label (max 63 chars); store full version in annotation. + labelVersion := desiredVersion + if len(labelVersion) > 63 { + labelVersion = labelVersion[:63] + } + return &batchv1.Job{ ObjectMeta: metav1.ObjectMeta{ Name: migrationJobName(deployment.Name), @@ -116,7 +122,10 @@ func buildMigrationJob( LabelPartOf: LabelPartOfValue, LabelComponent: "migration", "app.kubernetes.io/managed-by": "openfga-operator", - "app.kubernetes.io/version": desiredVersion, + "app.kubernetes.io/version": labelVersion, + }, + Annotations: map[string]string{ + "openfga.dev/desired-version": desiredVersion, }, OwnerReferences: []metav1.OwnerReference{ { diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index e28aed8..0e87325 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -146,7 +146,12 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } // 8b. If the existing Job is for a different version, delete it and recreate. - if jobVersion := job.Labels["app.kubernetes.io/version"]; jobVersion != "" && jobVersion != desiredVersion { + // Check annotation first (supports digests > 63 chars), fall back to label. + jobVersion := job.Annotations["openfga.dev/desired-version"] + if jobVersion == "" { + jobVersion = job.Labels["app.kubernetes.io/version"] + } + if jobVersion != "" && jobVersion != desiredVersion { logger.Info("existing migration job is for a different version, deleting", "jobVersion", jobVersion, "desiredVersion", desiredVersion) propagation := metav1.DeletePropagationBackground if delErr := r.Delete(ctx, job, &client.DeleteOptions{ From b2c00612edde667f98a8f862d95ea80aa72a2a62 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 11:48:59 -0400 Subject: [PATCH 08/42] fix: desired version to replace problematic ":" with "_" for label value" --- operator/internal/controller/helpers.go | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 799c609..871d1a5 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -108,8 +108,9 @@ func buildMigrationJob( } } - // Truncate version for label (max 63 chars); store full version in annotation. - labelVersion := desiredVersion + // Sanitize version for use as a label value (must match [a-zA-Z0-9._-], max 63 chars). + // The full version is stored in an annotation for accurate comparison. + labelVersion := strings.ReplaceAll(desiredVersion, ":", "_") if len(labelVersion) > 63 { labelVersion = labelVersion[:63] } From c84d7517fd27ce9b7ba307836e4bb81e14902e43 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 12:04:42 -0400 Subject: [PATCH 09/42] fix: address Copilot review comments - Return error on retry-after annotation patch failure to prevent Job churn that bypasses the 60s cooldown - Add test for stale-Job version mismatch deletion path --- .../controller/migration_controller.go | 2 +- .../controller/migration_controller_test.go | 72 +++++++++++++++++++ 2 files changed, 73 insertions(+), 1 deletion(-) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 0e87325..1cf0f06 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -210,7 +210,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } deployment.Annotations[AnnotationRetryAfter] = retryAfter if patchErr := r.Patch(ctx, deployment, patch); patchErr != nil { - logger.Error(patchErr, "failed to set retry-after annotation") + return ctrl.Result{}, fmt.Errorf("persisting retry-after annotation: %w", patchErr) } // Delete the failed Job so a fresh one is created on the next reconcile. diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index 811e89c..636182c 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -440,6 +440,78 @@ func TestReconcile_FindContainerByName(t *testing.T) { } } +func TestReconcile_StaleJob_DeletedAndRequeued(t *testing.T) { + // Given: a Deployment at v1.15.0 with an existing migration Job for v1.14.0. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.15.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + staleJob := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + Labels: map[string]string{ + "app.kubernetes.io/version": "v1.14.0", + }, + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + Status: batchv1.JobStatus{ + Succeeded: 1, + }, + } + + r := newReconciler(dep, staleJob) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, requeue to recreate with correct version. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue after deleting stale job") + } + + // Verify the stale Job was deleted. + deletedJob := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, deletedJob); getErr == nil { + t.Error("expected stale migration job to be deleted") + } + + // Verify ConfigMap was NOT updated (migration didn't actually run for v1.15.0). + cm := &corev1.ConfigMap{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migration-status", Namespace: "default", + }, cm); getErr == nil { + if cm.Data["version"] == "v1.15.0" { + t.Error("ConfigMap should not be updated to v1.15.0 from a stale v1.14.0 job") + } + } +} + func TestReconcile_MigrationNotEnabled_Skips(t *testing.T) { // Given: a Deployment without the migration-enabled annotation. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 3) From 0f0c736a11e720dee000b345eb16cfdbe8483a95 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 12:30:15 -0400 Subject: [PATCH 10/42] fix: validate flags to prevent negative or out-of-range values (i.e. values overflowing) --- operator/cmd/main.go | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/operator/cmd/main.go b/operator/cmd/main.go index 7917264..ac9bac6 100644 --- a/operator/cmd/main.go +++ b/operator/cmd/main.go @@ -2,6 +2,8 @@ package main import ( "flag" + "fmt" + "math" "os" "k8s.io/apimachinery/pkg/runtime" @@ -44,6 +46,22 @@ func main() { opts.BindFlags(flag.CommandLine) flag.Parse() + // Validate flag values. + for _, v := range []struct { + name string + value int + max int + }{ + {"backoff-limit", backoffLimit, math.MaxInt32}, + {"active-deadline-seconds", activeDeadline, math.MaxInt32}, + {"ttl-seconds-after-finished", ttlAfterFinished, math.MaxInt32}, + } { + if v.value < 0 || v.value > v.max { + fmt.Fprintf(os.Stderr, "invalid value for --%s: must be between 0 and %d\n", v.name, v.max) + os.Exit(1) + } + } + ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts))) logger := ctrl.Log.WithName("setup") From c415e560ed6bf528f099262da16a7202de1701d4 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 12:37:17 -0400 Subject: [PATCH 11/42] fix: gate legacy migration initContainers on operator.enabled When operator.enabled=true but migration.enabled=false, the legacy wait-for-migration initContainer could render and hang waiting for a Job that will never be created. Gate on operator.enabled instead so the legacy path is fully disabled when the operator is installed. --- charts/openfga/templates/deployment.yaml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index ded90ab..275bd86 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -52,8 +52,7 @@ spec: serviceAccountName: {{ include "openfga.serviceAccountName" . }} securityContext: {{- toYaml .Values.podSecurityContext | nindent 8 }} - {{- $operatorMigration := and .Values.operator.enabled .Values.migration.enabled }} - {{- $needsMigrationInit := and (not $operatorMigration) (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations }} + {{- $needsMigrationInit := and (not .Values.operator.enabled) (has .Values.datastore.engine (list "postgres" "mysql")) .Values.datastore.applyMigrations .Values.datastore.waitForMigrations }} {{- if or $needsMigrationInit .Values.extraInitContainers }} initContainers: {{- if $needsMigrationInit }} From d64235805924dd26b011f31e3153b7127e2e02ed Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 12:53:39 -0400 Subject: [PATCH 12/42] fix: use single quotes for Helm annotations with nested template expressions The double-quoted annotation values contained inner double quotes (e.g. "memory") which produced invalid YAML that IDEs flagged as errors. Switch to single-quote wrappers so the inner Go template strings don't conflict. --- charts/openfga/templates/deployment.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index 275bd86..a70c93d 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -9,9 +9,9 @@ metadata: annotations: {{- if $hasOperatorAnnotations }} openfga.dev/migration-enabled: "true" - openfga.dev/desired-replicas: "{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}" + openfga.dev/desired-replicas: '{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}' {{- if or .Values.migration.serviceAccount.create .Values.migration.serviceAccount.name }} - openfga.dev/migration-service-account: "{{ include "openfga.migrationServiceAccountName" . }}" + openfga.dev/migration-service-account: '{{ include "openfga.migrationServiceAccountName" . }}' {{- end }} {{- end }} {{- with .Values.annotations }} From 3740978d137534920bc3016630812b56f1f01e73 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 12:54:47 -0400 Subject: [PATCH 13/42] fix: address Copilot review round 6 on PR #309 - Pin Dockerfile base images by digest for reproducible builds - Fix version label fallback comparison for digest-pinned images by sanitizing desiredVersion before comparing to the label value --- operator/Dockerfile | 6 ++++-- operator/internal/controller/migration_controller.go | 9 ++++++++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/operator/Dockerfile b/operator/Dockerfile index 7d836a3..846414a 100644 --- a/operator/Dockerfile +++ b/operator/Dockerfile @@ -1,4 +1,5 @@ -FROM golang:1.26.2 AS builder +# pinned golang:1.26.2 linux/amd64 +FROM golang:1.26.2@sha256:b53c282df83967299380adbd6a2dc67e750a58217f39285d6240f6f80b19eaad AS builder WORKDIR /workspace COPY go.mod go.sum ./ @@ -9,7 +10,8 @@ COPY internal/ internal/ RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /operator ./cmd/ -FROM gcr.io/distroless/static:nonroot +# pinned gcr.io/distroless/static:nonroot linux/amd64 +FROM gcr.io/distroless/static:nonroot@sha256:64c43684e6d2b581d1eb362ea47b6a4defee6a9cac5f7ebbda3daa67e8c9b8e6 WORKDIR / COPY --from=builder /operator . USER 65532:65532 diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 1cf0f06..be00b7c 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -148,10 +148,17 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( // 8b. If the existing Job is for a different version, delete it and recreate. // Check annotation first (supports digests > 63 chars), fall back to label. jobVersion := job.Annotations["openfga.dev/desired-version"] + versionMatch := jobVersion == desiredVersion if jobVersion == "" { + // Label values have ":" replaced with "_", so sanitize desiredVersion for comparison. + sanitized := strings.ReplaceAll(desiredVersion, ":", "_") + if len(sanitized) > 63 { + sanitized = sanitized[:63] + } jobVersion = job.Labels["app.kubernetes.io/version"] + versionMatch = jobVersion == sanitized } - if jobVersion != "" && jobVersion != desiredVersion { + if jobVersion != "" && !versionMatch { logger.Info("existing migration job is for a different version, deleting", "jobVersion", jobVersion, "desiredVersion", desiredVersion) propagation := metav1.DeletePropagationBackground if delErr := r.Delete(ctx, job, &client.DeleteOptions{ From a71fbe63e52494a795d46fefad65d8ca02096512 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:13:27 -0400 Subject: [PATCH 14/42] fix: remove env var filtering, fix tests, updated README.md to clarify migration-specific volumes --- operator/README.md | 1 + operator/internal/controller/helpers.go | 11 +--- .../controller/migration_controller_test.go | 57 +++++++++++++++++-- 3 files changed, 55 insertions(+), 14 deletions(-) diff --git a/operator/README.md b/operator/README.md index 0e43a3d..c9efbe8 100644 --- a/operator/README.md +++ b/operator/README.md @@ -131,3 +131,4 @@ The operator reads these annotations from the OpenFGA Deployment: ## Limitations - **Mutable image tags:** The operator detects version changes by comparing the container image tag (or digest). If you deploy with a mutable tag like `latest` or reuse the same tag for different builds, the operator will not detect changes and will skip the migration. Use immutable tags (e.g., `v1.14.0`) or pin images by digest for reliable migration triggering. +- **Migration-specific volumes:** The legacy Helm chart values `migrate.extraVolumes` and `migrate.extraVolumeMounts` have no effect in operator mode. The operator inherits volumes and mounts from the main Deployment pod spec. If you need additional volumes for migrations (e.g., CA bundles or TLS certs), add them to the top-level `extraVolumes` and `extraVolumeMounts` values instead. diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 871d1a5..deaa9e1 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -100,14 +100,6 @@ func buildMigrationJob( migrationSA = deployment.Spec.Template.Spec.ServiceAccountName } - // Filter env vars — only pass datastore-related vars to the migration Job. - var datastoreEnvVars []corev1.EnvVar - for _, env := range mainContainer.Env { - if strings.HasPrefix(env.Name, "OPENFGA_DATASTORE_") { - datastoreEnvVars = append(datastoreEnvVars, env) - } - } - // Sanitize version for use as a label value (must match [a-zA-Z0-9._-], max 63 chars). // The full version is stored in an annotation for accurate comparison. labelVersion := strings.ReplaceAll(desiredVersion, ":", "_") @@ -160,7 +152,8 @@ func buildMigrationJob( Name: "migrate-database", Image: mainContainer.Image, Args: []string{"migrate"}, - Env: datastoreEnvVars, + Env: mainContainer.Env, + EnvFrom: mainContainer.EnvFrom, VolumeMounts: mainContainer.VolumeMounts, SecurityContext: mainContainer.SecurityContext, }, diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index 636182c..be37fe3 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -67,7 +67,8 @@ func newTestDeployment(name, namespace, image string, replicas int32) *appsv1.De func newReconciler(objects ...runtime.Object) *MigrationReconciler { scheme := newScheme() - clientBuilder := fake.NewClientBuilder().WithScheme(scheme) + clientBuilder := fake.NewClientBuilder().WithScheme(scheme). + WithStatusSubresource(&appsv1.Deployment{}) for _, obj := range objects { clientBuilder = clientBuilder.WithRuntimeObjects(obj) } @@ -79,6 +80,15 @@ func newReconciler(objects ...runtime.Object) *MigrationReconciler { } } +func findCondition(conditions []appsv1.DeploymentCondition, condType string) *appsv1.DeploymentCondition { + for i := range conditions { + if string(conditions[i].Type) == condType { + return &conditions[i] + } + } + return nil +} + func TestReconcile_FirstInstall_CreatesJob(t *testing.T) { // Given: a Deployment with no migration-status ConfigMap. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) @@ -113,10 +123,14 @@ func TestReconcile_FirstInstall_CreatesJob(t *testing.T) { t.Errorf("expected job args [migrate], got %v", job.Spec.Template.Spec.Containers[0].Args) } - // Verify only datastore env vars were passed. + // Verify all env vars from the main container were passed. + jobEnvNames := make(map[string]bool) for _, env := range job.Spec.Template.Spec.Containers[0].Env { - if env.Name == "OPENFGA_LOG_LEVEL" { - t.Error("non-datastore env var OPENFGA_LOG_LEVEL should not be passed to migration job") + jobEnvNames[env.Name] = true + } + for _, expected := range []string{"OPENFGA_DATASTORE_ENGINE", "OPENFGA_DATASTORE_URI", "OPENFGA_LOG_LEVEL"} { + if !jobEnvNames[expected] { + t.Errorf("expected env var %s to be passed to migration job", expected) } } } @@ -166,9 +180,18 @@ func TestReconcile_VersionMatch_ScalesUp(t *testing.T) { } func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { - // Given: a Deployment at 0 replicas, no ConfigMap, and a succeeded migration Job. + // Given: a Deployment at 0 replicas, no ConfigMap, a succeeded migration Job, + // and a pre-existing MigrationFailed condition from a prior attempt. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) dep.Annotations[AnnotationDesiredReplicas] = "3" + dep.Status.Conditions = []appsv1.DeploymentCondition{ + { + Type: "MigrationFailed", + Status: corev1.ConditionTrue, + Reason: "MigrationJobFailed", + Message: "Database migration failed for version v1.13.0.", + }, + } job := &batchv1.Job{ ObjectMeta: metav1.ObjectMeta{ @@ -230,6 +253,18 @@ func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { if *updated.Spec.Replicas != 3 { t.Errorf("expected 3 replicas, got %d", *updated.Spec.Replicas) } + + // Verify MigrationFailed condition was cleared. + cond := findCondition(updated.Status.Conditions, "MigrationFailed") + if cond == nil { + t.Fatal("expected MigrationFailed condition to exist") + } + if cond.Status != corev1.ConditionFalse { + t.Errorf("expected MigrationFailed status False after success, got %s", cond.Status) + } + if cond.Reason != "MigrationSucceeded" { + t.Errorf("expected reason MigrationSucceeded, got %s", cond.Reason) + } } func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { @@ -302,6 +337,18 @@ func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { if _, ok := updated.Annotations[AnnotationRetryAfter]; !ok { t.Error("expected retry-after annotation to be set on Deployment") } + + // Verify MigrationFailed condition was set. + cond := findCondition(updated.Status.Conditions, "MigrationFailed") + if cond == nil { + t.Fatal("expected MigrationFailed condition to be set") + } + if cond.Status != corev1.ConditionTrue { + t.Errorf("expected MigrationFailed status True, got %s", cond.Status) + } + if cond.Reason != "MigrationJobFailed" { + t.Errorf("expected reason MigrationJobFailed, got %s", cond.Reason) + } } func TestReconcile_RetryAfterCooldown_SkipsJobCreation(t *testing.T) { From 767d6ebf5cd03c3e36bb35699fe5fbd1e9b3bc15 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:31:11 -0400 Subject: [PATCH 15/42] fix: update includes OwnerReferences --- operator/internal/controller/helpers.go | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index deaa9e1..7a4ac2a 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -219,9 +219,11 @@ func updateMigrationStatus( return nil } - // Update existing ConfigMap. + // Update existing ConfigMap (including OwnerReferences in case the Deployment + // was deleted and recreated with a new UID). existing.Data = cm.Data existing.Labels = cm.Labels + existing.OwnerReferences = cm.OwnerReferences if updateErr := c.Update(ctx, existing); updateErr != nil { return fmt.Errorf("updating migration status ConfigMap: %w", updateErr) } From 65be2d7e5321b54aea60e484b42ae946ce3a8fb9 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:36:25 -0400 Subject: [PATCH 16/42] feat: add PodDisruptionBudget to operator subchart Prevents the operator from being evicted during node drains, which could leave the OpenFGA Deployment stuck at 0 replicas with no controller to scale it back up. Disabled by default; supports both minAvailable and maxUnavailable modes. --- charts/openfga-operator/templates/pdb.yaml | 18 ++++++++++++++++++ charts/openfga-operator/values.yaml | 10 ++++++++++ 2 files changed, 28 insertions(+) create mode 100644 charts/openfga-operator/templates/pdb.yaml diff --git a/charts/openfga-operator/templates/pdb.yaml b/charts/openfga-operator/templates/pdb.yaml new file mode 100644 index 0000000..6c3514e --- /dev/null +++ b/charts/openfga-operator/templates/pdb.yaml @@ -0,0 +1,18 @@ +{{- if .Values.podDisruptionBudget.enabled -}} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: {{ include "openfga-operator.fullname" . }} + namespace: {{ include "openfga-operator.namespace" . }} + labels: + {{- include "openfga-operator.labels" . | nindent 4 }} +spec: + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- else }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable | default 1 }} + {{- end }} + selector: + matchLabels: + {{- include "openfga-operator.selectorLabels" . | nindent 6 }} +{{- end }} diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 1b52525..4b308a2 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -49,6 +49,16 @@ resources: {} # limits: # memory: 128Mi +podDisruptionBudget: + # -- Enable a PodDisruptionBudget for the operator. + enabled: false + # -- Minimum number of pods that must be available during disruption. + # Cannot be set together with maxUnavailable. + minAvailable: "" + # -- Maximum number of pods that can be unavailable during disruption. + # Defaults to 1 when enabled and minAvailable is not set. + maxUnavailable: 1 + nodeSelector: {} tolerations: [] From aaddab3ee886b98972d2adcab144d6fa0f9b5748 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:38:23 -0400 Subject: [PATCH 17/42] fix: use Job conditions instead of status counters for failure detection Replace checks on job.Status.Failed >= backoffLimit with isJobConditionTrue(job, batchv1.JobFailed), and job.Status.Succeeded with batchv1.JobComplete. The Job controller sets conditions atomically when it makes its final decision, avoiding races where the operator acts on intermediate counter states before Kubernetes has finished cleaning up. --- .../controller/migration_controller.go | 23 ++++++++++++------- .../controller/migration_controller_test.go | 18 +++++++++++++++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index be00b7c..75cbf19 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -169,8 +169,8 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{RequeueAfter: 5 * time.Second}, nil } - // 9. Check Job status. - if job.Status.Succeeded >= 1 { + // 9. Check Job status using conditions for authoritative completion signals. + if isJobConditionTrue(job, batchv1.JobComplete) { logger.Info("migration succeeded", "version", desiredVersion) // Clear MigrationFailed condition. @@ -193,12 +193,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, nil } - backoffLimit := r.BackoffLimit - if job.Spec.BackoffLimit != nil { - backoffLimit = *job.Spec.BackoffLimit - } - - if job.Status.Failed >= backoffLimit { + if isJobConditionTrue(job, batchv1.JobFailed) { logger.Info("migration job failed, will delete and retry", "job", jobName, "version", desiredVersion) // Set condition so kubectl describe shows the failure. @@ -238,6 +233,18 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{RequeueAfter: 10 * time.Second}, nil } +// isJobConditionTrue returns true if the Job has a condition of the given type +// with status True. This is more reliable than comparing status counters because +// the Job controller sets conditions atomically when it makes its final decision. +func isJobConditionTrue(job *batchv1.Job, conditionType batchv1.JobConditionType) bool { + for _, c := range job.Status.Conditions { + if c.Type == conditionType && c.Status == corev1.ConditionTrue { + return true + } + } + return false +} + // isMemoryDatastore checks if the Deployment is using the memory datastore // (no database migration needed). func isMemoryDatastore(container *corev1.Container) bool { diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index be37fe3..c410aaf 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -217,6 +217,12 @@ func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { }, Status: batchv1.JobStatus{ Succeeded: 1, + Conditions: []batchv1.JobCondition{ + { + Type: batchv1.JobComplete, + Status: corev1.ConditionTrue, + }, + }, }, } @@ -296,6 +302,12 @@ func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { }, Status: batchv1.JobStatus{ Failed: 3, + Conditions: []batchv1.JobCondition{ + { + Type: batchv1.JobFailed, + Status: corev1.ConditionTrue, + }, + }, }, } @@ -522,6 +534,12 @@ func TestReconcile_StaleJob_DeletedAndRequeued(t *testing.T) { }, Status: batchv1.JobStatus{ Succeeded: 1, + Conditions: []batchv1.JobCondition{ + { + Type: batchv1.JobComplete, + Status: corev1.ConditionTrue, + }, + }, }, } From 6d25647034518ae06cafd8f1dfa1b148d72de9f2 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:39:10 -0400 Subject: [PATCH 18/42] test: add helm-unittest tests for operator mode 20 new tests across 4 files covering the operator-enabled code paths that previously had no unit test coverage: - deployment: annotations, replicas=0, autoscaling conflict, initContainers gating - job: template not rendered when operator enabled - serviceaccount: migration SA creation, custom names, IRSA annotations - rbac: legacy Role/RoleBinding not rendered when operator enabled --- .../openfga/tests/operator_mode_job_test.yaml | 28 ++++ .../tests/operator_mode_rbac_test.yaml | 25 ++++ .../operator_mode_serviceaccount_test.yaml | 66 +++++++++ charts/openfga/tests/operator_mode_test.yaml | 134 ++++++++++++++++++ 4 files changed, 253 insertions(+) create mode 100644 charts/openfga/tests/operator_mode_job_test.yaml create mode 100644 charts/openfga/tests/operator_mode_rbac_test.yaml create mode 100644 charts/openfga/tests/operator_mode_serviceaccount_test.yaml create mode 100644 charts/openfga/tests/operator_mode_test.yaml diff --git a/charts/openfga/tests/operator_mode_job_test.yaml b/charts/openfga/tests/operator_mode_job_test.yaml new file mode 100644 index 0000000..31d5760 --- /dev/null +++ b/charts/openfga/tests/operator_mode_job_test.yaml @@ -0,0 +1,28 @@ +suite: operator mode - job template +templates: + - templates/job.yaml +tests: + - it: should not render migration job when operator is enabled + set: + operator.enabled: true + migration.enabled: true + datastore.engine: postgres + datastore.uri: "postgres://localhost/openfga" + datastore.applyMigrations: true + datastore.migrationType: job + asserts: + - hasDocuments: + count: 0 + + - it: should render migration job when operator is disabled + set: + operator.enabled: false + datastore.engine: postgres + datastore.uri: "postgres://localhost/openfga" + datastore.applyMigrations: true + datastore.migrationType: job + asserts: + - hasDocuments: + count: 1 + - isKind: + of: Job diff --git a/charts/openfga/tests/operator_mode_rbac_test.yaml b/charts/openfga/tests/operator_mode_rbac_test.yaml new file mode 100644 index 0000000..bb60846 --- /dev/null +++ b/charts/openfga/tests/operator_mode_rbac_test.yaml @@ -0,0 +1,25 @@ +suite: operator mode - RBAC +templates: + - templates/rbac.yaml +tests: + - it: should not render legacy RBAC when operator is enabled + set: + operator.enabled: true + serviceAccount.create: true + asserts: + - hasDocuments: + count: 0 + + - it: should render legacy RBAC when operator is disabled + set: + operator.enabled: false + serviceAccount.create: true + asserts: + - hasDocuments: + count: 2 + - isKind: + of: Role + documentIndex: 0 + - isKind: + of: RoleBinding + documentIndex: 1 diff --git a/charts/openfga/tests/operator_mode_serviceaccount_test.yaml b/charts/openfga/tests/operator_mode_serviceaccount_test.yaml new file mode 100644 index 0000000..cbeab1a --- /dev/null +++ b/charts/openfga/tests/operator_mode_serviceaccount_test.yaml @@ -0,0 +1,66 @@ +suite: operator mode - service accounts +templates: + - templates/serviceaccount.yaml +tests: + - it: should render migration service account when operator is enabled + set: + operator.enabled: true + migration.enabled: true + migration.serviceAccount.create: true + serviceAccount.create: true + asserts: + - hasDocuments: + count: 2 + - isKind: + of: ServiceAccount + documentIndex: 1 + - equal: + path: metadata.name + value: RELEASE-NAME-openfga-migration + documentIndex: 1 + + - it: should not render migration service account when operator is disabled + set: + operator.enabled: false + serviceAccount.create: true + asserts: + - hasDocuments: + count: 1 + + - it: should not render migration service account when migration SA creation is disabled + set: + operator.enabled: true + migration.enabled: true + migration.serviceAccount.create: false + migration.serviceAccount.name: external-sa + serviceAccount.create: true + asserts: + - hasDocuments: + count: 1 + + - it: should render migration service account with custom annotations + set: + operator.enabled: true + migration.enabled: true + migration.serviceAccount.create: true + migration.serviceAccount.annotations: + eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/openfga-migrator" + serviceAccount.create: true + asserts: + - equal: + path: metadata.annotations["eks.amazonaws.com/role-arn"] + value: "arn:aws:iam::123456789012:role/openfga-migrator" + documentIndex: 1 + + - it: should use custom migration service account name + set: + operator.enabled: true + migration.enabled: true + migration.serviceAccount.create: true + migration.serviceAccount.name: my-migrator + serviceAccount.create: true + asserts: + - equal: + path: metadata.name + value: my-migrator + documentIndex: 1 diff --git a/charts/openfga/tests/operator_mode_test.yaml b/charts/openfga/tests/operator_mode_test.yaml new file mode 100644 index 0000000..31d06c1 --- /dev/null +++ b/charts/openfga/tests/operator_mode_test.yaml @@ -0,0 +1,134 @@ +suite: operator mode +templates: + - templates/deployment.yaml +tests: + # --- Deployment annotations --- + - it: should set operator annotations when operator and migration are enabled + set: + operator.enabled: true + migration.enabled: true + replicaCount: 3 + datastore.engine: postgres + asserts: + - equal: + path: metadata.annotations["openfga.dev/migration-enabled"] + value: "true" + - equal: + path: metadata.annotations["openfga.dev/desired-replicas"] + value: "3" + - equal: + path: metadata.annotations["openfga.dev/migration-service-account"] + value: RELEASE-NAME-openfga-migration + + - it: should not set operator annotations when operator is disabled + set: + operator.enabled: false + annotations: + custom: value + asserts: + - isNull: + path: metadata.annotations["openfga.dev/migration-enabled"] + - isNull: + path: metadata.annotations["openfga.dev/desired-replicas"] + + - it: should set desired-replicas to 1 for memory datastore + set: + operator.enabled: true + migration.enabled: true + replicaCount: 5 + datastore.engine: memory + asserts: + - equal: + path: metadata.annotations["openfga.dev/desired-replicas"] + value: "1" + + - it: should use custom migration service account name when set + set: + operator.enabled: true + migration.enabled: true + datastore.engine: postgres + migration.serviceAccount.name: my-custom-sa + asserts: + - equal: + path: metadata.annotations["openfga.dev/migration-service-account"] + value: my-custom-sa + + - it: should not set migration-service-account annotation when SA creation is disabled and no name set + set: + operator.enabled: true + migration.enabled: true + datastore.engine: postgres + migration.serviceAccount.create: false + asserts: + - isNull: + path: metadata.annotations["openfga.dev/migration-service-account"] + + # --- Replica count --- + - it: should set replicas to 0 when operator is enabled with database datastore + set: + operator.enabled: true + migration.enabled: true + replicaCount: 3 + datastore.engine: postgres + asserts: + - equal: + path: spec.replicas + value: 0 + + - it: should set replicas to 1 when operator is enabled with memory datastore + set: + operator.enabled: true + migration.enabled: true + replicaCount: 5 + datastore.engine: memory + asserts: + - equal: + path: spec.replicas + value: 1 + + - it: should set replicas to replicaCount when operator is disabled + set: + operator.enabled: false + replicaCount: 5 + datastore.engine: postgres + asserts: + - equal: + path: spec.replicas + value: 5 + + # --- Autoscaling conflict --- + - it: should fail when operator and autoscaling are both enabled + set: + operator.enabled: true + migration.enabled: true + autoscaling.enabled: true + datastore.engine: postgres + asserts: + - failedTemplate: + errorMessage: "operator.enabled and autoscaling.enabled cannot both be true" + + # --- initContainers gating --- + - it: should not render migration initContainers when operator is enabled + set: + operator.enabled: true + migration.enabled: true + datastore.engine: postgres + datastore.uri: "postgres://localhost/openfga" + datastore.applyMigrations: true + datastore.waitForMigrations: true + datastore.migrationType: job + asserts: + - isNull: + path: spec.template.spec.initContainers + + - it: should render migration initContainers when operator is disabled + set: + operator.enabled: false + datastore.engine: postgres + datastore.uri: "postgres://localhost/openfga" + datastore.applyMigrations: true + datastore.waitForMigrations: true + datastore.migrationType: job + asserts: + - isNotNull: + path: spec.template.spec.initContainers From 3d88092cca972d98c87c44b02dee4af893b80b2b Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:50:27 -0400 Subject: [PATCH 19/42] fix: inherit resource limits in operator migration Job The old Helm-templated migration Job uses datastore.migrations.resources for resource limits, but the operator-built Job had none. Inherit the main container's Resources to maintain parity and prevent unbounded resource consumption during migrations. --- operator/internal/controller/helpers.go | 1 + 1 file changed, 1 insertion(+) diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 7a4ac2a..794210b 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -154,6 +154,7 @@ func buildMigrationJob( Args: []string{"migrate"}, Env: mainContainer.Env, EnvFrom: mainContainer.EnvFrom, + Resources: mainContainer.Resources, VolumeMounts: mainContainer.VolumeMounts, SecurityContext: mainContainer.SecurityContext, }, From f8644518e5a0c2cce56f58b1616e6bb4646f6e8d Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 14:54:33 -0400 Subject: [PATCH 20/42] test: add missing controller unit tests for edge cases Four new tests covering previously untested code paths: - StaleJob_LabelOnlyFallback: version-mismatch detection when Job has only a label (no annotation), exercising the sanitized-label fallback - JobSucceeded_UpdatesExistingConfigMap: ConfigMap update path when a prior version's ConfigMap already exists - ScaleToZero_NilAnnotationsMap: scaleDeploymentToZero correctly stores desired-replicas when the annotation was not previously set - JobInProgress_Requeues: in-progress Job triggers 10s requeue without scaling up or modifying the Deployment --- .../controller/migration_controller_test.go | 246 ++++++++++++++++++ 1 file changed, 246 insertions(+) diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index c410aaf..b5aa702 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -617,6 +617,252 @@ func TestReconcile_MigrationNotEnabled_Skips(t *testing.T) { } } +func TestReconcile_StaleJob_LabelOnlyFallback_DeletedAndRequeued(t *testing.T) { + // Given: a Deployment at v1.15.0 with an existing Job that only has a label (no annotation). + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.15.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + staleJob := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + Labels: map[string]string{ + "app.kubernetes.io/version": "v1.14.0", + }, + // No annotation — forces the label-only fallback path. + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + } + + r := newReconciler(dep, staleJob) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: stale Job should be deleted and requeue requested. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue after deleting stale job") + } + + deletedJob := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, deletedJob); getErr == nil { + t.Error("expected stale migration job to be deleted") + } +} + +func TestReconcile_JobSucceeded_UpdatesExistingConfigMap(t *testing.T) { + // Given: a Deployment with a pre-existing ConfigMap from v1.13.0 and a succeeded Job for v1.14.0. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + existingCM := &corev1.ConfigMap{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migration-status", + Namespace: "default", + Labels: map[string]string{ + LabelPartOf: LabelPartOfValue, + LabelComponent: "migration", + "app.kubernetes.io/managed-by": "openfga-operator", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Data: map[string]string{ + "version": "v1.13.0", + "migratedAt": "2026-04-01T12:00:00Z", + "jobName": "openfga-migrate", + }, + } + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + Status: batchv1.JobStatus{ + Succeeded: 1, + Conditions: []batchv1.JobCondition{ + { + Type: batchv1.JobComplete, + Status: corev1.ConditionTrue, + }, + }, + }, + } + + r := newReconciler(dep, existingCM, job) + + // When: reconciling. + _, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + // Verify ConfigMap was updated to v1.14.0. + cm := &corev1.ConfigMap{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migration-status", Namespace: "default", + }, cm); getErr != nil { + t.Fatalf("expected ConfigMap to exist: %v", getErr) + } + if cm.Data["version"] != "v1.14.0" { + t.Errorf("expected version v1.14.0 in ConfigMap, got %s", cm.Data["version"]) + } +} + +func TestReconcile_ScaleToZero_NilAnnotationsMap(t *testing.T) { + // Given: a Deployment with nil Annotations map and replicas > 0. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 3) + dep.Annotations = nil + // Re-add the required annotation via a fresh map — but test that scaleDeploymentToZero + // handles nil gracefully by setting it only via the migration-enabled annotation. + dep.Annotations = map[string]string{ + AnnotationMigrationEnabled: "true", + } + + r := newReconciler(dep) + + // When: reconciling — this will call scaleDeploymentToZero which must handle + // the case where AnnotationDesiredReplicas is not yet set. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, Job created. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue after creating job") + } + + // Verify Deployment was scaled to 0 and desired-replicas annotation was preserved. + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if *updated.Spec.Replicas != 0 { + t.Errorf("expected 0 replicas, got %d", *updated.Spec.Replicas) + } + if updated.Annotations[AnnotationDesiredReplicas] != "3" { + t.Errorf("expected desired-replicas=3, got %s", updated.Annotations[AnnotationDesiredReplicas]) + } +} + +func TestReconcile_JobInProgress_Requeues(t *testing.T) { + // Given: a Deployment with a running Job (no conditions set yet). + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: ptr.To(int32(3)), + Template: corev1.PodTemplateSpec{ + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "migrate", Image: "openfga/openfga:v1.14.0"}}, + RestartPolicy: corev1.RestartPolicyNever, + }, + }, + }, + Status: batchv1.JobStatus{ + Active: 1, + }, + } + + r := newReconciler(dep, job) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: no error, requeue after 10s to poll progress. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 10*time.Second { + t.Errorf("expected 10s requeue for in-progress job, got %v", result.RequeueAfter) + } + + // Verify Deployment still at 0 replicas. + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if *updated.Spec.Replicas != 0 { + t.Errorf("expected 0 replicas while job in progress, got %d", *updated.Spec.Replicas) + } +} + func TestExtractImageTag(t *testing.T) { tests := []struct { image string From f70c58f4001e9a1011a46702c14d0a0267f39664 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:07:33 -0400 Subject: [PATCH 21/42] fix: wire migration Job flags (backoff/deadline/TTL) through Helm values The README documented these flags as configurable via the subchart values.yaml, but only --leader-elect and --watch-namespace were actually wired. Add migrationJob.backoffLimit, activeDeadlineSeconds, and ttlSecondsAfterFinished values with matching args in the operator Deployment template. --- charts/openfga-operator/templates/deployment.yaml | 3 +++ charts/openfga-operator/values.yaml | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/charts/openfga-operator/templates/deployment.yaml b/charts/openfga-operator/templates/deployment.yaml index ecad091..4ef3b80 100644 --- a/charts/openfga-operator/templates/deployment.yaml +++ b/charts/openfga-operator/templates/deployment.yaml @@ -43,6 +43,9 @@ spec: {{- if .Values.watchNamespace }} - --watch-namespace={{ .Values.watchNamespace }} {{- end }} + - --backoff-limit={{ .Values.migrationJob.backoffLimit }} + - --active-deadline-seconds={{ .Values.migrationJob.activeDeadlineSeconds }} + - --ttl-seconds-after-finished={{ .Values.migrationJob.ttlSecondsAfterFinished }} env: - name: POD_NAMESPACE valueFrom: diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 4b308a2..56a0b5b 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -42,6 +42,14 @@ leaderElection: # -- Enable leader election for controller manager. enabled: true +migrationJob: + # -- Number of pod failures before a migration Job is considered failed. + backoffLimit: 3 + # -- Maximum wall-clock seconds a migration Job can run before being terminated. + activeDeadlineSeconds: 300 + # -- Seconds to keep completed/failed Job pods for log inspection before garbage collection. + ttlSecondsAfterFinished: 300 + resources: {} # requests: # cpu: 10m From 1be6309efb1dba857604d6213b0d6050cf2af9f3 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:08:55 -0400 Subject: [PATCH 22/42] fix: document namespaceOverride in operator subchart values.yaml The _helpers.tpl namespace template already supported namespaceOverride but the value was undeclared in values.yaml, making it undiscoverable. --- charts/openfga-operator/values.yaml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 56a0b5b..6a4c165 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -9,6 +9,9 @@ image: imagePullSecrets: [] nameOverride: "" fullnameOverride: "" +# -- Override the namespace for all operator resources. +# Useful when the parent chart deploys subcharts into a different namespace. +namespaceOverride: "" serviceAccount: # -- Specifies whether a service account should be created. From 96edaf95bffd703e2f790d892091d0c58a8b76de Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:10:18 -0400 Subject: [PATCH 23/42] fix: rename misleading ScaleToZero test to match actual behavior The test validates that scaleDeploymentToZero stores the current replica count in the desired-replicas annotation before zeroing, not that it handles a nil annotations map. --- operator/internal/controller/migration_controller_test.go | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index b5aa702..a2edbf6 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -760,12 +760,10 @@ func TestReconcile_JobSucceeded_UpdatesExistingConfigMap(t *testing.T) { } } -func TestReconcile_ScaleToZero_NilAnnotationsMap(t *testing.T) { - // Given: a Deployment with nil Annotations map and replicas > 0. +func TestReconcile_ScaleToZero_StoresDesiredReplicas(t *testing.T) { + // Given: a Deployment with replicas > 0 and no desired-replicas annotation yet. + // scaleDeploymentToZero should store the current replica count before zeroing. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 3) - dep.Annotations = nil - // Re-add the required annotation via a fresh map — but test that scaleDeploymentToZero - // handles nil gracefully by setting it only via the migration-enabled annotation. dep.Annotations = map[string]string{ AnnotationMigrationEnabled: "true", } From 329f05e58ad88c5ed3ee31271e9521fb1d96b9dd Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:14:29 -0400 Subject: [PATCH 24/42] fix: require explicit serviceAccount.name when create=false Falling back to the "default" ServiceAccount would silently grant operator RBAC permissions (Deployment patch, Job create/delete) to a shared SA. Require an explicit name so the user makes a deliberate choice. --- charts/openfga-operator/templates/_helpers.tpl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/charts/openfga-operator/templates/_helpers.tpl b/charts/openfga-operator/templates/_helpers.tpl index 70d6e4c..f63057d 100644 --- a/charts/openfga-operator/templates/_helpers.tpl +++ b/charts/openfga-operator/templates/_helpers.tpl @@ -67,6 +67,6 @@ Create the name of the service account to use {{- if .Values.serviceAccount.create }} {{- default (include "openfga-operator.fullname" .) .Values.serviceAccount.name }} {{- else }} -{{- default "default" .Values.serviceAccount.name }} +{{- required "serviceAccount.name must be set when serviceAccount.create=false" .Values.serviceAccount.name }} {{- end }} {{- end }} From ec643f9e8b4c5514af4a60d57cbf165685cdc6d9 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:15:29 -0400 Subject: [PATCH 25/42] docs: update chart structure --- docs/adr/004-operator-deployment-model.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md index 603bf3a..a5a693c 100644 --- a/docs/adr/004-operator-deployment-model.md +++ b/docs/adr/004-operator-deployment-model.md @@ -66,7 +66,7 @@ The operator will be distributed as a **conditional Helm subchart dependency** o ### Chart Structure -``` +```text helm-charts/ ├── charts/ │ ├── openfga/ # Main chart (existing) @@ -81,8 +81,8 @@ helm-charts/ │ ├── templates/ │ │ ├── deployment.yaml │ │ ├── serviceaccount.yaml -│ │ ├── clusterrole.yaml -│ │ └── clusterrolebinding.yaml +│ │ ├── role.yaml +│ │ └── rolebinding.yaml │ └── crds/ # CRDs added in Stages 2-4 │ ├── fgastore.yaml │ ├── fgamodel.yaml From bd534ca6741f15028d846cd1977ee69aad878651 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:30:09 -0400 Subject: [PATCH 26/42] fix: handle AlreadyExists on migration Job creation gracefully When concurrent reconciles race between the GET and CREATE, the second create returns AlreadyExists. Treat this as benign and requeue to poll the existing Job instead of returning a hard error that produces noisy reconcile failures in the controller logs. --- operator/internal/controller/migration_controller.go | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 75cbf19..de044d6 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -137,6 +137,11 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } } if createErr := r.Create(ctx, job); createErr != nil { + if apierrors.IsAlreadyExists(createErr) { + // A concurrent reconcile already created the Job; requeue to pick it up. + logger.V(1).Info("migration job already exists, will recheck", "job", jobName) + return ctrl.Result{RequeueAfter: 5 * time.Second}, nil + } return ctrl.Result{}, fmt.Errorf("creating migration job: %w", createErr) } logger.Info("created migration job", "job", jobName, "version", desiredVersion) From 467b1ea8a72d67916126fccf302356d768bd63eb Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 15:57:37 -0400 Subject: [PATCH 27/42] fix: harden openfga-operator chart security and quality defaults Address review findings: add allowPrivilegeEscalation: false for restricted PSS compliance, set default resource requests/limits, add values.schema.json validation, use stable selectorLabels on pod template to prevent spurious rollouts, add .helmignore, and improve Chart.yaml metadata and NOTES.txt with migration commands. --- charts/openfga-operator/.helmignore | 18 ++++ charts/openfga-operator/Chart.yaml | 6 ++ charts/openfga-operator/templates/NOTES.txt | 6 ++ .../templates/deployment.yaml | 2 +- charts/openfga-operator/values.schema.json | 100 ++++++++++++++++++ charts/openfga-operator/values.yaml | 13 +-- .../controller/migration_controller.go | 8 +- 7 files changed, 144 insertions(+), 9 deletions(-) create mode 100644 charts/openfga-operator/.helmignore create mode 100644 charts/openfga-operator/values.schema.json diff --git a/charts/openfga-operator/.helmignore b/charts/openfga-operator/.helmignore new file mode 100644 index 0000000..edf9e7e --- /dev/null +++ b/charts/openfga-operator/.helmignore @@ -0,0 +1,18 @@ +# Patterns to ignore when building packages. +.DS_Store +.git +.gitignore +.bzr +.bzrignore +.hg +.hgignore +.svn +*.swp +*.bak +*.tmp +*.orig +*~ +.project +.idea +*.tmproj +.vscode diff --git a/charts/openfga-operator/Chart.yaml b/charts/openfga-operator/Chart.yaml index 1bdacb0..95da06b 100644 --- a/charts/openfga-operator/Chart.yaml +++ b/charts/openfga-operator/Chart.yaml @@ -9,5 +9,11 @@ appVersion: "0.1.0" home: "https://openfga.github.io/helm-charts" icon: https://github.com/openfga/community/raw/main/brand-assets/icon/color/openfga-icon-color.svg +maintainers: + - name: OpenFGA Authors + url: https://github.com/openfga +sources: + - https://github.com/openfga/helm-charts + annotations: artifacthub.io/license: Apache-2.0 diff --git a/charts/openfga-operator/templates/NOTES.txt b/charts/openfga-operator/templates/NOTES.txt index c2e09f9..bcfcf80 100644 --- a/charts/openfga-operator/templates/NOTES.txt +++ b/charts/openfga-operator/templates/NOTES.txt @@ -8,3 +8,9 @@ To check operator status: To view operator logs: kubectl logs --namespace {{ include "openfga-operator.namespace" . }} -l "app.kubernetes.io/name={{ include "openfga-operator.name" . }}" + +To check migration status: + kubectl get configmap -n {{ include "openfga-operator.namespace" . }} -l app.kubernetes.io/managed-by=openfga-operator + +To inspect migration jobs: + kubectl get jobs -n {{ include "openfga-operator.namespace" . }} -l app.kubernetes.io/part-of=openfga,app.kubernetes.io/component=migration diff --git a/charts/openfga-operator/templates/deployment.yaml b/charts/openfga-operator/templates/deployment.yaml index 4ef3b80..5b83a6b 100644 --- a/charts/openfga-operator/templates/deployment.yaml +++ b/charts/openfga-operator/templates/deployment.yaml @@ -17,7 +17,7 @@ spec: {{- toYaml . | nindent 8 }} {{- end }} labels: - {{- include "openfga-operator.labels" . | nindent 8 }} + {{- include "openfga-operator.selectorLabels" . | nindent 8 }} spec: {{- with .Values.imagePullSecrets }} imagePullSecrets: diff --git a/charts/openfga-operator/values.schema.json b/charts/openfga-operator/values.schema.json new file mode 100644 index 0000000..470b265 --- /dev/null +++ b/charts/openfga-operator/values.schema.json @@ -0,0 +1,100 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "type": "object", + "properties": { + "replicaCount": { + "type": "integer", + "minimum": 1 + }, + "image": { + "type": "object", + "properties": { + "repository": { + "type": "string", + "minLength": 1 + }, + "pullPolicy": { + "type": "string", + "enum": ["Always", "IfNotPresent", "Never"] + }, + "tag": { + "type": "string" + } + }, + "required": ["repository"] + }, + "imagePullSecrets": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { "type": "string" } + }, + "required": ["name"] + } + }, + "nameOverride": { "type": "string" }, + "fullnameOverride": { "type": "string" }, + "namespaceOverride": { "type": "string" }, + "serviceAccount": { + "type": "object", + "properties": { + "create": { "type": "boolean" }, + "annotations": { "type": "object" }, + "name": { "type": "string" } + } + }, + "podAnnotations": { "type": "object" }, + "podSecurityContext": { "type": "object" }, + "securityContext": { "type": "object" }, + "watchNamespace": { "type": "string" }, + "leaderElection": { + "type": "object", + "properties": { + "enabled": { "type": "boolean" } + } + }, + "migrationJob": { + "type": "object", + "properties": { + "backoffLimit": { + "type": "integer", + "minimum": 0 + }, + "activeDeadlineSeconds": { + "type": "integer", + "minimum": 1 + }, + "ttlSecondsAfterFinished": { + "type": "integer", + "minimum": 0 + } + } + }, + "resources": { "type": "object" }, + "podDisruptionBudget": { + "type": "object", + "properties": { + "enabled": { "type": "boolean" }, + "minAvailable": { + "oneOf": [ + { "type": "string" }, + { "type": "integer", "minimum": 0 } + ] + }, + "maxUnavailable": { + "oneOf": [ + { "type": "string" }, + { "type": "integer", "minimum": 0 } + ] + } + } + }, + "nodeSelector": { "type": "object" }, + "tolerations": { + "type": "array", + "items": { "type": "object" } + }, + "affinity": { "type": "object" } + } +} diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index 6a4c165..db2747b 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -30,6 +30,7 @@ podSecurityContext: type: RuntimeDefault securityContext: + allowPrivilegeEscalation: false capabilities: drop: - ALL @@ -53,12 +54,12 @@ migrationJob: # -- Seconds to keep completed/failed Job pods for log inspection before garbage collection. ttlSecondsAfterFinished: 300 -resources: {} - # requests: - # cpu: 10m - # memory: 64Mi - # limits: - # memory: 128Mi +resources: + requests: + cpu: 10m + memory: 64Mi + limits: + memory: 128Mi podDisruptionBudget: # -- Enable a PodDisruptionBudget for the operator. diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index de044d6..2fea3b4 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -48,7 +48,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } // 2. Skip if migration is not opted-in via annotation. - if deployment.Annotations[AnnotationMigrationEnabled] != "true" { + if len(deployment.Annotations) == 0 || deployment.Annotations[AnnotationMigrationEnabled] != "true" { logger.V(1).Info("migration not enabled for this deployment, skipping") return ctrl.Result{}, nil } @@ -61,7 +61,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } desiredVersion := extractImageTag(mainContainer.Image) - // 3. Skip migration for memory datastore — just ensure the Deployment is scaled up. + // 3b. Skip migration for memory datastore — just ensure the Deployment is scaled up. if isMemoryDatastore(mainContainer) { logger.V(1).Info("memory datastore detected, skipping migration") if _, scaleErr := ensureDeploymentScaled(ctx, r.Client, deployment); scaleErr != nil { @@ -252,6 +252,10 @@ func isJobConditionTrue(job *batchv1.Job, conditionType batchv1.JobConditionType // isMemoryDatastore checks if the Deployment is using the memory datastore // (no database migration needed). +// +// NOTE: This only inspects explicit env vars on the container spec. If +// OPENFGA_DATASTORE_ENGINE is injected via envFrom (ConfigMap/Secret), it +// will not be detected here and the operator will attempt a migration. func isMemoryDatastore(container *corev1.Container) bool { for _, env := range container.Env { if env.Name == "OPENFGA_DATASTORE_ENGINE" { From dfe31808a252e79596fab5fde2f1c91973aa28e3 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 11 Apr 2026 16:56:43 -0400 Subject: [PATCH 28/42] fix: replace scale-to-zero with lookup-based zero-downtime upgrades MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The operator was scaling Deployments to 0 replicas during every migration, causing a full outage on every helm upgrade — a regression from the existing rolling update behavior. OpenFGA already gates readiness on schema version (MinimumSupportedDatastoreSchemaRevision in sqlcommon.IsReady), so new pods naturally block until migration completes while old pods keep serving. Use Helm's lookup function to preserve the live replica count on upgrade (falling back to replicas: 0 on fresh install where no Deployment exists). Remove scaleDeploymentToZero from the operator reconcile loop. Update ADR-002 to document the rationale and the readiness gate dependency. --- charts/openfga/templates/deployment.yaml | 20 +++++ charts/openfga/tests/operator_mode_test.yaml | 5 +- docs/adr/002-operator-managed-migrations.md | 77 +++++++++++++------ operator/internal/controller/helpers.go | 30 -------- .../controller/migration_controller.go | 11 +-- .../controller/migration_controller_test.go | 22 +++--- 6 files changed, 89 insertions(+), 76 deletions(-) diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index a70c93d..a0da306 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -23,7 +23,27 @@ spec: {{- if .Values.autoscaling.enabled }} {{- fail "operator.enabled and autoscaling.enabled cannot both be true" }} {{- end }} + {{- /* On upgrade, preserve live replica count so existing pods keep serving (zero-downtime). + On fresh install, lookup returns empty — fall back to 0 so no pods start before migration. + See ADR-002 for rationale. */ -}} + {{- $existing := (lookup "apps/v1" "Deployment" (include "openfga.namespace" .) (include "openfga.fullname" .)) }} + {{- $desiredImage := printf "%s:%s" .Values.image.repository (.Values.image.tag | default .Chart.AppVersion) }} + {{- $currentImage := "" }} + {{- range (($existing).spec).template.spec.containers }} + {{- if eq .name "openfga" }} + {{- $currentImage = .image }} + {{- end }} + {{- end }} + {{- if and $existing (eq $currentImage $desiredImage) }} + replicas: {{ $existing.spec.replicas }} + {{- else if $existing }} + {{- /* Image changed — preserve replicas; OpenFGA's built-in schema version check + (MinimumSupportedDatastoreSchemaRevision) causes readiness to fail until + migration completes, so old pods keep serving while new pods wait. */ -}} + replicas: {{ $existing.spec.replicas }} + {{- else }} replicas: {{ ternary 1 0 (eq .Values.datastore.engine "memory") }} + {{- end }} {{- else if not .Values.autoscaling.enabled }} replicas: {{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory")}} {{- end }} diff --git a/charts/openfga/tests/operator_mode_test.yaml b/charts/openfga/tests/operator_mode_test.yaml index 31d06c1..2e50915 100644 --- a/charts/openfga/tests/operator_mode_test.yaml +++ b/charts/openfga/tests/operator_mode_test.yaml @@ -64,7 +64,10 @@ tests: path: metadata.annotations["openfga.dev/migration-service-account"] # --- Replica count --- - - it: should set replicas to 0 when operator is enabled with database datastore + # When no live cluster is available (helm template / test), lookup returns empty, + # so the template falls back to replicas: 0 (fresh install behavior). + # On a real cluster, lookup preserves the existing replica count for zero-downtime upgrades. + - it: should set replicas to 0 on fresh install when operator is enabled with database datastore set: operator.enabled: true migration.enabled: true diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md index 9b86d9d..ad53799 100644 --- a/docs/adr/002-operator-managed-migrations.md +++ b/docs/adr/002-operator-managed-migrations.md @@ -74,30 +74,37 @@ Replace the Helm hook migration Job and `k8s-wait-for` init container with **ope The operator runs a **migration controller** that reconciles the OpenFGA Deployment: ``` -┌────────────────────────────────────────────────────────┐ -│ Operator Reconciliation │ -│ │ -│ 1. Read Deployment → extract image tag (e.g. v1.14.0) │ -│ 2. Read ConfigMap/openfga-migration-status │ -│ └── "Last migrated version: v1.13.0" │ -│ 3. Versions differ → migration needed │ -│ 4. Create Job/openfga-migrate │ -│ ├── ServiceAccount: openfga-migrator (DDL perms) │ -│ ├── Image: openfga/openfga:v1.14.0 │ -│ ├── Args: ["migrate"] │ -│ └── ttlSecondsAfterFinished: 300 │ -│ 5. Watch Job until succeeded │ -│ 6. Update ConfigMap → "version: v1.14.0" │ -│ 7. Scale Deployment replicas: 0 → 3 │ -│ 8. OpenFGA pods start, serve requests │ -└────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────┐ +│ Operator Reconciliation │ +│ │ +│ 1. Read Deployment → extract image tag (e.g. v1.14.0) │ +│ 2. Read ConfigMap/openfga-migration-status │ +│ └── "Last migrated version: v1.13.0" │ +│ 3. Versions differ → migration needed │ +│ 4. Create Job/openfga-migrate │ +│ ├── ServiceAccount: openfga-migrator (DDL perms) │ +│ ├── Image: openfga/openfga:v1.14.0 │ +│ ├── Args: ["migrate"] │ +│ └── ttlSecondsAfterFinished: 300 │ +│ 5. Watch Job until succeeded │ +│ 6. Update ConfigMap → "version: v1.14.0" │ +│ 7. Ensure Deployment at desired replicas │ +│ (fresh install: 0 → N; upgrade: already running) │ +│ 8. New pods pass readiness, serve requests │ +└──────────────────────────────────────────────────────────┘ ``` **Key design decisions within this approach:** -#### Deployment starts at replicas: 0 +#### Zero-downtime upgrades via lookup and readiness gating -The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema. +On **fresh install**, the Helm chart renders the Deployment with `replicas: 0` (no existing Deployment found via `lookup`). The operator runs the migration Job and scales the Deployment to the desired replica count afterward. + +On **upgrade**, the chart uses Helm's `lookup` function to read the current replica count from the live Deployment and preserves it. Kubernetes starts a rolling update with the new image. OpenFGA has a **built-in schema version gate**: on startup, each instance calls `IsReady()` which checks the database schema revision against `MinimumSupportedDatastoreSchemaRevision` (via goose). If the schema is behind, the gRPC health endpoint returns `NOT_SERVING`, the readiness probe fails, and Kubernetes does not route traffic to the pod. Old pods continue serving on the migrated schema (OpenFGA migrations are additive/backward-compatible — this is how the existing Helm hook flow has operated for years with rolling updates). Once the operator's migration Job completes, new pods pass readiness and the rolling update proceeds. + +This matches the existing zero-downtime behavior of the non-operator chart. The previous approach (always starting at `replicas: 0`) introduced a full outage on every `helm upgrade` — even for config-only changes — which was a regression from the existing rolling update model. + +**`lookup` caveat:** `helm template` and `--dry-run=client` cannot query the cluster, so `lookup` returns empty and the template falls back to `replicas: 0`. This is correct for CI rendering (no live cluster) and does not affect real installs/upgrades. `--dry-run=server` works correctly. #### Version tracking via ConfigMap @@ -142,13 +149,13 @@ helm install Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4. -**After (operator-managed):** +**After (operator-managed, fresh install):** ``` helm install ├── Create ServiceAccount (runtime), ServiceAccount (migrator) ├── Create Secret, Service - ├── Create Deployment (replicas: 0, no init containers) + ├── Create Deployment (replicas: 0 via lookup fallback, no init containers) ├── Create Operator Deployment └── [Helm is done — all resources are regular, no hooks] @@ -159,10 +166,30 @@ Operator starts: │ └── Uses openfga-migrator ServiceAccount │ └── Runs openfga migrate → succeeds ├── Creates ConfigMap with migrated version - └── Scales Deployment to 3 replicas → pods start + └── Scales Deployment 0 → 3 replicas → pods start +``` + +**After (operator-managed, upgrade with new image):** + +``` +helm upgrade + ├── lookup finds existing Deployment at 3 replicas → preserves replicas: 3 + ├── Patches Deployment with new image tag + ├── Kubernetes starts rolling update + │ ├── New pods (v1.14) start → schema is behind → + │ │ readiness fails (gRPC NOT_SERVING) → no traffic routed + │ └── Old pods (v1.13) continue serving traffic + └── [Helm is done] + +Operator reconciles: + ├── Detects image version differs from ConfigMap + ├── Creates Job/openfga-migrate → runs migration + ├── Updates ConfigMap → "version: v1.14.0" + └── New pods pass readiness → rolling update completes + (operator does NOT scale to zero — zero downtime) ``` -No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects. +No hooks. No init containers. No `k8s-wait-for`. No downtime on upgrade. All resources are regular Kubernetes objects. ### What Changes in the Helm Chart @@ -206,10 +233,10 @@ When `operator.enabled: false`, the chart falls back to the current behavior — ### Negative - **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected) -- **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this +- **`lookup` limitation** — `helm template` and `--dry-run=client` cannot query the cluster; the template falls back to `replicas: 0` in these contexts. This does not affect real installs/upgrades. - **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy) ### Risks -- **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1. +- **Readiness gate relies on OpenFGA's built-in schema check** — the zero-downtime upgrade model depends on `MinimumSupportedDatastoreSchemaRevision` in `pkg/storage/sqlcommon/sqlcommon.go` causing `NOT_SERVING` when the schema is behind. If a future OpenFGA release removes or weakens this check, new pods could serve traffic against an unmigrated schema. This coupling should be documented and monitored across OpenFGA releases. - **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented. diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 794210b..7eced33 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -264,33 +264,3 @@ func ensureDeploymentScaled(ctx context.Context, c client.Client, deployment *ap return false, nil } -// scaleDeploymentToZero scales the Deployment to 0 replicas, storing the current -// desired count in an annotation so it can be restored later. -func scaleDeploymentToZero(ctx context.Context, c client.Client, deployment *appsv1.Deployment) error { - if deployment.Spec.Replicas != nil && *deployment.Spec.Replicas == 0 { - return nil // Already at zero. - } - - patch := client.MergeFrom(deployment.DeepCopy()) - - // Store the current desired replica count before zeroing. - currentReplicas := int32(1) - if deployment.Spec.Replicas != nil { - currentReplicas = *deployment.Spec.Replicas - } - - // Only store if not already stored (avoid overwriting with 0 on re-reconciliation). - if _, ok := deployment.Annotations[AnnotationDesiredReplicas]; !ok { - if deployment.Annotations == nil { - deployment.Annotations = make(map[string]string) - } - deployment.Annotations[AnnotationDesiredReplicas] = strconv.FormatInt(int64(currentReplicas), 10) - } - - deployment.Spec.Replicas = ptr.To(int32(0)) - - if err := c.Patch(ctx, deployment, patch); err != nil { - return fmt.Errorf("scaling deployment to 0: %w", err) - } - return nil -} diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 2fea3b4..6152df2 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -98,12 +98,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( logger.Info("migration needed", "currentVersion", currentVersion, "desiredVersion", desiredVersion) - // 6. Ensure the Deployment is scaled to zero before migrating. - if err := scaleDeploymentToZero(ctx, r.Client, deployment); err != nil { - return ctrl.Result{}, err - } - - // 7. Check retry-after annotation to honor backoff cooldown. + // 6. Check retry-after annotation to honor backoff cooldown. if retryAfter, ok := deployment.Annotations[AnnotationRetryAfter]; ok { retryTime, parseErr := time.Parse(time.RFC3339, retryAfter) if parseErr == nil && time.Now().Before(retryTime) { @@ -113,7 +108,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } } - // 8. Check if a migration Job already exists. + // 7. Check if a migration Job already exists. jobName := migrationJobName(req.Name) job := &batchv1.Job{} err = r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: req.Namespace}, job) @@ -150,7 +145,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, fmt.Errorf("getting migration job: %w", err) } - // 8b. If the existing Job is for a different version, delete it and recreate. + // 8. If the existing Job is for a different version, delete it and recreate. // Check annotation first (supports digests > 63 chars), fall back to label. jobVersion := job.Annotations["openfga.dev/desired-version"] versionMatch := jobVersion == desiredVersion diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index a2edbf6..a4d62a2 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -326,7 +326,7 @@ func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { t.Errorf("expected 60s requeue, got %v", result.RequeueAfter) } - // Verify Deployment was NOT scaled up — still at 0. + // Verify Deployment replicas unchanged (still at 0 from fresh install). updated := &appsv1.Deployment{} if getErr := r.Get(context.Background(), types.NamespacedName{ Name: "openfga", Namespace: "default", @@ -760,18 +760,19 @@ func TestReconcile_JobSucceeded_UpdatesExistingConfigMap(t *testing.T) { } } -func TestReconcile_ScaleToZero_StoresDesiredReplicas(t *testing.T) { - // Given: a Deployment with replicas > 0 and no desired-replicas annotation yet. - // scaleDeploymentToZero should store the current replica count before zeroing. +func TestReconcile_MigrationNeeded_DoesNotScaleToZero(t *testing.T) { + // Given: a Deployment with replicas > 0 and no migration-status ConfigMap. + // The operator should create the migration Job WITHOUT scaling to zero, + // relying on OpenFGA's built-in schema version check to gate readiness. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 3) dep.Annotations = map[string]string{ AnnotationMigrationEnabled: "true", + AnnotationDesiredReplicas: "3", } r := newReconciler(dep) - // When: reconciling — this will call scaleDeploymentToZero which must handle - // the case where AnnotationDesiredReplicas is not yet set. + // When: reconciling. result, err := r.Reconcile(context.Background(), ctrl.Request{ NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, }) @@ -784,18 +785,15 @@ func TestReconcile_ScaleToZero_StoresDesiredReplicas(t *testing.T) { t.Error("expected requeue after creating job") } - // Verify Deployment was scaled to 0 and desired-replicas annotation was preserved. + // Verify Deployment replicas were NOT changed — pods keep running during migration. updated := &appsv1.Deployment{} if getErr := r.Get(context.Background(), types.NamespacedName{ Name: "openfga", Namespace: "default", }, updated); getErr != nil { t.Fatalf("getting deployment: %v", getErr) } - if *updated.Spec.Replicas != 0 { - t.Errorf("expected 0 replicas, got %d", *updated.Spec.Replicas) - } - if updated.Annotations[AnnotationDesiredReplicas] != "3" { - t.Errorf("expected desired-replicas=3, got %s", updated.Annotations[AnnotationDesiredReplicas]) + if *updated.Spec.Replicas != 3 { + t.Errorf("expected replicas to remain at 3, got %d", *updated.Spec.Replicas) } } From 4d46e90dfbc54fce9d790a8644553c6d6e93f51e Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sat, 18 Apr 2026 16:31:17 -0400 Subject: [PATCH 29/42] refactor(operator): resolve container via annotation and tidy deployment template - findOpenFGAContainer now reads the openfga.dev/container-name annotation emitted by the chart, and returns an error when the target container is missing instead of silently falling back to the first container in the pod spec. - migration_controller surfaces that error to the reconciler instead of logging and skipping, so misconfigured Deployments are visible. - deployment.yaml emits the new container-name annotation, collapses the replica-preservation logic to a single branch (both previous branches already preserved existing replicas), and uses selectorLabels on the pod template to avoid chart-version churn in pod labels across upgrades. - values.yaml documents the openfga-operator subchart values passthrough and clarifies migration service account behavior. --- charts/openfga/templates/deployment.yaml | 22 ++++----------- charts/openfga/values.yaml | 20 +++++++++++++ operator/internal/controller/helpers.go | 28 ++++++++++++------- .../controller/migration_controller.go | 10 +++---- 4 files changed, 48 insertions(+), 32 deletions(-) diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index a0da306..e3d8601 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -9,6 +9,7 @@ metadata: annotations: {{- if $hasOperatorAnnotations }} openfga.dev/migration-enabled: "true" + openfga.dev/container-name: "{{ .Chart.Name }}" openfga.dev/desired-replicas: '{{ ternary 1 .Values.replicaCount (eq .Values.datastore.engine "memory") }}' {{- if or .Values.migration.serviceAccount.create .Values.migration.serviceAccount.name }} openfga.dev/migration-service-account: '{{ include "openfga.migrationServiceAccountName" . }}' @@ -23,23 +24,10 @@ spec: {{- if .Values.autoscaling.enabled }} {{- fail "operator.enabled and autoscaling.enabled cannot both be true" }} {{- end }} - {{- /* On upgrade, preserve live replica count so existing pods keep serving (zero-downtime). - On fresh install, lookup returns empty — fall back to 0 so no pods start before migration. - See ADR-002 for rationale. */ -}} + {{- /* On upgrade: preserve live replicas (zero-downtime). On fresh install: lookup returns empty, fall back to 0. + OpenFGA gates readiness on MinimumSupportedDatastoreSchemaRevision — see ADR-002. */ -}} {{- $existing := (lookup "apps/v1" "Deployment" (include "openfga.namespace" .) (include "openfga.fullname" .)) }} - {{- $desiredImage := printf "%s:%s" .Values.image.repository (.Values.image.tag | default .Chart.AppVersion) }} - {{- $currentImage := "" }} - {{- range (($existing).spec).template.spec.containers }} - {{- if eq .name "openfga" }} - {{- $currentImage = .image }} - {{- end }} - {{- end }} - {{- if and $existing (eq $currentImage $desiredImage) }} - replicas: {{ $existing.spec.replicas }} - {{- else if $existing }} - {{- /* Image changed — preserve replicas; OpenFGA's built-in schema version check - (MinimumSupportedDatastoreSchemaRevision) causes readiness to fail until - migration completes, so old pods keep serving while new pods wait. */ -}} + {{- if and $existing (hasKey ($existing) "spec") }} replicas: {{ $existing.spec.replicas }} {{- else }} replicas: {{ ternary 1 0 (eq .Values.datastore.engine "memory") }} @@ -60,7 +48,7 @@ spec: prometheus.io/path: /metrics prometheus.io/port: "{{ (split ":" .Values.telemetry.metrics.addr)._1 }}" labels: - {{- include "openfga.labels" . | nindent 8 }} + {{- include "openfga.selectorLabels" . | nindent 8 }} {{- with .Values.podExtraLabels }} {{- toYaml . | nindent 8 }} {{- end }} diff --git a/charts/openfga/values.yaml b/charts/openfga/values.yaml index 24723c2..fcca5b0 100644 --- a/charts/openfga/values.yaml +++ b/charts/openfga/values.yaml @@ -391,6 +391,23 @@ extraObjects: [] operator: enabled: false +# -- Values passed to the openfga-operator subchart (when operator.enabled is true). +# See charts/openfga-operator/values.yaml for all available options. +openfga-operator: {} + # migrationJob: + # backoffLimit: 3 + # activeDeadlineSeconds: 300 + # ttlSecondsAfterFinished: 300 + # leaderElection: + # enabled: true + # watchNamespace: "" + # resources: + # requests: + # cpu: 10m + # memory: 64Mi + # limits: + # memory: 128Mi + # -- migration controls operator-driven migration behavior. # Only used when operator.enabled is true. migration: @@ -398,8 +415,11 @@ migration: enabled: true serviceAccount: # -- Create a dedicated service account for migration Jobs. + # The migration Job inherits env vars (including secretKeyRef) from the OpenFGA container. + # If your datastore secret has RBAC restrictions, ensure this service account can read it. create: true # -- Annotations to add to the migration service account. + # Use this to attach cloud IAM roles (e.g., eks.amazonaws.com/role-arn) for DDL permissions. annotations: {} # -- The name of the migration service account. # If not set and create is true, defaults to {fullname}-migration. diff --git a/operator/internal/controller/helpers.go b/operator/internal/controller/helpers.go index 7eced33..da1c717 100644 --- a/operator/internal/controller/helpers.go +++ b/operator/internal/controller/helpers.go @@ -25,6 +25,7 @@ const ( // Annotations set on the Deployment by the Helm chart / operator. AnnotationMigrationEnabled = "openfga.dev/migration-enabled" + AnnotationContainerName = "openfga.dev/container-name" AnnotationDesiredReplicas = "openfga.dev/desired-replicas" AnnotationMigrationServiceAccount = "openfga.dev/migration-service-account" AnnotationRetryAfter = "openfga.dev/migration-retry-after" @@ -71,18 +72,25 @@ func migrationJobName(deploymentName string) string { } // findOpenFGAContainer finds the OpenFGA container in the Deployment's pod spec. -// It looks for a container named "openfga" first, then falls back to the first container. -func findOpenFGAContainer(deployment *appsv1.Deployment) *corev1.Container { - for i := range deployment.Spec.Template.Spec.Containers { - if deployment.Spec.Template.Spec.Containers[i].Name == "openfga" { - return &deployment.Spec.Template.Spec.Containers[i] - } +// It checks the openfga.dev/container-name annotation first, then looks for a +// container named "openfga". Returns an error if no containers exist or the +// target container is not found. +func findOpenFGAContainer(deployment *appsv1.Deployment) (*corev1.Container, error) { + containers := deployment.Spec.Template.Spec.Containers + if len(containers) == 0 { + return nil, fmt.Errorf("deployment %s/%s has no containers", deployment.Namespace, deployment.Name) } - // Fallback: use the first container (for charts that don't name it "openfga"). - if len(deployment.Spec.Template.Spec.Containers) > 0 { - return &deployment.Spec.Template.Spec.Containers[0] + + targetName := deployment.Annotations[AnnotationContainerName] + if targetName == "" { + targetName = "openfga" } - return nil + for i := range containers { + if containers[i].Name == targetName { + return &containers[i], nil + } + } + return nil, fmt.Errorf("container %q not found in deployment %s/%s", targetName, deployment.Namespace, deployment.Name) } // buildMigrationJob constructs a migration Job for the given Deployment. diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 6152df2..2812a6f 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -54,10 +54,10 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( } // 3. Find the OpenFGA container and extract the desired version. - mainContainer := findOpenFGAContainer(deployment) - if mainContainer == nil { - logger.Info("deployment has no containers, skipping") - return ctrl.Result{}, nil + mainContainer, err := findOpenFGAContainer(deployment) + if err != nil { + logger.Error(err, "unable to find OpenFGA container") + return ctrl.Result{}, err } desiredVersion := extractImageTag(mainContainer.Image) @@ -73,7 +73,7 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( // 4. Check current migration status from ConfigMap. configMap := &corev1.ConfigMap{} cmName := migrationConfigMapName(req.Name) - err := r.Get(ctx, types.NamespacedName{Name: cmName, Namespace: req.Namespace}, configMap) + err = r.Get(ctx, types.NamespacedName{Name: cmName, Namespace: req.Namespace}, configMap) currentVersion := "" if err == nil { From 022a8f47ee7bbb283aefd4957b1287dad21e4230 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sun, 19 Apr 2026 05:23:22 -0400 Subject: [PATCH 30/42] chore: remove ADRs not relevant to this PR --- .../003-declarative-store-lifecycle-crds.md | 199 ------------------ docs/adr/004-operator-deployment-model.md | 164 --------------- 2 files changed, 363 deletions(-) delete mode 100644 docs/adr/003-declarative-store-lifecycle-crds.md delete mode 100644 docs/adr/004-operator-deployment-model.md diff --git a/docs/adr/003-declarative-store-lifecycle-crds.md b/docs/adr/003-declarative-store-lifecycle-crds.md deleted file mode 100644 index a54ee44..0000000 --- a/docs/adr/003-declarative-store-lifecycle-crds.md +++ /dev/null @@ -1,199 +0,0 @@ -# ADR-003: Declarative Store Lifecycle Management via CRDs - -- **Status:** Proposed -- **Date:** 2026-04-06 -- **Deciders:** OpenFGA Helm Charts maintainers -- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) - -## Context - -OpenFGA is an authorization service. After deploying the server, teams must perform several runtime operations to make it usable: - -1. **Create a store** — a logical container for authorization data -2. **Write an authorization model** — the DSL that defines types, relations, and permissions -3. **Write tuples** — the relationship data that the model operates on (e.g., "user:anne is owner of document:budget") - -Today, these operations happen outside Kubernetes — through the OpenFGA API, CLI (`fga`), or custom scripts in CI pipelines. There is no declarative, Kubernetes-native way to manage them. - -This creates several problems: - -- **No GitOps for authorization config** — authorization models live in scripts or API calls, not in version-controlled manifests that ArgoCD/FluxCD sync. -- **No drift detection** — if someone modifies a model or tuple via the API, there's no controller to detect and reconcile the change. -- **No cross-team ownership** — each team that uses OpenFGA must build their own tooling to manage stores and models. There's no standard pattern. -- **Manual coordination** — deploying a new version of an application that needs a model change requires coordinating the Helm upgrade with a separate model push. - -### Alternatives Considered - -**A. CLI wrapper in CI pipelines** - -Use the `fga` CLI in a CI/CD step after `helm upgrade` to create stores, push models, and write tuples. - -*Pros:* No new Kubernetes components. Works with any CI system. -*Cons:* Imperative, not declarative. No drift detection. Each team builds their own pipeline. Model changes are not atomic with deployments. No visibility in Kubernetes tooling. - -**B. Helm post-install hook Job** - -Add a Helm hook Job that runs `fga` CLI commands after installation. - -*Pros:* Stays within the Helm ecosystem. -*Cons:* Helm hooks are the exact problem we're solving in ADR-002. Same ArgoCD/FluxCD incompatibilities. Hook Jobs are fire-and-forget with no reconciliation. - -**C. CRDs managed by the operator (selected)** - -Expose `FGAStore`, `FGAModel`, and `FGATuples` as Custom Resource Definitions. The operator watches these resources and reconciles them against the OpenFGA API. - -*Pros:* Fully declarative. GitOps-native. Continuous reconciliation. Standard Kubernetes patterns. Teams own their auth config as manifests. -*Cons:* Requires the operator (ADR-001). CRD design and reconciliation logic add development scope. Tuple reconciliation is complex. - -## Decision - -Introduce three CRDs, built in stages after the migration handling (ADR-002) is complete: - -### Stage 2: FGAStore - -```yaml -apiVersion: openfga.dev/v1alpha1 -kind: FGAStore -metadata: - name: my-app - namespace: my-team -spec: - # Reference to the OpenFGA instance - openfgaRef: - url: openfga.openfga-system.svc:8081 - credentialsRef: - name: openfga-api-credentials # Secret with API key or client credentials - # Store display name - name: "my-app-store" -status: - storeId: "01HXYZ..." - ready: true - conditions: - - type: Ready - status: "True" - lastTransitionTime: "2026-04-06T12:00:00Z" -``` - -**Controller behavior:** -- On create: call `CreateStore` API, store the returned store ID in `.status.storeId` -- On delete: call `DeleteStore` API (with finalizer to ensure cleanup) -- Idempotent: if a store with the same name exists, adopt it rather than creating a duplicate -- Status: set `Ready` condition when store is confirmed to exist - -### Stage 3: FGAModel - -```yaml -apiVersion: openfga.dev/v1alpha1 -kind: FGAModel -metadata: - name: my-app-model - namespace: my-team -spec: - storeRef: - name: my-app # References an FGAStore in the same namespace - model: | - model - schema 1.1 - type user - type organization - relations - define member: [user] - define admin: [user] - type document - relations - define reader: [user, organization#member] - define writer: [user, organization#admin] - define owner: [user] -status: - modelId: "01HABC..." - ready: true - lastWrittenHash: "sha256:a1b2c3..." # Hash of the model DSL to detect changes - conditions: - - type: Ready - status: "True" - - type: InSync - status: "True" -``` - -**Controller behavior:** -- On create/update: hash the model DSL. If hash differs from `.status.lastWrittenHash`, call `WriteAuthorizationModel` API -- Store the returned model ID in `.status.modelId` -- Model writes are append-only in OpenFGA (each write creates a new version), so this is safe -- Validation: optionally validate DSL syntax before calling the API (fail-fast with a clear error condition) -- The controller does NOT delete old model versions — OpenFGA retains model history - -### Stage 4: FGATuples - -```yaml -apiVersion: openfga.dev/v1alpha1 -kind: FGATuples -metadata: - name: my-app-base-tuples - namespace: my-team -spec: - storeRef: - name: my-app - tuples: - - user: "user:anne" - relation: "owner" - object: "document:budget" - - user: "team:engineering#member" - relation: "reader" - object: "folder:engineering-docs" - - user: "organization:acme#admin" - relation: "writer" - object: "folder:engineering-docs" -status: - writtenCount: 3 - ready: true - lastReconciled: "2026-04-06T12:00:00Z" - conditions: - - type: Ready - status: "True" - - type: InSync - status: "True" -``` - -**Controller behavior:** -- Maintain an **ownership model** — the controller tracks which tuples it wrote (via annotations or a status field). It only manages tuples it owns, never deleting tuples written by the application at runtime. -- On reconciliation: diff the desired tuples (from spec) against owned tuples in the store - - Tuples in spec but not in store → write them - - Tuples in store (owned) but not in spec → delete them - - Tuples in store but not owned → leave them alone -- Pagination: handle large tuple sets that exceed API response limits -- Batching: use `Write` API with batch operations to minimize API calls - -**Scope limitation:** `FGATuples` is intended for **base/static tuples** — organizational structure, role assignments, resource hierarchies. It is NOT intended to replace application-level tuple writes for dynamic data (e.g., per-request access grants). The ownership model ensures these two concerns don't interfere. - -### CRD Design Principles - -1. **Namespace-scoped** — all CRDs are namespaced, allowing teams to manage their own stores/models/tuples in their namespace -2. **Reference-based** — `FGAModel` and `FGATuples` reference an `FGAStore` by name, not by store ID. The controller resolves the reference. -3. **Status-driven** — controllers report state via `.status.conditions` following Kubernetes conventions (`Ready`, `InSync`, error conditions) -4. **Finalizers for cleanup** — `FGAStore` uses a finalizer to ensure the store is deleted from OpenFGA when the CR is deleted -5. **Idempotent** — all operations are safe to retry. Re-running reconciliation produces the same result. -6. **`v1alpha1` API version** — signals that the CRD schema may change. We will promote to `v1beta1` and `v1` as the design stabilizes. - -## Consequences - -### Positive - -- **GitOps-native authorization management** — stores, models, and tuples are Kubernetes resources that ArgoCD/FluxCD sync from Git -- **Drift detection and reconciliation** — the operator continuously ensures the actual state matches the declared state -- **Cross-team standardization** — every team uses the same CRDs, eliminating custom scripts and CI hacks -- **Atomic deployments** — a team can include `FGAModel` in their application's Helm chart; model updates deploy alongside code changes -- **Visibility** — `kubectl get fgastores`, `kubectl get fgamodels`, `kubectl describe fgatuples` provide instant visibility into authorization configuration -- **RBAC integration** — Kubernetes RBAC controls who can create/modify stores, models, and tuples per namespace - -### Negative - -- **Significant development scope** — three controllers, each with its own reconciliation logic, error handling, and tests -- **Tuple reconciliation complexity** — diffing and ownership tracking for tuples is the most complex piece; edge cases around partial failures, pagination, and large tuple sets -- **CRD upgrade burden** — CRD schema changes require careful migration; Helm does not upgrade CRDs automatically -- **API dependency** — the operator must be able to reach the OpenFGA API; network issues or API downtime affect reconciliation -- **Not suitable for all tuple management** — dynamic, application-driven tuples should still be written via the API, not CRDs. Users must understand this boundary. - -### Risks - -- **FGATuples at scale** — for stores with millions of tuples, the reconciliation diff could be expensive. The ownership model mitigates this (only diff owned tuples), but documentation must clearly state that `FGATuples` is for base/static data, not high-volume dynamic writes. -- **Multi-cluster** — if OpenFGA serves multiple clusters, CRDs in one cluster may conflict with CRDs in another pointing at the same store. This is out of scope for `v1alpha1` but should be considered for future versions. diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md deleted file mode 100644 index a5a693c..0000000 --- a/docs/adr/004-operator-deployment-model.md +++ /dev/null @@ -1,164 +0,0 @@ -# ADR-004: Operator Deployment as Helm Subchart Dependency - -- **Status:** Proposed -- **Date:** 2026-04-06 -- **Deciders:** OpenFGA Helm Charts maintainers -- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) - -## Context - -The OpenFGA Operator (ADR-001) needs a deployment model — how do users install it alongside or independent of the OpenFGA server? - -There are several established patterns in the Kubernetes ecosystem: - -### Alternatives Considered - -**A. Standalone operator chart (install separately)** - -Users install the operator chart first, then install the OpenFGA chart. The operator watches for OpenFGA Deployments across namespaces. - -*Example:* -```bash -helm install openfga-operator openfga/openfga-operator -n openfga-system -helm install openfga openfga/openfga -n my-namespace -``` - -*Pros:* Clean separation of concerns. One operator instance serves multiple OpenFGA installations. Follows the OLM/OperatorHub pattern. -*Cons:* Two install steps. Ordering dependency — operator must exist before the chart is useful. Users must manage two releases. Harder to get started. - -**B. Operator bundled in the main chart (single chart, always installed)** - -The operator Deployment, RBAC, and CRDs are templates in the main OpenFGA chart. No subchart. - -*Pros:* Simplest for users — one chart, one install. No dependency management. -*Cons:* Chart becomes larger and harder to maintain. Users who manage the operator separately (e.g., cluster-wide) can't disable it. CRDs are tied to the application chart's release cycle. Multiple OpenFGA installations in the same cluster would deploy multiple operator instances. - -**C. Operator as a conditional subchart dependency (selected)** - -The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Disabled by default for backward compatibility; users opt in with `operator.enabled: true`. - -*Example:* -```bash -# Everything in one command -helm install openfga openfga/openfga \ - --set datastore.engine=postgres \ - --set operator.enabled=true - -# Or, operator managed separately -helm install openfga-operator openfga/openfga-operator -n openfga-system -helm install openfga openfga/openfga \ - --set operator.enabled=false -``` - -*Pros:* Single install for most users. Operator chart has its own versioning. Users can disable for standalone management. Clean separation in code. -*Cons:* Subchart dependency adds some Chart.yaml complexity. CRDs still need special handling (Helm's `crds/` directory or a pre-install hook). - -**D. OLM (Operator Lifecycle Manager) only** - -Publish the operator to OperatorHub. Users install via OLM. - -*Pros:* Standard pattern for OpenShift. Handles CRD upgrades, operator upgrades, and RBAC. -*Cons:* OLM is not available on all clusters (not standard on EKS, GKE, AKS). Adds a dependency on OLM itself. Doesn't help Helm-only users. - -## Decision - -The operator will be distributed as a **conditional Helm subchart dependency** of the main OpenFGA chart. - -### Chart Structure - -```text -helm-charts/ -├── charts/ -│ ├── openfga/ # Main chart (existing) -│ │ ├── Chart.yaml # Declares openfga-operator as dependency -│ │ ├── values.yaml # operator.enabled: false (opt-in) -│ │ ├── templates/ -│ │ └── crds/ # Empty in Stage 1 -│ │ -│ └── openfga-operator/ # Operator subchart (new) -│ ├── Chart.yaml -│ ├── values.yaml -│ ├── templates/ -│ │ ├── deployment.yaml -│ │ ├── serviceaccount.yaml -│ │ ├── role.yaml -│ │ └── rolebinding.yaml -│ └── crds/ # CRDs added in Stages 2-4 -│ ├── fgastore.yaml -│ ├── fgamodel.yaml -│ └── fgatuples.yaml -``` - -### Dependency Declaration - -```yaml -# charts/openfga/Chart.yaml -dependencies: - - name: openfga-operator - version: "0.1.x" - repository: "file://../openfga-operator" - condition: operator.enabled -``` - -> **Note:** The `file://` reference is used because the operator subchart lives in the same -> monorepo. When the charts are published, consumers pulling from a registry will resolve the -> dependency automatically via the chart's packaging. - -### CRD Handling - -Helm has specific behavior around CRDs: - -1. **`crds/` directory** — CRDs placed here are installed on `helm install` but are **never upgraded or deleted** by Helm. This is safe but requires manual CRD upgrades. - -2. **Pre-install/pre-upgrade hook Job** — a Job that runs `kubectl apply -f` on CRD manifests before the main install/upgrade. This handles upgrades but reintroduces Helm hooks (the problem ADR-002 solves). - -3. **Static manifests applied separately** — CRDs are published as a standalone YAML file. Users run `kubectl apply -f` before `helm install`. This is the pattern used by cert-manager, Istio, and Prometheus Operator. - -**Decision:** Use the `crds/` directory in the operator subchart for initial installation. Publish CRD manifests as a standalone artifact for upgrades. Document both paths clearly. - -```bash -# First install — Helm installs CRDs automatically -helm install openfga openfga/openfga - -# CRD upgrades — applied manually (Helm won't upgrade them) -kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0/crds.yaml -``` - -### Installation Modes - -| Mode | Command | Use case | -|------|---------|----------| -| **Default** (no operator) | `helm install openfga openfga/openfga` | Backward compatible. Uses Helm hooks for migration. | -| **All-in-one** | `helm install openfga openfga/openfga --set operator.enabled=true` | Single install with operator-managed migrations. | -| **Operator standalone** | `helm install op openfga/openfga-operator -n openfga-system` | Cluster-wide operator serving multiple OpenFGA instances. | - -### Multi-Instance Considerations - -When multiple OpenFGA installations exist in the same cluster, each installation gets its own operator instance. The operator is **namespace-scoped** — it only watches resources in its own namespace (or the namespace specified via `--watch-namespace`). This ensures independent OpenFGA installations never interfere with each other. - -```yaml -# Operator values -operator: - watchNamespace: "" # empty = watch own namespace only (default) -``` - -## Consequences - -### Positive - -- **Single `helm install` for most users** — no ordering dependencies, no manual operator setup -- **Opt-out available** — `operator.enabled: false` for users who manage it separately or don't need it -- **Independent versioning** — operator chart has its own version; can be released on a different cadence than the main chart -- **Clean code separation** — operator code and templates are in their own chart directory -- **Namespace isolation** — each operator instance is scoped to its own namespace, so multiple OpenFGA installations coexist safely -- **Consistent with ecosystem** — this is the same pattern used by charts that depend on Bitnami PostgreSQL, Redis, etc. - -### Negative - -- **CRD upgrade complexity** — Helm does not upgrade CRDs; users must apply CRD manifests separately on operator upgrades -- **Multiple operators in all-in-one mode** — if a user installs OpenFGA in three namespaces, they get three operator pods (wasteful). Documentation should recommend standalone mode for multi-instance clusters. -- **Subchart value passing** — configuring the operator requires prefixed values (e.g., `openfga-operator.image.tag`), which is slightly less ergonomic than top-level values - -### Neutral - -- **OLM support is not excluded** — the operator can be published to OperatorHub in the future alongside the Helm distribution. The two are not mutually exclusive. From 0bea6c4db7822efce58da24515ef8a780334e61c Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sun, 19 Apr 2026 05:28:09 -0400 Subject: [PATCH 31/42] fix: clear retry-after annotation after Job creation --- .../controller/migration_controller.go | 17 ++-- .../controller/migration_controller_test.go | 89 ++++++++++++++++++- 2 files changed, 97 insertions(+), 9 deletions(-) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 2812a6f..3275703 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -123,22 +123,23 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( r.ActiveDeadlineSeconds, r.TTLSecondsAfterFinished, ) - // Clear the retry-after annotation now that we're creating a new Job. - if _, hasRetry := deployment.Annotations[AnnotationRetryAfter]; hasRetry { - patch := client.MergeFrom(deployment.DeepCopy()) - delete(deployment.Annotations, AnnotationRetryAfter) - if patchErr := r.Patch(ctx, deployment, patch); patchErr != nil { - logger.Error(patchErr, "failed to clear retry-after annotation") - } - } if createErr := r.Create(ctx, job); createErr != nil { if apierrors.IsAlreadyExists(createErr) { // A concurrent reconcile already created the Job; requeue to pick it up. logger.V(1).Info("migration job already exists, will recheck", "job", jobName) return ctrl.Result{RequeueAfter: 5 * time.Second}, nil } + // Leave the retry-after annotation intact so the cooldown survives this failure. return ctrl.Result{}, fmt.Errorf("creating migration job: %w", createErr) } + // Clear the retry-after annotation now that the Job is created. + if _, hasRetry := deployment.Annotations[AnnotationRetryAfter]; hasRetry { + patch := client.MergeFrom(deployment.DeepCopy()) + delete(deployment.Annotations, AnnotationRetryAfter) + if patchErr := r.Patch(ctx, deployment, patch); patchErr != nil { + logger.Error(patchErr, "failed to clear retry-after annotation") + } + } logger.Info("created migration job", "job", jobName, "version", desiredVersion) return ctrl.Result{RequeueAfter: 5 * time.Second}, nil } else if err != nil { diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index a4d62a2..3603788 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -2,6 +2,7 @@ package controller import ( "context" + "fmt" "testing" "time" @@ -11,10 +12,12 @@ import ( metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/types" - "k8s.io/utils/ptr" clientgoscheme "k8s.io/client-go/kubernetes/scheme" + "k8s.io/utils/ptr" ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/client/fake" + "sigs.k8s.io/controller-runtime/pkg/client/interceptor" ) func newScheme() *runtime.Scheme { @@ -396,6 +399,90 @@ func TestReconcile_RetryAfterCooldown_SkipsJobCreation(t *testing.T) { } } +func TestReconcile_RetryAfterPersistsOnJobCreateFailure(t *testing.T) { + // Given: a Deployment with an elapsed retry-after annotation, and a client + // that fails Job creation with a non-AlreadyExists error. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + dep.Annotations[AnnotationRetryAfter] = time.Now().Add(-1 * time.Second).UTC().Format(time.RFC3339) + + scheme := newScheme() + c := fake.NewClientBuilder(). + WithScheme(scheme). + WithStatusSubresource(&appsv1.Deployment{}). + WithRuntimeObjects(dep). + WithInterceptorFuncs(interceptor.Funcs{ + Create: func(ctx context.Context, c client.WithWatch, obj client.Object, opts ...client.CreateOption) error { + if _, ok := obj.(*batchv1.Job); ok { + return fmt.Errorf("simulated transient API error") + } + return c.Create(ctx, obj, opts...) + }, + }). + Build() + r := &MigrationReconciler{ + Client: c, + BackoffLimit: DefaultBackoffLimit, + ActiveDeadlineSeconds: DefaultActiveDeadlineSeconds, + TTLSecondsAfterFinished: DefaultTTLSecondsAfterFinished, + } + + // When: reconciling. + _, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: an error is returned and the retry-after annotation is preserved + // so the next reconcile honors the cooldown. + if err == nil { + t.Fatal("expected error from failed job creation") + } + + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if _, ok := updated.Annotations[AnnotationRetryAfter]; !ok { + t.Error("expected retry-after annotation to persist after Job creation failure") + } +} + +func TestReconcile_RetryAfterClearedAfterJobCreated(t *testing.T) { + // Given: a Deployment with an elapsed retry-after annotation. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + dep.Annotations[AnnotationRetryAfter] = time.Now().Add(-1 * time.Second).UTC().Format(time.RFC3339) + + r := newReconciler(dep) + + // When: reconciling. + if _, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }); err != nil { + t.Fatalf("unexpected error: %v", err) + } + + // Then: the Job exists and the retry-after annotation has been cleared. + job := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, job); getErr != nil { + t.Fatalf("expected migration job to be created: %v", getErr) + } + + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + if _, ok := updated.Annotations[AnnotationRetryAfter]; ok { + t.Error("expected retry-after annotation to be cleared after Job created") + } +} + func TestReconcile_MemoryDatastore_SkipsMigration(t *testing.T) { // Given: a Deployment using the memory datastore. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) From 06d81309c28331811bc7c463b0890ca08a7210ff Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sun, 19 Apr 2026 05:33:42 -0400 Subject: [PATCH 32/42] fix: a migration Job without a version annotation or matching label is stale, its JobComplete would write the wrong version into the status ConfigMap --- .../controller/migration_controller.go | 13 ++-- .../controller/migration_controller_test.go | 68 +++++++++++++++++++ 2 files changed, 76 insertions(+), 5 deletions(-) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index 3275703..ec718fb 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -146,8 +146,11 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, fmt.Errorf("getting migration job: %w", err) } - // 8. If the existing Job is for a different version, delete it and recreate. - // Check annotation first (supports digests > 63 chars), fall back to label. + // 8. If the existing Job is for a different (or unknown) version, delete it + // and recreate. Check annotation first (supports digests > 63 chars), fall + // back to label. A Job with neither marker is treated as stale: we cannot + // trust its outcome to represent the current desired version, so trusting + // JobComplete in step 9 would write a wrong version into the status ConfigMap. jobVersion := job.Annotations["openfga.dev/desired-version"] versionMatch := jobVersion == desiredVersion if jobVersion == "" { @@ -157,10 +160,10 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( sanitized = sanitized[:63] } jobVersion = job.Labels["app.kubernetes.io/version"] - versionMatch = jobVersion == sanitized + versionMatch = jobVersion != "" && jobVersion == sanitized } - if jobVersion != "" && !versionMatch { - logger.Info("existing migration job is for a different version, deleting", "jobVersion", jobVersion, "desiredVersion", desiredVersion) + if !versionMatch { + logger.Info("existing migration job is for a different or unknown version, deleting", "jobVersion", jobVersion, "desiredVersion", desiredVersion) propagation := metav1.DeletePropagationBackground if delErr := r.Delete(ctx, job, &client.DeleteOptions{ PropagationPolicy: &propagation, diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index 3603788..8299638 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -200,6 +200,9 @@ func TestReconcile_JobSucceeded_UpdatesConfigMapAndScalesUp(t *testing.T) { ObjectMeta: metav1.ObjectMeta{ Name: "openfga-migrate", Namespace: "default", + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, OwnerReferences: []metav1.OwnerReference{ { APIVersion: "apps/v1", @@ -285,6 +288,9 @@ func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { ObjectMeta: metav1.ObjectMeta{ Name: "openfga-migrate", Namespace: "default", + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, OwnerReferences: []metav1.OwnerReference{ { APIVersion: "apps/v1", @@ -399,6 +405,65 @@ func TestReconcile_RetryAfterCooldown_SkipsJobCreation(t *testing.T) { } } +func TestReconcile_UnknownVersionJob_DeletedNotTrusted(t *testing.T) { + // Given: a Deployment desiring v1.14.0 and a JobComplete migration Job that + // carries no version annotation or label (e.g. left over from an older + // operator or created by a third-party tool). Trusting its outcome would + // write the wrong version into the migration-status ConfigMap. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Status: batchv1.JobStatus{ + Conditions: []batchv1.JobCondition{ + {Type: batchv1.JobComplete, Status: corev1.ConditionTrue}, + }, + }, + } + + r := newReconciler(dep, job) + + // When: reconciling. + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + + // Then: the Job is deleted and a requeue is scheduled; the ConfigMap is + // NOT created from the unknown-version Job's outcome. + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter == 0 { + t.Error("expected requeue after deleting unknown-version job") + } + + deletedJob := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, deletedJob); getErr == nil { + t.Error("expected unknown-version job to be deleted") + } + + cm := &corev1.ConfigMap{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migration-status", Namespace: "default", + }, cm); getErr == nil { + t.Errorf("expected no migration-status ConfigMap; got version=%q", cm.Data["version"]) + } +} + func TestReconcile_RetryAfterPersistsOnJobCreateFailure(t *testing.T) { // Given: a Deployment with an elapsed retry-after annotation, and a client // that fails Job creation with a non-AlreadyExists error. @@ -794,6 +859,9 @@ func TestReconcile_JobSucceeded_UpdatesExistingConfigMap(t *testing.T) { ObjectMeta: metav1.ObjectMeta{ Name: "openfga-migrate", Namespace: "default", + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, OwnerReferences: []metav1.OwnerReference{ { APIVersion: "apps/v1", From 4700037938dccb64ed5443ca30a5343fc6ad274f Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Sun, 19 Apr 2026 05:42:52 -0400 Subject: [PATCH 33/42] fix(chart): restore full label set on pod template metadata MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit I switched the pod template labels from openfga.labels to openfga.selectorLabels, which would have stripped helm.sh/chart, commonLabels, component, version, managed-by, and part-of from running pods on upgrade — a breaking change for any tooling filtering on those labels. Add a helm-unittest regression guard for both operator on/off modes. --- charts/openfga/templates/deployment.yaml | 2 +- charts/openfga/tests/operator_mode_test.yaml | 33 ++++++++++++++++++++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/charts/openfga/templates/deployment.yaml b/charts/openfga/templates/deployment.yaml index e3d8601..7318dde 100644 --- a/charts/openfga/templates/deployment.yaml +++ b/charts/openfga/templates/deployment.yaml @@ -48,7 +48,7 @@ spec: prometheus.io/path: /metrics prometheus.io/port: "{{ (split ":" .Values.telemetry.metrics.addr)._1 }}" labels: - {{- include "openfga.selectorLabels" . | nindent 8 }} + {{- include "openfga.labels" . | nindent 8 }} {{- with .Values.podExtraLabels }} {{- toYaml . | nindent 8 }} {{- end }} diff --git a/charts/openfga/tests/operator_mode_test.yaml b/charts/openfga/tests/operator_mode_test.yaml index 2e50915..164d8bd 100644 --- a/charts/openfga/tests/operator_mode_test.yaml +++ b/charts/openfga/tests/operator_mode_test.yaml @@ -135,3 +135,36 @@ tests: asserts: - isNotNull: path: spec.template.spec.initContainers + + # --- Pod template labels --- + # The pod template must carry the full common label set (helm.sh/chart, + # component, version, managed-by, part-of) — not just selectorLabels — + # so logging/monitoring tooling that filters on these labels keeps working + # across upgrades. Regression guard for the operator-migration branch. + - it: should include common labels on pod template metadata when operator is disabled + set: + operator.enabled: false + asserts: + - isNotEmpty: + path: spec.template.metadata.labels["helm.sh/chart"] + - equal: + path: spec.template.metadata.labels["app.kubernetes.io/component"] + value: authorization-controller + - equal: + path: spec.template.metadata.labels["app.kubernetes.io/part-of"] + value: openfga + + - it: should include common labels on pod template metadata when operator is enabled + set: + operator.enabled: true + migration.enabled: true + datastore.engine: postgres + asserts: + - isNotEmpty: + path: spec.template.metadata.labels["helm.sh/chart"] + - equal: + path: spec.template.metadata.labels["app.kubernetes.io/component"] + value: authorization-controller + - equal: + path: spec.template.metadata.labels["app.kubernetes.io/part-of"] + value: openfga From 9763e5eba706f3daf2788ceea788e0fd30d7972d Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 05:43:02 -0400 Subject: [PATCH 34/42] =?UTF-8?q?ci:=20add=20operator-mode=20coverage=20an?= =?UTF-8?q?d=20v1.9.5=20=E2=86=92=20v1.14.1=20upgrade=20E2E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - charts/openfga-operator/ci and charts/openfga/ci values files so chart-testing exercises both the standalone operator subchart and the parent chart with operator.enabled=true. - .github/ci/operator-postgres-values.yaml plus a dedicated workflow step that installs at v1.9.5 and upgrades to v1.14.1 (crossing the v1.10.0 !!REQUIRES MIGRATION!! boundary), asserting the migration ConfigMap and ready rollout at each step. --- .github/ci/operator-postgres-values.yaml | 77 +++++++++++++++++ .github/workflows/test.yml | 84 +++++++++++++++++++ .../openfga-operator/ci/default-values.yaml | 4 + charts/openfga/ci/operator-mode-values.yaml | 25 ++++++ 4 files changed, 190 insertions(+) create mode 100644 .github/ci/operator-postgres-values.yaml create mode 100644 charts/openfga-operator/ci/default-values.yaml create mode 100644 charts/openfga/ci/operator-mode-values.yaml diff --git a/.github/ci/operator-postgres-values.yaml b/.github/ci/operator-postgres-values.yaml new file mode 100644 index 0000000..4330d9f --- /dev/null +++ b/.github/ci/operator-postgres-values.yaml @@ -0,0 +1,77 @@ +# E2E values consumed by the "operator + postgres E2E" step in test.yml. +# Not under charts/openfga/ci/ on purpose — chart-testing's helm-test runs +# a gRPC probe immediately after install, which would race the operator's +# scale-up. The dedicated workflow step waits for the migration ConfigMap +# and the scale-up explicitly, then verifies readiness. +replicaCount: 1 + +operator: + enabled: true + +migration: + enabled: true + +datastore: + engine: postgres + uriSecret: openfga-e2e-postgres-credentials + +openfga-operator: + image: + pullPolicy: Never + +extraObjects: + - apiVersion: v1 + kind: Secret + metadata: + name: openfga-e2e-postgres-credentials + stringData: + uri: "postgres://openfga:changeme@openfga-e2e-postgres:5432/openfga?sslmode=disable" + - apiVersion: apps/v1 + kind: Deployment + metadata: + name: openfga-e2e-postgres + spec: + replicas: 1 + selector: + matchLabels: + app: openfga-e2e-postgres + template: + metadata: + labels: + app: openfga-e2e-postgres + spec: + containers: + - name: postgres + image: postgres:17 + ports: + - containerPort: 5432 + env: + - name: POSTGRES_USER + value: openfga + - name: POSTGRES_PASSWORD + value: changeme + - name: POSTGRES_DB + value: openfga + - name: PGDATA + value: /var/lib/postgresql/data/pgdata + volumeMounts: + - name: data + mountPath: /var/lib/postgresql/data + readinessProbe: + exec: + command: ["pg_isready", "-U", "openfga", "-d", "openfga"] + initialDelaySeconds: 5 + periodSeconds: 5 + volumes: + - name: data + emptyDir: {} + - apiVersion: v1 + kind: Service + metadata: + name: openfga-e2e-postgres + spec: + selector: + app: openfga-e2e-postgres + ports: + - port: 5432 + targetPort: 5432 diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index bbec31d..bc035c3 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -69,3 +69,87 @@ jobs: - name: Run chart-testing (install) if: steps.list-changed.outputs.changed == 'true' run: ct install --target-branch ${{ github.event.repository.default_branch }} + + - name: E2E test — operator-managed migration across schema boundary + id: e2e-operator + if: steps.list-changed.outputs.changed == 'true' + env: + NS: openfga-e2e + REL: openfga + # v1.9.5 → v1.14.1 crosses the v1.10.0 "!!REQUIRES MIGRATION!!" + # boundary (collation spec change in openfga/openfga#2661). + OLD_VER: v1.9.5 + NEW_VER: v1.14.1 + run: | + set -euo pipefail + kubectl create namespace "$NS" + helm dependency build charts/openfga + + echo "=== Phase 1: fresh install at ${OLD_VER} ===" + helm install "$REL" charts/openfga \ + --namespace "$NS" \ + --values .github/ci/operator-postgres-values.yaml \ + --set image.tag="${OLD_VER}" \ + --wait --timeout=3m + + # Operator pod must reach Ready (validates /readyz, RBAC, env vars). + kubectl wait deployment -n "$NS" \ + -l app.kubernetes.io/name=openfga-operator \ + --for=condition=Available=True --timeout=2m + + # Operator must run the migration Job and write ConfigMap at OLD_VER. + # Poll because kubectl wait --for=create requires kubectl >=1.31. + for i in $(seq 1 60); do + ver=$(kubectl get configmap "${REL}-migration-status" -n "$NS" \ + -o jsonpath='{.data.version}' 2>/dev/null || true) + if [ "$ver" = "${OLD_VER}" ]; then + echo "Phase 1: migration ConfigMap version=${ver}" + break + fi + sleep 3 + done + test "$ver" = "${OLD_VER}" + + # Operator must scale the openfga Deployment from 0 to 1 ready replica. + # condition=Available alone returns true at 0/0 before scale-up; + # readyReplicas=1 is the load-bearing signal. + kubectl wait deployment/"$REL" -n "$NS" \ + --for=jsonpath='{.status.readyReplicas}'=1 --timeout=3m + + echo "=== Phase 2: helm upgrade ${OLD_VER} → ${NEW_VER} ===" + helm upgrade "$REL" charts/openfga \ + --namespace "$NS" \ + --values .github/ci/operator-postgres-values.yaml \ + --set image.tag="${NEW_VER}" \ + --wait --timeout=3m + + # Operator must detect the version change, delete the stale Job, + # run a new migration, and update the ConfigMap to NEW_VER. + for i in $(seq 1 60); do + ver=$(kubectl get configmap "${REL}-migration-status" -n "$NS" \ + -o jsonpath='{.data.version}' 2>/dev/null || true) + if [ "$ver" = "${NEW_VER}" ]; then + echo "Phase 2: migration ConfigMap version=${ver}" + break + fi + sleep 3 + done + test "$ver" = "${NEW_VER}" + + # New pods must roll out at NEW_VER and become Ready. + kubectl wait deployment/"$REL" -n "$NS" \ + --for=jsonpath='{.status.readyReplicas}'=1 --timeout=3m + image=$(kubectl get deployment/"$REL" -n "$NS" \ + -o jsonpath='{.spec.template.spec.containers[0].image}') + echo "Phase 2 running image: $image" + echo "$image" | grep -q ":${NEW_VER}" + + - name: Dump operator E2E diagnostics on failure + if: failure() && steps.e2e-operator.conclusion == 'failure' + env: + NS: openfga-e2e + run: | + kubectl get all,configmap,job -n "$NS" -o wide || true + kubectl describe deployment -n "$NS" || true + kubectl logs -n "$NS" -l app.kubernetes.io/name=openfga-operator --tail=200 || true + kubectl logs -n "$NS" -l job-name --tail=200 || true diff --git a/charts/openfga-operator/ci/default-values.yaml b/charts/openfga-operator/ci/default-values.yaml new file mode 100644 index 0000000..93797cd --- /dev/null +++ b/charts/openfga-operator/ci/default-values.yaml @@ -0,0 +1,4 @@ +# Standalone install exercise for chart-testing. +# kind has the operator image preloaded, so skip the registry pull. +image: + pullPolicy: Never diff --git a/charts/openfga/ci/operator-mode-values.yaml b/charts/openfga/ci/operator-mode-values.yaml new file mode 100644 index 0000000..b85a6af --- /dev/null +++ b/charts/openfga/ci/operator-mode-values.yaml @@ -0,0 +1,25 @@ +# Exercises operator-managed mode end-to-end via chart-testing. +# +# The openfga-operator subchart auto-installs (conditional dependency on +# operator.enabled). With the memory datastore, the chart starts the +# Deployment at replicas=1 immediately, so `helm test` runs without racing +# the operator's reconcile loop. Migration is skipped (memory engine), but +# the rest of the wiring is exercised: subchart resolution, operator RBAC, +# pod/SA/annotation rendering, and the operator pod actually running and +# reconciling against the openfga Deployment in its release namespace. +# +# Postgres + operator (which exercises the migration Job path) is left to +# a follow-up E2E test — it requires waiting for the operator to scale the +# Deployment up before `helm test` runs the gRPC probe. +operator: + enabled: true + +migration: + enabled: true + +datastore: + engine: memory + +openfga-operator: + image: + pullPolicy: Never From 160f1e90cf295c16d759186eab6981ef90c0d8d1 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 05:47:35 -0400 Subject: [PATCH 35/42] docs: update docs to reflect operator deployment changes --- docs/adr/002-operator-managed-migrations.md | 35 ++++++++++----------- docs/adr/README.md | 6 ++-- 2 files changed, 19 insertions(+), 22 deletions(-) diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md index ad53799..1f0dc74 100644 --- a/docs/adr/002-operator-managed-migrations.md +++ b/docs/adr/002-operator-managed-migrations.md @@ -193,31 +193,30 @@ No hooks. No init containers. No `k8s-wait-for`. No downtime on upgrade. All res ### What Changes in the Helm Chart -**Removed:** +Nothing is deleted outright — every change is gated on `operator.enabled` so the legacy flow remains the default for backward compatibility. -| File/Section | Reason | -|--------------|--------| -| `templates/job.yaml` | Operator creates migration Jobs | -| `templates/rbac.yaml` | No init container polling Job status | -| `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated | -| `values.yaml`: `datastore.migrationType` | Operator always uses Job internally | -| `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering | -| `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks | -| Deployment init containers for migration | Operator manages readiness via replica scaling | +**Gated on `operator.enabled: false` (legacy Helm-hook flow, rendered when the operator is disabled):** -**Added:** +| File/Section | Behavior when operator is enabled | +|--------------|-----------------------------------| +| `templates/job.yaml` | Skipped — operator creates migration Jobs dynamically | +| `templates/rbac.yaml` | Skipped — no init container needs to poll Job status | +| `values.yaml`: `initContainer.*` | Unused — `k8s-wait-for` not deployed | +| `values.yaml`: `datastore.migrationType`, `datastore.waitForMigrations` | Unused — operator always uses a Job and handles ordering | +| `values.yaml`: `migrate.annotations` | Unused — no Helm hooks | +| Deployment migration init containers | Skipped — operator manages readiness via replica scaling | + +**Added (active only when `operator.enabled: true`):** | File/Section | Purpose | |--------------|---------| -| `values.yaml`: `operator.enabled` | Toggle operator subchart | +| `values.yaml`: `operator.enabled` | Toggle the operator subchart | | `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs | -| `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration | +| `values.yaml`: `migration.backoffLimit`, `activeDeadlineSeconds`, `ttlSecondsAfterFinished` | Migration Job configuration | | `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount | -| `charts/openfga-operator/` | Operator subchart | - -**Preserved (backward compatible):** +| `charts/openfga-operator/` | Operator subchart (conditional dependency) | -When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption. +Users on `operator.enabled: false` (the default) see identical rendered output to the pre-operator chart, so gradual adoption is possible with no forced migration. ## Consequences @@ -226,7 +225,7 @@ When `operator.enabled: false`, the chart falls back to the current behavior — - **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility - **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144) - **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95) -- **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed +- **Runtime surface area reduced** — when `operator.enabled: true`, the legacy migration Job, init-container `k8s-wait-for` logic, and job-watching RBAC are skipped from the rendered manifest - **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors - **Idempotent and crash-safe** — operator can restart at any point and resume correctly diff --git a/docs/adr/README.md b/docs/adr/README.md index 5f80512..d6b3445 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -12,8 +12,6 @@ We follow the format described by [Michael Nygard](https://cognitect.com/blog/20 |-----|-------|--------|------| | [ADR-001](001-adopt-openfga-operator.md) | Adopt a Kubernetes Operator for OpenFGA Lifecycle Management | Proposed | 2026-04-06 | | [ADR-002](002-operator-managed-migrations.md) | Replace Helm Hook Migrations with Operator-Managed Migrations | Proposed | 2026-04-06 | -| [ADR-003](003-declarative-store-lifecycle-crds.md) | Declarative Store Lifecycle Management via CRDs | Proposed | 2026-04-06 | -| [ADR-004](004-operator-deployment-model.md) | Operator Deployment as Helm Subchart Dependency | Proposed | 2026-04-06 | --- @@ -67,9 +65,9 @@ When multiple ADRs are part of a single cohesive proposal — e.g., a foundation When doing this: -- **Explain the relationship in the PR description** — identify which ADR is the foundational decision and which are downstream. For example: "ADR-001 is the core decision to build an operator. ADR-002, 003, and 004 are downstream decisions about how the operator handles migrations, CRDs, and deployment." +- **Explain the relationship in the PR description** — identify which ADR is the foundational decision and which are downstream. For example: "ADR-001 is the core decision to build an operator. ADR-002 is a downstream decision about how the operator handles migrations." - **Each ADR can be accepted or rejected independently** — a reviewer might approve the foundational decision but push back on a downstream one. If that happens, split the PR: merge the accepted ADRs and keep the contested ones open for further discussion. -- **Keep each ADR self-contained** — even though they're in the same PR, each ADR should stand on its own. A reader should be able to understand ADR-003 without reading ADR-002 first (though they may reference each other). +- **Keep each ADR self-contained** — even though they're in the same PR, each ADR should stand on its own. A reader should be able to understand a downstream ADR without reading the foundational one first (though they may reference each other). ## How to Give Feedback on an ADR From a7bffbe30da449f3cbfa2c7e3214f5c80dc39044 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:04:46 -0400 Subject: [PATCH 36/42] fix(operator): react to JobFailureTarget for fast failure detection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Job controller sets JobFailureTarget as soon as it decides a Job will fail (backoff limit reached, active deadline exceeded, etc.) — JobFailed only flips after pods finish terminating, which can take up to BackoffLimit × ActiveDeadlineSeconds. Previously the operator only watched JobFailed, so a broken migration took ~15 minutes (with chart defaults) before MigrationFailed appeared on the Deployment. Treat either condition as "failed" and add a regression test. --- .../controller/migration_controller.go | 7 +- .../controller/migration_controller_test.go | 66 +++++++++++++++++++ 2 files changed, 72 insertions(+), 1 deletion(-) diff --git a/operator/internal/controller/migration_controller.go b/operator/internal/controller/migration_controller.go index ec718fb..ae096b6 100644 --- a/operator/internal/controller/migration_controller.go +++ b/operator/internal/controller/migration_controller.go @@ -197,7 +197,12 @@ func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, nil } - if isJobConditionTrue(job, batchv1.JobFailed) { + // JobFailureTarget is set as soon as the Job controller decides the Job + // will fail (backoff limit reached, deadline exceeded, etc.); JobFailed + // only flips after pods finish terminating, which can take BackoffLimit × + // ActiveDeadlineSeconds. Treating either as "failed" surfaces the failure + // to users within seconds instead of minutes. + if isJobConditionTrue(job, batchv1.JobFailed) || isJobConditionTrue(job, batchv1.JobFailureTarget) { logger.Info("migration job failed, will delete and retry", "job", jobName, "version", desiredVersion) // Set condition so kubectl describe shows the failure. diff --git a/operator/internal/controller/migration_controller_test.go b/operator/internal/controller/migration_controller_test.go index 8299638..1bd2eb0 100644 --- a/operator/internal/controller/migration_controller_test.go +++ b/operator/internal/controller/migration_controller_test.go @@ -372,6 +372,72 @@ func TestReconcile_JobFailed_SetsRetryAnnotationAndRequeues(t *testing.T) { } } +func TestReconcile_JobFailureTarget_TreatedAsFailed(t *testing.T) { + // Given: a Job with only JobFailureTarget=True (no JobFailed yet). The + // Job controller sets this as soon as it decides the Job will fail, + // before pods finish terminating and JobFailed is recorded. The operator + // should treat this as a failure to surface the error in seconds rather + // than waiting the full BackoffLimit × ActiveDeadlineSeconds. + dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) + dep.Annotations[AnnotationDesiredReplicas] = "3" + + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: "openfga-migrate", + Namespace: "default", + Annotations: map[string]string{ + "openfga.dev/desired-version": "v1.14.0", + }, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: "apps/v1", + Kind: "Deployment", + Name: "openfga", + UID: "test-uid-123", + }, + }, + }, + Status: batchv1.JobStatus{ + Conditions: []batchv1.JobCondition{ + {Type: batchv1.JobFailureTarget, Status: corev1.ConditionTrue, Reason: "BackoffLimitExceeded"}, + }, + }, + } + + r := newReconciler(dep, job) + + result, err := r.Reconcile(context.Background(), ctrl.Request{ + NamespacedName: types.NamespacedName{Name: "openfga", Namespace: "default"}, + }) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if result.RequeueAfter != 60*time.Second { + t.Errorf("expected 60s requeue, got %v", result.RequeueAfter) + } + + updated := &appsv1.Deployment{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga", Namespace: "default", + }, updated); getErr != nil { + t.Fatalf("getting deployment: %v", getErr) + } + + deletedJob := &batchv1.Job{} + if getErr := r.Get(context.Background(), types.NamespacedName{ + Name: "openfga-migrate", Namespace: "default", + }, deletedJob); getErr == nil { + t.Error("expected migration job to be deleted on JobFailureTarget") + } + if _, ok := updated.Annotations[AnnotationRetryAfter]; !ok { + t.Error("expected retry-after annotation to be set") + } + cond := findCondition(updated.Status.Conditions, "MigrationFailed") + if cond == nil || cond.Status != corev1.ConditionTrue { + t.Fatal("expected MigrationFailed condition True") + } +} + func TestReconcile_RetryAfterCooldown_SkipsJobCreation(t *testing.T) { // Given: a Deployment with a retry-after annotation in the future. dep := newTestDeployment("openfga", "default", "openfga/openfga:v1.14.0", 0) From 2cb102d0e71331a87ed1883ef83063680dc468f6 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:08:23 -0400 Subject: [PATCH 37/42] chore(schema): reject unknown keys in operator and migration values MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both the operator subchart and the parent chart's operator/migration blocks were missing additionalProperties: false, so typos like `migrationjob:` (lowercase), `enbaled: true`, or misplaced fields were silently ignored at install time. Add the guard to all well-defined object blocks — free-form blocks (podAnnotations, resources, securityContext, etc.) stay permissive since they pass through to pod spec. --- charts/openfga-operator/values.schema.json | 21 ++++++++++++++------- charts/openfga/values.schema.json | 13 ++++++++----- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/charts/openfga-operator/values.schema.json b/charts/openfga-operator/values.schema.json index 470b265..e74147c 100644 --- a/charts/openfga-operator/values.schema.json +++ b/charts/openfga-operator/values.schema.json @@ -21,7 +21,8 @@ "type": "string" } }, - "required": ["repository"] + "required": ["repository"], + "additionalProperties": false }, "imagePullSecrets": { "type": "array", @@ -30,7 +31,8 @@ "properties": { "name": { "type": "string" } }, - "required": ["name"] + "required": ["name"], + "additionalProperties": false } }, "nameOverride": { "type": "string" }, @@ -42,7 +44,8 @@ "create": { "type": "boolean" }, "annotations": { "type": "object" }, "name": { "type": "string" } - } + }, + "additionalProperties": false }, "podAnnotations": { "type": "object" }, "podSecurityContext": { "type": "object" }, @@ -52,7 +55,8 @@ "type": "object", "properties": { "enabled": { "type": "boolean" } - } + }, + "additionalProperties": false }, "migrationJob": { "type": "object", @@ -69,7 +73,8 @@ "type": "integer", "minimum": 0 } - } + }, + "additionalProperties": false }, "resources": { "type": "object" }, "podDisruptionBudget": { @@ -88,7 +93,8 @@ { "type": "integer", "minimum": 0 } ] } - } + }, + "additionalProperties": false }, "nodeSelector": { "type": "object" }, "tolerations": { @@ -96,5 +102,6 @@ "items": { "type": "object" } }, "affinity": { "type": "object" } - } + }, + "additionalProperties": false } diff --git a/charts/openfga/values.schema.json b/charts/openfga/values.schema.json index 54e737b..4fb19a2 100644 --- a/charts/openfga/values.schema.json +++ b/charts/openfga/values.schema.json @@ -1299,11 +1299,12 @@ "description": "Enable the openfga-operator subchart for operator-managed migrations", "default": false } - } + }, + "additionalProperties": false }, "openfga-operator": { "type": "object", - "description": "Values passed through to the openfga-operator subchart" + "description": "Values passed through to the openfga-operator subchart (validated by that chart's own schema)" }, "migration": { "type": "object", @@ -1332,12 +1333,14 @@ }, "name": { "type": "string", - "description": "The name of the migration service account. Defaults to {fullname}-migration.", + "description": "The name of the migration service account. Defaults to {fullname}-migration. Must be set explicitly when create=false and a dedicated migration SA is desired; leave empty to skip the annotation entirely.", "default": "" } - } + }, + "additionalProperties": false } - } + }, + "additionalProperties": false } }, "additionalProperties": false From 98e2b8bf1e2c17766eefec26bdea89f74d3705ed Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:11:42 -0400 Subject: [PATCH 38/42] ci(operator): build multi-arch on PRs, add immutable tag on main - Removes the push-to-main/dispatch gate on build-and-push so PRs verify the linux/arm64 build before merge. - On main pushes, adds an immutable :<version>-<sha> tag alongside the existing floating :<version> and :latest so consumers can pin a specific commit. --- .github/workflows/operator.yml | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/.github/workflows/operator.yml b/.github/workflows/operator.yml index dc88c55..001d22f 100644 --- a/.github/workflows/operator.yml +++ b/.github/workflows/operator.yml @@ -48,9 +48,6 @@ jobs: build-and-push: needs: test - if: >- - (github.event_name == 'push' && github.ref == 'refs/heads/main') || - (github.event_name == 'workflow_dispatch' && inputs.push_image) runs-on: ubuntu-latest permissions: contents: read @@ -68,31 +65,45 @@ jobs: echo "short_sha=${short_sha}" >> "$GITHUB_OUTPUT" echo "Operator version: ${version} (sha: ${short_sha})" - - name: Determine image tags + - name: Determine image tags and push policy id: tags run: | - if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then - echo "tags=${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }},${{ env.IMAGE_NAME }}:latest" >> "$GITHUB_OUTPUT" - else - # Dev build — tag with version-sha to avoid clobbering release tags + if [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/main" ]]; then + # Main push: publish floating :<version> and :latest plus an + # immutable :<version>-<sha> so consumers pinning a specific + # commit have a stable reference. + echo "tags=${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }},${{ env.IMAGE_NAME }}:latest,${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }}-${{ steps.version.outputs.short_sha }}" >> "$GITHUB_OUTPUT" + echo "push=true" >> "$GITHUB_OUTPUT" + elif [[ "${{ github.event_name }}" == "workflow_dispatch" && "${{ inputs.push_image }}" == "true" ]]; then echo "tags=${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }}-${{ steps.version.outputs.short_sha }}" >> "$GITHUB_OUTPUT" + echo "push=true" >> "$GITHUB_OUTPUT" + else + # Pull request (or workflow_dispatch with push_image=false): + # build both platforms but don't publish — catches arm64-incompatible + # changes (build tags, syscalls, CGO) before they merge. + echo "tags=${{ env.IMAGE_NAME }}:pr-${{ steps.version.outputs.short_sha }}" >> "$GITHUB_OUTPUT" + echo "push=false" >> "$GITHUB_OUTPUT" fi + - name: Set up QEMU + uses: docker/setup-qemu-action@v3 + - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Login to GHCR + if: steps.tags.outputs.push == 'true' uses: docker/login-action@v4.1.0 with: registry: ghcr.io username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - - name: Build and push + - name: Build and (conditionally) push uses: docker/build-push-action@v6 with: context: operator - push: true + push: ${{ steps.tags.outputs.push }} platforms: linux/amd64,linux/arm64 tags: ${{ steps.tags.outputs.tags }} cache-from: type=gha @@ -100,5 +111,6 @@ jobs: labels: | org.opencontainers.image.source=https://github.com/${{ github.repository }} org.opencontainers.image.version=${{ steps.version.outputs.version }} + org.opencontainers.image.revision=${{ github.sha }} org.opencontainers.image.title=openfga-operator org.opencontainers.image.description=OpenFGA Kubernetes operator for migration orchestration From 729b0ec0bfb6947a4b36e21688f8bb07d6d3a14f Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:15:04 -0400 Subject: [PATCH 39/42] docs(chart): explain 0-replica install in NOTES when operator is enabled When operator.enabled=true the workload starts at 0 replicas and only scales up after migration. If the operator pod is unhealthy this looks like a stuck install with no signal. Add NOTES output pointing at the operator deployment, migration Job, and MigrationFailed condition. --- charts/openfga/templates/NOTES.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/charts/openfga/templates/NOTES.txt b/charts/openfga/templates/NOTES.txt index 0048291..628c355 100644 --- a/charts/openfga/templates/NOTES.txt +++ b/charts/openfga/templates/NOTES.txt @@ -1,3 +1,20 @@ +{{- if and .Values.operator.enabled .Values.migration.enabled }} +NOTE: operator-managed migration is enabled. The OpenFGA Deployment starts at +0 replicas and is scaled up by the openfga-operator only after the migration +Job completes successfully. + +If pods don't appear within ~2 minutes, check the operator and the migration +Job: + + kubectl get deployment -A -l app.kubernetes.io/name=openfga-operator + kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name=openfga-operator --tail=100 + kubectl get job/{{ include "openfga.fullname" . }}-migrate -n {{ .Release.Namespace }} -o yaml + kubectl describe deployment/{{ include "openfga.fullname" . }} -n {{ .Release.Namespace }} + +A `MigrationFailed` condition on the Deployment indicates the migration Job +failed; the operator will retry every 60s once the underlying issue is fixed. + +{{ end -}} 1. Get the application URL by running these commands: {{- if .Values.ingress.enabled }} {{- range $host := .Values.ingress.hosts }} From d106292456440471f4342421ffe863c9b04ed470 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:19:30 -0400 Subject: [PATCH 40/42] fix: add missing global block to fix helm unit tests --- charts/openfga-operator/values.schema.json | 3 +++ 1 file changed, 3 insertions(+) diff --git a/charts/openfga-operator/values.schema.json b/charts/openfga-operator/values.schema.json index e74147c..324465f 100644 --- a/charts/openfga-operator/values.schema.json +++ b/charts/openfga-operator/values.schema.json @@ -2,6 +2,9 @@ "$schema": "https://json-schema.org/draft/2020-12/schema", "type": "object", "properties": { + "global": { + "type": "object" + }, "replicaCount": { "type": "integer", "minimum": 1 From ff6de224107ca6270bce39e6d64526c50cca4409 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 06:58:44 -0400 Subject: [PATCH 41/42] fix(operator): use multi-arch base image digests + Go cross-compile --- operator/Dockerfile | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/operator/Dockerfile b/operator/Dockerfile index 846414a..034e22d 100644 --- a/operator/Dockerfile +++ b/operator/Dockerfile @@ -1,5 +1,10 @@ -# pinned golang:1.26.2 linux/amd64 -FROM golang:1.26.2@sha256:b53c282df83967299380adbd6a2dc67e750a58217f39285d6240f6f80b19eaad AS builder +# pinned multi-arch index for golang:1.26.2 (linux/amd64, linux/arm64, ...) +FROM --platform=$BUILDPLATFORM golang:1.26.2@sha256:5f3787b7f902c07c7ec4f3aa91a301a3eda8133aa32661a3b3a3a86ab3a68a36 AS builder + +# buildx provides these automatically; declare so Go cross-compiles to the +# requested target instead of the build host's arch. +ARG TARGETOS +ARG TARGETARCH WORKDIR /workspace COPY go.mod go.sum ./ @@ -8,10 +13,11 @@ RUN go mod download COPY cmd/ cmd/ COPY internal/ internal/ -RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /operator ./cmd/ +RUN CGO_ENABLED=0 GOOS=${TARGETOS} GOARCH=${TARGETARCH} \ + go build -ldflags="-s -w" -o /operator ./cmd/ -# pinned gcr.io/distroless/static:nonroot linux/amd64 -FROM gcr.io/distroless/static:nonroot@sha256:64c43684e6d2b581d1eb362ea47b6a4defee6a9cac5f7ebbda3daa67e8c9b8e6 +# pinned multi-arch index for gcr.io/distroless/static:nonroot +FROM gcr.io/distroless/static:nonroot@sha256:e3f945647ffb95b5839c07038d64f9811adf17308b9121d8a2b87b6a22a80a39 WORKDIR / COPY --from=builder /operator . USER 65532:65532 From 4bdead4aa364fe007560b8d1c07ee80134250f76 Mon Sep 17 00:00:00 2001 From: Ed Milic <edmilic@gmail.com> Date: Mon, 20 Apr 2026 07:00:13 -0400 Subject: [PATCH 42/42] docs(operator-chart): clarify watchNamespace default --- charts/openfga-operator/values.yaml | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/charts/openfga-operator/values.yaml b/charts/openfga-operator/values.yaml index db2747b..921dcef 100644 --- a/charts/openfga-operator/values.yaml +++ b/charts/openfga-operator/values.yaml @@ -39,7 +39,12 @@ securityContext: runAsUser: 65532 # -- Namespace to watch for OpenFGA Deployments. -# Leave empty to default to the release namespace. +# Leave empty to default to the operator pod's own namespace (read from +# the POD_NAMESPACE env var, set via the downward API). This usually +# equals the release namespace, but when `namespaceOverride` puts the +# operator in a different namespace than the release, the watch follows +# the pod — not the release. Set this explicitly to watch a specific +# namespace independent of where the operator runs. watchNamespace: "" leaderElection: