openfga · emilic · Apr 6, 2026 · SoulPancake · Apr 16, 2026
@@ -0,0 +1,48 @@
+# ADR-NNN: Title
+
+- **Status:** Proposed
+- **Date:** YYYY-MM-DD
+- **Deciders:** [list of people involved]
+- **Related Issues:** #
+- **Related ADR:** [ADR-NNN](NNN-filename.md)
+
+## Context
+
+What is the problem or situation that motivates this decision? What constraints exist? What forces are at play?
+
+Include enough background that someone unfamiliar with the project can understand why this decision matters.
+
+## Decision
+
+What is the change being proposed or decided?
+
+### Alternatives Considered
+
+**A. [Alternative name]**
+
+[Description of the alternative]
+
+*Pros:* ...
+*Cons:* ...
+
+**B. [Alternative name]**
+
+[Description of the alternative]
+
+*Pros:* ...
+*Cons:* ...
+
+## Consequences
+
+### Positive
+
+- What improves as a result of this decision?
+
+### Negative
+
+- What gets harder, more complex, or more costly?
+
+### Risks
+
+- What assumptions might prove false?
+- What could go wrong?
@@ -0,0 +1,95 @@
+# ADR-001: Adopt a Kubernetes Operator for OpenFGA Lifecycle Management
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #143, #144
+
+## Context
+
+The OpenFGA Helm chart currently handles all lifecycle concerns — deployment, configuration, database migrations, and secret management — through Helm templates and hooks. This approach works for simple installations but breaks down in several important scenarios:
+
+1. **Database migrations rely on Helm hooks**, which are incompatible with GitOps tools (ArgoCD, FluxCD) and Helm's own `--wait` flag. This is the single biggest pain point for users, accounting for 6 open issues (#211, #107, #120, #100, #95, #126).
+
+2. **Store provisioning, authorization model updates, and tuple management** are runtime operations that happen through the OpenFGA API. There is no declarative, GitOps-native way to manage these. Teams must use imperative scripts, CI pipelines, or manual API calls to set up stores and push models after deployment.
+
+3. **The migration init container** depends on `groundnuty/k8s-wait-for`, an unmaintained image with known CVEs, pinned by mutable tag (#132, #144).
+
+4. **Migration and runtime workloads share a single ServiceAccount**, violating least-privilege when cloud IAM-based database authentication (AWS IRSA, GCP Workload Identity) maps the ServiceAccount directly to a database role (#95).
+
+### Alternatives Considered
+
+**A. Fix migrations within the Helm chart (no operator)**
+
+- Strip Helm hook annotations from the migration Job by default, rendering it as a regular resource.
+- Replace `k8s-wait-for` with a shell-based init container that polls the database schema version directly.
+- Add a separate ServiceAccount for the migration Job.
+
+*Pros:* Lower complexity, no new component to maintain.
+*Cons:* Doesn't solve the ordering problem cleanly — the Job and Deployment are created simultaneously, requiring an init container to gate startup. Still requires an image or script to poll. Doesn't address store/model/tuple lifecycle at all.
+
+**B. Recommend initContainer mode as default**
+
+- Change `datastore.migrationType` default from `"job"` to `"initContainer"`, running migrations inside each pod.
+
+*Pros:* No separate Job, no hooks, no `k8s-wait-for`.
+*Cons:* Every pod runs migrations on startup (wasteful). Rolling updates trigger redundant migrations. Crash-loops on migration failure. Still shares ServiceAccount. No path to store lifecycle management.
+
+**C. Build an operator (selected)**
+
+- A Kubernetes operator manages migrations as internal reconciliation logic and exposes CRDs for store, model, and tuple lifecycle.
+
+*Pros:* Solves all migration issues. Enables GitOps-native authorization management. Follows established Kubernetes patterns (CNPG, Strimzi, cert-manager). Separates concerns cleanly.
+*Cons:* Significant development and maintenance investment. New component to deploy and monitor. Learning curve for contributors.
+
+**D. External migration tool (e.g., Flyway, golang-migrate)**
+
+- Remove migrations from the chart entirely and document using an external tool.
+
+*Pros:* Simplifies the chart completely.
+*Cons:* Shifts complexity to the user. Every user must build their own migration pipeline. No standard approach across the community.
+
+## Decision
+
+We will build an **OpenFGA Kubernetes Operator** that handles:
+
+1. **Database migration orchestration** (Stage 1) — replacing Helm hooks, the `k8s-wait-for` init container, and shared ServiceAccount with operator-managed migration Jobs and deployment readiness gating.
+
+2. **Declarative store lifecycle management** (Stages 2-4) — exposing `FGAStore`, `FGAModel`, and `FGATuples` CRDs for GitOps-native authorization configuration.
+
+The operator will be:
+- Written in Go using `controller-runtime` / kubebuilder
+- Distributed as a Helm subchart dependency of the main OpenFGA chart
+- Optional — users who don't need it can set `operator.enabled: false` and fall back to the existing behavior
+
+Development will follow a staged approach to deliver value incrementally:
+
+| Stage | Scope | Outcome |
+|-------|-------|---------|
+| 1 | Operator scaffolding + migration handling | All 6 migration issues resolved |
+| 2 | `FGAStore` CRD | Declarative store provisioning |
+| 3 | `FGAModel` CRD | Declarative authorization model management |
+| 4 | `FGATuples` CRD | Declarative tuple management |
+
+## Consequences
+
+### Positive
+
+- **Resolves all 6 migration issues** (#211, #107, #120, #100, #95, #126) and related dependency issues (#132, #144)
+- **Eliminates `k8s-wait-for` dependency** — removes an unmaintained, CVE-carrying image from the supply chain
+- **Enables GitOps-native authorization management** — stores, models, and tuples become declarative Kubernetes resources that ArgoCD/FluxCD can sync
+- **Enforces least-privilege** — separate ServiceAccounts for migration (DDL) and runtime (CRUD)
+- **Simplifies the Helm chart** — removes migration Job template, init container logic, RBAC for job-status-reading, and hook annotations
+- **Follows Kubernetes ecosystem conventions** — operators are the standard pattern for managing stateful application lifecycle
+
+### Negative
+
+- **New component to maintain** — the operator is a full Go project with its own release cycle, CI, testing, and CVE surface
+- **Increased deployment footprint** — an additional pod running in the cluster (though resource requirements are minimal: ~50m CPU, ~64Mi memory)
+- **Learning curve** — contributors need to understand controller-runtime patterns to modify the operator
+- **CRD management complexity** — Helm does not upgrade or delete CRDs; users may need to apply CRD manifests separately on operator upgrades
+
+### Neutral
+
+- **Backward compatibility preserved** — the `operator.enabled: false` fallback maintains the existing Helm hook behavior for users who haven't migrated
+- **No change for memory-datastore users** — users running with `datastore.engine: memory` are unaffected (no migrations, no operator needed)
@@ -0,0 +1,215 @@
+# ADR-002: Replace Helm Hook Migrations with Operator-Managed Migrations
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md)
+- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #144
+
+## Context
+
+### How Migrations Work Today
+
+The current Helm chart uses a **Helm hook Job** to run database migrations (`openfga migrate`) and a **`k8s-wait-for` init container** on the Deployment to block server startup until the migration completes.
+
+Seven files are involved:
+
+| File | Role |
+|------|------|
+| `templates/job.yaml` | Migration Job with Helm hook annotations |
+| `templates/deployment.yaml` | OpenFGA Deployment + `wait-for-migration` init container |
+| `templates/serviceaccount.yaml` | Shared ServiceAccount (migration + runtime) |
+| `templates/rbac.yaml` | Role + RoleBinding so init container can poll Job status |
+| `templates/_helpers.tpl` | Datastore environment variable helpers |
+| `values.yaml` | `datastore.*`, `migrate.*`, `initContainer.*` configuration |
+| `Chart.yaml` | `bitnami/common` dependency for migration sidecars |
+
+**The migration Job** (`templates/job.yaml`) is annotated as a Helm hook:
+
+```yaml
+annotations:
+  "helm.sh/hook": post-install,post-upgrade,post-rollback,post-delete
+  "helm.sh/hook-delete-policy": before-hook-creation
+  "helm.sh/hook-weight": "1"
+```
+
+This means Helm manages it outside the normal release lifecycle — it only runs after Helm finishes creating/upgrading all other resources.
+
+**The wait-for init container** blocks the Deployment pods from starting:
+
+```yaml
+initContainers:
+  - name: wait-for-migration
+    image: "groundnuty/k8s-wait-for:v2.0"
+    args: ["job-wr", "openfga-migrate"]
+```
+
+It polls the Kubernetes API (`GET /apis/batch/v1/.../jobs/openfga-migrate`) until `.status.succeeded >= 1`. This requires RBAC permissions (Role/RoleBinding for `batch/jobs` `get`/`list`).
+
+**The alternative mode** (`datastore.migrationType: initContainer`) runs migration directly inside each Deployment pod as an init container, avoiding hooks entirely but introducing redundant migration runs across replicas.
+
+### The Six Issues
+
+| Issue | Tool | Root Cause |
+|-------|------|-----------|
+| **#211** | ArgoCD | ArgoCD ignores Helm hook annotations. The migration Job is never created as a managed resource. The init container waits forever for a Job that doesn't exist. |
+| **#107** | ArgoCD | Same root cause. The Job is invisible in ArgoCD's UI — users can't see, debug, or manually sync it. |
+| **#120** | Helm `--wait` | Circular deadlock. Helm waits for the Deployment to be ready before running post-install hooks. The Deployment is never ready because the init container waits for the hook Job. The Job never runs because Helm is waiting. |
+| **#100** | FluxCD | FluxCD waits for all resources by default. The `hook-delete-policy: before-hook-creation` removes the completed Job before FluxCD can confirm the Deployment is healthy. |
+| **#95** | AWS IRSA | Migration and runtime share a ServiceAccount. With IAM-based DB auth, the runtime gets DDL permissions it doesn't need (CREATE TABLE, ALTER TABLE). |
+| **#126** | All | The `k8s-wait-for` image is configured in two separate places in `values.yaml`, leading to inconsistency. Related: #132 (image unmaintained, has CVEs) and #144 (pinned by mutable tag). |
+
+### Why Helm Hooks Are Fundamentally Wrong for This
+
+Helm hooks are a **deploy-time orchestration mechanism**. They assume Helm is the active agent running the deployment. GitOps tools (ArgoCD, FluxCD) break this assumption — they render the chart to manifests and apply them declaratively. The hook annotations are either ignored (ArgoCD) or cause ordering/cleanup conflicts (FluxCD).
+
+This is not a bug in ArgoCD or FluxCD. It is a fundamental mismatch between Helm's imperative hook model and the declarative GitOps model.
+
+## Decision
+
+Replace the Helm hook migration Job and `k8s-wait-for` init container with **operator-managed migrations** as part of Stage 1 of the OpenFGA Operator (see [ADR-001](001-adopt-openfga-operator.md)).
+
+### How It Works
+
+The operator runs a **migration controller** that reconciles the OpenFGA Deployment:
+
+```
+┌────────────────────────────────────────────────────────┐
+│                  Operator Reconciliation                │
+│                                                        │
+│  1. Read Deployment → extract image tag (e.g. v1.14.0) │
+│  2. Read ConfigMap/openfga-migration-status             │
+│     └── "Last migrated version: v1.13.0"               │
+│  3. Versions differ → migration needed                  │
+│  4. Create Job/openfga-migrate                          │
+│     ├── ServiceAccount: openfga-migrator (DDL perms)   │
+│     ├── Image: openfga/openfga:v1.14.0                 │
+│     ├── Args: ["migrate"]                              │
+│     └── ttlSecondsAfterFinished: 300                   │
+│  5. Watch Job until succeeded                           │
+│  6. Update ConfigMap → "version: v1.14.0"              │
+│  7. Scale Deployment replicas: 0 → 3                   │
+│  8. OpenFGA pods start, serve requests                  │
+└────────────────────────────────────────────────────────┘
+```
+
+**Key design decisions within this approach:**
+
+#### Deployment starts at replicas: 0
+
+The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema.
+
+#### Version tracking via ConfigMap
+
+A ConfigMap (`openfga-migration-status`) records the last successfully migrated version. The operator compares this to the Deployment's image tag to determine if migration is needed. This is:
+- Simple to inspect (`kubectl get configmap openfga-migration-status -o yaml`)
+- Survives operator restarts
+- Can be manually deleted to force re-migration
+
+#### Separate ServiceAccount for migrations
+
+The operator creates a dedicated `openfga-migrator` ServiceAccount for migration Jobs. Users can annotate it with cloud IAM roles that grant DDL permissions, while the runtime ServiceAccount retains only CRUD permissions.
+
+#### Migration Job is a regular resource
+
+The Job created by the operator has no Helm hook annotations. It is a standard Kubernetes Job, visible to ArgoCD, FluxCD, and all Kubernetes tooling. It has an owner reference to the operator's managed resource for proper garbage collection.
+
+#### Failure handling
+
+| Failure | Behavior |
+|---------|----------|
+| Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. |
+| Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. |
+| Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. |
+| Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). |
+
+### Sequence Comparison
+
+**Before (Helm hooks):**
+
+```
+helm install
+  ├── Create ServiceAccount, RBAC, Secret, Service
+  ├── Create Deployment (with wait-for-migration init container)
+  │     └── Pod starts → init container polls for Job → waits...
+  ├── [Helm finishes regular resources]
+  ├── Run post-install hooks:
+  │     └── Create Job/openfga-migrate → runs openfga migrate
+  │           └── Job succeeds
+  ├── Init container sees Job succeeded → exits
+  └── Main container starts
+```
+
+Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4.
+
+**After (operator-managed):**
+
+```
+helm install
+  ├── Create ServiceAccount (runtime), ServiceAccount (migrator)
+  ├── Create Secret, Service
+  ├── Create Deployment (replicas: 0, no init containers)
+  ├── Create Operator Deployment
+  └── [Helm is done — all resources are regular, no hooks]
+
+Operator starts:
+  ├── Detects Deployment image version
+  ├── No migration status ConfigMap → migration needed
+  ├── Creates Job/openfga-migrate (regular Job, no hooks)
+  │     └── Uses openfga-migrator ServiceAccount
+  │     └── Runs openfga migrate → succeeds
+  ├── Creates ConfigMap with migrated version
+  └── Scales Deployment to 3 replicas → pods start
+```
+
+No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects.
+
+### What Changes in the Helm Chart
+
+**Removed:**
+
+| File/Section | Reason |
+|--------------|--------|
+| `templates/job.yaml` | Operator creates migration Jobs |
+| `templates/rbac.yaml` | No init container polling Job status |
+| `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated |
+| `values.yaml`: `datastore.migrationType` | Operator always uses Job internally |
+| `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering |
+| `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks |
+| Deployment init containers for migration | Operator manages readiness via replica scaling |
+
+**Added:**
+
+| File/Section | Purpose |
+|--------------|---------|
+| `values.yaml`: `operator.enabled` | Toggle operator subchart |
+| `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs |
+| `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration |
+| `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount |
+| `charts/openfga-operator/` | Operator subchart |
+
+**Preserved (backward compatible):**
+
+When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption.
+
+## Consequences
+
+### Positive
+
+- **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility
+- **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144)
+- **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95)
+- **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed
+- **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors
+- **Idempotent and crash-safe** — operator can restart at any point and resume correctly
+
+### Negative
+
+- **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected)
+- **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this
+- **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy)
+
+### Risks
+
+- **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1.
+- **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented.