-
Notifications
You must be signed in to change notification settings - Fork 75
docs: add ADRs for OpenFGA operator proposal #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
emilic
wants to merge
1
commit into
main
Choose a base branch
from
docs/adr-operator-proposal
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # ADR-NNN: Title | ||
|
|
||
| - **Status:** Proposed | ||
| - **Date:** YYYY-MM-DD | ||
| - **Deciders:** [list of people involved] | ||
| - **Related Issues:** # | ||
| - **Related ADR:** [ADR-NNN](NNN-filename.md) | ||
|
|
||
| ## Context | ||
|
|
||
| What is the problem or situation that motivates this decision? What constraints exist? What forces are at play? | ||
|
|
||
| Include enough background that someone unfamiliar with the project can understand why this decision matters. | ||
|
|
||
| ## Decision | ||
|
|
||
| What is the change being proposed or decided? | ||
|
|
||
| ### Alternatives Considered | ||
|
|
||
| **A. [Alternative name]** | ||
|
|
||
| [Description of the alternative] | ||
|
|
||
| *Pros:* ... | ||
| *Cons:* ... | ||
|
|
||
| **B. [Alternative name]** | ||
|
|
||
| [Description of the alternative] | ||
|
|
||
| *Pros:* ... | ||
| *Cons:* ... | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
|
|
||
| - What improves as a result of this decision? | ||
|
|
||
| ### Negative | ||
|
|
||
| - What gets harder, more complex, or more costly? | ||
|
|
||
| ### Risks | ||
|
|
||
| - What assumptions might prove false? | ||
| - What could go wrong? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| # ADR-001: Adopt a Kubernetes Operator for OpenFGA Lifecycle Management | ||
|
|
||
| - **Status:** Proposed | ||
| - **Date:** 2026-04-06 | ||
| - **Deciders:** OpenFGA Helm Charts maintainers | ||
| - **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #143, #144 | ||
|
|
||
| ## Context | ||
|
|
||
| The OpenFGA Helm chart currently handles all lifecycle concerns — deployment, configuration, database migrations, and secret management — through Helm templates and hooks. This approach works for simple installations but breaks down in several important scenarios: | ||
|
|
||
| 1. **Database migrations rely on Helm hooks**, which are incompatible with GitOps tools (ArgoCD, FluxCD) and Helm's own `--wait` flag. This is the single biggest pain point for users, accounting for 6 open issues (#211, #107, #120, #100, #95, #126). | ||
|
|
||
| 2. **Store provisioning, authorization model updates, and tuple management** are runtime operations that happen through the OpenFGA API. There is no declarative, GitOps-native way to manage these. Teams must use imperative scripts, CI pipelines, or manual API calls to set up stores and push models after deployment. | ||
|
|
||
| 3. **The migration init container** depends on `groundnuty/k8s-wait-for`, an unmaintained image with known CVEs, pinned by mutable tag (#132, #144). | ||
|
|
||
| 4. **Migration and runtime workloads share a single ServiceAccount**, violating least-privilege when cloud IAM-based database authentication (AWS IRSA, GCP Workload Identity) maps the ServiceAccount directly to a database role (#95). | ||
|
|
||
| ### Alternatives Considered | ||
|
|
||
| **A. Fix migrations within the Helm chart (no operator)** | ||
|
|
||
| - Strip Helm hook annotations from the migration Job by default, rendering it as a regular resource. | ||
| - Replace `k8s-wait-for` with a shell-based init container that polls the database schema version directly. | ||
| - Add a separate ServiceAccount for the migration Job. | ||
|
|
||
| *Pros:* Lower complexity, no new component to maintain. | ||
| *Cons:* Doesn't solve the ordering problem cleanly — the Job and Deployment are created simultaneously, requiring an init container to gate startup. Still requires an image or script to poll. Doesn't address store/model/tuple lifecycle at all. | ||
|
|
||
| **B. Recommend initContainer mode as default** | ||
|
|
||
| - Change `datastore.migrationType` default from `"job"` to `"initContainer"`, running migrations inside each pod. | ||
|
|
||
| *Pros:* No separate Job, no hooks, no `k8s-wait-for`. | ||
| *Cons:* Every pod runs migrations on startup (wasteful). Rolling updates trigger redundant migrations. Crash-loops on migration failure. Still shares ServiceAccount. No path to store lifecycle management. | ||
|
|
||
| **C. Build an operator (selected)** | ||
|
|
||
| - A Kubernetes operator manages migrations as internal reconciliation logic and exposes CRDs for store, model, and tuple lifecycle. | ||
|
|
||
| *Pros:* Solves all migration issues. Enables GitOps-native authorization management. Follows established Kubernetes patterns (CNPG, Strimzi, cert-manager). Separates concerns cleanly. | ||
| *Cons:* Significant development and maintenance investment. New component to deploy and monitor. Learning curve for contributors. | ||
|
|
||
| **D. External migration tool (e.g., Flyway, golang-migrate)** | ||
|
|
||
| - Remove migrations from the chart entirely and document using an external tool. | ||
|
|
||
| *Pros:* Simplifies the chart completely. | ||
| *Cons:* Shifts complexity to the user. Every user must build their own migration pipeline. No standard approach across the community. | ||
|
|
||
| ## Decision | ||
|
|
||
| We will build an **OpenFGA Kubernetes Operator** that handles: | ||
|
|
||
| 1. **Database migration orchestration** (Stage 1) — replacing Helm hooks, the `k8s-wait-for` init container, and shared ServiceAccount with operator-managed migration Jobs and deployment readiness gating. | ||
|
|
||
| 2. **Declarative store lifecycle management** (Stages 2-4) — exposing `FGAStore`, `FGAModel`, and `FGATuples` CRDs for GitOps-native authorization configuration. | ||
|
|
||
| The operator will be: | ||
| - Written in Go using `controller-runtime` / kubebuilder | ||
| - Distributed as a Helm subchart dependency of the main OpenFGA chart | ||
| - Optional — users who don't need it can set `operator.enabled: false` and fall back to the existing behavior | ||
|
|
||
| Development will follow a staged approach to deliver value incrementally: | ||
|
|
||
| | Stage | Scope | Outcome | | ||
| |-------|-------|---------| | ||
| | 1 | Operator scaffolding + migration handling | All 6 migration issues resolved | | ||
| | 2 | `FGAStore` CRD | Declarative store provisioning | | ||
| | 3 | `FGAModel` CRD | Declarative authorization model management | | ||
| | 4 | `FGATuples` CRD | Declarative tuple management | | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
|
|
||
| - **Resolves all 6 migration issues** (#211, #107, #120, #100, #95, #126) and related dependency issues (#132, #144) | ||
| - **Eliminates `k8s-wait-for` dependency** — removes an unmaintained, CVE-carrying image from the supply chain | ||
| - **Enables GitOps-native authorization management** — stores, models, and tuples become declarative Kubernetes resources that ArgoCD/FluxCD can sync | ||
| - **Enforces least-privilege** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) | ||
| - **Simplifies the Helm chart** — removes migration Job template, init container logic, RBAC for job-status-reading, and hook annotations | ||
| - **Follows Kubernetes ecosystem conventions** — operators are the standard pattern for managing stateful application lifecycle | ||
|
|
||
| ### Negative | ||
|
|
||
| - **New component to maintain** — the operator is a full Go project with its own release cycle, CI, testing, and CVE surface | ||
| - **Increased deployment footprint** — an additional pod running in the cluster (though resource requirements are minimal: ~50m CPU, ~64Mi memory) | ||
| - **Learning curve** — contributors need to understand controller-runtime patterns to modify the operator | ||
| - **CRD management complexity** — Helm does not upgrade or delete CRDs; users may need to apply CRD manifests separately on operator upgrades | ||
|
|
||
| ### Neutral | ||
|
|
||
| - **Backward compatibility preserved** — the `operator.enabled: false` fallback maintains the existing Helm hook behavior for users who haven't migrated | ||
| - **No change for memory-datastore users** — users running with `datastore.engine: memory` are unaffected (no migrations, no operator needed) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| # ADR-002: Replace Helm Hook Migrations with Operator-Managed Migrations | ||
|
|
||
| - **Status:** Proposed | ||
| - **Date:** 2026-04-06 | ||
| - **Deciders:** OpenFGA Helm Charts maintainers | ||
| - **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) | ||
| - **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #144 | ||
|
|
||
| ## Context | ||
|
|
||
| ### How Migrations Work Today | ||
|
|
||
| The current Helm chart uses a **Helm hook Job** to run database migrations (`openfga migrate`) and a **`k8s-wait-for` init container** on the Deployment to block server startup until the migration completes. | ||
|
|
||
| Seven files are involved: | ||
|
|
||
| | File | Role | | ||
| |------|------| | ||
| | `templates/job.yaml` | Migration Job with Helm hook annotations | | ||
| | `templates/deployment.yaml` | OpenFGA Deployment + `wait-for-migration` init container | | ||
| | `templates/serviceaccount.yaml` | Shared ServiceAccount (migration + runtime) | | ||
| | `templates/rbac.yaml` | Role + RoleBinding so init container can poll Job status | | ||
| | `templates/_helpers.tpl` | Datastore environment variable helpers | | ||
| | `values.yaml` | `datastore.*`, `migrate.*`, `initContainer.*` configuration | | ||
| | `Chart.yaml` | `bitnami/common` dependency for migration sidecars | | ||
|
|
||
| **The migration Job** (`templates/job.yaml`) is annotated as a Helm hook: | ||
|
|
||
| ```yaml | ||
| annotations: | ||
| "helm.sh/hook": post-install,post-upgrade,post-rollback,post-delete | ||
| "helm.sh/hook-delete-policy": before-hook-creation | ||
| "helm.sh/hook-weight": "1" | ||
| ``` | ||
|
|
||
| This means Helm manages it outside the normal release lifecycle — it only runs after Helm finishes creating/upgrading all other resources. | ||
|
|
||
| **The wait-for init container** blocks the Deployment pods from starting: | ||
|
|
||
| ```yaml | ||
| initContainers: | ||
| - name: wait-for-migration | ||
| image: "groundnuty/k8s-wait-for:v2.0" | ||
| args: ["job-wr", "openfga-migrate"] | ||
| ``` | ||
|
|
||
| It polls the Kubernetes API (`GET /apis/batch/v1/.../jobs/openfga-migrate`) until `.status.succeeded >= 1`. This requires RBAC permissions (Role/RoleBinding for `batch/jobs` `get`/`list`). | ||
|
|
||
| **The alternative mode** (`datastore.migrationType: initContainer`) runs migration directly inside each Deployment pod as an init container, avoiding hooks entirely but introducing redundant migration runs across replicas. | ||
|
|
||
| ### The Six Issues | ||
|
|
||
| | Issue | Tool | Root Cause | | ||
| |-------|------|-----------| | ||
| | **#211** | ArgoCD | ArgoCD ignores Helm hook annotations. The migration Job is never created as a managed resource. The init container waits forever for a Job that doesn't exist. | | ||
| | **#107** | ArgoCD | Same root cause. The Job is invisible in ArgoCD's UI — users can't see, debug, or manually sync it. | | ||
| | **#120** | Helm `--wait` | Circular deadlock. Helm waits for the Deployment to be ready before running post-install hooks. The Deployment is never ready because the init container waits for the hook Job. The Job never runs because Helm is waiting. | | ||
| | **#100** | FluxCD | FluxCD waits for all resources by default. The `hook-delete-policy: before-hook-creation` removes the completed Job before FluxCD can confirm the Deployment is healthy. | | ||
| | **#95** | AWS IRSA | Migration and runtime share a ServiceAccount. With IAM-based DB auth, the runtime gets DDL permissions it doesn't need (CREATE TABLE, ALTER TABLE). | | ||
| | **#126** | All | The `k8s-wait-for` image is configured in two separate places in `values.yaml`, leading to inconsistency. Related: #132 (image unmaintained, has CVEs) and #144 (pinned by mutable tag). | | ||
|
|
||
| ### Why Helm Hooks Are Fundamentally Wrong for This | ||
|
|
||
| Helm hooks are a **deploy-time orchestration mechanism**. They assume Helm is the active agent running the deployment. GitOps tools (ArgoCD, FluxCD) break this assumption — they render the chart to manifests and apply them declaratively. The hook annotations are either ignored (ArgoCD) or cause ordering/cleanup conflicts (FluxCD). | ||
|
|
||
| This is not a bug in ArgoCD or FluxCD. It is a fundamental mismatch between Helm's imperative hook model and the declarative GitOps model. | ||
|
|
||
| ## Decision | ||
|
|
||
| Replace the Helm hook migration Job and `k8s-wait-for` init container with **operator-managed migrations** as part of Stage 1 of the OpenFGA Operator (see [ADR-001](001-adopt-openfga-operator.md)). | ||
|
|
||
| ### How It Works | ||
|
|
||
| The operator runs a **migration controller** that reconciles the OpenFGA Deployment: | ||
|
|
||
| ``` | ||
| ┌────────────────────────────────────────────────────────┐ | ||
| │ Operator Reconciliation │ | ||
| │ │ | ||
| │ 1. Read Deployment → extract image tag (e.g. v1.14.0) │ | ||
| │ 2. Read ConfigMap/openfga-migration-status │ | ||
| │ └── "Last migrated version: v1.13.0" │ | ||
| │ 3. Versions differ → migration needed │ | ||
| │ 4. Create Job/openfga-migrate │ | ||
| │ ├── ServiceAccount: openfga-migrator (DDL perms) │ | ||
| │ ├── Image: openfga/openfga:v1.14.0 │ | ||
| │ ├── Args: ["migrate"] │ | ||
| │ └── ttlSecondsAfterFinished: 300 │ | ||
| │ 5. Watch Job until succeeded │ | ||
| │ 6. Update ConfigMap → "version: v1.14.0" │ | ||
| │ 7. Scale Deployment replicas: 0 → 3 │ | ||
| │ 8. OpenFGA pods start, serve requests │ | ||
| └────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| **Key design decisions within this approach:** | ||
|
|
||
| #### Deployment starts at replicas: 0 | ||
|
|
||
| The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema. | ||
|
|
||
| #### Version tracking via ConfigMap | ||
|
|
||
| A ConfigMap (`openfga-migration-status`) records the last successfully migrated version. The operator compares this to the Deployment's image tag to determine if migration is needed. This is: | ||
| - Simple to inspect (`kubectl get configmap openfga-migration-status -o yaml`) | ||
| - Survives operator restarts | ||
| - Can be manually deleted to force re-migration | ||
|
|
||
| #### Separate ServiceAccount for migrations | ||
|
|
||
| The operator creates a dedicated `openfga-migrator` ServiceAccount for migration Jobs. Users can annotate it with cloud IAM roles that grant DDL permissions, while the runtime ServiceAccount retains only CRUD permissions. | ||
|
|
||
| #### Migration Job is a regular resource | ||
|
|
||
| The Job created by the operator has no Helm hook annotations. It is a standard Kubernetes Job, visible to ArgoCD, FluxCD, and all Kubernetes tooling. It has an owner reference to the operator's managed resource for proper garbage collection. | ||
|
|
||
| #### Failure handling | ||
|
|
||
| | Failure | Behavior | | ||
| |---------|----------| | ||
| | Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. | | ||
| | Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. | | ||
| | Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. | | ||
| | Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). | | ||
|
|
||
| ### Sequence Comparison | ||
|
|
||
| **Before (Helm hooks):** | ||
|
|
||
| ``` | ||
| helm install | ||
| ├── Create ServiceAccount, RBAC, Secret, Service | ||
| ├── Create Deployment (with wait-for-migration init container) | ||
| │ └── Pod starts → init container polls for Job → waits... | ||
| ├── [Helm finishes regular resources] | ||
| ├── Run post-install hooks: | ||
| │ └── Create Job/openfga-migrate → runs openfga migrate | ||
| │ └── Job succeeds | ||
| ├── Init container sees Job succeeded → exits | ||
| └── Main container starts | ||
| ``` | ||
|
|
||
| Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4. | ||
|
|
||
| **After (operator-managed):** | ||
|
|
||
| ``` | ||
| helm install | ||
| ├── Create ServiceAccount (runtime), ServiceAccount (migrator) | ||
| ├── Create Secret, Service | ||
| ├── Create Deployment (replicas: 0, no init containers) | ||
| ├── Create Operator Deployment | ||
| └── [Helm is done — all resources are regular, no hooks] | ||
|
|
||
| Operator starts: | ||
| ├── Detects Deployment image version | ||
| ├── No migration status ConfigMap → migration needed | ||
| ├── Creates Job/openfga-migrate (regular Job, no hooks) | ||
| │ └── Uses openfga-migrator ServiceAccount | ||
| │ └── Runs openfga migrate → succeeds | ||
| ├── Creates ConfigMap with migrated version | ||
| └── Scales Deployment to 3 replicas → pods start | ||
| ``` | ||
|
|
||
| No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects. | ||
|
|
||
| ### What Changes in the Helm Chart | ||
|
|
||
| **Removed:** | ||
|
|
||
| | File/Section | Reason | | ||
| |--------------|--------| | ||
| | `templates/job.yaml` | Operator creates migration Jobs | | ||
| | `templates/rbac.yaml` | No init container polling Job status | | ||
| | `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated | | ||
| | `values.yaml`: `datastore.migrationType` | Operator always uses Job internally | | ||
| | `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering | | ||
| | `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks | | ||
| | Deployment init containers for migration | Operator manages readiness via replica scaling | | ||
|
|
||
| **Added:** | ||
|
|
||
| | File/Section | Purpose | | ||
| |--------------|---------| | ||
| | `values.yaml`: `operator.enabled` | Toggle operator subchart | | ||
| | `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs | | ||
| | `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration | | ||
| | `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount | | ||
| | `charts/openfga-operator/` | Operator subchart | | ||
|
|
||
| **Preserved (backward compatible):** | ||
|
|
||
| When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption. | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
|
|
||
| - **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility | ||
| - **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144) | ||
| - **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95) | ||
| - **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed | ||
| - **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors | ||
| - **Idempotent and crash-safe** — operator can restart at any point and resume correctly | ||
|
|
||
| ### Negative | ||
|
|
||
| - **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected) | ||
| - **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this | ||
| - **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy) | ||
|
|
||
| ### Risks | ||
|
|
||
| - **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1. | ||
| - **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! 👍