From 134c8573709a03bda69c5b3ada90bb40d6b8f23d Mon Sep 17 00:00:00 2001 From: Ed Milic Date: Mon, 6 Apr 2026 07:18:31 -0400 Subject: [PATCH] docs: add ADRs for OpenFGA operator proposal Propose adopting a Kubernetes operator for OpenFGA lifecycle management, covering migration handling, declarative CRDs for stores/models/tuples, and the operator deployment model as a Helm subchart. Also adds the ADR process documentation, template, and chart analysis. ADR-001: Adopt a Kubernetes Operator ADR-002: Operator-Managed Migrations ADR-003: Declarative Store Lifecycle CRDs ADR-004: Operator Deployment as Helm Subchart --- docs/adr/000-template.md | 48 ++++ docs/adr/001-adopt-openfga-operator.md | 95 ++++++++ docs/adr/002-operator-managed-migrations.md | 215 ++++++++++++++++++ .../003-declarative-store-lifecycle-crds.md | 199 ++++++++++++++++ docs/adr/004-operator-deployment-model.md | 167 ++++++++++++++ docs/adr/README.md | 180 +++++++++++++++ 6 files changed, 904 insertions(+) create mode 100644 docs/adr/000-template.md create mode 100644 docs/adr/001-adopt-openfga-operator.md create mode 100644 docs/adr/002-operator-managed-migrations.md create mode 100644 docs/adr/003-declarative-store-lifecycle-crds.md create mode 100644 docs/adr/004-operator-deployment-model.md create mode 100644 docs/adr/README.md diff --git a/docs/adr/000-template.md b/docs/adr/000-template.md new file mode 100644 index 0000000..2cc78bb --- /dev/null +++ b/docs/adr/000-template.md @@ -0,0 +1,48 @@ +# ADR-NNN: Title + +- **Status:** Proposed +- **Date:** YYYY-MM-DD +- **Deciders:** [list of people involved] +- **Related Issues:** # +- **Related ADR:** [ADR-NNN](NNN-filename.md) + +## Context + +What is the problem or situation that motivates this decision? What constraints exist? What forces are at play? + +Include enough background that someone unfamiliar with the project can understand why this decision matters. + +## Decision + +What is the change being proposed or decided? + +### Alternatives Considered + +**A. [Alternative name]** + +[Description of the alternative] + +*Pros:* ... +*Cons:* ... + +**B. [Alternative name]** + +[Description of the alternative] + +*Pros:* ... +*Cons:* ... + +## Consequences + +### Positive + +- What improves as a result of this decision? + +### Negative + +- What gets harder, more complex, or more costly? + +### Risks + +- What assumptions might prove false? +- What could go wrong? diff --git a/docs/adr/001-adopt-openfga-operator.md b/docs/adr/001-adopt-openfga-operator.md new file mode 100644 index 0000000..ca84553 --- /dev/null +++ b/docs/adr/001-adopt-openfga-operator.md @@ -0,0 +1,95 @@ +# ADR-001: Adopt a Kubernetes Operator for OpenFGA Lifecycle Management + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #143, #144 + +## Context + +The OpenFGA Helm chart currently handles all lifecycle concerns — deployment, configuration, database migrations, and secret management — through Helm templates and hooks. This approach works for simple installations but breaks down in several important scenarios: + +1. **Database migrations rely on Helm hooks**, which are incompatible with GitOps tools (ArgoCD, FluxCD) and Helm's own `--wait` flag. This is the single biggest pain point for users, accounting for 6 open issues (#211, #107, #120, #100, #95, #126). + +2. **Store provisioning, authorization model updates, and tuple management** are runtime operations that happen through the OpenFGA API. There is no declarative, GitOps-native way to manage these. Teams must use imperative scripts, CI pipelines, or manual API calls to set up stores and push models after deployment. + +3. **The migration init container** depends on `groundnuty/k8s-wait-for`, an unmaintained image with known CVEs, pinned by mutable tag (#132, #144). + +4. **Migration and runtime workloads share a single ServiceAccount**, violating least-privilege when cloud IAM-based database authentication (AWS IRSA, GCP Workload Identity) maps the ServiceAccount directly to a database role (#95). + +### Alternatives Considered + +**A. Fix migrations within the Helm chart (no operator)** + +- Strip Helm hook annotations from the migration Job by default, rendering it as a regular resource. +- Replace `k8s-wait-for` with a shell-based init container that polls the database schema version directly. +- Add a separate ServiceAccount for the migration Job. + +*Pros:* Lower complexity, no new component to maintain. +*Cons:* Doesn't solve the ordering problem cleanly — the Job and Deployment are created simultaneously, requiring an init container to gate startup. Still requires an image or script to poll. Doesn't address store/model/tuple lifecycle at all. + +**B. Recommend initContainer mode as default** + +- Change `datastore.migrationType` default from `"job"` to `"initContainer"`, running migrations inside each pod. + +*Pros:* No separate Job, no hooks, no `k8s-wait-for`. +*Cons:* Every pod runs migrations on startup (wasteful). Rolling updates trigger redundant migrations. Crash-loops on migration failure. Still shares ServiceAccount. No path to store lifecycle management. + +**C. Build an operator (selected)** + +- A Kubernetes operator manages migrations as internal reconciliation logic and exposes CRDs for store, model, and tuple lifecycle. + +*Pros:* Solves all migration issues. Enables GitOps-native authorization management. Follows established Kubernetes patterns (CNPG, Strimzi, cert-manager). Separates concerns cleanly. +*Cons:* Significant development and maintenance investment. New component to deploy and monitor. Learning curve for contributors. + +**D. External migration tool (e.g., Flyway, golang-migrate)** + +- Remove migrations from the chart entirely and document using an external tool. + +*Pros:* Simplifies the chart completely. +*Cons:* Shifts complexity to the user. Every user must build their own migration pipeline. No standard approach across the community. + +## Decision + +We will build an **OpenFGA Kubernetes Operator** that handles: + +1. **Database migration orchestration** (Stage 1) — replacing Helm hooks, the `k8s-wait-for` init container, and shared ServiceAccount with operator-managed migration Jobs and deployment readiness gating. + +2. **Declarative store lifecycle management** (Stages 2-4) — exposing `FGAStore`, `FGAModel`, and `FGATuples` CRDs for GitOps-native authorization configuration. + +The operator will be: +- Written in Go using `controller-runtime` / kubebuilder +- Distributed as a Helm subchart dependency of the main OpenFGA chart +- Optional — users who don't need it can set `operator.enabled: false` and fall back to the existing behavior + +Development will follow a staged approach to deliver value incrementally: + +| Stage | Scope | Outcome | +|-------|-------|---------| +| 1 | Operator scaffolding + migration handling | All 6 migration issues resolved | +| 2 | `FGAStore` CRD | Declarative store provisioning | +| 3 | `FGAModel` CRD | Declarative authorization model management | +| 4 | `FGATuples` CRD | Declarative tuple management | + +## Consequences + +### Positive + +- **Resolves all 6 migration issues** (#211, #107, #120, #100, #95, #126) and related dependency issues (#132, #144) +- **Eliminates `k8s-wait-for` dependency** — removes an unmaintained, CVE-carrying image from the supply chain +- **Enables GitOps-native authorization management** — stores, models, and tuples become declarative Kubernetes resources that ArgoCD/FluxCD can sync +- **Enforces least-privilege** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) +- **Simplifies the Helm chart** — removes migration Job template, init container logic, RBAC for job-status-reading, and hook annotations +- **Follows Kubernetes ecosystem conventions** — operators are the standard pattern for managing stateful application lifecycle + +### Negative + +- **New component to maintain** — the operator is a full Go project with its own release cycle, CI, testing, and CVE surface +- **Increased deployment footprint** — an additional pod running in the cluster (though resource requirements are minimal: ~50m CPU, ~64Mi memory) +- **Learning curve** — contributors need to understand controller-runtime patterns to modify the operator +- **CRD management complexity** — Helm does not upgrade or delete CRDs; users may need to apply CRD manifests separately on operator upgrades + +### Neutral + +- **Backward compatibility preserved** — the `operator.enabled: false` fallback maintains the existing Helm hook behavior for users who haven't migrated +- **No change for memory-datastore users** — users running with `datastore.engine: memory` are unaffected (no migrations, no operator needed) diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md new file mode 100644 index 0000000..8fb0cd7 --- /dev/null +++ b/docs/adr/002-operator-managed-migrations.md @@ -0,0 +1,215 @@ +# ADR-002: Replace Helm Hook Migrations with Operator-Managed Migrations + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) +- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #144 + +## Context + +### How Migrations Work Today + +The current Helm chart uses a **Helm hook Job** to run database migrations (`openfga migrate`) and a **`k8s-wait-for` init container** on the Deployment to block server startup until the migration completes. + +Seven files are involved: + +| File | Role | +|------|------| +| `templates/job.yaml` | Migration Job with Helm hook annotations | +| `templates/deployment.yaml` | OpenFGA Deployment + `wait-for-migration` init container | +| `templates/serviceaccount.yaml` | Shared ServiceAccount (migration + runtime) | +| `templates/rbac.yaml` | Role + RoleBinding so init container can poll Job status | +| `templates/_helpers.tpl` | Datastore environment variable helpers | +| `values.yaml` | `datastore.*`, `migrate.*`, `initContainer.*` configuration | +| `Chart.yaml` | `bitnami/common` dependency for migration sidecars | + +**The migration Job** (`templates/job.yaml`) is annotated as a Helm hook: + +```yaml +annotations: + "helm.sh/hook": post-install,post-upgrade,post-rollback,post-delete + "helm.sh/hook-delete-policy": before-hook-creation + "helm.sh/hook-weight": "1" +``` + +This means Helm manages it outside the normal release lifecycle — it only runs after Helm finishes creating/upgrading all other resources. + +**The wait-for init container** blocks the Deployment pods from starting: + +```yaml +initContainers: + - name: wait-for-migration + image: "groundnuty/k8s-wait-for:v2.0" + args: ["job-wr", "openfga-migrate"] +``` + +It polls the Kubernetes API (`GET /apis/batch/v1/.../jobs/openfga-migrate`) until `.status.succeeded >= 1`. This requires RBAC permissions (Role/RoleBinding for `batch/jobs` `get`/`list`). + +**The alternative mode** (`datastore.migrationType: initContainer`) runs migration directly inside each Deployment pod as an init container, avoiding hooks entirely but introducing redundant migration runs across replicas. + +### The Six Issues + +| Issue | Tool | Root Cause | +|-------|------|-----------| +| **#211** | ArgoCD | ArgoCD ignores Helm hook annotations. The migration Job is never created as a managed resource. The init container waits forever for a Job that doesn't exist. | +| **#107** | ArgoCD | Same root cause. The Job is invisible in ArgoCD's UI — users can't see, debug, or manually sync it. | +| **#120** | Helm `--wait` | Circular deadlock. Helm waits for the Deployment to be ready before running post-install hooks. The Deployment is never ready because the init container waits for the hook Job. The Job never runs because Helm is waiting. | +| **#100** | FluxCD | FluxCD waits for all resources by default. The `hook-delete-policy: before-hook-creation` removes the completed Job before FluxCD can confirm the Deployment is healthy. | +| **#95** | AWS IRSA | Migration and runtime share a ServiceAccount. With IAM-based DB auth, the runtime gets DDL permissions it doesn't need (CREATE TABLE, ALTER TABLE). | +| **#126** | All | The `k8s-wait-for` image is configured in two separate places in `values.yaml`, leading to inconsistency. Related: #132 (image unmaintained, has CVEs) and #144 (pinned by mutable tag). | + +### Why Helm Hooks Are Fundamentally Wrong for This + +Helm hooks are a **deploy-time orchestration mechanism**. They assume Helm is the active agent running the deployment. GitOps tools (ArgoCD, FluxCD) break this assumption — they render the chart to manifests and apply them declaratively. The hook annotations are either ignored (ArgoCD) or cause ordering/cleanup conflicts (FluxCD). + +This is not a bug in ArgoCD or FluxCD. It is a fundamental mismatch between Helm's imperative hook model and the declarative GitOps model. + +## Decision + +Replace the Helm hook migration Job and `k8s-wait-for` init container with **operator-managed migrations** as part of Stage 1 of the OpenFGA Operator (see [ADR-001](001-adopt-openfga-operator.md)). + +### How It Works + +The operator runs a **migration controller** that reconciles the OpenFGA Deployment: + +``` +┌────────────────────────────────────────────────────────┐ +│ Operator Reconciliation │ +│ │ +│ 1. Read Deployment → extract image tag (e.g. v1.14.0) │ +│ 2. Read ConfigMap/openfga-migration-status │ +│ └── "Last migrated version: v1.13.0" │ +│ 3. Versions differ → migration needed │ +│ 4. Create Job/openfga-migrate │ +│ ├── ServiceAccount: openfga-migrator (DDL perms) │ +│ ├── Image: openfga/openfga:v1.14.0 │ +│ ├── Args: ["migrate"] │ +│ └── ttlSecondsAfterFinished: 300 │ +│ 5. Watch Job until succeeded │ +│ 6. Update ConfigMap → "version: v1.14.0" │ +│ 7. Scale Deployment replicas: 0 → 3 │ +│ 8. OpenFGA pods start, serve requests │ +└────────────────────────────────────────────────────────┘ +``` + +**Key design decisions within this approach:** + +#### Deployment starts at replicas: 0 + +The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema. + +#### Version tracking via ConfigMap + +A ConfigMap (`openfga-migration-status`) records the last successfully migrated version. The operator compares this to the Deployment's image tag to determine if migration is needed. This is: +- Simple to inspect (`kubectl get configmap openfga-migration-status -o yaml`) +- Survives operator restarts +- Can be manually deleted to force re-migration + +#### Separate ServiceAccount for migrations + +The operator creates a dedicated `openfga-migrator` ServiceAccount for migration Jobs. Users can annotate it with cloud IAM roles that grant DDL permissions, while the runtime ServiceAccount retains only CRUD permissions. + +#### Migration Job is a regular resource + +The Job created by the operator has no Helm hook annotations. It is a standard Kubernetes Job, visible to ArgoCD, FluxCD, and all Kubernetes tooling. It has an owner reference to the operator's managed resource for proper garbage collection. + +#### Failure handling + +| Failure | Behavior | +|---------|----------| +| Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. | +| Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. | +| Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. | +| Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). | + +### Sequence Comparison + +**Before (Helm hooks):** + +``` +helm install + ├── Create ServiceAccount, RBAC, Secret, Service + ├── Create Deployment (with wait-for-migration init container) + │ └── Pod starts → init container polls for Job → waits... + ├── [Helm finishes regular resources] + ├── Run post-install hooks: + │ └── Create Job/openfga-migrate → runs openfga migrate + │ └── Job succeeds + ├── Init container sees Job succeeded → exits + └── Main container starts +``` + +Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4. + +**After (operator-managed):** + +``` +helm install + ├── Create ServiceAccount (runtime), ServiceAccount (migrator) + ├── Create Secret, Service + ├── Create Deployment (replicas: 0, no init containers) + ├── Create Operator Deployment + └── [Helm is done — all resources are regular, no hooks] + +Operator starts: + ├── Detects Deployment image version + ├── No migration status ConfigMap → migration needed + ├── Creates Job/openfga-migrate (regular Job, no hooks) + │ └── Uses openfga-migrator ServiceAccount + │ └── Runs openfga migrate → succeeds + ├── Creates ConfigMap with migrated version + └── Scales Deployment to 3 replicas → pods start +``` + +No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects. + +### What Changes in the Helm Chart + +**Removed:** + +| File/Section | Reason | +|--------------|--------| +| `templates/job.yaml` | Operator creates migration Jobs | +| `templates/rbac.yaml` | No init container polling Job status | +| `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated | +| `values.yaml`: `datastore.migrationType` | Operator always uses Job internally | +| `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering | +| `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks | +| Deployment init containers for migration | Operator manages readiness via replica scaling | + +**Added:** + +| File/Section | Purpose | +|--------------|---------| +| `values.yaml`: `operator.enabled` | Toggle operator subchart | +| `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs | +| `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration | +| `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount | +| `charts/openfga-operator/` | Operator subchart | + +**Preserved (backward compatible):** + +When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption. + +## Consequences + +### Positive + +- **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility +- **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144) +- **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95) +- **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed +- **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors +- **Idempotent and crash-safe** — operator can restart at any point and resume correctly + +### Negative + +- **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected) +- **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this +- **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy) + +### Risks + +- **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1. +- **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented. diff --git a/docs/adr/003-declarative-store-lifecycle-crds.md b/docs/adr/003-declarative-store-lifecycle-crds.md new file mode 100644 index 0000000..a54ee44 --- /dev/null +++ b/docs/adr/003-declarative-store-lifecycle-crds.md @@ -0,0 +1,199 @@ +# ADR-003: Declarative Store Lifecycle Management via CRDs + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) + +## Context + +OpenFGA is an authorization service. After deploying the server, teams must perform several runtime operations to make it usable: + +1. **Create a store** — a logical container for authorization data +2. **Write an authorization model** — the DSL that defines types, relations, and permissions +3. **Write tuples** — the relationship data that the model operates on (e.g., "user:anne is owner of document:budget") + +Today, these operations happen outside Kubernetes — through the OpenFGA API, CLI (`fga`), or custom scripts in CI pipelines. There is no declarative, Kubernetes-native way to manage them. + +This creates several problems: + +- **No GitOps for authorization config** — authorization models live in scripts or API calls, not in version-controlled manifests that ArgoCD/FluxCD sync. +- **No drift detection** — if someone modifies a model or tuple via the API, there's no controller to detect and reconcile the change. +- **No cross-team ownership** — each team that uses OpenFGA must build their own tooling to manage stores and models. There's no standard pattern. +- **Manual coordination** — deploying a new version of an application that needs a model change requires coordinating the Helm upgrade with a separate model push. + +### Alternatives Considered + +**A. CLI wrapper in CI pipelines** + +Use the `fga` CLI in a CI/CD step after `helm upgrade` to create stores, push models, and write tuples. + +*Pros:* No new Kubernetes components. Works with any CI system. +*Cons:* Imperative, not declarative. No drift detection. Each team builds their own pipeline. Model changes are not atomic with deployments. No visibility in Kubernetes tooling. + +**B. Helm post-install hook Job** + +Add a Helm hook Job that runs `fga` CLI commands after installation. + +*Pros:* Stays within the Helm ecosystem. +*Cons:* Helm hooks are the exact problem we're solving in ADR-002. Same ArgoCD/FluxCD incompatibilities. Hook Jobs are fire-and-forget with no reconciliation. + +**C. CRDs managed by the operator (selected)** + +Expose `FGAStore`, `FGAModel`, and `FGATuples` as Custom Resource Definitions. The operator watches these resources and reconciles them against the OpenFGA API. + +*Pros:* Fully declarative. GitOps-native. Continuous reconciliation. Standard Kubernetes patterns. Teams own their auth config as manifests. +*Cons:* Requires the operator (ADR-001). CRD design and reconciliation logic add development scope. Tuple reconciliation is complex. + +## Decision + +Introduce three CRDs, built in stages after the migration handling (ADR-002) is complete: + +### Stage 2: FGAStore + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGAStore +metadata: + name: my-app + namespace: my-team +spec: + # Reference to the OpenFGA instance + openfgaRef: + url: openfga.openfga-system.svc:8081 + credentialsRef: + name: openfga-api-credentials # Secret with API key or client credentials + # Store display name + name: "my-app-store" +status: + storeId: "01HXYZ..." + ready: true + conditions: + - type: Ready + status: "True" + lastTransitionTime: "2026-04-06T12:00:00Z" +``` + +**Controller behavior:** +- On create: call `CreateStore` API, store the returned store ID in `.status.storeId` +- On delete: call `DeleteStore` API (with finalizer to ensure cleanup) +- Idempotent: if a store with the same name exists, adopt it rather than creating a duplicate +- Status: set `Ready` condition when store is confirmed to exist + +### Stage 3: FGAModel + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGAModel +metadata: + name: my-app-model + namespace: my-team +spec: + storeRef: + name: my-app # References an FGAStore in the same namespace + model: | + model + schema 1.1 + type user + type organization + relations + define member: [user] + define admin: [user] + type document + relations + define reader: [user, organization#member] + define writer: [user, organization#admin] + define owner: [user] +status: + modelId: "01HABC..." + ready: true + lastWrittenHash: "sha256:a1b2c3..." # Hash of the model DSL to detect changes + conditions: + - type: Ready + status: "True" + - type: InSync + status: "True" +``` + +**Controller behavior:** +- On create/update: hash the model DSL. If hash differs from `.status.lastWrittenHash`, call `WriteAuthorizationModel` API +- Store the returned model ID in `.status.modelId` +- Model writes are append-only in OpenFGA (each write creates a new version), so this is safe +- Validation: optionally validate DSL syntax before calling the API (fail-fast with a clear error condition) +- The controller does NOT delete old model versions — OpenFGA retains model history + +### Stage 4: FGATuples + +```yaml +apiVersion: openfga.dev/v1alpha1 +kind: FGATuples +metadata: + name: my-app-base-tuples + namespace: my-team +spec: + storeRef: + name: my-app + tuples: + - user: "user:anne" + relation: "owner" + object: "document:budget" + - user: "team:engineering#member" + relation: "reader" + object: "folder:engineering-docs" + - user: "organization:acme#admin" + relation: "writer" + object: "folder:engineering-docs" +status: + writtenCount: 3 + ready: true + lastReconciled: "2026-04-06T12:00:00Z" + conditions: + - type: Ready + status: "True" + - type: InSync + status: "True" +``` + +**Controller behavior:** +- Maintain an **ownership model** — the controller tracks which tuples it wrote (via annotations or a status field). It only manages tuples it owns, never deleting tuples written by the application at runtime. +- On reconciliation: diff the desired tuples (from spec) against owned tuples in the store + - Tuples in spec but not in store → write them + - Tuples in store (owned) but not in spec → delete them + - Tuples in store but not owned → leave them alone +- Pagination: handle large tuple sets that exceed API response limits +- Batching: use `Write` API with batch operations to minimize API calls + +**Scope limitation:** `FGATuples` is intended for **base/static tuples** — organizational structure, role assignments, resource hierarchies. It is NOT intended to replace application-level tuple writes for dynamic data (e.g., per-request access grants). The ownership model ensures these two concerns don't interfere. + +### CRD Design Principles + +1. **Namespace-scoped** — all CRDs are namespaced, allowing teams to manage their own stores/models/tuples in their namespace +2. **Reference-based** — `FGAModel` and `FGATuples` reference an `FGAStore` by name, not by store ID. The controller resolves the reference. +3. **Status-driven** — controllers report state via `.status.conditions` following Kubernetes conventions (`Ready`, `InSync`, error conditions) +4. **Finalizers for cleanup** — `FGAStore` uses a finalizer to ensure the store is deleted from OpenFGA when the CR is deleted +5. **Idempotent** — all operations are safe to retry. Re-running reconciliation produces the same result. +6. **`v1alpha1` API version** — signals that the CRD schema may change. We will promote to `v1beta1` and `v1` as the design stabilizes. + +## Consequences + +### Positive + +- **GitOps-native authorization management** — stores, models, and tuples are Kubernetes resources that ArgoCD/FluxCD sync from Git +- **Drift detection and reconciliation** — the operator continuously ensures the actual state matches the declared state +- **Cross-team standardization** — every team uses the same CRDs, eliminating custom scripts and CI hacks +- **Atomic deployments** — a team can include `FGAModel` in their application's Helm chart; model updates deploy alongside code changes +- **Visibility** — `kubectl get fgastores`, `kubectl get fgamodels`, `kubectl describe fgatuples` provide instant visibility into authorization configuration +- **RBAC integration** — Kubernetes RBAC controls who can create/modify stores, models, and tuples per namespace + +### Negative + +- **Significant development scope** — three controllers, each with its own reconciliation logic, error handling, and tests +- **Tuple reconciliation complexity** — diffing and ownership tracking for tuples is the most complex piece; edge cases around partial failures, pagination, and large tuple sets +- **CRD upgrade burden** — CRD schema changes require careful migration; Helm does not upgrade CRDs automatically +- **API dependency** — the operator must be able to reach the OpenFGA API; network issues or API downtime affect reconciliation +- **Not suitable for all tuple management** — dynamic, application-driven tuples should still be written via the API, not CRDs. Users must understand this boundary. + +### Risks + +- **FGATuples at scale** — for stores with millions of tuples, the reconciliation diff could be expensive. The ownership model mitigates this (only diff owned tuples), but documentation must clearly state that `FGATuples` is for base/static data, not high-volume dynamic writes. +- **Multi-cluster** — if OpenFGA serves multiple clusters, CRDs in one cluster may conflict with CRDs in another pointing at the same store. This is out of scope for `v1alpha1` but should be considered for future versions. diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md new file mode 100644 index 0000000..bedb12c --- /dev/null +++ b/docs/adr/004-operator-deployment-model.md @@ -0,0 +1,167 @@ +# ADR-004: Operator Deployment as Helm Subchart Dependency + +- **Status:** Proposed +- **Date:** 2026-04-06 +- **Deciders:** OpenFGA Helm Charts maintainers +- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md) + +## Context + +The OpenFGA Operator (ADR-001) needs a deployment model — how do users install it alongside or independent of the OpenFGA server? + +There are several established patterns in the Kubernetes ecosystem: + +### Alternatives Considered + +**A. Standalone operator chart (install separately)** + +Users install the operator chart first, then install the OpenFGA chart. The operator watches for OpenFGA Deployments across namespaces. + +*Example:* +```bash +helm install openfga-operator openfga/openfga-operator -n openfga-system +helm install openfga openfga/openfga -n my-namespace +``` + +*Pros:* Clean separation of concerns. One operator instance serves multiple OpenFGA installations. Follows the OLM/OperatorHub pattern. +*Cons:* Two install steps. Ordering dependency — operator must exist before the chart is useful. Users must manage two releases. Harder to get started. + +**B. Operator bundled in the main chart (single chart, always installed)** + +The operator Deployment, RBAC, and CRDs are templates in the main OpenFGA chart. No subchart. + +*Pros:* Simplest for users — one chart, one install. No dependency management. +*Cons:* Chart becomes larger and harder to maintain. Users who manage the operator separately (e.g., cluster-wide) can't disable it. CRDs are tied to the application chart's release cycle. Multiple OpenFGA installations in the same cluster would deploy multiple operator instances. + +**C. Operator as a conditional subchart dependency (selected)** + +The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Enabled by default, but users can disable it. + +*Example:* +```bash +# Everything in one command +helm install openfga openfga/openfga \ + --set datastore.engine=postgres \ + --set operator.enabled=true + +# Or, operator managed separately +helm install openfga-operator openfga/openfga-operator -n openfga-system +helm install openfga openfga/openfga \ + --set operator.enabled=false +``` + +*Pros:* Single install for most users. Operator chart has its own versioning. Users can disable for standalone management. Clean separation in code. +*Cons:* Subchart dependency adds some Chart.yaml complexity. CRDs still need special handling (Helm's `crds/` directory or a pre-install hook). + +**D. OLM (Operator Lifecycle Manager) only** + +Publish the operator to OperatorHub. Users install via OLM. + +*Pros:* Standard pattern for OpenShift. Handles CRD upgrades, operator upgrades, and RBAC. +*Cons:* OLM is not available on all clusters (not standard on EKS, GKE, AKS). Adds a dependency on OLM itself. Doesn't help Helm-only users. + +## Decision + +The operator will be distributed as a **conditional Helm subchart dependency** of the main OpenFGA chart. + +### Chart Structure + +``` +helm-charts/ +├── charts/ +│ ├── openfga/ # Main chart (existing) +│ │ ├── Chart.yaml # Declares openfga-operator as dependency +│ │ ├── values.yaml # operator.enabled: true +│ │ ├── templates/ +│ │ └── crds/ # Empty in Stage 1 +│ │ +│ └── openfga-operator/ # Operator subchart (new) +│ ├── Chart.yaml +│ ├── values.yaml +│ ├── templates/ +│ │ ├── deployment.yaml +│ │ ├── serviceaccount.yaml +│ │ ├── clusterrole.yaml +│ │ └── clusterrolebinding.yaml +│ └── crds/ # CRDs added in Stages 2-4 +│ ├── fgastore.yaml +│ ├── fgamodel.yaml +│ └── fgatuples.yaml +``` + +### Dependency Declaration + +```yaml +# charts/openfga/Chart.yaml +dependencies: + - name: openfga-operator + version: "0.1.x" + repository: "oci://ghcr.io/openfga/helm-charts" + condition: operator.enabled +``` + +### CRD Handling + +Helm has specific behavior around CRDs: + +1. **`crds/` directory** — CRDs placed here are installed on `helm install` but are **never upgraded or deleted** by Helm. This is safe but requires manual CRD upgrades. + +2. **Pre-install/pre-upgrade hook Job** — a Job that runs `kubectl apply -f` on CRD manifests before the main install/upgrade. This handles upgrades but reintroduces Helm hooks (the problem ADR-002 solves). + +3. **Static manifests applied separately** — CRDs are published as a standalone YAML file. Users run `kubectl apply -f` before `helm install`. This is the pattern used by cert-manager, Istio, and Prometheus Operator. + +**Decision:** Use the `crds/` directory in the operator subchart for initial installation. Publish CRD manifests as a standalone artifact for upgrades. Document both paths clearly. + +```bash +# First install — Helm installs CRDs automatically +helm install openfga openfga/openfga + +# CRD upgrades — applied manually (Helm won't upgrade them) +kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0/crds.yaml +``` + +### Installation Modes + +| Mode | Command | Use case | +|------|---------|----------| +| **All-in-one** (default) | `helm install openfga openfga/openfga` | Most users. Single install, operator included. | +| **Operator disabled** | `helm install openfga openfga/openfga --set operator.enabled=false` | Operator managed separately or not needed (memory datastore). | +| **Operator standalone** | `helm install op openfga/openfga-operator -n openfga-system` | Cluster-wide operator serving multiple OpenFGA instances. | + +### Multi-Instance Considerations + +When multiple OpenFGA installations exist in the same cluster: + +- **All-in-one mode:** Each installation gets its own operator instance. The operator only watches resources in its own namespace. This is simple but wasteful. +- **Standalone mode:** One operator installation watches all namespaces (or a configured set). Individual OpenFGA installations set `operator.enabled=false`. This is more efficient for large clusters. + +The operator will support both modes via a `watchNamespace` configuration: + +```yaml +# Operator values +operator: + watchNamespace: "" # empty = watch own namespace only (all-in-one mode) + # watchNamespace: "" # or set to a specific namespace + # watchAllNamespaces: true # watch all namespaces (standalone mode) +``` + +## Consequences + +### Positive + +- **Single `helm install` for most users** — no ordering dependencies, no manual operator setup +- **Opt-out available** — `operator.enabled: false` for users who manage it separately or don't need it +- **Independent versioning** — operator chart has its own version; can be released on a different cadence than the main chart +- **Clean code separation** — operator code and templates are in their own chart directory +- **Standalone installation supported** — cluster admins can install one operator for multiple OpenFGA instances +- **Consistent with ecosystem** — this is the same pattern used by charts that depend on Bitnami PostgreSQL, Redis, etc. + +### Negative + +- **CRD upgrade complexity** — Helm does not upgrade CRDs; users must apply CRD manifests separately on operator upgrades +- **Multiple operators in all-in-one mode** — if a user installs OpenFGA in three namespaces, they get three operator pods (wasteful). Documentation should recommend standalone mode for multi-instance clusters. +- **Subchart value passing** — configuring the operator requires prefixed values (e.g., `openfga-operator.image.tag`), which is slightly less ergonomic than top-level values + +### Neutral + +- **OLM support is not excluded** — the operator can be published to OperatorHub in the future alongside the Helm distribution. The two are not mutually exclusive. diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000..298a9e3 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,180 @@ +# Architecture Decision Records + +This directory contains Architecture Decision Records (ADRs) for the OpenFGA Helm Charts project. + +ADRs are short documents that capture significant architectural decisions along with their context, alternatives considered, and consequences. They serve as a decision log — not a living design doc, but a point-in-time record of *why* a decision was made. + +We follow the format described by [Michael Nygard](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions). + +## Index + +| ADR | Title | Status | Date | +|-----|-------|--------|------| +| [ADR-001](001-adopt-openfga-operator.md) | Adopt a Kubernetes Operator for OpenFGA Lifecycle Management | Proposed | 2026-04-06 | +| [ADR-002](002-operator-managed-migrations.md) | Replace Helm Hook Migrations with Operator-Managed Migrations | Proposed | 2026-04-06 | +| [ADR-003](003-declarative-store-lifecycle-crds.md) | Declarative Store Lifecycle Management via CRDs | Proposed | 2026-04-06 | +| [ADR-004](004-operator-deployment-model.md) | Operator Deployment as Helm Subchart Dependency | Proposed | 2026-04-06 | + +--- + +## What is an ADR? + +An ADR captures a single architectural decision. It records: + +- **What** was decided +- **Why** it was decided (the context and constraints at the time) +- **What alternatives** were considered and why they were rejected +- **What consequences** follow from the decision (positive, negative, and neutral) + +ADRs are **immutable once accepted** — if a decision changes, you write a new ADR that supersedes the old one rather than editing it. This preserves the history of *why* things changed over time. + +## ADR Lifecycle + +``` +Proposed → Accepted → (optionally) Superseded or Deprecated + ↑ + │ feedback loop + │ + Discussion +``` + +### Statuses + +| Status | Meaning | +|--------|---------| +| **Proposed** | The ADR has been written and is open for discussion. No commitment has been made. | +| **Accepted** | The decision has been agreed upon by maintainers. Implementation can proceed. | +| **Deprecated** | The decision is no longer relevant (e.g., the feature was removed). | +| **Superseded by ADR-XXX** | A newer ADR has replaced this decision. The old ADR links to the new one. | + +## How to Propose an ADR + +1. **Create a branch** — e.g., `docs/adr-005-my-decision` + +2. **Copy the template** — use `000-template.md` as a starting point + +3. **Write the ADR** — fill in Context, Decision, and Consequences. Focus on *why*, not *how*. The most valuable part is the Alternatives Considered section — it shows reviewers what you evaluated and why you chose this path. + +4. **Assign a number** — use the next sequential number. Check the index above. + +5. **Open a pull request** — the PR is where discussion happens. Title it: `ADR-005: ` + +6. **Add to the index** — update the table in this README with the new entry (status: Proposed) + +### Proposing related ADRs together + +When multiple ADRs are part of a single cohesive proposal — e.g., a foundational decision and several downstream decisions that depend on it — they can be submitted in a single PR. This lets reviewers see the full picture instead of bouncing between separate PRs. + +When doing this: + +- **Explain the relationship in the PR description** — identify which ADR is the foundational decision and which are downstream. For example: "ADR-001 is the core decision to build an operator. ADR-002, 003, and 004 are downstream decisions about how the operator handles migrations, CRDs, and deployment." +- **Each ADR can be accepted or rejected independently** — a reviewer might approve the foundational decision but push back on a downstream one. If that happens, split the PR: merge the accepted ADRs and keep the contested ones open for further discussion. +- **Keep each ADR self-contained** — even though they're in the same PR, each ADR should stand on its own. A reader should be able to understand ADR-003 without reading ADR-002 first (though they may reference each other). + +## How to Give Feedback on an ADR + +ADR review happens in the **pull request**, not by editing the ADR directly. This keeps the discussion visible and linked to the decision. + +### As a reviewer + +- **Comment on the PR** — ask questions, challenge assumptions, suggest alternatives. Good review questions: + - "Did you consider X as an alternative?" + - "What happens if Y fails?" + - "This conflicts with how we do Z — can you address that?" + - "I agree with the decision but the consequence about X should mention Y" + +- **Request changes** if you believe the decision is wrong or incomplete + +- **Approve** when you're satisfied the decision is sound and well-documented + +### As the author responding to feedback + +- **Update the ADR in the PR** based on feedback: + - Add alternatives that reviewers suggested (with your evaluation of them) + - Expand the Consequences section if reviewers identified impacts you missed + - Clarify the Context if reviewers were confused about the problem + - Adjust the Decision if feedback reveals a better approach + +- **Do NOT delete feedback-driven changes** — if a reviewer raised a valid alternative and you addressed it, the ADR is stronger for including it + +- **Resolve PR comments** as you address them so reviewers can track progress + +### Reaching consensus + +- ADRs move to **Accepted** when maintainers approve the PR +- Not every maintainer needs to approve — follow the project's normal review standards +- If consensus can't be reached, escalate to a synchronous discussion (meeting, call) and record the outcome in the PR +- Disagreement is fine — document it in the Consequences section as a risk or trade-off rather than hiding it + +## How to Supersede an ADR + +When a decision needs to change: + +1. **Do NOT edit the original ADR** — it's a historical record + +2. **Write a new ADR** that references the old one: + ```markdown + - **Supersedes:** [ADR-002](002-operator-managed-migrations.md) + ``` + +3. **Update the old ADR's status** — change it to: + ```markdown + - **Status:** Superseded by [ADR-007](007-new-approach.md) + ``` + +4. **Update the index** in this README + +This way, anyone reading ADR-002 knows it's been replaced and can follow the link to understand what changed and why. + +## ADR Format + +Every ADR follows this structure: + +```markdown +# ADR-NNN: Title + +- **Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXX +- **Date:** YYYY-MM-DD +- **Deciders:** Who was involved in the decision +- **Related Issues:** GitHub issue references +- **Related ADR:** Links to related ADRs + +## Context + +What is the problem or situation that motivates this decision? +Include enough background that someone unfamiliar with the project +can understand why this decision matters. + +## Decision + +What is the decision and why was it chosen? + +### Alternatives Considered + +What other options were evaluated? Why were they rejected? +This is often the most valuable section — it prevents future +contributors from re-proposing rejected approaches. + +## Consequences + +### Positive +What improves as a result of this decision? + +### Negative +What gets harder or more complex? Be honest — every decision has costs. + +### Risks +What could go wrong? What assumptions might prove false? +``` + +## Template + +A blank template is available at [000-template.md](000-template.md). + +## Tips for Writing Good ADRs + +- **Keep it short** — an ADR is one decision, not a design doc. If it's longer than 2-3 pages, consider splitting it. +- **Focus on why, not how** — implementation details change; the reasoning behind the decision is what matters long-term. +- **Be honest about trade-offs** — an ADR that lists only positive consequences isn't credible. Every decision has costs. +- **Write for your future self** — in 18 months, you won't remember why you chose this. The ADR should tell you. +- **Not every decision needs an ADR** — use ADRs for decisions that are hard to reverse, affect multiple components, or where the reasoning isn't obvious from the code.