From 134c8573709a03bda69c5b3ada90bb40d6b8f23d Mon Sep 17 00:00:00 2001
From: Ed Milic <edmilic@gmail.com>
Date: Mon, 6 Apr 2026 07:18:31 -0400
Subject: [PATCH] docs: add ADRs for OpenFGA operator proposal

Propose adopting a Kubernetes operator for OpenFGA lifecycle management,
covering migration handling, declarative CRDs for stores/models/tuples,
and the operator deployment model as a Helm subchart.

Also adds the ADR process documentation, template, and chart analysis.

ADR-001: Adopt a Kubernetes Operator
ADR-002: Operator-Managed Migrations
ADR-003: Declarative Store Lifecycle CRDs
ADR-004: Operator Deployment as Helm Subchart
---
 docs/adr/000-template.md                      |  48 ++++
 docs/adr/001-adopt-openfga-operator.md        |  95 ++++++++
 docs/adr/002-operator-managed-migrations.md   | 215 ++++++++++++++++++
 .../003-declarative-store-lifecycle-crds.md   | 199 ++++++++++++++++
 docs/adr/004-operator-deployment-model.md     | 167 ++++++++++++++
 docs/adr/README.md                            | 180 +++++++++++++++
 6 files changed, 904 insertions(+)
 create mode 100644 docs/adr/000-template.md
 create mode 100644 docs/adr/001-adopt-openfga-operator.md
 create mode 100644 docs/adr/002-operator-managed-migrations.md
 create mode 100644 docs/adr/003-declarative-store-lifecycle-crds.md
 create mode 100644 docs/adr/004-operator-deployment-model.md
 create mode 100644 docs/adr/README.md

diff --git a/docs/adr/000-template.md b/docs/adr/000-template.md
new file mode 100644
index 0000000..2cc78bb
--- /dev/null
+++ b/docs/adr/000-template.md
@@ -0,0 +1,48 @@
+# ADR-NNN: Title
+
+- **Status:** Proposed
+- **Date:** YYYY-MM-DD
+- **Deciders:** [list of people involved]
+- **Related Issues:** #
+- **Related ADR:** [ADR-NNN](NNN-filename.md)
+
+## Context
+
+What is the problem or situation that motivates this decision? What constraints exist? What forces are at play?
+
+Include enough background that someone unfamiliar with the project can understand why this decision matters.
+
+## Decision
+
+What is the change being proposed or decided?
+
+### Alternatives Considered
+
+**A. [Alternative name]**
+
+[Description of the alternative]
+
+*Pros:* ...
+*Cons:* ...
+
+**B. [Alternative name]**
+
+[Description of the alternative]
+
+*Pros:* ...
+*Cons:* ...
+
+## Consequences
+
+### Positive
+
+- What improves as a result of this decision?
+
+### Negative
+
+- What gets harder, more complex, or more costly?
+
+### Risks
+
+- What assumptions might prove false?
+- What could go wrong?
diff --git a/docs/adr/001-adopt-openfga-operator.md b/docs/adr/001-adopt-openfga-operator.md
new file mode 100644
index 0000000..ca84553
--- /dev/null
+++ b/docs/adr/001-adopt-openfga-operator.md
@@ -0,0 +1,95 @@
+# ADR-001: Adopt a Kubernetes Operator for OpenFGA Lifecycle Management
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #143, #144
+
+## Context
+
+The OpenFGA Helm chart currently handles all lifecycle concerns — deployment, configuration, database migrations, and secret management — through Helm templates and hooks. This approach works for simple installations but breaks down in several important scenarios:
+
+1. **Database migrations rely on Helm hooks**, which are incompatible with GitOps tools (ArgoCD, FluxCD) and Helm's own `--wait` flag. This is the single biggest pain point for users, accounting for 6 open issues (#211, #107, #120, #100, #95, #126).
+
+2. **Store provisioning, authorization model updates, and tuple management** are runtime operations that happen through the OpenFGA API. There is no declarative, GitOps-native way to manage these. Teams must use imperative scripts, CI pipelines, or manual API calls to set up stores and push models after deployment.
+
+3. **The migration init container** depends on `groundnuty/k8s-wait-for`, an unmaintained image with known CVEs, pinned by mutable tag (#132, #144).
+
+4. **Migration and runtime workloads share a single ServiceAccount**, violating least-privilege when cloud IAM-based database authentication (AWS IRSA, GCP Workload Identity) maps the ServiceAccount directly to a database role (#95).
+
+### Alternatives Considered
+
+**A. Fix migrations within the Helm chart (no operator)**
+
+- Strip Helm hook annotations from the migration Job by default, rendering it as a regular resource.
+- Replace `k8s-wait-for` with a shell-based init container that polls the database schema version directly.
+- Add a separate ServiceAccount for the migration Job.
+
+*Pros:* Lower complexity, no new component to maintain.
+*Cons:* Doesn't solve the ordering problem cleanly — the Job and Deployment are created simultaneously, requiring an init container to gate startup. Still requires an image or script to poll. Doesn't address store/model/tuple lifecycle at all.
+
+**B. Recommend initContainer mode as default**
+
+- Change `datastore.migrationType` default from `"job"` to `"initContainer"`, running migrations inside each pod.
+
+*Pros:* No separate Job, no hooks, no `k8s-wait-for`.
+*Cons:* Every pod runs migrations on startup (wasteful). Rolling updates trigger redundant migrations. Crash-loops on migration failure. Still shares ServiceAccount. No path to store lifecycle management.
+
+**C. Build an operator (selected)**
+
+- A Kubernetes operator manages migrations as internal reconciliation logic and exposes CRDs for store, model, and tuple lifecycle.
+
+*Pros:* Solves all migration issues. Enables GitOps-native authorization management. Follows established Kubernetes patterns (CNPG, Strimzi, cert-manager). Separates concerns cleanly.
+*Cons:* Significant development and maintenance investment. New component to deploy and monitor. Learning curve for contributors.
+
+**D. External migration tool (e.g., Flyway, golang-migrate)**
+
+- Remove migrations from the chart entirely and document using an external tool.
+
+*Pros:* Simplifies the chart completely.
+*Cons:* Shifts complexity to the user. Every user must build their own migration pipeline. No standard approach across the community.
+
+## Decision
+
+We will build an **OpenFGA Kubernetes Operator** that handles:
+
+1. **Database migration orchestration** (Stage 1) — replacing Helm hooks, the `k8s-wait-for` init container, and shared ServiceAccount with operator-managed migration Jobs and deployment readiness gating.
+
+2. **Declarative store lifecycle management** (Stages 2-4) — exposing `FGAStore`, `FGAModel`, and `FGATuples` CRDs for GitOps-native authorization configuration.
+
+The operator will be:
+- Written in Go using `controller-runtime` / kubebuilder
+- Distributed as a Helm subchart dependency of the main OpenFGA chart
+- Optional — users who don't need it can set `operator.enabled: false` and fall back to the existing behavior
+
+Development will follow a staged approach to deliver value incrementally:
+
+| Stage | Scope | Outcome |
+|-------|-------|---------|
+| 1 | Operator scaffolding + migration handling | All 6 migration issues resolved |
+| 2 | `FGAStore` CRD | Declarative store provisioning |
+| 3 | `FGAModel` CRD | Declarative authorization model management |
+| 4 | `FGATuples` CRD | Declarative tuple management |
+
+## Consequences
+
+### Positive
+
+- **Resolves all 6 migration issues** (#211, #107, #120, #100, #95, #126) and related dependency issues (#132, #144)
+- **Eliminates `k8s-wait-for` dependency** — removes an unmaintained, CVE-carrying image from the supply chain
+- **Enables GitOps-native authorization management** — stores, models, and tuples become declarative Kubernetes resources that ArgoCD/FluxCD can sync
+- **Enforces least-privilege** — separate ServiceAccounts for migration (DDL) and runtime (CRUD)
+- **Simplifies the Helm chart** — removes migration Job template, init container logic, RBAC for job-status-reading, and hook annotations
+- **Follows Kubernetes ecosystem conventions** — operators are the standard pattern for managing stateful application lifecycle
+
+### Negative
+
+- **New component to maintain** — the operator is a full Go project with its own release cycle, CI, testing, and CVE surface
+- **Increased deployment footprint** — an additional pod running in the cluster (though resource requirements are minimal: ~50m CPU, ~64Mi memory)
+- **Learning curve** — contributors need to understand controller-runtime patterns to modify the operator
+- **CRD management complexity** — Helm does not upgrade or delete CRDs; users may need to apply CRD manifests separately on operator upgrades
+
+### Neutral
+
+- **Backward compatibility preserved** — the `operator.enabled: false` fallback maintains the existing Helm hook behavior for users who haven't migrated
+- **No change for memory-datastore users** — users running with `datastore.engine: memory` are unaffected (no migrations, no operator needed)
diff --git a/docs/adr/002-operator-managed-migrations.md b/docs/adr/002-operator-managed-migrations.md
new file mode 100644
index 0000000..8fb0cd7
--- /dev/null
+++ b/docs/adr/002-operator-managed-migrations.md
@@ -0,0 +1,215 @@
+# ADR-002: Replace Helm Hook Migrations with Operator-Managed Migrations
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md)
+- **Related Issues:** #211, #107, #120, #100, #95, #126, #132, #144
+
+## Context
+
+### How Migrations Work Today
+
+The current Helm chart uses a **Helm hook Job** to run database migrations (`openfga migrate`) and a **`k8s-wait-for` init container** on the Deployment to block server startup until the migration completes.
+
+Seven files are involved:
+
+| File | Role |
+|------|------|
+| `templates/job.yaml` | Migration Job with Helm hook annotations |
+| `templates/deployment.yaml` | OpenFGA Deployment + `wait-for-migration` init container |
+| `templates/serviceaccount.yaml` | Shared ServiceAccount (migration + runtime) |
+| `templates/rbac.yaml` | Role + RoleBinding so init container can poll Job status |
+| `templates/_helpers.tpl` | Datastore environment variable helpers |
+| `values.yaml` | `datastore.*`, `migrate.*`, `initContainer.*` configuration |
+| `Chart.yaml` | `bitnami/common` dependency for migration sidecars |
+
+**The migration Job** (`templates/job.yaml`) is annotated as a Helm hook:
+
+```yaml
+annotations:
+  "helm.sh/hook": post-install,post-upgrade,post-rollback,post-delete
+  "helm.sh/hook-delete-policy": before-hook-creation
+  "helm.sh/hook-weight": "1"
+```
+
+This means Helm manages it outside the normal release lifecycle — it only runs after Helm finishes creating/upgrading all other resources.
+
+**The wait-for init container** blocks the Deployment pods from starting:
+
+```yaml
+initContainers:
+  - name: wait-for-migration
+    image: "groundnuty/k8s-wait-for:v2.0"
+    args: ["job-wr", "openfga-migrate"]
+```
+
+It polls the Kubernetes API (`GET /apis/batch/v1/.../jobs/openfga-migrate`) until `.status.succeeded >= 1`. This requires RBAC permissions (Role/RoleBinding for `batch/jobs` `get`/`list`).
+
+**The alternative mode** (`datastore.migrationType: initContainer`) runs migration directly inside each Deployment pod as an init container, avoiding hooks entirely but introducing redundant migration runs across replicas.
+
+### The Six Issues
+
+| Issue | Tool | Root Cause |
+|-------|------|-----------|
+| **#211** | ArgoCD | ArgoCD ignores Helm hook annotations. The migration Job is never created as a managed resource. The init container waits forever for a Job that doesn't exist. |
+| **#107** | ArgoCD | Same root cause. The Job is invisible in ArgoCD's UI — users can't see, debug, or manually sync it. |
+| **#120** | Helm `--wait` | Circular deadlock. Helm waits for the Deployment to be ready before running post-install hooks. The Deployment is never ready because the init container waits for the hook Job. The Job never runs because Helm is waiting. |
+| **#100** | FluxCD | FluxCD waits for all resources by default. The `hook-delete-policy: before-hook-creation` removes the completed Job before FluxCD can confirm the Deployment is healthy. |
+| **#95** | AWS IRSA | Migration and runtime share a ServiceAccount. With IAM-based DB auth, the runtime gets DDL permissions it doesn't need (CREATE TABLE, ALTER TABLE). |
+| **#126** | All | The `k8s-wait-for` image is configured in two separate places in `values.yaml`, leading to inconsistency. Related: #132 (image unmaintained, has CVEs) and #144 (pinned by mutable tag). |
+
+### Why Helm Hooks Are Fundamentally Wrong for This
+
+Helm hooks are a **deploy-time orchestration mechanism**. They assume Helm is the active agent running the deployment. GitOps tools (ArgoCD, FluxCD) break this assumption — they render the chart to manifests and apply them declaratively. The hook annotations are either ignored (ArgoCD) or cause ordering/cleanup conflicts (FluxCD).
+
+This is not a bug in ArgoCD or FluxCD. It is a fundamental mismatch between Helm's imperative hook model and the declarative GitOps model.
+
+## Decision
+
+Replace the Helm hook migration Job and `k8s-wait-for` init container with **operator-managed migrations** as part of Stage 1 of the OpenFGA Operator (see [ADR-001](001-adopt-openfga-operator.md)).
+
+### How It Works
+
+The operator runs a **migration controller** that reconciles the OpenFGA Deployment:
+
+```
+┌────────────────────────────────────────────────────────┐
+│                  Operator Reconciliation                │
+│                                                        │
+│  1. Read Deployment → extract image tag (e.g. v1.14.0) │
+│  2. Read ConfigMap/openfga-migration-status             │
+│     └── "Last migrated version: v1.13.0"               │
+│  3. Versions differ → migration needed                  │
+│  4. Create Job/openfga-migrate                          │
+│     ├── ServiceAccount: openfga-migrator (DDL perms)   │
+│     ├── Image: openfga/openfga:v1.14.0                 │
+│     ├── Args: ["migrate"]                              │
+│     └── ttlSecondsAfterFinished: 300                   │
+│  5. Watch Job until succeeded                           │
+│  6. Update ConfigMap → "version: v1.14.0"              │
+│  7. Scale Deployment replicas: 0 → 3                   │
+│  8. OpenFGA pods start, serve requests                  │
+└────────────────────────────────────────────────────────┘
+```
+
+**Key design decisions within this approach:**
+
+#### Deployment starts at replicas: 0
+
+The Helm chart renders the Deployment with `replicas: 0` when `operator.enabled: true`. The operator scales it up only after migration succeeds. This is simpler than readiness gates or admission webhooks, and ensures no pods run against an unmigrated schema.
+
+#### Version tracking via ConfigMap
+
+A ConfigMap (`openfga-migration-status`) records the last successfully migrated version. The operator compares this to the Deployment's image tag to determine if migration is needed. This is:
+- Simple to inspect (`kubectl get configmap openfga-migration-status -o yaml`)
+- Survives operator restarts
+- Can be manually deleted to force re-migration
+
+#### Separate ServiceAccount for migrations
+
+The operator creates a dedicated `openfga-migrator` ServiceAccount for migration Jobs. Users can annotate it with cloud IAM roles that grant DDL permissions, while the runtime ServiceAccount retains only CRUD permissions.
+
+#### Migration Job is a regular resource
+
+The Job created by the operator has no Helm hook annotations. It is a standard Kubernetes Job, visible to ArgoCD, FluxCD, and all Kubernetes tooling. It has an owner reference to the operator's managed resource for proper garbage collection.
+
+#### Failure handling
+
+| Failure | Behavior |
+|---------|----------|
+| Job fails | Operator sets `MigrationFailed` condition on Deployment. Does NOT scale up. User inspects Job logs. |
+| Job hangs | `activeDeadlineSeconds` (default 300s) kills it. Operator sees failure. |
+| Operator crashes | On restart, re-reads ConfigMap and Job status. Resumes from where it left off. |
+| Database unreachable | Job fails to connect. Operator retries on next reconciliation (exponential backoff). |
+
+### Sequence Comparison
+
+**Before (Helm hooks):**
+
+```
+helm install
+  ├── Create ServiceAccount, RBAC, Secret, Service
+  ├── Create Deployment (with wait-for-migration init container)
+  │     └── Pod starts → init container polls for Job → waits...
+  ├── [Helm finishes regular resources]
+  ├── Run post-install hooks:
+  │     └── Create Job/openfga-migrate → runs openfga migrate
+  │           └── Job succeeds
+  ├── Init container sees Job succeeded → exits
+  └── Main container starts
+```
+
+Problems: ArgoCD skips step 4. FluxCD deletes Job in step 4. `--wait` deadlocks between steps 2 and 4.
+
+**After (operator-managed):**
+
+```
+helm install
+  ├── Create ServiceAccount (runtime), ServiceAccount (migrator)
+  ├── Create Secret, Service
+  ├── Create Deployment (replicas: 0, no init containers)
+  ├── Create Operator Deployment
+  └── [Helm is done — all resources are regular, no hooks]
+
+Operator starts:
+  ├── Detects Deployment image version
+  ├── No migration status ConfigMap → migration needed
+  ├── Creates Job/openfga-migrate (regular Job, no hooks)
+  │     └── Uses openfga-migrator ServiceAccount
+  │     └── Runs openfga migrate → succeeds
+  ├── Creates ConfigMap with migrated version
+  └── Scales Deployment to 3 replicas → pods start
+```
+
+No hooks. No init containers. No `k8s-wait-for`. All resources are regular Kubernetes objects.
+
+### What Changes in the Helm Chart
+
+**Removed:**
+
+| File/Section | Reason |
+|--------------|--------|
+| `templates/job.yaml` | Operator creates migration Jobs |
+| `templates/rbac.yaml` | No init container polling Job status |
+| `values.yaml`: `initContainer.repository`, `initContainer.tag` | `k8s-wait-for` eliminated |
+| `values.yaml`: `datastore.migrationType` | Operator always uses Job internally |
+| `values.yaml`: `datastore.waitForMigrations` | Operator handles ordering |
+| `values.yaml`: `migrate.annotations` (hook annotations) | No Helm hooks |
+| Deployment init containers for migration | Operator manages readiness via replica scaling |
+
+**Added:**
+
+| File/Section | Purpose |
+|--------------|---------|
+| `values.yaml`: `operator.enabled` | Toggle operator subchart |
+| `values.yaml`: `migration.serviceAccount.*` | Separate ServiceAccount for migration Jobs |
+| `values.yaml`: `migration.timeout`, `backoffLimit`, `ttlSecondsAfterFinished` | Migration Job configuration |
+| `templates/serviceaccount.yaml`: second SA | Migration ServiceAccount |
+| `charts/openfga-operator/` | Operator subchart |
+
+**Preserved (backward compatible):**
+
+When `operator.enabled: false`, the chart falls back to the current behavior — Helm hooks, `k8s-wait-for` init container, shared ServiceAccount. This allows gradual adoption.
+
+## Consequences
+
+### Positive
+
+- **All 6 migration issues resolved** — no Helm hooks means no ArgoCD/FluxCD/`--wait` incompatibility
+- **`k8s-wait-for` eliminated** — removes an unmaintained image with CVEs from the supply chain (#132, #144)
+- **Least-privilege enforced** — separate ServiceAccounts for migration (DDL) and runtime (CRUD) (#95)
+- **Helm chart simplified** — 2 templates removed, init container logic removed, RBAC for job-watching removed
+- **Migration is observable** — Job is a regular resource visible in all tools; ConfigMap records migration history; operator conditions surface errors
+- **Idempotent and crash-safe** — operator can restart at any point and resume correctly
+
+### Negative
+
+- **Operator is a new runtime dependency** — if the operator pod is unavailable, migrations don't run (but existing running pods are unaffected)
+- **Replica scaling model** — starting at `replicas: 0` means a brief period where the Deployment exists but has no pods; monitoring tools may flag this
+- **Two upgrade paths to document** — `operator.enabled: true` (new) vs `operator.enabled: false` (legacy)
+
+### Risks
+
+- **Zero-downtime upgrades** — the initial implementation scales to 0 during migration, causing brief downtime. A future enhancement can support rolling upgrades where the new schema is backward-compatible, but this is explicitly out of scope for Stage 1.
+- **ConfigMap as state store** — if the ConfigMap is accidentally deleted, the operator re-runs migration (which is safe — `openfga migrate` is idempotent). This is a feature, not a bug, but should be documented.
diff --git a/docs/adr/003-declarative-store-lifecycle-crds.md b/docs/adr/003-declarative-store-lifecycle-crds.md
new file mode 100644
index 0000000..a54ee44
--- /dev/null
+++ b/docs/adr/003-declarative-store-lifecycle-crds.md
@@ -0,0 +1,199 @@
+# ADR-003: Declarative Store Lifecycle Management via CRDs
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md)
+
+## Context
+
+OpenFGA is an authorization service. After deploying the server, teams must perform several runtime operations to make it usable:
+
+1. **Create a store** — a logical container for authorization data
+2. **Write an authorization model** — the DSL that defines types, relations, and permissions
+3. **Write tuples** — the relationship data that the model operates on (e.g., "user:anne is owner of document:budget")
+
+Today, these operations happen outside Kubernetes — through the OpenFGA API, CLI (`fga`), or custom scripts in CI pipelines. There is no declarative, Kubernetes-native way to manage them.
+
+This creates several problems:
+
+- **No GitOps for authorization config** — authorization models live in scripts or API calls, not in version-controlled manifests that ArgoCD/FluxCD sync.
+- **No drift detection** — if someone modifies a model or tuple via the API, there's no controller to detect and reconcile the change.
+- **No cross-team ownership** — each team that uses OpenFGA must build their own tooling to manage stores and models. There's no standard pattern.
+- **Manual coordination** — deploying a new version of an application that needs a model change requires coordinating the Helm upgrade with a separate model push.
+
+### Alternatives Considered
+
+**A. CLI wrapper in CI pipelines**
+
+Use the `fga` CLI in a CI/CD step after `helm upgrade` to create stores, push models, and write tuples.
+
+*Pros:* No new Kubernetes components. Works with any CI system.
+*Cons:* Imperative, not declarative. No drift detection. Each team builds their own pipeline. Model changes are not atomic with deployments. No visibility in Kubernetes tooling.
+
+**B. Helm post-install hook Job**
+
+Add a Helm hook Job that runs `fga` CLI commands after installation.
+
+*Pros:* Stays within the Helm ecosystem.
+*Cons:* Helm hooks are the exact problem we're solving in ADR-002. Same ArgoCD/FluxCD incompatibilities. Hook Jobs are fire-and-forget with no reconciliation.
+
+**C. CRDs managed by the operator (selected)**
+
+Expose `FGAStore`, `FGAModel`, and `FGATuples` as Custom Resource Definitions. The operator watches these resources and reconciles them against the OpenFGA API.
+
+*Pros:* Fully declarative. GitOps-native. Continuous reconciliation. Standard Kubernetes patterns. Teams own their auth config as manifests.
+*Cons:* Requires the operator (ADR-001). CRD design and reconciliation logic add development scope. Tuple reconciliation is complex.
+
+## Decision
+
+Introduce three CRDs, built in stages after the migration handling (ADR-002) is complete:
+
+### Stage 2: FGAStore
+
+```yaml
+apiVersion: openfga.dev/v1alpha1
+kind: FGAStore
+metadata:
+  name: my-app
+  namespace: my-team
+spec:
+  # Reference to the OpenFGA instance
+  openfgaRef:
+    url: openfga.openfga-system.svc:8081
+    credentialsRef:
+      name: openfga-api-credentials    # Secret with API key or client credentials
+  # Store display name
+  name: "my-app-store"
+status:
+  storeId: "01HXYZ..."
+  ready: true
+  conditions:
+    - type: Ready
+      status: "True"
+      lastTransitionTime: "2026-04-06T12:00:00Z"
+```
+
+**Controller behavior:**
+- On create: call `CreateStore` API, store the returned store ID in `.status.storeId`
+- On delete: call `DeleteStore` API (with finalizer to ensure cleanup)
+- Idempotent: if a store with the same name exists, adopt it rather than creating a duplicate
+- Status: set `Ready` condition when store is confirmed to exist
+
+### Stage 3: FGAModel
+
+```yaml
+apiVersion: openfga.dev/v1alpha1
+kind: FGAModel
+metadata:
+  name: my-app-model
+  namespace: my-team
+spec:
+  storeRef:
+    name: my-app                        # References an FGAStore in the same namespace
+  model: |
+    model
+      schema 1.1
+    type user
+    type organization
+      relations
+        define member: [user]
+        define admin: [user]
+    type document
+      relations
+        define reader: [user, organization#member]
+        define writer: [user, organization#admin]
+        define owner: [user]
+status:
+  modelId: "01HABC..."
+  ready: true
+  lastWrittenHash: "sha256:a1b2c3..."   # Hash of the model DSL to detect changes
+  conditions:
+    - type: Ready
+      status: "True"
+    - type: InSync
+      status: "True"
+```
+
+**Controller behavior:**
+- On create/update: hash the model DSL. If hash differs from `.status.lastWrittenHash`, call `WriteAuthorizationModel` API
+- Store the returned model ID in `.status.modelId`
+- Model writes are append-only in OpenFGA (each write creates a new version), so this is safe
+- Validation: optionally validate DSL syntax before calling the API (fail-fast with a clear error condition)
+- The controller does NOT delete old model versions — OpenFGA retains model history
+
+### Stage 4: FGATuples
+
+```yaml
+apiVersion: openfga.dev/v1alpha1
+kind: FGATuples
+metadata:
+  name: my-app-base-tuples
+  namespace: my-team
+spec:
+  storeRef:
+    name: my-app
+  tuples:
+    - user: "user:anne"
+      relation: "owner"
+      object: "document:budget"
+    - user: "team:engineering#member"
+      relation: "reader"
+      object: "folder:engineering-docs"
+    - user: "organization:acme#admin"
+      relation: "writer"
+      object: "folder:engineering-docs"
+status:
+  writtenCount: 3
+  ready: true
+  lastReconciled: "2026-04-06T12:00:00Z"
+  conditions:
+    - type: Ready
+      status: "True"
+    - type: InSync
+      status: "True"
+```
+
+**Controller behavior:**
+- Maintain an **ownership model** — the controller tracks which tuples it wrote (via annotations or a status field). It only manages tuples it owns, never deleting tuples written by the application at runtime.
+- On reconciliation: diff the desired tuples (from spec) against owned tuples in the store
+  - Tuples in spec but not in store → write them
+  - Tuples in store (owned) but not in spec → delete them
+  - Tuples in store but not owned → leave them alone
+- Pagination: handle large tuple sets that exceed API response limits
+- Batching: use `Write` API with batch operations to minimize API calls
+
+**Scope limitation:** `FGATuples` is intended for **base/static tuples** — organizational structure, role assignments, resource hierarchies. It is NOT intended to replace application-level tuple writes for dynamic data (e.g., per-request access grants). The ownership model ensures these two concerns don't interfere.
+
+### CRD Design Principles
+
+1. **Namespace-scoped** — all CRDs are namespaced, allowing teams to manage their own stores/models/tuples in their namespace
+2. **Reference-based** — `FGAModel` and `FGATuples` reference an `FGAStore` by name, not by store ID. The controller resolves the reference.
+3. **Status-driven** — controllers report state via `.status.conditions` following Kubernetes conventions (`Ready`, `InSync`, error conditions)
+4. **Finalizers for cleanup** — `FGAStore` uses a finalizer to ensure the store is deleted from OpenFGA when the CR is deleted
+5. **Idempotent** — all operations are safe to retry. Re-running reconciliation produces the same result.
+6. **`v1alpha1` API version** — signals that the CRD schema may change. We will promote to `v1beta1` and `v1` as the design stabilizes.
+
+## Consequences
+
+### Positive
+
+- **GitOps-native authorization management** — stores, models, and tuples are Kubernetes resources that ArgoCD/FluxCD sync from Git
+- **Drift detection and reconciliation** — the operator continuously ensures the actual state matches the declared state
+- **Cross-team standardization** — every team uses the same CRDs, eliminating custom scripts and CI hacks
+- **Atomic deployments** — a team can include `FGAModel` in their application's Helm chart; model updates deploy alongside code changes
+- **Visibility** — `kubectl get fgastores`, `kubectl get fgamodels`, `kubectl describe fgatuples` provide instant visibility into authorization configuration
+- **RBAC integration** — Kubernetes RBAC controls who can create/modify stores, models, and tuples per namespace
+
+### Negative
+
+- **Significant development scope** — three controllers, each with its own reconciliation logic, error handling, and tests
+- **Tuple reconciliation complexity** — diffing and ownership tracking for tuples is the most complex piece; edge cases around partial failures, pagination, and large tuple sets
+- **CRD upgrade burden** — CRD schema changes require careful migration; Helm does not upgrade CRDs automatically
+- **API dependency** — the operator must be able to reach the OpenFGA API; network issues or API downtime affect reconciliation
+- **Not suitable for all tuple management** — dynamic, application-driven tuples should still be written via the API, not CRDs. Users must understand this boundary.
+
+### Risks
+
+- **FGATuples at scale** — for stores with millions of tuples, the reconciliation diff could be expensive. The ownership model mitigates this (only diff owned tuples), but documentation must clearly state that `FGATuples` is for base/static data, not high-volume dynamic writes.
+- **Multi-cluster** — if OpenFGA serves multiple clusters, CRDs in one cluster may conflict with CRDs in another pointing at the same store. This is out of scope for `v1alpha1` but should be considered for future versions.
diff --git a/docs/adr/004-operator-deployment-model.md b/docs/adr/004-operator-deployment-model.md
new file mode 100644
index 0000000..bedb12c
--- /dev/null
+++ b/docs/adr/004-operator-deployment-model.md
@@ -0,0 +1,167 @@
+# ADR-004: Operator Deployment as Helm Subchart Dependency
+
+- **Status:** Proposed
+- **Date:** 2026-04-06
+- **Deciders:** OpenFGA Helm Charts maintainers
+- **Related ADR:** [ADR-001](001-adopt-openfga-operator.md)
+
+## Context
+
+The OpenFGA Operator (ADR-001) needs a deployment model — how do users install it alongside or independent of the OpenFGA server?
+
+There are several established patterns in the Kubernetes ecosystem:
+
+### Alternatives Considered
+
+**A. Standalone operator chart (install separately)**
+
+Users install the operator chart first, then install the OpenFGA chart. The operator watches for OpenFGA Deployments across namespaces.
+
+*Example:*
+```bash
+helm install openfga-operator openfga/openfga-operator -n openfga-system
+helm install openfga openfga/openfga -n my-namespace
+```
+
+*Pros:* Clean separation of concerns. One operator instance serves multiple OpenFGA installations. Follows the OLM/OperatorHub pattern.
+*Cons:* Two install steps. Ordering dependency — operator must exist before the chart is useful. Users must manage two releases. Harder to get started.
+
+**B. Operator bundled in the main chart (single chart, always installed)**
+
+The operator Deployment, RBAC, and CRDs are templates in the main OpenFGA chart. No subchart.
+
+*Pros:* Simplest for users — one chart, one install. No dependency management.
+*Cons:* Chart becomes larger and harder to maintain. Users who manage the operator separately (e.g., cluster-wide) can't disable it. CRDs are tied to the application chart's release cycle. Multiple OpenFGA installations in the same cluster would deploy multiple operator instances.
+
+**C. Operator as a conditional subchart dependency (selected)**
+
+The operator is a separate Helm chart (`openfga-operator`) that the main chart declares as a conditional dependency. Enabled by default, but users can disable it.
+
+*Example:*
+```bash
+# Everything in one command
+helm install openfga openfga/openfga \
+  --set datastore.engine=postgres \
+  --set operator.enabled=true
+
+# Or, operator managed separately
+helm install openfga-operator openfga/openfga-operator -n openfga-system
+helm install openfga openfga/openfga \
+  --set operator.enabled=false
+```
+
+*Pros:* Single install for most users. Operator chart has its own versioning. Users can disable for standalone management. Clean separation in code.
+*Cons:* Subchart dependency adds some Chart.yaml complexity. CRDs still need special handling (Helm's `crds/` directory or a pre-install hook).
+
+**D. OLM (Operator Lifecycle Manager) only**
+
+Publish the operator to OperatorHub. Users install via OLM.
+
+*Pros:* Standard pattern for OpenShift. Handles CRD upgrades, operator upgrades, and RBAC.
+*Cons:* OLM is not available on all clusters (not standard on EKS, GKE, AKS). Adds a dependency on OLM itself. Doesn't help Helm-only users.
+
+## Decision
+
+The operator will be distributed as a **conditional Helm subchart dependency** of the main OpenFGA chart.
+
+### Chart Structure
+
+```
+helm-charts/
+├── charts/
+│   ├── openfga/                    # Main chart (existing)
+│   │   ├── Chart.yaml              # Declares openfga-operator as dependency
+│   │   ├── values.yaml             # operator.enabled: true
+│   │   ├── templates/
+│   │   └── crds/                   # Empty in Stage 1
+│   │
+│   └── openfga-operator/           # Operator subchart (new)
+│       ├── Chart.yaml
+│       ├── values.yaml
+│       ├── templates/
+│       │   ├── deployment.yaml
+│       │   ├── serviceaccount.yaml
+│       │   ├── clusterrole.yaml
+│       │   └── clusterrolebinding.yaml
+│       └── crds/                   # CRDs added in Stages 2-4
+│           ├── fgastore.yaml
+│           ├── fgamodel.yaml
+│           └── fgatuples.yaml
+```
+
+### Dependency Declaration
+
+```yaml
+# charts/openfga/Chart.yaml
+dependencies:
+  - name: openfga-operator
+    version: "0.1.x"
+    repository: "oci://ghcr.io/openfga/helm-charts"
+    condition: operator.enabled
+```
+
+### CRD Handling
+
+Helm has specific behavior around CRDs:
+
+1. **`crds/` directory** — CRDs placed here are installed on `helm install` but are **never upgraded or deleted** by Helm. This is safe but requires manual CRD upgrades.
+
+2. **Pre-install/pre-upgrade hook Job** — a Job that runs `kubectl apply -f` on CRD manifests before the main install/upgrade. This handles upgrades but reintroduces Helm hooks (the problem ADR-002 solves).
+
+3. **Static manifests applied separately** — CRDs are published as a standalone YAML file. Users run `kubectl apply -f` before `helm install`. This is the pattern used by cert-manager, Istio, and Prometheus Operator.
+
+**Decision:** Use the `crds/` directory in the operator subchart for initial installation. Publish CRD manifests as a standalone artifact for upgrades. Document both paths clearly.
+
+```bash
+# First install — Helm installs CRDs automatically
+helm install openfga openfga/openfga
+
+# CRD upgrades — applied manually (Helm won't upgrade them)
+kubectl apply -f https://github.com/openfga/helm-charts/releases/download/v0.2.0/crds.yaml
+```
+
+### Installation Modes
+
+| Mode | Command | Use case |
+|------|---------|----------|
+| **All-in-one** (default) | `helm install openfga openfga/openfga` | Most users. Single install, operator included. |
+| **Operator disabled** | `helm install openfga openfga/openfga --set operator.enabled=false` | Operator managed separately or not needed (memory datastore). |
+| **Operator standalone** | `helm install op openfga/openfga-operator -n openfga-system` | Cluster-wide operator serving multiple OpenFGA instances. |
+
+### Multi-Instance Considerations
+
+When multiple OpenFGA installations exist in the same cluster:
+
+- **All-in-one mode:** Each installation gets its own operator instance. The operator only watches resources in its own namespace. This is simple but wasteful.
+- **Standalone mode:** One operator installation watches all namespaces (or a configured set). Individual OpenFGA installations set `operator.enabled=false`. This is more efficient for large clusters.
+
+The operator will support both modes via a `watchNamespace` configuration:
+
+```yaml
+# Operator values
+operator:
+  watchNamespace: ""          # empty = watch own namespace only (all-in-one mode)
+  # watchNamespace: ""        # or set to a specific namespace
+  # watchAllNamespaces: true  # watch all namespaces (standalone mode)
+```
+
+## Consequences
+
+### Positive
+
+- **Single `helm install` for most users** — no ordering dependencies, no manual operator setup
+- **Opt-out available** — `operator.enabled: false` for users who manage it separately or don't need it
+- **Independent versioning** — operator chart has its own version; can be released on a different cadence than the main chart
+- **Clean code separation** — operator code and templates are in their own chart directory
+- **Standalone installation supported** — cluster admins can install one operator for multiple OpenFGA instances
+- **Consistent with ecosystem** — this is the same pattern used by charts that depend on Bitnami PostgreSQL, Redis, etc.
+
+### Negative
+
+- **CRD upgrade complexity** — Helm does not upgrade CRDs; users must apply CRD manifests separately on operator upgrades
+- **Multiple operators in all-in-one mode** — if a user installs OpenFGA in three namespaces, they get three operator pods (wasteful). Documentation should recommend standalone mode for multi-instance clusters.
+- **Subchart value passing** — configuring the operator requires prefixed values (e.g., `openfga-operator.image.tag`), which is slightly less ergonomic than top-level values
+
+### Neutral
+
+- **OLM support is not excluded** — the operator can be published to OperatorHub in the future alongside the Helm distribution. The two are not mutually exclusive.
diff --git a/docs/adr/README.md b/docs/adr/README.md
new file mode 100644
index 0000000..298a9e3
--- /dev/null
+++ b/docs/adr/README.md
@@ -0,0 +1,180 @@
+# Architecture Decision Records
+
+This directory contains Architecture Decision Records (ADRs) for the OpenFGA Helm Charts project.
+
+ADRs are short documents that capture significant architectural decisions along with their context, alternatives considered, and consequences. They serve as a decision log — not a living design doc, but a point-in-time record of *why* a decision was made.
+
+We follow the format described by [Michael Nygard](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions).
+
+## Index
+
+| ADR | Title | Status | Date |
+|-----|-------|--------|------|
+| [ADR-001](001-adopt-openfga-operator.md) | Adopt a Kubernetes Operator for OpenFGA Lifecycle Management | Proposed | 2026-04-06 |
+| [ADR-002](002-operator-managed-migrations.md) | Replace Helm Hook Migrations with Operator-Managed Migrations | Proposed | 2026-04-06 |
+| [ADR-003](003-declarative-store-lifecycle-crds.md) | Declarative Store Lifecycle Management via CRDs | Proposed | 2026-04-06 |
+| [ADR-004](004-operator-deployment-model.md) | Operator Deployment as Helm Subchart Dependency | Proposed | 2026-04-06 |
+
+---
+
+## What is an ADR?
+
+An ADR captures a single architectural decision. It records:
+
+- **What** was decided
+- **Why** it was decided (the context and constraints at the time)
+- **What alternatives** were considered and why they were rejected
+- **What consequences** follow from the decision (positive, negative, and neutral)
+
+ADRs are **immutable once accepted** — if a decision changes, you write a new ADR that supersedes the old one rather than editing it. This preserves the history of *why* things changed over time.
+
+## ADR Lifecycle
+
+```
+Proposed → Accepted → (optionally) Superseded or Deprecated
+              ↑
+              │ feedback loop
+              │
+         Discussion
+```
+
+### Statuses
+
+| Status | Meaning |
+|--------|---------|
+| **Proposed** | The ADR has been written and is open for discussion. No commitment has been made. |
+| **Accepted** | The decision has been agreed upon by maintainers. Implementation can proceed. |
+| **Deprecated** | The decision is no longer relevant (e.g., the feature was removed). |
+| **Superseded by ADR-XXX** | A newer ADR has replaced this decision. The old ADR links to the new one. |
+
+## How to Propose an ADR
+
+1. **Create a branch** — e.g., `docs/adr-005-my-decision`
+
+2. **Copy the template** — use `000-template.md` as a starting point
+
+3. **Write the ADR** — fill in Context, Decision, and Consequences. Focus on *why*, not *how*. The most valuable part is the Alternatives Considered section — it shows reviewers what you evaluated and why you chose this path.
+
+4. **Assign a number** — use the next sequential number. Check the index above.
+
+5. **Open a pull request** — the PR is where discussion happens. Title it: `ADR-005: <title>`
+
+6. **Add to the index** — update the table in this README with the new entry (status: Proposed)
+
+### Proposing related ADRs together
+
+When multiple ADRs are part of a single cohesive proposal — e.g., a foundational decision and several downstream decisions that depend on it — they can be submitted in a single PR. This lets reviewers see the full picture instead of bouncing between separate PRs.
+
+When doing this:
+
+- **Explain the relationship in the PR description** — identify which ADR is the foundational decision and which are downstream. For example: "ADR-001 is the core decision to build an operator. ADR-002, 003, and 004 are downstream decisions about how the operator handles migrations, CRDs, and deployment."
+- **Each ADR can be accepted or rejected independently** — a reviewer might approve the foundational decision but push back on a downstream one. If that happens, split the PR: merge the accepted ADRs and keep the contested ones open for further discussion.
+- **Keep each ADR self-contained** — even though they're in the same PR, each ADR should stand on its own. A reader should be able to understand ADR-003 without reading ADR-002 first (though they may reference each other).
+
+## How to Give Feedback on an ADR
+
+ADR review happens in the **pull request**, not by editing the ADR directly. This keeps the discussion visible and linked to the decision.
+
+### As a reviewer
+
+- **Comment on the PR** — ask questions, challenge assumptions, suggest alternatives. Good review questions:
+  - "Did you consider X as an alternative?"
+  - "What happens if Y fails?"
+  - "This conflicts with how we do Z — can you address that?"
+  - "I agree with the decision but the consequence about X should mention Y"
+
+- **Request changes** if you believe the decision is wrong or incomplete
+
+- **Approve** when you're satisfied the decision is sound and well-documented
+
+### As the author responding to feedback
+
+- **Update the ADR in the PR** based on feedback:
+  - Add alternatives that reviewers suggested (with your evaluation of them)
+  - Expand the Consequences section if reviewers identified impacts you missed
+  - Clarify the Context if reviewers were confused about the problem
+  - Adjust the Decision if feedback reveals a better approach
+
+- **Do NOT delete feedback-driven changes** — if a reviewer raised a valid alternative and you addressed it, the ADR is stronger for including it
+
+- **Resolve PR comments** as you address them so reviewers can track progress
+
+### Reaching consensus
+
+- ADRs move to **Accepted** when maintainers approve the PR
+- Not every maintainer needs to approve — follow the project's normal review standards
+- If consensus can't be reached, escalate to a synchronous discussion (meeting, call) and record the outcome in the PR
+- Disagreement is fine — document it in the Consequences section as a risk or trade-off rather than hiding it
+
+## How to Supersede an ADR
+
+When a decision needs to change:
+
+1. **Do NOT edit the original ADR** — it's a historical record
+
+2. **Write a new ADR** that references the old one:
+   ```markdown
+   - **Supersedes:** [ADR-002](002-operator-managed-migrations.md)
+   ```
+
+3. **Update the old ADR's status** — change it to:
+   ```markdown
+   - **Status:** Superseded by [ADR-007](007-new-approach.md)
+   ```
+
+4. **Update the index** in this README
+
+This way, anyone reading ADR-002 knows it's been replaced and can follow the link to understand what changed and why.
+
+## ADR Format
+
+Every ADR follows this structure:
+
+```markdown
+# ADR-NNN: Title
+
+- **Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXX
+- **Date:** YYYY-MM-DD
+- **Deciders:** Who was involved in the decision
+- **Related Issues:** GitHub issue references
+- **Related ADR:** Links to related ADRs
+
+## Context
+
+What is the problem or situation that motivates this decision?
+Include enough background that someone unfamiliar with the project
+can understand why this decision matters.
+
+## Decision
+
+What is the decision and why was it chosen?
+
+### Alternatives Considered
+
+What other options were evaluated? Why were they rejected?
+This is often the most valuable section — it prevents future
+contributors from re-proposing rejected approaches.
+
+## Consequences
+
+### Positive
+What improves as a result of this decision?
+
+### Negative
+What gets harder or more complex? Be honest — every decision has costs.
+
+### Risks
+What could go wrong? What assumptions might prove false?
+```
+
+## Template
+
+A blank template is available at [000-template.md](000-template.md).
+
+## Tips for Writing Good ADRs
+
+- **Keep it short** — an ADR is one decision, not a design doc. If it's longer than 2-3 pages, consider splitting it.
+- **Focus on why, not how** — implementation details change; the reasoning behind the decision is what matters long-term.
+- **Be honest about trade-offs** — an ADR that lists only positive consequences isn't credible. Every decision has costs.
+- **Write for your future self** — in 18 months, you won't remember why you chose this. The ADR should tell you.
+- **Not every decision needs an ADR** — use ADRs for decisions that are hard to reverse, affect multiple components, or where the reasoning isn't obvious from the code.