From 03a7a2ae28e2f5e99361632a67d7b037357862f7 Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Mon, 13 Apr 2026 16:36:34 +0800
Subject: [PATCH 1/8] docs: add upgrade SOP for Helm-based deployments

Co-Authored-By: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
---
 docs/openab-upgrade-sop.md | 379 +++++++++++++++++++++++++++++++++++++
 1 file changed, 379 insertions(+)
 create mode 100644 docs/openab-upgrade-sop.md

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
new file mode 100644
index 00000000..636a634c
--- /dev/null
+++ b/docs/openab-upgrade-sop.md
@@ -0,0 +1,379 @@
+# OpenAB Version Upgrade SOP
+
+## Environment Reference
+
+| Item | Details |
+|---|---|
+| Deployment Method | Kubernetes (Helm Chart) |
+| Deployment Name | `openab-kiro` |
+| Pod Label | `app.kubernetes.io/instance=openab` |
+| Helm Repo (GitHub Pages) | `https://openabdev.github.io/openab` |
+| Helm Repo (OCI) | `oci://ghcr.io/openabdev/charts/openab` |
+| Image Registry | `ghcr.io/openabdev/openab` |
+| Git Repo | `github.com/openabdev/openab` |
+| Agent Config | `/home/agent/.kiro/agents/default.json` |
+| Steering Files | `/home/agent/.kiro/steering/` |
+| PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) |
+| KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) |
+
+> ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set:
+> ```bash
+> export KUBECONFIG=~/.kube/config
+> ```
+
+---
+
+## Upgrade Process Overview
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     Pre-Upgrade Preparation                  │
+│  Check version info → Read Release Notes → Announce outage  │
+└────────────────────────┬────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                          Backup                              │
+│  Helm values / Agent config / Steering / hosts.yml / PVC    │
+└────────────────────────┬────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                  Upgrade Execution (2 Phases)                │
+│                                                             │
+│  Step 1: Pre-release Validation                             │
+│    helm upgrade --version x.x.x-beta.1                     │
+│    └─ Discord functional test                               │
+│         ├─ Pass ──────────────────────────┐                 │
+│         └─ Fail → Wait for beta.2, retry  │                 │
+│                                           ▼                 │
+│  Step 2: Promote to Stable                                  │
+│    helm upgrade --version x.x.x                            │
+│    └─ kubectl rollout status                               │
+└────────────────────────┬────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                    Post-Upgrade Verification                  │
+│  Pod Ready? → Version check → Log check → Discord E2E test  │
+│       │                                                     │
+│       ├─ All pass → Send completion notice ✅               │
+│       └─ Issues   → Proceed to rollback ↓                  │
+└────────────────────────┬────────────────────────────────────┘
+                         │ (on failure)
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                       Rollback                               │
+│                                                             │
+│  Diagnose symptom                                           │
+│   ├─ Pod won't start    → helm rollback <REVISION>          │
+│   ├─ Broken / Pod OK    → rollback or hotfix                │
+│   ├─ Config lost        → restore from backup               │
+│   └─ Bot unresponsive   → kubectl rollout restart → rollback │
+│                                                             │
+│  Rollback done → Re-run verification → Send rollback notice │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## I. Pre-Upgrade Preparation
+
+### 1. Check Current Version
+
+> ℹ️ OpenAB is a pre-compiled Rust binary shipped inside a Docker image. There is **no source code or git repository** inside the container — version information must be retrieved from Helm or the image tag.
+
+```bash
+# Get the current running Pod
+POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+
+# Check deployed Helm chart version and image tag
+helm list -f openab
+helm status openab
+
+# Check the actual image the Pod is running (including tag / SHA)
+kubectl get deployment openab-kiro \
+  -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
+
+# Check latest releases on GitHub
+# Visit https://github.com/openabdev/openab/releases
+
+# List available versions from the Helm repo (requires repo to be added first — see Section III)
+helm search repo openab --versions
+```
+
+### 2. Read the Release Notes
+
+- Go to `https://github.com/openabdev/openab/releases/tag/<target-version>`
+- Pay special attention to:
+  - Breaking changes
+  - Helm Chart values changes
+  - Added or deprecated environment variables
+  - Any migration steps
+
+### 3. Announce the Upgrade
+
+> ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour.
+
+- Notify all users via Discord channel / Slack / email:
+  - Scheduled upgrade time and estimated downtime (typically 1–3 minutes)
+  - Scope of impact (Discord bot will be offline during the upgrade)
+  - Emergency contact
+
+---
+
+## II. Backup
+
+### Backup Checklist
+
+| Item | Command | Notes |
+|---|---|---|
+| Helm values | `helm get values openab -o yaml > openab-values-backup-$(date +%Y%m%d).yaml` | Current Helm deployment parameters |
+| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/default.json ./backup-default.json` | Custom agent settings (model, prompt, tools, etc.) |
+| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md |
+| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) |
+| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) |
+| STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) |
+| PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) |
+| Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback |
+
+### One-Click Backup Script
+
+```bash
+#!/bin/bash
+set -e
+
+BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
+mkdir -p "$BACKUP_DIR"
+
+POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+if [ -z "$POD" ]; then
+  echo "❌ OpenAB Pod not found. Aborting backup." && exit 1
+fi
+
+backup_step() {
+  local desc="$1"; shift
+  echo "=== $desc ==="
+  if ! "$@"; then
+    echo "❌ Failed: $desc" && exit 1
+  fi
+}
+
+backup_step "Backup Helm values"   bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml"
+backup_step "Backup Agent config"  kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
+backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
+kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null || echo "⚠️ skills/ directory not found, skipping"
+backup_step "Backup hosts.yml"     kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
+backup_step "Backup Helm history"  bash -c "helm history openab > $BACKUP_DIR/helm-history.txt"
+
+echo "=== ✅ Backup complete: $BACKUP_DIR ==="
+ls -la "$BACKUP_DIR/"
+```
+
+### Verify the Backup
+
+```bash
+# Check for empty files in the backup directory
+find $BACKUP_DIR -type f -empty
+
+# Confirm values.yaml is readable
+cat $BACKUP_DIR/values.yaml | head -20
+```
+
+### PVC Backup (⚠️ Manual Step)
+
+> This step must be performed manually based on your PVC type. It cannot be automated.
+
+```bash
+# 1. List PVCs mounted to the Pod
+kubectl get pod $POD -o jsonpath='{range .spec.volumes[*]}{.name}{"\t"}{.persistentVolumeClaim.claimName}{"\n"}{end}'
+
+# 2. Option A: VolumeSnapshot (recommended — requires CSI driver support)
+kubectl apply -f - <<EOF
+apiVersion: snapshot.storage.k8s.io/v1
+kind: VolumeSnapshot
+metadata:
+  name: openab-pvc-snapshot-$(date +%Y%m%d)
+spec:
+  volumeSnapshotClassName: <your-snapshot-class>
+  source:
+    persistentVolumeClaimName: <pvc-name>
+EOF
+
+# 3. Option B: kubectl cp (suitable for small data volumes)
+kubectl cp $POD:<pvc-mount-path> $BACKUP_DIR/pvc-data/
+```
+
+---
+
+## III. Upgrade Execution
+
+### Pre-check: Confirm Helm Repo is Configured
+
+```bash
+# GitHub Pages (stable releases — recommended for most cases)
+helm repo add openab https://openabdev.github.io/openab
+helm repo update
+
+# List available versions
+helm search repo openab --versions
+
+# Or query via OCI Registry
+helm show chart oci://ghcr.io/openabdev/charts/openab --version <target-version>
+```
+
+### Step 1: Pre-release Validation (Required)
+
+> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
+
+```bash
+BACKUP_VALUES="<backup-dir>/values.yaml"  # e.g. openab-backup-20260413-070000/values.yaml
+
+# Dry-run the pre-release version first
+helm upgrade openab openab/openab \
+  --version <target-version>-beta.1 \
+  -f "$BACKUP_VALUES" \
+  --dry-run
+
+# Deploy the pre-release
+helm upgrade openab openab/openab \
+  --version <target-version>-beta.1 \
+  -f "$BACKUP_VALUES"
+
+kubectl rollout status deployment/openab-kiro --timeout=300s
+
+# Run functional tests in the Discord channel
+# Verify basic conversation and tool calls work as expected
+# If issues are found, wait for beta.2 and repeat this step
+```
+
+### Step 2: Promote to Stable
+
+```bash
+BACKUP_VALUES="<backup-dir>/values.yaml"
+
+# Dry-run the stable version
+helm upgrade openab openab/openab \
+  --version <target-version> \
+  -f "$BACKUP_VALUES" \
+  --dry-run
+
+# Deploy stable (short downtime is expected)
+helm upgrade openab openab/openab \
+  --version <target-version> \
+  -f "$BACKUP_VALUES"
+
+# Wait for the Pod to be ready
+kubectl rollout status deployment/openab-kiro --timeout=300s
+```
+
+### Post-Upgrade Verification
+
+```bash
+POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+
+# 1. Check Pod status (must be Running and READY)
+kubectl get pod -l app.kubernetes.io/instance=openab
+kubectl wait --for=condition=Ready pod/$POD --timeout=120s
+
+# 2. Verify version (from Helm and image tag — no source code in the container)
+helm list -f openab
+kubectl get deployment openab-kiro \
+  -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
+
+# 3. Confirm the openab process is running
+kubectl exec $POD -- pgrep -x openab
+
+# 4. Check logs for errors (ERROR / WARN / panic)
+kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal"
+
+# 5. Discord E2E verification (final check)
+# → Send a test message in the Discord channel
+# → Confirm the bot responds and conversation works correctly
+```
+
+### Completion Notice
+
+- Once all verifications pass, notify users:
+  - Upgrade complete, service restored
+  - New version number and summary of key changes
+  - Contact channel for reporting any issues
+
+---
+
+## IV. Rollback
+
+### Decision Tree
+
+```
+Issue detected after upgrade
+         │
+         ▼
+    Pod status?
+         │
+         ├─ CrashLoopBackOff / Pending ──→ helm rollback <REVISION>
+         │
+         ├─ Running, but functionality broken
+         │         │
+         │         ├─ Quick fix possible (e.g. config error) ──→ hotfix (engineer)
+         │         └─ Root cause unclear ────────────────────→ helm rollback <REVISION>
+         │
+         ├─ Running, but bot is unresponsive
+         │         │
+         │         └─ kubectl rollout restart deployment/openab-kiro
+         │                   │
+         │                   ├─ Recovers after restart ──→ Continue monitoring
+         │                   └─ Still unresponsive      ──→ helm rollback <REVISION>
+         │
+         └─ Running, but custom config is missing
+                   │
+                   └─ Restore config from backup → kubectl rollout restart
+```
+
+| Symptom | Action |
+|---|---|
+| Pod fails to start (CrashLoopBackOff) | Helm rollback |
+| Functionality broken, Pod is running | Rollback or hotfix  |
+| Custom config lost | Restore config files from backup |
+| Bot unresponsive | Restart Pod first; rollback if it persists |
+
+### Helm Rollback
+
+```bash
+# 1. View release history
+helm history openab
+
+# 2. Roll back to a previous revision
+helm rollback openab <REVISION>
+
+# 3. Wait for the Pod to be ready
+kubectl rollout status deployment/openab-kiro --timeout=300s
+
+# 4. Confirm rollback succeeded
+kubectl get pod -l app.kubernetes.io/instance=openab
+
+# 5. Run full post-rollback verification (same as post-upgrade verification)
+POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+kubectl wait --for=condition=Ready pod/$POD --timeout=120s
+kubectl exec $POD -- pgrep -x openab
+kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal"
+# → Send a test message in the Discord channel to confirm the bot responds
+```
+
+### Restore Custom Config
+
+```bash
+POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+
+# Restore agent config
+kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json
+
+# Restore steering files
+kubectl cp ./backup-steering/ $POD:/home/agent/.kiro/steering/
+
+# Restore GitHub CLI credentials
+kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml
+
+# Restart Pod to apply changes
+kubectl rollout restart deployment/openab-kiro
+```

From 9d101d9fa669ec14d03950edc4024c9bbb1fb176 Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Mon, 13 Apr 2026 16:47:09 +0800
Subject: [PATCH 2/8] docs(sop): address reviewer feedback

- remove set -e; add explicit per-step error handling note
- add tar pre-check before kubectl cp directory operations
- add export KUBECONFIG inside backup script for consistency
- add full Secret backup step (not just STT key)
- add node resource check step in pre-upgrade preparation
- add note on when pre-release step can be skipped
- use tar pipe for steering restore to avoid kubectl cp dir nesting issue
- add Document Version / Last Updated header

Co-Authored-By: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
---
 docs/openab-upgrade-sop.md | 57 ++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index 636a634c..c01a032d 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -1,5 +1,10 @@
 # OpenAB Version Upgrade SOP
 
+| | |
+|---|---|
+| **Document Version** | 1.1 |
+| **Last Updated** | 2026-04-13 |
+
 ## Environment Reference
 
 | Item | Details |
@@ -111,7 +116,22 @@ helm search repo openab --versions
   - Added or deprecated environment variables
   - Any migration steps
 
-### 3. Announce the Upgrade
+### 3. Check Node Resources
+
+> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity.
+
+```bash
+# Check allocatable resources on all nodes
+kubectl describe nodes | grep -A 5 "Allocatable:"
+
+# Check current resource requests across the cluster
+kubectl top nodes
+
+# Confirm the new image size has not changed significantly
+# (check the release notes for any resource requirement changes)
+```
+
+### 4. Announce the Upgrade
 
 > ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour.
 
@@ -133,6 +153,7 @@ helm search repo openab --versions
 | Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md |
 | Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) |
 | hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) |
+| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | Full Secret dump including Discord token, STT key, etc. — store securely |
 | STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) |
 | PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) |
 | Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback |
@@ -141,7 +162,11 @@ helm search repo openab --versions
 
 ```bash
 #!/bin/bash
-set -e
+# Note: set -e is intentionally omitted.
+# Error handling is done explicitly per step via backup_step()
+# to avoid set -e interfering with the if ! "$@" pattern inside functions.
+
+export KUBECONFIG=~/.kube/config
 
 BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
 mkdir -p "$BACKUP_DIR"
@@ -151,6 +176,13 @@ if [ -z "$POD" ]; then
   echo "❌ OpenAB Pod not found. Aborting backup." && exit 1
 fi
 
+# Pre-check: kubectl cp directory operations require tar inside the container
+if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
+  echo "❌ tar not found in container. kubectl cp of directories will fail."
+  echo "   Use 'kubectl exec' with a tar pipe instead, or use VolumeSnapshot for PVC backup."
+  exit 1
+fi
+
 backup_step() {
   local desc="$1"; shift
   echo "=== $desc ==="
@@ -159,15 +191,18 @@ backup_step() {
   fi
 }
 
-backup_step "Backup Helm values"   bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml"
-backup_step "Backup Agent config"  kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
+backup_step "Backup Helm values"    bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml"
+backup_step "Backup Agent config"   kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
 backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
-kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null || echo "⚠️ skills/ directory not found, skipping"
-backup_step "Backup hosts.yml"     kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
-backup_step "Backup Helm history"  bash -c "helm history openab > $BACKUP_DIR/helm-history.txt"
+kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null \
+  || echo "⚠️ skills/ directory not found, skipping"
+backup_step "Backup hosts.yml"      kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
+backup_step "Backup full Secret"    bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml"
+backup_step "Backup Helm history"   bash -c "helm history openab > $BACKUP_DIR/helm-history.txt"
 
 echo "=== ✅ Backup complete: $BACKUP_DIR ==="
 ls -la "$BACKUP_DIR/"
+echo "⚠️  secret.yaml contains sensitive credentials — store it securely and do not commit it."
 ```
 
 ### Verify the Backup
@@ -225,6 +260,8 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version <target-version>
 ### Step 1: Pre-release Validation (Required)
 
 > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
+>
+> **When can Step 1 be skipped?** Only if the maintainer explicitly states that the stable release was promoted directly from a pre-release that was already validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first.
 
 ```bash
 BACKUP_VALUES="<backup-dir>/values.yaml"  # e.g. openab-backup-20260413-070000/values.yaml
@@ -369,7 +406,11 @@ POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[
 kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json
 
 # Restore steering files
-kubectl cp ./backup-steering/ $POD:/home/agent/.kiro/steering/
+# ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters.
+# Use the tar pipe method below to avoid accidentally creating a nested directory
+# (e.g. steering/steering/) which can happen with some kubectl versions.
+kubectl exec $POD -- mkdir -p /home/agent/.kiro/steering
+tar c -C ./backup-steering . | kubectl exec -i $POD -- tar x -C /home/agent/.kiro/steering
 
 # Restore GitHub CLI credentials
 kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml

From df4f39b35f463effb1dcf7c0a177c4a56662163e Mon Sep 17 00:00:00 2001
From: Claude <claude@anthropic.com>
Date: Mon, 13 Apr 2026 13:53:56 +0000
Subject: [PATCH 3/8] docs: address reviewer feedback on upgrade SOP (v1.2)

- Backup script: replace exit 1 with record-and-continue pattern; report all failed steps at end
- Backup checklist: strengthen security warning for secret.yaml with encryption suggestions (gpg/age)
- Environment Reference: add Namespace row; add namespace alias setup instructions
- PVC backup Option B: add data size check step with recommended size limit
- Pre-release skip condition: replace vague "maintainer explicitly states" with concrete pre-release-validated: true marker
- Post-upgrade verification: add steering files and agent config presence checks
---
 docs/openab-upgrade-sop.md | 64 ++++++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 10 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index c01a032d..af7d19f2 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -2,7 +2,7 @@
 
 | | |
 |---|---|
-| **Document Version** | 1.1 |
+| **Document Version** | 1.2 |
 | **Last Updated** | 2026-04-13 |
 
 ## Environment Reference
@@ -20,12 +20,22 @@
 | Steering Files | `/home/agent/.kiro/steering/` |
 | PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) |
 | KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) |
+| Namespace | `default` (adjust to match your actual deployment namespace) |
 
 > ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set:
 > ```bash
 > export KUBECONFIG=~/.kube/config
 > ```
 
+> 💡 **Namespace setup (recommended):** If OpenAB is deployed in a non-default namespace, set the following at the start of your session to avoid having to append `-n <namespace>` to every command:
+> ```bash
+> export NS=openab          # replace with your actual namespace
+> export KUBECONFIG=~/.kube/config
+> alias kubectl="kubectl -n $NS"
+> alias helm="helm -n $NS"
+> ```
+> All `kubectl` and `helm` commands in this SOP assume either the default namespace or that the above aliases are in effect.
+
 ---
 
 ## Upgrade Process Overview
@@ -153,7 +163,7 @@ kubectl top nodes
 | Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md |
 | Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) |
 | hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) |
-| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | Full Secret dump including Discord token, STT key, etc. — store securely |
+| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. Consider encrypting with `gpg` or [`age`](https://github.com/FiloSottile/age) before storing. |
 | STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) |
 | PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) |
 | Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback |
@@ -163,8 +173,8 @@ kubectl top nodes
 ```bash
 #!/bin/bash
 # Note: set -e is intentionally omitted.
-# Error handling is done explicitly per step via backup_step()
-# to avoid set -e interfering with the if ! "$@" pattern inside functions.
+# Failures are recorded per step and reported at the end,
+# so that a single failure does not prevent remaining items from being backed up.
 
 export KUBECONFIG=~/.kube/config
 
@@ -183,11 +193,14 @@ if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
   exit 1
 fi
 
+FAILED_STEPS=()
+
 backup_step() {
   local desc="$1"; shift
   echo "=== $desc ==="
   if ! "$@"; then
-    echo "❌ Failed: $desc" && exit 1
+    echo "⚠️  Failed: $desc (continuing with remaining steps...)"
+    FAILED_STEPS+=("$desc")
   fi
 }
 
@@ -200,9 +213,29 @@ backup_step "Backup hosts.yml"      kubectl cp "$POD:/home/agent/.config/gh/host
 backup_step "Backup full Secret"    bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml"
 backup_step "Backup Helm history"   bash -c "helm history openab > $BACKUP_DIR/helm-history.txt"
 
-echo "=== ✅ Backup complete: $BACKUP_DIR ==="
+echo ""
+echo "=== Backup Summary: $BACKUP_DIR ==="
 ls -la "$BACKUP_DIR/"
-echo "⚠️  secret.yaml contains sensitive credentials — store it securely and do not commit it."
+
+if [ ${#FAILED_STEPS[@]} -gt 0 ]; then
+  echo ""
+  echo "⚠️  The following backup steps FAILED:"
+  for step in "${FAILED_STEPS[@]}"; do
+    echo "   - $step"
+  done
+  echo ""
+  echo "   Review the failures above before proceeding with the upgrade."
+  echo "   A failed backup step means the corresponding data is NOT protected."
+else
+  echo "✅ All backup steps completed successfully."
+fi
+
+echo ""
+echo "🔐 SECURITY REMINDER: $BACKUP_DIR/secret.yaml contains sensitive credentials"
+echo "   (Discord token, STT key, etc.). Do NOT commit this file."
+echo "   Consider encrypting it before storing:"
+echo "     gpg --symmetric $BACKUP_DIR/secret.yaml"
+echo "     # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml"
 ```
 
 ### Verify the Backup
@@ -235,7 +268,12 @@ spec:
     persistentVolumeClaimName: <pvc-name>
 EOF
 
-# 3. Option B: kubectl cp (suitable for small data volumes)
+# 3. Option B: kubectl cp (suitable for small data volumes only)
+#
+# ⚠️ Size check before proceeding — kubectl cp may timeout or OOM on large datasets.
+# Recommended limit: < 500 MB. For larger volumes, use Option A (VolumeSnapshot).
+kubectl exec $POD -- du -sh <pvc-mount-path>
+# If the output is within an acceptable range, proceed:
 kubectl cp $POD:<pvc-mount-path> $BACKUP_DIR/pvc-data/
 ```
 
@@ -261,7 +299,7 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version <target-version>
 
 > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
 >
-> **When can Step 1 be skipped?** Only if the maintainer explicitly states that the stable release was promoted directly from a pre-release that was already validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first.
+> **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first.
 
 ```bash
 BACKUP_VALUES="<backup-dir>/values.yaml"  # e.g. openab-backup-20260413-070000/values.yaml
@@ -324,7 +362,13 @@ kubectl exec $POD -- pgrep -x openab
 # 4. Check logs for errors (ERROR / WARN / panic)
 kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal"
 
-# 5. Discord E2E verification (final check)
+# 5. Verify steering files and agent config are still present
+#    (PVC content should survive upgrades, but this confirms it)
+kubectl exec $POD -- ls /home/agent/.kiro/steering/
+kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5
+#    If either path is missing, restore from backup (see Section IV: Restore Custom Config)
+
+# 6. Discord E2E verification (final check)
 # → Send a test message in the Discord channel
 # → Confirm the bot responds and conversation works correctly
 ```

From 7a027e7bc72c24ab2ea304c4d40393b6de352b47 Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Tue, 14 Apr 2026 22:49:59 +0000
Subject: [PATCH 4/8] docs: revise upgrade SOP for AI-first execution and
 address review feedback

Address masami-agent review comments and repo owner AI-first design feedback:

Technical fixes (masami-agent):
- Add deployment naming pattern note (<release-name>-kiro)
- Use precise pod label selector (instance + name) to avoid multi-agent conflicts
- Prefer OCI registry for helm commands; GitHub Pages listed as fallback
- Add helm uninstall PVC deletion warning to Environment Reference
- Add kiro-cli auth DB backup (data.sqlite3) to checklist and scripts

AI-first redesign (repo owner):
- Add Agent-Executable Backup section: linear Steps 0-7 with explicit
  input/output dependency annotations (no ambiguous branches)
- Replace all <placeholder> patterns with "run this command to resolve"
  patterns (RELEASE_NAME, DEPLOYMENT, BACKUP_DIR, TARGET_VERSION, PREV_REVISION)
- Add Verification Gate after backup: checks all files exist and are
  non-empty; exits 1 on failure so agent cannot proceed to upgrade
- Add machine-readable pass/fail criteria for pre-release validation
  and post-upgrade verification steps
- Add machine-readable decision table for rollback branch selection
- Auto-resolve PREV_REVISION via helm history JSON + jq in rollback steps
- Restore section uses BACKUP_DIR resolved from ls -td pattern
---
 docs/openab-upgrade-sop.md | 604 +++++++++++++++++++++++++++----------
 1 file changed, 449 insertions(+), 155 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index af7d19f2..0dfc3307 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -2,26 +2,34 @@
 
 | | |
 |---|---|
-| **Document Version** | 1.2 |
-| **Last Updated** | 2026-04-13 |
+| **Document Version** | 1.3 |
+| **Last Updated** | 2026-04-14 |
 
 ## Environment Reference
 
 | Item | Details |
 |---|---|
 | Deployment Method | Kubernetes (Helm Chart) |
-| Deployment Name | `openab-kiro` |
-| Pod Label | `app.kubernetes.io/instance=openab` |
-| Helm Repo (GitHub Pages) | `https://openabdev.github.io/openab` |
-| Helm Repo (OCI) | `oci://ghcr.io/openabdev/charts/openab` |
+| Deployment Name | `<release-name>-kiro` (default: `openab-kiro`) — see note below |
+| Pod Label (precise) | `app.kubernetes.io/instance=openab,app.kubernetes.io/name=openab-kiro` |
+| Helm Repo (OCI, recommended) | `oci://ghcr.io/openabdev/charts/openab` |
+| Helm Repo (GitHub Pages, fallback) | `https://openabdev.github.io/openab` |
 | Image Registry | `ghcr.io/openabdev/openab` |
 | Git Repo | `github.com/openabdev/openab` |
 | Agent Config | `/home/agent/.kiro/agents/default.json` |
 | Steering Files | `/home/agent/.kiro/steering/` |
+| kiro-cli Auth DB | `/home/agent/.local/share/kiro-cli/data.sqlite3` |
 | PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) |
 | KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) |
 | Namespace | `default` (adjust to match your actual deployment namespace) |
 
+> **Deployment naming pattern:** The deployment name follows `<release-name>-kiro`. For the default setup (`helm install openab …`), the deployment is `openab-kiro`. If you used a different release name (e.g. `my-bot`), the deployment is `my-bot-kiro`. Verify with:
+> ```bash
+> RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+> DEPLOYMENT="${RELEASE_NAME}-kiro"
+> echo "Deployment: $DEPLOYMENT"
+> ```
+
 > ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set:
 > ```bash
 > export KUBECONFIG=~/.kube/config
@@ -36,6 +44,8 @@
 > ```
 > All `kubectl` and `helm` commands in this SOP assume either the default namespace or that the above aliases are in effect.
 
+> ⚠️ **Data loss warning:** `helm uninstall` **deletes the PVC** and all persistent data (steering files, auth database, agent config) unless the chart has an explicit resource policy annotation. Always use `helm rollback` instead of uninstall + reinstall. If you need to uninstall, back up the PVC data first.
+
 ---
 
 ## Upgrade Process Overview
@@ -50,6 +60,7 @@
 ┌─────────────────────────────────────────────────────────────┐
 │                          Backup                              │
 │  Helm values / Agent config / Steering / hosts.yml / PVC    │
+│  → Verification gate (all files exist & non-empty) ✅        │
 └────────────────────────┬────────────────────────────────────┘
                          │
                          ▼
@@ -58,9 +69,9 @@
 │                                                             │
 │  Step 1: Pre-release Validation                             │
 │    helm upgrade --version x.x.x-beta.1                     │
-│    └─ Discord functional test                               │
+│    └─ Automated smoke test (kubectl wait + pgrep + logs)    │
 │         ├─ Pass ──────────────────────────┐                 │
-│         └─ Fail → Wait for beta.2, retry  │                 │
+│         └─ Fail → rollback, stop          │                 │
 │                                           ▼                 │
 │  Step 2: Promote to Stable                                  │
 │    helm upgrade --version x.x.x                            │
@@ -94,32 +105,54 @@
 
 ## I. Pre-Upgrade Preparation
 
-### 1. Check Current Version
+### 1. Resolve Environment Variables
 
-> ℹ️ OpenAB is a pre-compiled Rust binary shipped inside a Docker image. There is **no source code or git repository** inside the container — version information must be retrieved from Helm or the image tag.
+> **Agent note:** Run this block first. All subsequent steps depend on these variables.
+>
+> **Step 1 output:** `RELEASE_NAME`, `DEPLOYMENT`, `POD`, `CURRENT_VERSION`, `TARGET_VERSION` → used in all subsequent steps.
 
 ```bash
-# Get the current running Pod
-POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+export KUBECONFIG=~/.kube/config
+
+# Resolve release name and deployment name
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
+echo "Release: $RELEASE_NAME  |  Deployment: $DEPLOYMENT"
+
+# Get current running pod (precise label selector — avoids matching other agents)
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
+echo "Pod: $POD"
+if [ -z "$POD" ]; then echo "❌ Pod not found. Check label selectors."; fi
+
+# Get current deployed chart version
+CURRENT_VERSION=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart' | sed 's/openab-//')
+echo "Current chart version: $CURRENT_VERSION"
 
-# Check deployed Helm chart version and image tag
-helm list -f openab
-helm status openab
+# List available versions via OCI (no repo add needed)
+helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null | grep ^version
 
-# Check the actual image the Pod is running (including tag / SHA)
-kubectl get deployment openab-kiro \
+# List all published versions (requires GitHub Pages repo to be added)
+# helm repo add openab https://openabdev.github.io/openab && helm repo update
+# helm search repo openab --versions
+
+# Check the actual image the Pod is running
+kubectl get deployment "$DEPLOYMENT" \
   -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
+```
 
-# Check latest releases on GitHub
-# Visit https://github.com/openabdev/openab/releases
+After running the above, set the target version:
 
-# List available versions from the Helm repo (requires repo to be added first — see Section III)
-helm search repo openab --versions
+```bash
+# Set this to the version you are upgrading to (e.g. 0.7.5)
+TARGET_VERSION="0.7.5"
+echo "Upgrading to: $TARGET_VERSION"
 ```
 
 ### 2. Read the Release Notes
 
-- Go to `https://github.com/openabdev/openab/releases/tag/<target-version>`
+- Go to `https://github.com/openabdev/openab/releases/tag/v${TARGET_VERSION}`
 - Pay special attention to:
   - Breaking changes
   - Helm Chart values changes
@@ -136,9 +169,6 @@ kubectl describe nodes | grep -A 5 "Allocatable:"
 
 # Check current resource requests across the cluster
 kubectl top nodes
-
-# Confirm the new image size has not changed significantly
-# (check the release notes for any resource requirement changes)
 ```
 
 ### 4. Announce the Upgrade
@@ -154,22 +184,204 @@ kubectl top nodes
 
 ## II. Backup
 
-### Backup Checklist
+> **Agent note — dependency chain:**
+> - Step 0 must run first (resolves `BACKUP_DIR` and `POD`)
+> - Steps 1–7 depend on `POD` from Step 0
+> - The Verification Gate must pass before proceeding to Section III
 
-| Item | Command | Notes |
-|---|---|---|
-| Helm values | `helm get values openab -o yaml > openab-values-backup-$(date +%Y%m%d).yaml` | Current Helm deployment parameters |
-| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/default.json ./backup-default.json` | Custom agent settings (model, prompt, tools, etc.) |
-| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md |
-| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) |
-| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) |
-| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. Consider encrypting with `gpg` or [`age`](https://github.com/FiloSottile/age) before storing. |
-| STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) |
-| PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) |
-| Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback |
+### Agent-Executable Backup (Linear Sequence)
+
+This section is written as a machine-executable runbook with no ambiguous branches. Run steps in order.
+
+#### Step 0 — Resolve variables and create backup directory
+
+> **Output:** `BACKUP_DIR`, `POD` → used in Steps 1–7 and the Verification Gate.
+
+```bash
+export KUBECONFIG=~/.kube/config
+
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
+BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
+mkdir -p "$BACKUP_DIR"
+echo "Backup directory: $BACKUP_DIR"
+
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
+echo "Pod: $POD"
+
+# Gate: abort if pod not found
+if [ -z "$POD" ]; then
+  echo "❌ Pod not found. Cannot proceed with backup."
+  exit 1
+fi
+
+# Gate: verify tar is available (required for directory kubectl cp)
+if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
+  echo "❌ tar not found in container. kubectl cp of directories will fail. Aborting."
+  exit 1
+fi
+```
+
+#### Step 1 — Backup Helm values
+
+> **Output:** `$BACKUP_DIR/values.yaml`
+
+```bash
+helm get values "$RELEASE_NAME" -o yaml > "$BACKUP_DIR/values.yaml"
+echo "✅ Helm values backed up"
+```
+
+#### Step 2 — Backup agent config
+
+> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/agents/`
+
+```bash
+kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
+echo "✅ Agent config backed up"
+```
+
+#### Step 3 — Backup steering files
+
+> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/steering/`
+
+```bash
+kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
+echo "✅ Steering files backed up"
+```
+
+#### Step 4 — Backup skills (optional)
+
+> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/skills/` (may be skipped)
+
+```bash
+if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then
+  kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
+  echo "✅ Skills directory backed up"
+else
+  echo "⚠️ skills/ not found in container — skipping (this is normal if no custom skills are installed)"
+fi
+```
+
+#### Step 5 — Backup GitHub CLI credentials and kiro-cli auth
+
+> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3`
+
+```bash
+kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
+echo "✅ hosts.yml backed up"
+
+# kiro-cli auth database — required for bot to resume without re-authentication
+kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
+echo "✅ kiro-cli auth DB backed up"
+```
+
+#### Step 6 — Backup Kubernetes Secret
+
+> **Output:** `$BACKUP_DIR/secret.yaml` ⚠️ SENSITIVE
+
+```bash
+kubectl get secret "${DEPLOYMENT}" -o yaml > "$BACKUP_DIR/secret.yaml"
+echo "✅ Secret backed up"
+echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit."
+echo "   Encrypt if storing: gpg --symmetric $BACKUP_DIR/secret.yaml"
+```
+
+#### Step 7 — Backup Helm release history and PVC data
+
+> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/`
+
+```bash
+helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt"
+echo "✅ Helm history backed up"
+
+# PVC backup via kubectl cp (default path — /home/agent is the full PVC mount)
+# Check size first to avoid timeout
+PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1)
+echo "PVC size: $PVC_SIZE"
+# Proceed with kubectl cp (recommended for < 500 MB; use VolumeSnapshot for larger volumes)
+kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
+echo "✅ PVC data backed up"
+```
+
+> **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):**
+> ```bash
+> # First, resolve the PVC name
+> PVC_NAME=$(kubectl get pod "$POD" \
+>   -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}')
+> echo "PVC name: $PVC_NAME"
+>
+> # List available VolumeSnapshotClasses
+> SNAPSHOT_CLASS=$(kubectl get volumesnapshotclass -o jsonpath='{.items[0].metadata.name}')
+> echo "Snapshot class: $SNAPSHOT_CLASS"
+>
+> # Create the snapshot
+> kubectl apply -f - <<EOF
+> apiVersion: snapshot.storage.k8s.io/v1
+> kind: VolumeSnapshot
+> metadata:
+>   name: openab-pvc-snapshot-$(date +%Y%m%d)
+> spec:
+>   volumeSnapshotClassName: ${SNAPSHOT_CLASS}
+>   source:
+>     persistentVolumeClaimName: ${PVC_NAME}
+> EOF
+> ```
+
+#### Verification Gate — must pass before proceeding to upgrade
+
+> **Agent instruction:** Run this gate after all backup steps. If any check fails, **stop and do not proceed** with the upgrade. A failed backup means that data is unprotected.
+
+```bash
+echo "=== Backup Verification Gate ==="
+GATE_PASS=true
+
+check_file() {
+  local path="$1"
+  local label="$2"
+  if [ -s "$path" ]; then
+    echo "  ✅ $label ($path)"
+  else
+    echo "  ❌ MISSING or EMPTY: $label ($path)"
+    GATE_PASS=false
+  fi
+}
+
+check_dir() {
+  local path="$1"
+  local label="$2"
+  if [ -d "$path" ] && [ -n "$(ls -A "$path" 2>/dev/null)" ]; then
+    echo "  ✅ $label ($path)"
+  else
+    echo "  ❌ MISSING or EMPTY: $label ($path)"
+    GATE_PASS=false
+  fi
+}
+
+check_file "$BACKUP_DIR/values.yaml"           "Helm values"
+check_dir  "$BACKUP_DIR/agents/"               "Agent config"
+check_dir  "$BACKUP_DIR/steering/"             "Steering files"
+check_file "$BACKUP_DIR/hosts.yml"             "GitHub CLI credentials"
+check_file "$BACKUP_DIR/kiro-auth.sqlite3"     "kiro-cli auth DB"
+check_file "$BACKUP_DIR/secret.yaml"           "Kubernetes Secret"
+check_file "$BACKUP_DIR/helm-history.txt"      "Helm history"
+check_dir  "$BACKUP_DIR/pvc-data/"             "PVC data"
+
+echo ""
+if [ "$GATE_PASS" = true ]; then
+  echo "✅ GATE PASSED — all backup files present and non-empty. Safe to proceed with upgrade."
+else
+  echo "❌ GATE FAILED — one or more backup files are missing or empty."
+  echo "   Do NOT proceed with the upgrade until all checks pass."
+  exit 1
+fi
+```
 
 ### One-Click Backup Script
 
+The script below combines Steps 0–7 and the Verification Gate into a single file.
+
 ```bash
 #!/bin/bash
 # Note: set -e is intentionally omitted.
@@ -178,10 +390,14 @@ kubectl top nodes
 
 export KUBECONFIG=~/.kube/config
 
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
 BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
 mkdir -p "$BACKUP_DIR"
 
-POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
 if [ -z "$POD" ]; then
   echo "❌ OpenAB Pod not found. Aborting backup." && exit 1
 fi
@@ -189,7 +405,7 @@ fi
 # Pre-check: kubectl cp directory operations require tar inside the container
 if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
   echo "❌ tar not found in container. kubectl cp of directories will fail."
-  echo "   Use 'kubectl exec' with a tar pipe instead, or use VolumeSnapshot for PVC backup."
+  echo "   Use VolumeSnapshot for PVC backup instead."
   exit 1
 fi
 
@@ -204,14 +420,20 @@ backup_step() {
   fi
 }
 
-backup_step "Backup Helm values"    bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml"
-backup_step "Backup Agent config"   kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
-backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
-kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null \
-  || echo "⚠️ skills/ directory not found, skipping"
-backup_step "Backup hosts.yml"      kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
-backup_step "Backup full Secret"    bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml"
-backup_step "Backup Helm history"   bash -c "helm history openab > $BACKUP_DIR/helm-history.txt"
+backup_step "Backup Helm values"       bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml"
+backup_step "Backup Agent config"      kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
+backup_step "Backup Steering files"    kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
+backup_step "Backup hosts.yml"         kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
+backup_step "Backup kiro-cli auth DB"  kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
+backup_step "Backup full Secret"       bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml"
+backup_step "Backup Helm history"      bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt"
+backup_step "Backup PVC data"          kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
+
+if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then
+  backup_step "Backup skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
+else
+  echo "⚠️ skills/ not found — skipping (normal if no custom skills installed)"
+fi
 
 echo ""
 echo "=== Backup Summary: $BACKUP_DIR ==="
@@ -238,61 +460,69 @@ echo "     gpg --symmetric $BACKUP_DIR/secret.yaml"
 echo "     # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml"
 ```
 
-### Verify the Backup
+### Backup Checklist (Human Reference)
 
-```bash
-# Check for empty files in the backup directory
-find $BACKUP_DIR -type f -empty
+| Item | Command | Notes |
+|---|---|---|
+| Helm values | `helm get values $RELEASE_NAME -o yaml > $BACKUP_DIR/values.yaml` | Current Helm deployment parameters |
+| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/ $BACKUP_DIR/agents/` | Custom agent settings (model, prompt, tools, etc.) |
+| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ $BACKUP_DIR/steering/` | Steering docs such as IDENTITY.md |
+| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ $BACKUP_DIR/skills/` | Custom agent skills (if any; see Step 4 for conditional check) |
+| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml $BACKUP_DIR/hosts.yml` | GitHub CLI credentials (including multi-account configs) |
+| kiro-cli auth | `kubectl cp $POD:/home/agent/.local/share/kiro-cli/data.sqlite3 $BACKUP_DIR/kiro-auth.sqlite3` | Bot auth DB — required to avoid re-authentication after PVC loss |
+| Full Secret export | `kubectl get secret ${DEPLOYMENT} -o yaml > $BACKUP_DIR/secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. |
+| PVC data | `kubectl cp $POD:/home/agent/ $BACKUP_DIR/pvc-data/` | Default: kubectl cp. See Step 7 for VolumeSnapshot (advanced). |
+| Helm release history | `helm history $RELEASE_NAME > $BACKUP_DIR/helm-history.txt` | Useful reference for rollback |
 
-# Confirm values.yaml is readable
-cat $BACKUP_DIR/values.yaml | head -20
-```
+---
 
-### PVC Backup (⚠️ Manual Step)
+## III. Upgrade Execution
 
-> This step must be performed manually based on your PVC type. It cannot be automated.
+> **Agent note — dependency chain:**
+> - Requires `RELEASE_NAME`, `DEPLOYMENT`, `BACKUP_DIR`, `TARGET_VERSION` from Section I.
+> - Requires the Verification Gate (Section II) to have passed.
+
+### Pre-check: Resolve Upgrade Variables
 
 ```bash
-# 1. List PVCs mounted to the Pod
-kubectl get pod $POD -o jsonpath='{range .spec.volumes[*]}{.name}{"\t"}{.persistentVolumeClaim.claimName}{"\n"}{end}'
-
-# 2. Option A: VolumeSnapshot (recommended — requires CSI driver support)
-kubectl apply -f - <<EOF
-apiVersion: snapshot.storage.k8s.io/v1
-kind: VolumeSnapshot
-metadata:
-  name: openab-pvc-snapshot-$(date +%Y%m%d)
-spec:
-  volumeSnapshotClassName: <your-snapshot-class>
-  source:
-    persistentVolumeClaimName: <pvc-name>
-EOF
-
-# 3. Option B: kubectl cp (suitable for small data volumes only)
-#
-# ⚠️ Size check before proceeding — kubectl cp may timeout or OOM on large datasets.
-# Recommended limit: < 500 MB. For larger volumes, use Option A (VolumeSnapshot).
-kubectl exec $POD -- du -sh <pvc-mount-path>
-# If the output is within an acceptable range, proceed:
-kubectl cp $POD:<pvc-mount-path> $BACKUP_DIR/pvc-data/
-```
+export KUBECONFIG=~/.kube/config
 
----
+# Resolve release name and deployment
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
 
-## III. Upgrade Execution
+# Resolve backup directory (most recent backup)
+BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1)
+BACKUP_VALUES="${BACKUP_DIR}/values.yaml"
+echo "Using backup: $BACKUP_DIR"
+echo "Values file: $BACKUP_VALUES"
 
-### Pre-check: Confirm Helm Repo is Configured
+# Confirm the values file exists and is non-empty
+if [ ! -s "$BACKUP_VALUES" ]; then
+  echo "❌ values.yaml not found or empty at $BACKUP_VALUES. Run backup first."
+  exit 1
+fi
 
-```bash
-# GitHub Pages (stable releases — recommended for most cases)
-helm repo add openab https://openabdev.github.io/openab
-helm repo update
+# Set target version (e.g. 0.7.5 — check https://github.com/openabdev/openab/releases)
+TARGET_VERSION="0.7.5"
+
+# List available chart versions via OCI (no helm repo add required)
+helm show chart oci://ghcr.io/openabdev/charts/openab --version "$TARGET_VERSION" 2>/dev/null \
+  | grep -E "^(name|version|appVersion):"
+```
 
-# List available versions
-helm search repo openab --versions
+### Pre-check: Confirm Helm OCI Access
 
-# Or query via OCI Registry
-helm show chart oci://ghcr.io/openabdev/charts/openab --version <target-version>
+```bash
+# Verify OCI registry is reachable and the target version exists
+helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}" > /dev/null \
+  && echo "✅ Chart version ${TARGET_VERSION} available via OCI" \
+  || echo "❌ Chart version ${TARGET_VERSION} not found. Check version string."
+
+# Also verify the pre-release beta.1 version exists (required for Step 1)
+helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}-beta.1" > /dev/null \
+  && echo "✅ Pre-release ${TARGET_VERSION}-beta.1 available" \
+  || echo "⚠️ beta.1 not found — check GitHub releases for available pre-release tags"
 ```
 
 ### Step 1: Pre-release Validation (Required)
@@ -300,75 +530,107 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version <target-version>
 > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
 >
 > **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first.
+>
+> **Agent note — pass/fail criteria:**
+> - **Pass:** `kubectl wait` exits 0 AND `pgrep -x openab` exits 0 AND log scan returns no `panic` or `fatal` lines.
+> - **Fail:** Any of the above fails, OR a human operator reports a functional regression in Discord within the monitoring window. On failure, run `helm rollback` (see Section IV) and stop — do not proceed to Step 2.
 
 ```bash
-BACKUP_VALUES="<backup-dir>/values.yaml"  # e.g. openab-backup-20260413-070000/values.yaml
-
-# Dry-run the pre-release version first
-helm upgrade openab openab/openab \
-  --version <target-version>-beta.1 \
+# Dry-run first to catch values conflicts
+helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
+  --version "${TARGET_VERSION}-beta.1" \
   -f "$BACKUP_VALUES" \
   --dry-run
 
 # Deploy the pre-release
-helm upgrade openab openab/openab \
-  --version <target-version>-beta.1 \
+helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
+  --version "${TARGET_VERSION}-beta.1" \
   -f "$BACKUP_VALUES"
 
-kubectl rollout status deployment/openab-kiro --timeout=300s
+kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
+
+# Automated smoke test
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
+kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
+kubectl exec "$POD" -- pgrep -x openab
+PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true)
+if [ "$PANIC_LINES" -gt 0 ]; then
+  echo "❌ Panic/fatal lines found in logs. Do NOT proceed. Run rollback."
+  exit 1
+fi
+echo "✅ Automated smoke test passed. Proceed with Discord functional validation."
 
-# Run functional tests in the Discord channel
-# Verify basic conversation and tool calls work as expected
-# If issues are found, wait for beta.2 and repeat this step
+# After automated smoke test — manual Discord check required:
+# → Send a test message in the Discord channel
+# → Confirm the bot responds and basic conversation / tool calls work
+# → If bot is unresponsive or broken: run helm rollback (Section IV) and stop
 ```
 
 ### Step 2: Promote to Stable
 
-```bash
-BACKUP_VALUES="<backup-dir>/values.yaml"
+> **Agent note:** Only run this after Step 1 has passed both automated and Discord validation.
 
+```bash
 # Dry-run the stable version
-helm upgrade openab openab/openab \
-  --version <target-version> \
+helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
+  --version "${TARGET_VERSION}" \
   -f "$BACKUP_VALUES" \
   --dry-run
 
-# Deploy stable (short downtime is expected)
-helm upgrade openab openab/openab \
-  --version <target-version> \
+# Deploy stable (short downtime is expected due to Recreate strategy)
+helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
+  --version "${TARGET_VERSION}" \
   -f "$BACKUP_VALUES"
 
 # Wait for the Pod to be ready
-kubectl rollout status deployment/openab-kiro --timeout=300s
+kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
 ```
 
 ### Post-Upgrade Verification
 
+> **Agent note — pass/fail criteria:**
+> - **Pass:** All commands below exit 0 AND image tag matches `TARGET_VERSION` AND no panic/fatal in logs.
+> - **Fail:** Any command exits non-zero, or image tag does not match. → Proceed to Section IV Rollback.
+
 ```bash
-POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
 
 # 1. Check Pod status (must be Running and READY)
-kubectl get pod -l app.kubernetes.io/instance=openab
-kubectl wait --for=condition=Ready pod/$POD --timeout=120s
+kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}"
+kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
+
+# 2. Verify deployed chart version matches target
+DEPLOYED=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart')
+echo "Deployed chart: $DEPLOYED  |  Expected: openab-${TARGET_VERSION}"
+if [ "$DEPLOYED" != "openab-${TARGET_VERSION}" ]; then
+  echo "❌ Version mismatch. Investigate before proceeding."
+fi
 
-# 2. Verify version (from Helm and image tag — no source code in the container)
-helm list -f openab
-kubectl get deployment openab-kiro \
+# 3. Verify image tag
+kubectl get "deployment/${DEPLOYMENT}" \
   -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
 
-# 3. Confirm the openab process is running
-kubectl exec $POD -- pgrep -x openab
+# 4. Confirm the openab process is running
+kubectl exec "$POD" -- pgrep -x openab
 
-# 4. Check logs for errors (ERROR / WARN / panic)
-kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal"
+# 5. Check logs for errors
+PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true)
+ERROR_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true)
+echo "Panic/fatal lines: $PANIC_LINES  |  Error/warn lines: $ERROR_LINES"
+if [ "$PANIC_LINES" -gt 0 ]; then
+  echo "❌ Panic/fatal found. Rollback recommended."
+fi
 
-# 5. Verify steering files and agent config are still present
-#    (PVC content should survive upgrades, but this confirms it)
-kubectl exec $POD -- ls /home/agent/.kiro/steering/
-kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5
-#    If either path is missing, restore from backup (see Section IV: Restore Custom Config)
+# 6. Verify PVC data (steering files and agent config) are still present
+kubectl exec "$POD" -- ls /home/agent/.kiro/steering/
+kubectl exec "$POD" -- cat /home/agent/.kiro/agents/default.json | head -5
+# If either path is missing, restore from backup (see Section IV: Restore Custom Config)
 
-# 6. Discord E2E verification (final check)
+# 7. Discord E2E verification (final check — requires human operator)
 # → Send a test message in the Discord channel
 # → Confirm the bot responds and conversation works correctly
 ```
@@ -386,6 +648,17 @@ kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5
 
 ### Decision Tree
 
+> **Agent note — machine-readable branch criteria:**
+>
+> | Observed condition | Action |
+> |---|---|
+> | `kubectl get pod` shows `CrashLoopBackOff` or `Pending` | `helm rollback` immediately |
+> | Pod is `Running` AND `pgrep -x openab` exits non-zero | `helm rollback` |
+> | Pod is `Running`, process OK, but logs contain `panic` or `fatal` | `helm rollback` |
+> | Pod is `Running`, process OK, logs clean, but no Discord response after 60s | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` |
+> | Pod is `Running`, process OK, logs clean, Discord responds, but config is missing | Restore config from backup → `kubectl rollout restart` |
+> | Quick fix is clearly identified (e.g. a known bad config key) | Hotfix — escalate to human engineer |
+
 ```
 Issue detected after upgrade
          │
@@ -394,20 +667,16 @@ Issue detected after upgrade
          │
          ├─ CrashLoopBackOff / Pending ──→ helm rollback <REVISION>
          │
-         ├─ Running, but functionality broken
-         │         │
-         │         ├─ Quick fix possible (e.g. config error) ──→ hotfix (engineer)
-         │         └─ Root cause unclear ────────────────────→ helm rollback <REVISION>
+         ├─ Running, pgrep exits non-zero OR panic in logs
+         │         └─ helm rollback <REVISION>
          │
-         ├─ Running, but bot is unresponsive
-         │         │
-         │         └─ kubectl rollout restart deployment/openab-kiro
+         ├─ Running, logs clean, bot unresponsive
+         │         └─ kubectl rollout restart deployment/${DEPLOYMENT}
          │                   │
-         │                   ├─ Recovers after restart ──→ Continue monitoring
-         │                   └─ Still unresponsive      ──→ helm rollback <REVISION>
+         │                   ├─ Responds within 60s ──→ Continue monitoring
+         │                   └─ Still unresponsive   ──→ helm rollback <REVISION>
          │
-         └─ Running, but custom config is missing
-                   │
+         └─ Running, bot OK, config missing
                    └─ Restore config from backup → kubectl rollout restart
 ```
 
@@ -421,44 +690,69 @@ Issue detected after upgrade
 ### Helm Rollback
 
 ```bash
-# 1. View release history
-helm history openab
+export KUBECONFIG=~/.kube/config
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
+
+# 1. View release history and identify the previous revision
+helm history "$RELEASE_NAME"
+
+# 2. Get the previous revision number automatically
+PREV_REVISION=$(helm history "$RELEASE_NAME" --output json \
+  | jq -r 'sort_by(.revision) | .[-2].revision')
+echo "Rolling back to revision: $PREV_REVISION"
 
-# 2. Roll back to a previous revision
-helm rollback openab <REVISION>
+# 3. Roll back
+helm rollback "$RELEASE_NAME" "$PREV_REVISION"
 
-# 3. Wait for the Pod to be ready
-kubectl rollout status deployment/openab-kiro --timeout=300s
+# 4. Wait for the Pod to be ready
+kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
 
-# 4. Confirm rollback succeeded
-kubectl get pod -l app.kubernetes.io/instance=openab
+# 5. Confirm rollback succeeded
+kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}"
 
-# 5. Run full post-rollback verification (same as post-upgrade verification)
-POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
-kubectl wait --for=condition=Ready pod/$POD --timeout=120s
-kubectl exec $POD -- pgrep -x openab
-kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal"
+# 6. Run full post-rollback verification
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
+kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
+kubectl exec "$POD" -- pgrep -x openab
+kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -iE "error|warn|panic|fatal"
 # → Send a test message in the Discord channel to confirm the bot responds
 ```
 
 ### Restore Custom Config
 
 ```bash
-POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}')
+export KUBECONFIG=~/.kube/config
+RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+DEPLOYMENT="${RELEASE_NAME}-kiro"
+
+POD=$(kubectl get pod \
+  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
+  -o jsonpath='{.items[0].metadata.name}')
+
+# Resolve backup directory
+BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1)
+echo "Restoring from: $BACKUP_DIR"
 
 # Restore agent config
-kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json
+kubectl cp "$BACKUP_DIR/agents/default.json" "$POD:/home/agent/.kiro/agents/default.json"
 
 # Restore steering files
 # ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters.
 # Use the tar pipe method below to avoid accidentally creating a nested directory
 # (e.g. steering/steering/) which can happen with some kubectl versions.
-kubectl exec $POD -- mkdir -p /home/agent/.kiro/steering
-tar c -C ./backup-steering . | kubectl exec -i $POD -- tar x -C /home/agent/.kiro/steering
+kubectl exec "$POD" -- mkdir -p /home/agent/.kiro/steering
+tar c -C "$BACKUP_DIR/steering" . | kubectl exec -i "$POD" -- tar x -C /home/agent/.kiro/steering
 
 # Restore GitHub CLI credentials
-kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml
+kubectl cp "$BACKUP_DIR/hosts.yml" "$POD:/home/agent/.config/gh/hosts.yml"
+
+# Restore kiro-cli auth database
+kubectl exec "$POD" -- mkdir -p /home/agent/.local/share/kiro-cli
+kubectl cp "$BACKUP_DIR/kiro-auth.sqlite3" "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3"
 
 # Restart Pod to apply changes
-kubectl rollout restart deployment/openab-kiro
+kubectl rollout restart "deployment/${DEPLOYMENT}"
 ```

From 506555a25dc0a7b65c55fdad03b8641b23bc03c5 Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Tue, 14 Apr 2026 23:10:35 +0000
Subject: [PATCH 5/8] docs: address AI-first review feedback (v1.4)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix all issues flagged in the second round of review:

1. TARGET_VERSION: auto-resolved from OCI registry latest stable version
   (helm show chart ... | grep ^version) — no more hardcoded placeholder

2. Pre-release beta.1 ambiguous branch: add explicit 3-way branch in
   Section I env setup — (a) beta.1 found: set PRERELEASE_VERSION,
   (b) not found but release notes contain pre-release-validated: true:
   set PRERELEASE_VERSION="" to skip Step 1, (c) neither: exit 1 with
   clear instructions (wait / check alternate tags / ask human)

3. Discord E2E validation: replace comment-only instructions with
   read -r HUMAN_INPUT gate accepting CONFIRMED or ROLLBACK;
   unrecognized input exits 1 for safety

4. Announcements: replace text-only descriptions with curl Discord
   webhook calls (guarded by DISCORD_WEBHOOK_URL env var check)

5. Session env file (openab-session-env.sh): resolve all variables once
   in Section I and persist to file; all subsequent sections source it.
   BACKUP_DIR appended after Step 0. Resumption instructions included.

6. BACKUP_DIR validation on resume: add timestamp echo and ls check
   before upgrade so agent can confirm correct backup is being used

7. Pre-condition check (Section 0): verify kubectl/helm/jq/curl/awk/tar
   are installed, KUBECONFIG file exists, context is set, cluster is
   reachable — exit 1 with per-tool guidance on failure

8. PREV_REVISION: resolve from backup helm-history.txt (captured before
   upgrade) using awk to find last "deployed" revision — avoids the
   ambiguity of [-2] when pre-release + stable both ran during upgrade

9. Add expected stdout patterns and estimated durations to key steps
   so agents can validate success beyond exit code
---
 docs/openab-upgrade-sop.md | 794 +++++++++++++++++++++----------------
 1 file changed, 459 insertions(+), 335 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index 0dfc3307..0d3421fb 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -2,7 +2,7 @@
 
 | | |
 |---|---|
-| **Document Version** | 1.3 |
+| **Document Version** | 1.4 |
 | **Last Updated** | 2026-04-14 |
 
 ## Environment Reference
@@ -11,7 +11,7 @@
 |---|---|
 | Deployment Method | Kubernetes (Helm Chart) |
 | Deployment Name | `<release-name>-kiro` (default: `openab-kiro`) — see note below |
-| Pod Label (precise) | `app.kubernetes.io/instance=openab,app.kubernetes.io/name=openab-kiro` |
+| Pod Label (precise) | `app.kubernetes.io/instance=<release-name>,app.kubernetes.io/name=<release-name>-kiro` |
 | Helm Repo (OCI, recommended) | `oci://ghcr.io/openabdev/charts/openab` |
 | Helm Repo (GitHub Pages, fallback) | `https://openabdev.github.io/openab` |
 | Image Registry | `ghcr.io/openabdev/openab` |
@@ -52,156 +52,265 @@
 
 ```
 ┌─────────────────────────────────────────────────────────────┐
-│                     Pre-Upgrade Preparation                  │
-│  Check version info → Read Release Notes → Announce outage  │
+│                  0. Environment Readiness Check              │
+│  kubectl / helm / jq / curl / KUBECONFIG / cluster access   │
 └────────────────────────┬────────────────────────────────────┘
                          │
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
-│                          Backup                              │
-│  Helm values / Agent config / Steering / hosts.yml / PVC    │
-│  → Verification gate (all files exist & non-empty) ✅        │
+│                     I. Pre-Upgrade Preparation               │
+│  Resolve vars → Save session env file → Read release notes  │
 └────────────────────────┬────────────────────────────────────┘
                          │
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
-│                  Upgrade Execution (2 Phases)                │
-│                                                             │
-│  Step 1: Pre-release Validation                             │
-│    helm upgrade --version x.x.x-beta.1                     │
-│    └─ Automated smoke test (kubectl wait + pgrep + logs)    │
-│         ├─ Pass ──────────────────────────┐                 │
-│         └─ Fail → rollback, stop          │                 │
-│                                           ▼                 │
-│  Step 2: Promote to Stable                                  │
-│    helm upgrade --version x.x.x                            │
-│    └─ kubectl rollout status                               │
+│                          II. Backup                          │
+│  Steps 0–7 → Verification Gate (all files non-empty) ✅      │
 └────────────────────────┬────────────────────────────────────┘
                          │
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
-│                    Post-Upgrade Verification                  │
-│  Pod Ready? → Version check → Log check → Discord E2E test  │
-│       │                                                     │
-│       ├─ All pass → Send completion notice ✅               │
-│       └─ Issues   → Proceed to rollback ↓                  │
+│                 III. Upgrade Execution (2 Phases)            │
+│                                                             │
+│  Step 1: Pre-release Validation                             │
+│    Check beta.1 exists → deploy → automated smoke test      │
+│    → ⏸ HUMAN CONFIRMATION → proceed or rollback            │
+│                                                             │
+│  Step 2: Promote to Stable                                  │
+│    helm upgrade (OCI) → rollout status → verification       │
 └────────────────────────┬────────────────────────────────────┘
-                         │ (on failure)
+                         │
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
-│                       Rollback                               │
-│                                                             │
-│  Diagnose symptom                                           │
-│   ├─ Pod won't start    → helm rollback <REVISION>          │
-│   ├─ Broken / Pod OK    → rollback or hotfix                │
-│   ├─ Config lost        → restore from backup               │
-│   └─ Bot unresponsive   → kubectl rollout restart → rollback │
-│                                                             │
-│  Rollback done → Re-run verification → Send rollback notice │
+│                    IV. Rollback                              │
+│  PREV_REVISION from backup helm-history.txt                 │
+│  Machine-readable decision table → rollback → verify        │
 └─────────────────────────────────────────────────────────────┘
 ```
 
 ---
 
+## 0. Environment Readiness Check
+
+> **Agent instruction:** Run this section before anything else. If any check fails, stop and resolve the issue before proceeding. Do not attempt workarounds.
+>
+> **Expected output on success:** All lines print `✅` and the final line reads `✅ Environment ready.`
+
+```bash
+export KUBECONFIG=~/.kube/config
+
+echo "=== Environment Readiness Check ==="
+READY=true
+
+check_cmd() {
+  if command -v "$1" > /dev/null 2>&1; then
+    echo "  ✅ $1 found"
+  else
+    echo "  ❌ $1 not found — install it before proceeding"
+    READY=false
+  fi
+}
+
+check_cmd kubectl
+check_cmd helm
+check_cmd jq
+check_cmd curl
+check_cmd awk
+check_cmd tar
+
+echo ""
+echo "  KUBECONFIG: $KUBECONFIG"
+if [ -f "$KUBECONFIG" ]; then
+  echo "  ✅ KUBECONFIG file exists"
+else
+  echo "  ❌ KUBECONFIG file not found at $KUBECONFIG"
+  READY=false
+fi
+
+CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null)
+if [ -n "$CURRENT_CONTEXT" ]; then
+  echo "  ✅ kubectl context: $CURRENT_CONTEXT"
+else
+  echo "  ❌ No kubectl context — check KUBECONFIG"
+  READY=false
+fi
+
+if kubectl cluster-info > /dev/null 2>&1; then
+  echo "  ✅ Cluster reachable"
+else
+  echo "  ❌ Cannot reach cluster — check KUBECONFIG and cluster status"
+  READY=false
+fi
+
+echo ""
+if [ "$READY" = true ]; then
+  echo "✅ Environment ready. Proceed to Section I."
+else
+  echo "❌ Environment not ready. Fix the issues above before proceeding."
+  exit 1
+fi
+```
+
+---
+
 ## I. Pre-Upgrade Preparation
 
-### 1. Resolve Environment Variables
+### 1. Resolve All Session Variables
 
-> **Agent note:** Run this block first. All subsequent steps depend on these variables.
+> **Agent instruction:** Run this entire block as one unit. All subsequent sections depend on `openab-session-env.sh`. If the session file already exists from a previous run (e.g. backup was done earlier and upgrade is now resuming), source it instead of re-running this block.
 >
-> **Step 1 output:** `RELEASE_NAME`, `DEPLOYMENT`, `POD`, `CURRENT_VERSION`, `TARGET_VERSION` → used in all subsequent steps.
+> **Output:** `openab-session-env.sh` → sourced by all subsequent sections.
 
 ```bash
 export KUBECONFIG=~/.kube/config
 
-# Resolve release name and deployment name
+# If resuming a previous session, source the saved env and skip this block:
+# source openab-session-env.sh && echo "✅ Session env loaded" && exit 0
+
+# --- Resolve release and deployment names ---
 RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+if [ -z "$RELEASE_NAME" ]; then
+  echo "❌ No Helm release found. Is OpenAB installed?"
+  exit 1
+fi
 DEPLOYMENT="${RELEASE_NAME}-kiro"
 echo "Release: $RELEASE_NAME  |  Deployment: $DEPLOYMENT"
+# Expected output contains: "Release: openab  |  Deployment: openab-kiro"
 
-# Get current running pod (precise label selector — avoids matching other agents)
-POD=$(kubectl get pod \
-  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
-  -o jsonpath='{.items[0].metadata.name}')
-echo "Pod: $POD"
-if [ -z "$POD" ]; then echo "❌ Pod not found. Check label selectors."; fi
-
-# Get current deployed chart version
+# --- Resolve current version ---
 CURRENT_VERSION=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart' | sed 's/openab-//')
 echo "Current chart version: $CURRENT_VERSION"
 
-# List available versions via OCI (no repo add needed)
-helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null | grep ^version
+# --- Resolve target version (latest stable from OCI, no pre-release tags) ---
+TARGET_VERSION=$(helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null \
+  | grep '^version:' | awk '{print $2}')
+if [ -z "$TARGET_VERSION" ]; then
+  echo "❌ Could not resolve target version from OCI registry."
+  echo "   Check network connectivity and registry access."
+  exit 1
+fi
+echo "Target version (latest stable from OCI): $TARGET_VERSION"
+# Expected output: "Target version (latest stable from OCI): 0.7.5"
 
-# List all published versions (requires GitHub Pages repo to be added)
-# helm repo add openab https://openabdev.github.io/openab && helm repo update
-# helm search repo openab --versions
+# If you need to upgrade to a specific version instead of latest, override here:
+# TARGET_VERSION="0.7.4"
 
-# Check the actual image the Pod is running
-kubectl get deployment "$DEPLOYMENT" \
-  -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
-```
+# --- Check if upgrade is needed ---
+if [ "$CURRENT_VERSION" = "$TARGET_VERSION" ]; then
+  echo "ℹ️  Already on the latest version ($TARGET_VERSION). No upgrade needed."
+  echo "   If you still want to proceed (e.g. force re-deploy), continue manually."
+  exit 0
+fi
 
-After running the above, set the target version:
+# --- Check pre-release availability (determines Step 1 path) ---
+if helm show chart oci://ghcr.io/openabdev/charts/openab \
+     --version "${TARGET_VERSION}-beta.1" > /dev/null 2>&1; then
+  PRERELEASE_VERSION="${TARGET_VERSION}-beta.1"
+  echo "✅ Pre-release found: $PRERELEASE_VERSION"
+else
+  # Check if release notes explicitly mark this as pre-validated
+  RELEASE_NOTES=$(gh api "repos/openabdev/openab/releases/tags/v${TARGET_VERSION}" \
+    --jq '.body' 2>/dev/null || true)
+  if echo "$RELEASE_NOTES" | grep -q 'pre-release-validated: true'; then
+    PRERELEASE_VERSION=""
+    echo "✅ Release notes contain pre-release-validated: true — Step 1 will be skipped"
+  else
+    echo "❌ STOP: ${TARGET_VERSION}-beta.1 not found in OCI registry."
+    echo "   Release notes do not contain 'pre-release-validated: true'."
+    echo "   Options:"
+    echo "   1. Wait for the project to publish ${TARGET_VERSION}-beta.1"
+    echo "   2. Check GitHub releases for an alternative pre-release tag:"
+    echo "      gh release list --repo openabdev/openab"
+    echo "   3. If a different pre-release tag is available (e.g. beta.2), set:"
+    echo "      PRERELEASE_VERSION=\"${TARGET_VERSION}-beta.2\""
+    echo "   Do NOT proceed until a pre-release is available or the release notes"
+    echo "   explicitly contain 'pre-release-validated: true'."
+    exit 1
+  fi
+fi
 
-```bash
-# Set this to the version you are upgrading to (e.g. 0.7.5)
-TARGET_VERSION="0.7.5"
-echo "Upgrading to: $TARGET_VERSION"
+# --- Save session environment file ---
+cat > openab-session-env.sh <<EOF
+# OpenAB upgrade session — generated $(date)
+export KUBECONFIG=~/.kube/config
+export RELEASE_NAME="${RELEASE_NAME}"
+export DEPLOYMENT="${DEPLOYMENT}"
+export CURRENT_VERSION="${CURRENT_VERSION}"
+export TARGET_VERSION="${TARGET_VERSION}"
+export PRERELEASE_VERSION="${PRERELEASE_VERSION}"
+# BACKUP_DIR will be appended by Section II
+EOF
+echo "✅ Session env saved to openab-session-env.sh"
+echo "   Source it in subsequent sessions: source openab-session-env.sh"
 ```
 
 ### 2. Read the Release Notes
 
-- Go to `https://github.com/openabdev/openab/releases/tag/v${TARGET_VERSION}`
-- Pay special attention to:
-  - Breaking changes
-  - Helm Chart values changes
-  - Added or deprecated environment variables
-  - Any migration steps
+```bash
+source openab-session-env.sh
 
-### 3. Check Node Resources
+echo "Release notes URL: https://github.com/openabdev/openab/releases/tag/v${TARGET_VERSION}"
 
-> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity.
+# Print release notes to terminal for review
+gh release view "v${TARGET_VERSION}" --repo openabdev/openab 2>/dev/null \
+  || echo "⚠️ Could not fetch release notes via gh CLI — check the URL manually"
+```
+
+Pay special attention to:
+- Breaking changes
+- Helm Chart values changes
+- Added or deprecated environment variables
+- Any migration steps
+
+### 3. Check Node Resources
 
 ```bash
-# Check allocatable resources on all nodes
-kubectl describe nodes | grep -A 5 "Allocatable:"
+source openab-session-env.sh
 
-# Check current resource requests across the cluster
+kubectl describe nodes | grep -A 5 "Allocatable:"
 kubectl top nodes
 ```
 
+> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity.
+
 ### 4. Announce the Upgrade
 
 > ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour.
 
-- Notify all users via Discord channel / Slack / email:
-  - Scheduled upgrade time and estimated downtime (typically 1–3 minutes)
-  - Scope of impact (Discord bot will be offline during the upgrade)
-  - Emergency contact
+```bash
+source openab-session-env.sh
+
+# Option A: Discord webhook notification (set DISCORD_WEBHOOK_URL in environment)
+if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then
+  curl -s -X POST "$DISCORD_WEBHOOK_URL" \
+    -H "Content-Type: application/json" \
+    -d "{\"content\": \"🔧 **Upgrade starting:** OpenAB is being upgraded from v${CURRENT_VERSION} to v${TARGET_VERSION}. The bot will be offline for approximately 1–3 minutes.\"}"
+  echo "✅ Discord notification sent"
+else
+  echo "ℹ️  DISCORD_WEBHOOK_URL not set — skipping automated notification"
+  echo "   Notify users manually: OpenAB upgrading v${CURRENT_VERSION} → v${TARGET_VERSION}, ~1–3 min downtime"
+fi
+```
 
 ---
 
 ## II. Backup
 
-> **Agent note — dependency chain:**
-> - Step 0 must run first (resolves `BACKUP_DIR` and `POD`)
-> - Steps 1–7 depend on `POD` from Step 0
-> - The Verification Gate must pass before proceeding to Section III
+> **Agent instruction — dependency chain:**
+> - `openab-session-env.sh` must exist (created in Section I)
+> - Steps 0–7 must run in order
+> - The Verification Gate must print `✅ GATE PASSED` before proceeding to Section III
+> - `BACKUP_DIR` is appended to `openab-session-env.sh` after Step 0
 
 ### Agent-Executable Backup (Linear Sequence)
 
-This section is written as a machine-executable runbook with no ambiguous branches. Run steps in order.
-
 #### Step 0 — Resolve variables and create backup directory
 
-> **Output:** `BACKUP_DIR`, `POD` → used in Steps 1–7 and the Verification Gate.
+> **Output:** `BACKUP_DIR` appended to `openab-session-env.sh` → used in Steps 1–7 and the Verification Gate.
 
 ```bash
-export KUBECONFIG=~/.kube/config
+source openab-session-env.sh
 
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
-DEPLOYMENT="${RELEASE_NAME}-kiro"
 BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
 mkdir -p "$BACKUP_DIR"
 echo "Backup directory: $BACKUP_DIR"
@@ -211,70 +320,81 @@ POD=$(kubectl get pod \
   -o jsonpath='{.items[0].metadata.name}')
 echo "Pod: $POD"
 
-# Gate: abort if pod not found
 if [ -z "$POD" ]; then
   echo "❌ Pod not found. Cannot proceed with backup."
   exit 1
 fi
 
-# Gate: verify tar is available (required for directory kubectl cp)
 if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
   echo "❌ tar not found in container. kubectl cp of directories will fail. Aborting."
   exit 1
 fi
+
+# Append BACKUP_DIR to session env file
+echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh
+echo "✅ BACKUP_DIR saved to openab-session-env.sh"
 ```
 
 #### Step 1 — Backup Helm values
 
 > **Output:** `$BACKUP_DIR/values.yaml`
+> **Expected:** file size > 0 bytes
 
 ```bash
+source openab-session-env.sh
 helm get values "$RELEASE_NAME" -o yaml > "$BACKUP_DIR/values.yaml"
-echo "✅ Helm values backed up"
+echo "✅ Helm values backed up ($(wc -c < "$BACKUP_DIR/values.yaml") bytes)"
 ```
 
 #### Step 2 — Backup agent config
 
-> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/agents/`
+> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/agents/`
 
 ```bash
+source openab-session-env.sh
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
 echo "✅ Agent config backed up"
 ```
 
 #### Step 3 — Backup steering files
 
-> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/steering/`
+> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/steering/`
 
 ```bash
+source openab-session-env.sh
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
 echo "✅ Steering files backed up"
 ```
 
 #### Step 4 — Backup skills (optional)
 
-> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/skills/` (may be skipped)
+> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/skills/` (may be absent)
 
 ```bash
+source openab-session-env.sh
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then
   kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
   echo "✅ Skills directory backed up"
 else
-  echo "⚠️ skills/ not found in container — skipping (this is normal if no custom skills are installed)"
+  echo "⚠️ skills/ not found in container — skipping (normal if no custom skills are installed)"
 fi
 ```
 
 #### Step 5 — Backup GitHub CLI credentials and kiro-cli auth
 
-> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3`
+> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3`
 
 ```bash
+source openab-session-env.sh
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
-echo "✅ hosts.yml backed up"
+echo "✅ hosts.yml backed up ($(wc -c < "$BACKUP_DIR/hosts.yml") bytes)"
 
-# kiro-cli auth database — required for bot to resume without re-authentication
 kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
-echo "✅ kiro-cli auth DB backed up"
+echo "✅ kiro-cli auth DB backed up ($(wc -c < "$BACKUP_DIR/kiro-auth.sqlite3") bytes)"
 ```
 
 #### Step 6 — Backup Kubernetes Secret
@@ -282,41 +402,38 @@ echo "✅ kiro-cli auth DB backed up"
 > **Output:** `$BACKUP_DIR/secret.yaml` ⚠️ SENSITIVE
 
 ```bash
+source openab-session-env.sh
 kubectl get secret "${DEPLOYMENT}" -o yaml > "$BACKUP_DIR/secret.yaml"
-echo "✅ Secret backed up"
-echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit."
-echo "   Encrypt if storing: gpg --symmetric $BACKUP_DIR/secret.yaml"
+echo "✅ Secret backed up ($(wc -c < "$BACKUP_DIR/secret.yaml") bytes)"
+echo "🔐 SECURITY: secret.yaml contains credentials — do NOT commit. Encrypt before storing:"
+echo "   gpg --symmetric $BACKUP_DIR/secret.yaml"
 ```
 
 #### Step 7 — Backup Helm release history and PVC data
 
-> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/`
+> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/`
 
 ```bash
+source openab-session-env.sh
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
+
 helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt"
 echo "✅ Helm history backed up"
+# This file is the source of truth for PREV_REVISION used in rollback
 
-# PVC backup via kubectl cp (default path — /home/agent is the full PVC mount)
-# Check size first to avoid timeout
 PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1)
 echo "PVC size: $PVC_SIZE"
-# Proceed with kubectl cp (recommended for < 500 MB; use VolumeSnapshot for larger volumes)
 kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
 echo "✅ PVC data backed up"
 ```
 
 > **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):**
 > ```bash
-> # First, resolve the PVC name
+> source openab-session-env.sh
 > PVC_NAME=$(kubectl get pod "$POD" \
 >   -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}')
-> echo "PVC name: $PVC_NAME"
->
-> # List available VolumeSnapshotClasses
 > SNAPSHOT_CLASS=$(kubectl get volumesnapshotclass -o jsonpath='{.items[0].metadata.name}')
-> echo "Snapshot class: $SNAPSHOT_CLASS"
->
-> # Create the snapshot
+> echo "PVC: $PVC_NAME  |  SnapshotClass: $SNAPSHOT_CLASS"
 > kubectl apply -f - <<EOF
 > apiVersion: snapshot.storage.k8s.io/v1
 > kind: VolumeSnapshot
@@ -331,17 +448,17 @@ echo "✅ PVC data backed up"
 
 #### Verification Gate — must pass before proceeding to upgrade
 
-> **Agent instruction:** Run this gate after all backup steps. If any check fails, **stop and do not proceed** with the upgrade. A failed backup means that data is unprotected.
+> **Agent instruction:** Run this gate after all backup steps. If output does not contain `✅ GATE PASSED`, **stop immediately** and do not proceed with the upgrade.
 
 ```bash
+source openab-session-env.sh
 echo "=== Backup Verification Gate ==="
 GATE_PASS=true
 
 check_file() {
-  local path="$1"
-  local label="$2"
+  local path="$1"; local label="$2"
   if [ -s "$path" ]; then
-    echo "  ✅ $label ($path)"
+    echo "  ✅ $label ($(wc -c < "$path") bytes)"
   else
     echo "  ❌ MISSING or EMPTY: $label ($path)"
     GATE_PASS=false
@@ -349,24 +466,23 @@ check_file() {
 }
 
 check_dir() {
-  local path="$1"
-  local label="$2"
+  local path="$1"; local label="$2"
   if [ -d "$path" ] && [ -n "$(ls -A "$path" 2>/dev/null)" ]; then
-    echo "  ✅ $label ($path)"
+    echo "  ✅ $label ($(ls "$path" | wc -l) files)"
   else
     echo "  ❌ MISSING or EMPTY: $label ($path)"
     GATE_PASS=false
   fi
 }
 
-check_file "$BACKUP_DIR/values.yaml"           "Helm values"
-check_dir  "$BACKUP_DIR/agents/"               "Agent config"
-check_dir  "$BACKUP_DIR/steering/"             "Steering files"
-check_file "$BACKUP_DIR/hosts.yml"             "GitHub CLI credentials"
-check_file "$BACKUP_DIR/kiro-auth.sqlite3"     "kiro-cli auth DB"
-check_file "$BACKUP_DIR/secret.yaml"           "Kubernetes Secret"
-check_file "$BACKUP_DIR/helm-history.txt"      "Helm history"
-check_dir  "$BACKUP_DIR/pvc-data/"             "PVC data"
+check_file "$BACKUP_DIR/values.yaml"          "Helm values"
+check_dir  "$BACKUP_DIR/agents/"              "Agent config"
+check_dir  "$BACKUP_DIR/steering/"            "Steering files"
+check_file "$BACKUP_DIR/hosts.yml"            "GitHub CLI credentials"
+check_file "$BACKUP_DIR/kiro-auth.sqlite3"    "kiro-cli auth DB"
+check_file "$BACKUP_DIR/secret.yaml"          "Kubernetes Secret"
+check_file "$BACKUP_DIR/helm-history.txt"     "Helm history"
+check_dir  "$BACKUP_DIR/pvc-data/"            "PVC data"
 
 echo ""
 if [ "$GATE_PASS" = true ]; then
@@ -380,59 +496,49 @@ fi
 
 ### One-Click Backup Script
 
-The script below combines Steps 0–7 and the Verification Gate into a single file.
+The script below combines Steps 0–7 and the Verification Gate.
 
 ```bash
 #!/bin/bash
-# Note: set -e is intentionally omitted.
-# Failures are recorded per step and reported at the end,
-# so that a single failure does not prevent remaining items from being backed up.
-
 export KUBECONFIG=~/.kube/config
+source openab-session-env.sh
 
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
-DEPLOYMENT="${RELEASE_NAME}-kiro"
 BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
 mkdir -p "$BACKUP_DIR"
 
 POD=$(kubectl get pod \
   -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
   -o jsonpath='{.items[0].metadata.name}')
-if [ -z "$POD" ]; then
-  echo "❌ OpenAB Pod not found. Aborting backup." && exit 1
-fi
-
-# Pre-check: kubectl cp directory operations require tar inside the container
+if [ -z "$POD" ]; then echo "❌ Pod not found. Aborting." && exit 1; fi
 if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
-  echo "❌ tar not found in container. kubectl cp of directories will fail."
-  echo "   Use VolumeSnapshot for PVC backup instead."
-  exit 1
+  echo "❌ tar not found in container. Aborting." && exit 1
 fi
 
-FAILED_STEPS=()
+echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh
 
+FAILED_STEPS=()
 backup_step() {
   local desc="$1"; shift
   echo "=== $desc ==="
   if ! "$@"; then
-    echo "⚠️  Failed: $desc (continuing with remaining steps...)"
+    echo "⚠️  Failed: $desc"
     FAILED_STEPS+=("$desc")
   fi
 }
 
-backup_step "Backup Helm values"       bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml"
-backup_step "Backup Agent config"      kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
-backup_step "Backup Steering files"    kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
-backup_step "Backup hosts.yml"         kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
-backup_step "Backup kiro-cli auth DB"  kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
-backup_step "Backup full Secret"       bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml"
-backup_step "Backup Helm history"      bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt"
-backup_step "Backup PVC data"          kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
+backup_step "Helm values"       bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml"
+backup_step "Agent config"      kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
+backup_step "Steering files"    kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
+backup_step "hosts.yml"         kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
+backup_step "kiro-cli auth DB"  kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
+backup_step "Kubernetes Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml"
+backup_step "Helm history"      bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt"
+backup_step "PVC data"          kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
 
 if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then
-  backup_step "Backup skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
+  backup_step "Skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
 else
-  echo "⚠️ skills/ not found — skipping (normal if no custom skills installed)"
+  echo "⚠️ skills/ not found — skipping"
 fi
 
 echo ""
@@ -440,319 +546,337 @@ echo "=== Backup Summary: $BACKUP_DIR ==="
 ls -la "$BACKUP_DIR/"
 
 if [ ${#FAILED_STEPS[@]} -gt 0 ]; then
-  echo ""
-  echo "⚠️  The following backup steps FAILED:"
-  for step in "${FAILED_STEPS[@]}"; do
-    echo "   - $step"
-  done
-  echo ""
-  echo "   Review the failures above before proceeding with the upgrade."
-  echo "   A failed backup step means the corresponding data is NOT protected."
+  echo "⚠️  Failed steps: ${FAILED_STEPS[*]}"
+  echo "   Review failures before proceeding with the upgrade."
 else
-  echo "✅ All backup steps completed successfully."
+  echo "✅ All backup steps completed."
 fi
 
 echo ""
-echo "🔐 SECURITY REMINDER: $BACKUP_DIR/secret.yaml contains sensitive credentials"
-echo "   (Discord token, STT key, etc.). Do NOT commit this file."
-echo "   Consider encrypting it before storing:"
-echo "     gpg --symmetric $BACKUP_DIR/secret.yaml"
-echo "     # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml"
+echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit."
+echo "   gpg --symmetric $BACKUP_DIR/secret.yaml"
 ```
 
-### Backup Checklist (Human Reference)
-
-| Item | Command | Notes |
-|---|---|---|
-| Helm values | `helm get values $RELEASE_NAME -o yaml > $BACKUP_DIR/values.yaml` | Current Helm deployment parameters |
-| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/ $BACKUP_DIR/agents/` | Custom agent settings (model, prompt, tools, etc.) |
-| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ $BACKUP_DIR/steering/` | Steering docs such as IDENTITY.md |
-| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ $BACKUP_DIR/skills/` | Custom agent skills (if any; see Step 4 for conditional check) |
-| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml $BACKUP_DIR/hosts.yml` | GitHub CLI credentials (including multi-account configs) |
-| kiro-cli auth | `kubectl cp $POD:/home/agent/.local/share/kiro-cli/data.sqlite3 $BACKUP_DIR/kiro-auth.sqlite3` | Bot auth DB — required to avoid re-authentication after PVC loss |
-| Full Secret export | `kubectl get secret ${DEPLOYMENT} -o yaml > $BACKUP_DIR/secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. |
-| PVC data | `kubectl cp $POD:/home/agent/ $BACKUP_DIR/pvc-data/` | Default: kubectl cp. See Step 7 for VolumeSnapshot (advanced). |
-| Helm release history | `helm history $RELEASE_NAME > $BACKUP_DIR/helm-history.txt` | Useful reference for rollback |
-
 ---
 
 ## III. Upgrade Execution
 
-> **Agent note — dependency chain:**
-> - Requires `RELEASE_NAME`, `DEPLOYMENT`, `BACKUP_DIR`, `TARGET_VERSION` from Section I.
-> - Requires the Verification Gate (Section II) to have passed.
-
-### Pre-check: Resolve Upgrade Variables
+> **Agent instruction — session continuity:**
+> - Source `openab-session-env.sh` at the start of each step
+> - If resuming after a gap (e.g. backup was done earlier), verify `BACKUP_DIR` matches the intended backup:
+>   ```bash
+>   source openab-session-env.sh
+>   echo "BACKUP_DIR: $BACKUP_DIR"
+>   echo "Backup time: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')"
+>   ls "$BACKUP_DIR/"
+>   # Confirm this is the correct backup before proceeding
+>   ```
 
-```bash
-export KUBECONFIG=~/.kube/config
+### Step 1: Pre-release Validation
 
-# Resolve release name and deployment
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
-DEPLOYMENT="${RELEASE_NAME}-kiro"
-
-# Resolve backup directory (most recent backup)
-BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1)
-BACKUP_VALUES="${BACKUP_DIR}/values.yaml"
-echo "Using backup: $BACKUP_DIR"
-echo "Values file: $BACKUP_VALUES"
-
-# Confirm the values file exists and is non-empty
-if [ ! -s "$BACKUP_VALUES" ]; then
-  echo "❌ values.yaml not found or empty at $BACKUP_VALUES. Run backup first."
-  exit 1
-fi
-
-# Set target version (e.g. 0.7.5 — check https://github.com/openabdev/openab/releases)
-TARGET_VERSION="0.7.5"
-
-# List available chart versions via OCI (no helm repo add required)
-helm show chart oci://ghcr.io/openabdev/charts/openab --version "$TARGET_VERSION" 2>/dev/null \
-  | grep -E "^(name|version|appVersion):"
-```
-
-### Pre-check: Confirm Helm OCI Access
+> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
+>
+> **Agent note — branch resolution:**
+> - If `PRERELEASE_VERSION` is empty (set during Section I because `pre-release-validated: true` was found in release notes): **skip this entire step**, proceed directly to Step 2.
+> - If `PRERELEASE_VERSION` is non-empty: run the full step below.
+> - If this step fails (automated smoke test fails): run rollback (Section IV) and **stop** — do not proceed to Step 2.
 
 ```bash
-# Verify OCI registry is reachable and the target version exists
-helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}" > /dev/null \
-  && echo "✅ Chart version ${TARGET_VERSION} available via OCI" \
-  || echo "❌ Chart version ${TARGET_VERSION} not found. Check version string."
-
-# Also verify the pre-release beta.1 version exists (required for Step 1)
-helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}-beta.1" > /dev/null \
-  && echo "✅ Pre-release ${TARGET_VERSION}-beta.1 available" \
-  || echo "⚠️ beta.1 not found — check GitHub releases for available pre-release tags"
-```
+source openab-session-env.sh
 
-### Step 1: Pre-release Validation (Required)
+if [ -z "$PRERELEASE_VERSION" ]; then
+  echo "ℹ️  PRERELEASE_VERSION is empty — pre-release step was skipped during env setup."
+  echo "   Proceeding directly to Step 2."
+  exit 0
+fi
 
-> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2.
->
-> **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first.
->
-> **Agent note — pass/fail criteria:**
-> - **Pass:** `kubectl wait` exits 0 AND `pgrep -x openab` exits 0 AND log scan returns no `panic` or `fatal` lines.
-> - **Fail:** Any of the above fails, OR a human operator reports a functional regression in Discord within the monitoring window. On failure, run `helm rollback` (see Section IV) and stop — do not proceed to Step 2.
+echo "Deploying pre-release: $PRERELEASE_VERSION"
 
-```bash
-# Dry-run first to catch values conflicts
+# Dry-run first
 helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
-  --version "${TARGET_VERSION}-beta.1" \
-  -f "$BACKUP_VALUES" \
+  --version "$PRERELEASE_VERSION" \
+  -f "$BACKUP_DIR/values.yaml" \
   --dry-run
+# Expected output contains: "Release \"openab\" has been upgraded. Happy Helming!"
 
-# Deploy the pre-release
+# Deploy pre-release
 helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
-  --version "${TARGET_VERSION}-beta.1" \
-  -f "$BACKUP_VALUES"
+  --version "$PRERELEASE_VERSION" \
+  -f "$BACKUP_DIR/values.yaml"
 
 kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
+# Expected output: "deployment/<DEPLOYMENT> successfully rolled out"
 
-# Automated smoke test
-POD=$(kubectl get pod \
-  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
-  -o jsonpath='{.items[0].metadata.name}')
+# --- Automated smoke test ---
+# Estimated duration: 30–60 seconds
+POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
+# Expected output: "pod/<POD> condition met"
+
 kubectl exec "$POD" -- pgrep -x openab
+# Expected output: a numeric PID (e.g. "42") — non-zero exit means process not running
+
 PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true)
 if [ "$PANIC_LINES" -gt 0 ]; then
-  echo "❌ Panic/fatal lines found in logs. Do NOT proceed. Run rollback."
+  echo "❌ Panic/fatal lines found in logs. Automated smoke test FAILED."
+  echo "   Run rollback (Section IV) and do not proceed to Step 2."
   exit 1
 fi
-echo "✅ Automated smoke test passed. Proceed with Discord functional validation."
+echo "✅ Automated smoke test passed."
+```
 
-# After automated smoke test — manual Discord check required:
-# → Send a test message in the Discord channel
-# → Confirm the bot responds and basic conversation / tool calls work
-# → If bot is unresponsive or broken: run helm rollback (Section IV) and stop
+**After automated smoke test — human Discord validation required:**
+
+```bash
+# ⏸ HUMAN CONFIRMATION REQUIRED
+# Estimated wait: 2–5 minutes
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "⏸  PAUSED — Human action required before continuing"
+echo ""
+echo "  1. Send a test message to the Discord channel"
+echo "  2. Confirm the bot responds and basic conversation / tool calls work"
+echo ""
+echo "  When confirmed OK, type:  CONFIRMED"
+echo "  To abort and rollback,    type:  ROLLBACK"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+read -r HUMAN_INPUT
+case "$HUMAN_INPUT" in
+  CONFIRMED)
+    echo "✅ Human confirmed — proceeding to Step 2"
+    ;;
+  ROLLBACK)
+    echo "🔁 Rollback requested by human. Proceed to Section IV."
+    exit 2
+    ;;
+  *)
+    echo "❌ Unrecognized input ('$HUMAN_INPUT'). Aborting for safety."
+    echo "   Run rollback (Section IV) if needed."
+    exit 1
+    ;;
+esac
 ```
 
 ### Step 2: Promote to Stable
 
-> **Agent note:** Only run this after Step 1 has passed both automated and Discord validation.
+> **Agent instruction:** Only run this after Step 1 is fully complete (automated + human confirmation), or after confirming `PRERELEASE_VERSION` was empty.
 
 ```bash
-# Dry-run the stable version
+source openab-session-env.sh
+
+echo "Promoting to stable: $TARGET_VERSION"
+
+# Dry-run
 helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
-  --version "${TARGET_VERSION}" \
-  -f "$BACKUP_VALUES" \
+  --version "$TARGET_VERSION" \
+  -f "$BACKUP_DIR/values.yaml" \
   --dry-run
 
-# Deploy stable (short downtime is expected due to Recreate strategy)
+# Deploy stable (short downtime expected due to Recreate strategy)
 helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \
-  --version "${TARGET_VERSION}" \
-  -f "$BACKUP_VALUES"
+  --version "$TARGET_VERSION" \
+  -f "$BACKUP_DIR/values.yaml"
 
-# Wait for the Pod to be ready
 kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
+# Expected output: "deployment/<DEPLOYMENT> successfully rolled out"
+# Estimated duration: 60–180 seconds
 ```
 
 ### Post-Upgrade Verification
 
 > **Agent note — pass/fail criteria:**
-> - **Pass:** All commands below exit 0 AND image tag matches `TARGET_VERSION` AND no panic/fatal in logs.
-> - **Fail:** Any command exits non-zero, or image tag does not match. → Proceed to Section IV Rollback.
+> - **Pass:** All commands exit 0, deployed chart version equals `openab-${TARGET_VERSION}`, no panic/fatal in logs, PVC paths are present.
+> - **Fail:** Any command exits non-zero, version mismatch, or panic/fatal in logs. → Proceed to Section IV Rollback immediately.
 
 ```bash
+source openab-session-env.sh
+
 POD=$(kubectl get pod \
   -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
   -o jsonpath='{.items[0].metadata.name}')
 
-# 1. Check Pod status (must be Running and READY)
+# 1. Pod status
 kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}"
 kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
+# Expected output: "pod/<POD> condition met"
 
-# 2. Verify deployed chart version matches target
+# 2. Chart version
 DEPLOYED=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart')
-echo "Deployed chart: $DEPLOYED  |  Expected: openab-${TARGET_VERSION}"
+echo "Deployed: $DEPLOYED  |  Expected: openab-${TARGET_VERSION}"
 if [ "$DEPLOYED" != "openab-${TARGET_VERSION}" ]; then
   echo "❌ Version mismatch. Investigate before proceeding."
+  exit 1
 fi
 
-# 3. Verify image tag
+# 3. Image tag
 kubectl get "deployment/${DEPLOYMENT}" \
   -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
+# Expected output contains: TARGET_VERSION or its image SHA
 
-# 4. Confirm the openab process is running
+# 4. Process check
 kubectl exec "$POD" -- pgrep -x openab
+# Expected output: a numeric PID
 
-# 5. Check logs for errors
+# 5. Log check
 PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true)
-ERROR_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true)
-echo "Panic/fatal lines: $PANIC_LINES  |  Error/warn lines: $ERROR_LINES"
+WARN_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true)
+echo "Panic/fatal: $PANIC_LINES  |  Error/warn: $WARN_LINES"
 if [ "$PANIC_LINES" -gt 0 ]; then
-  echo "❌ Panic/fatal found. Rollback recommended."
+  echo "❌ Panic/fatal found. Proceed to Section IV Rollback."
+  exit 1
 fi
 
-# 6. Verify PVC data (steering files and agent config) are still present
+# 6. PVC data integrity
 kubectl exec "$POD" -- ls /home/agent/.kiro/steering/
+# Expected output: at least one file listed (e.g. IDENTITY.md)
 kubectl exec "$POD" -- cat /home/agent/.kiro/agents/default.json | head -5
-# If either path is missing, restore from backup (see Section IV: Restore Custom Config)
+# Expected output: first 5 lines of valid JSON
 
-# 7. Discord E2E verification (final check — requires human operator)
-# → Send a test message in the Discord channel
-# → Confirm the bot responds and conversation works correctly
+echo "✅ All automated checks passed."
+```
+
+**After automated checks — human Discord E2E confirmation:**
+
+```bash
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "⏸  PAUSED — Human E2E validation required"
+echo ""
+echo "  Send a test message in the Discord channel."
+echo "  Confirm the bot responds and conversation works correctly."
+echo ""
+echo "  When confirmed OK, type:  CONFIRMED"
+echo "  If issues found,   type:  ROLLBACK"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+read -r HUMAN_INPUT
+case "$HUMAN_INPUT" in
+  CONFIRMED) echo "✅ Upgrade complete." ;;
+  ROLLBACK)  echo "🔁 Rollback requested. Proceed to Section IV."; exit 2 ;;
+  *)         echo "❌ Unrecognized input. Aborting."; exit 1 ;;
+esac
 ```
 
 ### Completion Notice
 
-- Once all verifications pass, notify users:
-  - Upgrade complete, service restored
-  - New version number and summary of key changes
-  - Contact channel for reporting any issues
+```bash
+source openab-session-env.sh
+
+# Send completion notification via Discord webhook (if configured)
+if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then
+  curl -s -X POST "$DISCORD_WEBHOOK_URL" \
+    -H "Content-Type: application/json" \
+    -d "{\"content\": \"✅ **Upgrade complete:** OpenAB is now running v${TARGET_VERSION}. Service restored.\"}"
+  echo "✅ Completion notice sent"
+else
+  echo "ℹ️  Notify users manually: OpenAB upgraded to v${TARGET_VERSION}, service restored."
+fi
+```
 
 ---
 
 ## IV. Rollback
 
-### Decision Tree
-
-> **Agent note — machine-readable branch criteria:**
->
-> | Observed condition | Action |
-> |---|---|
-> | `kubectl get pod` shows `CrashLoopBackOff` or `Pending` | `helm rollback` immediately |
-> | Pod is `Running` AND `pgrep -x openab` exits non-zero | `helm rollback` |
-> | Pod is `Running`, process OK, but logs contain `panic` or `fatal` | `helm rollback` |
-> | Pod is `Running`, process OK, logs clean, but no Discord response after 60s | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` |
-> | Pod is `Running`, process OK, logs clean, Discord responds, but config is missing | Restore config from backup → `kubectl rollout restart` |
-> | Quick fix is clearly identified (e.g. a known bad config key) | Hotfix — escalate to human engineer |
+### Decision Table (Machine-Readable)
 
-```
-Issue detected after upgrade
-         │
-         ▼
-    Pod status?
-         │
-         ├─ CrashLoopBackOff / Pending ──→ helm rollback <REVISION>
-         │
-         ├─ Running, pgrep exits non-zero OR panic in logs
-         │         └─ helm rollback <REVISION>
-         │
-         ├─ Running, logs clean, bot unresponsive
-         │         └─ kubectl rollout restart deployment/${DEPLOYMENT}
-         │                   │
-         │                   ├─ Responds within 60s ──→ Continue monitoring
-         │                   └─ Still unresponsive   ──→ helm rollback <REVISION>
-         │
-         └─ Running, bot OK, config missing
-                   └─ Restore config from backup → kubectl rollout restart
-```
+> **Agent instruction:** Evaluate conditions in order. Execute the action for the first matching row. Only one action should be taken per rollback event.
 
-| Symptom | Action |
-|---|---|
-| Pod fails to start (CrashLoopBackOff) | Helm rollback |
-| Functionality broken, Pod is running | Rollback or hotfix  |
-| Custom config lost | Restore config files from backup |
-| Bot unresponsive | Restart Pod first; rollback if it persists |
+| Condition to check | How to check | Action |
+|---|---|---|
+| Pod phase is `CrashLoopBackOff` or `Pending` | `kubectl get pod ... -o jsonpath='{.items[0].status.phase}'` | `helm rollback` immediately |
+| Pod is `Running` AND `pgrep -x openab` exits non-zero | `kubectl exec $POD -- pgrep -x openab; echo $?` | `helm rollback` |
+| Pod is `Running`, process OK, logs contain `panic` or `fatal` | `kubectl logs ... \| grep -icE "panic\|fatal"` | `helm rollback` |
+| Pod is `Running`, process OK, logs clean, no Discord response after 60s | Human reports no response | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` |
+| Pod is `Running`, process OK, bot responds, but config files missing | `kubectl exec $POD -- ls /home/agent/.kiro/steering/` | Restore from backup → `kubectl rollout restart` |
+| Quick fix is clearly identified (e.g. known bad config key) | Human identifies root cause | Hotfix — escalate to human engineer |
 
 ### Helm Rollback
 
+> **Agent instruction:** `PREV_REVISION` is resolved from the backup's `helm-history.txt` (saved before any upgrade occurred). This avoids the ambiguity of "倒數第二個" when multiple `helm upgrade` calls were made during the upgrade process (pre-release + stable).
+
 ```bash
-export KUBECONFIG=~/.kube/config
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
-DEPLOYMENT="${RELEASE_NAME}-kiro"
+source openab-session-env.sh
 
-# 1. View release history and identify the previous revision
-helm history "$RELEASE_NAME"
+# Validate BACKUP_DIR is set and helm-history.txt exists
+if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.txt" ]; then
+  echo "❌ BACKUP_DIR not set or helm-history.txt missing."
+  echo "   Resolve manually: helm history $RELEASE_NAME"
+  exit 1
+fi
 
-# 2. Get the previous revision number automatically
-PREV_REVISION=$(helm history "$RELEASE_NAME" --output json \
-  | jq -r 'sort_by(.revision) | .[-2].revision')
-echo "Rolling back to revision: $PREV_REVISION"
+echo "Using backup: $BACKUP_DIR"
+echo "Backup timestamp: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')"
+
+# Resolve the pre-upgrade stable revision from the backup
+# (the last revision with status "deployed" at the time of backup)
+PREV_REVISION=$(awk 'NR>1 && $3=="deployed" {rev=$1} END {print rev}' "$BACKUP_DIR/helm-history.txt")
+if [ -z "$PREV_REVISION" ]; then
+  echo "❌ Could not resolve PREV_REVISION from helm-history.txt."
+  echo "   Contents of helm-history.txt:"
+  cat "$BACKUP_DIR/helm-history.txt"
+  echo ""
+  echo "   Set PREV_REVISION manually and re-run: helm rollback $RELEASE_NAME <REVISION>"
+  exit 1
+fi
+echo "Rolling back to revision: $PREV_REVISION (pre-upgrade stable)"
 
-# 3. Roll back
+# Rollback
 helm rollback "$RELEASE_NAME" "$PREV_REVISION"
+# Expected output: "Rollback was a success! Happy Helming!"
 
-# 4. Wait for the Pod to be ready
 kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
+# Expected output: "deployment/<DEPLOYMENT> successfully rolled out"
 
-# 5. Confirm rollback succeeded
+# Confirm rollback
 kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}"
+# Expected output: 1 pod in Running/Ready state
 
-# 6. Run full post-rollback verification
+# Post-rollback verification
 POD=$(kubectl get pod \
   -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
   -o jsonpath='{.items[0].metadata.name}')
 kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s
 kubectl exec "$POD" -- pgrep -x openab
-kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -iE "error|warn|panic|fatal"
-# → Send a test message in the Discord channel to confirm the bot responds
+# Expected output: a numeric PID
+
+PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true)
+echo "Panic/fatal after rollback: $PANIC_LINES"
+if [ "$PANIC_LINES" -gt 0 ]; then
+  echo "❌ Panic/fatal found even after rollback. Escalate to human engineer."
+  exit 1
+fi
+echo "✅ Rollback complete. Send Discord test message to confirm bot is responsive."
 ```
 
 ### Restore Custom Config
 
 ```bash
-export KUBECONFIG=~/.kube/config
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
-DEPLOYMENT="${RELEASE_NAME}-kiro"
+source openab-session-env.sh
 
 POD=$(kubectl get pod \
   -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
   -o jsonpath='{.items[0].metadata.name}')
 
-# Resolve backup directory
-BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1)
 echo "Restoring from: $BACKUP_DIR"
 
 # Restore agent config
 kubectl cp "$BACKUP_DIR/agents/default.json" "$POD:/home/agent/.kiro/agents/default.json"
+echo "✅ Agent config restored"
 
 # Restore steering files
-# ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters.
-# Use the tar pipe method below to avoid accidentally creating a nested directory
-# (e.g. steering/steering/) which can happen with some kubectl versions.
+# ⚠️ Use tar pipe to avoid nested directory issue (e.g. steering/steering/)
 kubectl exec "$POD" -- mkdir -p /home/agent/.kiro/steering
 tar c -C "$BACKUP_DIR/steering" . | kubectl exec -i "$POD" -- tar x -C /home/agent/.kiro/steering
+echo "✅ Steering files restored"
 
 # Restore GitHub CLI credentials
 kubectl cp "$BACKUP_DIR/hosts.yml" "$POD:/home/agent/.config/gh/hosts.yml"
+echo "✅ hosts.yml restored"
 
 # Restore kiro-cli auth database
 kubectl exec "$POD" -- mkdir -p /home/agent/.local/share/kiro-cli
 kubectl cp "$BACKUP_DIR/kiro-auth.sqlite3" "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3"
+echo "✅ kiro-cli auth DB restored"
 
 # Restart Pod to apply changes
 kubectl rollout restart "deployment/${DEPLOYMENT}"
+kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s
+echo "✅ Pod restarted with restored config"
 ```

From f32cbfc03b08caf27acb3f47a640c0970e3657d5 Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Tue, 14 Apr 2026 23:28:38 +0000
Subject: [PATCH 6/8] docs: add AI instruction note to SOP title

---
 docs/openab-upgrade-sop.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index 0d3421fb..d373f8c2 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -1,5 +1,7 @@
 # OpenAB Version Upgrade SOP
 
+> [AI Instruction] Follow this documentation to assist the user in executing the upgrade process efficiently while ensuring all backup and rollback protocols are met.
+
 | | |
 |---|---|
 | **Document Version** | 1.4 |

From 8b384e5fad160c167b42d013e82bff912f67b77d Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Tue, 14 Apr 2026 23:32:31 +0000
Subject: [PATCH 7/8] docs: fix multi-release selector, read timeout, remove
 duplicate backup script
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1. helm list release name resolution: replace .[0].name with
   .[] | select(.chart | startswith("openab-")) | .name | head -1
   to correctly handle namespaces with multiple Helm releases

2. read HUMAN_INPUT: add -t 600 timeout with exit 1 on expiry to
   prevent indefinite hang in non-interactive / CI/CD environments

3. Remove "One-Click Backup Script" section (was ~60 lines of content
   functionally identical to the Agent-Executable Steps 0-7 above it)
   — reduces duplication and document length
---
 docs/openab-upgrade-sop.md | 71 +++-----------------------------------
 1 file changed, 4 insertions(+), 67 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index d373f8c2..f36b66af 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -27,7 +27,7 @@
 
 > **Deployment naming pattern:** The deployment name follows `<release-name>-kiro`. For the default setup (`helm install openab …`), the deployment is `openab-kiro`. If you used a different release name (e.g. `my-bot`), the deployment is `my-bot-kiro`. Verify with:
 > ```bash
-> RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+> RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1)
 > DEPLOYMENT="${RELEASE_NAME}-kiro"
 > echo "Deployment: $DEPLOYMENT"
 > ```
@@ -170,7 +170,7 @@ export KUBECONFIG=~/.kube/config
 # source openab-session-env.sh && echo "✅ Session env loaded" && exit 0
 
 # --- Resolve release and deployment names ---
-RELEASE_NAME=$(helm list -o json | jq -r '.[0].name')
+RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1)
 if [ -z "$RELEASE_NAME" ]; then
   echo "❌ No Helm release found. Is OpenAB installed?"
   exit 1
@@ -496,69 +496,6 @@ else
 fi
 ```
 
-### One-Click Backup Script
-
-The script below combines Steps 0–7 and the Verification Gate.
-
-```bash
-#!/bin/bash
-export KUBECONFIG=~/.kube/config
-source openab-session-env.sh
-
-BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)"
-mkdir -p "$BACKUP_DIR"
-
-POD=$(kubectl get pod \
-  -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \
-  -o jsonpath='{.items[0].metadata.name}')
-if [ -z "$POD" ]; then echo "❌ Pod not found. Aborting." && exit 1; fi
-if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then
-  echo "❌ tar not found in container. Aborting." && exit 1
-fi
-
-echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh
-
-FAILED_STEPS=()
-backup_step() {
-  local desc="$1"; shift
-  echo "=== $desc ==="
-  if ! "$@"; then
-    echo "⚠️  Failed: $desc"
-    FAILED_STEPS+=("$desc")
-  fi
-}
-
-backup_step "Helm values"       bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml"
-backup_step "Agent config"      kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/"
-backup_step "Steering files"    kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/"
-backup_step "hosts.yml"         kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml"
-backup_step "kiro-cli auth DB"  kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3"
-backup_step "Kubernetes Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml"
-backup_step "Helm history"      bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt"
-backup_step "PVC data"          kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
-
-if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then
-  backup_step "Skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/"
-else
-  echo "⚠️ skills/ not found — skipping"
-fi
-
-echo ""
-echo "=== Backup Summary: $BACKUP_DIR ==="
-ls -la "$BACKUP_DIR/"
-
-if [ ${#FAILED_STEPS[@]} -gt 0 ]; then
-  echo "⚠️  Failed steps: ${FAILED_STEPS[*]}"
-  echo "   Review failures before proceeding with the upgrade."
-else
-  echo "✅ All backup steps completed."
-fi
-
-echo ""
-echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit."
-echo "   gpg --symmetric $BACKUP_DIR/secret.yaml"
-```
-
 ---
 
 ## III. Upgrade Execution
@@ -641,7 +578,7 @@ echo ""
 echo "  When confirmed OK, type:  CONFIRMED"
 echo "  To abort and rollback,    type:  ROLLBACK"
 echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
-read -r HUMAN_INPUT
+read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; }
 case "$HUMAN_INPUT" in
   CONFIRMED)
     echo "✅ Human confirmed — proceeding to Step 2"
@@ -748,7 +685,7 @@ echo ""
 echo "  When confirmed OK, type:  CONFIRMED"
 echo "  If issues found,   type:  ROLLBACK"
 echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
-read -r HUMAN_INPUT
+read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; }
 case "$HUMAN_INPUT" in
   CONFIRMED) echo "✅ Upgrade complete." ;;
   ROLLBACK)  echo "🔁 Rollback requested. Proceed to Section IV."; exit 2 ;;

From 986eb0b3a1c8da852f6c1acfadbb1ac19e10396e Mon Sep 17 00:00:00 2001
From: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
Date: Wed, 15 Apr 2026 00:03:35 +0000
Subject: [PATCH 8/8] docs: address chaodu-agent review feedback (v1.5)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix all 3 suggested changes and 2 nits from chaodu-agent review:

1. Add PVC rollback warning at top of Section IV: helm rollback does
   NOT revert PVC data; if new version ran a data migration, restore
   PVC from Step 7 backup before rolling back

2. Add Agent note above both read HUMAN_INPUT blocks: if running in
   non-interactive shell (no stdin), skip read and report to user
   that human confirmation is required, then pause execution

3. Step 7 PVC overlap: add explicit note explaining that pvc-data/ is
   intentionally redundant — it is the full PVC snapshot for rollback
   of migrated data, while Steps 2-5 are for targeted fast restores.
   Add 500MB size threshold gate with VolumeSnapshot recommendation.

Nit 1: replace awk text parsing of helm-history with JSON approach —
   Step 7 now saves helm-history.json in addition to .txt;
   PREV_REVISION resolution uses jq on the JSON file for reliability
   across Helm versions (avoids column-shift issues with text output);
   Verification Gate and rollback section updated accordingly

Nit 2: add comment in Step 0 explaining why POD is not persisted to
   openab-session-env.sh (pod name changes after every upgrade/restart)
---
 docs/openab-upgrade-sop.md | 70 ++++++++++++++++++++++++++------------
 1 file changed, 49 insertions(+), 21 deletions(-)

diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md
index f36b66af..e9ea81b7 100644
--- a/docs/openab-upgrade-sop.md
+++ b/docs/openab-upgrade-sop.md
@@ -4,8 +4,8 @@
 
 | | |
 |---|---|
-| **Document Version** | 1.4 |
-| **Last Updated** | 2026-04-14 |
+| **Document Version** | 1.5 |
+| **Last Updated** | 2026-04-15 |
 
 ## Environment Reference
 
@@ -85,7 +85,7 @@
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    IV. Rollback                              │
-│  PREV_REVISION from backup helm-history.txt                 │
+│  PREV_REVISION from backup helm-history.json                │
 │  Machine-readable decision table → rollback → verify        │
 └─────────────────────────────────────────────────────────────┘
 ```
@@ -309,6 +309,8 @@ fi
 #### Step 0 — Resolve variables and create backup directory
 
 > **Output:** `BACKUP_DIR` appended to `openab-session-env.sh` → used in Steps 1–7 and the Verification Gate.
+>
+> **Why `POD` is not saved to `openab-session-env.sh`:** The pod name changes after every `helm upgrade` or `kubectl rollout restart` (new pod is created, old one is terminated). Persisting the pod name would cause subsequent steps to target a pod that no longer exists. Each step re-resolves `POD` at runtime to ensure it always refers to the currently running pod.
 
 ```bash
 source openab-session-env.sh
@@ -411,22 +413,33 @@ echo "🔐 SECURITY: secret.yaml contains credentials — do NOT commit. Encrypt
 echo "   gpg --symmetric $BACKUP_DIR/secret.yaml"
 ```
 
-#### Step 7 — Backup Helm release history and PVC data
+#### Step 7 — Backup Helm release history and full PVC snapshot
 
 > **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/`
+>
+> **Note on PVC overlap:** `pvc-data/` copies the entire `/home/agent` directory, which includes paths already backed up individually in Steps 2–5 (agents/, steering/, hosts.yml, kiro-auth.sqlite3). This overlap is **intentional** — the full PVC snapshot is the last-resort restore path if the new version ran a data migration that corrupts the PVC. The individual backups in Steps 2–5 are for fast, targeted restores; `pvc-data/` is for full rollback of PVC state.
+>
+> **Size threshold:** If the PVC is larger than ~500 MB, `kubectl cp` may be slow or time out. In that case, use the VolumeSnapshot option below instead.
 
 ```bash
 source openab-session-env.sh
 POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}')
 
 helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt"
-echo "✅ Helm history backed up"
-# This file is the source of truth for PREV_REVISION used in rollback
-
-PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1)
-echo "PVC size: $PVC_SIZE"
+helm history "$RELEASE_NAME" --output json > "$BACKUP_DIR/helm-history.json"
+echo "✅ Helm history backed up (text + JSON)"
+# helm-history.json is the source of truth for PREV_REVISION used in Section IV rollback
+# JSON format avoids column-shift parsing issues across Helm versions
+
+PVC_SIZE_BYTES=$(kubectl exec "$POD" -- du -sb /home/agent 2>/dev/null | cut -f1)
+PVC_SIZE_HUMAN=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1)
+echo "PVC size: $PVC_SIZE_HUMAN"
+if [ "${PVC_SIZE_BYTES:-0}" -gt 524288000 ]; then
+  echo "⚠️ PVC exceeds 500 MB — kubectl cp may be slow or time out."
+  echo "   Consider using the VolumeSnapshot option below instead."
+fi
 kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/"
-echo "✅ PVC data backed up"
+echo "✅ Full PVC snapshot backed up"
 ```
 
 > **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):**
@@ -483,7 +496,8 @@ check_dir  "$BACKUP_DIR/steering/"            "Steering files"
 check_file "$BACKUP_DIR/hosts.yml"            "GitHub CLI credentials"
 check_file "$BACKUP_DIR/kiro-auth.sqlite3"    "kiro-cli auth DB"
 check_file "$BACKUP_DIR/secret.yaml"          "Kubernetes Secret"
-check_file "$BACKUP_DIR/helm-history.txt"     "Helm history"
+check_file "$BACKUP_DIR/helm-history.txt"     "Helm history (text)"
+check_file "$BACKUP_DIR/helm-history.json"    "Helm history (JSON — used for PREV_REVISION)"
 check_dir  "$BACKUP_DIR/pvc-data/"            "PVC data"
 
 echo ""
@@ -566,6 +580,8 @@ echo "✅ Automated smoke test passed."
 
 **After automated smoke test — human Discord validation required:**
 
+> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`.
+
 ```bash
 # ⏸ HUMAN CONFIRMATION REQUIRED
 # Estimated wait: 2–5 minutes
@@ -675,6 +691,8 @@ echo "✅ All automated checks passed."
 
 **After automated checks — human Discord E2E confirmation:**
 
+> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`.
+
 ```bash
 echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
 echo "⏸  PAUSED — Human E2E validation required"
@@ -713,6 +731,14 @@ fi
 
 ## IV. Rollback
 
+> ⚠️ **`helm rollback` does NOT revert PVC data.** Helm only rolls back Kubernetes resources (Deployment, ConfigMap, Secret, etc.). The PVC and its contents remain as-is after rollback.
+>
+> If the new version ran a data migration on startup, the old version may not be compatible with the modified PVC data. In that case, restore PVC data from the Step 7 backup **before** running `helm rollback`:
+> ```bash
+> # Restore PVC data from backup first (see "Restore Custom Config" below)
+> # Then run helm rollback
+> ```
+
 ### Decision Table (Machine-Readable)
 
 > **Agent instruction:** Evaluate conditions in order. Execute the action for the first matching row. Only one action should be taken per rollback event.
@@ -728,28 +754,30 @@ fi
 
 ### Helm Rollback
 
-> **Agent instruction:** `PREV_REVISION` is resolved from the backup's `helm-history.txt` (saved before any upgrade occurred). This avoids the ambiguity of "倒數第二個" when multiple `helm upgrade` calls were made during the upgrade process (pre-release + stable).
+> **Agent instruction:** `PREV_REVISION` is resolved from `helm-history.json` saved during Step 7 (before any upgrade occurred). Using the JSON format avoids column-shift parsing issues across Helm versions. This also avoids the ambiguity of "second-to-last revision" when multiple `helm upgrade` calls were made (pre-release + stable).
 
 ```bash
 source openab-session-env.sh
 
 # Validate BACKUP_DIR is set and helm-history.txt exists
-if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.txt" ]; then
-  echo "❌ BACKUP_DIR not set or helm-history.txt missing."
-  echo "   Resolve manually: helm history $RELEASE_NAME"
+if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.json" ]; then
+  echo "❌ BACKUP_DIR not set or helm-history.json missing."
+  echo "   Resolve manually: helm history $RELEASE_NAME --output json | jq"
   exit 1
 fi
 
 echo "Using backup: $BACKUP_DIR"
 echo "Backup timestamp: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')"
 
-# Resolve the pre-upgrade stable revision from the backup
+# Resolve the pre-upgrade stable revision from the backup JSON
 # (the last revision with status "deployed" at the time of backup)
-PREV_REVISION=$(awk 'NR>1 && $3=="deployed" {rev=$1} END {print rev}' "$BACKUP_DIR/helm-history.txt")
-if [ -z "$PREV_REVISION" ]; then
-  echo "❌ Could not resolve PREV_REVISION from helm-history.txt."
-  echo "   Contents of helm-history.txt:"
-  cat "$BACKUP_DIR/helm-history.txt"
+# Uses JSON format saved during Step 7 — avoids column-shift parsing issues across Helm versions
+PREV_REVISION=$(jq -r '[.[] | select(.status == "deployed")] | sort_by(.revision) | last | .revision' \
+  "$BACKUP_DIR/helm-history.json" 2>/dev/null)
+if [ -z "$PREV_REVISION" ] || [ "$PREV_REVISION" = "null" ]; then
+  echo "❌ Could not resolve PREV_REVISION from helm-history.json."
+  echo "   Contents of helm-history.json:"
+  cat "$BACKUP_DIR/helm-history.json"
   echo ""
   echo "   Set PREV_REVISION manually and re-run: helm rollback $RELEASE_NAME <REVISION>"
   exit 1