diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md new file mode 100644 index 00000000..e9ea81b7 --- /dev/null +++ b/docs/openab-upgrade-sop.md @@ -0,0 +1,849 @@ +# OpenAB Version Upgrade SOP + +> [AI Instruction] Follow this documentation to assist the user in executing the upgrade process efficiently while ensuring all backup and rollback protocols are met. + +| | | +|---|---| +| **Document Version** | 1.5 | +| **Last Updated** | 2026-04-15 | + +## Environment Reference + +| Item | Details | +|---|---| +| Deployment Method | Kubernetes (Helm Chart) | +| Deployment Name | `-kiro` (default: `openab-kiro`) — see note below | +| Pod Label (precise) | `app.kubernetes.io/instance=,app.kubernetes.io/name=-kiro` | +| Helm Repo (OCI, recommended) | `oci://ghcr.io/openabdev/charts/openab` | +| Helm Repo (GitHub Pages, fallback) | `https://openabdev.github.io/openab` | +| Image Registry | `ghcr.io/openabdev/openab` | +| Git Repo | `github.com/openabdev/openab` | +| Agent Config | `/home/agent/.kiro/agents/default.json` | +| Steering Files | `/home/agent/.kiro/steering/` | +| kiro-cli Auth DB | `/home/agent/.local/share/kiro-cli/data.sqlite3` | +| PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) | +| KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) | +| Namespace | `default` (adjust to match your actual deployment namespace) | + +> **Deployment naming pattern:** The deployment name follows `-kiro`. For the default setup (`helm install openab …`), the deployment is `openab-kiro`. If you used a different release name (e.g. `my-bot`), the deployment is `my-bot-kiro`. Verify with: +> ```bash +> RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1) +> DEPLOYMENT="${RELEASE_NAME}-kiro" +> echo "Deployment: $DEPLOYMENT" +> ``` + +> ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set: +> ```bash +> export KUBECONFIG=~/.kube/config +> ``` + +> 💡 **Namespace setup (recommended):** If OpenAB is deployed in a non-default namespace, set the following at the start of your session to avoid having to append `-n ` to every command: +> ```bash +> export NS=openab # replace with your actual namespace +> export KUBECONFIG=~/.kube/config +> alias kubectl="kubectl -n $NS" +> alias helm="helm -n $NS" +> ``` +> All `kubectl` and `helm` commands in this SOP assume either the default namespace or that the above aliases are in effect. + +> ⚠️ **Data loss warning:** `helm uninstall` **deletes the PVC** and all persistent data (steering files, auth database, agent config) unless the chart has an explicit resource policy annotation. Always use `helm rollback` instead of uninstall + reinstall. If you need to uninstall, back up the PVC data first. + +--- + +## Upgrade Process Overview + +``` +┌─────────────────────────────────────────────────────────────┐ +│ 0. Environment Readiness Check │ +│ kubectl / helm / jq / curl / KUBECONFIG / cluster access │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ I. Pre-Upgrade Preparation │ +│ Resolve vars → Save session env file → Read release notes │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ II. Backup │ +│ Steps 0–7 → Verification Gate (all files non-empty) ✅ │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ III. Upgrade Execution (2 Phases) │ +│ │ +│ Step 1: Pre-release Validation │ +│ Check beta.1 exists → deploy → automated smoke test │ +│ → ⏸ HUMAN CONFIRMATION → proceed or rollback │ +│ │ +│ Step 2: Promote to Stable │ +│ helm upgrade (OCI) → rollout status → verification │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ IV. Rollback │ +│ PREV_REVISION from backup helm-history.json │ +│ Machine-readable decision table → rollback → verify │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## 0. Environment Readiness Check + +> **Agent instruction:** Run this section before anything else. If any check fails, stop and resolve the issue before proceeding. Do not attempt workarounds. +> +> **Expected output on success:** All lines print `✅` and the final line reads `✅ Environment ready.` + +```bash +export KUBECONFIG=~/.kube/config + +echo "=== Environment Readiness Check ===" +READY=true + +check_cmd() { + if command -v "$1" > /dev/null 2>&1; then + echo " ✅ $1 found" + else + echo " ❌ $1 not found — install it before proceeding" + READY=false + fi +} + +check_cmd kubectl +check_cmd helm +check_cmd jq +check_cmd curl +check_cmd awk +check_cmd tar + +echo "" +echo " KUBECONFIG: $KUBECONFIG" +if [ -f "$KUBECONFIG" ]; then + echo " ✅ KUBECONFIG file exists" +else + echo " ❌ KUBECONFIG file not found at $KUBECONFIG" + READY=false +fi + +CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null) +if [ -n "$CURRENT_CONTEXT" ]; then + echo " ✅ kubectl context: $CURRENT_CONTEXT" +else + echo " ❌ No kubectl context — check KUBECONFIG" + READY=false +fi + +if kubectl cluster-info > /dev/null 2>&1; then + echo " ✅ Cluster reachable" +else + echo " ❌ Cannot reach cluster — check KUBECONFIG and cluster status" + READY=false +fi + +echo "" +if [ "$READY" = true ]; then + echo "✅ Environment ready. Proceed to Section I." +else + echo "❌ Environment not ready. Fix the issues above before proceeding." + exit 1 +fi +``` + +--- + +## I. Pre-Upgrade Preparation + +### 1. Resolve All Session Variables + +> **Agent instruction:** Run this entire block as one unit. All subsequent sections depend on `openab-session-env.sh`. If the session file already exists from a previous run (e.g. backup was done earlier and upgrade is now resuming), source it instead of re-running this block. +> +> **Output:** `openab-session-env.sh` → sourced by all subsequent sections. + +```bash +export KUBECONFIG=~/.kube/config + +# If resuming a previous session, source the saved env and skip this block: +# source openab-session-env.sh && echo "✅ Session env loaded" && exit 0 + +# --- Resolve release and deployment names --- +RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1) +if [ -z "$RELEASE_NAME" ]; then + echo "❌ No Helm release found. Is OpenAB installed?" + exit 1 +fi +DEPLOYMENT="${RELEASE_NAME}-kiro" +echo "Release: $RELEASE_NAME | Deployment: $DEPLOYMENT" +# Expected output contains: "Release: openab | Deployment: openab-kiro" + +# --- Resolve current version --- +CURRENT_VERSION=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart' | sed 's/openab-//') +echo "Current chart version: $CURRENT_VERSION" + +# --- Resolve target version (latest stable from OCI, no pre-release tags) --- +TARGET_VERSION=$(helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null \ + | grep '^version:' | awk '{print $2}') +if [ -z "$TARGET_VERSION" ]; then + echo "❌ Could not resolve target version from OCI registry." + echo " Check network connectivity and registry access." + exit 1 +fi +echo "Target version (latest stable from OCI): $TARGET_VERSION" +# Expected output: "Target version (latest stable from OCI): 0.7.5" + +# If you need to upgrade to a specific version instead of latest, override here: +# TARGET_VERSION="0.7.4" + +# --- Check if upgrade is needed --- +if [ "$CURRENT_VERSION" = "$TARGET_VERSION" ]; then + echo "ℹ️ Already on the latest version ($TARGET_VERSION). No upgrade needed." + echo " If you still want to proceed (e.g. force re-deploy), continue manually." + exit 0 +fi + +# --- Check pre-release availability (determines Step 1 path) --- +if helm show chart oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}-beta.1" > /dev/null 2>&1; then + PRERELEASE_VERSION="${TARGET_VERSION}-beta.1" + echo "✅ Pre-release found: $PRERELEASE_VERSION" +else + # Check if release notes explicitly mark this as pre-validated + RELEASE_NOTES=$(gh api "repos/openabdev/openab/releases/tags/v${TARGET_VERSION}" \ + --jq '.body' 2>/dev/null || true) + if echo "$RELEASE_NOTES" | grep -q 'pre-release-validated: true'; then + PRERELEASE_VERSION="" + echo "✅ Release notes contain pre-release-validated: true — Step 1 will be skipped" + else + echo "❌ STOP: ${TARGET_VERSION}-beta.1 not found in OCI registry." + echo " Release notes do not contain 'pre-release-validated: true'." + echo " Options:" + echo " 1. Wait for the project to publish ${TARGET_VERSION}-beta.1" + echo " 2. Check GitHub releases for an alternative pre-release tag:" + echo " gh release list --repo openabdev/openab" + echo " 3. If a different pre-release tag is available (e.g. beta.2), set:" + echo " PRERELEASE_VERSION=\"${TARGET_VERSION}-beta.2\"" + echo " Do NOT proceed until a pre-release is available or the release notes" + echo " explicitly contain 'pre-release-validated: true'." + exit 1 + fi +fi + +# --- Save session environment file --- +cat > openab-session-env.sh </dev/null \ + || echo "⚠️ Could not fetch release notes via gh CLI — check the URL manually" +``` + +Pay special attention to: +- Breaking changes +- Helm Chart values changes +- Added or deprecated environment variables +- Any migration steps + +### 3. Check Node Resources + +```bash +source openab-session-env.sh + +kubectl describe nodes | grep -A 5 "Allocatable:" +kubectl top nodes +``` + +> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity. + +### 4. Announce the Upgrade + +> ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour. + +```bash +source openab-session-env.sh + +# Option A: Discord webhook notification (set DISCORD_WEBHOOK_URL in environment) +if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then + curl -s -X POST "$DISCORD_WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"content\": \"🔧 **Upgrade starting:** OpenAB is being upgraded from v${CURRENT_VERSION} to v${TARGET_VERSION}. The bot will be offline for approximately 1–3 minutes.\"}" + echo "✅ Discord notification sent" +else + echo "ℹ️ DISCORD_WEBHOOK_URL not set — skipping automated notification" + echo " Notify users manually: OpenAB upgrading v${CURRENT_VERSION} → v${TARGET_VERSION}, ~1–3 min downtime" +fi +``` + +--- + +## II. Backup + +> **Agent instruction — dependency chain:** +> - `openab-session-env.sh` must exist (created in Section I) +> - Steps 0–7 must run in order +> - The Verification Gate must print `✅ GATE PASSED` before proceeding to Section III +> - `BACKUP_DIR` is appended to `openab-session-env.sh` after Step 0 + +### Agent-Executable Backup (Linear Sequence) + +#### Step 0 — Resolve variables and create backup directory + +> **Output:** `BACKUP_DIR` appended to `openab-session-env.sh` → used in Steps 1–7 and the Verification Gate. +> +> **Why `POD` is not saved to `openab-session-env.sh`:** The pod name changes after every `helm upgrade` or `kubectl rollout restart` (new pod is created, old one is terminated). Persisting the pod name would cause subsequent steps to target a pod that no longer exists. Each step re-resolves `POD` at runtime to ensure it always refers to the currently running pod. + +```bash +source openab-session-env.sh + +BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$BACKUP_DIR" +echo "Backup directory: $BACKUP_DIR" + +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +echo "Pod: $POD" + +if [ -z "$POD" ]; then + echo "❌ Pod not found. Cannot proceed with backup." + exit 1 +fi + +if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then + echo "❌ tar not found in container. kubectl cp of directories will fail. Aborting." + exit 1 +fi + +# Append BACKUP_DIR to session env file +echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh +echo "✅ BACKUP_DIR saved to openab-session-env.sh" +``` + +#### Step 1 — Backup Helm values + +> **Output:** `$BACKUP_DIR/values.yaml` +> **Expected:** file size > 0 bytes + +```bash +source openab-session-env.sh +helm get values "$RELEASE_NAME" -o yaml > "$BACKUP_DIR/values.yaml" +echo "✅ Helm values backed up ($(wc -c < "$BACKUP_DIR/values.yaml") bytes)" +``` + +#### Step 2 — Backup agent config + +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/agents/` + +```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') +kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +echo "✅ Agent config backed up" +``` + +#### Step 3 — Backup steering files + +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/steering/` + +```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') +kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" +echo "✅ Steering files backed up" +``` + +#### Step 4 — Backup skills (optional) + +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/skills/` (may be absent) + +```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') +if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then + kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" + echo "✅ Skills directory backed up" +else + echo "⚠️ skills/ not found in container — skipping (normal if no custom skills are installed)" +fi +``` + +#### Step 5 — Backup GitHub CLI credentials and kiro-cli auth + +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3` + +```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') +kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +echo "✅ hosts.yml backed up ($(wc -c < "$BACKUP_DIR/hosts.yml") bytes)" + +kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" +echo "✅ kiro-cli auth DB backed up ($(wc -c < "$BACKUP_DIR/kiro-auth.sqlite3") bytes)" +``` + +#### Step 6 — Backup Kubernetes Secret + +> **Output:** `$BACKUP_DIR/secret.yaml` ⚠️ SENSITIVE + +```bash +source openab-session-env.sh +kubectl get secret "${DEPLOYMENT}" -o yaml > "$BACKUP_DIR/secret.yaml" +echo "✅ Secret backed up ($(wc -c < "$BACKUP_DIR/secret.yaml") bytes)" +echo "🔐 SECURITY: secret.yaml contains credentials — do NOT commit. Encrypt before storing:" +echo " gpg --symmetric $BACKUP_DIR/secret.yaml" +``` + +#### Step 7 — Backup Helm release history and full PVC snapshot + +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/` +> +> **Note on PVC overlap:** `pvc-data/` copies the entire `/home/agent` directory, which includes paths already backed up individually in Steps 2–5 (agents/, steering/, hosts.yml, kiro-auth.sqlite3). This overlap is **intentional** — the full PVC snapshot is the last-resort restore path if the new version ran a data migration that corrupts the PVC. The individual backups in Steps 2–5 are for fast, targeted restores; `pvc-data/` is for full rollback of PVC state. +> +> **Size threshold:** If the PVC is larger than ~500 MB, `kubectl cp` may be slow or time out. In that case, use the VolumeSnapshot option below instead. + +```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') + +helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt" +helm history "$RELEASE_NAME" --output json > "$BACKUP_DIR/helm-history.json" +echo "✅ Helm history backed up (text + JSON)" +# helm-history.json is the source of truth for PREV_REVISION used in Section IV rollback +# JSON format avoids column-shift parsing issues across Helm versions + +PVC_SIZE_BYTES=$(kubectl exec "$POD" -- du -sb /home/agent 2>/dev/null | cut -f1) +PVC_SIZE_HUMAN=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1) +echo "PVC size: $PVC_SIZE_HUMAN" +if [ "${PVC_SIZE_BYTES:-0}" -gt 524288000 ]; then + echo "⚠️ PVC exceeds 500 MB — kubectl cp may be slow or time out." + echo " Consider using the VolumeSnapshot option below instead." +fi +kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" +echo "✅ Full PVC snapshot backed up" +``` + +> **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):** +> ```bash +> source openab-session-env.sh +> PVC_NAME=$(kubectl get pod "$POD" \ +> -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}') +> SNAPSHOT_CLASS=$(kubectl get volumesnapshotclass -o jsonpath='{.items[0].metadata.name}') +> echo "PVC: $PVC_NAME | SnapshotClass: $SNAPSHOT_CLASS" +> kubectl apply -f - < apiVersion: snapshot.storage.k8s.io/v1 +> kind: VolumeSnapshot +> metadata: +> name: openab-pvc-snapshot-$(date +%Y%m%d) +> spec: +> volumeSnapshotClassName: ${SNAPSHOT_CLASS} +> source: +> persistentVolumeClaimName: ${PVC_NAME} +> EOF +> ``` + +#### Verification Gate — must pass before proceeding to upgrade + +> **Agent instruction:** Run this gate after all backup steps. If output does not contain `✅ GATE PASSED`, **stop immediately** and do not proceed with the upgrade. + +```bash +source openab-session-env.sh +echo "=== Backup Verification Gate ===" +GATE_PASS=true + +check_file() { + local path="$1"; local label="$2" + if [ -s "$path" ]; then + echo " ✅ $label ($(wc -c < "$path") bytes)" + else + echo " ❌ MISSING or EMPTY: $label ($path)" + GATE_PASS=false + fi +} + +check_dir() { + local path="$1"; local label="$2" + if [ -d "$path" ] && [ -n "$(ls -A "$path" 2>/dev/null)" ]; then + echo " ✅ $label ($(ls "$path" | wc -l) files)" + else + echo " ❌ MISSING or EMPTY: $label ($path)" + GATE_PASS=false + fi +} + +check_file "$BACKUP_DIR/values.yaml" "Helm values" +check_dir "$BACKUP_DIR/agents/" "Agent config" +check_dir "$BACKUP_DIR/steering/" "Steering files" +check_file "$BACKUP_DIR/hosts.yml" "GitHub CLI credentials" +check_file "$BACKUP_DIR/kiro-auth.sqlite3" "kiro-cli auth DB" +check_file "$BACKUP_DIR/secret.yaml" "Kubernetes Secret" +check_file "$BACKUP_DIR/helm-history.txt" "Helm history (text)" +check_file "$BACKUP_DIR/helm-history.json" "Helm history (JSON — used for PREV_REVISION)" +check_dir "$BACKUP_DIR/pvc-data/" "PVC data" + +echo "" +if [ "$GATE_PASS" = true ]; then + echo "✅ GATE PASSED — all backup files present and non-empty. Safe to proceed with upgrade." +else + echo "❌ GATE FAILED — one or more backup files are missing or empty." + echo " Do NOT proceed with the upgrade until all checks pass." + exit 1 +fi +``` + +--- + +## III. Upgrade Execution + +> **Agent instruction — session continuity:** +> - Source `openab-session-env.sh` at the start of each step +> - If resuming after a gap (e.g. backup was done earlier), verify `BACKUP_DIR` matches the intended backup: +> ```bash +> source openab-session-env.sh +> echo "BACKUP_DIR: $BACKUP_DIR" +> echo "Backup time: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')" +> ls "$BACKUP_DIR/" +> # Confirm this is the correct backup before proceeding +> ``` + +### Step 1: Pre-release Validation + +> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. +> +> **Agent note — branch resolution:** +> - If `PRERELEASE_VERSION` is empty (set during Section I because `pre-release-validated: true` was found in release notes): **skip this entire step**, proceed directly to Step 2. +> - If `PRERELEASE_VERSION` is non-empty: run the full step below. +> - If this step fails (automated smoke test fails): run rollback (Section IV) and **stop** — do not proceed to Step 2. + +```bash +source openab-session-env.sh + +if [ -z "$PRERELEASE_VERSION" ]; then + echo "ℹ️ PRERELEASE_VERSION is empty — pre-release step was skipped during env setup." + echo " Proceeding directly to Step 2." + exit 0 +fi + +echo "Deploying pre-release: $PRERELEASE_VERSION" + +# Dry-run first +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "$PRERELEASE_VERSION" \ + -f "$BACKUP_DIR/values.yaml" \ + --dry-run +# Expected output contains: "Release \"openab\" has been upgraded. Happy Helming!" + +# Deploy pre-release +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "$PRERELEASE_VERSION" \ + -f "$BACKUP_DIR/values.yaml" + +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" + +# --- Automated smoke test --- +# Estimated duration: 30–60 seconds +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +# Expected output: "pod/ condition met" + +kubectl exec "$POD" -- pgrep -x openab +# Expected output: a numeric PID (e.g. "42") — non-zero exit means process not running + +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal lines found in logs. Automated smoke test FAILED." + echo " Run rollback (Section IV) and do not proceed to Step 2." + exit 1 +fi +echo "✅ Automated smoke test passed." +``` + +**After automated smoke test — human Discord validation required:** + +> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`. + +```bash +# ⏸ HUMAN CONFIRMATION REQUIRED +# Estimated wait: 2–5 minutes +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +echo "⏸ PAUSED — Human action required before continuing" +echo "" +echo " 1. Send a test message to the Discord channel" +echo " 2. Confirm the bot responds and basic conversation / tool calls work" +echo "" +echo " When confirmed OK, type: CONFIRMED" +echo " To abort and rollback, type: ROLLBACK" +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; } +case "$HUMAN_INPUT" in + CONFIRMED) + echo "✅ Human confirmed — proceeding to Step 2" + ;; + ROLLBACK) + echo "🔁 Rollback requested by human. Proceed to Section IV." + exit 2 + ;; + *) + echo "❌ Unrecognized input ('$HUMAN_INPUT'). Aborting for safety." + echo " Run rollback (Section IV) if needed." + exit 1 + ;; +esac +``` + +### Step 2: Promote to Stable + +> **Agent instruction:** Only run this after Step 1 is fully complete (automated + human confirmation), or after confirming `PRERELEASE_VERSION` was empty. + +```bash +source openab-session-env.sh + +echo "Promoting to stable: $TARGET_VERSION" + +# Dry-run +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "$TARGET_VERSION" \ + -f "$BACKUP_DIR/values.yaml" \ + --dry-run + +# Deploy stable (short downtime expected due to Recreate strategy) +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "$TARGET_VERSION" \ + -f "$BACKUP_DIR/values.yaml" + +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" +# Estimated duration: 60–180 seconds +``` + +### Post-Upgrade Verification + +> **Agent note — pass/fail criteria:** +> - **Pass:** All commands exit 0, deployed chart version equals `openab-${TARGET_VERSION}`, no panic/fatal in logs, PVC paths are present. +> - **Fail:** Any command exits non-zero, version mismatch, or panic/fatal in logs. → Proceed to Section IV Rollback immediately. + +```bash +source openab-session-env.sh + +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') + +# 1. Pod status +kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +# Expected output: "pod/ condition met" + +# 2. Chart version +DEPLOYED=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart') +echo "Deployed: $DEPLOYED | Expected: openab-${TARGET_VERSION}" +if [ "$DEPLOYED" != "openab-${TARGET_VERSION}" ]; then + echo "❌ Version mismatch. Investigate before proceeding." + exit 1 +fi + +# 3. Image tag +kubectl get "deployment/${DEPLOYMENT}" \ + -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' +# Expected output contains: TARGET_VERSION or its image SHA + +# 4. Process check +kubectl exec "$POD" -- pgrep -x openab +# Expected output: a numeric PID + +# 5. Log check +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +WARN_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true) +echo "Panic/fatal: $PANIC_LINES | Error/warn: $WARN_LINES" +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal found. Proceed to Section IV Rollback." + exit 1 +fi + +# 6. PVC data integrity +kubectl exec "$POD" -- ls /home/agent/.kiro/steering/ +# Expected output: at least one file listed (e.g. IDENTITY.md) +kubectl exec "$POD" -- cat /home/agent/.kiro/agents/default.json | head -5 +# Expected output: first 5 lines of valid JSON + +echo "✅ All automated checks passed." +``` + +**After automated checks — human Discord E2E confirmation:** + +> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`. + +```bash +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +echo "⏸ PAUSED — Human E2E validation required" +echo "" +echo " Send a test message in the Discord channel." +echo " Confirm the bot responds and conversation works correctly." +echo "" +echo " When confirmed OK, type: CONFIRMED" +echo " If issues found, type: ROLLBACK" +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; } +case "$HUMAN_INPUT" in + CONFIRMED) echo "✅ Upgrade complete." ;; + ROLLBACK) echo "🔁 Rollback requested. Proceed to Section IV."; exit 2 ;; + *) echo "❌ Unrecognized input. Aborting."; exit 1 ;; +esac +``` + +### Completion Notice + +```bash +source openab-session-env.sh + +# Send completion notification via Discord webhook (if configured) +if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then + curl -s -X POST "$DISCORD_WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"content\": \"✅ **Upgrade complete:** OpenAB is now running v${TARGET_VERSION}. Service restored.\"}" + echo "✅ Completion notice sent" +else + echo "ℹ️ Notify users manually: OpenAB upgraded to v${TARGET_VERSION}, service restored." +fi +``` + +--- + +## IV. Rollback + +> ⚠️ **`helm rollback` does NOT revert PVC data.** Helm only rolls back Kubernetes resources (Deployment, ConfigMap, Secret, etc.). The PVC and its contents remain as-is after rollback. +> +> If the new version ran a data migration on startup, the old version may not be compatible with the modified PVC data. In that case, restore PVC data from the Step 7 backup **before** running `helm rollback`: +> ```bash +> # Restore PVC data from backup first (see "Restore Custom Config" below) +> # Then run helm rollback +> ``` + +### Decision Table (Machine-Readable) + +> **Agent instruction:** Evaluate conditions in order. Execute the action for the first matching row. Only one action should be taken per rollback event. + +| Condition to check | How to check | Action | +|---|---|---| +| Pod phase is `CrashLoopBackOff` or `Pending` | `kubectl get pod ... -o jsonpath='{.items[0].status.phase}'` | `helm rollback` immediately | +| Pod is `Running` AND `pgrep -x openab` exits non-zero | `kubectl exec $POD -- pgrep -x openab; echo $?` | `helm rollback` | +| Pod is `Running`, process OK, logs contain `panic` or `fatal` | `kubectl logs ... \| grep -icE "panic\|fatal"` | `helm rollback` | +| Pod is `Running`, process OK, logs clean, no Discord response after 60s | Human reports no response | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` | +| Pod is `Running`, process OK, bot responds, but config files missing | `kubectl exec $POD -- ls /home/agent/.kiro/steering/` | Restore from backup → `kubectl rollout restart` | +| Quick fix is clearly identified (e.g. known bad config key) | Human identifies root cause | Hotfix — escalate to human engineer | + +### Helm Rollback + +> **Agent instruction:** `PREV_REVISION` is resolved from `helm-history.json` saved during Step 7 (before any upgrade occurred). Using the JSON format avoids column-shift parsing issues across Helm versions. This also avoids the ambiguity of "second-to-last revision" when multiple `helm upgrade` calls were made (pre-release + stable). + +```bash +source openab-session-env.sh + +# Validate BACKUP_DIR is set and helm-history.txt exists +if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.json" ]; then + echo "❌ BACKUP_DIR not set or helm-history.json missing." + echo " Resolve manually: helm history $RELEASE_NAME --output json | jq" + exit 1 +fi + +echo "Using backup: $BACKUP_DIR" +echo "Backup timestamp: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')" + +# Resolve the pre-upgrade stable revision from the backup JSON +# (the last revision with status "deployed" at the time of backup) +# Uses JSON format saved during Step 7 — avoids column-shift parsing issues across Helm versions +PREV_REVISION=$(jq -r '[.[] | select(.status == "deployed")] | sort_by(.revision) | last | .revision' \ + "$BACKUP_DIR/helm-history.json" 2>/dev/null) +if [ -z "$PREV_REVISION" ] || [ "$PREV_REVISION" = "null" ]; then + echo "❌ Could not resolve PREV_REVISION from helm-history.json." + echo " Contents of helm-history.json:" + cat "$BACKUP_DIR/helm-history.json" + echo "" + echo " Set PREV_REVISION manually and re-run: helm rollback $RELEASE_NAME " + exit 1 +fi +echo "Rolling back to revision: $PREV_REVISION (pre-upgrade stable)" + +# Rollback +helm rollback "$RELEASE_NAME" "$PREV_REVISION" +# Expected output: "Rollback was a success! Happy Helming!" + +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" + +# Confirm rollback +kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" +# Expected output: 1 pod in Running/Ready state + +# Post-rollback verification +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +kubectl exec "$POD" -- pgrep -x openab +# Expected output: a numeric PID + +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +echo "Panic/fatal after rollback: $PANIC_LINES" +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal found even after rollback. Escalate to human engineer." + exit 1 +fi +echo "✅ Rollback complete. Send Discord test message to confirm bot is responsive." +``` + +### Restore Custom Config + +```bash +source openab-session-env.sh + +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') + +echo "Restoring from: $BACKUP_DIR" + +# Restore agent config +kubectl cp "$BACKUP_DIR/agents/default.json" "$POD:/home/agent/.kiro/agents/default.json" +echo "✅ Agent config restored" + +# Restore steering files +# ⚠️ Use tar pipe to avoid nested directory issue (e.g. steering/steering/) +kubectl exec "$POD" -- mkdir -p /home/agent/.kiro/steering +tar c -C "$BACKUP_DIR/steering" . | kubectl exec -i "$POD" -- tar x -C /home/agent/.kiro/steering +echo "✅ Steering files restored" + +# Restore GitHub CLI credentials +kubectl cp "$BACKUP_DIR/hosts.yml" "$POD:/home/agent/.config/gh/hosts.yml" +echo "✅ hosts.yml restored" + +# Restore kiro-cli auth database +kubectl exec "$POD" -- mkdir -p /home/agent/.local/share/kiro-cli +kubectl cp "$BACKUP_DIR/kiro-auth.sqlite3" "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" +echo "✅ kiro-cli auth DB restored" + +# Restart Pod to apply changes +kubectl rollout restart "deployment/${DEPLOYMENT}" +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +echo "✅ Pod restarted with restored config" +```