From 03a7a2ae28e2f5e99361632a67d7b037357862f7 Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Mon, 13 Apr 2026 16:36:34 +0800 Subject: [PATCH 1/8] docs: add upgrade SOP for Helm-based deployments Co-Authored-By: JARVIS-Agent --- docs/openab-upgrade-sop.md | 379 +++++++++++++++++++++++++++++++++++++ 1 file changed, 379 insertions(+) create mode 100644 docs/openab-upgrade-sop.md diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md new file mode 100644 index 00000000..636a634c --- /dev/null +++ b/docs/openab-upgrade-sop.md @@ -0,0 +1,379 @@ +# OpenAB Version Upgrade SOP + +## Environment Reference + +| Item | Details | +|---|---| +| Deployment Method | Kubernetes (Helm Chart) | +| Deployment Name | `openab-kiro` | +| Pod Label | `app.kubernetes.io/instance=openab` | +| Helm Repo (GitHub Pages) | `https://openabdev.github.io/openab` | +| Helm Repo (OCI) | `oci://ghcr.io/openabdev/charts/openab` | +| Image Registry | `ghcr.io/openabdev/openab` | +| Git Repo | `github.com/openabdev/openab` | +| Agent Config | `/home/agent/.kiro/agents/default.json` | +| Steering Files | `/home/agent/.kiro/steering/` | +| PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) | +| KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) | + +> ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set: +> ```bash +> export KUBECONFIG=~/.kube/config +> ``` + +--- + +## Upgrade Process Overview + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Pre-Upgrade Preparation │ +│ Check version info → Read Release Notes → Announce outage │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Backup │ +│ Helm values / Agent config / Steering / hosts.yml / PVC │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Upgrade Execution (2 Phases) │ +│ │ +│ Step 1: Pre-release Validation │ +│ helm upgrade --version x.x.x-beta.1 │ +│ └─ Discord functional test │ +│ ├─ Pass ──────────────────────────┐ │ +│ └─ Fail → Wait for beta.2, retry │ │ +│ ▼ │ +│ Step 2: Promote to Stable │ +│ helm upgrade --version x.x.x │ +│ └─ kubectl rollout status │ +└────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Post-Upgrade Verification │ +│ Pod Ready? → Version check → Log check → Discord E2E test │ +│ │ │ +│ ├─ All pass → Send completion notice ✅ │ +│ └─ Issues → Proceed to rollback ↓ │ +└────────────────────────┬────────────────────────────────────┘ + │ (on failure) + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Rollback │ +│ │ +│ Diagnose symptom │ +│ ├─ Pod won't start → helm rollback │ +│ ├─ Broken / Pod OK → rollback or hotfix │ +│ ├─ Config lost → restore from backup │ +│ └─ Bot unresponsive → kubectl rollout restart → rollback │ +│ │ +│ Rollback done → Re-run verification → Send rollback notice │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## I. Pre-Upgrade Preparation + +### 1. Check Current Version + +> ℹ️ OpenAB is a pre-compiled Rust binary shipped inside a Docker image. There is **no source code or git repository** inside the container — version information must be retrieved from Helm or the image tag. + +```bash +# Get the current running Pod +POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') + +# Check deployed Helm chart version and image tag +helm list -f openab +helm status openab + +# Check the actual image the Pod is running (including tag / SHA) +kubectl get deployment openab-kiro \ + -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' + +# Check latest releases on GitHub +# Visit https://github.com/openabdev/openab/releases + +# List available versions from the Helm repo (requires repo to be added first — see Section III) +helm search repo openab --versions +``` + +### 2. Read the Release Notes + +- Go to `https://github.com/openabdev/openab/releases/tag/` +- Pay special attention to: + - Breaking changes + - Helm Chart values changes + - Added or deprecated environment variables + - Any migration steps + +### 3. Announce the Upgrade + +> ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour. + +- Notify all users via Discord channel / Slack / email: + - Scheduled upgrade time and estimated downtime (typically 1–3 minutes) + - Scope of impact (Discord bot will be offline during the upgrade) + - Emergency contact + +--- + +## II. Backup + +### Backup Checklist + +| Item | Command | Notes | +|---|---|---| +| Helm values | `helm get values openab -o yaml > openab-values-backup-$(date +%Y%m%d).yaml` | Current Helm deployment parameters | +| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/default.json ./backup-default.json` | Custom agent settings (model, prompt, tools, etc.) | +| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md | +| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) | +| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) | +| STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) | +| PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) | +| Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback | + +### One-Click Backup Script + +```bash +#!/bin/bash +set -e + +BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$BACKUP_DIR" + +POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +if [ -z "$POD" ]; then + echo "❌ OpenAB Pod not found. Aborting backup." && exit 1 +fi + +backup_step() { + local desc="$1"; shift + echo "=== $desc ===" + if ! "$@"; then + echo "❌ Failed: $desc" && exit 1 + fi +} + +backup_step "Backup Helm values" bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml" +backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" +kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null || echo "⚠️ skills/ directory not found, skipping" +backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +backup_step "Backup Helm history" bash -c "helm history openab > $BACKUP_DIR/helm-history.txt" + +echo "=== ✅ Backup complete: $BACKUP_DIR ===" +ls -la "$BACKUP_DIR/" +``` + +### Verify the Backup + +```bash +# Check for empty files in the backup directory +find $BACKUP_DIR -type f -empty + +# Confirm values.yaml is readable +cat $BACKUP_DIR/values.yaml | head -20 +``` + +### PVC Backup (⚠️ Manual Step) + +> This step must be performed manually based on your PVC type. It cannot be automated. + +```bash +# 1. List PVCs mounted to the Pod +kubectl get pod $POD -o jsonpath='{range .spec.volumes[*]}{.name}{"\t"}{.persistentVolumeClaim.claimName}{"\n"}{end}' + +# 2. Option A: VolumeSnapshot (recommended — requires CSI driver support) +kubectl apply -f - < + source: + persistentVolumeClaimName: +EOF + +# 3. Option B: kubectl cp (suitable for small data volumes) +kubectl cp $POD: $BACKUP_DIR/pvc-data/ +``` + +--- + +## III. Upgrade Execution + +### Pre-check: Confirm Helm Repo is Configured + +```bash +# GitHub Pages (stable releases — recommended for most cases) +helm repo add openab https://openabdev.github.io/openab +helm repo update + +# List available versions +helm search repo openab --versions + +# Or query via OCI Registry +helm show chart oci://ghcr.io/openabdev/charts/openab --version +``` + +### Step 1: Pre-release Validation (Required) + +> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. + +```bash +BACKUP_VALUES="/values.yaml" # e.g. openab-backup-20260413-070000/values.yaml + +# Dry-run the pre-release version first +helm upgrade openab openab/openab \ + --version -beta.1 \ + -f "$BACKUP_VALUES" \ + --dry-run + +# Deploy the pre-release +helm upgrade openab openab/openab \ + --version -beta.1 \ + -f "$BACKUP_VALUES" + +kubectl rollout status deployment/openab-kiro --timeout=300s + +# Run functional tests in the Discord channel +# Verify basic conversation and tool calls work as expected +# If issues are found, wait for beta.2 and repeat this step +``` + +### Step 2: Promote to Stable + +```bash +BACKUP_VALUES="/values.yaml" + +# Dry-run the stable version +helm upgrade openab openab/openab \ + --version \ + -f "$BACKUP_VALUES" \ + --dry-run + +# Deploy stable (short downtime is expected) +helm upgrade openab openab/openab \ + --version \ + -f "$BACKUP_VALUES" + +# Wait for the Pod to be ready +kubectl rollout status deployment/openab-kiro --timeout=300s +``` + +### Post-Upgrade Verification + +```bash +POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') + +# 1. Check Pod status (must be Running and READY) +kubectl get pod -l app.kubernetes.io/instance=openab +kubectl wait --for=condition=Ready pod/$POD --timeout=120s + +# 2. Verify version (from Helm and image tag — no source code in the container) +helm list -f openab +kubectl get deployment openab-kiro \ + -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' + +# 3. Confirm the openab process is running +kubectl exec $POD -- pgrep -x openab + +# 4. Check logs for errors (ERROR / WARN / panic) +kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal" + +# 5. Discord E2E verification (final check) +# → Send a test message in the Discord channel +# → Confirm the bot responds and conversation works correctly +``` + +### Completion Notice + +- Once all verifications pass, notify users: + - Upgrade complete, service restored + - New version number and summary of key changes + - Contact channel for reporting any issues + +--- + +## IV. Rollback + +### Decision Tree + +``` +Issue detected after upgrade + │ + ▼ + Pod status? + │ + ├─ CrashLoopBackOff / Pending ──→ helm rollback + │ + ├─ Running, but functionality broken + │ │ + │ ├─ Quick fix possible (e.g. config error) ──→ hotfix (engineer) + │ └─ Root cause unclear ────────────────────→ helm rollback + │ + ├─ Running, but bot is unresponsive + │ │ + │ └─ kubectl rollout restart deployment/openab-kiro + │ │ + │ ├─ Recovers after restart ──→ Continue monitoring + │ └─ Still unresponsive ──→ helm rollback + │ + └─ Running, but custom config is missing + │ + └─ Restore config from backup → kubectl rollout restart +``` + +| Symptom | Action | +|---|---| +| Pod fails to start (CrashLoopBackOff) | Helm rollback | +| Functionality broken, Pod is running | Rollback or hotfix | +| Custom config lost | Restore config files from backup | +| Bot unresponsive | Restart Pod first; rollback if it persists | + +### Helm Rollback + +```bash +# 1. View release history +helm history openab + +# 2. Roll back to a previous revision +helm rollback openab + +# 3. Wait for the Pod to be ready +kubectl rollout status deployment/openab-kiro --timeout=300s + +# 4. Confirm rollback succeeded +kubectl get pod -l app.kubernetes.io/instance=openab + +# 5. Run full post-rollback verification (same as post-upgrade verification) +POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +kubectl wait --for=condition=Ready pod/$POD --timeout=120s +kubectl exec $POD -- pgrep -x openab +kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal" +# → Send a test message in the Discord channel to confirm the bot responds +``` + +### Restore Custom Config + +```bash +POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') + +# Restore agent config +kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json + +# Restore steering files +kubectl cp ./backup-steering/ $POD:/home/agent/.kiro/steering/ + +# Restore GitHub CLI credentials +kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml + +# Restart Pod to apply changes +kubectl rollout restart deployment/openab-kiro +``` From 9d101d9fa669ec14d03950edc4024c9bbb1fb176 Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Mon, 13 Apr 2026 16:47:09 +0800 Subject: [PATCH 2/8] docs(sop): address reviewer feedback - remove set -e; add explicit per-step error handling note - add tar pre-check before kubectl cp directory operations - add export KUBECONFIG inside backup script for consistency - add full Secret backup step (not just STT key) - add node resource check step in pre-upgrade preparation - add note on when pre-release step can be skipped - use tar pipe for steering restore to avoid kubectl cp dir nesting issue - add Document Version / Last Updated header Co-Authored-By: JARVIS-Agent --- docs/openab-upgrade-sop.md | 57 ++++++++++++++++++++++++++++++++------ 1 file changed, 49 insertions(+), 8 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index 636a634c..c01a032d 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -1,5 +1,10 @@ # OpenAB Version Upgrade SOP +| | | +|---|---| +| **Document Version** | 1.1 | +| **Last Updated** | 2026-04-13 | + ## Environment Reference | Item | Details | @@ -111,7 +116,22 @@ helm search repo openab --versions - Added or deprecated environment variables - Any migration steps -### 3. Announce the Upgrade +### 3. Check Node Resources + +> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity. + +```bash +# Check allocatable resources on all nodes +kubectl describe nodes | grep -A 5 "Allocatable:" + +# Check current resource requests across the cluster +kubectl top nodes + +# Confirm the new image size has not changed significantly +# (check the release notes for any resource requirement changes) +``` + +### 4. Announce the Upgrade > ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour. @@ -133,6 +153,7 @@ helm search repo openab --versions | Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md | | Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) | | hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) | +| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | Full Secret dump including Discord token, STT key, etc. — store securely | | STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) | | PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) | | Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback | @@ -141,7 +162,11 @@ helm search repo openab --versions ```bash #!/bin/bash -set -e +# Note: set -e is intentionally omitted. +# Error handling is done explicitly per step via backup_step() +# to avoid set -e interfering with the if ! "$@" pattern inside functions. + +export KUBECONFIG=~/.kube/config BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" @@ -151,6 +176,13 @@ if [ -z "$POD" ]; then echo "❌ OpenAB Pod not found. Aborting backup." && exit 1 fi +# Pre-check: kubectl cp directory operations require tar inside the container +if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then + echo "❌ tar not found in container. kubectl cp of directories will fail." + echo " Use 'kubectl exec' with a tar pipe instead, or use VolumeSnapshot for PVC backup." + exit 1 +fi + backup_step() { local desc="$1"; shift echo "=== $desc ===" @@ -159,15 +191,18 @@ backup_step() { fi } -backup_step "Backup Helm values" bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml" -backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +backup_step "Backup Helm values" bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml" +backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" -kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null || echo "⚠️ skills/ directory not found, skipping" -backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" -backup_step "Backup Helm history" bash -c "helm history openab > $BACKUP_DIR/helm-history.txt" +kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null \ + || echo "⚠️ skills/ directory not found, skipping" +backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +backup_step "Backup full Secret" bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml" +backup_step "Backup Helm history" bash -c "helm history openab > $BACKUP_DIR/helm-history.txt" echo "=== ✅ Backup complete: $BACKUP_DIR ===" ls -la "$BACKUP_DIR/" +echo "⚠️ secret.yaml contains sensitive credentials — store it securely and do not commit it." ``` ### Verify the Backup @@ -225,6 +260,8 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version ### Step 1: Pre-release Validation (Required) > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. +> +> **When can Step 1 be skipped?** Only if the maintainer explicitly states that the stable release was promoted directly from a pre-release that was already validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first. ```bash BACKUP_VALUES="/values.yaml" # e.g. openab-backup-20260413-070000/values.yaml @@ -369,7 +406,11 @@ POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[ kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json # Restore steering files -kubectl cp ./backup-steering/ $POD:/home/agent/.kiro/steering/ +# ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters. +# Use the tar pipe method below to avoid accidentally creating a nested directory +# (e.g. steering/steering/) which can happen with some kubectl versions. +kubectl exec $POD -- mkdir -p /home/agent/.kiro/steering +tar c -C ./backup-steering . | kubectl exec -i $POD -- tar x -C /home/agent/.kiro/steering # Restore GitHub CLI credentials kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml From df4f39b35f463effb1dcf7c0a177c4a56662163e Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 13 Apr 2026 13:53:56 +0000 Subject: [PATCH 3/8] docs: address reviewer feedback on upgrade SOP (v1.2) - Backup script: replace exit 1 with record-and-continue pattern; report all failed steps at end - Backup checklist: strengthen security warning for secret.yaml with encryption suggestions (gpg/age) - Environment Reference: add Namespace row; add namespace alias setup instructions - PVC backup Option B: add data size check step with recommended size limit - Pre-release skip condition: replace vague "maintainer explicitly states" with concrete pre-release-validated: true marker - Post-upgrade verification: add steering files and agent config presence checks --- docs/openab-upgrade-sop.md | 64 ++++++++++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 10 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index c01a032d..af7d19f2 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -2,7 +2,7 @@ | | | |---|---| -| **Document Version** | 1.1 | +| **Document Version** | 1.2 | | **Last Updated** | 2026-04-13 | ## Environment Reference @@ -20,12 +20,22 @@ | Steering Files | `/home/agent/.kiro/steering/` | | PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) | | KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) | +| Namespace | `default` (adjust to match your actual deployment namespace) | > ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set: > ```bash > export KUBECONFIG=~/.kube/config > ``` +> 💡 **Namespace setup (recommended):** If OpenAB is deployed in a non-default namespace, set the following at the start of your session to avoid having to append `-n ` to every command: +> ```bash +> export NS=openab # replace with your actual namespace +> export KUBECONFIG=~/.kube/config +> alias kubectl="kubectl -n $NS" +> alias helm="helm -n $NS" +> ``` +> All `kubectl` and `helm` commands in this SOP assume either the default namespace or that the above aliases are in effect. + --- ## Upgrade Process Overview @@ -153,7 +163,7 @@ kubectl top nodes | Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md | | Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) | | hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) | -| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | Full Secret dump including Discord token, STT key, etc. — store securely | +| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. Consider encrypting with `gpg` or [`age`](https://github.com/FiloSottile/age) before storing. | | STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) | | PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) | | Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback | @@ -163,8 +173,8 @@ kubectl top nodes ```bash #!/bin/bash # Note: set -e is intentionally omitted. -# Error handling is done explicitly per step via backup_step() -# to avoid set -e interfering with the if ! "$@" pattern inside functions. +# Failures are recorded per step and reported at the end, +# so that a single failure does not prevent remaining items from being backed up. export KUBECONFIG=~/.kube/config @@ -183,11 +193,14 @@ if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then exit 1 fi +FAILED_STEPS=() + backup_step() { local desc="$1"; shift echo "=== $desc ===" if ! "$@"; then - echo "❌ Failed: $desc" && exit 1 + echo "⚠️ Failed: $desc (continuing with remaining steps...)" + FAILED_STEPS+=("$desc") fi } @@ -200,9 +213,29 @@ backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/host backup_step "Backup full Secret" bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml" backup_step "Backup Helm history" bash -c "helm history openab > $BACKUP_DIR/helm-history.txt" -echo "=== ✅ Backup complete: $BACKUP_DIR ===" +echo "" +echo "=== Backup Summary: $BACKUP_DIR ===" ls -la "$BACKUP_DIR/" -echo "⚠️ secret.yaml contains sensitive credentials — store it securely and do not commit it." + +if [ ${#FAILED_STEPS[@]} -gt 0 ]; then + echo "" + echo "⚠️ The following backup steps FAILED:" + for step in "${FAILED_STEPS[@]}"; do + echo " - $step" + done + echo "" + echo " Review the failures above before proceeding with the upgrade." + echo " A failed backup step means the corresponding data is NOT protected." +else + echo "✅ All backup steps completed successfully." +fi + +echo "" +echo "🔐 SECURITY REMINDER: $BACKUP_DIR/secret.yaml contains sensitive credentials" +echo " (Discord token, STT key, etc.). Do NOT commit this file." +echo " Consider encrypting it before storing:" +echo " gpg --symmetric $BACKUP_DIR/secret.yaml" +echo " # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml" ``` ### Verify the Backup @@ -235,7 +268,12 @@ spec: persistentVolumeClaimName: EOF -# 3. Option B: kubectl cp (suitable for small data volumes) +# 3. Option B: kubectl cp (suitable for small data volumes only) +# +# ⚠️ Size check before proceeding — kubectl cp may timeout or OOM on large datasets. +# Recommended limit: < 500 MB. For larger volumes, use Option A (VolumeSnapshot). +kubectl exec $POD -- du -sh +# If the output is within an acceptable range, proceed: kubectl cp $POD: $BACKUP_DIR/pvc-data/ ``` @@ -261,7 +299,7 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. > -> **When can Step 1 be skipped?** Only if the maintainer explicitly states that the stable release was promoted directly from a pre-release that was already validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first. +> **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first. ```bash BACKUP_VALUES="/values.yaml" # e.g. openab-backup-20260413-070000/values.yaml @@ -324,7 +362,13 @@ kubectl exec $POD -- pgrep -x openab # 4. Check logs for errors (ERROR / WARN / panic) kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal" -# 5. Discord E2E verification (final check) +# 5. Verify steering files and agent config are still present +# (PVC content should survive upgrades, but this confirms it) +kubectl exec $POD -- ls /home/agent/.kiro/steering/ +kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5 +# If either path is missing, restore from backup (see Section IV: Restore Custom Config) + +# 6. Discord E2E verification (final check) # → Send a test message in the Discord channel # → Confirm the bot responds and conversation works correctly ``` From 7a027e7bc72c24ab2ea304c4d40393b6de352b47 Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Tue, 14 Apr 2026 22:49:59 +0000 Subject: [PATCH 4/8] docs: revise upgrade SOP for AI-first execution and address review feedback Address masami-agent review comments and repo owner AI-first design feedback: Technical fixes (masami-agent): - Add deployment naming pattern note (-kiro) - Use precise pod label selector (instance + name) to avoid multi-agent conflicts - Prefer OCI registry for helm commands; GitHub Pages listed as fallback - Add helm uninstall PVC deletion warning to Environment Reference - Add kiro-cli auth DB backup (data.sqlite3) to checklist and scripts AI-first redesign (repo owner): - Add Agent-Executable Backup section: linear Steps 0-7 with explicit input/output dependency annotations (no ambiguous branches) - Replace all patterns with "run this command to resolve" patterns (RELEASE_NAME, DEPLOYMENT, BACKUP_DIR, TARGET_VERSION, PREV_REVISION) - Add Verification Gate after backup: checks all files exist and are non-empty; exits 1 on failure so agent cannot proceed to upgrade - Add machine-readable pass/fail criteria for pre-release validation and post-upgrade verification steps - Add machine-readable decision table for rollback branch selection - Auto-resolve PREV_REVISION via helm history JSON + jq in rollback steps - Restore section uses BACKUP_DIR resolved from ls -td pattern --- docs/openab-upgrade-sop.md | 604 +++++++++++++++++++++++++++---------- 1 file changed, 449 insertions(+), 155 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index af7d19f2..0dfc3307 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -2,26 +2,34 @@ | | | |---|---| -| **Document Version** | 1.2 | -| **Last Updated** | 2026-04-13 | +| **Document Version** | 1.3 | +| **Last Updated** | 2026-04-14 | ## Environment Reference | Item | Details | |---|---| | Deployment Method | Kubernetes (Helm Chart) | -| Deployment Name | `openab-kiro` | -| Pod Label | `app.kubernetes.io/instance=openab` | -| Helm Repo (GitHub Pages) | `https://openabdev.github.io/openab` | -| Helm Repo (OCI) | `oci://ghcr.io/openabdev/charts/openab` | +| Deployment Name | `-kiro` (default: `openab-kiro`) — see note below | +| Pod Label (precise) | `app.kubernetes.io/instance=openab,app.kubernetes.io/name=openab-kiro` | +| Helm Repo (OCI, recommended) | `oci://ghcr.io/openabdev/charts/openab` | +| Helm Repo (GitHub Pages, fallback) | `https://openabdev.github.io/openab` | | Image Registry | `ghcr.io/openabdev/openab` | | Git Repo | `github.com/openabdev/openab` | | Agent Config | `/home/agent/.kiro/agents/default.json` | | Steering Files | `/home/agent/.kiro/steering/` | +| kiro-cli Auth DB | `/home/agent/.local/share/kiro-cli/data.sqlite3` | | PVC Mount Path | `/home/agent` (Helm); `.kiro` / `.local/share/kiro-cli` (raw k8s) | | KUBECONFIG | `~/.kube/config` (must be set explicitly — default k3s config has insufficient permissions) | | Namespace | `default` (adjust to match your actual deployment namespace) | +> **Deployment naming pattern:** The deployment name follows `-kiro`. For the default setup (`helm install openab …`), the deployment is `openab-kiro`. If you used a different release name (e.g. `my-bot`), the deployment is `my-bot-kiro`. Verify with: +> ```bash +> RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +> DEPLOYMENT="${RELEASE_NAME}-kiro" +> echo "Deployment: $DEPLOYMENT" +> ``` + > ⚠️ The local kubectl defaults to reading `/etc/rancher/k3s/k3s.yaml`, which will result in a permission denied error. Before running any command, always set: > ```bash > export KUBECONFIG=~/.kube/config @@ -36,6 +44,8 @@ > ``` > All `kubectl` and `helm` commands in this SOP assume either the default namespace or that the above aliases are in effect. +> ⚠️ **Data loss warning:** `helm uninstall` **deletes the PVC** and all persistent data (steering files, auth database, agent config) unless the chart has an explicit resource policy annotation. Always use `helm rollback` instead of uninstall + reinstall. If you need to uninstall, back up the PVC data first. + --- ## Upgrade Process Overview @@ -50,6 +60,7 @@ ┌─────────────────────────────────────────────────────────────┐ │ Backup │ │ Helm values / Agent config / Steering / hosts.yml / PVC │ +│ → Verification gate (all files exist & non-empty) ✅ │ └────────────────────────┬────────────────────────────────────┘ │ ▼ @@ -58,9 +69,9 @@ │ │ │ Step 1: Pre-release Validation │ │ helm upgrade --version x.x.x-beta.1 │ -│ └─ Discord functional test │ +│ └─ Automated smoke test (kubectl wait + pgrep + logs) │ │ ├─ Pass ──────────────────────────┐ │ -│ └─ Fail → Wait for beta.2, retry │ │ +│ └─ Fail → rollback, stop │ │ │ ▼ │ │ Step 2: Promote to Stable │ │ helm upgrade --version x.x.x │ @@ -94,32 +105,54 @@ ## I. Pre-Upgrade Preparation -### 1. Check Current Version +### 1. Resolve Environment Variables -> ℹ️ OpenAB is a pre-compiled Rust binary shipped inside a Docker image. There is **no source code or git repository** inside the container — version information must be retrieved from Helm or the image tag. +> **Agent note:** Run this block first. All subsequent steps depend on these variables. +> +> **Step 1 output:** `RELEASE_NAME`, `DEPLOYMENT`, `POD`, `CURRENT_VERSION`, `TARGET_VERSION` → used in all subsequent steps. ```bash -# Get the current running Pod -POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +export KUBECONFIG=~/.kube/config + +# Resolve release name and deployment name +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" +echo "Release: $RELEASE_NAME | Deployment: $DEPLOYMENT" + +# Get current running pod (precise label selector — avoids matching other agents) +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +echo "Pod: $POD" +if [ -z "$POD" ]; then echo "❌ Pod not found. Check label selectors."; fi + +# Get current deployed chart version +CURRENT_VERSION=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart' | sed 's/openab-//') +echo "Current chart version: $CURRENT_VERSION" -# Check deployed Helm chart version and image tag -helm list -f openab -helm status openab +# List available versions via OCI (no repo add needed) +helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null | grep ^version -# Check the actual image the Pod is running (including tag / SHA) -kubectl get deployment openab-kiro \ +# List all published versions (requires GitHub Pages repo to be added) +# helm repo add openab https://openabdev.github.io/openab && helm repo update +# helm search repo openab --versions + +# Check the actual image the Pod is running +kubectl get deployment "$DEPLOYMENT" \ -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' +``` -# Check latest releases on GitHub -# Visit https://github.com/openabdev/openab/releases +After running the above, set the target version: -# List available versions from the Helm repo (requires repo to be added first — see Section III) -helm search repo openab --versions +```bash +# Set this to the version you are upgrading to (e.g. 0.7.5) +TARGET_VERSION="0.7.5" +echo "Upgrading to: $TARGET_VERSION" ``` ### 2. Read the Release Notes -- Go to `https://github.com/openabdev/openab/releases/tag/` +- Go to `https://github.com/openabdev/openab/releases/tag/v${TARGET_VERSION}` - Pay special attention to: - Breaking changes - Helm Chart values changes @@ -136,9 +169,6 @@ kubectl describe nodes | grep -A 5 "Allocatable:" # Check current resource requests across the cluster kubectl top nodes - -# Confirm the new image size has not changed significantly -# (check the release notes for any resource requirement changes) ``` ### 4. Announce the Upgrade @@ -154,22 +184,204 @@ kubectl top nodes ## II. Backup -### Backup Checklist +> **Agent note — dependency chain:** +> - Step 0 must run first (resolves `BACKUP_DIR` and `POD`) +> - Steps 1–7 depend on `POD` from Step 0 +> - The Verification Gate must pass before proceeding to Section III -| Item | Command | Notes | -|---|---|---| -| Helm values | `helm get values openab -o yaml > openab-values-backup-$(date +%Y%m%d).yaml` | Current Helm deployment parameters | -| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/default.json ./backup-default.json` | Custom agent settings (model, prompt, tools, etc.) | -| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ ./backup-steering/` | Steering docs such as IDENTITY.md | -| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ ./backup-skills/` | Custom agent skills (if any; skip if path does not exist) | -| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml ./backup-hosts.yml` | GitHub CLI credentials (including multi-account configs) | -| Full Secret export | `kubectl get secret openab-kiro -o yaml > ./backup-secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. Consider encrypting with `gpg` or [`age`](https://github.com/FiloSottile/age) before storing. | -| STT API Key | `kubectl get secret openab-kiro -o jsonpath='{.data.stt-api-key}' \| base64 -d > ./backup-stt-api-key.txt` | Required only if STT is enabled (`stt.enabled: true`) | -| PVC data | See "PVC Backup" section below | Persistent data mounted to the Pod (⚠️ manual step required) | -| Helm release history | `helm history openab > openab-helm-history-$(date +%Y%m%d).txt` | Useful reference for rollback | +### Agent-Executable Backup (Linear Sequence) + +This section is written as a machine-executable runbook with no ambiguous branches. Run steps in order. + +#### Step 0 — Resolve variables and create backup directory + +> **Output:** `BACKUP_DIR`, `POD` → used in Steps 1–7 and the Verification Gate. + +```bash +export KUBECONFIG=~/.kube/config + +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" +BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$BACKUP_DIR" +echo "Backup directory: $BACKUP_DIR" + +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +echo "Pod: $POD" + +# Gate: abort if pod not found +if [ -z "$POD" ]; then + echo "❌ Pod not found. Cannot proceed with backup." + exit 1 +fi + +# Gate: verify tar is available (required for directory kubectl cp) +if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then + echo "❌ tar not found in container. kubectl cp of directories will fail. Aborting." + exit 1 +fi +``` + +#### Step 1 — Backup Helm values + +> **Output:** `$BACKUP_DIR/values.yaml` + +```bash +helm get values "$RELEASE_NAME" -o yaml > "$BACKUP_DIR/values.yaml" +echo "✅ Helm values backed up" +``` + +#### Step 2 — Backup agent config + +> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/agents/` + +```bash +kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +echo "✅ Agent config backed up" +``` + +#### Step 3 — Backup steering files + +> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/steering/` + +```bash +kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" +echo "✅ Steering files backed up" +``` + +#### Step 4 — Backup skills (optional) + +> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/skills/` (may be skipped) + +```bash +if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then + kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" + echo "✅ Skills directory backed up" +else + echo "⚠️ skills/ not found in container — skipping (this is normal if no custom skills are installed)" +fi +``` + +#### Step 5 — Backup GitHub CLI credentials and kiro-cli auth + +> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3` + +```bash +kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +echo "✅ hosts.yml backed up" + +# kiro-cli auth database — required for bot to resume without re-authentication +kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" +echo "✅ kiro-cli auth DB backed up" +``` + +#### Step 6 — Backup Kubernetes Secret + +> **Output:** `$BACKUP_DIR/secret.yaml` ⚠️ SENSITIVE + +```bash +kubectl get secret "${DEPLOYMENT}" -o yaml > "$BACKUP_DIR/secret.yaml" +echo "✅ Secret backed up" +echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit." +echo " Encrypt if storing: gpg --symmetric $BACKUP_DIR/secret.yaml" +``` + +#### Step 7 — Backup Helm release history and PVC data + +> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/` + +```bash +helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt" +echo "✅ Helm history backed up" + +# PVC backup via kubectl cp (default path — /home/agent is the full PVC mount) +# Check size first to avoid timeout +PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1) +echo "PVC size: $PVC_SIZE" +# Proceed with kubectl cp (recommended for < 500 MB; use VolumeSnapshot for larger volumes) +kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" +echo "✅ PVC data backed up" +``` + +> **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):** +> ```bash +> # First, resolve the PVC name +> PVC_NAME=$(kubectl get pod "$POD" \ +> -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}') +> echo "PVC name: $PVC_NAME" +> +> # List available VolumeSnapshotClasses +> SNAPSHOT_CLASS=$(kubectl get volumesnapshotclass -o jsonpath='{.items[0].metadata.name}') +> echo "Snapshot class: $SNAPSHOT_CLASS" +> +> # Create the snapshot +> kubectl apply -f - < apiVersion: snapshot.storage.k8s.io/v1 +> kind: VolumeSnapshot +> metadata: +> name: openab-pvc-snapshot-$(date +%Y%m%d) +> spec: +> volumeSnapshotClassName: ${SNAPSHOT_CLASS} +> source: +> persistentVolumeClaimName: ${PVC_NAME} +> EOF +> ``` + +#### Verification Gate — must pass before proceeding to upgrade + +> **Agent instruction:** Run this gate after all backup steps. If any check fails, **stop and do not proceed** with the upgrade. A failed backup means that data is unprotected. + +```bash +echo "=== Backup Verification Gate ===" +GATE_PASS=true + +check_file() { + local path="$1" + local label="$2" + if [ -s "$path" ]; then + echo " ✅ $label ($path)" + else + echo " ❌ MISSING or EMPTY: $label ($path)" + GATE_PASS=false + fi +} + +check_dir() { + local path="$1" + local label="$2" + if [ -d "$path" ] && [ -n "$(ls -A "$path" 2>/dev/null)" ]; then + echo " ✅ $label ($path)" + else + echo " ❌ MISSING or EMPTY: $label ($path)" + GATE_PASS=false + fi +} + +check_file "$BACKUP_DIR/values.yaml" "Helm values" +check_dir "$BACKUP_DIR/agents/" "Agent config" +check_dir "$BACKUP_DIR/steering/" "Steering files" +check_file "$BACKUP_DIR/hosts.yml" "GitHub CLI credentials" +check_file "$BACKUP_DIR/kiro-auth.sqlite3" "kiro-cli auth DB" +check_file "$BACKUP_DIR/secret.yaml" "Kubernetes Secret" +check_file "$BACKUP_DIR/helm-history.txt" "Helm history" +check_dir "$BACKUP_DIR/pvc-data/" "PVC data" + +echo "" +if [ "$GATE_PASS" = true ]; then + echo "✅ GATE PASSED — all backup files present and non-empty. Safe to proceed with upgrade." +else + echo "❌ GATE FAILED — one or more backup files are missing or empty." + echo " Do NOT proceed with the upgrade until all checks pass." + exit 1 +fi +``` ### One-Click Backup Script +The script below combines Steps 0–7 and the Verification Gate into a single file. + ```bash #!/bin/bash # Note: set -e is intentionally omitted. @@ -178,10 +390,14 @@ kubectl top nodes export KUBECONFIG=~/.kube/config +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" -POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') if [ -z "$POD" ]; then echo "❌ OpenAB Pod not found. Aborting backup." && exit 1 fi @@ -189,7 +405,7 @@ fi # Pre-check: kubectl cp directory operations require tar inside the container if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then echo "❌ tar not found in container. kubectl cp of directories will fail." - echo " Use 'kubectl exec' with a tar pipe instead, or use VolumeSnapshot for PVC backup." + echo " Use VolumeSnapshot for PVC backup instead." exit 1 fi @@ -204,14 +420,20 @@ backup_step() { fi } -backup_step "Backup Helm values" bash -c "helm get values openab -o yaml > $BACKUP_DIR/values.yaml" -backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" -backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" -kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" 2>/dev/null \ - || echo "⚠️ skills/ directory not found, skipping" -backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" -backup_step "Backup full Secret" bash -c "kubectl get secret openab-kiro -o yaml > $BACKUP_DIR/secret.yaml" -backup_step "Backup Helm history" bash -c "helm history openab > $BACKUP_DIR/helm-history.txt" +backup_step "Backup Helm values" bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml" +backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" +backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +backup_step "Backup kiro-cli auth DB" kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" +backup_step "Backup full Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml" +backup_step "Backup Helm history" bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt" +backup_step "Backup PVC data" kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" + +if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then + backup_step "Backup skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" +else + echo "⚠️ skills/ not found — skipping (normal if no custom skills installed)" +fi echo "" echo "=== Backup Summary: $BACKUP_DIR ===" @@ -238,61 +460,69 @@ echo " gpg --symmetric $BACKUP_DIR/secret.yaml" echo " # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml" ``` -### Verify the Backup +### Backup Checklist (Human Reference) -```bash -# Check for empty files in the backup directory -find $BACKUP_DIR -type f -empty +| Item | Command | Notes | +|---|---|---| +| Helm values | `helm get values $RELEASE_NAME -o yaml > $BACKUP_DIR/values.yaml` | Current Helm deployment parameters | +| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/ $BACKUP_DIR/agents/` | Custom agent settings (model, prompt, tools, etc.) | +| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ $BACKUP_DIR/steering/` | Steering docs such as IDENTITY.md | +| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ $BACKUP_DIR/skills/` | Custom agent skills (if any; see Step 4 for conditional check) | +| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml $BACKUP_DIR/hosts.yml` | GitHub CLI credentials (including multi-account configs) | +| kiro-cli auth | `kubectl cp $POD:/home/agent/.local/share/kiro-cli/data.sqlite3 $BACKUP_DIR/kiro-auth.sqlite3` | Bot auth DB — required to avoid re-authentication after PVC loss | +| Full Secret export | `kubectl get secret ${DEPLOYMENT} -o yaml > $BACKUP_DIR/secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. | +| PVC data | `kubectl cp $POD:/home/agent/ $BACKUP_DIR/pvc-data/` | Default: kubectl cp. See Step 7 for VolumeSnapshot (advanced). | +| Helm release history | `helm history $RELEASE_NAME > $BACKUP_DIR/helm-history.txt` | Useful reference for rollback | -# Confirm values.yaml is readable -cat $BACKUP_DIR/values.yaml | head -20 -``` +--- -### PVC Backup (⚠️ Manual Step) +## III. Upgrade Execution -> This step must be performed manually based on your PVC type. It cannot be automated. +> **Agent note — dependency chain:** +> - Requires `RELEASE_NAME`, `DEPLOYMENT`, `BACKUP_DIR`, `TARGET_VERSION` from Section I. +> - Requires the Verification Gate (Section II) to have passed. + +### Pre-check: Resolve Upgrade Variables ```bash -# 1. List PVCs mounted to the Pod -kubectl get pod $POD -o jsonpath='{range .spec.volumes[*]}{.name}{"\t"}{.persistentVolumeClaim.claimName}{"\n"}{end}' - -# 2. Option A: VolumeSnapshot (recommended — requires CSI driver support) -kubectl apply -f - < - source: - persistentVolumeClaimName: -EOF - -# 3. Option B: kubectl cp (suitable for small data volumes only) -# -# ⚠️ Size check before proceeding — kubectl cp may timeout or OOM on large datasets. -# Recommended limit: < 500 MB. For larger volumes, use Option A (VolumeSnapshot). -kubectl exec $POD -- du -sh -# If the output is within an acceptable range, proceed: -kubectl cp $POD: $BACKUP_DIR/pvc-data/ -``` +export KUBECONFIG=~/.kube/config ---- +# Resolve release name and deployment +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" -## III. Upgrade Execution +# Resolve backup directory (most recent backup) +BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1) +BACKUP_VALUES="${BACKUP_DIR}/values.yaml" +echo "Using backup: $BACKUP_DIR" +echo "Values file: $BACKUP_VALUES" -### Pre-check: Confirm Helm Repo is Configured +# Confirm the values file exists and is non-empty +if [ ! -s "$BACKUP_VALUES" ]; then + echo "❌ values.yaml not found or empty at $BACKUP_VALUES. Run backup first." + exit 1 +fi -```bash -# GitHub Pages (stable releases — recommended for most cases) -helm repo add openab https://openabdev.github.io/openab -helm repo update +# Set target version (e.g. 0.7.5 — check https://github.com/openabdev/openab/releases) +TARGET_VERSION="0.7.5" + +# List available chart versions via OCI (no helm repo add required) +helm show chart oci://ghcr.io/openabdev/charts/openab --version "$TARGET_VERSION" 2>/dev/null \ + | grep -E "^(name|version|appVersion):" +``` -# List available versions -helm search repo openab --versions +### Pre-check: Confirm Helm OCI Access -# Or query via OCI Registry -helm show chart oci://ghcr.io/openabdev/charts/openab --version +```bash +# Verify OCI registry is reachable and the target version exists +helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}" > /dev/null \ + && echo "✅ Chart version ${TARGET_VERSION} available via OCI" \ + || echo "❌ Chart version ${TARGET_VERSION} not found. Check version string." + +# Also verify the pre-release beta.1 version exists (required for Step 1) +helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}-beta.1" > /dev/null \ + && echo "✅ Pre-release ${TARGET_VERSION}-beta.1 available" \ + || echo "⚠️ beta.1 not found — check GitHub releases for available pre-release tags" ``` ### Step 1: Pre-release Validation (Required) @@ -300,75 +530,107 @@ helm show chart oci://ghcr.io/openabdev/charts/openab --version > ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. > > **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first. +> +> **Agent note — pass/fail criteria:** +> - **Pass:** `kubectl wait` exits 0 AND `pgrep -x openab` exits 0 AND log scan returns no `panic` or `fatal` lines. +> - **Fail:** Any of the above fails, OR a human operator reports a functional regression in Discord within the monitoring window. On failure, run `helm rollback` (see Section IV) and stop — do not proceed to Step 2. ```bash -BACKUP_VALUES="/values.yaml" # e.g. openab-backup-20260413-070000/values.yaml - -# Dry-run the pre-release version first -helm upgrade openab openab/openab \ - --version -beta.1 \ +# Dry-run first to catch values conflicts +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}-beta.1" \ -f "$BACKUP_VALUES" \ --dry-run # Deploy the pre-release -helm upgrade openab openab/openab \ - --version -beta.1 \ +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}-beta.1" \ -f "$BACKUP_VALUES" -kubectl rollout status deployment/openab-kiro --timeout=300s +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s + +# Automated smoke test +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +kubectl exec "$POD" -- pgrep -x openab +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal lines found in logs. Do NOT proceed. Run rollback." + exit 1 +fi +echo "✅ Automated smoke test passed. Proceed with Discord functional validation." -# Run functional tests in the Discord channel -# Verify basic conversation and tool calls work as expected -# If issues are found, wait for beta.2 and repeat this step +# After automated smoke test — manual Discord check required: +# → Send a test message in the Discord channel +# → Confirm the bot responds and basic conversation / tool calls work +# → If bot is unresponsive or broken: run helm rollback (Section IV) and stop ``` ### Step 2: Promote to Stable -```bash -BACKUP_VALUES="/values.yaml" +> **Agent note:** Only run this after Step 1 has passed both automated and Discord validation. +```bash # Dry-run the stable version -helm upgrade openab openab/openab \ - --version \ +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}" \ -f "$BACKUP_VALUES" \ --dry-run -# Deploy stable (short downtime is expected) -helm upgrade openab openab/openab \ - --version \ +# Deploy stable (short downtime is expected due to Recreate strategy) +helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}" \ -f "$BACKUP_VALUES" # Wait for the Pod to be ready -kubectl rollout status deployment/openab-kiro --timeout=300s +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s ``` ### Post-Upgrade Verification +> **Agent note — pass/fail criteria:** +> - **Pass:** All commands below exit 0 AND image tag matches `TARGET_VERSION` AND no panic/fatal in logs. +> - **Fail:** Any command exits non-zero, or image tag does not match. → Proceed to Section IV Rollback. + ```bash -POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') # 1. Check Pod status (must be Running and READY) -kubectl get pod -l app.kubernetes.io/instance=openab -kubectl wait --for=condition=Ready pod/$POD --timeout=120s +kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s + +# 2. Verify deployed chart version matches target +DEPLOYED=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart') +echo "Deployed chart: $DEPLOYED | Expected: openab-${TARGET_VERSION}" +if [ "$DEPLOYED" != "openab-${TARGET_VERSION}" ]; then + echo "❌ Version mismatch. Investigate before proceeding." +fi -# 2. Verify version (from Helm and image tag — no source code in the container) -helm list -f openab -kubectl get deployment openab-kiro \ +# 3. Verify image tag +kubectl get "deployment/${DEPLOYMENT}" \ -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' -# 3. Confirm the openab process is running -kubectl exec $POD -- pgrep -x openab +# 4. Confirm the openab process is running +kubectl exec "$POD" -- pgrep -x openab -# 4. Check logs for errors (ERROR / WARN / panic) -kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal" +# 5. Check logs for errors +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +ERROR_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true) +echo "Panic/fatal lines: $PANIC_LINES | Error/warn lines: $ERROR_LINES" +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal found. Rollback recommended." +fi -# 5. Verify steering files and agent config are still present -# (PVC content should survive upgrades, but this confirms it) -kubectl exec $POD -- ls /home/agent/.kiro/steering/ -kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5 -# If either path is missing, restore from backup (see Section IV: Restore Custom Config) +# 6. Verify PVC data (steering files and agent config) are still present +kubectl exec "$POD" -- ls /home/agent/.kiro/steering/ +kubectl exec "$POD" -- cat /home/agent/.kiro/agents/default.json | head -5 +# If either path is missing, restore from backup (see Section IV: Restore Custom Config) -# 6. Discord E2E verification (final check) +# 7. Discord E2E verification (final check — requires human operator) # → Send a test message in the Discord channel # → Confirm the bot responds and conversation works correctly ``` @@ -386,6 +648,17 @@ kubectl exec $POD -- cat /home/agent/.kiro/agents/default.json | head -5 ### Decision Tree +> **Agent note — machine-readable branch criteria:** +> +> | Observed condition | Action | +> |---|---| +> | `kubectl get pod` shows `CrashLoopBackOff` or `Pending` | `helm rollback` immediately | +> | Pod is `Running` AND `pgrep -x openab` exits non-zero | `helm rollback` | +> | Pod is `Running`, process OK, but logs contain `panic` or `fatal` | `helm rollback` | +> | Pod is `Running`, process OK, logs clean, but no Discord response after 60s | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` | +> | Pod is `Running`, process OK, logs clean, Discord responds, but config is missing | Restore config from backup → `kubectl rollout restart` | +> | Quick fix is clearly identified (e.g. a known bad config key) | Hotfix — escalate to human engineer | + ``` Issue detected after upgrade │ @@ -394,20 +667,16 @@ Issue detected after upgrade │ ├─ CrashLoopBackOff / Pending ──→ helm rollback │ - ├─ Running, but functionality broken - │ │ - │ ├─ Quick fix possible (e.g. config error) ──→ hotfix (engineer) - │ └─ Root cause unclear ────────────────────→ helm rollback + ├─ Running, pgrep exits non-zero OR panic in logs + │ └─ helm rollback │ - ├─ Running, but bot is unresponsive - │ │ - │ └─ kubectl rollout restart deployment/openab-kiro + ├─ Running, logs clean, bot unresponsive + │ └─ kubectl rollout restart deployment/${DEPLOYMENT} │ │ - │ ├─ Recovers after restart ──→ Continue monitoring - │ └─ Still unresponsive ──→ helm rollback + │ ├─ Responds within 60s ──→ Continue monitoring + │ └─ Still unresponsive ──→ helm rollback │ - └─ Running, but custom config is missing - │ + └─ Running, bot OK, config missing └─ Restore config from backup → kubectl rollout restart ``` @@ -421,44 +690,69 @@ Issue detected after upgrade ### Helm Rollback ```bash -# 1. View release history -helm history openab +export KUBECONFIG=~/.kube/config +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" + +# 1. View release history and identify the previous revision +helm history "$RELEASE_NAME" + +# 2. Get the previous revision number automatically +PREV_REVISION=$(helm history "$RELEASE_NAME" --output json \ + | jq -r 'sort_by(.revision) | .[-2].revision') +echo "Rolling back to revision: $PREV_REVISION" -# 2. Roll back to a previous revision -helm rollback openab +# 3. Roll back +helm rollback "$RELEASE_NAME" "$PREV_REVISION" -# 3. Wait for the Pod to be ready -kubectl rollout status deployment/openab-kiro --timeout=300s +# 4. Wait for the Pod to be ready +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s -# 4. Confirm rollback succeeded -kubectl get pod -l app.kubernetes.io/instance=openab +# 5. Confirm rollback succeeded +kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -# 5. Run full post-rollback verification (same as post-upgrade verification) -POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') -kubectl wait --for=condition=Ready pod/$POD --timeout=120s -kubectl exec $POD -- pgrep -x openab -kubectl logs deployment/openab-kiro --tail=100 | grep -iE "error|warn|panic|fatal" +# 6. Run full post-rollback verification +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') +kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +kubectl exec "$POD" -- pgrep -x openab +kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -iE "error|warn|panic|fatal" # → Send a test message in the Discord channel to confirm the bot responds ``` ### Restore Custom Config ```bash -POD=$(kubectl get pod -l app.kubernetes.io/instance=openab -o jsonpath='{.items[0].metadata.name}') +export KUBECONFIG=~/.kube/config +RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +DEPLOYMENT="${RELEASE_NAME}-kiro" + +POD=$(kubectl get pod \ + -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ + -o jsonpath='{.items[0].metadata.name}') + +# Resolve backup directory +BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1) +echo "Restoring from: $BACKUP_DIR" # Restore agent config -kubectl cp ./backup-default.json $POD:/home/agent/.kiro/agents/default.json +kubectl cp "$BACKUP_DIR/agents/default.json" "$POD:/home/agent/.kiro/agents/default.json" # Restore steering files # ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters. # Use the tar pipe method below to avoid accidentally creating a nested directory # (e.g. steering/steering/) which can happen with some kubectl versions. -kubectl exec $POD -- mkdir -p /home/agent/.kiro/steering -tar c -C ./backup-steering . | kubectl exec -i $POD -- tar x -C /home/agent/.kiro/steering +kubectl exec "$POD" -- mkdir -p /home/agent/.kiro/steering +tar c -C "$BACKUP_DIR/steering" . | kubectl exec -i "$POD" -- tar x -C /home/agent/.kiro/steering # Restore GitHub CLI credentials -kubectl cp ./backup-hosts.yml $POD:/home/agent/.config/gh/hosts.yml +kubectl cp "$BACKUP_DIR/hosts.yml" "$POD:/home/agent/.config/gh/hosts.yml" + +# Restore kiro-cli auth database +kubectl exec "$POD" -- mkdir -p /home/agent/.local/share/kiro-cli +kubectl cp "$BACKUP_DIR/kiro-auth.sqlite3" "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" # Restart Pod to apply changes -kubectl rollout restart deployment/openab-kiro +kubectl rollout restart "deployment/${DEPLOYMENT}" ``` From 506555a25dc0a7b65c55fdad03b8641b23bc03c5 Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Tue, 14 Apr 2026 23:10:35 +0000 Subject: [PATCH 5/8] docs: address AI-first review feedback (v1.4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix all issues flagged in the second round of review: 1. TARGET_VERSION: auto-resolved from OCI registry latest stable version (helm show chart ... | grep ^version) — no more hardcoded placeholder 2. Pre-release beta.1 ambiguous branch: add explicit 3-way branch in Section I env setup — (a) beta.1 found: set PRERELEASE_VERSION, (b) not found but release notes contain pre-release-validated: true: set PRERELEASE_VERSION="" to skip Step 1, (c) neither: exit 1 with clear instructions (wait / check alternate tags / ask human) 3. Discord E2E validation: replace comment-only instructions with read -r HUMAN_INPUT gate accepting CONFIRMED or ROLLBACK; unrecognized input exits 1 for safety 4. Announcements: replace text-only descriptions with curl Discord webhook calls (guarded by DISCORD_WEBHOOK_URL env var check) 5. Session env file (openab-session-env.sh): resolve all variables once in Section I and persist to file; all subsequent sections source it. BACKUP_DIR appended after Step 0. Resumption instructions included. 6. BACKUP_DIR validation on resume: add timestamp echo and ls check before upgrade so agent can confirm correct backup is being used 7. Pre-condition check (Section 0): verify kubectl/helm/jq/curl/awk/tar are installed, KUBECONFIG file exists, context is set, cluster is reachable — exit 1 with per-tool guidance on failure 8. PREV_REVISION: resolve from backup helm-history.txt (captured before upgrade) using awk to find last "deployed" revision — avoids the ambiguity of [-2] when pre-release + stable both ran during upgrade 9. Add expected stdout patterns and estimated durations to key steps so agents can validate success beyond exit code --- docs/openab-upgrade-sop.md | 794 +++++++++++++++++++++---------------- 1 file changed, 459 insertions(+), 335 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index 0dfc3307..0d3421fb 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -2,7 +2,7 @@ | | | |---|---| -| **Document Version** | 1.3 | +| **Document Version** | 1.4 | | **Last Updated** | 2026-04-14 | ## Environment Reference @@ -11,7 +11,7 @@ |---|---| | Deployment Method | Kubernetes (Helm Chart) | | Deployment Name | `-kiro` (default: `openab-kiro`) — see note below | -| Pod Label (precise) | `app.kubernetes.io/instance=openab,app.kubernetes.io/name=openab-kiro` | +| Pod Label (precise) | `app.kubernetes.io/instance=,app.kubernetes.io/name=-kiro` | | Helm Repo (OCI, recommended) | `oci://ghcr.io/openabdev/charts/openab` | | Helm Repo (GitHub Pages, fallback) | `https://openabdev.github.io/openab` | | Image Registry | `ghcr.io/openabdev/openab` | @@ -52,156 +52,265 @@ ``` ┌─────────────────────────────────────────────────────────────┐ -│ Pre-Upgrade Preparation │ -│ Check version info → Read Release Notes → Announce outage │ +│ 0. Environment Readiness Check │ +│ kubectl / helm / jq / curl / KUBECONFIG / cluster access │ └────────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Backup │ -│ Helm values / Agent config / Steering / hosts.yml / PVC │ -│ → Verification gate (all files exist & non-empty) ✅ │ +│ I. Pre-Upgrade Preparation │ +│ Resolve vars → Save session env file → Read release notes │ └────────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Upgrade Execution (2 Phases) │ -│ │ -│ Step 1: Pre-release Validation │ -│ helm upgrade --version x.x.x-beta.1 │ -│ └─ Automated smoke test (kubectl wait + pgrep + logs) │ -│ ├─ Pass ──────────────────────────┐ │ -│ └─ Fail → rollback, stop │ │ -│ ▼ │ -│ Step 2: Promote to Stable │ -│ helm upgrade --version x.x.x │ -│ └─ kubectl rollout status │ +│ II. Backup │ +│ Steps 0–7 → Verification Gate (all files non-empty) ✅ │ └────────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Post-Upgrade Verification │ -│ Pod Ready? → Version check → Log check → Discord E2E test │ -│ │ │ -│ ├─ All pass → Send completion notice ✅ │ -│ └─ Issues → Proceed to rollback ↓ │ +│ III. Upgrade Execution (2 Phases) │ +│ │ +│ Step 1: Pre-release Validation │ +│ Check beta.1 exists → deploy → automated smoke test │ +│ → ⏸ HUMAN CONFIRMATION → proceed or rollback │ +│ │ +│ Step 2: Promote to Stable │ +│ helm upgrade (OCI) → rollout status → verification │ └────────────────────────┬────────────────────────────────────┘ - │ (on failure) + │ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Rollback │ -│ │ -│ Diagnose symptom │ -│ ├─ Pod won't start → helm rollback │ -│ ├─ Broken / Pod OK → rollback or hotfix │ -│ ├─ Config lost → restore from backup │ -│ └─ Bot unresponsive → kubectl rollout restart → rollback │ -│ │ -│ Rollback done → Re-run verification → Send rollback notice │ +│ IV. Rollback │ +│ PREV_REVISION from backup helm-history.txt │ +│ Machine-readable decision table → rollback → verify │ └─────────────────────────────────────────────────────────────┘ ``` --- +## 0. Environment Readiness Check + +> **Agent instruction:** Run this section before anything else. If any check fails, stop and resolve the issue before proceeding. Do not attempt workarounds. +> +> **Expected output on success:** All lines print `✅` and the final line reads `✅ Environment ready.` + +```bash +export KUBECONFIG=~/.kube/config + +echo "=== Environment Readiness Check ===" +READY=true + +check_cmd() { + if command -v "$1" > /dev/null 2>&1; then + echo " ✅ $1 found" + else + echo " ❌ $1 not found — install it before proceeding" + READY=false + fi +} + +check_cmd kubectl +check_cmd helm +check_cmd jq +check_cmd curl +check_cmd awk +check_cmd tar + +echo "" +echo " KUBECONFIG: $KUBECONFIG" +if [ -f "$KUBECONFIG" ]; then + echo " ✅ KUBECONFIG file exists" +else + echo " ❌ KUBECONFIG file not found at $KUBECONFIG" + READY=false +fi + +CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null) +if [ -n "$CURRENT_CONTEXT" ]; then + echo " ✅ kubectl context: $CURRENT_CONTEXT" +else + echo " ❌ No kubectl context — check KUBECONFIG" + READY=false +fi + +if kubectl cluster-info > /dev/null 2>&1; then + echo " ✅ Cluster reachable" +else + echo " ❌ Cannot reach cluster — check KUBECONFIG and cluster status" + READY=false +fi + +echo "" +if [ "$READY" = true ]; then + echo "✅ Environment ready. Proceed to Section I." +else + echo "❌ Environment not ready. Fix the issues above before proceeding." + exit 1 +fi +``` + +--- + ## I. Pre-Upgrade Preparation -### 1. Resolve Environment Variables +### 1. Resolve All Session Variables -> **Agent note:** Run this block first. All subsequent steps depend on these variables. +> **Agent instruction:** Run this entire block as one unit. All subsequent sections depend on `openab-session-env.sh`. If the session file already exists from a previous run (e.g. backup was done earlier and upgrade is now resuming), source it instead of re-running this block. > -> **Step 1 output:** `RELEASE_NAME`, `DEPLOYMENT`, `POD`, `CURRENT_VERSION`, `TARGET_VERSION` → used in all subsequent steps. +> **Output:** `openab-session-env.sh` → sourced by all subsequent sections. ```bash export KUBECONFIG=~/.kube/config -# Resolve release name and deployment name +# If resuming a previous session, source the saved env and skip this block: +# source openab-session-env.sh && echo "✅ Session env loaded" && exit 0 + +# --- Resolve release and deployment names --- RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +if [ -z "$RELEASE_NAME" ]; then + echo "❌ No Helm release found. Is OpenAB installed?" + exit 1 +fi DEPLOYMENT="${RELEASE_NAME}-kiro" echo "Release: $RELEASE_NAME | Deployment: $DEPLOYMENT" +# Expected output contains: "Release: openab | Deployment: openab-kiro" -# Get current running pod (precise label selector — avoids matching other agents) -POD=$(kubectl get pod \ - -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ - -o jsonpath='{.items[0].metadata.name}') -echo "Pod: $POD" -if [ -z "$POD" ]; then echo "❌ Pod not found. Check label selectors."; fi - -# Get current deployed chart version +# --- Resolve current version --- CURRENT_VERSION=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart' | sed 's/openab-//') echo "Current chart version: $CURRENT_VERSION" -# List available versions via OCI (no repo add needed) -helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null | grep ^version +# --- Resolve target version (latest stable from OCI, no pre-release tags) --- +TARGET_VERSION=$(helm show chart oci://ghcr.io/openabdev/charts/openab 2>/dev/null \ + | grep '^version:' | awk '{print $2}') +if [ -z "$TARGET_VERSION" ]; then + echo "❌ Could not resolve target version from OCI registry." + echo " Check network connectivity and registry access." + exit 1 +fi +echo "Target version (latest stable from OCI): $TARGET_VERSION" +# Expected output: "Target version (latest stable from OCI): 0.7.5" -# List all published versions (requires GitHub Pages repo to be added) -# helm repo add openab https://openabdev.github.io/openab && helm repo update -# helm search repo openab --versions +# If you need to upgrade to a specific version instead of latest, override here: +# TARGET_VERSION="0.7.4" -# Check the actual image the Pod is running -kubectl get deployment "$DEPLOYMENT" \ - -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' -``` +# --- Check if upgrade is needed --- +if [ "$CURRENT_VERSION" = "$TARGET_VERSION" ]; then + echo "ℹ️ Already on the latest version ($TARGET_VERSION). No upgrade needed." + echo " If you still want to proceed (e.g. force re-deploy), continue manually." + exit 0 +fi -After running the above, set the target version: +# --- Check pre-release availability (determines Step 1 path) --- +if helm show chart oci://ghcr.io/openabdev/charts/openab \ + --version "${TARGET_VERSION}-beta.1" > /dev/null 2>&1; then + PRERELEASE_VERSION="${TARGET_VERSION}-beta.1" + echo "✅ Pre-release found: $PRERELEASE_VERSION" +else + # Check if release notes explicitly mark this as pre-validated + RELEASE_NOTES=$(gh api "repos/openabdev/openab/releases/tags/v${TARGET_VERSION}" \ + --jq '.body' 2>/dev/null || true) + if echo "$RELEASE_NOTES" | grep -q 'pre-release-validated: true'; then + PRERELEASE_VERSION="" + echo "✅ Release notes contain pre-release-validated: true — Step 1 will be skipped" + else + echo "❌ STOP: ${TARGET_VERSION}-beta.1 not found in OCI registry." + echo " Release notes do not contain 'pre-release-validated: true'." + echo " Options:" + echo " 1. Wait for the project to publish ${TARGET_VERSION}-beta.1" + echo " 2. Check GitHub releases for an alternative pre-release tag:" + echo " gh release list --repo openabdev/openab" + echo " 3. If a different pre-release tag is available (e.g. beta.2), set:" + echo " PRERELEASE_VERSION=\"${TARGET_VERSION}-beta.2\"" + echo " Do NOT proceed until a pre-release is available or the release notes" + echo " explicitly contain 'pre-release-validated: true'." + exit 1 + fi +fi -```bash -# Set this to the version you are upgrading to (e.g. 0.7.5) -TARGET_VERSION="0.7.5" -echo "Upgrading to: $TARGET_VERSION" +# --- Save session environment file --- +cat > openab-session-env.sh < Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity. +# Print release notes to terminal for review +gh release view "v${TARGET_VERSION}" --repo openabdev/openab 2>/dev/null \ + || echo "⚠️ Could not fetch release notes via gh CLI — check the URL manually" +``` + +Pay special attention to: +- Breaking changes +- Helm Chart values changes +- Added or deprecated environment variables +- Any migration steps + +### 3. Check Node Resources ```bash -# Check allocatable resources on all nodes -kubectl describe nodes | grep -A 5 "Allocatable:" +source openab-session-env.sh -# Check current resource requests across the cluster +kubectl describe nodes | grep -A 5 "Allocatable:" kubectl top nodes ``` +> Skipping this step risks the new Pod getting stuck in `Pending` if the node lacks capacity. + ### 4. Announce the Upgrade > ⚠️ **Downtime is expected during every upgrade.** The deployment strategy is `Recreate` because the PVC is ReadWriteOnce, which does not support RollingUpdate. The old Pod is terminated before the new one starts — the Discord bot will be unavailable during this window, and this is expected behaviour. -- Notify all users via Discord channel / Slack / email: - - Scheduled upgrade time and estimated downtime (typically 1–3 minutes) - - Scope of impact (Discord bot will be offline during the upgrade) - - Emergency contact +```bash +source openab-session-env.sh + +# Option A: Discord webhook notification (set DISCORD_WEBHOOK_URL in environment) +if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then + curl -s -X POST "$DISCORD_WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"content\": \"🔧 **Upgrade starting:** OpenAB is being upgraded from v${CURRENT_VERSION} to v${TARGET_VERSION}. The bot will be offline for approximately 1–3 minutes.\"}" + echo "✅ Discord notification sent" +else + echo "ℹ️ DISCORD_WEBHOOK_URL not set — skipping automated notification" + echo " Notify users manually: OpenAB upgrading v${CURRENT_VERSION} → v${TARGET_VERSION}, ~1–3 min downtime" +fi +``` --- ## II. Backup -> **Agent note — dependency chain:** -> - Step 0 must run first (resolves `BACKUP_DIR` and `POD`) -> - Steps 1–7 depend on `POD` from Step 0 -> - The Verification Gate must pass before proceeding to Section III +> **Agent instruction — dependency chain:** +> - `openab-session-env.sh` must exist (created in Section I) +> - Steps 0–7 must run in order +> - The Verification Gate must print `✅ GATE PASSED` before proceeding to Section III +> - `BACKUP_DIR` is appended to `openab-session-env.sh` after Step 0 ### Agent-Executable Backup (Linear Sequence) -This section is written as a machine-executable runbook with no ambiguous branches. Run steps in order. - #### Step 0 — Resolve variables and create backup directory -> **Output:** `BACKUP_DIR`, `POD` → used in Steps 1–7 and the Verification Gate. +> **Output:** `BACKUP_DIR` appended to `openab-session-env.sh` → used in Steps 1–7 and the Verification Gate. ```bash -export KUBECONFIG=~/.kube/config +source openab-session-env.sh -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') -DEPLOYMENT="${RELEASE_NAME}-kiro" BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" echo "Backup directory: $BACKUP_DIR" @@ -211,70 +320,81 @@ POD=$(kubectl get pod \ -o jsonpath='{.items[0].metadata.name}') echo "Pod: $POD" -# Gate: abort if pod not found if [ -z "$POD" ]; then echo "❌ Pod not found. Cannot proceed with backup." exit 1 fi -# Gate: verify tar is available (required for directory kubectl cp) if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then echo "❌ tar not found in container. kubectl cp of directories will fail. Aborting." exit 1 fi + +# Append BACKUP_DIR to session env file +echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh +echo "✅ BACKUP_DIR saved to openab-session-env.sh" ``` #### Step 1 — Backup Helm values > **Output:** `$BACKUP_DIR/values.yaml` +> **Expected:** file size > 0 bytes ```bash +source openab-session-env.sh helm get values "$RELEASE_NAME" -o yaml > "$BACKUP_DIR/values.yaml" -echo "✅ Helm values backed up" +echo "✅ Helm values backed up ($(wc -c < "$BACKUP_DIR/values.yaml") bytes)" ``` #### Step 2 — Backup agent config -> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/agents/` +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/agents/` ```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" echo "✅ Agent config backed up" ``` #### Step 3 — Backup steering files -> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/steering/` +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/steering/` ```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" echo "✅ Steering files backed up" ``` #### Step 4 — Backup skills (optional) -> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/skills/` (may be skipped) +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/skills/` (may be absent) ```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" echo "✅ Skills directory backed up" else - echo "⚠️ skills/ not found in container — skipping (this is normal if no custom skills are installed)" + echo "⚠️ skills/ not found in container — skipping (normal if no custom skills are installed)" fi ``` #### Step 5 — Backup GitHub CLI credentials and kiro-cli auth -> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3` +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/hosts.yml`, `$BACKUP_DIR/kiro-auth.sqlite3` ```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" -echo "✅ hosts.yml backed up" +echo "✅ hosts.yml backed up ($(wc -c < "$BACKUP_DIR/hosts.yml") bytes)" -# kiro-cli auth database — required for bot to resume without re-authentication kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" -echo "✅ kiro-cli auth DB backed up" +echo "✅ kiro-cli auth DB backed up ($(wc -c < "$BACKUP_DIR/kiro-auth.sqlite3") bytes)" ``` #### Step 6 — Backup Kubernetes Secret @@ -282,41 +402,38 @@ echo "✅ kiro-cli auth DB backed up" > **Output:** `$BACKUP_DIR/secret.yaml` ⚠️ SENSITIVE ```bash +source openab-session-env.sh kubectl get secret "${DEPLOYMENT}" -o yaml > "$BACKUP_DIR/secret.yaml" -echo "✅ Secret backed up" -echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit." -echo " Encrypt if storing: gpg --symmetric $BACKUP_DIR/secret.yaml" +echo "✅ Secret backed up ($(wc -c < "$BACKUP_DIR/secret.yaml") bytes)" +echo "🔐 SECURITY: secret.yaml contains credentials — do NOT commit. Encrypt before storing:" +echo " gpg --symmetric $BACKUP_DIR/secret.yaml" ``` #### Step 7 — Backup Helm release history and PVC data -> **Input:** `POD` from Step 0 · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/` +> **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/` ```bash +source openab-session-env.sh +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') + helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt" echo "✅ Helm history backed up" +# This file is the source of truth for PREV_REVISION used in rollback -# PVC backup via kubectl cp (default path — /home/agent is the full PVC mount) -# Check size first to avoid timeout PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1) echo "PVC size: $PVC_SIZE" -# Proceed with kubectl cp (recommended for < 500 MB; use VolumeSnapshot for larger volumes) kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" echo "✅ PVC data backed up" ``` > **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):** > ```bash -> # First, resolve the PVC name +> source openab-session-env.sh > PVC_NAME=$(kubectl get pod "$POD" \ > -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}') -> echo "PVC name: $PVC_NAME" -> -> # List available VolumeSnapshotClasses > SNAPSHOT_CLASS=$(kubectl get volumesnapshotclass -o jsonpath='{.items[0].metadata.name}') -> echo "Snapshot class: $SNAPSHOT_CLASS" -> -> # Create the snapshot +> echo "PVC: $PVC_NAME | SnapshotClass: $SNAPSHOT_CLASS" > kubectl apply -f - < apiVersion: snapshot.storage.k8s.io/v1 > kind: VolumeSnapshot @@ -331,17 +448,17 @@ echo "✅ PVC data backed up" #### Verification Gate — must pass before proceeding to upgrade -> **Agent instruction:** Run this gate after all backup steps. If any check fails, **stop and do not proceed** with the upgrade. A failed backup means that data is unprotected. +> **Agent instruction:** Run this gate after all backup steps. If output does not contain `✅ GATE PASSED`, **stop immediately** and do not proceed with the upgrade. ```bash +source openab-session-env.sh echo "=== Backup Verification Gate ===" GATE_PASS=true check_file() { - local path="$1" - local label="$2" + local path="$1"; local label="$2" if [ -s "$path" ]; then - echo " ✅ $label ($path)" + echo " ✅ $label ($(wc -c < "$path") bytes)" else echo " ❌ MISSING or EMPTY: $label ($path)" GATE_PASS=false @@ -349,24 +466,23 @@ check_file() { } check_dir() { - local path="$1" - local label="$2" + local path="$1"; local label="$2" if [ -d "$path" ] && [ -n "$(ls -A "$path" 2>/dev/null)" ]; then - echo " ✅ $label ($path)" + echo " ✅ $label ($(ls "$path" | wc -l) files)" else echo " ❌ MISSING or EMPTY: $label ($path)" GATE_PASS=false fi } -check_file "$BACKUP_DIR/values.yaml" "Helm values" -check_dir "$BACKUP_DIR/agents/" "Agent config" -check_dir "$BACKUP_DIR/steering/" "Steering files" -check_file "$BACKUP_DIR/hosts.yml" "GitHub CLI credentials" -check_file "$BACKUP_DIR/kiro-auth.sqlite3" "kiro-cli auth DB" -check_file "$BACKUP_DIR/secret.yaml" "Kubernetes Secret" -check_file "$BACKUP_DIR/helm-history.txt" "Helm history" -check_dir "$BACKUP_DIR/pvc-data/" "PVC data" +check_file "$BACKUP_DIR/values.yaml" "Helm values" +check_dir "$BACKUP_DIR/agents/" "Agent config" +check_dir "$BACKUP_DIR/steering/" "Steering files" +check_file "$BACKUP_DIR/hosts.yml" "GitHub CLI credentials" +check_file "$BACKUP_DIR/kiro-auth.sqlite3" "kiro-cli auth DB" +check_file "$BACKUP_DIR/secret.yaml" "Kubernetes Secret" +check_file "$BACKUP_DIR/helm-history.txt" "Helm history" +check_dir "$BACKUP_DIR/pvc-data/" "PVC data" echo "" if [ "$GATE_PASS" = true ]; then @@ -380,59 +496,49 @@ fi ### One-Click Backup Script -The script below combines Steps 0–7 and the Verification Gate into a single file. +The script below combines Steps 0–7 and the Verification Gate. ```bash #!/bin/bash -# Note: set -e is intentionally omitted. -# Failures are recorded per step and reported at the end, -# so that a single failure does not prevent remaining items from being backed up. - export KUBECONFIG=~/.kube/config +source openab-session-env.sh -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') -DEPLOYMENT="${RELEASE_NAME}-kiro" BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" POD=$(kubectl get pod \ -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ -o jsonpath='{.items[0].metadata.name}') -if [ -z "$POD" ]; then - echo "❌ OpenAB Pod not found. Aborting backup." && exit 1 -fi - -# Pre-check: kubectl cp directory operations require tar inside the container +if [ -z "$POD" ]; then echo "❌ Pod not found. Aborting." && exit 1; fi if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then - echo "❌ tar not found in container. kubectl cp of directories will fail." - echo " Use VolumeSnapshot for PVC backup instead." - exit 1 + echo "❌ tar not found in container. Aborting." && exit 1 fi -FAILED_STEPS=() +echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh +FAILED_STEPS=() backup_step() { local desc="$1"; shift echo "=== $desc ===" if ! "$@"; then - echo "⚠️ Failed: $desc (continuing with remaining steps...)" + echo "⚠️ Failed: $desc" FAILED_STEPS+=("$desc") fi } -backup_step "Backup Helm values" bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml" -backup_step "Backup Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" -backup_step "Backup Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" -backup_step "Backup hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" -backup_step "Backup kiro-cli auth DB" kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" -backup_step "Backup full Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml" -backup_step "Backup Helm history" bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt" -backup_step "Backup PVC data" kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" +backup_step "Helm values" bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml" +backup_step "Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" +backup_step "Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" +backup_step "hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" +backup_step "kiro-cli auth DB" kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" +backup_step "Kubernetes Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml" +backup_step "Helm history" bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt" +backup_step "PVC data" kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then - backup_step "Backup skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" + backup_step "Skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" else - echo "⚠️ skills/ not found — skipping (normal if no custom skills installed)" + echo "⚠️ skills/ not found — skipping" fi echo "" @@ -440,319 +546,337 @@ echo "=== Backup Summary: $BACKUP_DIR ===" ls -la "$BACKUP_DIR/" if [ ${#FAILED_STEPS[@]} -gt 0 ]; then - echo "" - echo "⚠️ The following backup steps FAILED:" - for step in "${FAILED_STEPS[@]}"; do - echo " - $step" - done - echo "" - echo " Review the failures above before proceeding with the upgrade." - echo " A failed backup step means the corresponding data is NOT protected." + echo "⚠️ Failed steps: ${FAILED_STEPS[*]}" + echo " Review failures before proceeding with the upgrade." else - echo "✅ All backup steps completed successfully." + echo "✅ All backup steps completed." fi echo "" -echo "🔐 SECURITY REMINDER: $BACKUP_DIR/secret.yaml contains sensitive credentials" -echo " (Discord token, STT key, etc.). Do NOT commit this file." -echo " Consider encrypting it before storing:" -echo " gpg --symmetric $BACKUP_DIR/secret.yaml" -echo " # or: age -p -o $BACKUP_DIR/secret.yaml.age $BACKUP_DIR/secret.yaml" +echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit." +echo " gpg --symmetric $BACKUP_DIR/secret.yaml" ``` -### Backup Checklist (Human Reference) - -| Item | Command | Notes | -|---|---|---| -| Helm values | `helm get values $RELEASE_NAME -o yaml > $BACKUP_DIR/values.yaml` | Current Helm deployment parameters | -| Agent config | `kubectl cp $POD:/home/agent/.kiro/agents/ $BACKUP_DIR/agents/` | Custom agent settings (model, prompt, tools, etc.) | -| Steering files | `kubectl cp $POD:/home/agent/.kiro/steering/ $BACKUP_DIR/steering/` | Steering docs such as IDENTITY.md | -| Skills | `kubectl cp $POD:/home/agent/.kiro/skills/ $BACKUP_DIR/skills/` | Custom agent skills (if any; see Step 4 for conditional check) | -| hosts.yml | `kubectl cp $POD:/home/agent/.config/gh/hosts.yml $BACKUP_DIR/hosts.yml` | GitHub CLI credentials (including multi-account configs) | -| kiro-cli auth | `kubectl cp $POD:/home/agent/.local/share/kiro-cli/data.sqlite3 $BACKUP_DIR/kiro-auth.sqlite3` | Bot auth DB — required to avoid re-authentication after PVC loss | -| Full Secret export | `kubectl get secret ${DEPLOYMENT} -o yaml > $BACKUP_DIR/secret.yaml` | ⚠️ **SENSITIVE** — contains Discord token, STT key, and other credentials. Store securely, **never commit to version control**. | -| PVC data | `kubectl cp $POD:/home/agent/ $BACKUP_DIR/pvc-data/` | Default: kubectl cp. See Step 7 for VolumeSnapshot (advanced). | -| Helm release history | `helm history $RELEASE_NAME > $BACKUP_DIR/helm-history.txt` | Useful reference for rollback | - --- ## III. Upgrade Execution -> **Agent note — dependency chain:** -> - Requires `RELEASE_NAME`, `DEPLOYMENT`, `BACKUP_DIR`, `TARGET_VERSION` from Section I. -> - Requires the Verification Gate (Section II) to have passed. - -### Pre-check: Resolve Upgrade Variables +> **Agent instruction — session continuity:** +> - Source `openab-session-env.sh` at the start of each step +> - If resuming after a gap (e.g. backup was done earlier), verify `BACKUP_DIR` matches the intended backup: +> ```bash +> source openab-session-env.sh +> echo "BACKUP_DIR: $BACKUP_DIR" +> echo "Backup time: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')" +> ls "$BACKUP_DIR/" +> # Confirm this is the correct backup before proceeding +> ``` -```bash -export KUBECONFIG=~/.kube/config +### Step 1: Pre-release Validation -# Resolve release name and deployment -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') -DEPLOYMENT="${RELEASE_NAME}-kiro" - -# Resolve backup directory (most recent backup) -BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1) -BACKUP_VALUES="${BACKUP_DIR}/values.yaml" -echo "Using backup: $BACKUP_DIR" -echo "Values file: $BACKUP_VALUES" - -# Confirm the values file exists and is non-empty -if [ ! -s "$BACKUP_VALUES" ]; then - echo "❌ values.yaml not found or empty at $BACKUP_VALUES. Run backup first." - exit 1 -fi - -# Set target version (e.g. 0.7.5 — check https://github.com/openabdev/openab/releases) -TARGET_VERSION="0.7.5" - -# List available chart versions via OCI (no helm repo add required) -helm show chart oci://ghcr.io/openabdev/charts/openab --version "$TARGET_VERSION" 2>/dev/null \ - | grep -E "^(name|version|appVersion):" -``` - -### Pre-check: Confirm Helm OCI Access +> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. +> +> **Agent note — branch resolution:** +> - If `PRERELEASE_VERSION` is empty (set during Section I because `pre-release-validated: true` was found in release notes): **skip this entire step**, proceed directly to Step 2. +> - If `PRERELEASE_VERSION` is non-empty: run the full step below. +> - If this step fails (automated smoke test fails): run rollback (Section IV) and **stop** — do not proceed to Step 2. ```bash -# Verify OCI registry is reachable and the target version exists -helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}" > /dev/null \ - && echo "✅ Chart version ${TARGET_VERSION} available via OCI" \ - || echo "❌ Chart version ${TARGET_VERSION} not found. Check version string." - -# Also verify the pre-release beta.1 version exists (required for Step 1) -helm show chart oci://ghcr.io/openabdev/charts/openab --version "${TARGET_VERSION}-beta.1" > /dev/null \ - && echo "✅ Pre-release ${TARGET_VERSION}-beta.1 available" \ - || echo "⚠️ beta.1 not found — check GitHub releases for available pre-release tags" -``` +source openab-session-env.sh -### Step 1: Pre-release Validation (Required) +if [ -z "$PRERELEASE_VERSION" ]; then + echo "ℹ️ PRERELEASE_VERSION is empty — pre-release step was skipped during env setup." + echo " Proceeding directly to Step 2." + exit 0 +fi -> ⚠️ Per project convention, **a stable release must be preceded by a validated pre-release**. Do not skip directly to Step 2. -> -> **When can Step 1 be skipped?** Only if the release notes for the target stable version explicitly contain `pre-release-validated: true`, indicating that the corresponding pre-release has already been validated in another environment (e.g. a staging cluster). In all other cases, run Step 1 first. -> -> **Agent note — pass/fail criteria:** -> - **Pass:** `kubectl wait` exits 0 AND `pgrep -x openab` exits 0 AND log scan returns no `panic` or `fatal` lines. -> - **Fail:** Any of the above fails, OR a human operator reports a functional regression in Discord within the monitoring window. On failure, run `helm rollback` (see Section IV) and stop — do not proceed to Step 2. +echo "Deploying pre-release: $PRERELEASE_VERSION" -```bash -# Dry-run first to catch values conflicts +# Dry-run first helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ - --version "${TARGET_VERSION}-beta.1" \ - -f "$BACKUP_VALUES" \ + --version "$PRERELEASE_VERSION" \ + -f "$BACKUP_DIR/values.yaml" \ --dry-run +# Expected output contains: "Release \"openab\" has been upgraded. Happy Helming!" -# Deploy the pre-release +# Deploy pre-release helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ - --version "${TARGET_VERSION}-beta.1" \ - -f "$BACKUP_VALUES" + --version "$PRERELEASE_VERSION" \ + -f "$BACKUP_DIR/values.yaml" kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" -# Automated smoke test -POD=$(kubectl get pod \ - -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ - -o jsonpath='{.items[0].metadata.name}') +# --- Automated smoke test --- +# Estimated duration: 30–60 seconds +POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +# Expected output: "pod/ condition met" + kubectl exec "$POD" -- pgrep -x openab +# Expected output: a numeric PID (e.g. "42") — non-zero exit means process not running + PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) if [ "$PANIC_LINES" -gt 0 ]; then - echo "❌ Panic/fatal lines found in logs. Do NOT proceed. Run rollback." + echo "❌ Panic/fatal lines found in logs. Automated smoke test FAILED." + echo " Run rollback (Section IV) and do not proceed to Step 2." exit 1 fi -echo "✅ Automated smoke test passed. Proceed with Discord functional validation." +echo "✅ Automated smoke test passed." +``` -# After automated smoke test — manual Discord check required: -# → Send a test message in the Discord channel -# → Confirm the bot responds and basic conversation / tool calls work -# → If bot is unresponsive or broken: run helm rollback (Section IV) and stop +**After automated smoke test — human Discord validation required:** + +```bash +# ⏸ HUMAN CONFIRMATION REQUIRED +# Estimated wait: 2–5 minutes +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +echo "⏸ PAUSED — Human action required before continuing" +echo "" +echo " 1. Send a test message to the Discord channel" +echo " 2. Confirm the bot responds and basic conversation / tool calls work" +echo "" +echo " When confirmed OK, type: CONFIRMED" +echo " To abort and rollback, type: ROLLBACK" +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +read -r HUMAN_INPUT +case "$HUMAN_INPUT" in + CONFIRMED) + echo "✅ Human confirmed — proceeding to Step 2" + ;; + ROLLBACK) + echo "🔁 Rollback requested by human. Proceed to Section IV." + exit 2 + ;; + *) + echo "❌ Unrecognized input ('$HUMAN_INPUT'). Aborting for safety." + echo " Run rollback (Section IV) if needed." + exit 1 + ;; +esac ``` ### Step 2: Promote to Stable -> **Agent note:** Only run this after Step 1 has passed both automated and Discord validation. +> **Agent instruction:** Only run this after Step 1 is fully complete (automated + human confirmation), or after confirming `PRERELEASE_VERSION` was empty. ```bash -# Dry-run the stable version +source openab-session-env.sh + +echo "Promoting to stable: $TARGET_VERSION" + +# Dry-run helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ - --version "${TARGET_VERSION}" \ - -f "$BACKUP_VALUES" \ + --version "$TARGET_VERSION" \ + -f "$BACKUP_DIR/values.yaml" \ --dry-run -# Deploy stable (short downtime is expected due to Recreate strategy) +# Deploy stable (short downtime expected due to Recreate strategy) helm upgrade "$RELEASE_NAME" oci://ghcr.io/openabdev/charts/openab \ - --version "${TARGET_VERSION}" \ - -f "$BACKUP_VALUES" + --version "$TARGET_VERSION" \ + -f "$BACKUP_DIR/values.yaml" -# Wait for the Pod to be ready kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" +# Estimated duration: 60–180 seconds ``` ### Post-Upgrade Verification > **Agent note — pass/fail criteria:** -> - **Pass:** All commands below exit 0 AND image tag matches `TARGET_VERSION` AND no panic/fatal in logs. -> - **Fail:** Any command exits non-zero, or image tag does not match. → Proceed to Section IV Rollback. +> - **Pass:** All commands exit 0, deployed chart version equals `openab-${TARGET_VERSION}`, no panic/fatal in logs, PVC paths are present. +> - **Fail:** Any command exits non-zero, version mismatch, or panic/fatal in logs. → Proceed to Section IV Rollback immediately. ```bash +source openab-session-env.sh + POD=$(kubectl get pod \ -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ -o jsonpath='{.items[0].metadata.name}') -# 1. Check Pod status (must be Running and READY) +# 1. Pod status kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s +# Expected output: "pod/ condition met" -# 2. Verify deployed chart version matches target +# 2. Chart version DEPLOYED=$(helm list -f "$RELEASE_NAME" -o json | jq -r '.[0].chart') -echo "Deployed chart: $DEPLOYED | Expected: openab-${TARGET_VERSION}" +echo "Deployed: $DEPLOYED | Expected: openab-${TARGET_VERSION}" if [ "$DEPLOYED" != "openab-${TARGET_VERSION}" ]; then echo "❌ Version mismatch. Investigate before proceeding." + exit 1 fi -# 3. Verify image tag +# 3. Image tag kubectl get "deployment/${DEPLOYMENT}" \ -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}' +# Expected output contains: TARGET_VERSION or its image SHA -# 4. Confirm the openab process is running +# 4. Process check kubectl exec "$POD" -- pgrep -x openab +# Expected output: a numeric PID -# 5. Check logs for errors +# 5. Log check PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) -ERROR_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true) -echo "Panic/fatal lines: $PANIC_LINES | Error/warn lines: $ERROR_LINES" +WARN_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "error|warn" || true) +echo "Panic/fatal: $PANIC_LINES | Error/warn: $WARN_LINES" if [ "$PANIC_LINES" -gt 0 ]; then - echo "❌ Panic/fatal found. Rollback recommended." + echo "❌ Panic/fatal found. Proceed to Section IV Rollback." + exit 1 fi -# 6. Verify PVC data (steering files and agent config) are still present +# 6. PVC data integrity kubectl exec "$POD" -- ls /home/agent/.kiro/steering/ +# Expected output: at least one file listed (e.g. IDENTITY.md) kubectl exec "$POD" -- cat /home/agent/.kiro/agents/default.json | head -5 -# If either path is missing, restore from backup (see Section IV: Restore Custom Config) +# Expected output: first 5 lines of valid JSON -# 7. Discord E2E verification (final check — requires human operator) -# → Send a test message in the Discord channel -# → Confirm the bot responds and conversation works correctly +echo "✅ All automated checks passed." +``` + +**After automated checks — human Discord E2E confirmation:** + +```bash +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +echo "⏸ PAUSED — Human E2E validation required" +echo "" +echo " Send a test message in the Discord channel." +echo " Confirm the bot responds and conversation works correctly." +echo "" +echo " When confirmed OK, type: CONFIRMED" +echo " If issues found, type: ROLLBACK" +echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" +read -r HUMAN_INPUT +case "$HUMAN_INPUT" in + CONFIRMED) echo "✅ Upgrade complete." ;; + ROLLBACK) echo "🔁 Rollback requested. Proceed to Section IV."; exit 2 ;; + *) echo "❌ Unrecognized input. Aborting."; exit 1 ;; +esac ``` ### Completion Notice -- Once all verifications pass, notify users: - - Upgrade complete, service restored - - New version number and summary of key changes - - Contact channel for reporting any issues +```bash +source openab-session-env.sh + +# Send completion notification via Discord webhook (if configured) +if [ -n "${DISCORD_WEBHOOK_URL:-}" ]; then + curl -s -X POST "$DISCORD_WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"content\": \"✅ **Upgrade complete:** OpenAB is now running v${TARGET_VERSION}. Service restored.\"}" + echo "✅ Completion notice sent" +else + echo "ℹ️ Notify users manually: OpenAB upgraded to v${TARGET_VERSION}, service restored." +fi +``` --- ## IV. Rollback -### Decision Tree - -> **Agent note — machine-readable branch criteria:** -> -> | Observed condition | Action | -> |---|---| -> | `kubectl get pod` shows `CrashLoopBackOff` or `Pending` | `helm rollback` immediately | -> | Pod is `Running` AND `pgrep -x openab` exits non-zero | `helm rollback` | -> | Pod is `Running`, process OK, but logs contain `panic` or `fatal` | `helm rollback` | -> | Pod is `Running`, process OK, logs clean, but no Discord response after 60s | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` | -> | Pod is `Running`, process OK, logs clean, Discord responds, but config is missing | Restore config from backup → `kubectl rollout restart` | -> | Quick fix is clearly identified (e.g. a known bad config key) | Hotfix — escalate to human engineer | +### Decision Table (Machine-Readable) -``` -Issue detected after upgrade - │ - ▼ - Pod status? - │ - ├─ CrashLoopBackOff / Pending ──→ helm rollback - │ - ├─ Running, pgrep exits non-zero OR panic in logs - │ └─ helm rollback - │ - ├─ Running, logs clean, bot unresponsive - │ └─ kubectl rollout restart deployment/${DEPLOYMENT} - │ │ - │ ├─ Responds within 60s ──→ Continue monitoring - │ └─ Still unresponsive ──→ helm rollback - │ - └─ Running, bot OK, config missing - └─ Restore config from backup → kubectl rollout restart -``` +> **Agent instruction:** Evaluate conditions in order. Execute the action for the first matching row. Only one action should be taken per rollback event. -| Symptom | Action | -|---|---| -| Pod fails to start (CrashLoopBackOff) | Helm rollback | -| Functionality broken, Pod is running | Rollback or hotfix | -| Custom config lost | Restore config files from backup | -| Bot unresponsive | Restart Pod first; rollback if it persists | +| Condition to check | How to check | Action | +|---|---|---| +| Pod phase is `CrashLoopBackOff` or `Pending` | `kubectl get pod ... -o jsonpath='{.items[0].status.phase}'` | `helm rollback` immediately | +| Pod is `Running` AND `pgrep -x openab` exits non-zero | `kubectl exec $POD -- pgrep -x openab; echo $?` | `helm rollback` | +| Pod is `Running`, process OK, logs contain `panic` or `fatal` | `kubectl logs ... \| grep -icE "panic\|fatal"` | `helm rollback` | +| Pod is `Running`, process OK, logs clean, no Discord response after 60s | Human reports no response | `kubectl rollout restart` first; if still no response after 60s → `helm rollback` | +| Pod is `Running`, process OK, bot responds, but config files missing | `kubectl exec $POD -- ls /home/agent/.kiro/steering/` | Restore from backup → `kubectl rollout restart` | +| Quick fix is clearly identified (e.g. known bad config key) | Human identifies root cause | Hotfix — escalate to human engineer | ### Helm Rollback +> **Agent instruction:** `PREV_REVISION` is resolved from the backup's `helm-history.txt` (saved before any upgrade occurred). This avoids the ambiguity of "倒數第二個" when multiple `helm upgrade` calls were made during the upgrade process (pre-release + stable). + ```bash -export KUBECONFIG=~/.kube/config -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') -DEPLOYMENT="${RELEASE_NAME}-kiro" +source openab-session-env.sh -# 1. View release history and identify the previous revision -helm history "$RELEASE_NAME" +# Validate BACKUP_DIR is set and helm-history.txt exists +if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.txt" ]; then + echo "❌ BACKUP_DIR not set or helm-history.txt missing." + echo " Resolve manually: helm history $RELEASE_NAME" + exit 1 +fi -# 2. Get the previous revision number automatically -PREV_REVISION=$(helm history "$RELEASE_NAME" --output json \ - | jq -r 'sort_by(.revision) | .[-2].revision') -echo "Rolling back to revision: $PREV_REVISION" +echo "Using backup: $BACKUP_DIR" +echo "Backup timestamp: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')" + +# Resolve the pre-upgrade stable revision from the backup +# (the last revision with status "deployed" at the time of backup) +PREV_REVISION=$(awk 'NR>1 && $3=="deployed" {rev=$1} END {print rev}' "$BACKUP_DIR/helm-history.txt") +if [ -z "$PREV_REVISION" ]; then + echo "❌ Could not resolve PREV_REVISION from helm-history.txt." + echo " Contents of helm-history.txt:" + cat "$BACKUP_DIR/helm-history.txt" + echo "" + echo " Set PREV_REVISION manually and re-run: helm rollback $RELEASE_NAME " + exit 1 +fi +echo "Rolling back to revision: $PREV_REVISION (pre-upgrade stable)" -# 3. Roll back +# Rollback helm rollback "$RELEASE_NAME" "$PREV_REVISION" +# Expected output: "Rollback was a success! Happy Helming!" -# 4. Wait for the Pod to be ready kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +# Expected output: "deployment/ successfully rolled out" -# 5. Confirm rollback succeeded +# Confirm rollback kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" +# Expected output: 1 pod in Running/Ready state -# 6. Run full post-rollback verification +# Post-rollback verification POD=$(kubectl get pod \ -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ -o jsonpath='{.items[0].metadata.name}') kubectl wait --for=condition=Ready "pod/${POD}" --timeout=120s kubectl exec "$POD" -- pgrep -x openab -kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -iE "error|warn|panic|fatal" -# → Send a test message in the Discord channel to confirm the bot responds +# Expected output: a numeric PID + +PANIC_LINES=$(kubectl logs "deployment/${DEPLOYMENT}" --tail=100 | grep -icE "panic|fatal" || true) +echo "Panic/fatal after rollback: $PANIC_LINES" +if [ "$PANIC_LINES" -gt 0 ]; then + echo "❌ Panic/fatal found even after rollback. Escalate to human engineer." + exit 1 +fi +echo "✅ Rollback complete. Send Discord test message to confirm bot is responsive." ``` ### Restore Custom Config ```bash -export KUBECONFIG=~/.kube/config -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') -DEPLOYMENT="${RELEASE_NAME}-kiro" +source openab-session-env.sh POD=$(kubectl get pod \ -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ -o jsonpath='{.items[0].metadata.name}') -# Resolve backup directory -BACKUP_DIR=$(ls -td openab-backup-* 2>/dev/null | head -1) echo "Restoring from: $BACKUP_DIR" # Restore agent config kubectl cp "$BACKUP_DIR/agents/default.json" "$POD:/home/agent/.kiro/agents/default.json" +echo "✅ Agent config restored" # Restore steering files -# ⚠️ kubectl cp directory behaviour varies across versions — trailing slash matters. -# Use the tar pipe method below to avoid accidentally creating a nested directory -# (e.g. steering/steering/) which can happen with some kubectl versions. +# ⚠️ Use tar pipe to avoid nested directory issue (e.g. steering/steering/) kubectl exec "$POD" -- mkdir -p /home/agent/.kiro/steering tar c -C "$BACKUP_DIR/steering" . | kubectl exec -i "$POD" -- tar x -C /home/agent/.kiro/steering +echo "✅ Steering files restored" # Restore GitHub CLI credentials kubectl cp "$BACKUP_DIR/hosts.yml" "$POD:/home/agent/.config/gh/hosts.yml" +echo "✅ hosts.yml restored" # Restore kiro-cli auth database kubectl exec "$POD" -- mkdir -p /home/agent/.local/share/kiro-cli kubectl cp "$BACKUP_DIR/kiro-auth.sqlite3" "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" +echo "✅ kiro-cli auth DB restored" # Restart Pod to apply changes kubectl rollout restart "deployment/${DEPLOYMENT}" +kubectl rollout status "deployment/${DEPLOYMENT}" --timeout=300s +echo "✅ Pod restarted with restored config" ``` From f32cbfc03b08caf27acb3f47a640c0970e3657d5 Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Tue, 14 Apr 2026 23:28:38 +0000 Subject: [PATCH 6/8] docs: add AI instruction note to SOP title --- docs/openab-upgrade-sop.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index 0d3421fb..d373f8c2 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -1,5 +1,7 @@ # OpenAB Version Upgrade SOP +> [AI Instruction] Follow this documentation to assist the user in executing the upgrade process efficiently while ensuring all backup and rollback protocols are met. + | | | |---|---| | **Document Version** | 1.4 | From 8b384e5fad160c167b42d013e82bff912f67b77d Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Tue, 14 Apr 2026 23:32:31 +0000 Subject: [PATCH 7/8] docs: fix multi-release selector, read timeout, remove duplicate backup script MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1. helm list release name resolution: replace .[0].name with .[] | select(.chart | startswith("openab-")) | .name | head -1 to correctly handle namespaces with multiple Helm releases 2. read HUMAN_INPUT: add -t 600 timeout with exit 1 on expiry to prevent indefinite hang in non-interactive / CI/CD environments 3. Remove "One-Click Backup Script" section (was ~60 lines of content functionally identical to the Agent-Executable Steps 0-7 above it) — reduces duplication and document length --- docs/openab-upgrade-sop.md | 71 +++----------------------------------- 1 file changed, 4 insertions(+), 67 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index d373f8c2..f36b66af 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -27,7 +27,7 @@ > **Deployment naming pattern:** The deployment name follows `-kiro`. For the default setup (`helm install openab …`), the deployment is `openab-kiro`. If you used a different release name (e.g. `my-bot`), the deployment is `my-bot-kiro`. Verify with: > ```bash -> RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +> RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1) > DEPLOYMENT="${RELEASE_NAME}-kiro" > echo "Deployment: $DEPLOYMENT" > ``` @@ -170,7 +170,7 @@ export KUBECONFIG=~/.kube/config # source openab-session-env.sh && echo "✅ Session env loaded" && exit 0 # --- Resolve release and deployment names --- -RELEASE_NAME=$(helm list -o json | jq -r '.[0].name') +RELEASE_NAME=$(helm list -o json | jq -r '.[] | select(.chart | startswith("openab-")) | .name' | head -1) if [ -z "$RELEASE_NAME" ]; then echo "❌ No Helm release found. Is OpenAB installed?" exit 1 @@ -496,69 +496,6 @@ else fi ``` -### One-Click Backup Script - -The script below combines Steps 0–7 and the Verification Gate. - -```bash -#!/bin/bash -export KUBECONFIG=~/.kube/config -source openab-session-env.sh - -BACKUP_DIR="openab-backup-$(date +%Y%m%d-%H%M%S)" -mkdir -p "$BACKUP_DIR" - -POD=$(kubectl get pod \ - -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" \ - -o jsonpath='{.items[0].metadata.name}') -if [ -z "$POD" ]; then echo "❌ Pod not found. Aborting." && exit 1; fi -if ! kubectl exec "$POD" -- which tar > /dev/null 2>&1; then - echo "❌ tar not found in container. Aborting." && exit 1 -fi - -echo "export BACKUP_DIR=\"${BACKUP_DIR}\"" >> openab-session-env.sh - -FAILED_STEPS=() -backup_step() { - local desc="$1"; shift - echo "=== $desc ===" - if ! "$@"; then - echo "⚠️ Failed: $desc" - FAILED_STEPS+=("$desc") - fi -} - -backup_step "Helm values" bash -c "helm get values '$RELEASE_NAME' -o yaml > $BACKUP_DIR/values.yaml" -backup_step "Agent config" kubectl cp "$POD:/home/agent/.kiro/agents/" "$BACKUP_DIR/agents/" -backup_step "Steering files" kubectl cp "$POD:/home/agent/.kiro/steering/" "$BACKUP_DIR/steering/" -backup_step "hosts.yml" kubectl cp "$POD:/home/agent/.config/gh/hosts.yml" "$BACKUP_DIR/hosts.yml" -backup_step "kiro-cli auth DB" kubectl cp "$POD:/home/agent/.local/share/kiro-cli/data.sqlite3" "$BACKUP_DIR/kiro-auth.sqlite3" -backup_step "Kubernetes Secret" bash -c "kubectl get secret '${DEPLOYMENT}' -o yaml > $BACKUP_DIR/secret.yaml" -backup_step "Helm history" bash -c "helm history '$RELEASE_NAME' > $BACKUP_DIR/helm-history.txt" -backup_step "PVC data" kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" - -if kubectl exec "$POD" -- test -d /home/agent/.kiro/skills 2>/dev/null; then - backup_step "Skills" kubectl cp "$POD:/home/agent/.kiro/skills/" "$BACKUP_DIR/skills/" -else - echo "⚠️ skills/ not found — skipping" -fi - -echo "" -echo "=== Backup Summary: $BACKUP_DIR ===" -ls -la "$BACKUP_DIR/" - -if [ ${#FAILED_STEPS[@]} -gt 0 ]; then - echo "⚠️ Failed steps: ${FAILED_STEPS[*]}" - echo " Review failures before proceeding with the upgrade." -else - echo "✅ All backup steps completed." -fi - -echo "" -echo "🔐 SECURITY: $BACKUP_DIR/secret.yaml contains credentials. Do NOT commit." -echo " gpg --symmetric $BACKUP_DIR/secret.yaml" -``` - --- ## III. Upgrade Execution @@ -641,7 +578,7 @@ echo "" echo " When confirmed OK, type: CONFIRMED" echo " To abort and rollback, type: ROLLBACK" echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" -read -r HUMAN_INPUT +read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; } case "$HUMAN_INPUT" in CONFIRMED) echo "✅ Human confirmed — proceeding to Step 2" @@ -748,7 +685,7 @@ echo "" echo " When confirmed OK, type: CONFIRMED" echo " If issues found, type: ROLLBACK" echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" -read -r HUMAN_INPUT +read -t 600 -r HUMAN_INPUT || { echo "❌ Timeout: no human input received within 600s. Aborting."; exit 1; } case "$HUMAN_INPUT" in CONFIRMED) echo "✅ Upgrade complete." ;; ROLLBACK) echo "🔁 Rollback requested. Proceed to Section IV."; exit 2 ;; From 986eb0b3a1c8da852f6c1acfadbb1ac19e10396e Mon Sep 17 00:00:00 2001 From: JARVIS-Agent Date: Wed, 15 Apr 2026 00:03:35 +0000 Subject: [PATCH 8/8] docs: address chaodu-agent review feedback (v1.5) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix all 3 suggested changes and 2 nits from chaodu-agent review: 1. Add PVC rollback warning at top of Section IV: helm rollback does NOT revert PVC data; if new version ran a data migration, restore PVC from Step 7 backup before rolling back 2. Add Agent note above both read HUMAN_INPUT blocks: if running in non-interactive shell (no stdin), skip read and report to user that human confirmation is required, then pause execution 3. Step 7 PVC overlap: add explicit note explaining that pvc-data/ is intentionally redundant — it is the full PVC snapshot for rollback of migrated data, while Steps 2-5 are for targeted fast restores. Add 500MB size threshold gate with VolumeSnapshot recommendation. Nit 1: replace awk text parsing of helm-history with JSON approach — Step 7 now saves helm-history.json in addition to .txt; PREV_REVISION resolution uses jq on the JSON file for reliability across Helm versions (avoids column-shift issues with text output); Verification Gate and rollback section updated accordingly Nit 2: add comment in Step 0 explaining why POD is not persisted to openab-session-env.sh (pod name changes after every upgrade/restart) --- docs/openab-upgrade-sop.md | 70 ++++++++++++++++++++++++++------------ 1 file changed, 49 insertions(+), 21 deletions(-) diff --git a/docs/openab-upgrade-sop.md b/docs/openab-upgrade-sop.md index f36b66af..e9ea81b7 100644 --- a/docs/openab-upgrade-sop.md +++ b/docs/openab-upgrade-sop.md @@ -4,8 +4,8 @@ | | | |---|---| -| **Document Version** | 1.4 | -| **Last Updated** | 2026-04-14 | +| **Document Version** | 1.5 | +| **Last Updated** | 2026-04-15 | ## Environment Reference @@ -85,7 +85,7 @@ ▼ ┌─────────────────────────────────────────────────────────────┐ │ IV. Rollback │ -│ PREV_REVISION from backup helm-history.txt │ +│ PREV_REVISION from backup helm-history.json │ │ Machine-readable decision table → rollback → verify │ └─────────────────────────────────────────────────────────────┘ ``` @@ -309,6 +309,8 @@ fi #### Step 0 — Resolve variables and create backup directory > **Output:** `BACKUP_DIR` appended to `openab-session-env.sh` → used in Steps 1–7 and the Verification Gate. +> +> **Why `POD` is not saved to `openab-session-env.sh`:** The pod name changes after every `helm upgrade` or `kubectl rollout restart` (new pod is created, old one is terminated). Persisting the pod name would cause subsequent steps to target a pod that no longer exists. Each step re-resolves `POD` at runtime to ensure it always refers to the currently running pod. ```bash source openab-session-env.sh @@ -411,22 +413,33 @@ echo "🔐 SECURITY: secret.yaml contains credentials — do NOT commit. Encrypt echo " gpg --symmetric $BACKUP_DIR/secret.yaml" ``` -#### Step 7 — Backup Helm release history and PVC data +#### Step 7 — Backup Helm release history and full PVC snapshot > **Input:** `POD` (re-resolved) · **Output:** `$BACKUP_DIR/helm-history.txt`, `$BACKUP_DIR/pvc-data/` +> +> **Note on PVC overlap:** `pvc-data/` copies the entire `/home/agent` directory, which includes paths already backed up individually in Steps 2–5 (agents/, steering/, hosts.yml, kiro-auth.sqlite3). This overlap is **intentional** — the full PVC snapshot is the last-resort restore path if the new version ran a data migration that corrupts the PVC. The individual backups in Steps 2–5 are for fast, targeted restores; `pvc-data/` is for full rollback of PVC state. +> +> **Size threshold:** If the PVC is larger than ~500 MB, `kubectl cp` may be slow or time out. In that case, use the VolumeSnapshot option below instead. ```bash source openab-session-env.sh POD=$(kubectl get pod -l "app.kubernetes.io/instance=${RELEASE_NAME},app.kubernetes.io/name=${DEPLOYMENT}" -o jsonpath='{.items[0].metadata.name}') helm history "$RELEASE_NAME" > "$BACKUP_DIR/helm-history.txt" -echo "✅ Helm history backed up" -# This file is the source of truth for PREV_REVISION used in rollback - -PVC_SIZE=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1) -echo "PVC size: $PVC_SIZE" +helm history "$RELEASE_NAME" --output json > "$BACKUP_DIR/helm-history.json" +echo "✅ Helm history backed up (text + JSON)" +# helm-history.json is the source of truth for PREV_REVISION used in Section IV rollback +# JSON format avoids column-shift parsing issues across Helm versions + +PVC_SIZE_BYTES=$(kubectl exec "$POD" -- du -sb /home/agent 2>/dev/null | cut -f1) +PVC_SIZE_HUMAN=$(kubectl exec "$POD" -- du -sh /home/agent 2>/dev/null | cut -f1) +echo "PVC size: $PVC_SIZE_HUMAN" +if [ "${PVC_SIZE_BYTES:-0}" -gt 524288000 ]; then + echo "⚠️ PVC exceeds 500 MB — kubectl cp may be slow or time out." + echo " Consider using the VolumeSnapshot option below instead." +fi kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" -echo "✅ PVC data backed up" +echo "✅ Full PVC snapshot backed up" ``` > **Advanced option — VolumeSnapshot (for large PVCs or CSI-enabled clusters):** @@ -483,7 +496,8 @@ check_dir "$BACKUP_DIR/steering/" "Steering files" check_file "$BACKUP_DIR/hosts.yml" "GitHub CLI credentials" check_file "$BACKUP_DIR/kiro-auth.sqlite3" "kiro-cli auth DB" check_file "$BACKUP_DIR/secret.yaml" "Kubernetes Secret" -check_file "$BACKUP_DIR/helm-history.txt" "Helm history" +check_file "$BACKUP_DIR/helm-history.txt" "Helm history (text)" +check_file "$BACKUP_DIR/helm-history.json" "Helm history (JSON — used for PREV_REVISION)" check_dir "$BACKUP_DIR/pvc-data/" "PVC data" echo "" @@ -566,6 +580,8 @@ echo "✅ Automated smoke test passed." **After automated smoke test — human Discord validation required:** +> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`. + ```bash # ⏸ HUMAN CONFIRMATION REQUIRED # Estimated wait: 2–5 minutes @@ -675,6 +691,8 @@ echo "✅ All automated checks passed." **After automated checks — human Discord E2E confirmation:** +> **Agent note:** If running in a non-interactive shell (no stdin available), skip the `read` command below. Instead, report to the user that human confirmation is required and pause execution. Resume only after the user explicitly provides `CONFIRMED` or `ROLLBACK`. + ```bash echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" echo "⏸ PAUSED — Human E2E validation required" @@ -713,6 +731,14 @@ fi ## IV. Rollback +> ⚠️ **`helm rollback` does NOT revert PVC data.** Helm only rolls back Kubernetes resources (Deployment, ConfigMap, Secret, etc.). The PVC and its contents remain as-is after rollback. +> +> If the new version ran a data migration on startup, the old version may not be compatible with the modified PVC data. In that case, restore PVC data from the Step 7 backup **before** running `helm rollback`: +> ```bash +> # Restore PVC data from backup first (see "Restore Custom Config" below) +> # Then run helm rollback +> ``` + ### Decision Table (Machine-Readable) > **Agent instruction:** Evaluate conditions in order. Execute the action for the first matching row. Only one action should be taken per rollback event. @@ -728,28 +754,30 @@ fi ### Helm Rollback -> **Agent instruction:** `PREV_REVISION` is resolved from the backup's `helm-history.txt` (saved before any upgrade occurred). This avoids the ambiguity of "倒數第二個" when multiple `helm upgrade` calls were made during the upgrade process (pre-release + stable). +> **Agent instruction:** `PREV_REVISION` is resolved from `helm-history.json` saved during Step 7 (before any upgrade occurred). Using the JSON format avoids column-shift parsing issues across Helm versions. This also avoids the ambiguity of "second-to-last revision" when multiple `helm upgrade` calls were made (pre-release + stable). ```bash source openab-session-env.sh # Validate BACKUP_DIR is set and helm-history.txt exists -if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.txt" ]; then - echo "❌ BACKUP_DIR not set or helm-history.txt missing." - echo " Resolve manually: helm history $RELEASE_NAME" +if [ -z "$BACKUP_DIR" ] || [ ! -f "$BACKUP_DIR/helm-history.json" ]; then + echo "❌ BACKUP_DIR not set or helm-history.json missing." + echo " Resolve manually: helm history $RELEASE_NAME --output json | jq" exit 1 fi echo "Using backup: $BACKUP_DIR" echo "Backup timestamp: $(echo "$BACKUP_DIR" | grep -oE '[0-9]{8}-[0-9]{6}')" -# Resolve the pre-upgrade stable revision from the backup +# Resolve the pre-upgrade stable revision from the backup JSON # (the last revision with status "deployed" at the time of backup) -PREV_REVISION=$(awk 'NR>1 && $3=="deployed" {rev=$1} END {print rev}' "$BACKUP_DIR/helm-history.txt") -if [ -z "$PREV_REVISION" ]; then - echo "❌ Could not resolve PREV_REVISION from helm-history.txt." - echo " Contents of helm-history.txt:" - cat "$BACKUP_DIR/helm-history.txt" +# Uses JSON format saved during Step 7 — avoids column-shift parsing issues across Helm versions +PREV_REVISION=$(jq -r '[.[] | select(.status == "deployed")] | sort_by(.revision) | last | .revision' \ + "$BACKUP_DIR/helm-history.json" 2>/dev/null) +if [ -z "$PREV_REVISION" ] || [ "$PREV_REVISION" = "null" ]; then + echo "❌ Could not resolve PREV_REVISION from helm-history.json." + echo " Contents of helm-history.json:" + cat "$BACKUP_DIR/helm-history.json" echo "" echo " Set PREV_REVISION manually and re-run: helm rollback $RELEASE_NAME " exit 1