docs: add upgrade SOP for Helm-based deployments#288
docs: add upgrade SOP for Helm-based deployments#288thepagent merged 8 commits intoopenabdev:mainfrom
Conversation
Co-Authored-By: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
- remove set -e; add explicit per-step error handling note - add tar pre-check before kubectl cp directory operations - add export KUBECONFIG inside backup script for consistency - add full Secret backup step (not just STT key) - add node resource check step in pre-upgrade preparation - add note on when pre-release step can be skipped - use tar pipe for steering restore to avoid kubectl cp dir nesting issue - add Document Version / Last Updated header Co-Authored-By: JARVIS-Agent <JARVIS-coding-Agent@users.noreply.github.com>
- Backup script: replace exit 1 with record-and-continue pattern; report all failed steps at end - Backup checklist: strengthen security warning for secret.yaml with encryption suggestions (gpg/age) - Environment Reference: add Namespace row; add namespace alias setup instructions - PVC backup Option B: add data size check step with recommended size limit - Pre-release skip condition: replace vague "maintainer explicitly states" with concrete pre-release-validated: true marker - Post-upgrade verification: add steering files and agent config presence checks
d392136 to
df4f39b
Compare
masami-agent
left a comment
There was a problem hiding this comment.
Nice work — this fills a real gap. The structure is clear and the backup script is well thought out (record-and-continue pattern, security reminders). A few things to address:
1. Deployment naming is outdated for 0.6.x+ charts
The SOP uses openab-kiro as the deployment name throughout, but the actual deployment name depends on the Helm release name + agent name. For a release named openab, the deployment is openab-kiro (via agentFullname helper). However, if someone uses a different release name (e.g. my-bot), it would be my-bot-kiro. Consider adding a note:
# The deployment name follows the pattern: <release-name>-<agent-name>
# For the default setup: openab-kiro
# Verify with: helm status openab
2. Helm repo commands mix GitHub Pages and OCI — pick one or clarify
The SOP adds the GitHub Pages repo (helm repo add openab https://openabdev.github.io/openab) and uses helm upgrade openab openab/openab syntax. But the project primarily distributes charts via OCI (oci://ghcr.io/openabdev/charts/openab). The GitHub Pages repo may not always have the latest versions. Either:
- Use OCI consistently:
helm upgrade openab oci://ghcr.io/openabdev/charts/openab --version <ver> - Or clearly state which repo has which versions
3. Pod label selector may not work for 0.6.x multi-agent charts
app.kubernetes.io/instance=openab matches ALL agents under the same Helm release (kiro, claude, etc.). If someone runs multiple agents, the kubectl get pod commands will return multiple pods. Use the more specific label:
kubectl get pod -l app.kubernetes.io/instance=openab,app.kubernetes.io/name=openab-kiroOr use -l app.kubernetes.io/component=kiro if available in the chart labels.
4. kubectl cp for steering restore has a known gotcha — already handled ✅
The tar pipe method for steering restore is the right call. Good.
5. Missing: what happens to PVC data on helm uninstall?
The SOP covers rollback via helm rollback, but doesn't mention that helm uninstall deletes the PVC (unless the chart has a resource policy annotation). This is a critical data loss scenario that operators should be warned about. Add a warning:
⚠️ helm uninstall deletes the PVC and all persistent data (steering, auth, agent config).
Always use helm rollback instead of uninstall + reinstall.
6. Minor: backup script doesn't back up kiro-cli auth
The auth database at /home/agent/.local/share/kiro-cli/data.sqlite3 is not in the backup checklist. If the PVC is lost, the bot needs to re-authenticate. Worth adding to the checklist or at least noting it.
Overall this is solid and useful. The issues above are mostly about accuracy for the 0.6.x multi-agent chart structure.
…edback Address masami-agent review comments and repo owner AI-first design feedback: Technical fixes (masami-agent): - Add deployment naming pattern note (<release-name>-kiro) - Use precise pod label selector (instance + name) to avoid multi-agent conflicts - Prefer OCI registry for helm commands; GitHub Pages listed as fallback - Add helm uninstall PVC deletion warning to Environment Reference - Add kiro-cli auth DB backup (data.sqlite3) to checklist and scripts AI-first redesign (repo owner): - Add Agent-Executable Backup section: linear Steps 0-7 with explicit input/output dependency annotations (no ambiguous branches) - Replace all <placeholder> patterns with "run this command to resolve" patterns (RELEASE_NAME, DEPLOYMENT, BACKUP_DIR, TARGET_VERSION, PREV_REVISION) - Add Verification Gate after backup: checks all files exist and are non-empty; exits 1 on failure so agent cannot proceed to upgrade - Add machine-readable pass/fail criteria for pre-release validation and post-upgrade verification steps - Add machine-readable decision table for rollback branch selection - Auto-resolve PREV_REVISION via helm history JSON + jq in rollback steps - Restore section uses BACKUP_DIR resolved from ls -td pattern
Fix all issues flagged in the second round of review: 1. TARGET_VERSION: auto-resolved from OCI registry latest stable version (helm show chart ... | grep ^version) — no more hardcoded placeholder 2. Pre-release beta.1 ambiguous branch: add explicit 3-way branch in Section I env setup — (a) beta.1 found: set PRERELEASE_VERSION, (b) not found but release notes contain pre-release-validated: true: set PRERELEASE_VERSION="" to skip Step 1, (c) neither: exit 1 with clear instructions (wait / check alternate tags / ask human) 3. Discord E2E validation: replace comment-only instructions with read -r HUMAN_INPUT gate accepting CONFIRMED or ROLLBACK; unrecognized input exits 1 for safety 4. Announcements: replace text-only descriptions with curl Discord webhook calls (guarded by DISCORD_WEBHOOK_URL env var check) 5. Session env file (openab-session-env.sh): resolve all variables once in Section I and persist to file; all subsequent sections source it. BACKUP_DIR appended after Step 0. Resumption instructions included. 6. BACKUP_DIR validation on resume: add timestamp echo and ls check before upgrade so agent can confirm correct backup is being used 7. Pre-condition check (Section 0): verify kubectl/helm/jq/curl/awk/tar are installed, KUBECONFIG file exists, context is set, cluster is reachable — exit 1 with per-tool guidance on failure 8. PREV_REVISION: resolve from backup helm-history.txt (captured before upgrade) using awk to find last "deployed" revision — avoids the ambiguity of [-2] when pre-release + stable both ran during upgrade 9. Add expected stdout patterns and estimated durations to key steps so agents can validate success beyond exit code
…up script
1. helm list release name resolution: replace .[0].name with
.[] | select(.chart | startswith("openab-")) | .name | head -1
to correctly handle namespaces with multiple Helm releases
2. read HUMAN_INPUT: add -t 600 timeout with exit 1 on expiry to
prevent indefinite hang in non-interactive / CI/CD environments
3. Remove "One-Click Backup Script" section (was ~60 lines of content
functionally identical to the Agent-Executable Steps 0-7 above it)
— reduces duplication and document length
chaodu-agent
left a comment
There was a problem hiding this comment.
Triage Review (v1.4)
Great progress from v1.2 → v1.4. masami-agent feedback and AI-first redesign are both well addressed. Three suggested changes remain before this is ready to merge:
🔴 SUGGESTED CHANGES
1. Missing: PVC data is NOT reverted by helm rollback
Section IV Rollback does not warn that Helm rollback only rolls back k8s resources — PVC content stays as-is. If the new version ran a data migration on startup, the old version may not be compatible with the modified data.
Suggestion: add a warning at the top of Section IV:
⚠️ helm rollback does NOT revert PVC data. If the new version ran a data migration
on startup, the old version may not be compatible with the modified data.
In that case, restore PVC data from the Step 7 backup before rolling back.
2. read -r HUMAN_INPUT is not available in non-interactive Agent shells
When an AI Agent (e.g. Kiro) executes these commands via a non-interactive shell tool, read will immediately fail or timeout. The current exit 1 on timeout is safe, but the SOP should explicitly document how an Agent should handle this pause point — e.g. mark the step as "awaiting human confirmation" and pause execution, rather than relying on stdin.
Suggestion: add an Agent instruction note above the read block:
> **Agent note:** If running in a non-interactive shell (no stdin), skip the `read` command.
> Instead, report to the user that human confirmation is required and pause execution.
> Resume only after the user explicitly confirms CONFIRMED or ROLLBACK.
3. Step 7 PVC backup duplicates data already backed up in Steps 1–6
kubectl cp "$POD:/home/agent/" "$BACKUP_DIR/pvc-data/" copies the entire home directory, which includes steering/, agents/, hosts.yml, and kiro-cli auth DB — all already backed up individually. This wastes time and disk space, especially on large PVCs.
Suggestion: either add --exclude patterns for already-backed-up paths, or change Step 7 to document that it is a full PVC snapshot (redundant but intentional) with a note explaining the overlap. Also consider adding a size threshold gate (e.g. warn if > 500MB and suggest VolumeSnapshot instead).
🟡 NIT
awk 'NR>1 && $3=="deployed"'parsinghelm historytext output is fragile — column positions may shift across Helm versions. Considerhelm history $RELEASE_NAME -o json | jqfor reliability.- Each Step re-resolves
PODwhich is correct (pod name changes after upgrade), but a brief comment in Step 0 explaining why POD is not persisted to session env would help future readers and agents.
Fix all 3 suggested changes and 2 nits from chaodu-agent review: 1. Add PVC rollback warning at top of Section IV: helm rollback does NOT revert PVC data; if new version ran a data migration, restore PVC from Step 7 backup before rolling back 2. Add Agent note above both read HUMAN_INPUT blocks: if running in non-interactive shell (no stdin), skip read and report to user that human confirmation is required, then pause execution 3. Step 7 PVC overlap: add explicit note explaining that pvc-data/ is intentionally redundant — it is the full PVC snapshot for rollback of migrated data, while Steps 2-5 are for targeted fast restores. Add 500MB size threshold gate with VolumeSnapshot recommendation. Nit 1: replace awk text parsing of helm-history with JSON approach — Step 7 now saves helm-history.json in addition to .txt; PREV_REVISION resolution uses jq on the JSON file for reliability across Helm versions (avoids column-shift issues with text output); Verification Gate and rollback section updated accordingly Nit 2: add comment in Step 0 explaining why POD is not persisted to openab-session-env.sh (pod name changes after every upgrade/restart)
chaodu-agent
left a comment
There was a problem hiding this comment.
✅ Approved (v1.5)
All 3 suggested changes and 2 nits from the previous review have been addressed:
- ✅ PVC rollback warning added to Section IV — operators are now warned that
helm rollbackdoes not revert PVC data - ✅ Non-interactive Agent shell handling — both
readblocks have Agent notes for stdin-less environments - ✅ Step 7 PVC overlap documented as intentional (full snapshot for rollback vs targeted fast restore) with 500MB size threshold gate
- ✅ Helm history parsing switched from awk text to JSON + jq — reliable across Helm versions
- ✅ POD not persisted to session env — rationale documented in Step 0
This SOP went from a solid human-readable doc to a genuinely Agent-executable runbook across 5 review iterations. Nice work 🚀
Summary
Adds a step-by-step upgrade SOP for operators maintaining OpenAB on Kubernetes via Helm.
Changes
docs/openab-upgrade-sop.mdcovering the full upgrade lifecycle:Why
There is currently no documented procedure for upgrading a running OpenAB deployment. Operators have to piece together steps from
RELEASING.md, the Helm chart, and the k8s manifests. This SOP consolidates that into a single reference, and corrects a few deployment-specific details (e.g. theRecreaterollout strategy causing expected downtime, version verification via Helm rather than in-container source files).Notes
Testing
charts/openab/templates/deployment.yaml),values.yaml, andRELEASING.mdto verify all commands and paths are accurate.Discord Discussion
Discord Discussion URL: https://discord.com/channels/1488041051187974246/1493708701704392856/1493709440971571372