diff --git a/.gitignore b/.gitignore index 5c64191..bf1eee4 100644 --- a/.gitignore +++ b/.gitignore @@ -2,6 +2,7 @@ *.iso *.qcow2 *.img +seed.iso # Logs *.log @@ -25,6 +26,9 @@ Thumbs.db # Claude Code local settings .claude/ +# MCP server config (local) +.mcp.json + # Secrets - never commit these secrets/ *.pem diff --git a/.mcp.json b/.mcp.json deleted file mode 100644 index c050400..0000000 --- a/.mcp.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "mcpServers": { - "nexus-agents": { - "command": "nexus-agents", - "args": ["--mode=server"] - } - } -} diff --git a/CHANGELOG.md b/CHANGELOG.md index e493758..9c00d96 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,16 +19,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Golden image now includes full authentication (Claude OAuth, GitHub token, Codex, Gemini) - Clones inherit all authentication - no setup required - Health check on connect (`./agent.sh --health`) +- **Agent resilience phase** in bootstrap with crash recovery tools: + - Claude memory watchdog systemd service (warns at 8GB, kills at 13GB) + - `run-claude-limited` cgroups wrapper for hard memory limits + - `agent-session` tmux wrapper with session persistence + - Enhanced `vm-health-check` with memory trend prediction and OOM alerts + - Crash event logging to `~/.agent-session/crashes.log` +- `RESOURCES.md` - Comprehensive guide for parallel agent memory planning +- Default RAM increased to 16GB (from 8GB) for Claude CLI memory leak protection +- Default swap increased to 8GB (from 4GB) +- `--memory` and `--vcpus` flags for `setup_cloud.sh` ### Changed - Updated README with one-command agent workflow - Updated README with quick reference table - Updated README with shell aliases +- Updated README with agent resilience commands +- Bootstrap now has 11 phases (added agent resilience phase) ### Fixed - CI shellcheck warnings (SC2155, SC2088) - Pinned GitHub Actions to stable versions (ludeeus/action-shellcheck@2.0.0, ibiqlik/action-yamllint@v3.1.1) - Updated golden image dependencies (npm 11.8.0, corepack 0.34.6, nexus-agents latest) +- Removed `.mcp.json` from git tracking (local MCP config should not be shared) ## [1.1.0] - 2026-02-01 diff --git a/README.md b/README.md index d3f4fe5..35e68d2 100644 --- a/README.md +++ b/README.md @@ -100,9 +100,15 @@ sudo shutdown -h now moltdown/ ├── README.md # This file ├── CLAUDE.md # Development guidelines +├── RESOURCES.md # Memory planning for parallel agents +├── CHANGELOG.md # Release history ├── Makefile # Common operations +├── agent.sh # One-command agent VM creation ├── setup_cloud.sh # One-command setup (cloud images, RECOMMENDED) ├── setup.sh # One-command setup (ISO installer) +├── update-golden.sh # Update golden image CLIs and auth +├── sync-ai-auth.sh # Sync AI CLI auth to VMs +├── code-connect.sh # VS Code Remote SSH connection ├── generate_cloud_seed.sh # Create seed ISO for cloud images ├── generate_nocloud_iso.sh # Create seed ISO for ISO installer ├── virt_install_agent_vm.sh # Create VM with virt-install @@ -116,8 +122,7 @@ moltdown/ │ ├── user-data # Autoinstall config (for ISO installer) │ └── meta-data # Cloud-init metadata ├── guest/ -│ ├── bootstrap_agent_vm.sh # Run inside VM -│ └── vm-health-check.sh # Health monitoring script +│ └── bootstrap_agent_vm.sh # Run inside VM (includes health check) ├── docs/ │ └── CLOUD_IMAGES.md # Cloud image workflow docs ├── examples/ @@ -244,17 +249,25 @@ images stay in `/var/lib/libvirt/images/` and are excluded by `.gitignore`. VMs are hardened for multi-day or multi-week agent sessions: -- **Swap file**: 4GB for memory pressure +- **Swap file**: 8GB for memory pressure (Claude CLI can leak to 13GB+) - **Journal limits**: 100MB max, prevents disk fill - **No auto-reboot**: Security updates don't restart - **Cloud-init disabled**: Prevents reconfiguration +- **Memory watchdog**: Auto-kills runaway Claude CLI processes at 13GB threshold +- **cgroups limits**: Optional hard memory limits via `run-claude-limited` Monitor health inside VM: ```bash -vm-health-check # Quick status -vm-health-check --watch # Live monitoring +vm-health-check # Quick status with Claude memory tracking +vm-health-check --watch # Live monitoring (30s refresh) +vm-health-check --trend # Memory trend analysis with OOM prediction +run-claude-limited # Run Claude with 12GB memory limit +run-claude-limited 8G # Run with custom limit +agent-session # Persistent tmux session with auto-reattach ``` +See [RESOURCES.md](RESOURCES.md) for detailed memory planning and parallel agent deployment guidance. + ## Scripts Reference ### bootstrap_agent_vm.sh diff --git a/guest/vm-health-check.sh b/guest/vm-health-check.sh deleted file mode 100755 index e76912c..0000000 --- a/guest/vm-health-check.sh +++ /dev/null @@ -1,65 +0,0 @@ -#!/bin/bash -#=============================================================================== -# vm-health-check.sh - Quick VM health status for long-running sessions -#=============================================================================== -# Part of moltdown 🦀 - https://github.com/williamzujkowski/moltdown -# -# Purpose: Display quick health metrics for monitoring long-running agent VMs -# -# Usage: vm-health-check -# vm-health-check --watch # Refresh every 30 seconds -# -# License: MIT -#=============================================================================== - -set -euo pipefail - -show_health() { - echo "=== VM Health Check $(date '+%Y-%m-%d %H:%M:%S') ===" - echo "Uptime: $(uptime -p)" - echo "Memory: $(free -h | awk '/Mem:/{print $3 "/" $2 " (" int($3/$2*100) "% used)"}')" - echo "Swap: $(free -h | awk '/Swap:/{if($2!="0B") print $3 "/" $2; else print "not configured"}')" - echo "Disk: $(df -h / | awk 'NR==2{print $3 "/" $2 " (" $5 " used)"}')" - echo "Load: $(cat /proc/loadavg | cut -d' ' -f1-3)" - echo "Procs: $(ps aux --no-headers | wc -l)" - - # Journal size (if available) - if command -v journalctl &>/dev/null; then - echo "Journal: $(journalctl --disk-usage 2>/dev/null | grep -oP '\d+\.\d+[MG]' || echo 'unknown')" - fi - - # Docker status (if installed) - if command -v docker &>/dev/null; then - local containers - containers=$(docker ps -q 2>/dev/null | wc -l) - echo "Docker: $containers containers running" - fi -} - -main() { - case "${1:-}" in - --watch|-w) - while true; do - clear - show_health - echo "" - echo "(Refreshing every 30s, Ctrl+C to exit)" - sleep 30 - done - ;; - --help|-h) - echo "Usage: vm-health-check [--watch]" - echo "" - echo "Display quick health metrics for the VM." - echo "" - echo "Options:" - echo " --watch, -w Refresh every 30 seconds" - echo " --help, -h Show this help" - ;; - *) - show_health - ;; - esac -} - -main "$@"