Skip to content

release: v1.2.0 - Agent resilience and crash recovery#18

Merged
williamzujkowski merged 1 commit intomainfrom
release/v1.2.0
Feb 2, 2026
Merged

release: v1.2.0 - Agent resilience and crash recovery#18
williamzujkowski merged 1 commit intomainfrom
release/v1.2.0

Conversation

@williamzujkowski
Copy link
Copy Markdown
Owner

Summary

This release adds comprehensive agent process resilience features to prevent and recover from AI CLI crashes (particularly Claude Code memory leaks).

Key Features

Agent Resilience Phase (New Bootstrap Phase 8)

  • Claude Memory Watchdog: systemd service that monitors Claude CLI memory usage
    • Warns at 8GB, kills at 13GB threshold
    • Prevents cascade failures across parallel clones
  • cgroups Memory Limiting: run-claude-limited wrapper enforces hard limits
    • Default 12GB limit, customizable
    • Uses systemd cgroups v2
  • Session Persistence: agent-session tmux wrapper
    • Auto-reattach on reconnect
    • Crash event logging
  • Enhanced Health Check: vm-health-check with trend analysis
    • Memory trend prediction
    • OOM alerts 30-60 minutes before exhaustion

Resource Planning

  • Default RAM increased: 8GB → 16GB
  • Default swap increased: 4GB → 8GB
  • --memory and --vcpus flags for setup_cloud.sh
  • New RESOURCES.md with parallel deployment guide

Cleanup

  • Removed .mcp.json from git (local MCP config)
  • Removed vestigial guest/vm-health-check.sh (now embedded in bootstrap)
  • Fixed README accuracy (swap size, directory structure)

New Commands (Inside VM)

vm-health-check              # Quick health with Claude memory
vm-health-check --watch      # Continuous monitoring
vm-health-check --trend      # Memory trend analysis
run-claude-limited           # Run Claude with 12GB limit
run-claude-limited 8G        # Custom limit
agent-session                # Persistent tmux session
systemctl status claude-watchdog  # Check watchdog

Test Plan

  • make lint passes
  • README accuracy verified
  • CHANGELOG updated
  • Directory structure accurate
  • Bootstrap runs successfully in VM (manual test)
  • Watchdog service starts and monitors correctly

🤖 Generated with Claude Code

## Summary
- Add agent process resilience phase with memory watchdog, cgroups limits, and session persistence
- Increase default RAM to 16GB and swap to 8GB for Claude CLI memory leak protection
- Add RESOURCES.md with parallel agent memory planning guide
- Clean up vestigial files and fix documentation accuracy

## Changes

### Added
- Claude memory watchdog systemd service (warns 8GB, kills 13GB)
- `run-claude-limited` cgroups wrapper for hard memory limits
- `agent-session` tmux wrapper with session persistence
- Enhanced `vm-health-check` with memory trend prediction
- `--memory` and `--vcpus` flags for setup_cloud.sh
- RESOURCES.md comprehensive resource planning guide

### Fixed
- README: swap size 4GB → 8GB (accuracy)
- README: updated directory structure
- Removed .mcp.json from git (local config)
- Removed vestigial guest/vm-health-check.sh (embedded in bootstrap)

### Documentation
- Updated CHANGELOG with all new features
- Updated CLAUDE.md quick reference
- Added agent resilience commands to README

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@williamzujkowski williamzujkowski merged commit 1b00ef1 into main Feb 2, 2026
2 checks passed
@williamzujkowski williamzujkowski deleted the release/v1.2.0 branch February 2, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant