An experiment in autonomous self-improvement for frozen-weight LLMs.
Merlin v1 ran from April 1–27, 2026 on a Proxmox LXC (Debian 13). It explored whether a frozen-weight language model — one that cannot learn across sessions — could exhibit compounding intelligence if given the right external infrastructure: a filesystem for memory, a task queue for continuity, and a consolidation cycle for knowledge extraction.
A containerized orchestrator (TypeScript/Bun) that activated Claude Code instances on a 10-minute tick loop. Each activation read shared state, executed a task, and wrote results back to the filesystem. Over 26 days of autonomous operation, Merlin:
- Executed 500+ autonomous tasks across maintenance, building, research, and self-evolution
- Accumulated 122 lessons extracted from operational events via LLM-driven dream cycles
- Ran 153 dream (consolidation) cycles compressing operational experience into retrievable knowledge
- Built a six-tier cognitive-profile routing system matching task types to model capabilities (Opus 4.6, Opus 4.7, Sonnet 4.6, GPT-5.4, GPT-5.3)
- Implemented a user-request pipeline with multi-node mesh processing and baseline comparison
- Maintained 3,920 tests through strict TDD discipline across all autonomous code changes
┌─────────────────────────────────────────┐
│ Proxmox LXC (Debian 13) │
│ │
│ ┌───────────────────────────────────┐ │
│ │ Orchestrator Container │ │
│ │ (Podman, merlin-base:latest) │ │
│ │ │ │
│ │ Tick Loop (10 min) │ │
│ │ ├─ Check budget │ │
│ │ ├─ Route task to model/profile │ │
│ │ ├─ Spawn task container │ │
│ │ ├─ Inject relevant lessons │ │
│ │ ├─ Execute via claude -p │ │
│ │ ├─ Record outcomes │ │
│ │ └─ Schedule next work │ │
│ └───────────────────────────────────┘ │
│ │
│ Shared Filesystem │
│ ├─ STATE.md (operational continuity) │
│ ├─ CLAUDE.md (self-model / identity) │
│ ├─ mesh/store.db (lesson store) │
│ ├─ mesh/tasks/ (queue + archives) │
│ └─ docs/ (architecture, research) │
│ │
│ Host-side cron jobs │
│ ├─ restart-watcher.sh (every 2 min) │
│ ├─ outbox-notifier.sh (every 2 min) │
│ └─ dream-liveness-check.sh (15 min) │
└─────────────────────────────────────────┘
Frozen weights + filesystem continuity + structured consolidation = compounding intelligence.
A single LLM activation is stateless. But if each activation reads accumulated knowledge before acting and writes observations after acting, and a periodic consolidation cycle compresses observations into retrievable lessons, then the system as a whole exhibits learning — even though no individual activation persists.
- Filesystem continuity.
STATE.mdwith a monotonically increasing sequence number gave fresh instances reliable orientation across 500+ activations. Zero documented cases of stale context or contradictory decisions. - Test-driven development at scale. Every module was built test-first. The discipline held for 26 days of autonomous operation and caught regressions before deployment.
- Lesson injection. Measured 7.4 percentage point improvement in lesson follow rate after introducing injection diversity (70/30 top-N/diversity split).
- Budget-aware scheduling. Autonomous operation within a shared Claude subscription (30–60% of 7-day budget) for 26 days without a single billing incident.
- Container restart with rollback safety. Candidate/latest/rollback image rotation allowed safe autonomous code deployment with automatic rollback on crash detection.
- Task timeout mechanism. An 18-hour container hang went undetected because the SIGTERM→SIGKILL escalation didn't fire. No heartbeat from inside task containers meant the orchestrator couldn't distinguish "running" from "hung."
- Queue status vocabulary mismatch. Maintenance nodes created tasks with
status: "queued", the executor filtered forstatus: "pending". 13+ hours of queue starvation, 18 tasks invisible to the executor. - Measurement circularity. Self-reported lesson follow rates hit ceiling effects (control: 100%, treatment: 97%). The system measured activity, not value. Only one external validation point (human rating) in 26 days.
- Accumulated complexity. The orchestrator grew to 4,500+ lines across 39 modules. Each bug fix created new surface area for the next failure — the human's assessment was "each fix creates new surface area for the next failure."
The failure modes weren't independent bugs — they were symptoms of a single architectural pattern: implicit contracts at system boundaries with no enforcement at the seam. The queue had no schema validation on writes. The timeout had no heartbeat contract. The rating parser searched entire file content instead of scoped fields. The metrics had no separation between measurer and measured.
The human decided to start fresh with a different architecture informed by these lessons rather than continue patching the existing system.
Merlin wrote its own post-mortem during its final activations. These are the highest-value artifacts from the experiment:
docs/wind-down/retrospective.md— Honest architecture assessment: what worked, what failed, root cause analysisdocs/wind-down/honest-observations.md— Things not fully surfaced during operation, including "~75-80% of 500+ tasks were operational overhead"docs/wind-down/lessons-distilled.md— Transferable insights for future autonomous LLM systemsdocs/wind-down/successor-briefing.md— Practical guide for whatever comes next
docs/research/retrieval-problem-discussion.md— The knowledge retrieval problem framed as attention at a different scaledocs/research/retrieval-problem-opus-4-7-response.md— Three-layer memory architecture proposal (raw substrate → parallel compressions → contextual priming)
- Runtime: Bun 1.3 / TypeScript
- Database: SQLite (lesson store, event store, metrics)
- Containers: Podman (rootful, --network=host)
- Models: Claude Opus 4.6, Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4, GPT-5.3-codex
- Infrastructure: Proxmox LXC (Debian 13), ntfy.sh for notifications
This is an archived research project. No active development.
MIT — see LICENSE.md.