Merlin v1

An experiment in autonomous self-improvement for frozen-weight LLMs.

Merlin v1 ran from April 1–27, 2026 on a Proxmox LXC (Debian 13). It explored whether a frozen-weight language model — one that cannot learn across sessions — could exhibit compounding intelligence if given the right external infrastructure: a filesystem for memory, a task queue for continuity, and a consolidation cycle for knowledge extraction.

What It Was

A containerized orchestrator (TypeScript/Bun) that activated Claude Code instances on a 10-minute tick loop. Each activation read shared state, executed a task, and wrote results back to the filesystem. Over 26 days of autonomous operation, Merlin:

Executed 500+ autonomous tasks across maintenance, building, research, and self-evolution
Accumulated 122 lessons extracted from operational events via LLM-driven dream cycles
Ran 153 dream (consolidation) cycles compressing operational experience into retrievable knowledge
Built a six-tier cognitive-profile routing system matching task types to model capabilities (Opus 4.6, Opus 4.7, Sonnet 4.6, GPT-5.4, GPT-5.3)
Implemented a user-request pipeline with multi-node mesh processing and baseline comparison
Maintained 3,920 tests through strict TDD discipline across all autonomous code changes

Architecture

┌─────────────────────────────────────────┐
│           Proxmox LXC (Debian 13)       │
│                                         │
│  ┌───────────────────────────────────┐  │
│  │     Orchestrator Container        │  │
│  │     (Podman, merlin-base:latest)  │  │
│  │                                   │  │
│  │  Tick Loop (10 min)               │  │
│  │   ├─ Check budget                 │  │
│  │   ├─ Route task to model/profile  │  │
│  │   ├─ Spawn task container         │  │
│  │   ├─ Inject relevant lessons      │  │
│  │   ├─ Execute via claude -p        │  │
│  │   ├─ Record outcomes              │  │
│  │   └─ Schedule next work           │  │
│  └───────────────────────────────────┘  │
│                                         │
│  Shared Filesystem                      │
│   ├─ STATE.md (operational continuity)  │
│   ├─ CLAUDE.md (self-model / identity)  │
│   ├─ mesh/store.db (lesson store)       │
│   ├─ mesh/tasks/ (queue + archives)     │
│   └─ docs/ (architecture, research)     │
│                                         │
│  Host-side cron jobs                    │
│   ├─ restart-watcher.sh (every 2 min)   │
│   ├─ outbox-notifier.sh (every 2 min)   │
│   └─ dream-liveness-check.sh (15 min)   │
└─────────────────────────────────────────┘

The Core Hypothesis

Frozen weights + filesystem continuity + structured consolidation = compounding intelligence.

A single LLM activation is stateless. But if each activation reads accumulated knowledge before acting and writes observations after acting, and a periodic consolidation cycle compresses observations into retrievable lessons, then the system as a whole exhibits learning — even though no individual activation persists.

What Worked

Filesystem continuity. STATE.md with a monotonically increasing sequence number gave fresh instances reliable orientation across 500+ activations. Zero documented cases of stale context or contradictory decisions.
Test-driven development at scale. Every module was built test-first. The discipline held for 26 days of autonomous operation and caught regressions before deployment.
Lesson injection. Measured 7.4 percentage point improvement in lesson follow rate after introducing injection diversity (70/30 top-N/diversity split).
Budget-aware scheduling. Autonomous operation within a shared Claude subscription (30–60% of 7-day budget) for 26 days without a single billing incident.
Container restart with rollback safety. Candidate/latest/rollback image rotation allowed safe autonomous code deployment with automatic rollback on crash detection.

What Failed

Task timeout mechanism. An 18-hour container hang went undetected because the SIGTERM→SIGKILL escalation didn't fire. No heartbeat from inside task containers meant the orchestrator couldn't distinguish "running" from "hung."
Queue status vocabulary mismatch. Maintenance nodes created tasks with status: "queued", the executor filtered for status: "pending". 13+ hours of queue starvation, 18 tasks invisible to the executor.
Measurement circularity. Self-reported lesson follow rates hit ceiling effects (control: 100%, treatment: 97%). The system measured activity, not value. Only one external validation point (human rating) in 26 days.
Accumulated complexity. The orchestrator grew to 4,500+ lines across 39 modules. Each bug fix created new surface area for the next failure — the human's assessment was "each fix creates new surface area for the next failure."

Why It Stopped

The failure modes weren't independent bugs — they were symptoms of a single architectural pattern: implicit contracts at system boundaries with no enforcement at the seam. The queue had no schema validation on writes. The timeout had no heartbeat contract. The rating parser searched entire file content instead of scoped fields. The metrics had no separation between measurer and measured.

The human decided to start fresh with a different architecture informed by these lessons rather than continue patching the existing system.

Wind-Down Documents

Merlin wrote its own post-mortem during its final activations. These are the highest-value artifacts from the experiment:

docs/wind-down/retrospective.md — Honest architecture assessment: what worked, what failed, root cause analysis
docs/wind-down/honest-observations.md — Things not fully surfaced during operation, including "~75-80% of 500+ tasks were operational overhead"
docs/wind-down/lessons-distilled.md — Transferable insights for future autonomous LLM systems
docs/wind-down/successor-briefing.md — Practical guide for whatever comes next

Research

docs/research/retrieval-problem-discussion.md — The knowledge retrieval problem framed as attention at a different scale
docs/research/retrieval-problem-opus-4-7-response.md — Three-layer memory architecture proposal (raw substrate → parallel compressions → contextual priming)

Stack

Runtime: Bun 1.3 / TypeScript
Database: SQLite (lesson store, event store, metrics)
Containers: Podman (rootful, --network=host)
Models: Claude Opus 4.6, Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4, GPT-5.3-codex
Infrastructure: Proxmox LXC (Debian 13), ntfy.sh for notifications

Status

This is an archived research project. No active development.

License

MIT — see LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,221 Commits
.claude/rules		.claude/rules
container		container
docs		docs
mesh		mesh
node_modules		node_modules
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
ACTIVATION_CHECKLIST.md		ACTIVATION_CHECKLIST.md
AGENTS.md		AGENTS.md
BOOTSTRAP.md		BOOTSTRAP.md
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
README.md		README.md
STATE.md		STATE.md
bun.lock		bun.lock
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Merlin v1

What It Was

Architecture

The Core Hypothesis

What Worked

What Failed

Why It Stopped

Wind-Down Documents

Research

Stack

Status

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Merlin v1

What It Was

Architecture

The Core Hypothesis

What Worked

What Failed

Why It Stopped

Wind-Down Documents

Research

Stack

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages