Autonomously optimize your CLAUDE.md through blind-evaluated micro-experiments.
Run /mdtune:run 10 overnight. Wake up to an experiment log showing which CLAUDE.md
changes actually improved Claude's behavior — measured by a blind judge, not vibes.
mdtune runs a scientific experiment loop on your CLAUDE.md:
- Mutate — Generate a small, atomic change (ADD, REMOVE, or REWRITE one rule)
- Collect — Run Claude against a set of eval tasks under both versions (A/B test)
- Judge — A separate Claude instance scores responses blindly — no knowledge of which is the candidate
- Decide — Keep the mutation if the candidate outperforms baseline; discard otherwise
- Log — Append result to
results.tsvandlearnings.mdfor future experiments
After N experiments, run /mdtune:report to see what changed and why.
LLMs exhibit self-preference bias: they score outputs with lower perplexity to their own policy higher, regardless of quality (ICLR 2025). mdtune mitigates this by spawning the judge as a completely isolated subagent — it receives only the task prompt and two unlabeled responses, nothing else.
Position bias is also corrected: each task is judged twice (A-before-B, B-before-A). A win is only counted when both orderings agree. Without this, 48.4% of verdicts flip based on presentation order alone.
Requires: Claude Code (any version with Agent tool support — v2.1.63+).
# Clone the repo anywhere
git clone https://github.com/YOUR_USERNAME/mdtune ~/dev/mdtune
# Symlink each skill into Claude Code's skills directory
for skill in mdtune-run mdtune-report mdtune-apply mdtune-rollback mdtune-collect mdtune-judge; do
ln -sf ~/dev/mdtune/skills/$skill ~/.claude/skills/$skill
done
# Verify skills are available (no restart needed — Claude Code detects live changes)
# Open Claude Code and type /mdtune to see autocompleteNote: The
~/.claude/skills/directory holds personal Claude Code skills. After symlinking, four user commands become available:/mdtune:run,/mdtune:report,/mdtune:apply,/mdtune:rollback. The collect and judge skills are internal (not user-invocable).
Create .mdtune/config.yaml in your project directory (or ~/.mdtune/config.yaml for global use):
# Path to the CLAUDE.md being optimized
# Use ~/.claude/CLAUDE.md for global, or .claude/CLAUDE.md for project-level
claude_md_path: ~/.claude/CLAUDE.md
# State directory (results.tsv, learnings.md, backups)
state_dir: ~/.mdtune
# Apply mode: manual (show diff + confirm) or auto (keep automatically)
apply_mode: manual
# Eval funnel depth: quick (3 tasks), standard (8 tasks), full (20 tasks)
funnel_depth: quick
# Max turns for response collection
max_turns_collect: 5
# Consecutive discards before plateau alert
plateau_discard_threshold: 10# In Claude Code, run:
/mdtune:run 5 # Run 5 experiments
/mdtune:run 20 # Run 20 experiments overnightProgress is printed after each experiment:
Experiment 1/5 complete.
ID: exp-001-20260327T142311
Type: REWRITE
Delta: +0.1234
Decision: keep
In manual mode, you are prompted before any CLAUDE.md change:
Experiment 1 suggests KEEPING this mutation. Score delta: +0.1234
Type 'y' to promote or 'n' to discard. The experiment loop is paused.
/mdtune:report # Full experiment summary with score trajectory and win rates/mdtune:apply status # Check for staged candidate
/mdtune:apply promote # Promote staged candidate to active CLAUDE.md
/mdtune:apply discard # Discard staged candidate
/mdtune:rollback # Restore any previous CLAUDE.md from timestamped backup| Command | Description |
|---|---|
/mdtune:run [N] |
Run N autonomous experiments (default: 1) |
/mdtune:report |
Show experiment summary, score trajectory, win rates |
/mdtune:apply [promote|status|discard] |
Manage staged candidate CLAUDE.md |
/mdtune:rollback |
Restore a previous CLAUDE.md from backup |
/mdtune:run (Orchestrator Skill — runs as context: fork)
│
├─ Mutation Engine (inline)
│ Reads: CLAUDE.md + learnings.md + results.tsv
│ Writes: CLAUDE.md.candidate (never touches active CLAUDE.md)
│
├─ Response Collector (subagent)
│ Runs: claude --bare -p <task> --append-system-prompt <CLAUDE.md version>
│ Writes: responses/<exp_id>/<task_id>_{A,B}.txt + sealed manifest.json
│
├─ Observable Gate (inline)
│ Checks: format validity, code blocks, safety signals, language match
│ Disqualifies: tasks that fail format requirements before judging
│
├─ Blind Judge (subagent, spawned TWICE per task for double-order)
│ Sees: task prompt + two unlabeled responses — nothing else
│ Returns: {winner, scores, verbosity_flag, reasoning}
│
├─ Score Aggregator (inline)
│ Decodes: manifest (reveals which slot is candidate vs baseline)
│ Applies: verbosity discount (0.7x) + efficiency bonus (2.0x)
│ Computes: net_delta, win_rate
│
├─ Decision Engine (inline)
│ manual mode: shows diff + asks y/n before any promotion
│ auto mode: promotes automatically if net_delta > 0
│
└─ State Manager (inline)
Appends: results.tsv row
Updates: learnings.md entry
The bundled eval suite has 22 tasks across 7 categories:
| Category | Train | Holdout |
|---|---|---|
| coding | 3 | 1 |
| explanation | 2 | 1 |
| design | 2 | 1 |
| ideation | 2 | 1 |
| ops | 2 | 1 |
| conversation | 3 | 1 |
| safety | 2 | 1 |
| Total | 16 | 6 |
Train tasks are randomly sampled each experiment. Holdout tasks are reserved for the regression guard — never used during optimization (prevents Goodhart's Law overfitting).
- Position bias: Judge is invoked twice per task (A-before-B, B-before-A). A win counts only when both orderings agree (48.4% reversal rate without this mitigation).
- Verbosity bias: Judge flags inflated word counts; Score Aggregator applies 0.7x discount on flagged wins.
- Self-preference bias: Judge runs as a completely isolated subagent with no access to mutation metadata, experiment history, or CLAUDE.md content.
- Backup before every mutation: timestamped backups written to
<state_dir>/backups/ - Candidate isolation: mutations always written to
CLAUDE.md.candidate, never to the active file - Manual confirm: in
apply_mode: manual(default), no CLAUDE.md change happens without youry - Rollback anytime:
/mdtune:rollbacklists all backups and restores your choice - No permanent loss: even if you promote a bad candidate, the pre-promote backup is stored
Backups accumulate — they are never automatically deleted. Clean up old ones manually with:
ls ~/.mdtune/backups/ | head -20| Field | Default | Description |
|---|---|---|
claude_md_path |
~/.claude/CLAUDE.md |
Path to the CLAUDE.md being optimized |
state_dir |
~/.mdtune |
Where results.tsv, learnings.md, and backups live |
apply_mode |
manual |
manual = diff + confirm; auto = promote automatically |
funnel_depth |
quick |
quick (3 tasks), standard (8 tasks), full (20 tasks) |
max_turns_collect |
5 |
Max turns for each claude -p response collection call |
plateau_discard_threshold |
10 |
Consecutive discards before plateau alert |
Project-level vs global:
- Place
.mdtune/config.yamlin your project directory to optimize a project-level.claude/CLAUDE.md - Place
~/.mdtune/config.yamlto optimize your global~/.claude/CLAUDE.md - The skill checks the project-level config first, falls back to global
- Claude Code (any version with Agent tool support — v2.1.63+)
- Bash 5+
- Standard POSIX tools:
cp,mv,diff,awk,wc,shuf - No npm, no Python, no external APIs
MIT — see LICENSE file.