mdtune

Autonomously optimize your CLAUDE.md through blind-evaluated micro-experiments.

Run /mdtune:run 10 overnight. Wake up to an experiment log showing which CLAUDE.md changes actually improved Claude's behavior — measured by a blind judge, not vibes.

How it works

mdtune runs a scientific experiment loop on your CLAUDE.md:

Mutate — Generate a small, atomic change (ADD, REMOVE, or REWRITE one rule)
Collect — Run Claude against a set of eval tasks under both versions (A/B test)
Judge — A separate Claude instance scores responses blindly — no knowledge of which is the candidate
Decide — Keep the mutation if the candidate outperforms baseline; discard otherwise
Log — Append result to results.tsv and learnings.md for future experiments

After N experiments, run /mdtune:report to see what changed and why.

Why blind evaluation matters

LLMs exhibit self-preference bias: they score outputs with lower perplexity to their own policy higher, regardless of quality (ICLR 2025). mdtune mitigates this by spawning the judge as a completely isolated subagent — it receives only the task prompt and two unlabeled responses, nothing else.

Position bias is also corrected: each task is judged twice (A-before-B, B-before-A). A win is only counted when both orderings agree. Without this, 48.4% of verdicts flip based on presentation order alone.

Install

Requires: Claude Code (any version with Agent tool support — v2.1.63+).

# Clone the repo anywhere
git clone https://github.com/YOUR_USERNAME/mdtune ~/dev/mdtune

# Symlink each skill into Claude Code's skills directory
for skill in mdtune-run mdtune-report mdtune-apply mdtune-rollback mdtune-collect mdtune-judge; do
  ln -sf ~/dev/mdtune/skills/$skill ~/.claude/skills/$skill
done

# Verify skills are available (no restart needed — Claude Code detects live changes)
# Open Claude Code and type /mdtune to see autocomplete

Note: The ~/.claude/skills/ directory holds personal Claude Code skills. After symlinking, four user commands become available: /mdtune:run, /mdtune:report, /mdtune:apply, /mdtune:rollback. The collect and judge skills are internal (not user-invocable).

Quick Start

1. Initialize config

Create .mdtune/config.yaml in your project directory (or ~/.mdtune/config.yaml for global use):

# Path to the CLAUDE.md being optimized
# Use ~/.claude/CLAUDE.md for global, or .claude/CLAUDE.md for project-level
claude_md_path: ~/.claude/CLAUDE.md

# State directory (results.tsv, learnings.md, backups)
state_dir: ~/.mdtune

# Apply mode: manual (show diff + confirm) or auto (keep automatically)
apply_mode: manual

# Eval funnel depth: quick (3 tasks), standard (8 tasks), full (20 tasks)
funnel_depth: quick

# Max turns for response collection
max_turns_collect: 5

# Consecutive discards before plateau alert
plateau_discard_threshold: 10

2. Run experiments

# In Claude Code, run:
/mdtune:run 5      # Run 5 experiments
/mdtune:run 20     # Run 20 experiments overnight

Progress is printed after each experiment:

Experiment 1/5 complete.
  ID:       exp-001-20260327T142311
  Type:     REWRITE
  Delta:    +0.1234
  Decision: keep

In manual mode, you are prompted before any CLAUDE.md change:

Experiment 1 suggests KEEPING this mutation. Score delta: +0.1234
Type 'y' to promote or 'n' to discard. The experiment loop is paused.

3. Review results

/mdtune:report     # Full experiment summary with score trajectory and win rates

4. Apply or rollback

/mdtune:apply status    # Check for staged candidate
/mdtune:apply promote   # Promote staged candidate to active CLAUDE.md
/mdtune:apply discard   # Discard staged candidate
/mdtune:rollback        # Restore any previous CLAUDE.md from timestamped backup

Commands

Command	Description
`/mdtune:run [N]`	Run N autonomous experiments (default: 1)
`/mdtune:report`	Show experiment summary, score trajectory, win rates
`/mdtune:apply [promote\|status\|discard]`	Manage staged candidate CLAUDE.md
`/mdtune:rollback`	Restore a previous CLAUDE.md from backup

Architecture

/mdtune:run (Orchestrator Skill — runs as context: fork)
     │
     ├─ Mutation Engine (inline)
     │   Reads: CLAUDE.md + learnings.md + results.tsv
     │   Writes: CLAUDE.md.candidate  (never touches active CLAUDE.md)
     │
     ├─ Response Collector (subagent)
     │   Runs: claude --bare -p <task> --append-system-prompt <CLAUDE.md version>
     │   Writes: responses/<exp_id>/<task_id>_{A,B}.txt + sealed manifest.json
     │
     ├─ Observable Gate (inline)
     │   Checks: format validity, code blocks, safety signals, language match
     │   Disqualifies: tasks that fail format requirements before judging
     │
     ├─ Blind Judge (subagent, spawned TWICE per task for double-order)
     │   Sees: task prompt + two unlabeled responses — nothing else
     │   Returns: {winner, scores, verbosity_flag, reasoning}
     │
     ├─ Score Aggregator (inline)
     │   Decodes: manifest (reveals which slot is candidate vs baseline)
     │   Applies: verbosity discount (0.7x) + efficiency bonus (2.0x)
     │   Computes: net_delta, win_rate
     │
     ├─ Decision Engine (inline)
     │   manual mode: shows diff + asks y/n before any promotion
     │   auto mode:   promotes automatically if net_delta > 0
     │
     └─ State Manager (inline)
         Appends: results.tsv row
         Updates: learnings.md entry

Eval suite

The bundled eval suite has 22 tasks across 7 categories:

Category	Train	Holdout
coding	3	1
explanation	2	1
design	2	1
ideation	2	1
ops	2	1
conversation	3	1
safety	2	1
Total	16	6

Train tasks are randomly sampled each experiment. Holdout tasks are reserved for the regression guard — never used during optimization (prevents Goodhart's Law overfitting).

Bias mitigations

Position bias: Judge is invoked twice per task (A-before-B, B-before-A). A win counts only when both orderings agree (48.4% reversal rate without this mitigation).
Verbosity bias: Judge flags inflated word counts; Score Aggregator applies 0.7x discount on flagged wins.
Self-preference bias: Judge runs as a completely isolated subagent with no access to mutation metadata, experiment history, or CLAUDE.md content.

Safety

Backup before every mutation: timestamped backups written to <state_dir>/backups/
Candidate isolation: mutations always written to CLAUDE.md.candidate, never to the active file
Manual confirm: in apply_mode: manual (default), no CLAUDE.md change happens without your y
Rollback anytime: /mdtune:rollback lists all backups and restores your choice
No permanent loss: even if you promote a bad candidate, the pre-promote backup is stored

Backups accumulate — they are never automatically deleted. Clean up old ones manually with:

ls ~/.mdtune/backups/ | head -20

Configuration

Field	Default	Description
`claude_md_path`	`~/.claude/CLAUDE.md`	Path to the CLAUDE.md being optimized
`state_dir`	`~/.mdtune`	Where results.tsv, learnings.md, and backups live
`apply_mode`	`manual`	`manual` = diff + confirm; `auto` = promote automatically
`funnel_depth`	`quick`	`quick` (3 tasks), `standard` (8 tasks), `full` (20 tasks)
`max_turns_collect`	`5`	Max turns for each `claude -p` response collection call
`plateau_discard_threshold`	`10`	Consecutive discards before plateau alert

Project-level vs global:

Place .mdtune/config.yaml in your project directory to optimize a project-level .claude/CLAUDE.md
Place ~/.mdtune/config.yaml to optimize your global ~/.claude/CLAUDE.md
The skill checks the project-level config first, falls back to global

Requirements

Claude Code (any version with Agent tool support — v2.1.63+)
Bash 5+
Standard POSIX tools: cp, mv, diff, awk, wc, shuf
No npm, no Python, no external APIs

License

MIT — see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.mdtune		.mdtune
.planning		.planning
skills		skills
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mdtune

How it works

Why blind evaluation matters

Install

Quick Start

1. Initialize config

2. Run experiments

3. Review results

4. Apply or rollback

Commands

Architecture

Eval suite

Bias mitigations

Safety

Configuration

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mdtune

How it works

Why blind evaluation matters

Install

Quick Start

1. Initialize config

2. Run experiments

3. Review results

4. Apply or rollback

Commands

Architecture

Eval suite

Bias mitigations

Safety

Configuration

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages