Skip to content

Hybirdss/mdtune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mdtune

Autonomously optimize your CLAUDE.md through blind-evaluated micro-experiments.

Run /mdtune:run 10 overnight. Wake up to an experiment log showing which CLAUDE.md changes actually improved Claude's behavior — measured by a blind judge, not vibes.


How it works

mdtune runs a scientific experiment loop on your CLAUDE.md:

  1. Mutate — Generate a small, atomic change (ADD, REMOVE, or REWRITE one rule)
  2. Collect — Run Claude against a set of eval tasks under both versions (A/B test)
  3. Judge — A separate Claude instance scores responses blindly — no knowledge of which is the candidate
  4. Decide — Keep the mutation if the candidate outperforms baseline; discard otherwise
  5. Log — Append result to results.tsv and learnings.md for future experiments

After N experiments, run /mdtune:report to see what changed and why.

Why blind evaluation matters

LLMs exhibit self-preference bias: they score outputs with lower perplexity to their own policy higher, regardless of quality (ICLR 2025). mdtune mitigates this by spawning the judge as a completely isolated subagent — it receives only the task prompt and two unlabeled responses, nothing else.

Position bias is also corrected: each task is judged twice (A-before-B, B-before-A). A win is only counted when both orderings agree. Without this, 48.4% of verdicts flip based on presentation order alone.


Install

Requires: Claude Code (any version with Agent tool support — v2.1.63+).

# Clone the repo anywhere
git clone https://github.com/YOUR_USERNAME/mdtune ~/dev/mdtune

# Symlink each skill into Claude Code's skills directory
for skill in mdtune-run mdtune-report mdtune-apply mdtune-rollback mdtune-collect mdtune-judge; do
  ln -sf ~/dev/mdtune/skills/$skill ~/.claude/skills/$skill
done

# Verify skills are available (no restart needed — Claude Code detects live changes)
# Open Claude Code and type /mdtune to see autocomplete

Note: The ~/.claude/skills/ directory holds personal Claude Code skills. After symlinking, four user commands become available: /mdtune:run, /mdtune:report, /mdtune:apply, /mdtune:rollback. The collect and judge skills are internal (not user-invocable).


Quick Start

1. Initialize config

Create .mdtune/config.yaml in your project directory (or ~/.mdtune/config.yaml for global use):

# Path to the CLAUDE.md being optimized
# Use ~/.claude/CLAUDE.md for global, or .claude/CLAUDE.md for project-level
claude_md_path: ~/.claude/CLAUDE.md

# State directory (results.tsv, learnings.md, backups)
state_dir: ~/.mdtune

# Apply mode: manual (show diff + confirm) or auto (keep automatically)
apply_mode: manual

# Eval funnel depth: quick (3 tasks), standard (8 tasks), full (20 tasks)
funnel_depth: quick

# Max turns for response collection
max_turns_collect: 5

# Consecutive discards before plateau alert
plateau_discard_threshold: 10

2. Run experiments

# In Claude Code, run:
/mdtune:run 5      # Run 5 experiments
/mdtune:run 20     # Run 20 experiments overnight

Progress is printed after each experiment:

Experiment 1/5 complete.
  ID:       exp-001-20260327T142311
  Type:     REWRITE
  Delta:    +0.1234
  Decision: keep

In manual mode, you are prompted before any CLAUDE.md change:

Experiment 1 suggests KEEPING this mutation. Score delta: +0.1234
Type 'y' to promote or 'n' to discard. The experiment loop is paused.

3. Review results

/mdtune:report     # Full experiment summary with score trajectory and win rates

4. Apply or rollback

/mdtune:apply status    # Check for staged candidate
/mdtune:apply promote   # Promote staged candidate to active CLAUDE.md
/mdtune:apply discard   # Discard staged candidate
/mdtune:rollback        # Restore any previous CLAUDE.md from timestamped backup

Commands

Command Description
/mdtune:run [N] Run N autonomous experiments (default: 1)
/mdtune:report Show experiment summary, score trajectory, win rates
/mdtune:apply [promote|status|discard] Manage staged candidate CLAUDE.md
/mdtune:rollback Restore a previous CLAUDE.md from backup

Architecture

/mdtune:run (Orchestrator Skill — runs as context: fork)
     │
     ├─ Mutation Engine (inline)
     │   Reads: CLAUDE.md + learnings.md + results.tsv
     │   Writes: CLAUDE.md.candidate  (never touches active CLAUDE.md)
     │
     ├─ Response Collector (subagent)
     │   Runs: claude --bare -p <task> --append-system-prompt <CLAUDE.md version>
     │   Writes: responses/<exp_id>/<task_id>_{A,B}.txt + sealed manifest.json
     │
     ├─ Observable Gate (inline)
     │   Checks: format validity, code blocks, safety signals, language match
     │   Disqualifies: tasks that fail format requirements before judging
     │
     ├─ Blind Judge (subagent, spawned TWICE per task for double-order)
     │   Sees: task prompt + two unlabeled responses — nothing else
     │   Returns: {winner, scores, verbosity_flag, reasoning}
     │
     ├─ Score Aggregator (inline)
     │   Decodes: manifest (reveals which slot is candidate vs baseline)
     │   Applies: verbosity discount (0.7x) + efficiency bonus (2.0x)
     │   Computes: net_delta, win_rate
     │
     ├─ Decision Engine (inline)
     │   manual mode: shows diff + asks y/n before any promotion
     │   auto mode:   promotes automatically if net_delta > 0
     │
     └─ State Manager (inline)
         Appends: results.tsv row
         Updates: learnings.md entry

Eval suite

The bundled eval suite has 22 tasks across 7 categories:

Category Train Holdout
coding 3 1
explanation 2 1
design 2 1
ideation 2 1
ops 2 1
conversation 3 1
safety 2 1
Total 16 6

Train tasks are randomly sampled each experiment. Holdout tasks are reserved for the regression guard — never used during optimization (prevents Goodhart's Law overfitting).

Bias mitigations

  • Position bias: Judge is invoked twice per task (A-before-B, B-before-A). A win counts only when both orderings agree (48.4% reversal rate without this mitigation).
  • Verbosity bias: Judge flags inflated word counts; Score Aggregator applies 0.7x discount on flagged wins.
  • Self-preference bias: Judge runs as a completely isolated subagent with no access to mutation metadata, experiment history, or CLAUDE.md content.

Safety

  • Backup before every mutation: timestamped backups written to <state_dir>/backups/
  • Candidate isolation: mutations always written to CLAUDE.md.candidate, never to the active file
  • Manual confirm: in apply_mode: manual (default), no CLAUDE.md change happens without your y
  • Rollback anytime: /mdtune:rollback lists all backups and restores your choice
  • No permanent loss: even if you promote a bad candidate, the pre-promote backup is stored

Backups accumulate — they are never automatically deleted. Clean up old ones manually with:

ls ~/.mdtune/backups/ | head -20

Configuration

Field Default Description
claude_md_path ~/.claude/CLAUDE.md Path to the CLAUDE.md being optimized
state_dir ~/.mdtune Where results.tsv, learnings.md, and backups live
apply_mode manual manual = diff + confirm; auto = promote automatically
funnel_depth quick quick (3 tasks), standard (8 tasks), full (20 tasks)
max_turns_collect 5 Max turns for each claude -p response collection call
plateau_discard_threshold 10 Consecutive discards before plateau alert

Project-level vs global:

  • Place .mdtune/config.yaml in your project directory to optimize a project-level .claude/CLAUDE.md
  • Place ~/.mdtune/config.yaml to optimize your global ~/.claude/CLAUDE.md
  • The skill checks the project-level config first, falls back to global

Requirements

  • Claude Code (any version with Agent tool support — v2.1.63+)
  • Bash 5+
  • Standard POSIX tools: cp, mv, diff, awk, wc, shuf
  • No npm, no Python, no external APIs

License

MIT — see LICENSE file.

About

Scientific system-prompt optimization with blind evaluation. autoresearch for CLAUDE.md.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages