Karpathy's
autoresearchasks: what's the besttrain.pyunder a 5-minute GPU budget? Tinker-Explorer asks: what's the best reading order under a 2,000-token budget? Same discipline, different search space.
An RL agent trained via GRPO that learns to navigate a pool of evidence chunks — opening, summarizing, and answering — to answer multi-hop questions with maximum accuracy at minimum reading cost. Built on Tinker, inspired by Karpathy's autoresearch.
| Model | Token F1 | Exact Match | Answer Rate | Avg Opens |
|---|---|---|---|---|
| Heuristic baseline | 0.246 | 0.220 | 100% | 1.2 |
| SFT (warm-start) | 0.142 | 0.060 | 87.0% | 1.6 |
| RL Run 1 — shaped reward | 0.123 | 0.010 | 96.0% | 1.4 |
| RL Run 2 — pure F1 | 0.154 | 0.065 | 89.5% | 1.5 |
| RL Run 3 — status text fix | 0.172 | 0.085 | 99.5% | 1.2 |
Key finding: The run with the lowest training reward (0.056) produced the best model (F1=0.172). Stricter, more honest reward functions lead to better generalization — a textbook case of Goodhart's Law in RL.
In autoresearch, agents mutate train.py and measure val_bpb. Here, agents choose which Wikipedia paragraphs to read and measure answer correctness. The agent must learn to:
- Read chunk previews to identify relevant documents
- Open the minimum number of chunks needed
- Answer with a concise, exact entity name
2WikiMultiHopQA — 5,000 training tasks, 200 val tasks. Each question requires reasoning across 2 Wikipedia articles from a pool of ~10 candidates.
| Phase | Goal | Status |
|---|---|---|
| 0 | Foundation — repo, dataset, episode state inspector | ✅ Complete |
| 1 | Core Environment — ExplorerEnv, reward, random policy |
✅ Complete |
| 2 | Baselines — Heuristic policy (F1=0.246) | ✅ Complete |
| 3 | SFT Warmstart — 450 demo episodes via heuristic | ✅ Complete |
| 4 | RL Training — 3 GRPO runs, reward engineering | ✅ Complete |
| 5 | Ablation — Reward design (3 variants tested) | ✅ Leaf 1 Complete |
| 6 | Blog Post + Writeup | ✅ Complete |
pip install -e .
# Build the dataset (first time — downloads 2WikiMultiHopQA)
python data/build_chunked_qa.py
# Inspect episode state for task 0
python data/inspect_task.py --id 0
# Run evaluation on val set
TINKER_KEY=<your_key> python eval/eval_rl.py --rl_ckpt <checkpoint> --n_tasks 200data/ Dataset builder, serialized tasks, CLI inspector
env/ ExplorerEnv, reward function, state renderer, action schema
policies/ Heuristic baseline, system prompts
train/ SFT warmstart + GRPO training loop
eval/ Accuracy, efficiency, trajectory inspector, baselines
runs/ Detailed run reports (run_1, run_2, run_3)
blog/ Full technical blog post
plots/ All figures organized by run + comparison
├── run_1/ Per-run reward curves & summaries
├── run_2/
├── run_3/
└── comparison/ Cross-run visualizations
logs/ Training metrics (metrics.jsonl per run)
checkpoints/ Model checkpoint references
leaves/ Ablation experiment harness
- Model: Qwen3-8B with LoRA (rank=32)
- Platform: Tinker — cloud inference + training, no local GPU
- Algorithm: GRPO (GROUP_SIZE=16, BATCH_SIZE=8, LR=5e-6)
- Total compute: ~40 hours wall-clock, ~120 GPU-hours
See program.md for the current research state.