Skip to content

mohammed840/autoresearch-explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tinker-Explorer

Karpathy's autoresearch asks: what's the best train.py under a 5-minute GPU budget? Tinker-Explorer asks: what's the best reading order under a 2,000-token budget? Same discipline, different search space.

An RL agent trained via GRPO that learns to navigate a pool of evidence chunks — opening, summarizing, and answering — to answer multi-hop questions with maximum accuracy at minimum reading cost. Built on Tinker, inspired by Karpathy's autoresearch.


Results

Model Token F1 Exact Match Answer Rate Avg Opens
Heuristic baseline 0.246 0.220 100% 1.2
SFT (warm-start) 0.142 0.060 87.0% 1.6
RL Run 1 — shaped reward 0.123 0.010 96.0% 1.4
RL Run 2 — pure F1 0.154 0.065 89.5% 1.5
RL Run 3 — status text fix 0.172 0.085 99.5% 1.2

Key finding: The run with the lowest training reward (0.056) produced the best model (F1=0.172). Stricter, more honest reward functions lead to better generalization — a textbook case of Goodhart's Law in RL.

📝 Read the full blog post →


The Core Idea

In autoresearch, agents mutate train.py and measure val_bpb. Here, agents choose which Wikipedia paragraphs to read and measure answer correctness. The agent must learn to:

  1. Read chunk previews to identify relevant documents
  2. Open the minimum number of chunks needed
  3. Answer with a concise, exact entity name

Dataset

2WikiMultiHopQA — 5,000 training tasks, 200 val tasks. Each question requires reasoning across 2 Wikipedia articles from a pool of ~10 candidates.

Project Phases

Phase Goal Status
0 Foundation — repo, dataset, episode state inspector ✅ Complete
1 Core Environment — ExplorerEnv, reward, random policy ✅ Complete
2 Baselines — Heuristic policy (F1=0.246) ✅ Complete
3 SFT Warmstart — 450 demo episodes via heuristic ✅ Complete
4 RL Training — 3 GRPO runs, reward engineering ✅ Complete
5 Ablation — Reward design (3 variants tested) ✅ Leaf 1 Complete
6 Blog Post + Writeup ✅ Complete

Quickstart

pip install -e .

# Build the dataset (first time — downloads 2WikiMultiHopQA)
python data/build_chunked_qa.py

# Inspect episode state for task 0
python data/inspect_task.py --id 0

# Run evaluation on val set
TINKER_KEY=<your_key> python eval/eval_rl.py --rl_ckpt <checkpoint> --n_tasks 200

Structure

data/          Dataset builder, serialized tasks, CLI inspector
env/           ExplorerEnv, reward function, state renderer, action schema
policies/      Heuristic baseline, system prompts
train/         SFT warmstart + GRPO training loop
eval/          Accuracy, efficiency, trajectory inspector, baselines
runs/          Detailed run reports (run_1, run_2, run_3)
blog/          Full technical blog post
plots/         All figures organized by run + comparison
  ├── run_1/     Per-run reward curves & summaries
  ├── run_2/
  ├── run_3/
  └── comparison/ Cross-run visualizations
logs/          Training metrics (metrics.jsonl per run)
checkpoints/   Model checkpoint references
leaves/        Ablation experiment harness

Training Infrastructure

  • Model: Qwen3-8B with LoRA (rank=32)
  • Platform: Tinker — cloud inference + training, no local GPU
  • Algorithm: GRPO (GROUP_SIZE=16, BATCH_SIZE=8, LR=5e-6)
  • Total compute: ~40 hours wall-clock, ~120 GPU-hours

See program.md for the current research state.

About

RL agent for document exploration via GRPO, inspired by Karpathy's autoresearch. Built on Tinker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages