Tinker-Explorer

Karpathy's autoresearch asks: what's the best train.py under a 5-minute GPU budget? Tinker-Explorer asks: what's the best reading order under a 2,000-token budget? Same discipline, different search space.

An RL agent trained via GRPO that learns to navigate a pool of evidence chunks — opening, summarizing, and answering — to answer multi-hop questions with maximum accuracy at minimum reading cost. Built on Tinker, inspired by Karpathy's autoresearch.

Results

Model	Token F1	Exact Match	Answer Rate	Avg Opens
Heuristic baseline	0.246	0.220	100%	1.2
SFT (warm-start)	0.142	0.060	87.0%	1.6
RL Run 1 — shaped reward	0.123	0.010	96.0%	1.4
RL Run 2 — pure F1	0.154	0.065	89.5%	1.5
RL Run 3 — status text fix	0.172	0.085	99.5%	1.2

Key finding: The run with the lowest training reward (0.056) produced the best model (F1=0.172). Stricter, more honest reward functions lead to better generalization — a textbook case of Goodhart's Law in RL.

📝 Read the full blog post →

The Core Idea

In autoresearch, agents mutate train.py and measure val_bpb. Here, agents choose which Wikipedia paragraphs to read and measure answer correctness. The agent must learn to:

Read chunk previews to identify relevant documents
Open the minimum number of chunks needed
Answer with a concise, exact entity name

Dataset

2WikiMultiHopQA — 5,000 training tasks, 200 val tasks. Each question requires reasoning across 2 Wikipedia articles from a pool of ~10 candidates.

Project Phases

Phase	Goal	Status
0	Foundation — repo, dataset, episode state inspector	✅ Complete
1	Core Environment — `ExplorerEnv`, reward, random policy	✅ Complete
2	Baselines — Heuristic policy (F1=0.246)	✅ Complete
3	SFT Warmstart — 450 demo episodes via heuristic	✅ Complete
4	RL Training — 3 GRPO runs, reward engineering	✅ Complete
5	Ablation — Reward design (3 variants tested)	✅ Leaf 1 Complete
6	Blog Post + Writeup	✅ Complete

Quickstart

pip install -e .

# Build the dataset (first time — downloads 2WikiMultiHopQA)
python data/build_chunked_qa.py

# Inspect episode state for task 0
python data/inspect_task.py --id 0

# Run evaluation on val set
TINKER_KEY=<your_key> python eval/eval_rl.py --rl_ckpt <checkpoint> --n_tasks 200

Structure

data/          Dataset builder, serialized tasks, CLI inspector
env/           ExplorerEnv, reward function, state renderer, action schema
policies/      Heuristic baseline, system prompts
train/         SFT warmstart + GRPO training loop
eval/          Accuracy, efficiency, trajectory inspector, baselines
runs/          Detailed run reports (run_1, run_2, run_3)
blog/          Full technical blog post
plots/         All figures organized by run + comparison
  ├── run_1/     Per-run reward curves & summaries
  ├── run_2/
  ├── run_3/
  └── comparison/ Cross-run visualizations
logs/          Training metrics (metrics.jsonl per run)
checkpoints/   Model checkpoint references
leaves/        Ablation experiment harness

Training Infrastructure

Model: Qwen3-8B with LoRA (rank=32)
Platform: Tinker — cloud inference + training, no local GPU
Algorithm: GRPO (GROUP_SIZE=16, BATCH_SIZE=8, LR=5e-6)
Total compute: ~40 hours wall-clock, ~120 GPU-hours

See program.md for the current research state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tinker-Explorer

Results

The Core Idea

Dataset

Project Phases

Quickstart

Structure

Training Infrastructure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
blog		blog
checkpoints		checkpoints
data		data
env		env
eval		eval
leaves		leaves
logs		logs
plots		plots
policies		policies
runs		runs
train		train
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
program.md		program.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Tinker-Explorer

Results

The Core Idea

Dataset

Project Phases

Quickstart

Structure

Training Infrastructure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages