V3 → V4: What Changed and Where We're Going #11

2imi9 · 2026-04-03T01:32:04Z

2imi9
Apr 3, 2026
Maintainer

V3 → V4 Transition

We've upgraded the autonomous research runner from V3 to V4, adapting agent design patterns from Claude Code (Anthropic). This post covers what changed, what broke, what we learned, and what's next.

What V3 Achieved

Best val_bpb: 1.152 (from 1.268 baseline) over 64 experiments
Qwen3.5-9B orchestrated search → scoring → proposal → training loop
MolmoWeb-4B visual browsing for deep paper reads
Ran with PyTorch SDPA (cuDNN backend), batch size 32
FA3 was in the code but never actually ran on our RTX 5090 — it silently fell back to SDPA

What V4 Adds (Claude Code Patterns)

Pattern	Source	What It Does
Batch scoring	Tool orchestration	Score all papers in 1 LLM call instead of N
History compaction	Context compaction	Summarize old experiments into structured digest
Adaptive thinking	Query loop	Qwen /think mode ON for proposals, OFF for scoring
Validation hooks	Stop hooks	Reject duplicates + exhausted params before training
Circuit breaker	withRetry	Fall back to random proposals after 3 consecutive failures
Exponential backoff	withRetry	500ms × 2^attempt, capped at 32s, with jitter
FileStateCache	fileStateCache.ts	LRU cache with mtime invalidation for file reads
Graceful interrupt	AbortController	Ctrl+C saves queue + history, second Ctrl+C force-quits

FlexAttention Upgrade

We discovered V3 was using SDPA without sliding window (the SSSL pattern was ignored). V4 now uses PyTorch FlexAttention which supports sliding window natively:

	SDPA (V3 fallback)	FlexAttention (V4)
val_bpb	1.739	1.680
tok/sec	~70k	~83k
Sliding window	❌	✅

Flash Attention Status

FA3: Hopper (SM 9.0) only — does NOT support Blackwell (RTX 5090, SM 12.0)
FA4: In beta, SM 12.0 PRs under review (Add SM120 (Blackwell GeForce / DGX Spark) flash attention Dao-AILab/flash-attention#2268), ~2-3 weeks out
SageAttention3: Has explicit Blackwell variant — worth evaluating

The dual-repo FA3 logic (varunneal for Hopper, kernels-community for others) was accidentally lost during the V4 rewrite and has been restored.

Proposal Pipeline Fix

V4's first test produced 0 LLM-generated proposals because:

Qwen's thinking tags (<think>...</think>) broke the regex parser
No few-shot examples → LLM output format was unpredictable
6 silent rejection points with no logging

All fixed with multi-strategy parsing, few-shot examples, and diagnostic logging.

Open Issues

Flash Attention 3/4 does not support Blackwell (RTX 5090, SM 12.0) #4 FA3/FA4 Blackwell support tracking
FlexAttention sliding window replaces SDPA fallback #5 FlexAttention integration
V4 proposal pipeline generates 0 LLM proposals — falls back to random #6 Proposal pipeline quality
Restore dual-repo FA3 logic lost during V4 rewrite #7 Dual-repo FA3 logic
Evaluate SageAttention3 as FlexAttention alternative for Blackwell #8 SageAttention3 evaluation
Docker training mode commented out — restore as optional flag #9 Docker training restore

PR

V4: Claude Code agent patterns + FlexAttention + proposal pipeline fixes #10 V4: Claude Code agent patterns + FlexAttention + proposal pipeline fixes

What's Next

Run full 64-experiment campaign with improved proposal pipeline
Evaluate SageAttention3 for better Blackwell performance
Adopt FA4 when SM 12.0 support lands (~2-3 weeks)
AutoML 2026 paper — deadline April 30

64 unit tests passing. All code on experiments/baseline branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3 → V4: What Changed and Where We're Going #11

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

V3 → V4: What Changed and Where We're Going #11

Uh oh!

2imi9 Apr 3, 2026 Maintainer

V3 → V4 Transition

What V3 Achieved

What V4 Adds (Claude Code Patterns)

FlexAttention Upgrade

Flash Attention Status

Proposal Pipeline Fix

Open Issues

PR

What's Next

Replies: 0 comments

2imi9
Apr 3, 2026
Maintainer