-
Notifications
You must be signed in to change notification settings - Fork 0
Infrastructure: Per-step logging, smoke tests, and reproducibility #9
Description
Objective
Add lightweight infrastructure to improve experiment quality without adding complexity or dependencies.
1. Per-Step CSV Logging (HIGH priority, SMALL effort)
Add a CSV logger to train.py that writes metrics each step:
step,elapsed_sec,train_loss,lr_multiplier,tok_per_sec,mfu
0,0.0,8.234,1.0,150000,35.2
1,0.3,7.891,1.0,155000,36.1
...
Value: The agent currently only sees final val_bpb. A per-step CSV enables:
- Detecting whether loss was still decreasing (should train longer / use more steps)
- Comparing learning curves across experiments
- Feeding
analysis.ipynbwithout regex parsing
Implementation: ~20 lines using stdlib csv. Write to metrics.csv alongside run.log. No new dependencies.
2. Smoke Tests (HIGH priority, MEDIUM effort)
Before each 5-minute training run, a 10-second smoke test catches shape mismatches, import errors, and obvious regressions.
tests/test_train_smoke.py
def test_forward_pass():
"""Model forward produces correct shape, loss is finite."""
config = GPTConfig(n_layer=2, n_embd=128, n_head=2, n_kv_head=2, vocab_size=256)
model = GPT(config).cuda()
model.init_weights()
x = torch.randint(0, 256, (2, 64), device="cuda")
loss = model(x, x)
assert loss.isfinite()
assert loss.ndim == 0
def test_gradient_flow():
"""All parameters receive non-zero gradients."""
# Small model, one forward+backward, check p.grad is not None/zeroNew dev dependency: pytest (add to [tool.uv.dev-dependencies] in pyproject.toml)
3. Reproducibility: Seed Logging (HIGH priority, SMALL effort)
Current seed is hardcoded to 42. Improvements:
- Log the seed value in the final summary output (alongside val_bpb)
- Accept
SEEDenv variable override:torch.manual_seed(int(os.environ.get("SEED", 42))) - Enables re-running a specific experiment with a different seed to measure variance
No --deterministic flag needed — cudnn autotuning is fine for the 5-min budget.
4. Structured Results Store (MEDIUM priority, SMALL effort)
Replace results.tsv with results.jsonl — one JSON object per experiment:
{"commit": "a1b2c3d", "val_bpb": 0.9979, "memory_gb": 44.0, "status": "keep", "description": "baseline", "config": {"DEPTH": 8, "MATRIX_LR": 0.04}}Value: JSON lines support nested config, are easier to query, and eliminate TSV quoting issues. Compatible with the existing program.md workflow.
5. Analysis Improvements (MEDIUM priority, MEDIUM effort)
Add analysis/trends.py — a standalone script that reads results and outputs:
- Best val_bpb over time (monotonic improvement curve)
- Which config changes correlated with improvement
- Machine-readable
analysis_summary.jsonthe agent can consume
What NOT to Add
- wandb/mlflow: Too heavy for autonomous loop. Requires auth, network, accounts.
- TensorBoard: Optional nice-to-have but not critical — CSV logging covers the agent's needs.
- Checkpoint saving: The 5-min full-retrain design is intentional. Checkpointing adds state management complexity.
- Complex CI/CD: The agent IS the CI.
Implementation Order
| Phase | Items | Effort |
|---|---|---|
| 1 | CSV logging + seed logging | ~30 min |
| 2 | Smoke tests | ~1 hour |
| 3 | results.jsonl + trends script | ~1 hour |
🤖 Generated with Claude Code