Skip to content

Infrastructure: Per-step logging, smoke tests, and reproducibility #9

@Jason-Adam

Description

@Jason-Adam

Objective

Add lightweight infrastructure to improve experiment quality without adding complexity or dependencies.

1. Per-Step CSV Logging (HIGH priority, SMALL effort)

Add a CSV logger to train.py that writes metrics each step:

step,elapsed_sec,train_loss,lr_multiplier,tok_per_sec,mfu
0,0.0,8.234,1.0,150000,35.2
1,0.3,7.891,1.0,155000,36.1
...

Value: The agent currently only sees final val_bpb. A per-step CSV enables:

  • Detecting whether loss was still decreasing (should train longer / use more steps)
  • Comparing learning curves across experiments
  • Feeding analysis.ipynb without regex parsing

Implementation: ~20 lines using stdlib csv. Write to metrics.csv alongside run.log. No new dependencies.

2. Smoke Tests (HIGH priority, MEDIUM effort)

Before each 5-minute training run, a 10-second smoke test catches shape mismatches, import errors, and obvious regressions.

tests/test_train_smoke.py

def test_forward_pass():
    """Model forward produces correct shape, loss is finite."""
    config = GPTConfig(n_layer=2, n_embd=128, n_head=2, n_kv_head=2, vocab_size=256)
    model = GPT(config).cuda()
    model.init_weights()
    x = torch.randint(0, 256, (2, 64), device="cuda")
    loss = model(x, x)
    assert loss.isfinite()
    assert loss.ndim == 0

def test_gradient_flow():
    """All parameters receive non-zero gradients."""
    # Small model, one forward+backward, check p.grad is not None/zero

New dev dependency: pytest (add to [tool.uv.dev-dependencies] in pyproject.toml)

3. Reproducibility: Seed Logging (HIGH priority, SMALL effort)

Current seed is hardcoded to 42. Improvements:

  • Log the seed value in the final summary output (alongside val_bpb)
  • Accept SEED env variable override: torch.manual_seed(int(os.environ.get("SEED", 42)))
  • Enables re-running a specific experiment with a different seed to measure variance

No --deterministic flag needed — cudnn autotuning is fine for the 5-min budget.

4. Structured Results Store (MEDIUM priority, SMALL effort)

Replace results.tsv with results.jsonl — one JSON object per experiment:

{"commit": "a1b2c3d", "val_bpb": 0.9979, "memory_gb": 44.0, "status": "keep", "description": "baseline", "config": {"DEPTH": 8, "MATRIX_LR": 0.04}}

Value: JSON lines support nested config, are easier to query, and eliminate TSV quoting issues. Compatible with the existing program.md workflow.

5. Analysis Improvements (MEDIUM priority, MEDIUM effort)

Add analysis/trends.py — a standalone script that reads results and outputs:

  • Best val_bpb over time (monotonic improvement curve)
  • Which config changes correlated with improvement
  • Machine-readable analysis_summary.json the agent can consume

What NOT to Add

  • wandb/mlflow: Too heavy for autonomous loop. Requires auth, network, accounts.
  • TensorBoard: Optional nice-to-have but not critical — CSV logging covers the agent's needs.
  • Checkpoint saving: The 5-min full-retrain design is intentional. Checkpointing adds state management complexity.
  • Complex CI/CD: The agent IS the CI.

Implementation Order

Phase Items Effort
1 CSV logging + seed logging ~30 min
2 Smoke tests ~1 hour
3 results.jsonl + trends script ~1 hour

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    infrastructureTooling, testing, or workflow improvementpriority: mediumMedium impactsize: MMedium — 5-15 experiments or 1-3 hours

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions