Infrastructure: Per-step logging, smoke tests, and reproducibility

## Objective

Add lightweight infrastructure to improve experiment quality without adding complexity or dependencies.

## 1. Per-Step CSV Logging (HIGH priority, SMALL effort)

Add a CSV logger to `train.py` that writes metrics each step:

```
step,elapsed_sec,train_loss,lr_multiplier,tok_per_sec,mfu
0,0.0,8.234,1.0,150000,35.2
1,0.3,7.891,1.0,155000,36.1
...
```

**Value**: The agent currently only sees final `val_bpb`. A per-step CSV enables:
- Detecting whether loss was still decreasing (should train longer / use more steps)
- Comparing learning curves across experiments
- Feeding `analysis.ipynb` without regex parsing

**Implementation**: ~20 lines using stdlib `csv`. Write to `metrics.csv` alongside `run.log`. No new dependencies.

## 2. Smoke Tests (HIGH priority, MEDIUM effort)

Before each 5-minute training run, a 10-second smoke test catches shape mismatches, import errors, and obvious regressions.

### `tests/test_train_smoke.py`
```python
def test_forward_pass():
    """Model forward produces correct shape, loss is finite."""
    config = GPTConfig(n_layer=2, n_embd=128, n_head=2, n_kv_head=2, vocab_size=256)
    model = GPT(config).cuda()
    model.init_weights()
    x = torch.randint(0, 256, (2, 64), device="cuda")
    loss = model(x, x)
    assert loss.isfinite()
    assert loss.ndim == 0

def test_gradient_flow():
    """All parameters receive non-zero gradients."""
    # Small model, one forward+backward, check p.grad is not None/zero
```

**New dev dependency**: `pytest` (add to `[tool.uv.dev-dependencies]` in pyproject.toml)

## 3. Reproducibility: Seed Logging (HIGH priority, SMALL effort)

Current seed is hardcoded to 42. Improvements:
- Log the seed value in the final summary output (alongside val_bpb)
- Accept `SEED` env variable override: `torch.manual_seed(int(os.environ.get("SEED", 42)))`
- Enables re-running a specific experiment with a different seed to measure variance

**No `--deterministic` flag needed** — cudnn autotuning is fine for the 5-min budget.

## 4. Structured Results Store (MEDIUM priority, SMALL effort)

Replace `results.tsv` with `results.jsonl` — one JSON object per experiment:
```json
{"commit": "a1b2c3d", "val_bpb": 0.9979, "memory_gb": 44.0, "status": "keep", "description": "baseline", "config": {"DEPTH": 8, "MATRIX_LR": 0.04}}
```

**Value**: JSON lines support nested config, are easier to query, and eliminate TSV quoting issues. Compatible with the existing `program.md` workflow.

## 5. Analysis Improvements (MEDIUM priority, MEDIUM effort)

Add `analysis/trends.py` — a standalone script that reads results and outputs:
- Best val_bpb over time (monotonic improvement curve)
- Which config changes correlated with improvement
- Machine-readable `analysis_summary.json` the agent can consume

## What NOT to Add

- **wandb/mlflow**: Too heavy for autonomous loop. Requires auth, network, accounts.
- **TensorBoard**: Optional nice-to-have but not critical — CSV logging covers the agent's needs.
- **Checkpoint saving**: The 5-min full-retrain design is intentional. Checkpointing adds state management complexity.
- **Complex CI/CD**: The agent IS the CI.

## Implementation Order

| Phase | Items | Effort |
|-------|-------|--------|
| 1 | CSV logging + seed logging | ~30 min |
| 2 | Smoke tests | ~1 hour |
| 3 | results.jsonl + trends script | ~1 hour |

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure: Per-step logging, smoke tests, and reproducibility #9

Objective

1. Per-Step CSV Logging (HIGH priority, SMALL effort)

2. Smoke Tests (HIGH priority, MEDIUM effort)

`tests/test_train_smoke.py`

3. Reproducibility: Seed Logging (HIGH priority, SMALL effort)

4. Structured Results Store (MEDIUM priority, SMALL effort)

5. Analysis Improvements (MEDIUM priority, MEDIUM effort)

What NOT to Add

Implementation Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase	Items	Effort
1	CSV logging + seed logging	~30 min
2	Smoke tests	~1 hour
3	results.jsonl + trends script	~1 hour

Infrastructure: Per-step logging, smoke tests, and reproducibility #9

Description

Objective

1. Per-Step CSV Logging (HIGH priority, SMALL effort)

2. Smoke Tests (HIGH priority, MEDIUM effort)

tests/test_train_smoke.py

3. Reproducibility: Seed Logging (HIGH priority, SMALL effort)

4. Structured Results Store (MEDIUM priority, SMALL effort)

5. Analysis Improvements (MEDIUM priority, MEDIUM effort)

What NOT to Add

Implementation Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`tests/test_train_smoke.py`