feat(training): add military strikes forecasting notebook by bturtel · Pull Request #45 · lightning-rod-labs/lightningrod-python-sdk

bturtel · 2026-03-24T04:26:54Z

Summary

New SDK example notebook: notebooks/fine_tuning/04_military_strikes.ipynb
E2E flow: data gen → train (gpt-oss-120b, RL) → eval vs GPT-5.4
Uses client-approved prompts covering global strike types (air, missile, drone, naval) and state/non-state actors
Training params match golf/WWTD experiments (lora_rank=32, batch_size=32, lr=4e-5)

cc @bartolomej — FYI, not ready for review yet

bturtel · 2026-03-25T02:00:45Z

cc @bartolomej — training is working end to end, but crashed at step 52. You can use this to test.

Dataset to use:

dataset = lr.datasets.get("8841b3cb-35c5-496f-a217-a3ec50f0bdca")

bturtel · 2026-03-25T04:35:53Z

cc @bartolomej — training is working end to end but crashed at step 52. You can use this to test.

Dataset:

dataset = lr.datasets.get("8841b3cb-35c5-496f-a217-a3ec50f0bdca")

Also parallelized the eval cell — Tinker RL, Tinker base, and GPT-5.4 benchmark now run concurrently instead of sequentially.

bturtel · 2026-03-27T03:32:55Z

Training Complete — military-strikes-v2 (111/111 steps)

cc @bartolomej

v1 crashed at step 52. Restarted with the same dataset and config — v2 completed all 111 steps.

IDs & Checkpoints


Dataset ID	`8841b3cb-35c5-496f-a217-a3ec50f0bdca`
Training Job ID	`ee725279-f982-4ec5-bcb3-45318db1bb43`
LR Model ID	`checkpoint:ee725279-f982-4ec5-bcb3-45318db1bb43`
Eval Job ID	`b4a2edb6-ee56-4e48-9d13-fc066495ca6a`
Base Model	`openai/gpt-oss-120b`

Note: lr.predict() is currently returning 500 on the /openai endpoint (affects all models, not just checkpoints). Use lr.evals.run() for batch inference. Investigate /openai endpoint separately.

Reward Progress

	Reward
Step 1	-0.9344
Step 52 (v1 crash point)	-0.6380
Step 99 (best)	-0.4879
Step 111 (final)	-0.5644

Eval Results (993 test samples)

Evaluated via lr.evals.run() — all models get identical prompts and parsing, temperature=0.0.

Model	Brier Score	ECE	BSS
gpt-oss-120b RL (v2, 111 steps)	0.2205	0.0991	+11.8%
GPT-5.4 (benchmark)	0.2466	0.2055	+1.4%
gpt-oss-120b (base)	0.2580	0.2130	-3.2%

RL vs GPT-5.4: 10.6% better Brier, 2× better calibration (ECE)
RL vs base: 14.5% better Brier
All models: 100% parse rate, 993/993 valid

Cost

Training: $35.22
Eval: $0.89

Notebook update

Cleared all outputs and removed cells 17-21 (the direct Tinker workaround from the v1 crash). The notebook now runs end-to-end through the SDK.

E2E SDK example: data gen, training (gpt-oss-120b), eval vs GPT-5.4. Uses client-approved prompts covering global strike types and actors.

# Conflicts: # notebooks/fine_tuning/04_military_strikes.ipynb

Training completed successfully (111/111 steps), so the direct Tinker evaluation cells (17-21) are no longer needed. Also fix GPT-5.1 typo to GPT-5.4 in eval section header.

bartolomej · 2026-03-27T16:50:31Z

feat(training): update evals API & example notebooks #49
feat(training): add military strikes forecasting notebook #45 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

bartolomej

Awesome stuff!

I've opened this PR with eval API changes (to support evals on multiple models and intermediate checkpoints): https://app.graphite.com/github/pr/lightning-rod-labs/lightningrod-python-sdk/49

bturtel requested review from bartolomej and paulwilczewski March 27, 2026 03:31

bturtel added 5 commits March 27, 2026 17:42

Add military strikes forecasting notebook

a73479b

E2E SDK example: data gen, training (gpt-oss-120b), eval vs GPT-5.4. Uses client-approved prompts covering global strike types and actors.

Update military strikes notebook

527beb3

Parallelize inference: run Tinker + GPT-5.4 benchmark concurrently

7875603

# Conflicts: # notebooks/fine_tuning/04_military_strikes.ipynb

Fix GPT-5.4 eval: use same training prompts as Tinker models

54d0850

Clean up notebook: clear outputs, remove Tinker workaround cells

e4b3d3d

Training completed successfully (111/111 steps), so the direct Tinker evaluation cells (17-21) are no longer needed. Also fix GPT-5.1 typo to GPT-5.4 in eval section header.

bartolomej force-pushed the feat/military-strikes-notebook branch from 0db6bec to e4b3d3d Compare March 27, 2026 16:50

bartolomej mentioned this pull request Mar 27, 2026

feat(training): update evals API & example notebooks #49

Open

bartolomej changed the title ~~Add military strikes forecasting notebook~~ feat(training): add military strikes forecasting notebook Mar 27, 2026

bartolomej approved these changes Mar 27, 2026

View reviewed changes

bturtel merged commit 4cdc12f into main Mar 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training): add military strikes forecasting notebook#45

feat(training): add military strikes forecasting notebook#45
bturtel merged 5 commits intomainfrom
feat/military-strikes-notebook

bturtel commented Mar 24, 2026

Uh oh!

bturtel commented Mar 25, 2026

Uh oh!

bturtel commented Mar 25, 2026

Uh oh!

bturtel commented Mar 27, 2026

Uh oh!

bartolomej commented Mar 27, 2026

Uh oh!

bartolomej left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bturtel commented Mar 24, 2026

Summary

Uh oh!

bturtel commented Mar 25, 2026

Uh oh!

bturtel commented Mar 25, 2026

Uh oh!

bturtel commented Mar 27, 2026

Training Complete — military-strikes-v2 (111/111 steps)

IDs & Checkpoints

Reward Progress

Eval Results (993 test samples)

Cost

Notebook update

Uh oh!

bartolomej commented Mar 27, 2026

Uh oh!

bartolomej left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants