Align training recipe and add validation logging by FrankHui · Pull Request #15 · kyegomez/OpenMythos

FrankHui · 2026-04-20T09:14:04Z

Summary

add a float16 GradScaler path for single-GPU training when bf16 is unavailable
add periodic validation loss/perplexity estimation during training
add a validation data stream with a shard offset to avoid using the exact same worker shard as training
update README training notes to match the current optimizer and precision behavior

Test plan

python3 -m py_compile training/3b_fine_web_edu.py
Run a short training smoke test and confirm validation logs are emitted every val_every steps

Made with Cursor

Add a float16 GradScaler path for single-GPU training, include periodic validation loss/perplexity estimation, and update README training notes to match the current optimizer and precision behavior. Made-with: Cursor

Qodo-Free-For-OSS · 2026-04-27T07:48:05Z

Hi, Validation data sharding depends on num_workers, but training and validation DataLoaders use different num_workers (4 vs 2), so shard_offset=1 still yields overlapping shards between train and validation. This can contaminate validation metrics with training data and undermines the purpose of the separate validation stream.

Severity: action required | Category: correctness

How to fix: Decouple sharding from workers

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

FineWebEduDataset currently derives total_shards and shard_index from the dataloader worker count. Because train and validation loaders use different num_workers, shard_offset does not reliably produce a disjoint validation stream.

Issue Context

Train loader uses num_workers=4

Val loader uses num_workers=2

Sharding math uses num_workers in both total_shards and rank * num_workers + worker_id

Fix Focus Areas

training/3b_fine_web_edu.py[59-66]

training/3b_fine_web_edu.py[256-263]

Suggested direction

Make sharding independent of num_workers so train/val can be made disjoint regardless of worker configuration. Options include:

Use only world_size/rank for HF .shard() and handle worker-level partitioning locally (e.g., each worker skips/strides samples).

Or pass an explicit, shared num_shards/shard_index scheme into both train and val datasets so their shard spaces match, then choose a shard_offset that guarantees disjointness within that shared space.

We noticed a couple of other issues in this PR as well - happy to share if helpful.

Spotted by Qodo code review - free for open-source projects.

align training recipe and add validation logging

5b221cc

Add a float16 GradScaler path for single-GPU training, include periodic validation loss/perplexity estimation, and update README training notes to match the current optimizer and precision behavior. Made-with: Cursor

tonyzdev mentioned this pull request Apr 21, 2026

Add experiments/ suite for inference-time loop scaling validation #27

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align training recipe and add validation logging#15

Align training recipe and add validation logging#15
FrankHui wants to merge 1 commit intokyegomez:mainfrom
FrankHui:training-recipe-alignment

FrankHui commented Apr 20, 2026 •

edited

Loading

Uh oh!

Qodo-Free-For-OSS commented Apr 27, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested direction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FrankHui commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Qodo-Free-For-OSS commented Apr 27, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested direction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FrankHui commented Apr 20, 2026 •

edited

Loading