Skip to content

Align training recipe and add validation logging#15

Open
FrankHui wants to merge 1 commit intokyegomez:mainfrom
FrankHui:training-recipe-alignment
Open

Align training recipe and add validation logging#15
FrankHui wants to merge 1 commit intokyegomez:mainfrom
FrankHui:training-recipe-alignment

Conversation

@FrankHui
Copy link
Copy Markdown

@FrankHui FrankHui commented Apr 20, 2026

Summary

  • add a float16 GradScaler path for single-GPU training when bf16 is unavailable
  • add periodic validation loss/perplexity estimation during training
  • add a validation data stream with a shard offset to avoid using the exact same worker shard as training
  • update README training notes to match the current optimizer and precision behavior

Test plan

  • python3 -m py_compile training/3b_fine_web_edu.py
  • Run a short training smoke test and confirm validation logs are emitted every val_every steps

Made with Cursor

Add a float16 GradScaler path for single-GPU training, include periodic validation loss/perplexity estimation, and update README training notes to match the current optimizer and precision behavior.

Made-with: Cursor
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, Validation data sharding depends on num_workers, but training and validation DataLoaders use different num_workers (4 vs 2), so shard_offset=1 still yields overlapping shards between train and validation. This can contaminate validation metrics with training data and undermines the purpose of the separate validation stream.

Severity: action required | Category: correctness

How to fix: Decouple sharding from workers

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

FineWebEduDataset currently derives total_shards and shard_index from the dataloader worker count. Because train and validation loaders use different num_workers, shard_offset does not reliably produce a disjoint validation stream.

Issue Context

  • Train loader uses num_workers=4
  • Val loader uses num_workers=2
  • Sharding math uses num_workers in both total_shards and rank * num_workers + worker_id

Fix Focus Areas

  • training/3b_fine_web_edu.py[59-66]
  • training/3b_fine_web_edu.py[256-263]

Suggested direction

Make sharding independent of num_workers so train/val can be made disjoint regardless of worker configuration. Options include:

  • Use only world_size/rank for HF .shard() and handle worker-level partitioning locally (e.g., each worker skips/strides samples).
  • Or pass an explicit, shared num_shards/shard_index scheme into both train and val datasets so their shard spaces match, then choose a shard_offset that guarantees disjointness within that shared space.

We noticed a couple of other issues in this PR as well - happy to share if helpful.


Spotted by Qodo code review - free for open-source projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants