Skip to content

nano3: bring pretrain blend config to parity with super3#141

Open
Doondi-Ashlesh wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Doondi-Ashlesh:nano3/pretrain-blend-parity
Open

nano3: bring pretrain blend config to parity with super3#141
Doondi-Ashlesh wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Doondi-Ashlesh:nano3/pretrain-blend-parity

Conversation

@Doondi-Ashlesh
Copy link
Copy Markdown
Contributor

What this PR does

Brings the Nano3 pretrain blend config to parity with Super3 by adding metadata and a curriculum phase split, as requested in issue #136.

Changes

  • Added _comment and _missing_categories to data_blend_raw.json documenting the 23.5T/1.5T phase structure, proprietary data gaps (code, crawl++, academic), and the open-source code datasets available separately (Nemotron-CC-Code-v1, Nemotron-Pretraining-Code-v2)
  • Created data_blend_raw_phase1.json for the 23.5T diversity phase with approximate internal weight ratios, replacing uniform weight: 1.0 placeholders
  • Created data_blend_raw_phase2.json for the 1.5T quality phase with medium-quality subsets removed and synthetic weights reduced, matching the curriculum described in the tech report

Weights are unnormalized approximations based on the Nano3 tech report and the Super3 blend files for shared datasets. The pipeline normalizes at runtime.

Closes #136

The Nano3 pretrain blend config lacked the metadata and phase
structure that the Super3 config provides. This commit adds:

- _comment and _missing_categories to data_blend_raw.json,
  documenting the 23.5T/1.5T phase split, proprietary data gaps
  (code, crawl++, academic), and the open-source code datasets
  available separately (Nemotron-CC-Code-v1, Nemotron-Pretraining-Code-v2)
- data_blend_raw_phase1.json for the 23.5T diversity phase with
  approximate internal weight ratios replacing uniform placeholders
- data_blend_raw_phase2.json for the 1.5T quality phase with
  medium-quality subsets removed and synthetic weights adjusted

Closes NVIDIA-NeMo#136

Signed-off-by: Doondi-Ashlesh <doondiashlesh@gmail.com>
@Doondi-Ashlesh Doondi-Ashlesh force-pushed the nano3/pretrain-blend-parity branch from 192bbb3 to c86ab66 Compare April 11, 2026 03:43
@Doondi-Ashlesh
Copy link
Copy Markdown
Contributor Author

Hi, just checking in on this PR, I think it should be ready to merge once reviewed.
Happy to make changes if anything needs to be updated. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bring Nano3 pretrain blend documentation to parity with Super3 (_missing_categories, real weights, phase split)

1 participant