nano3: bring pretrain blend config to parity with super3#141
Open
Doondi-Ashlesh wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
nano3: bring pretrain blend config to parity with super3#141Doondi-Ashlesh wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Doondi-Ashlesh wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
The Nano3 pretrain blend config lacked the metadata and phase structure that the Super3 config provides. This commit adds: - _comment and _missing_categories to data_blend_raw.json, documenting the 23.5T/1.5T phase split, proprietary data gaps (code, crawl++, academic), and the open-source code datasets available separately (Nemotron-CC-Code-v1, Nemotron-Pretraining-Code-v2) - data_blend_raw_phase1.json for the 23.5T diversity phase with approximate internal weight ratios replacing uniform placeholders - data_blend_raw_phase2.json for the 1.5T quality phase with medium-quality subsets removed and synthetic weights adjusted Closes NVIDIA-NeMo#136 Signed-off-by: Doondi-Ashlesh <doondiashlesh@gmail.com>
192bbb3 to
c86ab66
Compare
Contributor
Author
|
Hi, just checking in on this PR, I think it should be ready to merge once reviewed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Brings the Nano3 pretrain blend config to parity with Super3 by adding metadata and a curriculum phase split, as requested in issue #136.
Changes
_commentand_missing_categoriestodata_blend_raw.jsondocumenting the 23.5T/1.5T phase structure, proprietary data gaps (code, crawl++, academic), and the open-source code datasets available separately (Nemotron-CC-Code-v1, Nemotron-Pretraining-Code-v2)data_blend_raw_phase1.jsonfor the 23.5T diversity phase with approximate internal weight ratios, replacing uniformweight: 1.0placeholdersdata_blend_raw_phase2.jsonfor the 1.5T quality phase with medium-quality subsets removed and synthetic weights reduced, matching the curriculum described in the tech reportWeights are unnormalized approximations based on the Nano3 tech report and the Super3 blend files for shared datasets. The pipeline normalizes at runtime.
Closes #136