Adding Omni-3 recipes#173
Merged
marcromeyn merged 23 commits intomainfrom Apr 28, 2026
Merged
Conversation
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Keep the _smoke_test_stage placeholder in-tree for wave-1 CLI wiring smoke coverage; replace or remove it in the next wave when the real omni3 stages land. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Add the long-document SDG recipe directory with a full README and nine script stubs.\n\nThe upstream GitLab recipe source currently redirects to sign-in from this environment, so the script bodies remain explicit release-time TODO stubs for now. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Replaces 19 occurrences of the literal $USER in OCI-archive container
paths with ${oc.env:USER}. OmegaConf does not expand shell variables,
so the literal was being passed through to ensure_squashed_image and
would fail to resolve the archive path on any real cluster.
Spotted across:
- stage0_sft configs (default, tiny, audio_text, image_text_sft,
image_text_peft, peft_valor32k) and README.md
- stage1_rl sub-stages (mpo, text, vision) configs + train.py runspec
comments
- stage2_eval default config
No functional change for any config that was previously overriding the
container field; only fixes the out-of-the-box default path resolution.
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Two critical fixes before the branch can survive real cluster usage: 1. stage3_vision_rl/train.py runspec declared nodes=16, gpus_per_node=8 on a stub that raises NotImplementedError. A stray `nemotron omni3 rl vision --batch` would allocate 128 GPUs, wait, and then crash. Dropped to nodes=1, gpus_per_node=0; the real footprint gets restored when the upstream launcher lands alongside the real main() body. 2. stage0_sft/train.py invoked torchrun with only --nproc-per-node, so a multi-node run (`nodes=2` in the runspec) launched two independent torchrun processes that never rendezvoused — each spanning its own 8 GPUs, no cross-node collectives. Now reads SLURM_NNODES/SLURM_NODEID and resolves the first SLURM_STEP_NODELIST hostname into MASTER_ADDR, passing --nnodes / --node-rank / --rdzv-endpoint through to torchrun. Single-node runs fall back to the static default. Emits a warning when nnodes>1 but MASTER_ADDR can't be resolved. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
…leak
Three fixes that turn "works on a happy-path stable SSH session" into
"survives real operational conditions":
1. omni3/pipe.py now auto-detects the vision stub via its degenerate
resource footprint (gpus_per_node=0, set in the previous commit) and
skips stage 4 by default, surfacing omni3-rl-text-model:latest as the
final usable artifact. Users can pass `pipe.force_vision=true` via
dotlist to override once the upstream launcher lands and the runspec
footprint has been bumped back. Previously, `omni3 pipe --run` burned
through stages 1-3 (days of compute) before crashing at stage 4's
NotImplementedError.
2. omni3/data/prep/{sft,rl}.py used $CWD/config.yaml as a staging path
for the resolved recipe config. That silently overwrote whatever the
user happened to have at that path — a data-loss foot-gun for anyone
iterating on config.yaml in the repo root. Now stages under a
job-scoped filename (`.nemotron-data-prep-<job>.yaml`). The remote
path inside the Ray workdir stays `config.yaml` so the `cmd` template
is unchanged.
3. Same two files leaked one `NamedTemporaryFile(delete=False)` per
invocation in /tmp for the Ray runtime_env yaml. Both that file and
the new job-scoped config copy are now cleaned up in a try/finally
around ray_job.start + tunnel.put.
Test updates:
- test_pipe_dry_run_succeeds now asserts the stub-skip path (rl text as
final artifact).
- test_pipe_dry_run_force_vision covers the override path.
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The broad `!src/nemotron/cli/commands/omni3/data/**` and `!src/nemotron/recipes/data/**` exceptions needed to defeat the top-level `data/` rule also un-masked `__pycache__/` directories inside those trees. Four `.pyc` files leaked into the previous commit. Re-ignore explicitly and remove the tracked artifacts. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Six sites previously inlined the same `build_partition > run_partition >
partition` / `build_time > time` / `build_image > default` precedence
chain (flagged across the oracle review of commits 2, 3, 5, and 6 — the
single largest duplication on the omni3 branch).
Consolidate into four public helpers in `nemo_runspec.squash`:
- resolve_build_partition(env) -> str | None
- resolve_build_time(env, default) -> str
- resolve_build_image(env, default) -> str
- build_salloc_args(env, **opts) -> list[str] (salloc argv shape)
Migrate call sites:
- nemo_runspec.squash.ensure_squashed_image (was 20 lines, now 1)
- nemotron.cli.kit.squash (was 20 lines, now 1)
- nemotron.cli.commands.omni3.build (two display/executor sites, both now
use the three resolve_* helpers directly since the SlurmExecutor kwargs
differ from salloc argv)
Behavior is unchanged: the partition/time precedence and defaults match
what was written inline in each site. Adds 13 unit tests covering the
helpers' precedence rules; existing 11 scheme-passthrough tests keep
passing.
Note: the data-prep Ray submissions (omni3/data/prep/{sft,rl}.py) and
the model lifecycle helper (omni3/model/_base.py) use a different,
mode-driven precedence (run_partition vs batch_partition depending on
attached mode) and are intentionally not touched by this commit.
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Oracle review of commit 9 flagged that the omni3 family docs described the design-doc state of the branch rather than the shipped state — the vision RL stub, RL Dockerfile placeholder, SFT data-prep staging scaffold, SDG script stubs, text-only eval default, and upstream-private branch resolution were either glossed over or absent entirely. Fix: - README.md: prominent "Current Limitations" section listing each stub/placeholder with the exact path, the auto-detect/override mechanism where one exists, and a pointer to the per-stage guide. - sft.md: blockquote callout covering data-prep staging scaffold + the pending upstream dev/nomni branch. - rl.md: blockquote callout covering the vision stub auto-skip in pipe, the Dockerfile placeholder body, and the pending nano-v3-omni-recipes branch. - evaluate.md: blockquote callout explaining text-only default task list + how to override the default model artifact while vision is stubbed. No code changes. The rl.md vision-stub acknowledgment that already existed is preserved and expanded. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The previous shape had three problems the user surfaced:
1. `uv run` leaked into the PEP 723 cmd template — that's a Dockerfile
implementation detail (we use uv to manage the venv), not part of
the script's declared interface.
2. `train.py` hand-rolled a subprocess dance that reconstructed the
torchrun command, duplicated SLURM_NNODES/SLURM_NODEID parsing, and
flattened OmegaConf overrides back to argv — work that nemo-run's
Torchrun launcher already does.
3. `train.py` ran in the base image's Python while the subprocess ran
in the uv venv — a split that made imports fragile.
Split concerns cleanly:
**Dockerfile** (`stage0_sft/Dockerfile`):
- `ENV PATH=/workspace/Megatron-Bridge/.venv/bin:$PATH` makes the
uv-synced venv the default Python environment. Downstream tooling
no longer needs to know `uv` was how it got built.
- Drop `uv run` prefix from the sanity-check RUN lines.
**Runspec** (`stage0_sft/train.py` PEP 723):
- `launch = "torchrun"` (was `"direct"`) — nemo-run now wraps with
torchrun and injects --nproc-per-node / --nnodes / --node-rank / the
rendezvous flags from SLURM env vars automatically.
- `cmd = "python {script} --recipe {recipe} --step_func {step_func} --config {config}"`
— no `uv run`, matches nano3's declarative style.
**train.py body** — collapsed to a ~15-line `os.execvp` forwarder that
accepts the CLI flags and hands them to
`/workspace/Megatron-Bridge/scripts/training/run_recipe.py`. Deletes
`_build_command`, `_torchrun_rendezvous_args`, and `_flatten_overrides`
(~90 lines). No nemotron.* imports so it runs cleanly in the uv venv.
**CLI** (`commands/omni3/sft.py`):
- Reads `recipe.name` / `recipe.step_func` from the compiled job
config and passes them as `--recipe` / `--step_func` args alongside
`--config REMOTE_CONFIG`.
- `torchrun=(SPEC.run.launch == "torchrun")` — honors the runspec's
launch mode for local execution too (was hardcoded `False`).
Net effect: the hand-rolled multi-node rendezvous logic from the
previous fix commit becomes unnecessary (nemo-run's Torchrun launcher
does it); the `uv run` / base-venv-vs-uv-venv split disappears; the
runspec is portable across dependency-management tools (swap uv for
conda tomorrow: only the Dockerfile changes).
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The shipped default for Omni SFT pointed at Valor32k-AVQA, an NVIDIA-internal
Energon dataset — so a fresh user running `nemotron omni3 sft --run <cluster>`
out of the box hit `Energon dataset not found at /datasets/valor32k/energon`
with no path to recovery. Flipping the default makes the recipe runnable from
open data on a public clone.
Config swaps (explicit git mv so history follows):
- default.yaml → valor32k.yaml (internal Energon SFT, same contents;
added header documenting OMNI3_VALOR32K_ENERGON_PATH)
- image_text_sft.yaml → default.yaml (CORD-v2 via HF `vlm-hf` loader;
train_iters bumped 20 → 1000; save path aligned to
/nemo_run/omni3-sft-model; header documents the
projector-only trade-off per QA guide §5.2.2)
tiny.yaml repointed at CORD-v2 so `-c tiny` smoke tests work without internal
dataset access. Same for data_prep/tiny.yaml.
data_prep flow now supports both modes:
- hf_dataset_id set → manifest-only (training container pulls from Hub)
- dataset_path set → Energon validation (existing behavior)
- both → ValueError
- neither → ValueError pointing at the two config exemplars
data_prep.py:
- Omni3EnergonDataPrepConfig gains optional hf_dataset_id; dataset_path is
now Optional[Path] so callers can choose the flow via which field they set.
- run_data_prep dispatches to _run_hf_prep or _run_energon_prep; both emit a
manifest.json carrying `source: huggingface|energon` for downstream
artifact lineage.
- PEP 723 runspec default flipped "valor32k" → "default" so `omni3 data prep
sft` picks the HF config by default.
New config: data_prep/default.yaml (CORD-v2 HF flow).
Docs: sft.md and README.md Known Limitations updated to reflect the open default
and the Valor32k opt-in. Stays honest about the projector-only constraint on
single-node CORD-v2.
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
5553eca to
b891965
Compare
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Contributor
|
WHOO!!!! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.