Adding Omni-3 recipes by marcromeyn · Pull Request #173 · NVIDIA-NeMo/Nemotron

marcromeyn · 2026-04-28T18:18:02Z

No description provided.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Keep the _smoke_test_stage placeholder in-tree for wave-1 CLI wiring smoke coverage; replace or remove it in the next wave when the real omni3 stages land. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Add the long-document SDG recipe directory with a full README and nine script stubs.\n\nThe upstream GitLab recipe source currently redirects to sign-in from this environment, so the script bodies remain explicit release-time TODO stubs for now. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Replaces 19 occurrences of the literal $USER in OCI-archive container paths with ${oc.env:USER}. OmegaConf does not expand shell variables, so the literal was being passed through to ensure_squashed_image and would fail to resolve the archive path on any real cluster. Spotted across: - stage0_sft configs (default, tiny, audio_text, image_text_sft, image_text_peft, peft_valor32k) and README.md - stage1_rl sub-stages (mpo, text, vision) configs + train.py runspec comments - stage2_eval default config No functional change for any config that was previously overriding the container field; only fixes the out-of-the-box default path resolution. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Two critical fixes before the branch can survive real cluster usage: 1. stage3_vision_rl/train.py runspec declared nodes=16, gpus_per_node=8 on a stub that raises NotImplementedError. A stray `nemotron omni3 rl vision --batch` would allocate 128 GPUs, wait, and then crash. Dropped to nodes=1, gpus_per_node=0; the real footprint gets restored when the upstream launcher lands alongside the real main() body. 2. stage0_sft/train.py invoked torchrun with only --nproc-per-node, so a multi-node run (`nodes=2` in the runspec) launched two independent torchrun processes that never rendezvoused — each spanning its own 8 GPUs, no cross-node collectives. Now reads SLURM_NNODES/SLURM_NODEID and resolves the first SLURM_STEP_NODELIST hostname into MASTER_ADDR, passing --nnodes / --node-rank / --rdzv-endpoint through to torchrun. Single-node runs fall back to the static default. Emits a warning when nnodes>1 but MASTER_ADDR can't be resolved. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

…leak Three fixes that turn "works on a happy-path stable SSH session" into "survives real operational conditions": 1. omni3/pipe.py now auto-detects the vision stub via its degenerate resource footprint (gpus_per_node=0, set in the previous commit) and skips stage 4 by default, surfacing omni3-rl-text-model:latest as the final usable artifact. Users can pass `pipe.force_vision=true` via dotlist to override once the upstream launcher lands and the runspec footprint has been bumped back. Previously, `omni3 pipe --run` burned through stages 1-3 (days of compute) before crashing at stage 4's NotImplementedError. 2. omni3/data/prep/{sft,rl}.py used $CWD/config.yaml as a staging path for the resolved recipe config. That silently overwrote whatever the user happened to have at that path — a data-loss foot-gun for anyone iterating on config.yaml in the repo root. Now stages under a job-scoped filename (`.nemotron-data-prep-<job>.yaml`). The remote path inside the Ray workdir stays `config.yaml` so the `cmd` template is unchanged. 3. Same two files leaked one `NamedTemporaryFile(delete=False)` per invocation in /tmp for the Ray runtime_env yaml. Both that file and the new job-scoped config copy are now cleaned up in a try/finally around ray_job.start + tunnel.put. Test updates: - test_pipe_dry_run_succeeds now asserts the stub-skip path (rl text as final artifact). - test_pipe_dry_run_force_vision covers the override path. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

The broad `!src/nemotron/cli/commands/omni3/data/**` and `!src/nemotron/recipes/data/**` exceptions needed to defeat the top-level `data/` rule also un-masked `__pycache__/` directories inside those trees. Four `.pyc` files leaked into the previous commit. Re-ignore explicitly and remove the tracked artifacts. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Six sites previously inlined the same `build_partition > run_partition > partition` / `build_time > time` / `build_image > default` precedence chain (flagged across the oracle review of commits 2, 3, 5, and 6 — the single largest duplication on the omni3 branch). Consolidate into four public helpers in `nemo_runspec.squash`: - resolve_build_partition(env) -> str | None - resolve_build_time(env, default) -> str - resolve_build_image(env, default) -> str - build_salloc_args(env, **opts) -> list[str] (salloc argv shape) Migrate call sites: - nemo_runspec.squash.ensure_squashed_image (was 20 lines, now 1) - nemotron.cli.kit.squash (was 20 lines, now 1) - nemotron.cli.commands.omni3.build (two display/executor sites, both now use the three resolve_* helpers directly since the SlurmExecutor kwargs differ from salloc argv) Behavior is unchanged: the partition/time precedence and defaults match what was written inline in each site. Adds 13 unit tests covering the helpers' precedence rules; existing 11 scheme-passthrough tests keep passing. Note: the data-prep Ray submissions (omni3/data/prep/{sft,rl}.py) and the model lifecycle helper (omni3/model/_base.py) use a different, mode-driven precedence (run_partition vs batch_partition depending on attached mode) and are intentionally not touched by this commit. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Oracle review of commit 9 flagged that the omni3 family docs described the design-doc state of the branch rather than the shipped state — the vision RL stub, RL Dockerfile placeholder, SFT data-prep staging scaffold, SDG script stubs, text-only eval default, and upstream-private branch resolution were either glossed over or absent entirely. Fix: - README.md: prominent "Current Limitations" section listing each stub/placeholder with the exact path, the auto-detect/override mechanism where one exists, and a pointer to the per-stage guide. - sft.md: blockquote callout covering data-prep staging scaffold + the pending upstream dev/nomni branch. - rl.md: blockquote callout covering the vision stub auto-skip in pipe, the Dockerfile placeholder body, and the pending nano-v3-omni-recipes branch. - evaluate.md: blockquote callout explaining text-only default task list + how to override the default model artifact while vision is stubbed. No code changes. The rl.md vision-stub acknowledgment that already existed is preserved and expanded. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

The previous shape had three problems the user surfaced: 1. `uv run` leaked into the PEP 723 cmd template — that's a Dockerfile implementation detail (we use uv to manage the venv), not part of the script's declared interface. 2. `train.py` hand-rolled a subprocess dance that reconstructed the torchrun command, duplicated SLURM_NNODES/SLURM_NODEID parsing, and flattened OmegaConf overrides back to argv — work that nemo-run's Torchrun launcher already does. 3. `train.py` ran in the base image's Python while the subprocess ran in the uv venv — a split that made imports fragile. Split concerns cleanly: **Dockerfile** (`stage0_sft/Dockerfile`): - `ENV PATH=/workspace/Megatron-Bridge/.venv/bin:$PATH` makes the uv-synced venv the default Python environment. Downstream tooling no longer needs to know `uv` was how it got built. - Drop `uv run` prefix from the sanity-check RUN lines. **Runspec** (`stage0_sft/train.py` PEP 723): - `launch = "torchrun"` (was `"direct"`) — nemo-run now wraps with torchrun and injects --nproc-per-node / --nnodes / --node-rank / the rendezvous flags from SLURM env vars automatically. - `cmd = "python {script} --recipe {recipe} --step_func {step_func} --config {config}"` — no `uv run`, matches nano3's declarative style. **train.py body** — collapsed to a ~15-line `os.execvp` forwarder that accepts the CLI flags and hands them to `/workspace/Megatron-Bridge/scripts/training/run_recipe.py`. Deletes `_build_command`, `_torchrun_rendezvous_args`, and `_flatten_overrides` (~90 lines). No nemotron.* imports so it runs cleanly in the uv venv. **CLI** (`commands/omni3/sft.py`): - Reads `recipe.name` / `recipe.step_func` from the compiled job config and passes them as `--recipe` / `--step_func` args alongside `--config REMOTE_CONFIG`. - `torchrun=(SPEC.run.launch == "torchrun")` — honors the runspec's launch mode for local execution too (was hardcoded `False`). Net effect: the hand-rolled multi-node rendezvous logic from the previous fix commit becomes unnecessary (nemo-run's Torchrun launcher does it); the `uv run` / base-venv-vs-uv-venv split disappears; the runspec is portable across dependency-management tools (swap uv for conda tomorrow: only the Dockerfile changes). Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

The shipped default for Omni SFT pointed at Valor32k-AVQA, an NVIDIA-internal Energon dataset — so a fresh user running `nemotron omni3 sft --run <cluster>` out of the box hit `Energon dataset not found at /datasets/valor32k/energon` with no path to recovery. Flipping the default makes the recipe runnable from open data on a public clone. Config swaps (explicit git mv so history follows): - default.yaml → valor32k.yaml (internal Energon SFT, same contents; added header documenting OMNI3_VALOR32K_ENERGON_PATH) - image_text_sft.yaml → default.yaml (CORD-v2 via HF `vlm-hf` loader; train_iters bumped 20 → 1000; save path aligned to /nemo_run/omni3-sft-model; header documents the projector-only trade-off per QA guide §5.2.2) tiny.yaml repointed at CORD-v2 so `-c tiny` smoke tests work without internal dataset access. Same for data_prep/tiny.yaml. data_prep flow now supports both modes: - hf_dataset_id set → manifest-only (training container pulls from Hub) - dataset_path set → Energon validation (existing behavior) - both → ValueError - neither → ValueError pointing at the two config exemplars data_prep.py: - Omni3EnergonDataPrepConfig gains optional hf_dataset_id; dataset_path is now Optional[Path] so callers can choose the flow via which field they set. - run_data_prep dispatches to _run_hf_prep or _run_energon_prep; both emit a manifest.json carrying `source: huggingface|energon` for downstream artifact lineage. - PEP 723 runspec default flipped "valor32k" → "default" so `omni3 data prep sft` picks the HF config by default. New config: data_prep/default.yaml (CORD-v2 HF flow). Docs: sft.md and README.md Known Limitations updated to reflect the open default and the Valor32k opt-in. Stays honest about the projector-only constraint on single-node CORD-v2. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

chrisalexiuk-nvidia · 2026-04-28T19:11:09Z

WHOO!!!!

chrisalexiuk-nvidia

Hooray!

marcromeyn added 19 commits April 28, 2026 20:18

refactor(recipes): move data curation under data tree

dcd0fb7

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

refactor(runspec): support archive container schemes

cc98060

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): scaffold build command and app

a89cdd5

Keep the _smoke_test_stage placeholder in-tree for wave-1 CLI wiring smoke coverage; replace or remove it in the next wave when the real omni3 stages land. Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): add sft training path

85d261d

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): add sft model lifecycle commands

53d896e

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): add rl stages (mpo, text, vision)

ad0486d

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): add eval stage

2f318a8

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

docs(omni3): add family docs + update cli/nemo-run references

09a55d6

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

feat(omni3): add pipe command

7d97fd8

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Adding omni 3 recipes

b891965

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

marcromeyn force-pushed the romeyn/omni-3 branch from 5553eca to b891965 Compare April 28, 2026 18:22

marcromeyn and others added 4 commits April 28, 2026 20:44

Update rl recipe and remove eval

b7f35f5

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Merge branch 'main' into romeyn/omni-3

83b63ac

Fixing some docs issues

f688798

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Fixing some docs issues

b8557ff

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

marcromeyn requested a review from chrisalexiuk-nvidia April 28, 2026 19:15

chrisalexiuk-nvidia approved these changes Apr 28, 2026

View reviewed changes

marcromeyn merged commit 65311da into main Apr 28, 2026
3 of 4 checks passed

marcromeyn deleted the romeyn/omni-3 branch April 28, 2026 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Omni-3 recipes#173

Adding Omni-3 recipes#173
marcromeyn merged 23 commits intomainfrom
romeyn/omni-3

marcromeyn commented Apr 28, 2026

Uh oh!

chrisalexiuk-nvidia commented Apr 28, 2026

Uh oh!

chrisalexiuk-nvidia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcromeyn commented Apr 28, 2026

Uh oh!

chrisalexiuk-nvidia commented Apr 28, 2026

Uh oh!

chrisalexiuk-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants