Skip to content

Adding Omni-3 recipes#173

Merged
marcromeyn merged 23 commits intomainfrom
romeyn/omni-3
Apr 28, 2026
Merged

Adding Omni-3 recipes#173
marcromeyn merged 23 commits intomainfrom
romeyn/omni-3

Conversation

@marcromeyn
Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Keep the _smoke_test_stage placeholder in-tree for wave-1 CLI wiring smoke coverage; replace or remove it in the next wave when the real omni3 stages land.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Add the long-document SDG recipe directory with a full README and nine script stubs.\n\nThe upstream GitLab recipe source currently redirects to sign-in from this environment, so the script bodies remain explicit release-time TODO stubs for now.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Replaces 19 occurrences of the literal $USER in OCI-archive container
paths with ${oc.env:USER}. OmegaConf does not expand shell variables,
so the literal was being passed through to ensure_squashed_image and
would fail to resolve the archive path on any real cluster.

Spotted across:
- stage0_sft configs (default, tiny, audio_text, image_text_sft,
  image_text_peft, peft_valor32k) and README.md
- stage1_rl sub-stages (mpo, text, vision) configs + train.py runspec
  comments
- stage2_eval default config

No functional change for any config that was previously overriding the
container field; only fixes the out-of-the-box default path resolution.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Two critical fixes before the branch can survive real cluster usage:

1. stage3_vision_rl/train.py runspec declared nodes=16, gpus_per_node=8
   on a stub that raises NotImplementedError. A stray `nemotron omni3 rl
   vision --batch` would allocate 128 GPUs, wait, and then crash.
   Dropped to nodes=1, gpus_per_node=0; the real footprint gets restored
   when the upstream launcher lands alongside the real main() body.

2. stage0_sft/train.py invoked torchrun with only --nproc-per-node, so a
   multi-node run (`nodes=2` in the runspec) launched two independent
   torchrun processes that never rendezvoused — each spanning its own
   8 GPUs, no cross-node collectives. Now reads SLURM_NNODES/SLURM_NODEID
   and resolves the first SLURM_STEP_NODELIST hostname into MASTER_ADDR,
   passing --nnodes / --node-rank / --rdzv-endpoint through to torchrun.
   Single-node runs fall back to the static default. Emits a warning
   when nnodes>1 but MASTER_ADDR can't be resolved.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
…leak

Three fixes that turn "works on a happy-path stable SSH session" into
"survives real operational conditions":

1. omni3/pipe.py now auto-detects the vision stub via its degenerate
   resource footprint (gpus_per_node=0, set in the previous commit) and
   skips stage 4 by default, surfacing omni3-rl-text-model:latest as the
   final usable artifact. Users can pass `pipe.force_vision=true` via
   dotlist to override once the upstream launcher lands and the runspec
   footprint has been bumped back. Previously, `omni3 pipe --run` burned
   through stages 1-3 (days of compute) before crashing at stage 4's
   NotImplementedError.

2. omni3/data/prep/{sft,rl}.py used $CWD/config.yaml as a staging path
   for the resolved recipe config. That silently overwrote whatever the
   user happened to have at that path — a data-loss foot-gun for anyone
   iterating on config.yaml in the repo root. Now stages under a
   job-scoped filename (`.nemotron-data-prep-<job>.yaml`). The remote
   path inside the Ray workdir stays `config.yaml` so the `cmd` template
   is unchanged.

3. Same two files leaked one `NamedTemporaryFile(delete=False)` per
   invocation in /tmp for the Ray runtime_env yaml. Both that file and
   the new job-scoped config copy are now cleaned up in a try/finally
   around ray_job.start + tunnel.put.

Test updates:
- test_pipe_dry_run_succeeds now asserts the stub-skip path (rl text as
  final artifact).
- test_pipe_dry_run_force_vision covers the override path.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The broad `!src/nemotron/cli/commands/omni3/data/**` and `!src/nemotron/recipes/data/**`
exceptions needed to defeat the top-level `data/` rule also un-masked
`__pycache__/` directories inside those trees. Four `.pyc` files leaked
into the previous commit. Re-ignore explicitly and remove the tracked
artifacts.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Six sites previously inlined the same `build_partition > run_partition >
partition` / `build_time > time` / `build_image > default` precedence
chain (flagged across the oracle review of commits 2, 3, 5, and 6 — the
single largest duplication on the omni3 branch).

Consolidate into four public helpers in `nemo_runspec.squash`:

- resolve_build_partition(env)         -> str | None
- resolve_build_time(env, default)     -> str
- resolve_build_image(env, default)    -> str
- build_salloc_args(env, **opts)       -> list[str]    (salloc argv shape)

Migrate call sites:
- nemo_runspec.squash.ensure_squashed_image (was 20 lines, now 1)
- nemotron.cli.kit.squash (was 20 lines, now 1)
- nemotron.cli.commands.omni3.build (two display/executor sites, both now
  use the three resolve_* helpers directly since the SlurmExecutor kwargs
  differ from salloc argv)

Behavior is unchanged: the partition/time precedence and defaults match
what was written inline in each site. Adds 13 unit tests covering the
helpers' precedence rules; existing 11 scheme-passthrough tests keep
passing.

Note: the data-prep Ray submissions (omni3/data/prep/{sft,rl}.py) and
the model lifecycle helper (omni3/model/_base.py) use a different,
mode-driven precedence (run_partition vs batch_partition depending on
attached mode) and are intentionally not touched by this commit.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Oracle review of commit 9 flagged that the omni3 family docs described
the design-doc state of the branch rather than the shipped state — the
vision RL stub, RL Dockerfile placeholder, SFT data-prep staging
scaffold, SDG script stubs, text-only eval default, and upstream-private
branch resolution were either glossed over or absent entirely.

Fix:
- README.md: prominent "Current Limitations" section listing each
  stub/placeholder with the exact path, the auto-detect/override
  mechanism where one exists, and a pointer to the per-stage guide.
- sft.md: blockquote callout covering data-prep staging scaffold + the
  pending upstream dev/nomni branch.
- rl.md: blockquote callout covering the vision stub auto-skip in pipe,
  the Dockerfile placeholder body, and the pending nano-v3-omni-recipes
  branch.
- evaluate.md: blockquote callout explaining text-only default task
  list + how to override the default model artifact while vision is
  stubbed.

No code changes. The rl.md vision-stub acknowledgment that already
existed is preserved and expanded.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The previous shape had three problems the user surfaced:
  1. `uv run` leaked into the PEP 723 cmd template — that's a Dockerfile
     implementation detail (we use uv to manage the venv), not part of
     the script's declared interface.
  2. `train.py` hand-rolled a subprocess dance that reconstructed the
     torchrun command, duplicated SLURM_NNODES/SLURM_NODEID parsing, and
     flattened OmegaConf overrides back to argv — work that nemo-run's
     Torchrun launcher already does.
  3. `train.py` ran in the base image's Python while the subprocess ran
     in the uv venv — a split that made imports fragile.

Split concerns cleanly:

**Dockerfile** (`stage0_sft/Dockerfile`):
  - `ENV PATH=/workspace/Megatron-Bridge/.venv/bin:$PATH` makes the
    uv-synced venv the default Python environment. Downstream tooling
    no longer needs to know `uv` was how it got built.
  - Drop `uv run` prefix from the sanity-check RUN lines.

**Runspec** (`stage0_sft/train.py` PEP 723):
  - `launch = "torchrun"` (was `"direct"`) — nemo-run now wraps with
    torchrun and injects --nproc-per-node / --nnodes / --node-rank / the
    rendezvous flags from SLURM env vars automatically.
  - `cmd = "python {script} --recipe {recipe} --step_func {step_func} --config {config}"`
    — no `uv run`, matches nano3's declarative style.

**train.py body** — collapsed to a ~15-line `os.execvp` forwarder that
accepts the CLI flags and hands them to
`/workspace/Megatron-Bridge/scripts/training/run_recipe.py`. Deletes
`_build_command`, `_torchrun_rendezvous_args`, and `_flatten_overrides`
(~90 lines). No nemotron.* imports so it runs cleanly in the uv venv.

**CLI** (`commands/omni3/sft.py`):
  - Reads `recipe.name` / `recipe.step_func` from the compiled job
    config and passes them as `--recipe` / `--step_func` args alongside
    `--config REMOTE_CONFIG`.
  - `torchrun=(SPEC.run.launch == "torchrun")` — honors the runspec's
    launch mode for local execution too (was hardcoded `False`).

Net effect: the hand-rolled multi-node rendezvous logic from the
previous fix commit becomes unnecessary (nemo-run's Torchrun launcher
does it); the `uv run` / base-venv-vs-uv-venv split disappears; the
runspec is portable across dependency-management tools (swap uv for
conda tomorrow: only the Dockerfile changes).

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
The shipped default for Omni SFT pointed at Valor32k-AVQA, an NVIDIA-internal
Energon dataset — so a fresh user running `nemotron omni3 sft --run <cluster>`
out of the box hit `Energon dataset not found at /datasets/valor32k/energon`
with no path to recovery. Flipping the default makes the recipe runnable from
open data on a public clone.

Config swaps (explicit git mv so history follows):
  - default.yaml          → valor32k.yaml (internal Energon SFT, same contents;
                            added header documenting OMNI3_VALOR32K_ENERGON_PATH)
  - image_text_sft.yaml   → default.yaml (CORD-v2 via HF `vlm-hf` loader;
                            train_iters bumped 20 → 1000; save path aligned to
                            /nemo_run/omni3-sft-model; header documents the
                            projector-only trade-off per QA guide §5.2.2)

tiny.yaml repointed at CORD-v2 so `-c tiny` smoke tests work without internal
dataset access. Same for data_prep/tiny.yaml.

data_prep flow now supports both modes:
  - hf_dataset_id set → manifest-only (training container pulls from Hub)
  - dataset_path set  → Energon validation (existing behavior)
  - both              → ValueError
  - neither           → ValueError pointing at the two config exemplars

data_prep.py:
  - Omni3EnergonDataPrepConfig gains optional hf_dataset_id; dataset_path is
    now Optional[Path] so callers can choose the flow via which field they set.
  - run_data_prep dispatches to _run_hf_prep or _run_energon_prep; both emit a
    manifest.json carrying `source: huggingface|energon` for downstream
    artifact lineage.
  - PEP 723 runspec default flipped "valor32k" → "default" so `omni3 data prep
    sft` picks the HF config by default.

New config: data_prep/default.yaml (CORD-v2 HF flow).

Docs: sft.md and README.md Known Limitations updated to reflect the open default
and the Valor32k opt-in. Stays honest about the projector-only constraint on
single-node CORD-v2.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
marcromeyn and others added 4 commits April 28, 2026 20:44
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
@chrisalexiuk-nvidia
Copy link
Copy Markdown
Contributor

WHOO!!!!

Copy link
Copy Markdown
Contributor

@chrisalexiuk-nvidia chrisalexiuk-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray!

@marcromeyn marcromeyn merged commit 65311da into main Apr 28, 2026
3 of 4 checks passed
@marcromeyn marcromeyn deleted the romeyn/omni-3 branch April 28, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants