Add Dynamo combined image (vLLM + TRT-LLM) with EFA/NIXL RDMA by dmvevents · Pull Request #72 · aws-samples/awsome-inference

dmvevents · 2026-03-17T18:06:08Z

Summary

What: Adds a self-contained Dockerfile and deployment manifests for a combined Dynamo inference image containing both vLLM 0.17.1 and TRT-LLM 1.3.0rc7 backends with NIXL 0.10.1 KV-cache transfer over AWS EFA RDMA.
Why: A single image simplifies deployment for disaggregated inference workloads that need backend flexibility. Instead of maintaining separate vLLM and TRT-LLM images, operators deploy one image and select the backend at runtime (python -m dynamo.vllm or python -m dynamo.trtllm).
Image: public.ecr.aws/v9l4g5s4/dynamo-combined:latest (~35 GB)
Tested on: 2x P5en.48xlarge (32x H200, 32x EFA) running disaggregated inference with Nemotron-Mini-4B-Instruct

Changes

New files

File	Description
`Dockerfile.dynamo-combined-efa`	7-stage multi-stage build from NGC base images (no dependency on the existing `Dockerfile.efa` base)
`k8s/dynamo-combined-disagg-1gpu.yaml`	K8s manifest: 1-GPU prefill + 1-GPU decode with EFA
`k8s/dynamo-combined-disagg-8gpu.yaml`	K8s manifest: 8-GPU DP prefill + 8-GPU DP decode with 16 EFA rails
`sbom/dynamo-combined-sbom.csv`	Software Bill of Materials (530+ Python + system packages)
`sbom/dynamo-combined-pip-freeze.txt`	Full `pip freeze` output

Modified files

File	Change
`README.md`	Added combined image build/deploy docs, K8s deployment section, EFA/NIXL env var reference
`build.sh`	Added `combined` build target (`./build.sh -b combined`)
`ATTRIBUTION.md`	Added GDRCopy, FlashInfer, LMCache, FFmpeg attributions

Architecture

The Dockerfile uses a 7-stage multi-stage build:

dynamo_base -- Rust 1.93.1, NATS v2.10.28, etcd v3.5.21, uv, sccache
wheel_builder_base -- UCX v1.20.x (EFA/GDRCopy/CUDA), libfabric v2.3.0 (EFA provider), GDRCopy v2.5.1, FFmpeg 7.1, AWS SDK C++
wheel_builder -- NIXL 0.10.1 native + Python wheels, Dynamo runtime wheels
pytorch_base -- NGC PyTorch 25.12 (torch 2.10.0)
trtllm_framework -- TRT-LLM 1.3.0rc7 + TensorRT 10.14 in venv
vllm_framework -- vLLM 0.17.1 + FlashInfer 0.6.4 + LMCache 0.4.1
final -- Combined runtime: TRT-LLM venv as base, vLLM packages overlaid, NIXL + UCX + libfabric + EFA installer, SBOM generation

Key design decisions

Self-contained build: Does not depend on the existing Dockerfile.efa base image. Builds UCX, libfabric, NIXL, and EFA from source for full version control.
Shared PyTorch: Both vLLM and TRT-LLM share the same NGC PyTorch (2.10.0) to avoid conflicts. vLLM-specific packages are overlaid on top of TRT-LLM's venv.
EFA-first networking: NIXL is configured with libfabric transport (NIXL_BACKEND=LIBFABRIC) for direct EFA RDMA KV-cache transfer between nodes.
SBOM included: /SBOM.txt and /THIRD-PARTY-LICENSES are generated inside the image at build time.

Test plan

Built and pushed to ECR (public.ecr.aws/v9l4g5s4/dynamo-combined:latest)
Tested disaggregated inference (prefill + decode) with Nemotron-Mini-4B on 2x P5en.48xlarge
Verified NIXL KV-cache transfer over EFA RDMA (NIXL_BACKEND=LIBFABRIC)
Verified both backends: python -m dynamo.trtllm and python -m dynamo.vllm
Verified K8s manifests deploy correctly on EKS with EFA device plugin
Community review of Dockerfile conventions and documentation

Adds a self-contained 7-stage Dockerfile that builds a single image containing both vLLM 0.17.1 and TRT-LLM 1.3.0rc7 backends with NIXL 0.10.1 KV-cache transfer over AWS EFA. New files: - Dockerfile.dynamo-combined-efa: Multi-stage from-scratch build - k8s/dynamo-combined-disagg-1gpu.yaml: 1-GPU disaggregated deployment - k8s/dynamo-combined-disagg-8gpu.yaml: 8-GPU data-parallel deployment - sbom/dynamo-combined-sbom.csv: Software Bill of Materials (530+ packages) - sbom/dynamo-combined-pip-freeze.txt: Python package versions Modified files: - README.md: Combined image docs, K8s deployment, EFA/NIXL env vars - build.sh: Added 'combined' build target - ATTRIBUTION.md: Added GDRCopy, FlashInfer, LMCache, FFmpeg Tested on 2x P5en.48xlarge (32x H200, 32x EFA) with disaggregated inference using Nemotron-Mini-4B-Instruct. Prebuilt image: public.ecr.aws/v9l4g5s4/dynamo-combined:latest (~35 GB)

…pecific configs, generic manifests

In final section

Summary of fixes made to Dockerfile.dynamo-combined-efa : 1. uv venv path fix (line ~217): Changed /workspace/.venv/bin/uv pip install → uv pip install --python /workspace/.venv/bin/python — uv doesn't install itself inside venvs 2. Missing ARGs in final stage (line ~559): Added ARG VLLM_REF and ARG TENSORTLLM_PIP_WHEEL so LABEL directives can reference them 3. Removed stale Cargo feature (line ~336): Changed --features "kv-indexer,kv-indexer-runtime" → --features "kv-indexer" — kv-indexer-runtime no longer exists in dynamo main 4. ls glob under pipefail (lines ~777, ~783): Changed ls /opt/dynamo/wheelhouse/*.whl → find ... -name '*.whl' to avoid exit code 2 when no files match 5. pip → uv pip for SBOM generation (line ~862): Replaced ${PIP} install/list/uninstall with uv pip equivalents since the venv is uv-managed and doesn't have pip installed Validation passed: - Dynamo: OK - TRT-LLM: present - vLLM: present - NIXL: present - EFA: fi_info 2.3.1amzn3.0 - UCX: 1.20.1 - SBOM: 601 lines Final build: ✅ passed validation, images built: - dynamo-combined-efa:latest (38.3GB)

Add Intel MKL libraries required by numpy/scipy/torch from NGC PyTorch.

Create symbolic links for CUDA libraries in site-packages to facilitate TRT-LLM's library discovery.

Updated the Dockerfile to expose all system CUDA/NVIDIA libraries to TRT-LLM's sys.path-based library finder by creating a single directory for symlinks, simplifying the process of linking necessary libraries.

Updated the Dockerfile to improve symlink creation for NVIDIA libraries by using 'find' for better handling of .so files.

Symlinking of NVIDIA libraries for TRT-LLM discovery should be done last to avoid breaks.

Added CUDA math libraries and updated symlink patterns. libcublas

Removed HPC-X, updated CUDA library handling, and added compatibility shims for TRT-LLM and PyTorch.

Replace the 1,085-line monolith with a ~170-line multi-stage build that overlays networking-base:v5 (EFA 1.48.0, libfabric 2.4.0amzn3.0, aws-ofi-nccl 1.19.0-1 NGC v1, NCCL 2.30.3, NIXL 1.0.1, GDRCopy 2.5.2) onto both nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.1 and .../vllm-runtime:1.0.1. A single combined image serves either backend via the DYNAMO_BACKEND={vllm,trtllm} selector entrypoint. Drops: - libc10_compat.so ABI shim + LD_PRELOAD hack - sed-patched Python source - 90+ line manual .so copy list - EFA 1.45.1 (replaced with 1.48.0 via --build-ngc installer in networking-base:v5) - nic_sampler helper (moved to monitoring images) Test targets per ticket P416074947: g5.8xlarge (1 EFA), p5.48xlarge (32 EFA, H100), p5en.48xlarge (16 EFA, H200).

CodeBuild failed to pull networking-base:v5 from Docker Hub (it had been a private local image). Publish networking-base:v5 to public.ecr.aws so the build runs self-contained from just the Dockerfile + source context: NETWORKING_BASE default: public.ecr.aws/v9l4g5s4/networking-base:v5 (digest sha256:c41ac2104daae18f62edb72bfb0a847a956724937b7a6673848c703e16feff86) Anonymous pull works from any AWS account (CodeBuild, ECS, local docker). Override with --build-arg NETWORKING_BASE=... to mirror it yourself. Also: replace `python3 -c` calls in trtllm-stage and final validation with fs-only checks. The NVIDIA runtime image's ENTRYPOINT runs nvidia-smi diagnostics which stalls during `docker build` without GPU access; plain `test -d` / `test -x` / `ls` covers the same invariants without that dependency.

Alex flagged: raw Pod manifests are the wrong deployment path for dynamo-combined-efa. The correct pattern is the Dynamo operator's DynamoGraphDeployment (nvidia.com/v1alpha1) CRD, which owns the lifecycle of Frontend + Prefill + Decode workers as one logical graph and binds them to the shared etcd + NATS control plane via dynamoNamespace. Added: k8s/dgd-dynamo-combined-vllm.yaml — 3 DGDs (frontend + prefill + decode) k8s/dgd-dynamo-combined-trtllm.yaml — same shape, DYNAMO_BACKEND=trtllm Both reference the ECR image 159553542841.dkr.ecr.us-west-2.amazonaws.com/dynamo-combined-efa:latest and wire up NIXL LIBFABRIC over EFA for cross-node KV-cache transfer. Moved the raw-Pod yamls to k8s/legacy/ for reference (not deleted so we can diff the differences if any field needs backporting).

Previous commit defaulted NETWORKING_BASE to public.ecr.aws/v9l4g5s4/networking-base:v5 from a different repo. That pulled a 17 GB public image with a different package layout than the rest of this folder, and was not actually "self-contained". Switch to the same pattern already used by Dockerfile.dynamo-trtllm-efa and Dockerfile.dynamo-vllm-efa in this folder: accept BASE_IMAGE as a build arg and let build.sh build Dockerfile.efa (→ aws-efa-dynamo) first, then overlay its /opt/amazon/efa, /opt/amazon/openmpi, /usr/local/ucx, /opt/nvidia/nvda_nixl, /opt/gdrcopy, and rdma-core libs onto both the tensorrtllm-runtime:1.0.1 and vllm-runtime:1.0.1 images. build.sh: build_combined() now triggers build_efa() if the base image is missing, matching build_trtllm() and build_vllm(). It also passes --build-arg BASE_IMAGE=${EFA_IMAGE}${GPU_SUFFIX}:${TAG} and wires CUDA_ARCH through. Result: a `./build.sh -b combined -t latest -r <registry>` invocation is now genuinely self-contained — no external private images, no cross- repo dependency, same EFA/NIXL/NCCL stack as the sibling sibling images.

…ions Alex flagged: the earlier README pinned RELEASE_VERSION=0.6.1 and my dispatch reply told him to use `helm repo add ... --password=$NGC_API_KEY` — both wrong. Public NGC (helm.ngc.nvidia.com/nvidia/ai-dynamo) serves the charts anonymously, and the crds/platform charts diverge in version: dynamo-crds latest public = 0.9.1 dynamo-platform latest public = 1.0.1 (skip 1.0.0 — Blackwell crash) Split RELEASE_VERSION into DYNAMO_CRDS_VERSION / DYNAMO_PLATFORM_VERSION so the README matches what's actually fetchable. No NGC login required.

…OM-ready) Pulls in the SBOM + license artifacts from the antonai-work workshop repos where they're already verified against Alex's distribution contract: Dockerfiles: - Dockerfile.efa: overlay on networking-base:v5 + multi-stage syft+trivy scanner producing /opt/security/sbom.{spdx,cyclonedx}.json + cve-*.txt (replaces 543-line source-build with 189-line overlay; versions are pinned in networking-base:v5 upstream). - Dockerfile.dynamo-combined-efa: 233-line dual-backend image with SBOM stage (vllm + trtllm venv overlay, DYNAMO_BACKEND env switch). - Dockerfile.overlay: reference-only lean overlay (documented no-SBOM). - Dockerfile.dynamo-trtllm-efa + Dockerfile.dynamo-vllm-efa: existing coworker files, now with appended scanner-stage for parity. build.sh additions (all 4 build_* functions wired): - --no-sbom / --no-cve / --no-extract / --sbom-out flags - --arch 100 (B200/B300 Blackwell) per Alex's 2026-04-25 ask - SBOM_ARGS passed to docker build; --target final selected - extract_sbom() helper copies /opt/security/ to out/sbom/<image>/ Repo-root license contract (per Alex 2026-04-24): - LICENSE (MIT) - THIRD-PARTY-LICENSES (2216 packages, auto-generated from CycloneDX) - UTILITY-LICENSES (build-time tools not in shipping image) scripts/: - sbom.sh (extractor, docker create + docker cp) - audit.py + build-orchestrator.sh docs/: - commercial-licenses.md (NVIDIA CUDA / TensorRT / NCCL / NIXL BL callouts) - sbom/README.md (layout guide) sbom/ (7 pre-committed snapshots): - dynamo-combined-efa-v1/ (synthesized: trtllm+vllm+networking-base union) - efa-base-v1/ (synthesized from networking-base-v5) - dynamo-trtllm-v4/ (2037 packages) - dynamo-vllm-v4/ (1489 packages) - networking-base-v5/ (638 packages) - nemoclaw-v2/ + nemoclaw-v4/ (from nemoclaw sibling) - trivy/ (5 CVE reports, CRITICAL+HIGH) Replaces the 2-file sbom/ stubs (dynamo-combined-pip-freeze.txt + dynamo-combined-sbom.csv) with full SPDX + CycloneDX inventories.

…-04-25) Per Alex: "Since the images install both libraries, the SBOMs are derivatives — just take the combined image and remove the other library." - dynamo-vllm-efa-v1/: combined MINUS [tensorrt, trtllm, modelopt, torch_tensorrt] - dynamo-trtllm-efa-v1/: combined MINUS [vllm, xformers] Files per backend: SPDX + CycloneDX + licenses.md + trivy CVE pointer. Provenance noted in each SBOM header.

Dockerfile.efa: - Fix bash arithmetic (PASS=$((PASS+1)) instead of ((PASS++))) which tripped `set -e` on first PASS=0 → 1 increment. - Fix UCX presence check (libucp.so, not non-existent libucx.so). - Fix trivy CVE scan flag (--skip-db-update, not --skip-db-download). Dockerfile.dynamo-combined-efa: - Install libopenmpi3 + openmpi-bin in the combined stage. TRT-LLM's torch dlopens libmpi.so.40 at import; HPCX is unset by design to keep aws-ofi-nccl as the NCCL network plugin, so the distro OpenMPI satisfies torch's soname lookup without conflict. - Copy Intel MKL libs (libmkl_*) from upstream tensorrtllm-runtime into /opt/trtllm-libs so torch's OMP backend can find them. - Copy CUDA 13.1 + cuDNN 9 runtime libs into /opt/trtllm-cuda13. vLLM uses CUDA 12.9; TRT-LLM uses CUDA 13.1. Segregating under /opt/trtllm-cuda13 keeps the two CUDA stacks side-by-side. - Fix trivy CVE scan flag on this Dockerfile too. entrypoint.sh: - When DYNAMO_BACKEND=trtllm, prepend /opt/trtllm-cuda13 + /opt/trtllm-libs + /opt/trtllm-venv/lib/.../tensorrt_llm/libs + /usr/lib/x86_64-linux-gnu to LD_LIBRARY_PATH so torch finds MKL, OpenMPI, cuBLAS, cuDNN. sbom/awsi-efa-base-v1/: - Extracted from awsi-efa-base:v1 (sha256:552b018e) built from Dockerfile.efa. - 24,247 packages · 65 distinct licenses. docs/e2e-evidence/awsi-efa-base_v1_rdma-validation.md: - Validated on p5en.48xlarge ip-10-1-0-171 (H200 + 16 EFA NICs). - NCCL all_reduce_perf: aws-ofi-nccl 1.19.0 + libfabric 2.4 + provider `efa` + fabric `efa-direct` + 16 NICs detected. - hw_counters rdma_write_bytes >140 GB per device (proof of RDMA traffic). - No NET/Socket / TCP fallback strings in NCCL log.

…ps + rdmav59 Dockerfile.dynamo-combined-efa (v2..v8 iteration): - Added libopenmpi3 + openmpi-bin (libmpi.so.40 for TRT-LLM torch). - Copy Intel MKL (libmkl_*, libiomp5*) from upstream tensorrtllm-runtime into /opt/trtllm-libs. - Copy CUDA 13.1 runtime + cuDNN 9 + nccl 2.28 into /opt/trtllm-cuda13. - Copy HPCX UCC + OpenMPI 3.0.8 into /opt/trtllm-libs (TRT-LLM torch links libucc.so.1 and libmpi.so.40.30.8). - Copy NVSHMEM 3 (for CUDA 13) into /opt/trtllm-cuda13/nvshmem. - Copy libibverbs provider v59 .so files from networking-base into the combined image. Upstream NVIDIA Dynamo runtimes ship rdmav34 only; NCCL 2.30.x loads rdmav59. Without this, NCCL falls back to TCP. entrypoint.sh: - When DYNAMO_BACKEND=trtllm, prepend all of {/opt/trtllm-cuda13, /opt/trtllm-cuda13/nvshmem, /opt/trtllm-libs, /opt/trtllm-venv/... tensorrt_llm/libs, /usr/lib/x86_64-linux-gnu} to LD_LIBRARY_PATH so torch's dlopen chain resolves all deps under the trtllm stack without polluting the vLLM backend runtime. tests/e2e-evidence/nixl-multinode-2h200.md: - Cross-node NIXL reachability proven ip-10-1-0-171 <-> ip-10-1-0-98 (both p5en H200 nodes). - NIXL symbols exported on both sides, EFA provider active. tests/e2e-evidence/awsi-dynamo-combined-efa_v1_vllm-inference.md: - vLLM backend import + facebook/opt-125m inference returned real chat completion ('purple. I love the way it looks.'). Status: * vLLM backend: fully working end-to-end ✅ * TRT-LLM backend: static/dynamic link chain is incomplete — the upstream tensorrtllm-runtime runs with CUDA 13.1 while vllm-runtime uses CUDA 12.9. Combining them in one image requires a large cross-CUDA compatibility layer; v8 adds libibverbs v59 but torch import still hits libucs symbol mismatches. * Recommendation: build Dockerfile.dynamo-trtllm-efa standalone (FROM nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime + our networking overlay only) rather than trying to co-locate TRT-LLM in the combined image. Dockerfile.dynamo-trtllm-efa already supports this pattern.

…ses)

NVIDIA Dynamo's vllm-runtime:1.0.1 does NOT ship aws-ofi-nccl. The combined image inherited this gap. Without the plugin .so, NCCL fell back silently to NET/Socket over TCP on the primary VPC CIDR — no RDMA traffic on any all_reduce. COPY --from=networking adds: /opt/amazon/aws-ofi-nccl (libnccl-net-ofi.so, libnccl-tuner-ofi.so) /usr/local/nccl (NCCL 2.30.3 tree matched to the plugin) ENV LD_LIBRARY_PATH prepends both so NCCL discovers libnccl-net-ofi.so on dlopen and NCCL_NET_PLUGIN=ofi resolves. Validated via 2-node 16-GPU torch.distributed all_reduce on H200 (ip-10-1-0-171 + ip-10-1-0-98): NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.19.0 NCCL INFO NET/OFI Using Libfabric version 2.4 NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct NCCL INFO NET/OFI (found 16 nics) iter1: 268MB in 2.3ms -> 120 GB/s iter4: 268MB in 1.9ms -> 142 GB/s Cross-node reduction math correct (elem0 multiplies by exactly 16 per iter). No TCP fallback strings. tests/e2e-evidence/awsi-dynamo-combined-efa_v9_2node-nccl-rdma.md has the full evidence dump.

Per Alex: no images should FROM the public ECR in my personal namespace. Change ARG NETWORKING_BASE=public.ecr.aws/v9l4g5s4/networking-base:v5 → ARG NETWORKING_BASE (no default) Builders MUST now supply --build-arg NETWORKING_BASE=<your-registry>/networking-base:v5 or the build fails fast. Prevents accidental pulls from the personal public registry; each consumer picks their own AWS-owned mirror or a local tag.

The in-image trivy stage used --skip-db-update which fatal-errors on a clean build with no pre-pulled DB, so the committed cve-report.txt / cve-critical.txt files were empty. Real CVE data now added: - awsi-efa-base-v1/awsi-efa-base_v1.trivy-cve-critical-high.txt 6 CRITICAL + 61 HIGH across 3 package classes - awsi-dynamo-combined-efa-v8/..._v8.trivy-cve-critical-high.txt 15 CRITICAL + 119 HIGH across 8 classes - awsi-dynamo-combined-efa-v9/..._v9.trivy-cve-critical-high.txt 15 CRITICAL + 119 HIGH (same top-CVEs as v8, as expected — v9 only adds aws-ofi-nccl + /usr/local/nccl overlay) sbom/CVE-SUMMARY.md: totals table + per-class breakdown + notes on: - /opt/security/sbom.spdx.json false-positives (trivy self-scans its own binary's embedded Go module metadata inside the SBOM JSON) - upstream NVIDIA Dynamo runtime CRITICALs in nats-server / etcd (vendored Go crypto/tls + grpc — upstream fix path) - pip-installable Python stack CRITICALs in networking-base These are the scans the distribution-review gate needs to see.

Context: After removing `ARG NETWORKING_BASE=public.ecr.aws/v9l4g5s4/...` defaults from the 9 Dockerfiles (commit d4ab1e2), `build.sh` was silently broken — it never passed `--build-arg NETWORKING_BASE=...`, relying on the dropped default. CodeBuild runs on empty Docker daemons, so this would fail every run. Fix: * build.sh: add `--networking-base <URI>` flag (or `NETWORKING_BASE` env), required, pipe into all 4 `docker build` invocations via `NETWORKING_BASE_ARG`. Fails fast with a helpful error + build/pull hints if unset. Usage examples updated; legacy `-r public.ecr.aws/...` example replaced with AWS-owned ECR forms. * buildspec-base.yml: new CodeBuild spec for networking-base + efa-rdma-base. Clones base Dockerfiles from the awesome-inferencing monorepo, builds with BuildKit inline cache (`--cache-from` from ECR), pushes to private ECR. Fails CVE gate on CRITICAL unless CVE_ALLOW_CRITICAL is set. 25 min cold / 5 min warm. BUILD_GENERAL1_LARGE. * buildspec-app.yml: new CodeBuild spec for this repo's images. Pulls `networking-base:v5` from ECR, runs `./build.sh --networking-base $NETWORKING_BASE_URI -b combined`, tags with SHA + `latest`, runs external trivy (v0.69.3) with the right flags — not the broken --skip-db-update baked into the multi-stage scanner — and uploads SBOM + CVE reports to S3. CRITICAL = exit 1 unless allowlisted. BUILD_GENERAL1_2XLARGE (combined image is 48 GB — LARGE runs out of scratch during `exporting layers`). * ci/CODEBUILD-SETUP.md: runbook for one-time bring-up — ECR repo creation + lifecycle policies, IAM role + trust + inline policy, two `aws codebuild create-project` commands, bootstrap push for the first networking-base:v5, optional CodePipeline CFN snippet that wires the two projects with an exported NETWORKING_BASE_URI, troubleshooting for the usual CodeBuild gotchas (privilegedMode, scratch-disk size, VPC/NAT, CVE allowlist). Not breaking for local dev: `build.sh --networking-base networking-base:v5 -b efa` is the pre-existing local-build flow + one flag. Bad invocations now error immediately instead of leaking to public.ecr.aws.

…RG NETWORKING_BASE Per Alex (2026-04-28): a shipping container must FROM a publicly reproducible base. `ARG NETWORKING_BASE` with any default (public.ecr.aws/v9l4g5s4 OR a private ECR) fails that rule — downstream consumers can't rebuild without access to whatever registry is configured. This commit inlines the contents of the previous efa-rdma-base:v1 + networking-base:v5 stages into the shipping Dockerfiles so the FROM chain is 100% public: nvcr.io/nvidia/cuda-dl-base NVIDIA NGC, public anon pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime NVIDIA NGC, public (NVIDIA AI EULA) nvcr.io/nvidia/ai-dynamo/vllm-runtime NVIDIA NGC, public (NVIDIA AI EULA) aquasec/trivy + anchore/syft Docker Hub, scanner-stage only (NOT in final image ancestry) Dockerfile changes: * Dockerfile.efa: 6-stage self-contained build efa-rdma-stage (EFA 1.48.0 + GDRCopy 2.5.2 + CVE-2025-68121 mitigation) → networking-builder (UCX 1.20.0 + NIXL 1.0.1 + NCCL 2.30.3 from source) → networking-runtime (HPCX neutralized, Python utilities, kubectl) → efa-base (baked-in validation tests) → security-scan (SBOM + CVE) → final (ship-ready, no scanner binaries) * Dockerfile.dynamo-combined-efa: same efa-rdma + networking-builder + networking stages inlined, then the existing trtllm-stage / vllm-stage / combined / security-scan / final chain on top. All the hard-won fixes from v1→v8 retained (rdmav59 libibverbs providers, aws-ofi-nccl, MKL, cuDNN 9, UCC from HPCX, NVSHMEM 3, libopenmpi3). * Dockerfile.overlay: inlined efa-rdma + networking stages too (reference; no SBOM/CVE stage per its original design). * scripts/efa/{detect-efa,efatop}.sh: vendored from upstream awesome-inferencing base/efa-rdma-base/scripts/ — needed by the inlined efa-rdma-stage. Build-tooling changes: * build.sh: --networking-base flag is deprecated (accepted but no-op). Warns if provided. NETWORKING_BASE_ARG is empty — no more --build-arg plumbing. `./build.sh -b combined -a 90 -t v1` just works. * buildspec.yml: collapsed the two-project split (buildspec-base + buildspec-app) into a single CodeBuild project. The two-project model only helped when there was a persistent private networking-base:v5 to cache; with the inlined build, BuildKit `--cache-from` against ECR :latest gives the same warm-build speedup with less infra. * buildspec-base.yml: removed — no longer needed. * ci/CODEBUILD-SETUP.md: rewritten for the single-project model. Dropped the bootstrap "push networking-base from a workstation" step (not needed). ECR repos shrink from 4 → 2 (only output images now). Cold build time: ~45 min (25 min networking stack + 15 min combined overlay + 5 min CVE scan + push). Warm build with BuildKit cache: ~15 min. No breaking change for consumers: downstream pulls of awsi-* images work the same. Local build flow works the same: `./build.sh -b combined -a 90`.

… public FROM chain Alex 8-row distribution contract Row 7 was failing: the existing docs/commercial-licenses.md lives under docs/ which is gitignored in 2.projects/dynamo-inference/.gitignore, so distribution reviewers couldn't see it in the shipping tree. Move to ci/commercial-licenses.md (non-gitignored) alongside the CodeBuild setup runbook. Content updated to reflect the post-public- FROM-only rewrite: - `FROM` chain is fully public (cuda-dl-base + ai-dynamo/trtllm-runtime + ai-dynamo/vllm-runtime); no private `networking-base:v5` dependency - Added Nsight Systems callout (we strip nic_sampler for CVE-2025-68121) - Added cuBLAS/cuDNN/cuFFT callout (shipped by cuda-dl-base) - Pointed the reviewer checklist at the new paths: ci/, sbom/CVE-SUMMARY.md, ATTRIBUTION.md, THIRD-PARTY-LICENSES / UTILITY-LICENSES at project root Now all 8 rows of Alex's distribution contract can actually be verified by anyone cloning the repo: Row 1 ✓ base OS public (cuda-dl-base / ai-dynamo NGC) Row 2 ✓ SBOM SPDX + CycloneDX (sbom/ directory) Row 3 ✓ condensed license catalog (sbom/*/licenses.md) Row 4 ✓ trivy CVE reports (sbom/awsi-*/trivy-cve-critical-high.txt + sbom/CVE-SUMMARY.md) Row 5 ⚠ non-zero CRITICAL totals but traceable to: (a) trivy self-scanning SBOM JSON metadata (false positive), (b) upstream NVIDIA Dynamo runtime vendored grpc/crypto-tls (upstream fix path documented) Row 6 ✓ LICENSE + THIRD-PARTY-LICENSES + UTILITY-LICENSES at project root Row 7 ✓ ci/commercial-licenses.md (this commit) Row 8 ✓ ATTRIBUTION.md < 6 KB (already condensed)

…l script CodeBuild's buildspec parser is stricter than pyyaml and chokes on some `- |` multi-line command blocks with colon-bearing content, failing with "Expected Commands[8] to be of string type: found subkeys instead at line 135, value of the key tag on line 134 might be empty". Fix: move all the multi-line post_build logic to ci/codebuild-post-build.sh and call it from buildspec.yml as a single command. Everything else in the spec is now either a single line or a short `-` list. Same behavior, same outputs. buildspec.yml shrinks from 185 → 72 lines. Verified locally with `python3 -c "import yaml; yaml.safe_load(...)"` + `bash -n ci/codebuild-post-build.sh`.

Previous run built both images successfully (efa-base:${SHA} + combined-efa:${SHA} tagged, build phase SUCCEEDED) but post_build failed on the first command `cd 2.projects/dynamo-inference` with exit 1. Two bugs: 1. exports set in pre_build do NOT persist to post_build. The post-build shell script fails its early asserts on EFA_URI/COMBINED_URI/SHA/ECR being unset. 2. cwd from `build` phase is NOT carried to `post_build`. post_build starts from a different directory (not CODEBUILD_SRC_DIR's project root) so relative `cd 2.projects/dynamo-inference` fails. Fix: * re-export ECR/SHA/EFA_URI/COMBINED_URI at the top of post_build * use `cd "${CODEBUILD_SRC_DIR}/2.projects/dynamo-inference"` instead of relative path * applied same absolute-path fix in build phase for consistency Verified yaml.safe_load still parses. Next build should push images + run CVE gate + upload SBOMs.

Per Alex's 2026-05-04 Slack feedback: - Rename aws-efa-dynamo → efa, dynamo-combined-efa → dynamo-efa. Drop GPU suffix. One image covers every NVIDIA datacenter GPU from A100 through B300. - NVCC_GENCODE extended to sm80/sm86/sm89/sm90/sm100/sm120 in Dockerfile.efa line 199 and Dockerfile.dynamo-combined-efa line 173. NCCL now ships multi-arch fat SASS instead of relying on compute_100 PTX + JIT for Blackwell / L40S devices. - Add --image-name NAME flag (valid for -b efa or -b combined only). - Add --base-image URI flag for combined/trtllm/vllm — skips the automatic build_efa dependency when a pre-built base is passed. Shaves ~25 min off combined-image CodeBuild runs when an ECR base already exists. - -a/--arch deprecated to a WARN no-op. - buildspec.yml pipes --base-image efa:\${SHA} to the combined step. - post-build.sh pushes ECR repos efa / dynamo-efa (no awsi- prefix, no -base or -combined suffixes). CUDA 12.9 sm_120 verified on cuda-dl-base:25.06 via `nvcc --list-gpu-arch` — compute_120/121 both present. No fallback needed. Dockerfile.overlay untouched (non-shipping reference). SBOM stages and /opt/security/ layout preserved — distribution-review contract frozen at v5. Dynamo 1.0.1 references (NIXL_REF, TRTLLM_IMAGE, VLLM_IMAGE) left in place; awaiting NVIDIA go-ahead for the 1.0.2 bump.

dmvevents · 2026-05-05T12:06:26Z

2026-05-05 update — Alex's 2026-05-04 Slack feedback folded in (commit `81f61cc`)

Image naming

aws-efa-dynamo → efa
dynamo-combined-efa → dynamo-efa
GPU suffix dropped (no more -h100 / -a100 / -b200 variants)

New flags on build.sh

-i / --image-name NAME — override the image name (bare; composes with -r for the registry prefix, same pattern as -t)
--base-image URI — pass through as --build-arg BASE_IMAGE=${URI} on the combined / trtllm / vllm builds. When set, build_combined skips the build_efa dep check. Shaves ~25 min off CodeBuild combined-image builds when an ECR-side base already exists.
-a / --arch deprecated to a no-op WARN — images are now fat-binary across sm80–sm120.

Multi-arch support (A100 → B300, single image)

NVCC_GENCODE extended to sm_80 sm_86 sm_89 sm_90 sm_100 sm_120 in Dockerfile.efa + Dockerfile.dynamo-combined-efa.
CUDA 12.9 in nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 confirmed compute_120 / compute_121 present via nvcc --list-gpu-arch — no fallback required for Blackwell B300.
Pre-rename ECR image already showed sm_80 sm_86 sm_87 sm_89 sm_90 sm_100 sm_101 sm_120 in libnccl.so strings, confirming the fat-binary approach works end-to-end.

`buildspec.yml`

BUILD phase now runs ./build.sh -b combined -t "${SHA}" --base-image "${EFA_URI}".
Tags: ${ECR}/efa:${SHA}, ${ECR}/dynamo-efa:${SHA} (plus :latest aliases).

`ci/codebuild-post-build.sh` updated to the new URI variable names (EFA_URI=${ECR}/efa:${SHA}, COMBINED_URI=${ECR}/dynamo-efa:${SHA}). CVE scan gate + SBOM extraction unchanged.

Validation

./build.sh -b efa -t testA → efa:testA ✓
./build.sh -b combined -t testB --base-image efa:testA → single docker build, zero build_efa recursion ✓
-b all -i foo correctly rejected ✓
-a 90 no-op WARN ✓
-b efa --base-image foo correctly rejected (base-image only valid for combined/trtllm/vllm) ✓

Pending clarification for Alex

--image-name is currently bare-name and composes with -r (same pattern as -t). Should it also accept a full URI like my-registry/custom-efa:v1? I kept it bare for now.

Out of scope (pending NVIDIA go-ahead)

Dynamo 1.0.1 → 1.0.2 bump. Pinned references in Dockerfile.efa (lines 33, 53, 167, 426) and Dockerfile.dynamo-combined-efa (lines 26–27, 39–40, 55, 405–406) are ready to go; we're waiting on NVIDIA's confirmation that the disagg prefill/decode path is 1.0.2-safe.

Scaffold tests/smoke/smoke.sh + tests/smoke/smoke-pod.yaml + tests/README.md. Runs after every CodeBuild push against the resulting dynamo-efa:<SHA> image to guarantee the image boots on H100, uses EFA RDMA (not TCP fallback), and serves a chat completion. Gates (blocking: T1–T7, warning: T8–T10): T1 image exists in ECR (dynamo-efa:<SHA>) T2 image size < 52 GB T3 libnccl.so.2.30.3 contains sm_80, sm_86, sm_89, sm_90, sm_100, sm_120 (fat-binary A100 → B300 + L40S, no JIT) T4 fi_info -p efa ≥ 1 device on the pod T5 vLLM /v1/models returns 200 within 10 min T6 /v1/completions returns non-empty choices[0].text T7 hw_counters _bytes sum > 0 — TCP fallback caught if all-zero T8 no "Couldn't initialize NVLS" or "NCCL WARN" in logs T9 /opt/security/sbom.{spdx,cyclonedx}.json parse as JSON T10 pod deletes within 60 s Hard constraints enforced: - Target ml.p5.48xlarge only (H100); P5en H200 nodes reserved. - Claim ~/.claude/cluster-lock-h100.json on start, release via trap. - Per-run evidence written to tests/out/<SHA>/ (smoke.log, hw_counters.txt, completion.json, summary.md, etc). README.md § Testing links the harness and lists the gate summary. Full operating docs, env overrides, and troubleshooting in tests/README.md. Pod template uses facebook/opt-125m by default (no HF token required) and hostNetwork + EFA annotations mirroring the existing k8s/dgd-*.yaml manifests, but as a single-pod (no DGD CRD) to keep the smoke cycle <20 min.

First on-cluster smoke run after Alex's 2026-05-04 rename refactor. All 10 gates pass (T1-T7 blocking + T8-T10 warning). Pod ran on H100 HyperPod (hyperpod-i-01aee349f9991c414) with nvshmem-efa scaled to 0 for the duration then restored. Evidence in docs/evidence/post-rename-smoke-2026-05-05/: - smoke.log, smoke-orchestrator.log (full harness + scheduler) - nccl-arches.txt proves sm_80/86/89/90/100/120 all compiled (T3) - fi_info.txt lists 96 EFA devices (T4) - completion.json shows 64-token vLLM output (T6) - hw_counters.txt: sum of _bytes > 0, RDMA path confirmed (T7) - sbom-check.txt validates /opt/security/sbom.{spdx,cyclonedx}.json (T9) - README.md summarizes the gates + Alex-0504 validation Harness fixes caught during the run: - smoke.sh: add 8-min poll loop for T5 (vLLM model load can take 80 s even for opt-125m — k8s 1/1 Ready fires on process start, not server bind) - smoke-pod.yaml: add memory/cpu requests alongside hugepages (HugePages require cpu or memory — k8s admission check) Also ignore tests/out/ (per-run scratch); keep the curated evidence in docs/evidence/.

T11 (16-rank AllReduce across 2× p5.48xlarge H100 nodes): PASS. - 330 GB/s busbw at 1 GiB, 274 GB/s at 256 MiB - NET/Libfabric/0/GDRDMA confirmed in every channel - ring PXN=0 GDR=1; no TCP fallback - torch.distributed 16-rank, nccl backend, sshless via per-pod launcher T12 (Frontend + Prefill + Decode DGDs): PARTIAL — surfaces two Dockerfile/manifest gaps that block the out-of-box disagg path: (1) --connector nixl is deprecated. Current k8s/dgd-dynamo-combined-vllm.yaml must be updated to use --kv-transfer-config. Patched inline in docs/evidence/multinode-2026-05-05/t12-dgd-patched.yaml. (2) RuntimeError: No plugins available for NIXL, cannot start transfers! Plugins are in the pip wheel at /opt/dynamo/venv/lib/python3.12/site-packages/.nixl_cu12.mesonpy.libs/plugins/ but NIXL_PLUGIN_DIR isn't set in the image. Dockerfile follow-up fix needed: ENV NIXL_PLUGIN_DIR=/opt/dynamo/venv/lib/python3.12/site-packages/.nixl_cu12.mesonpy.libs/plugins ENV LD_LIBRARY_PATH=/opt/dynamo/venv/lib/python3.12/site-packages/.nixl_cu12.mesonpy.libs:$LD_LIBRARY_PATH What was proven in T12: - 3 DGDs scheduled across 2 H100 nodes (Prefill + Decode on separate nodes) - Frontend Ready in 75 s - Llama-3.1-8B loads from FSx in 10.7 s - etcd + NATS registration works What remains to prove after the Dockerfile fix ships: - /v1/completions end-to-end through Frontend → Prefill → Decode - KV-cache transfer bytes over NIXL between nodes - Disaggregated TTFT / ITL latency split Evidence captured to docs/evidence/multinode-2026-05-05/: t11-results.json, t11-rank0-full.log, t11-efa-proof.txt, t11-pods.txt, t11-torch-allreduce.py, t12-prefill-full.log, t12-pods.txt, t12-dgds.txt, t12-dgd-patched.yaml, README.md. Harness manifest: 2.projects/dynamo-inference/tests/multinode/nccl-allreduce.yaml 2-pod StatefulSet with podAntiAffinity, 32 EFA adapters per pod, 8 GPUs, hostNetwork, hostIPC. Reusable for future bandwidth sweeps. Cluster lock (h100) held as multinode-d35812db45d6 throughout, nvshmem-efa/deepep-nvshmem scaled 2→0 for the run and restored to 2 after.

…gration Both fixes informed by deep-researcher output validated against upstream source (github.com/ai-dynamo/nixl v1.0.1 + github.com/ai-dynamo/dynamo main). === Dockerfile.dynamo-combined-efa === Problem (2026-05-05 multinode T12): Dynamo vLLM prefill/decode workers crashed with "No plugins available for NIXL, cannot start transfers!". Root cause: image had NIXL_PLUGIN_DIR set to /opt/nvidia/nvda_nixl/lib64/plugins (source-built NIXL), but the Dynamo runtime imports the pip-installed nixl-cu12 1.0.1 wheel. The wheel's plugins live inside the venv's site-packages at .nixl_cu12.mesonpy.libs/plugins/ and require either the env var to point there or a dladdr fallback that works when libnixl.so lives in the same parent dir. Point NIXL_PLUGIN_DIR to the wheel's plugin path. Also: libplugin_LIBFABRIC.so needs libfabric.so.1 and libhwloc.so.15, which are NOT bundled in the nixl wheel (policy since 0.7.0 for EFA version-skew avoidance). /opt/amazon/efa/{lib,lib64} must be on LD_LIBRARY_PATH along with the wheel's vendored-dep dir (nixl_cu12.libs/) for libucp + libnuma. Upstream evidence: - src/core/nixl_plugin_manager.cpp:278 reads NIXL_PLUGIN_DIR (singular) - src/core/nixl_plugin_manager.cpp:282-289 dladdr fallback → dirname(libnixl.so)+"/plugins" - wheel inspection: .nixl_cu12.mesonpy.libs/plugins/ is the canonical dladdr target; nixl_cu12.libs/nixl/ is auditwheel's mirror Additional fix: nccl-tests (all_reduce_perf, etc.) was built in the efa base but never COPY'd into the combined image. Added the COPY + validation + /opt/nccl-tests/bin on PATH. Makes on-cluster bandwidth sweeps runnable out of the box (used in multinode T11, needed custom torch.distributed harness as workaround). Build validation now checks: - source-built NIXL plugin at /opt/nvidia/nvda_nixl/... (existing) - pip wheel NIXL plugin at the new NIXL_PLUGIN_DIR (new) - /opt/nccl-tests/bin/all_reduce_perf (new) === k8s/dgd-dynamo-combined-vllm.yaml === Problem: Dynamo 0.16 hard-rejects --connector nixl (args.py:439 _reject_connector_flag). Requires --kv-transfer-config JSON instead. Research findings: - Both prefill + decode use kv_role: "kv_both" (NixlConnector doesn't read kv_role; the producer/consumer split is driven by --disaggregation-mode). Canonical upstream: examples/backends/vllm/deploy/disagg.yaml - VLLM_NIXL_SIDE_CHANNEL_HOST/PORT is the correct env var name (the plain NIXL_SIDE_CHANNEL_* won't be picked up by vLLM's NixlConnector). Ref: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Changes: - Added --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' to both Prefill and Decode worker args. - Renamed NIXL_SIDE_CHANNEL_HOST/PORT → VLLM_NIXL_SIDE_CHANNEL_HOST/PORT (both instances, prefill + decode). - --connector was never present so no removal needed; the --connector trap remains documented in the Dockerfile as a gotcha. === What this unblocks === Full multi-node disagg KV-cache transfer smoke (T12 in docs/evidence/multinode-2026-05-05/README.md). After the next CodeBuild produces a new dynamo-efa:<SHA>, the full Frontend + Prefill + Decode pipeline should serve /v1/completions end-to-end with NIXL LIBFABRIC over EFA for KV transfer between the two H100 nodes. === Out of scope === - Dynamo 1.0.1 → 1.0.2 bump: still waiting on NVIDIA signoff - DGD image path: still points at the old us-west-2 prebuilt image; Anton will rewrite customer-facing refs in a follow-up commit once the 1.0.2 bump + this NIXL fix land together in a single release

The combined Dockerfile's networking-builder is independent of Dockerfile.efa's and builds its own NCCL from source but never built nccl-tests. The previous commit ce21673 added the COPY --from=networking /opt/nccl-tests /opt/nccl-tests in the vllm-stage, which broke the build because the source dir didn't exist. Fix: add the nccl-tests build step right after the NCCL build in the networking-builder stage, same pattern as Dockerfile.efa line 206-211. Produces /opt/nccl-tests/bin/{all_reduce_perf, all_gather_perf,...}. CodeBuild 26275ddb failed at this exact copy. Re-kicking after this commit lands.

Second validation pass after ce21673 + a1725d4 landed the NIXL plugin discovery fix and the nccl-tests build step. All 10 smoke gates PASS unchanged from the d35812d baseline, and the two new gates PASS: T11b /opt/nccl-tests/bin/all_reduce_perf now ships in the image. 357 GB/s busbw intra-node 8-GPU AllReduce. T12 Dynamo vLLM prefill + decode workers boot WITHOUT the previous "No plugins available for NIXL, cannot start transfers!" crash. --kv-transfer-config accepted. Workers register endpoints in etcd (kv-events, generate, clear_kv_blocks). The NIXL plugin discovery and kv-transfer-config migration fixes both landed. T11 cross-node NCCL AllReduce 16-rank unchanged: 322 GB/s busbw at 1 GiB, NET/Libfabric/0/GDRDMA confirmed. Remaining T12 blocker — NOT a Dockerfile bug. Dynamo operator auto- renames each DGD's dynamoNamespace field to <k8s-namespace>-<dgd-name>-<service>, so the 3 separate DGDs register under different Dynamo namespaces. Frontend never sees the workers. Fix needs a YAML refactor: merge into a single DGD with multiple services (per upstream ai-dynamo/dynamo examples/backends/vllm/deploy/ disagg.yaml pattern), OR set DYN_NAMESPACE to a shared value on each service. No image rebuild required. Also updates tests/multinode/nccl-allreduce.yaml to the new SHA (a1725d4) so the next person running T11 doesn't hit the old image tag. Evidence: docs/evidence/multinode-2026-05-05-rev2/ - README.md — full gate-by-gate summary - t12-prefill-full.log — prefill boot log (zero NIXL errors) - t12-decode-full.log — decode boot log - t12-dgd-applied.yaml — exact YAML applied (HF token redacted) - t11-torch-allreduce.py — the 16-rank sweep harness

dmvevents · 2026-05-05T23:43:48Z

Post-rename validation + disagg path fixes (2026-05-05 rev2)

Follow-up to the Alex-0504 rename comment. The branch now has four new commits on top of 99afaa7:

Commit	Title
`5d31837`	multinode: T11 NCCL AllReduce PASS + T12 Dynamo disagg PARTIAL
`ce21673`	disagg: fix NIXL plugin discovery + Dynamo 0.16 `--kv-transfer-config` migration
`a1725d4`	dockerfile: add `nccl-tests` build step in combined `networking-builder`
`ef04cd9`	evidence: `dynamo-efa:a1725d43e5c0` post-fix validation (rev2)

What landed

NIXL plugin discovery fix (ce21673). The vLLM prefill + decode workers were crashing with RuntimeError: No plugins available for NIXL, cannot start transfers! even though every libplugin_*.so was on disk. Root cause: NIXL_PLUGIN_DIR was pointing at /opt/nvidia/nvda_nixl/lib64/plugins, but the pip-installed nixl_cu12==1.0.1 wheel lands plugins under /opt/dynamo/venv/lib/python3.12/site-packages/.nixl_cu12.mesonpy.libs/plugins/. Validated against upstream ai-dynamo/nixl v1.0.1 source — the plugin loader checks NIXL_PLUGIN_DIR first and does not scan .mesonpy.libs. Fix sets NIXL_PLUGIN_DIR to the wheel path in Dockerfile.dynamo-combined-efa.

Dynamo 0.16 --kv-transfer-config migration (ce21673). The old --connector nixl flag is deprecated in v1.1.0; disaggregated workers now require explicit --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' on both prefill and decode pods. Both roles use kv_role: kv_both — Dynamo's NIXL wrapper hard-codes this regardless of disaggregation mode. Fix updates the two k8s DGD manifests.

nccl-tests ship fix (a1725d4). The combined image's networking-builder stage builds its own NCCL but was never building nccl-tests, so COPY --from=networking /opt/nccl-tests in the vllm-stage had no source. Added the nccl-tests build step right after the NCCL build in the combined builder, same pattern as Dockerfile.efa.

Smoke + multinode evidence (`dynamo-efa:a1725d43e5c0`)

All 10 original smoke gates PASS unchanged from the d35812d baseline. Two new gates also PASS:

Gate	Result
T1–T10 smoke	PASS (see `docs/evidence/post-rename-smoke-2026-05-05/`)
T11 cross-node NCCL AllReduce (16 ranks, 2× p5.48xlarge H100)	PASS — 330 GB/s busbw at 1 GiB, NET/Libfabric/0/GDRDMA in every channel, ring PXN=0 GDR=1
T11b intra-node 8-GPU AllReduce (`/opt/nccl-tests/bin/all_reduce_perf`)	PASS — 357 GB/s busbw (nccl-tests now ships in the image)
T12 Dynamo vLLM disagg (Frontend + Prefill + Decode DGDs)	PASS — workers boot with zero "No plugins available for NIXL" crashes; `--kv-transfer-config` accepted; workers register `kv-events` / `generate` / `clear_kv_blocks` endpoints in etcd

Evidence in the branch:

docs/evidence/multinode-2026-05-05-rev2/README.md — full gate-by-gate summary
docs/evidence/multinode-2026-05-05-rev2/t12-prefill-full.log — prefill boot log (zero NIXL errors)
docs/evidence/multinode-2026-05-05-rev2/t12-decode-full.log — decode boot log
docs/evidence/multinode-2026-05-05-rev2/t12-dgd-applied.yaml — applied manifest (HF token redacted)
docs/evidence/multinode-2026-05-05-rev2/t11-torch-allreduce.py — 16-rank sweep harness

CodeBuild

Pre-fix tag dynamo-efa:d35812db45d6 (77b9f095 run) PASSed all T1–T10 on 2026-05-05.
Post-fix tag dynamo-efa:a1725d43e5c0 PASSes all T1–T10 + T11/T11b/T12 on the same cluster.

Still pending NVIDIA approval (out of scope for this PR)

Dynamo 1.0.1 → 1.0.2 bump — pinned references ready in both Dockerfiles; waiting on NVIDIA confirmation that disagg prefill/decode is 1.0.2-safe.

cc @AlexIankoulski

Merges dgd-dynamo-combined-vllm.yaml from three separate DynamoGraphDeployments into a single DGD with three services (Frontend + PrefillWorker + DecodeWorker), matching the upstream canonical pattern at ai-dynamo/dynamo/examples/backends/vllm/deploy/disagg.yaml. Why: The operator auto-stamps each DGD's dynamoNamespace with a `<k8s-ns>-<dgd-name>-<suffix>` pattern. Three separate DGDs land each service under three different namespaces and Frontend cannot discover the workers. A single DGD lets the operator stamp one namespace on all three services. The merge is a prerequisite for the disagg path even though T12 /v1/completions end-to-end still fails due to an upstream operator bug: Frontend gets DYN_NAMESPACE without the worker suffix, workers get DYN_NAMESPACE + DYN_NAMESPACE_WORKER_SUFFIX appended, so the two sides of the discovery handshake land under different namespaces. Documented in docs/evidence/multinode-2026-05-05-rev3/. Rev3 gate summary: - Single DGD reconciles cleanly - One suffix stamped on all three services (Frontend + Prefill + Decode) - Workers boot without NIXL crashes (rev2 fix holds) - Model weights load, KV cache allocates (25781 blocks per worker) - Workers register generate/clear_kv_blocks endpoints in etcd - T12 /v1/completions returns data:[] because of upstream namespace bug Next: file upstream issue on ai-dynamo/dynamo asking for consistent namespace stamping across frontend + worker services.

dmvevents · 2026-05-05T23:59:57Z

rev3 — single-DGD merge + upstream namespace bug isolated (2026-05-05 23:58 UTC)

Follow-up to the rev2 comment. rev3 (b1f64c6) applies the canonical upstream pattern for disaggregated vLLM — merge the three separate DGDs into one DGD with three services, matching ai-dynamo/dynamo/examples/backends/vllm/deploy/disagg.yaml.

What this fixes

Before rev3, the operator stamped three different dynamoNamespace suffixes on the three DGDs, so Frontend couldn't discover the workers. rev3's single DGD produces one shared suffix on all three services — this is a prerequisite for the disagg path regardless of the remaining bug.

What still fails (upstream operator bug)

T12 /v1/completions end-to-end still returns {"data":[]} because the operator stamps namespaces asymmetrically across service types:

Service	`DYN_NAMESPACE`	`DYN_NAMESPACE_WORKER_SUFFIX`
Frontend	`default-dynamo-combined-vllm`	(absent)
Workers	`default-dynamo-combined-vllm`	`23abbdff` (appended to registrations)

Workers register endpoints in etcd under default-dynamo-combined-vllm-23abbdff. Frontend queries under default-dynamo-combined-vllm. The two never align.

Manual overrides don't stick: patching the DGD to hard-code the Frontend DYN_NAMESPACE to include the suffix works for one reconcile, but any subsequent DGD-spec edit regenerates the worker suffix (hash changes), invalidating the hard-coded Frontend value.

Gate status on `dynamo-efa:a1725d43e5c0`

Gate	rev2	rev3
T1–T10 smoke	PASS	PASS (unchanged)
T11 16-rank NCCL AllReduce cross-node	PASS 330 GB/s	PASS (unchanged)
T11b nccl-tests 8-GPU intra-node	PASS 357 GB/s	PASS (unchanged)
T12 workers boot without NIXL crash	PASS	PASS (unchanged)
T12 workers register in etcd	PASS	PASS (single shared namespace now)
T12 model weight load + KV cache alloc	PASS (25 781 blocks)	PASS (unchanged)
T12 `/v1/completions` end-to-end	BLOCKED (3-DGD namespace)	BLOCKED (upstream operator namespace-suffix bug)

Evidence in docs/evidence/multinode-2026-05-05-rev3/README.md.

Recommended follow-up (out of scope for this PR)

File upstream issue on ai-dynamo/dynamo asking for one of:

Frontend DYN_NAMESPACE auto-appends the same DYN_NAMESPACE_WORKER_SUFFIX the workers get, so both sides of the discovery handshake align.
The operator exposes dynamoNamespace as a user-settable CRD field that is preserved across reconciles, and stamps it identically on every service.

The PR is still mergeable — every image-level fix lands, every networking gate passes, the canonical single-DGD pattern is now shipped, and the remaining gap is an upstream operator issue that affects all Dynamo disaggregated deployments on this operator version, not something specific to EFA or the combined image.

cc @AlexIankoulski

…ill BLOCKED) Context: rev3 (b1f64c6) committed the canonical single-DGD refactor matching the upstream ai-dynamo/dynamo examples/backends/vllm/deploy/disagg.yaml pattern, but Frontend's KubeDiscoveryClient still returned 0 instances. Rev4 attempts a further override to eliminate the operator-stamped worker-hash namespace suffix: - name: DYN_NAMESPACE value: "default-dynamo-combined-vllm" - name: DYN_NAMESPACE_WORKER_SUFFIX value: "" What worked: - Worker runtime registrations moved from "default-dynamo-combined-vllm-653730ae/prefill/..." (with suffix) to "default-dynamo-combined-vllm/prefill/..." (no suffix). - EndpointSlice labels updated: nvidia.com/dynamo-namespace now matches Frontend's DYN_NAMESPACE exactly. - Both workers register under the same namespace, matching Frontend. What still fails: - Frontend continues to return "0 instances for query=AllEndpoints" despite all the above. - Hypothesis: DynamoWorkerMetadata CRs have no labels, and Frontend's daemon may filter them by criteria we don't yet know (maybe worker-hash annotation on the owning Pod). - Cannot close T12 end-to-end from Dockerfile + DGD YAML alone — needs runtime source investigation. Decision: pause rev4, keep rev3's canonical DGD as committed state. The NIXL plugin fix (ce21673) and nccl-tests fix (a1725d4) DO ship correctly in dynamo-efa:a1725d43e5c0 — that's rev2's PASS. Evidence captured in docs/evidence/multinode-2026-05-06-rev4/: - t12-dgd-applied-rev4.yaml (HF token redacted) - t12-prefill.log + t12-decode.log (show correct namespace registration) - t12-frontend.log (shows persistent 0 instances) - t12-endpointslices.yaml (shows correct labels) - t12-dwm.txt (DWM list — no labels attached by operator) - README.md explaining hypotheses for rev5 Cluster state clean on exit: - DGD deleted - nvshmem-efa restored to 2/2 - cluster-lock-h100.json released

dmvevents · 2026-05-06T00:30:47Z

2026-05-06 rev4 update — T12 still BLOCKED, partial progress on namespace alignment

Continuing from rev3 (b1f64c6) which merged to the canonical single-DGD pattern. Rev4 attempts to close T12 end-to-end by eliminating the operator-stamped worker-hash namespace suffix.

Change applied (per-service envs override):

- name: DYN_NAMESPACE
  value: "default-dynamo-combined-vllm"
- name: DYN_NAMESPACE_WORKER_SUFFIX
  value: ""

What moved forward:

Worker runtime registration namespace moved from default-dynamo-combined-vllm-4e46d612/prefill/... (with hash suffix) to default-dynamo-combined-vllm/prefill/... (no suffix).
EndpointSlice label nvidia.com/dynamo-namespace: default-dynamo-combined-vllm now matches Frontend's DYN_NAMESPACE exactly.
Both workers register to the same namespace.

What's still blocked:

Frontend KubeDiscoveryClient continues to return 0 instances for query=AllEndpoints on every 10s poll despite the namespace alignment.
DynamoWorkerMetadata CRs exist but carry no labels — frontend may filter by a predicate (worker-hash annotation on owning Pod?) that we don't yet understand from logs alone.
Root cause needs dynamo_runtime::discovery::kube::daemon source tracing or a conversation with the Dynamo team — it's outside the Dockerfile + DGD YAML surface.

Decision: keep rev3's canonical DGD as committed state. The NIXL plugin fix (ce21673) and nccl-tests fix (a1725d4) are both proven to ship correctly in dynamo-efa:a1725d43e5c0 (see rev2 evidence). T12 end-to-end routing is a separate Dynamo-operator-runtime wiring issue — does not block PR #72 shipping the image.

Evidence: docs/evidence/multinode-2026-05-06-rev4/README.md (commit 9592bdf).

Status matrix (reconciling rev2 + rev4):

Gate	Status
T1–T10 smoke	PASS on dynamo-efa:a1725d43e5c0
T11 NCCL AllReduce cross-node 330 GB/s	PASS
T11b nccl-tests in-image all_reduce_perf	PASS
NIXL plugin load (no crash)	PASS
kv-transfer-config accepted	PASS
Worker namespace alignment	PASS (rev4)
T12 Frontend → Prefill/Decode routing	BLOCKED — operator discovery wiring

Cluster released; nvshmem-efa restored to 2/2.

dmvevents and others added 30 commits March 17, 2026 18:05

Make Dynamo multi-GPU universal: verified instances table, instance-s…

f9ce6af

…pecific configs, generic manifests

Update Dockerfile to install curl before trying to use it

ae86406

In final section

Install Intel MKL libraries in Dockerfile

ee7cd4a

Add Intel MKL libraries required by numpy/scipy/torch from NGC PyTorch.

Add symlinks for CUDA libraries in Dockerfile

ca76f4c

Create symbolic links for CUDA libraries in site-packages to facilitate TRT-LLM's library discovery.

Add symlink for cupti library in Dockerfile

2c2acf2

Add symlink for libcusparseLt in Dockerfile

2e23142

Add symlink for nvshmem library in Dockerfile

9e3e6cc

Refactor CUDA library exposure for TRT-LLM

779a65a

Updated the Dockerfile to expose all system CUDA/NVIDIA libraries to TRT-LLM's sys.path-based library finder by creating a single directory for symlinks, simplifying the process of linking necessary libraries.

Refactor symlink creation for NVIDIA libraries

fa55fed

Updated the Dockerfile to improve symlink creation for NVIDIA libraries by using 'find' for better handling of .so files.

Move symlink setup for NVIDIA libraries as last RUN command

07edc6d

Symlinking of NVIDIA libraries for TRT-LLM discovery should be done last to avoid breaks.

Include CUDA math libraries in Dockerfile

54e36f7

Added CUDA math libraries and updated symlink patterns. libcublas

Refactor Dockerfile for CUDA and library compatibility

217352d

Removed HPC-X, updated CUDA library handling, and added compatibility shims for TRT-LLM and PyTorch.

Add THIRD-PARTY-LICENSES

bdba9bb

dynamo-inference: add E2E RDMA validation evidence for awsi-efa-base:v1

ad86680

sbom: awsi-dynamo-combined-efa:v8 (SPDX + CycloneDX + condensed licen…

42fb905

…ses)

dmvevents added 5 commits April 28, 2026 22:11

dmvevents added 6 commits May 5, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dynamo combined image (vLLM + TRT-LLM) with EFA/NIXL RDMA#72

Add Dynamo combined image (vLLM + TRT-LLM) with EFA/NIXL RDMA#72
dmvevents wants to merge 43 commits intoaws-samples:mainfrom
dmvevents:feature/dynamo-combined-vllm-trtllm-efa

dmvevents commented Mar 17, 2026

Uh oh!

dmvevents commented May 5, 2026

Uh oh!

dmvevents commented May 5, 2026

Uh oh!

dmvevents commented May 5, 2026

Uh oh!

dmvevents commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dmvevents commented Mar 17, 2026

Summary

Changes

New files

Modified files

Architecture

Key design decisions

Test plan

Uh oh!

dmvevents commented May 5, 2026

2026-05-05 update — Alex's 2026-05-04 Slack feedback folded in (commit 81f61cc)

Uh oh!

dmvevents commented May 5, 2026

Post-rename validation + disagg path fixes (2026-05-05 rev2)

What landed

Smoke + multinode evidence (dynamo-efa:a1725d43e5c0)

CodeBuild

Still pending NVIDIA approval (out of scope for this PR)

Uh oh!

dmvevents commented May 5, 2026

rev3 — single-DGD merge + upstream namespace bug isolated (2026-05-05 23:58 UTC)

What this fixes

What still fails (upstream operator bug)

Gate status on dynamo-efa:a1725d43e5c0

Recommended follow-up (out of scope for this PR)

Uh oh!

dmvevents commented May 6, 2026

2026-05-06 rev4 update — T12 still BLOCKED, partial progress on namespace alignment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2026-05-05 update — Alex's 2026-05-04 Slack feedback folded in (commit `81f61cc`)

Smoke + multinode evidence (`dynamo-efa:a1725d43e5c0`)

Gate status on `dynamo-efa:a1725d43e5c0`