Skip to content

[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported" #561

@mbudge

Description

@mbudge

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem this feature solves

MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported"

Summary

Morpheus pipelines abort immediately on NVIDIA DGX Spark (GB10, aarch64) because MRC treats NVML_ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() as a fatal error. The GB10 uses a unified memory architecture (UMA) where NVML memory queries are not supported — this is expected/documented behavior for this platform, but MRC has no fallback path.

Environment

Component Value
Platform NVIDIA DGX Spark (GB10)
Architecture aarch64
OS Ubuntu 24.04.3 LTS (Noble Numbat)
NVIDIA Driver 580.95.05
CUDA Version 13.0
Morpheus Production DFP example (digital_fingerprinting/production)
Python 3.12 (conda env morpheus-dfp)
Deployment Docker Compose

nvidia-smi output (host)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   52C    P0             12W /  N/A  | Not Supported          |      7%      Default |
+-----------------------------------------+------------------------+----------------------+

Note: Memory-Usage: Not Supported is expected behavior on DGX Spark due to UMA — there is no dedicated framebuffer to report.

Steps to Reproduce

  1. On a DGX Spark (GB10) host running Ubuntu 24.04 aarch64
  2. Clone the Morpheus repo and navigate to examples/digital_fingerprinting/production
  3. Run docker compose up morpheus_pipeline

Observed Behavior

The pipeline aborts with exit code 134 (SIGABRT) before any Python pipeline code executes:

F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running
'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'.
Error msg: Not Supported

Stack trace (abbreviated):

google::LogMessage::Fail()
google::LogMessageFatal::~LogMessageFatal()
mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int)        ← fatal here
mrc::system::Topology::Create(mrc::TopologyOptions const&)
mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
mrc::make_system(std::shared_ptr<mrc::Options>)
mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)

Full stack trace

Attaching to morpheus_pipeline
morpheus_pipeline  | Running training pipeline with the following options: 
morpheus_pipeline  | Train generic_user: True
morpheus_pipeline  | Skipping users: []
morpheus_pipeline  | F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running 'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'. Error msg: Not Supported
morpheus_pipeline  |     @     0xeadf321158f8  google::LogMessage::Fail()
morpheus_pipeline  |     @     0xeadf321177ec  google::LogMessageFatal::~LogMessageFatal()
morpheus_pipeline  |     @     0xeadec0e65f4c  mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int)
morpheus_pipeline  |     @     0xeadec0e930c8  mrc::system::Topology::Create(mrc::TopologyOptions const&)
morpheus_pipeline  |     @     0xeadec0e859c4  mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
morpheus_pipeline  |     @     0xeadec0f54414  mrc::make_system(std::shared_ptr<mrc::Options>)
morpheus_pipeline  |     @     0xeadec1169d7c  mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)
morpheus_pipeline  |     @     0xeadeaf72c944  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x1c943)
morpheus_pipeline  |     @     0xeadeaf7222f0  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x122ef)
morpheus_pipeline  |     @     0xab7839967a38  cfunction_call
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839919ed0  method_vectorcall
morpheus_pipeline  |     @     0xab7839990ce8  slot_tp_init
morpheus_pipeline  |     @     0xab7839987ddc  type_call
morpheus_pipeline  |     @     0xeadef2617b24  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/morpheus/_lib/common.cpython-312-aarch64-linux-gnu.so+0x17b23)
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839a1ae54  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab78399308c0  gen_send_ex2
morpheus_pipeline  |     @     0xeadf35a4aa04  task_step_impl
morpheus_pipeline  |     @     0xeadf35a4bcc4  task_step
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839a3cfac  context_run
morpheus_pipeline  |     @     0xab7839967b68  cfunction_vectorcall_FASTCALL_KEYWORDS
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839919f08  method_vectorcall
morpheus_pipeline  |     @     0xab78399182e4  _PyVectorcall_Call
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839919f08  method_vectorcall
morpheus_pipeline  |     @     0xab78399182e4  _PyVectorcall_Call
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839917fac  _PyObject_FastCallDictTstate
morpheus_pipeline  |     @     0xab78399181ec  _PyObject_Call_Prepend
morpheus_pipeline  | /workspace/examples/digital_fingerprinting/production/launch.sh: line 17:    20 Aborted                 (core dumped) python dfp_duo_pipeline.py "$@"
morpheus_pipeline exited with code 134

The crash originates in device_info.cpp:329 where the NVML return code is treated as fatal with no fallback.

Expected Behavior

MRC should gracefully handle NVML_ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() by falling back to an alternative memory query method, such as:

  • cudaMemGetInfo(&free, &total) — works on all CUDA devices
  • cudaGetDeviceProperties().totalGlobalMem — also universally available

This would allow Morpheus to run on UMA platforms like DGX Spark where NVML memory reporting is unsupported by design.

Suggested Fix

In mrc::system::DeviceInfo::DeviceTotalMemory() (device_info.cpp), replace the fatal log with a fallback:

nvmlReturn_t result = nvmlDeviceGetMemoryInfo(handle, &info);
if (result == NVML_SUCCESS)
{
    return info.total;
}
else if (result == NVML_ERROR_NOT_SUPPORTED)
{
    // Fallback for UMA platforms (e.g., DGX Spark GB10)
    size_t free = 0, total = 0;
    cudaError_t cuda_result = cudaMemGetInfo(&free, &total);
    if (cuda_result == cudaSuccess)
    {
        LOG(WARNING) << "NVML memory query not supported on this device. "
                     << "Falling back to cudaMemGetInfo. Total: " << total;
        return total;
    }
    // If CUDA fallback also fails, then fatal
    LOG(FATAL) << "Unable to determine device memory via NVML or CUDA.";
}
else
{
    LOG(FATAL) << "NVML failed: " << nvmlErrorString(result);
}

Workarounds

  • NVML shim library: Build a drop-in libnvidia-ml.so shim that wraps nvmlDeviceGetMemoryInfo with a CUDA-based fallback, mount into the container via LD_LIBRARY_PATH
  • Run on x86_64 + discrete GPU: Avoids the UMA limitation entirely

Additional Context

The DGX Spark / GB10 is a shipping NVIDIA product with a Blackwell GPU on a unified memory architecture. As NVIDIA expands its ARM64 and UMA product line, MRC's hard dependency on NVML memory queries will increasingly become a blocker. The cudaMemGetInfo fallback is a minimal, low-risk change that would unblock all UMA platforms.

Labels

bug, platform:aarch64, component:mrc, priority:high

Describe your ideal solution

Nvidia morpheus works on Nvidia Spark and similar shared memory devices.

Describe any alternatives you have considered

No response

Additional context

We want to demo our product on Nvidia Sparks.

Code of Conduct

  • I agree to follow MRC's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions