-
Notifications
You must be signed in to change notification settings - Fork 29
[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported" #561
Description
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Critical (currently preventing usage)
Please provide a clear description of problem this feature solves
MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported"
Summary
Morpheus pipelines abort immediately on NVIDIA DGX Spark (GB10, aarch64) because MRC treats NVML_ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() as a fatal error. The GB10 uses a unified memory architecture (UMA) where NVML memory queries are not supported — this is expected/documented behavior for this platform, but MRC has no fallback path.
Environment
| Component | Value |
|---|---|
| Platform | NVIDIA DGX Spark (GB10) |
| Architecture | aarch64 |
| OS | Ubuntu 24.04.3 LTS (Noble Numbat) |
| NVIDIA Driver | 580.95.05 |
| CUDA Version | 13.0 |
| Morpheus | Production DFP example (digital_fingerprinting/production) |
| Python | 3.12 (conda env morpheus-dfp) |
| Deployment | Docker Compose |
nvidia-smi output (host)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 52C P0 12W / N/A | Not Supported | 7% Default |
+-----------------------------------------+------------------------+----------------------+
Note: Memory-Usage: Not Supported is expected behavior on DGX Spark due to UMA — there is no dedicated framebuffer to report.
Steps to Reproduce
- On a DGX Spark (GB10) host running Ubuntu 24.04 aarch64
- Clone the Morpheus repo and navigate to
examples/digital_fingerprinting/production - Run
docker compose up morpheus_pipeline
Observed Behavior
The pipeline aborts with exit code 134 (SIGABRT) before any Python pipeline code executes:
F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running
'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'.
Error msg: Not Supported
Stack trace (abbreviated):
google::LogMessage::Fail()
google::LogMessageFatal::~LogMessageFatal()
mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int) ← fatal here
mrc::system::Topology::Create(mrc::TopologyOptions const&)
mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
mrc::make_system(std::shared_ptr<mrc::Options>)
mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)
Full stack trace
Attaching to morpheus_pipeline
morpheus_pipeline | Running training pipeline with the following options:
morpheus_pipeline | Train generic_user: True
morpheus_pipeline | Skipping users: []
morpheus_pipeline | F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running 'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'. Error msg: Not Supported
morpheus_pipeline | @ 0xeadf321158f8 google::LogMessage::Fail()
morpheus_pipeline | @ 0xeadf321177ec google::LogMessageFatal::~LogMessageFatal()
morpheus_pipeline | @ 0xeadec0e65f4c mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int)
morpheus_pipeline | @ 0xeadec0e930c8 mrc::system::Topology::Create(mrc::TopologyOptions const&)
morpheus_pipeline | @ 0xeadec0e859c4 mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
morpheus_pipeline | @ 0xeadec0f54414 mrc::make_system(std::shared_ptr<mrc::Options>)
morpheus_pipeline | @ 0xeadec1169d7c mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)
morpheus_pipeline | @ 0xeadeaf72c944 (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x1c943)
morpheus_pipeline | @ 0xeadeaf7222f0 (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x122ef)
morpheus_pipeline | @ 0xab7839967a38 cfunction_call
morpheus_pipeline | @ 0xab7839915db8 _PyObject_MakeTpCall
morpheus_pipeline | @ 0xab7839919ed0 method_vectorcall
morpheus_pipeline | @ 0xab7839990ce8 slot_tp_init
morpheus_pipeline | @ 0xab7839987ddc type_call
morpheus_pipeline | @ 0xeadef2617b24 (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/morpheus/_lib/common.cpython-312-aarch64-linux-gnu.so+0x17b23)
morpheus_pipeline | @ 0xab7839915db8 _PyObject_MakeTpCall
morpheus_pipeline | @ 0xab7839a1ae54 _PyEval_EvalFrameDefault
morpheus_pipeline | @ 0xab78399308c0 gen_send_ex2
morpheus_pipeline | @ 0xeadf35a4aa04 task_step_impl
morpheus_pipeline | @ 0xeadf35a4bcc4 task_step
morpheus_pipeline | @ 0xab7839915db8 _PyObject_MakeTpCall
morpheus_pipeline | @ 0xab7839a3cfac context_run
morpheus_pipeline | @ 0xab7839967b68 cfunction_vectorcall_FASTCALL_KEYWORDS
morpheus_pipeline | @ 0xab7839a1d8f4 _PyEval_EvalFrameDefault
morpheus_pipeline | @ 0xab7839919f08 method_vectorcall
morpheus_pipeline | @ 0xab78399182e4 _PyVectorcall_Call
morpheus_pipeline | @ 0xab7839a1d8f4 _PyEval_EvalFrameDefault
morpheus_pipeline | @ 0xab7839919f08 method_vectorcall
morpheus_pipeline | @ 0xab78399182e4 _PyVectorcall_Call
morpheus_pipeline | @ 0xab7839a1d8f4 _PyEval_EvalFrameDefault
morpheus_pipeline | @ 0xab7839917fac _PyObject_FastCallDictTstate
morpheus_pipeline | @ 0xab78399181ec _PyObject_Call_Prepend
morpheus_pipeline | /workspace/examples/digital_fingerprinting/production/launch.sh: line 17: 20 Aborted (core dumped) python dfp_duo_pipeline.py "$@"
morpheus_pipeline exited with code 134
The crash originates in device_info.cpp:329 where the NVML return code is treated as fatal with no fallback.
Expected Behavior
MRC should gracefully handle NVML_ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() by falling back to an alternative memory query method, such as:
cudaMemGetInfo(&free, &total)— works on all CUDA devicescudaGetDeviceProperties().totalGlobalMem— also universally available
This would allow Morpheus to run on UMA platforms like DGX Spark where NVML memory reporting is unsupported by design.
Suggested Fix
In mrc::system::DeviceInfo::DeviceTotalMemory() (device_info.cpp), replace the fatal log with a fallback:
nvmlReturn_t result = nvmlDeviceGetMemoryInfo(handle, &info);
if (result == NVML_SUCCESS)
{
return info.total;
}
else if (result == NVML_ERROR_NOT_SUPPORTED)
{
// Fallback for UMA platforms (e.g., DGX Spark GB10)
size_t free = 0, total = 0;
cudaError_t cuda_result = cudaMemGetInfo(&free, &total);
if (cuda_result == cudaSuccess)
{
LOG(WARNING) << "NVML memory query not supported on this device. "
<< "Falling back to cudaMemGetInfo. Total: " << total;
return total;
}
// If CUDA fallback also fails, then fatal
LOG(FATAL) << "Unable to determine device memory via NVML or CUDA.";
}
else
{
LOG(FATAL) << "NVML failed: " << nvmlErrorString(result);
}Workarounds
- NVML shim library: Build a drop-in
libnvidia-ml.soshim that wrapsnvmlDeviceGetMemoryInfowith a CUDA-based fallback, mount into the container viaLD_LIBRARY_PATH - Run on x86_64 + discrete GPU: Avoids the UMA limitation entirely
Additional Context
The DGX Spark / GB10 is a shipping NVIDIA product with a Blackwell GPU on a unified memory architecture. As NVIDIA expands its ARM64 and UMA product line, MRC's hard dependency on NVML memory queries will increasingly become a blocker. The cudaMemGetInfo fallback is a minimal, low-risk change that would unblock all UMA platforms.
Labels
bug, platform:aarch64, component:mrc, priority:high
Describe your ideal solution
Nvidia morpheus works on Nvidia Spark and similar shared memory devices.
Describe any alternatives you have considered
No response
Additional context
We want to demo our product on Nvidia Sparks.
Code of Conduct
- I agree to follow MRC's Code of Conduct
- I have searched the open feature requests and have found no duplicates for this feature request
Metadata
Metadata
Assignees
Labels
Type
Projects
Status