[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported"

### Is this a new feature, an improvement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request

Critical (currently preventing usage)

### Please provide a clear description of problem this feature solves

# MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported"

## Summary

Morpheus pipelines abort immediately on NVIDIA DGX Spark (GB10, aarch64) because MRC treats `NVML_ERROR_NOT_SUPPORTED` from `nvmlDeviceGetMemoryInfo()` as a fatal error. The GB10 uses a unified memory architecture (UMA) where NVML memory queries are not supported — this is expected/documented behavior for this platform, but MRC has no fallback path.

## Environment

| Component | Value |
|-----------|-------|
| **Platform** | NVIDIA DGX Spark (GB10) |
| **Architecture** | aarch64 |
| **OS** | Ubuntu 24.04.3 LTS (Noble Numbat) |
| **NVIDIA Driver** | 580.95.05 |
| **CUDA Version** | 13.0 |
| **Morpheus** | Production DFP example (`digital_fingerprinting/production`) |
| **Python** | 3.12 (conda env `morpheus-dfp`) |
| **Deployment** | Docker Compose |

## nvidia-smi output (host)

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   52C    P0             12W /  N/A  | Not Supported          |      7%      Default |
+-----------------------------------------+------------------------+----------------------+
```

Note: `Memory-Usage: Not Supported` is expected behavior on DGX Spark due to UMA — there is no dedicated framebuffer to report.

## Steps to Reproduce

1. On a DGX Spark (GB10) host running Ubuntu 24.04 aarch64
2. Clone the Morpheus repo and navigate to `examples/digital_fingerprinting/production`
3. Run `docker compose up morpheus_pipeline`

## Observed Behavior

The pipeline aborts with exit code 134 (SIGABRT) before any Python pipeline code executes:

```
F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running
'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'.
Error msg: Not Supported
```

Stack trace (abbreviated):

```
google::LogMessage::Fail()
google::LogMessageFatal::~LogMessageFatal()
mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int)        ← fatal here
mrc::system::Topology::Create(mrc::TopologyOptions const&)
mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
mrc::make_system(std::shared_ptr<mrc::Options>)
mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)
```

Full stack trace

```
Attaching to morpheus_pipeline
morpheus_pipeline  | Running training pipeline with the following options: 
morpheus_pipeline  | Train generic_user: True
morpheus_pipeline  | Skipping users: []
morpheus_pipeline  | F20260227 21:37:48.420159 258244446173728 device_info.cpp:329] NVML failed running 'NvmlState::handle().nvmlDeviceGetMemoryInfo(get_handle_by_id(device_id), &info)'. Error msg: Not Supported
morpheus_pipeline  |     @     0xeadf321158f8  google::LogMessage::Fail()
morpheus_pipeline  |     @     0xeadf321177ec  google::LogMessageFatal::~LogMessageFatal()
morpheus_pipeline  |     @     0xeadec0e65f4c  mrc::system::DeviceInfo::DeviceTotalMemory(unsigned int)
morpheus_pipeline  |     @     0xeadec0e930c8  mrc::system::Topology::Create(mrc::TopologyOptions const&)
morpheus_pipeline  |     @     0xeadec0e859c4  mrc::system::SystemDefinition::SystemDefinition(mrc::Options const&)
morpheus_pipeline  |     @     0xeadec0f54414  mrc::make_system(std::shared_ptr<mrc::Options>)
morpheus_pipeline  |     @     0xeadec1169d7c  mrc::pymrc::Executor::Executor(std::shared_ptr<mrc::Options>)
morpheus_pipeline  |     @     0xeadeaf72c944  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x1c943)
morpheus_pipeline  |     @     0xeadeaf7222f0  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/mrc/core/executor.cpython-312-aarch64-linux-gnu.so+0x122ef)
morpheus_pipeline  |     @     0xab7839967a38  cfunction_call
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839919ed0  method_vectorcall
morpheus_pipeline  |     @     0xab7839990ce8  slot_tp_init
morpheus_pipeline  |     @     0xab7839987ddc  type_call
morpheus_pipeline  |     @     0xeadef2617b24  (/opt/conda/envs/morpheus-dfp/lib/python3.12/site-packages/morpheus/_lib/common.cpython-312-aarch64-linux-gnu.so+0x17b23)
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839a1ae54  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab78399308c0  gen_send_ex2
morpheus_pipeline  |     @     0xeadf35a4aa04  task_step_impl
morpheus_pipeline  |     @     0xeadf35a4bcc4  task_step
morpheus_pipeline  |     @     0xab7839915db8  _PyObject_MakeTpCall
morpheus_pipeline  |     @     0xab7839a3cfac  context_run
morpheus_pipeline  |     @     0xab7839967b68  cfunction_vectorcall_FASTCALL_KEYWORDS
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839919f08  method_vectorcall
morpheus_pipeline  |     @     0xab78399182e4  _PyVectorcall_Call
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839919f08  method_vectorcall
morpheus_pipeline  |     @     0xab78399182e4  _PyVectorcall_Call
morpheus_pipeline  |     @     0xab7839a1d8f4  _PyEval_EvalFrameDefault
morpheus_pipeline  |     @     0xab7839917fac  _PyObject_FastCallDictTstate
morpheus_pipeline  |     @     0xab78399181ec  _PyObject_Call_Prepend
morpheus_pipeline  | /workspace/examples/digital_fingerprinting/production/launch.sh: line 17:    20 Aborted                 (core dumped) python dfp_duo_pipeline.py "$@"
morpheus_pipeline exited with code 134
```

The crash originates in `device_info.cpp:329` where the NVML return code is treated as fatal with no fallback.

## Expected Behavior

MRC should gracefully handle `NVML_ERROR_NOT_SUPPORTED` from `nvmlDeviceGetMemoryInfo()` by falling back to an alternative memory query method, such as:

- `cudaMemGetInfo(&free, &total)` — works on all CUDA devices
- `cudaGetDeviceProperties().totalGlobalMem` — also universally available

This would allow Morpheus to run on UMA platforms like DGX Spark where NVML memory reporting is unsupported by design.

## Suggested Fix

In `mrc::system::DeviceInfo::DeviceTotalMemory()` (`device_info.cpp`), replace the fatal log with a fallback:

```cpp
nvmlReturn_t result = nvmlDeviceGetMemoryInfo(handle, &info);
if (result == NVML_SUCCESS)
{
    return info.total;
}
else if (result == NVML_ERROR_NOT_SUPPORTED)
{
    // Fallback for UMA platforms (e.g., DGX Spark GB10)
    size_t free = 0, total = 0;
    cudaError_t cuda_result = cudaMemGetInfo(&free, &total);
    if (cuda_result == cudaSuccess)
    {
        LOG(WARNING) << "NVML memory query not supported on this device. "
                     << "Falling back to cudaMemGetInfo. Total: " << total;
        return total;
    }
    // If CUDA fallback also fails, then fatal
    LOG(FATAL) << "Unable to determine device memory via NVML or CUDA.";
}
else
{
    LOG(FATAL) << "NVML failed: " << nvmlErrorString(result);
}
```

## Workarounds

- **NVML shim library**: Build a drop-in `libnvidia-ml.so` shim that wraps `nvmlDeviceGetMemoryInfo` with a CUDA-based fallback, mount into the container via `LD_LIBRARY_PATH`
- **Run on x86_64 + discrete GPU**: Avoids the UMA limitation entirely

## Additional Context

The DGX Spark / GB10 is a shipping NVIDIA product with a Blackwell GPU on a unified memory architecture. As NVIDIA expands its ARM64 and UMA product line, MRC's hard dependency on NVML memory queries will increasingly become a blocker. The `cudaMemGetInfo` fallback is a minimal, low-risk change that would unblock all UMA platforms.

## Labels

`bug`, `platform:aarch64`, `component:mrc`, `priority:high`


### Describe your ideal solution

Nvidia morpheus works on Nvidia Spark and similar shared memory devices.

### Describe any alternatives you have considered

_No response_

### Additional context

We want to demo our product on Nvidia Sparks. 

### Code of Conduct

- [x] I agree to follow MRC's Code of Conduct
- [x] I have searched the [open feature requests](https://github.com/nv-morpheus/MRC/issues?q=is%3Aopen+is%3Aissue+label%3A%22feature+request%22%2Cimprovement%2Cenhancement) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported" #561

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported"

Summary

Environment

nvidia-smi output (host)

Steps to Reproduce

Observed Behavior

Expected Behavior

Suggested Fix

Workarounds

Additional Context

Labels

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Value
Platform	NVIDIA DGX Spark (GB10)
Architecture	aarch64
OS	Ubuntu 24.04.3 LTS (Noble Numbat)
NVIDIA Driver	580.95.05
CUDA Version	13.0
Morpheus	Production DFP example (`digital_fingerprinting/production`)
Python	3.12 (conda env `morpheus-dfp`)
Deployment	Docker Compose

[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported" #561

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

MRC crashes with fatal error on DGX Spark (GB10) — nvmlDeviceGetMemoryInfo returns "Not Supported"

Summary

Environment

nvidia-smi output (host)

Steps to Reproduce

Observed Behavior

Expected Behavior

Suggested Fix

Workarounds

Additional Context

Labels

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA]: MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported" #561

MRC crashes with fatal error on DGX Spark (GB10) — `nvmlDeviceGetMemoryInfo` returns "Not Supported"