Skip to content

nvlink check output not very helpful #142

@ncclementi

Description

@ncclementi

I'm running rapids doctor as part of testing new checks I'm adding to cudf.

But I found that in teh amchine I'm at, the nvlink checks failed, but the errors are not very helpful. It's unclear to me what's a good approach here, is there a way to make this more human readable, that gives the user any idea on what's the problem and if there is any?

When I run rapids doctor --verbose

🧑‍⚕️ Performing REQUIRED health check for RAPIDS 
Discovering checks
Found check 'cudf_import' provided by 'cudf.health_checks:import_check'
Found check 'cudf_functional' provided by 'cudf.health_checks:functional_check'
Found check 'cudf_functional_numba' provided by 'cudf.health_checks:functional_numba_check'
Found check 'cuda' provided by 'rapids_cli.doctor.checks.cuda_driver:cuda_check'
Found check 'gpu' provided by 'rapids_cli.doctor.checks.gpu:gpu_check'
Found check 'gpu_compute_capability' provided by 'rapids_cli.doctor.checks.gpu:check_gpu_compute_capability'
Found check 'memory_to_gpu_ratio' provided by 'rapids_cli.doctor.checks.memory:check_memory_to_gpu_ratio'
Found check 'nvlink_status' provided by 'rapids_cli.doctor.checks.nvlink:check_nvlink_status'
Discovered 8 checks
Running checks
import_check: cuDF 26.06.00 is available
functional_check: cuDF groupby/agg succeeded
functional_numba_check: cuDF Series.apply (Numba path) succeeded
gpu_check: GPU(s) detected: 8
check_nvlink_status failed
  NVLink 0 Status Check Failed
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/rapids_cli/doctor/checks/nvlink │
│ .py:23 in check_nvlink_status                                                                    │
│                                                                                                  │
│   20 │   │   handle = pynvml.nvmlDeviceGetHandleByIndex(i)                                       │
│   21 │   │   for nvlink_id in range(pynvml.NVML_NVLINK_MAX_LINKS):                               │
│   22 │   │   │   try:                                                                            │
│ ❱ 23 │   │   │   │   pynvml.nvmlDeviceGetNvLinkState(handle, 0)                                  │
│   24 │   │   │   │   return True                                                                 │
│   25 │   │   │   except pynvml.NVMLError as e:                                                   │
│   26 │   │   │   │   raise ValueError(f"NVLink {nvlink_id} Status Check Failed") from e          │
│                                                                                                  │
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/pynvml.py:4690 in               │
│ nvmlDeviceGetNvLinkState                                                                         │
│                                                                                                  │
│   4687 │   c_isActive = c_uint()                                                                 │
│   4688 │   fn = _nvmlGetFunctionPointer("nvmlDeviceGetNvLinkState")                              │
│   4689 │   ret = fn(device, link, byref(c_isActive))                                             │
│ ❱ 4690 │   _nvmlCheckReturn(ret)                                                                 │
│   4691 │   return c_isActive.value                                                               │
│   4692                                                                                           │
│   4693 def nvmlDeviceGetNvLinkVersion(device, link):                                             │
│                                                                                                  │
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/pynvml.py:1083 in               │
│ _nvmlCheckReturn                                                                                 │
│                                                                                                  │
│   1080                                                                                           │
│   1081 def _nvmlCheckReturn(ret):                                                                │
│   1082 │   if (ret != NVML_SUCCESS):                                                             │
│ ❱ 1083 │   │   raise NVMLError(ret)                                                              │
│   1084 │   return ret                                                                            │
│   1085                                                                                           │
│   1086 ## Function access ##                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NVMLError_NotSupported: Not Supported

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/rapids_cli/doctor/doctor.py:130 │
│ in doctor_check                                                                                  │
│                                                                                                  │
│   127 │   │   │   │   console.print(f"  {result.error}")                                         │
│   128 │   │   │   │   if verbose and result.error:                                               │
│   129 │   │   │   │   │   try:                                                                   │
│ ❱ 130 │   │   │   │   │   │   raise result.error                                                 │
│   131 │   │   │   │   │   except Exception:                                                      │
│   132 │   │   │   │   │   │   console.print_exception()                                          │
│   133 │   │   return False                                                                       │
│                                                                                                  │
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/rapids_cli/doctor/doctor.py:90  │
│ in doctor_check                                                                                  │
│                                                                                                  │
│    87 │   │   │   │   with warnings.catch_warnings(record=True) as w:                            │
│    88 │   │   │   │   │   warnings.simplefilter("always")                                        │
│    89 │   │   │   │   │   status = True                                                          │
│ ❱  90 │   │   │   │   │   value = check_fn(verbose=verbose)                                      │
│    91 │   │   │   │   │   caught_warnings = w                                                    │
│    92 │   │   │                                                                                  │
│    93 │   │   │   except Exception as e:                                                         │
│                                                                                                  │
│ /raid/nclementi/conda/envs/cudf-dev/lib/python3.13/site-packages/rapids_cli/doctor/checks/nvlink │
│ .py:26 in check_nvlink_status                                                                    │
│                                                                                                  │
│   23 │   │   │   │   pynvml.nvmlDeviceGetNvLinkState(handle, 0)                                  │
│   24 │   │   │   │   return True                                                                 │
│   25 │   │   │   except pynvml.NVMLError as e:                                                   │
│ ❱ 26 │   │   │   │   raise ValueError(f"NVLink {nvlink_id} Status Check Failed") from e          │
│   27                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: NVLink 0 Status Check Failed
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Health checks failed.                                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions