Skip to content

GPU not working on prod #1332

@homework36

Description

@homework36

Updated 1/26:

Clarify. GPU jobs (excluding training) CAN be submitted on production, but we cannot use the full capacity of GPUs, so it takes a very long time. Please use staging during this time until further notice. There's nothing we can do from our end at this point.


Well, it's working, but also not working... Probably related to the Arbutus outage

root@prod-rodan-vgpu:/srv/webapps/Rodan# make gpu-celery_log
[2026-01-19 19:24:37,467: INFO/MainProcess] Connected to amqp://someadmin:**@rabbitmq:5672//
[2026-01-19 19:24:37,659: INFO/MainProcess] mingle: searching for neighbors
[2026-01-19 19:24:39,375: INFO/MainProcess] mingle: all alone
[2026-01-19 19:24:40,428: INFO/MainProcess] celery@GPU ready.
[2026-01-19 19:31:07,251: INFO/MainProcess] Received task: Fast Pixelwise Analysis of Music Document, Classifying[0a08daed-c17f-42ef-92e6-849bfcd0755c]
[2026-01-19 19:31:07,331: INFO/ForkPoolWorker-1] started running the task!
[2026-01-19 19:37:30,073: WARNING/ForkPoolWorker-1] Fast Pixelwise Analysis of Music Document, Classifying[0a08daed-c17f-42ef-92e6-849bfcd0755c]: 0 / 1535

[stuck here forever]

root@prod-rodan-vgpu:/srv/webapps/Rodan# nvidia-smi
Tue Jan 20 00:43:30 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID V100D-16C                 On  |   00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |   14470MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    254044      C   /usr/bin/python3.7                          14470MiB |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions