Updated 1/26:
Clarify. GPU jobs (excluding training) CAN be submitted on production, but we cannot use the full capacity of GPUs, so it takes a very long time. Please use staging during this time until further notice. There's nothing we can do from our end at this point.
Well, it's working, but also not working... Probably related to the Arbutus outage
root@prod-rodan-vgpu:/srv/webapps/Rodan# make gpu-celery_log
[2026-01-19 19:24:37,467: INFO/MainProcess] Connected to amqp://someadmin:**@rabbitmq:5672//
[2026-01-19 19:24:37,659: INFO/MainProcess] mingle: searching for neighbors
[2026-01-19 19:24:39,375: INFO/MainProcess] mingle: all alone
[2026-01-19 19:24:40,428: INFO/MainProcess] celery@GPU ready.
[2026-01-19 19:31:07,251: INFO/MainProcess] Received task: Fast Pixelwise Analysis of Music Document, Classifying[0a08daed-c17f-42ef-92e6-849bfcd0755c]
[2026-01-19 19:31:07,331: INFO/ForkPoolWorker-1] started running the task!
[2026-01-19 19:37:30,073: WARNING/ForkPoolWorker-1] Fast Pixelwise Analysis of Music Document, Classifying[0a08daed-c17f-42ef-92e6-849bfcd0755c]: 0 / 1535
[stuck here forever]
root@prod-rodan-vgpu:/srv/webapps/Rodan# nvidia-smi
Tue Jan 20 00:43:30 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 GRID V100D-16C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 14470MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 254044 C /usr/bin/python3.7 14470MiB |
+-----------------------------------------------------------------------------------------+
Updated 1/26:
Clarify. GPU jobs (excluding training) CAN be submitted on production, but we cannot use the full capacity of GPUs, so it takes a very long time. Please use staging during this time until further notice. There's nothing we can do from our end at this point.
Well, it's working, but also not working... Probably related to the Arbutus outage