feat: AMD Multi-GPU Support by y-coffee-dev · Pull Request #750 · Light-Heart-Labs/DreamServer

y-coffee-dev · 2026-04-03T01:30:38Z

feat: AMD Multi-GPU Support

End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.

Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.

What was added

Hardware Detection

Multi-GPU AMD detection via sysfs (counts all vendor=0x1002 cards)
Handles XCP virtual cards on MI300X (filters by non-empty vendor file)
Total VRAM aggregation across all detected AMD GPUs
Mixed APU + discrete GPU classification

Topology Detection

Full AMD topology library with three detection backends: amd-smi JSON, rocm-smi text, sysfs NUMA/IOMMU fallback
Inter-GPU link classification: XGMI, PCIe-SameSwitch, PCIe-HostBridge, PCIe-CrossNUMA
Link ranking system (0–100) for topology-aware GPU assignment
Per-GPU metadata: render node, GFX version, PCI BDF, VRAM, memory type (unified/discrete)
GPU identification with three fallback methods: amd-smi UUID, sysfs unique_id, composite PCI BDF

Installer Integration

AMD topology detection phase (runs when GPU_COUNT > 1 and backend is AMD)
Vendor-aware GPU assignment extraction
AMD multi-GPU env vars written to .env only when applicable
Render node verification in AMD tuning phase

Docker Compose Overlays

AMD multi-GPU overlay for llama-server with Lemonade passthrough
--split-mode passed via --llamacpp-args (Lemonade's official mechanism)
ROCm backend selected via LEMONADE_LLAMACPP env var (compatible with both Python and C++ Lemonade builds)
Per-service GPU isolation via ROCR_VISIBLE_DEVICES
Renamed existing NVIDIA overlays from generic multigpu to multigpu-nvidia

CLI (`dream gpu` commands)

All five GPU commands (status, topology, assign, reassign, monitor) are AMD-aware
AMD GPU status table with VRAM, utilization, temperature, power via amd-smi
AMD topology display with GFX versions, memory types, and render nodes
AMD GPU reassignment writes LLAMA_SERVER_GPU_INDICES and per-service *_GPU_INDEX

Dashboard API

AMD GPU monitoring via amd-smi and sysfs hwmon
Per-GPU metrics: utilization, VRAM, temperature, power, fan speed
GPU assignment decoding from GPU_ASSIGNMENT_JSON_B64

Environment & Schema

LLAMA_SERVER_GPU_INDICES — comma-separated GPU indices for ROCR_VISIBLE_DEVICES
COMFYUI_GPU_INDEX, WHISPER_GPU_INDEX, EMBEDDINGS_GPU_INDEX — per-service GPU index
LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT — llama.cpp multi-GPU parameters
All new env vars added to .env.schema.json and .env.example

Tests

25 BATS unit tests for amd-topo.sh (render node, GFX version, GPU name, GPU ID, topology parsing)
23 shell integration tests with fixture files (4-GPU XGMI, 2-GPU PCIe, field-based select, cross-tool agreement)
16 pytest tests for dashboard-api AMD GPU monitoring
Fixture files extracted from real hardware for amd-smi JSON and rocm-smi text output
Real hardware tests on 4x AMD Instinct MI300X

Files Changed

Area	Files
Topology detection	`installers/lib/amd-topo.sh` (new)
Hardware detection	`installers/lib/detection.sh`, `installers/phases/02-detection.sh`
Installer	`installers/phases/03-features.sh`, `installers/phases/06-directories.sh`, `installers/phases/10-amd-tuning.sh`
Compose overlays	`docker-compose.multigpu-amd.yml` (new), `docker-compose.multigpu-nvidia.yml` (renamed)
Service overlays	`extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml` (new), NVIDIA renamed
CLI	`dream-cli`
Assignment	`scripts/assign_gpus.py`, `scripts/resolve-compose-stack.sh`
Dashboard API	`extensions/services/dashboard-api/gpu.py`
Config	`config/gpu-database.json`, `.env.schema.json`, `.env.example`
Tests	`tests/bats-tests/amd-topo.bats`, `tests/test-amd-topo.sh`, `tests/fixtures/amd/*`, `extensions/services/dashboard-api/tests/test_gpu_amd.py`

…se overlays, CLI, and tests

Lightheartdevs

Audit Review — AMD Multi-GPU Support

Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.

Two bugs to fix, two things to verify:

Bug 1: Bash evaluation order in _gpu_status power reading

if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]

The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:

if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]

Bug 2: Dead jq expression in _gpu_reassign auto-mode

The llama_indices assignment uses a jq filter that always returns empty string:

llama_indices=$(echo "$assignment_json" | jq -r '
    [.gpu_assignment.services.llama_server.gpus[]] as $uuids |
    "" ')

The actual index extraction happens in the while loop below. Remove the dead code.

Verify: compute_subset rank_matrix indexing

In the APU+dGPU hybrid path (assign_gpus.py ~line 488):

discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)

Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.

Verify: Old multigpu.yml filename references

The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.

Non-blocking notes (follow-up material)

gpu-database.json: RX 7900 XTX/XT/GRE share device_id 0x744c — matcher must prioritize name_patterns over device_ids
Missing schema entries for ROCR_VISIBLE_DEVICES, VIDEO_GID, RENDER_GID, HSA_OVERRIDE_GFX_VERSION
_gpu_reassign interactive prompts have no --yes flag — will hang in non-interactive pipelines
No tests for 0-GPU and 1-GPU edge cases, _detect_topo_sysfs fallback, or mixed APU+dGPU in the API detailed view

Good

3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
All 5 CLI GPU commands are now vendor-aware
Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
Docker Compose overlays follow existing patterns (ROCR_VISIBLE_DEVICES parallels NVIDIA_VISIBLE_DEVICES)
resolve-compose-stack.sh correctly constructs vendor-specific overlay filenames
All CI failures are pre-existing on main, none caused by this PR

… integer comparison

…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example

y-coffee-dev · 2026-04-03T02:24:30Z

Hey! Thank you for the thorough review and for the kind words, I really appreciate it!

Fixed bug 1 (power reading eval order):
Great catch. You're right that -eq on a non-numeric value would cause issues before the regex guard gets a chance to run. In practice pw_uw is initialized to 0 and reset on cat failure, so sysfs would have to return a non-numeric string successfully for this to trigger, which is unlikely but not impossible. I swapped the order so the regex check runs first just in case.

Fixed Bug 2 (dead llama_indices):
Yep, that was a leftover from an earlier approach. The jq expression [...] as $uuids | "" literally evaluates to an empty string, and the variable is never referenced after assignment. I removed it.

Verified compute_subset rank_matrix indexing:
I went through this carefully, compute_subset() at line 130 does indices = [g.index for g in gpus] and then uses those indices for rank_matrix lookups via get_rank(rank_matrix, a, b). So when we filter to discrete-only GPUs, it's still looking up ranks by their original topology indices, not by list position, so we're good.

Verified old multigpu.yml filename references:
I grepped the full codebase. The old docker-compose.multigpu.yml / compose.multigpu.yaml do not appear in any code files, no references in dream-cli, resolve-compose-stack.sh, tests, or any production code. All good on this as well.

Non-blocking notes:
All fair observations. I went ahead and addressed the schema one: ROCR_VISIBLE_DEVICES was the only entry actually missing from .env.schema.json, VIDEO_GID, RENDER_GID, and HSA_OVERRIDE_GFX_VERSION were already there, I just improved their descriptions and added defaults. Also updated .env.example with all four AMD-specific vars.

Thanks again for the review!

y-coffee-dev added 3 commits April 3, 2026 02:01

feat: add AMD multi-GPU support end-to-end, topology detection, compo…

1fe5c52

…se overlays, CLI, and tests

fix: remove unused variable and imports flagged by ruff

125118b

fix: remove unused llama_indices variable in AMD GPU reassign

21f3a82

Lightheartdevs requested changes Apr 3, 2026

View reviewed changes

y-coffee-dev added 2 commits April 3, 2026 03:08

fix: swap evaluation order in AMD power reading to check regex before…

c5df679

… integer comparison

fix: add ROCR_VISIBLE_DEVICES schema entry, improve VIDEO_GID/RENDER_…

07efb7d

…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AMD Multi-GPU Support#750

feat: AMD Multi-GPU Support#750
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu

y-coffee-dev commented Apr 3, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

y-coffee-dev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

y-coffee-dev commented Apr 3, 2026

feat: AMD Multi-GPU Support

What was added

Hardware Detection

Topology Detection

Installer Integration

Docker Compose Overlays

CLI (dream gpu commands)

Dashboard API

Environment & Schema

Tests

Files Changed

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Audit Review — AMD Multi-GPU Support

Bug 1: Bash evaluation order in _gpu_status power reading

Bug 2: Dead jq expression in _gpu_reassign auto-mode

Verify: compute_subset rank_matrix indexing

Verify: Old multigpu.yml filename references

Non-blocking notes (follow-up material)

Good

Uh oh!

y-coffee-dev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLI (`dream gpu` commands)