feat: AMD Multi-GPU Support#750
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
Audit Review — AMD Multi-GPU Support
Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.
Two bugs to fix, two things to verify:
Bug 1: Bash evaluation order in _gpu_status power reading
if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:
if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]Bug 2: Dead jq expression in _gpu_reassign auto-mode
The llama_indices assignment uses a jq filter that always returns empty string:
llama_indices=$(echo "$assignment_json" | jq -r '
[.gpu_assignment.services.llama_server.gpus[]] as $uuids |
"" ')The actual index extraction happens in the while loop below. Remove the dead code.
Verify: compute_subset rank_matrix indexing
In the APU+dGPU hybrid path (assign_gpus.py ~line 488):
discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.
Verify: Old multigpu.yml filename references
The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.
Non-blocking notes (follow-up material)
gpu-database.json: RX 7900 XTX/XT/GRE share device_id0x744c— matcher must prioritizename_patternsoverdevice_ids- Missing schema entries for
ROCR_VISIBLE_DEVICES,VIDEO_GID,RENDER_GID,HSA_OVERRIDE_GFX_VERSION _gpu_reassigninteractive prompts have no--yesflag — will hang in non-interactive pipelines- No tests for 0-GPU and 1-GPU edge cases,
_detect_topo_sysfsfallback, or mixed APU+dGPU in the API detailed view
Good
- 3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
- Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
- All 5 CLI GPU commands are now vendor-aware
- Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
- Docker Compose overlays follow existing patterns (
ROCR_VISIBLE_DEVICESparallelsNVIDIA_VISIBLE_DEVICES) resolve-compose-stack.shcorrectly constructs vendor-specific overlay filenames- All CI failures are pre-existing on main, none caused by this PR
… integer comparison
…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example
|
Hey! Thank you for the thorough review and for the kind words, I really appreciate it! Fixed bug 1 (power reading eval order): Fixed Bug 2 (dead llama_indices): Verified compute_subset rank_matrix indexing: Verified old multigpu.yml filename references: Non-blocking notes: Thanks again for the review! |
feat: AMD Multi-GPU Support
End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.
Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.
What was added
Hardware Detection
vendor=0x1002cards)Topology Detection
Installer Integration
.envonly when applicableDocker Compose Overlays
--split-modepassed via--llamacpp-args(Lemonade's official mechanism)LEMONADE_LLAMACPPenv var (compatible with both Python and C++ Lemonade builds)ROCR_VISIBLE_DEVICESmultigputomultigpu-nvidiaCLI (
dream gpucommands)status,topology,assign,reassign,monitor) are AMD-awareLLAMA_SERVER_GPU_INDICESand per-service*_GPU_INDEXDashboard API
GPU_ASSIGNMENT_JSON_B64Environment & Schema
LLAMA_SERVER_GPU_INDICES— comma-separated GPU indices forROCR_VISIBLE_DEVICESCOMFYUI_GPU_INDEX,WHISPER_GPU_INDEX,EMBEDDINGS_GPU_INDEX— per-service GPU indexLLAMA_ARG_SPLIT_MODE,LLAMA_ARG_TENSOR_SPLIT— llama.cpp multi-GPU parameters.env.schema.jsonand.env.exampleTests
Files Changed
installers/lib/amd-topo.sh(new)installers/lib/detection.sh,installers/phases/02-detection.shinstallers/phases/03-features.sh,installers/phases/06-directories.sh,installers/phases/10-amd-tuning.shdocker-compose.multigpu-amd.yml(new),docker-compose.multigpu-nvidia.yml(renamed)extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml(new), NVIDIA renameddream-cliscripts/assign_gpus.py,scripts/resolve-compose-stack.shextensions/services/dashboard-api/gpu.pyconfig/gpu-database.json,.env.schema.json,.env.exampletests/bats-tests/amd-topo.bats,tests/test-amd-topo.sh,tests/fixtures/amd/*,extensions/services/dashboard-api/tests/test_gpu_amd.py