Skip to content

feat: AMD Multi-GPU Support#750

Open
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu
Open

feat: AMD Multi-GPU Support#750
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu

Conversation

@y-coffee-dev
Copy link
Copy Markdown
Contributor

feat: AMD Multi-GPU Support

End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.

Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.

What was added

Hardware Detection

  • Multi-GPU AMD detection via sysfs (counts all vendor=0x1002 cards)
  • Handles XCP virtual cards on MI300X (filters by non-empty vendor file)
  • Total VRAM aggregation across all detected AMD GPUs
  • Mixed APU + discrete GPU classification

Topology Detection

  • Full AMD topology library with three detection backends: amd-smi JSON, rocm-smi text, sysfs NUMA/IOMMU fallback
  • Inter-GPU link classification: XGMI, PCIe-SameSwitch, PCIe-HostBridge, PCIe-CrossNUMA
  • Link ranking system (0–100) for topology-aware GPU assignment
  • Per-GPU metadata: render node, GFX version, PCI BDF, VRAM, memory type (unified/discrete)
  • GPU identification with three fallback methods: amd-smi UUID, sysfs unique_id, composite PCI BDF

Installer Integration

  • AMD topology detection phase (runs when GPU_COUNT > 1 and backend is AMD)
  • Vendor-aware GPU assignment extraction
  • AMD multi-GPU env vars written to .env only when applicable
  • Render node verification in AMD tuning phase

Docker Compose Overlays

  • AMD multi-GPU overlay for llama-server with Lemonade passthrough
  • --split-mode passed via --llamacpp-args (Lemonade's official mechanism)
  • ROCm backend selected via LEMONADE_LLAMACPP env var (compatible with both Python and C++ Lemonade builds)
  • Per-service GPU isolation via ROCR_VISIBLE_DEVICES
  • Renamed existing NVIDIA overlays from generic multigpu to multigpu-nvidia

CLI (dream gpu commands)

  • All five GPU commands (status, topology, assign, reassign, monitor) are AMD-aware
  • AMD GPU status table with VRAM, utilization, temperature, power via amd-smi
  • AMD topology display with GFX versions, memory types, and render nodes
  • AMD GPU reassignment writes LLAMA_SERVER_GPU_INDICES and per-service *_GPU_INDEX

Dashboard API

  • AMD GPU monitoring via amd-smi and sysfs hwmon
  • Per-GPU metrics: utilization, VRAM, temperature, power, fan speed
  • GPU assignment decoding from GPU_ASSIGNMENT_JSON_B64

Environment & Schema

  • LLAMA_SERVER_GPU_INDICES — comma-separated GPU indices for ROCR_VISIBLE_DEVICES
  • COMFYUI_GPU_INDEX, WHISPER_GPU_INDEX, EMBEDDINGS_GPU_INDEX — per-service GPU index
  • LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT — llama.cpp multi-GPU parameters
  • All new env vars added to .env.schema.json and .env.example

Tests

  • 25 BATS unit tests for amd-topo.sh (render node, GFX version, GPU name, GPU ID, topology parsing)
  • 23 shell integration tests with fixture files (4-GPU XGMI, 2-GPU PCIe, field-based select, cross-tool agreement)
  • 16 pytest tests for dashboard-api AMD GPU monitoring
  • Fixture files extracted from real hardware for amd-smi JSON and rocm-smi text output
  • Real hardware tests on 4x AMD Instinct MI300X

Files Changed

Area Files
Topology detection installers/lib/amd-topo.sh (new)
Hardware detection installers/lib/detection.sh, installers/phases/02-detection.sh
Installer installers/phases/03-features.sh, installers/phases/06-directories.sh, installers/phases/10-amd-tuning.sh
Compose overlays docker-compose.multigpu-amd.yml (new), docker-compose.multigpu-nvidia.yml (renamed)
Service overlays extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml (new), NVIDIA renamed
CLI dream-cli
Assignment scripts/assign_gpus.py, scripts/resolve-compose-stack.sh
Dashboard API extensions/services/dashboard-api/gpu.py
Config config/gpu-database.json, .env.schema.json, .env.example
Tests tests/bats-tests/amd-topo.bats, tests/test-amd-topo.sh, tests/fixtures/amd/*, extensions/services/dashboard-api/tests/test_gpu_amd.py

Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit Review — AMD Multi-GPU Support

Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.

Two bugs to fix, two things to verify:

Bug 1: Bash evaluation order in _gpu_status power reading

if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]

The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:

if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]

Bug 2: Dead jq expression in _gpu_reassign auto-mode

The llama_indices assignment uses a jq filter that always returns empty string:

llama_indices=$(echo "$assignment_json" | jq -r '
    [.gpu_assignment.services.llama_server.gpus[]] as $uuids |
    "" ')

The actual index extraction happens in the while loop below. Remove the dead code.

Verify: compute_subset rank_matrix indexing

In the APU+dGPU hybrid path (assign_gpus.py ~line 488):

discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)

Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.

Verify: Old multigpu.yml filename references

The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.

Non-blocking notes (follow-up material)

  • gpu-database.json: RX 7900 XTX/XT/GRE share device_id 0x744c — matcher must prioritize name_patterns over device_ids
  • Missing schema entries for ROCR_VISIBLE_DEVICES, VIDEO_GID, RENDER_GID, HSA_OVERRIDE_GFX_VERSION
  • _gpu_reassign interactive prompts have no --yes flag — will hang in non-interactive pipelines
  • No tests for 0-GPU and 1-GPU edge cases, _detect_topo_sysfs fallback, or mixed APU+dGPU in the API detailed view

Good

  • 3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
  • Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
  • All 5 CLI GPU commands are now vendor-aware
  • Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
  • Docker Compose overlays follow existing patterns (ROCR_VISIBLE_DEVICES parallels NVIDIA_VISIBLE_DEVICES)
  • resolve-compose-stack.sh correctly constructs vendor-specific overlay filenames
  • All CI failures are pre-existing on main, none caused by this PR

@y-coffee-dev
Copy link
Copy Markdown
Contributor Author

Hey! Thank you for the thorough review and for the kind words, I really appreciate it!

Fixed bug 1 (power reading eval order):
Great catch. You're right that -eq on a non-numeric value would cause issues before the regex guard gets a chance to run. In practice pw_uw is initialized to 0 and reset on cat failure, so sysfs would have to return a non-numeric string successfully for this to trigger, which is unlikely but not impossible. I swapped the order so the regex check runs first just in case.

Fixed Bug 2 (dead llama_indices):
Yep, that was a leftover from an earlier approach. The jq expression [...] as $uuids | "" literally evaluates to an empty string, and the variable is never referenced after assignment. I removed it.

Verified compute_subset rank_matrix indexing:
I went through this carefully, compute_subset() at line 130 does indices = [g.index for g in gpus] and then uses those indices for rank_matrix lookups via get_rank(rank_matrix, a, b). So when we filter to discrete-only GPUs, it's still looking up ranks by their original topology indices, not by list position, so we're good.

Verified old multigpu.yml filename references:
I grepped the full codebase. The old docker-compose.multigpu.yml / compose.multigpu.yaml do not appear in any code files, no references in dream-cli, resolve-compose-stack.sh, tests, or any production code. All good on this as well.

Non-blocking notes:
All fair observations. I went ahead and addressed the schema one: ROCR_VISIBLE_DEVICES was the only entry actually missing from .env.schema.json, VIDEO_GID, RENDER_GID, and HSA_OVERRIDE_GFX_VERSION were already there, I just improved their descriptions and added defaults. Also updated .env.example with all four AMD-specific vars.

Thanks again for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants