Skip to content

AMD Strix Halo #4

@FirebirdRender

Description

@FirebirdRender

First off, kudos, amazing concept, cool to see it implemented.

While I realize this is directly OUTSIDE the scope of your "macOS" only project, have you given any thought to how this might align with a "Strix-Halo" chipset machine on the x64 side, since it's also unified memory?

Sharing a couple of thoughts that Gemini 3.1 pro had when I asked it about porting to x64. Feel free to delete, posting in case it's interesting to anyone else.


To port @t8/hypura from Apple Silicon (macOS/ARM64) to Ubuntu 24 on Intel hardware (Linux/x86_64), you would face a number of architectural and system-level challenges. Although Rust itself is cross-platform, this project is deeply intertwined with Apple's hardware and OS-level memory management.

Here is an evaluation of where you should start and the primary challenges you will encounter:

Where to Start

Update the Build System (hypura-sys/build.rs) The current build script is hardcoded for Apple Silicon. You will need to modify it to:

Change CMake flags: Drop GGML_METAL=ON and macOS specific flags. Enable GGML_CUDA=ON (for NVIDIA) or GGML_HIP=ON (for AMD).
Remove macOS framework linking (Metal, Foundation, MetalKit, Accelerate).
Link the appropriate Linux libraries (e.g., cuda, cudart, cublas).
Refactor the Hardware Profiler (src/profiler/)

GPU Detection: src/profiler/gpu.rs currently looks specifically for "MTL" or "Metal" backend devices and hardcodes lookup tables for "M1", "M1 Pro", etc. You'll need to write a Linux equivalent to query CUDA/ROCm device info and available VRAM.
NVMe Profiling: Ensure the disk I/O profiling correctly queries block devices in Linux (e.g., via /sys/block or lsblk equivalents) rather than macOS disk utilities.

Adapt Direct I/O and Memory APIs
The project relies on macOS-specific F_NOCACHE for pread to bypass the page cache when streaming from NVMe. On Linux, this will need to be replaced with O_DIRECT file flags.
You will need to check the C buffer wrapper (hypura-sys/src/hypura_buft.c) where mmap is used.

The Major Challenges

  1. Unified Memory vs. Discrete Memory (The Biggest Hurdle)
    The core design of Hypura relies on the fact that Apple Silicon uses Unified Memory. When Hypura streams weights into RAM, the GPU can immediately access them because system RAM and VRAM are the exact same physical chips. On Intel/NVIDIA (Discrete GPU): RAM and VRAM are separate. If you stream expert weights from an NVMe drive into system RAM, you then have to transfer them over the PCIe bus to the GPU's VRAM before the GPU can compute them. The scheduler currently does not account for this PCIe bottleneck or the explicit memory copy operations required for discrete GPUs.

  2. Virtual Memory & OS Paging Differences
    The scheduler heavily exploits Apple Silicon's specific memory-mapping behaviors. The code explicitly notes that "on macOS unified memory, mmap pages are lazily committed on first access... [this] avoids Metal OOM". Linux handles overcommit, lazy allocation, and swapping completely differently. If you map a 40GB model into a 32GB RAM system on Linux without tuning vm.overcommit_memory or custom madvise rules, the Linux OOM Killer is highly likely to intervene much faster/differently than macOS does.

  3. llama.cpp Backend Customization
    Hypura uses a custom C shim (hypura_buft.c) that registers a custom GGML buffer type to intercept layer loading and stop the Metal backend from eagerly mapping everything. You will need to write an equivalent interception layer for the ggml-cuda or ggml-sycl backends, which handle tensor offloading and memory buffers quite differently than ggml-metal.

Summary

To get this working, your first steps are purely compiling the C/C++ bindings for CUDA/Linux instead of Metal/macOS. However, the true challenge is architectural: you will have to rewrite the scheduler to account for PCIe transfer costs and discrete VRAM budgets, as the current logic assumes RAM and GPU memory are the same pool.


Targeting an AMD Strix-Halo APU changes the calculus entirely and makes this port significantly more feasible.

Strix-Halo (with its massive RDNA 3.5 iGPU and wide LPDDR5x memory bus) mirrors the Apple Silicon architecture in the most critical way: it has true Unified Memory.

By targeting this specific hardware, the biggest hurdle of the previous port—the PCIe bus bottleneck and the need to manually copy memory from System RAM to Discrete VRAM—completely disappears. Your RAM is your VRAM.

Here is how your porting strategy and challenges would shift for a Linux + AMD Strix-Halo target:

What Gets Solved (The Big Wins)

The Core Scheduler Logic Survives: Hypura’s mathematical placement optimization assumes that if a tensor is in RAM, the GPU can compute it directly at full speed without a transfer penalty. On Strix-Halo, this holds true. You won't need to rewrite the tiering logic to account for PCIe transfer times.
Zero-Copy Architecture: You can still stream weights from NVMe into RAM and have the RDNA 3.5 GPU execute them in place, exactly as the project currently does with Metal.

Where You Would Start Now

Build System (ROCm / HIP): Instead of porting to CUDA, you will configure hypura-sys/build.rs to build llama.cpp with the ROCm/HIP backend (GGML_HIP=ON).
Direct I/O for Linux: You still have to replace the macOS-specific F_NOCACHE in the C shim with Linux's O_DIRECT flag when opening file descriptors to read from NVMe, ensuring you bypass the Linux page cache.
Hardware Profiler Updates: You will need to update src/profiler/gpu.rs to query the AMD SMI (System Management Interface) or ROCm APIs to detect the RDNA 3.5 GPU, its clock speeds, and the unified LPDDR5x bandwidth.

The New & Remaining Challenges

While the hardware matches the paradigm, the software stack (ROCm + Linux) introduces its own friction:

  1. Convincing ROCm to use Zero-Copy / Unified Memory
    llama.cpp’s ROCm backend (ggml-hip.cpp) is primarily written assuming discrete AMD GPUs (like the RX 7900 XTX). When a buffer is created, it usually expects to allocate VRAM on the card. To make Hypura work, you must ensure that your custom buffer C-shim (hypura_buft.c) uses ROCm's unified memory allocators (e.g., hipHostMalloc or mapping standard mmap memory to the GPU via hipHostRegister). If you don't intercept this correctly, the ROCm driver might try to unnecessarily copy data from your mmap'd system RAM into a dedicated GPU carve-out, defeating the purpose of the APU.

  2. Linux Virtual Memory vs. macOS Memory Pressure
    Hypura exploits a specific quirk of macOS: lazy mmap commits. On Apple Silicon, Metal lets you reserve a massive virtual memory space without crashing the GPU, only consuming physical RAM when the pages are actually read. Linux handles mmap overcommit differently. If you mmap a 40GB model, the Linux OOM killer might intervene differently depending on vm.overcommit_memory settings. You will need to carefully manage Linux page eviction. Hypura currently uses MADV_FREE to release memory back to macOS. On Linux, you will likely need to test MADV_DONTNEED to ensure the kernel immediately drops the pages and frees up the physical RAM for the next block of streamed NVMe weights.

  3. Memory Bandwidth Calibration
    While Strix-Halo is incredibly fast for x86 (around ~120 GB/s to ~250 GB/s depending on the OEM configuration), it is still slower than the M1/M2/M3 Max chips (~400 GB/s) and Ultra chips (~800 GB/s) that Hypura targets. You will need to run the hypura profile command to let the scheduler recalculate the optimal prefetch lookahead depth, as the slower RAM bandwidth will change the ratio between compute time and NVMe I/O time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions