Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 103 additions & 55 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,108 @@
# CLAUDE.md
# go-rocm — AMD ROCm GPU Inference

## What This Is

AMD ROCm GPU inference for Linux. Module: `forge.lthn.ai/core/go-rocm`
AMD ROCm GPU inference for Linux via managed `llama-server` subprocess. Module: `dappco.re/go/rocm`.

Implements `inference.Backend` and `inference.TextModel` (from `core/go-inference`) using llama.cpp compiled with HIP/ROCm. Targets AMD RDNA 3+ GPUs.
Implements `inference.Backend` and `inference.TextModel` (from `core/go-inference`) using llama.cpp compiled with `-DGGML_HIP=ON`. Targets AMD RDNA 2+ GPUs (tested on Radeon RX 7800 XT, gfx1100).

## Target Hardware
Sibling to `go-mlx` (Metal on macOS). Both expose the same interface; users select at runtime based on `Available()`.

- **GPU**: AMD Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB VRAM) — confirmed gfx1100, not gfx1101
- **OS**: Ubuntu 24.04 LTS (linux/amd64)
- **ROCm**: 7.2.0 installed
- **Kernel**: 6.17.0
## Key Facts

## Commands
- **Subprocess model:** llama-server runs as isolated process, communicates via HTTP/SSE
- **GGUF parser:** Reads model metadata (v2/v3) without loading tensors — enables fast discovery
- **VRAM monitoring:** sysfs-based (no ROCm runtime library dependency)
- **iGPU masking:** `HIP_VISIBLE_DEVICES=0` hardcoded — Ryzen 9 iGPU crashes llama-server if exposed
- **Auto-register:** `init()` registers backend into `inference.Register()` on linux && amd64
- **Platform stubs:** Exports no-op funcs on non-Linux/amd64 to avoid build failures
- **Error wrapping:** All errors use `coreerr.E(scope, msg, cause)` from `go-log`

```bash
go test ./... # Unit tests (no GPU required)
go test -tags rocm ./... # Integration tests + benchmarks (GPU required)
go test -tags rocm -v -run TestROCm ./... # Full GPU tests only
go test -tags rocm -bench=. -benchtime=3x ./... # Benchmarks
```
## Hardware & OS

## Architecture
| Component | Value |
|-----------|-------|
| GPU | Radeon RX 7800 XT (gfx1100, RDNA 3, 16 GB) |
| CPU | Ryzen 9 9950X |
| OS | Ubuntu 24.04 LTS |
| ROCm | 7.2.0 |
| Kernel | 6.17.0 |

See `docs/architecture.md` for full detail.
## Architecture

```
go-rocm/
├── backend.go inference.Backend (linux && amd64)
├── model.go inference.TextModel (linux && amd64)
├── server.go llama-server subprocess lifecycle
├── vram.go VRAM monitoring via sysfs
├── discover.go GGUF model discovery
├── register_rocm.go auto-registers via init() (linux && amd64)
├── rocm_stub.go stubs for non-linux/non-amd64
└── internal/
├── llamacpp/ llama-server HTTP client + health check
└── gguf/ GGUF v2/v3 binary metadata parser
dappco.re/go/rocm/
├── Public:
│ ├── rocm.go [VRAMInfo, ModelInfo types]
│ ├── discover.go [DiscoverModels(dir) -> []ModelInfo]
│ ├── register_rocm.go [init() register]
├── Backend/Model (linux && amd64):
│ ├── backend.go [rocmBackend impl]
│ ├── model.go [rocmModel impl, metrics, streaming]
│ ├── server.go [subprocess lifecycle, port mgmt]
│ ├── vram.go [GetVRAMInfo() via sysfs]
│ ├── rocm_stub.go [stubs for other platforms]
└── Internal:
├── internal/gguf/
│ └── gguf.go [GGUF v2/v3 binary header parser]
└── internal/llamacpp/
├── client.go [HTTP client, Complete, ChatComplete]
└── health.go [/health endpoint polling]
```

## Critical: iGPU Crash
## Critical Rules

1. **iGPU always masked:** `serverEnv()` enforces `HIP_VISIBLE_DEVICES=0`. This is non-negotiable. Do not accept as config or env var override.

2. **Platform-specific:** Build tags `linux && amd64` for GPU code. Stubs on other platforms prevent build errors.

3. **Subprocess isolation:** llama-server is not trusted. Runs at default perms, minimal env, auto-killed on exit.

The Ryzen 9 9950X iGPU appears as ROCm Device 1. llama-server crashes trying to split tensors across it. `serverEnv()` always sets `HIP_VISIBLE_DEVICES=0`. Do not remove or weaken this.
4. **Error scope:** All errors use `coreerr.E()`. No `fmt.Errorf`, no `errors.New`, no `log` package.

## Building llama-server with ROCm
5. **Banned imports:** `fmt`, `log`, `errors`, `os/exec` use their core.* equivalents. (Note: `os` used directly for file/env ops, justified by GPU module weight constraints.)

6. **Metrics best-effort:** VRAM stats read non-atomically from sysfs. Under heavy churn, transient gaps expected. Recording is not real-time.

## Spec Index

See `/sessions/vibrant-sharp-fermat/mnt/plans/code/core/go/rocm/RFC.md`:

- **§1–2:** Overview & package layout
- **§3:** Type definitions (VRAMInfo, ModelInfo, rocmBackend, rocmModel, server)
- **§4:** Inference pipeline (Load, Generate, Chat, metrics)
- **§5:** GGUF parser internals
- **§6:** llama-server HTTP bridge
- **§7–9:** VRAM discovery, model discovery, platform support
- **§10–16:** Error handling, config, quantisation, design notes, cross-refs

## Working Commands

```bash
# Unit tests (no GPU required)
go test ./...

# Integration tests + benchmarks (GPU required, gfx1100)
go test -tags rocm ./...

# Full GPU tests only
go test -tags rocm -v -run TestROCm ./...

# Benchmarks
go test -tags rocm -bench=. -benchtime=3x ./...

# Format & lint
go fmt ./...
```

## Building llama-server

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1100 \
Expand All @@ -56,34 +112,26 @@ cmake --build build --parallel $(nproc) -t llama-server
sudo cp build/bin/llama-server /usr/local/bin/llama-server
```

## Environment Variables
## Coordination

| Variable | Default | Purpose |
|----------|---------|---------|
| `ROCM_LLAMA_SERVER_PATH` | PATH lookup | Path to llama-server binary |
| `HIP_VISIBLE_DEVICES` | overridden to `0` | Always forced to 0 — do not rely on ambient value |
- **Virgil** (forge.lthn.ai/core) — orchestrator, task writer, PR reviewer
- **go-mlx** — sibling Metal backend (same interface contract)
- **go-inference** — shared TextModel/Backend interface definitions
- **go-ml** — scoring engine wrapping both backends
- **LEM training** — uses go-rocm for model eval on Charon homelab

## Coding Standards
## Test Naming

- UK English
- Tests: testify assert/require
- Build tags: `linux && amd64` for GPU code, `rocm` for integration tests
- Errors: `coreerr.E("pkg.Func", "what failed", err)` via `go-log`, never `fmt.Errorf` or `errors.New`
- File I/O: `os` package used directly — `go-io` not imported (its transitive deps are too heavy for a GPU inference module)
- Conventional commits
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Licence: EUPL-1.2
Format: `TestFilename_Function_{Good,Bad,Ugly}` — all three categories mandatory.

## Coordination
Example: `TestModel_Generate_Good`, `TestModel_Generate_Bad`, `TestModel_Generate_Ugly`.

- **Virgil** (core/go) is the orchestrator — writes tasks and reviews PRs
- **go-mlx** is the sibling — Metal backend on macOS, same interface contract
- **go-inference** defines the shared TextModel/Backend interfaces both backends implement
- **go-ml** wraps both backends into the scoring engine
## Commit Style

## Documentation
```
type(scope): description

Co-Authored-By: Virgil <virgil@lethean.io>
```

- `docs/architecture.md` — component design, data flow, interface contracts
- `docs/development.md` — prerequisites, test commands, benchmarks, coding standards
- `docs/history.md` — completed phases, commit hashes, known limitations
- `docs/plans/` — phase design documents (read-only reference)
Example: `feat(rocm): add VRAM monitoring via sysfs`
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

AMD ROCm GPU inference for Linux via a managed llama-server subprocess. Implements the `inference.Backend` and `inference.TextModel` interfaces from go-inference for AMD RDNA 3+ GPUs (validated on RX 7800 XT with ROCm 7.2). Uses llama-server's OpenAI-compatible streaming API rather than direct HIP CGO bindings, giving access to 50+ GGUF model architectures with GPU crash isolation. Includes a GGUF v2/v3 binary metadata parser, sysfs VRAM monitoring, and model discovery. Platform-restricted: `linux/amd64` only; a safe stub compiles everywhere else.

**Module**: `forge.lthn.ai/core/go-rocm`
**Module**: `dappco.re/go/rocm`
**Licence**: EUPL-1.2
**Language**: Go 1.25

Expand All @@ -11,7 +11,7 @@ AMD ROCm GPU inference for Linux via a managed llama-server subprocess. Implemen
```go
import (
"forge.lthn.ai/core/go-inference"
_ "forge.lthn.ai/core/go-rocm" // registers "rocm" backend via init()
_ "dappco.re/go/rocm" // registers "rocm" backend via init()
)

// Requires llama-server compiled with HIP/ROCm on PATH
Expand Down
115 changes: 66 additions & 49 deletions backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ import (
"os"
"strings"

coreerr "forge.lthn.ai/core/go-log"
"forge.lthn.ai/core/go-inference"
"forge.lthn.ai/core/go-rocm/internal/gguf"
"dappco.re/go/inference"
coreerr "dappco.re/go/log"
"dappco.re/go/rocm/internal/gguf"
)

// rocmBackend implements inference.Backend for AMD ROCm GPUs.
type rocmBackend struct{}

const defaultContextLengthCap = 4096

func (b *rocmBackend) Name() string { return "rocm" }

// Available reports whether ROCm GPU inference can run on this machine.
Expand All @@ -30,68 +32,83 @@ func (b *rocmBackend) Available() bool {

// LoadModel loads a GGUF model onto the AMD GPU via llama-server.
// Model architecture is read from GGUF metadata (replacing filename-based guessing).
// If no context length is specified, defaults to min(model_context_length, 4096)
// to prevent VRAM exhaustion on models with 128K+ native context.
// If no context length is specified, defaults to min(model_context_length,
// 4096). When metadata omits the native context, it falls back to 4096 to
// keep the load path on the safe side of VRAM usage.
func (b *rocmBackend) LoadModel(path string, opts ...inference.LoadOption) (inference.TextModel, error) {
cfg := inference.ApplyLoadOpts(opts)
loadConfig := inference.ApplyLoadOpts(opts)

binary, err := findLlamaServer()
if err != nil {
return nil, err
}

meta, err := gguf.ReadMetadata(path)
metadata, err := gguf.ReadMetadata(path)
if err != nil {
return nil, coreerr.E("rocm.LoadModel", "read model metadata", err)
}

ctxLen := cfg.ContextLen
if ctxLen == 0 && meta.ContextLength > 0 {
ctxLen = int(min(meta.ContextLength, 4096))
}
contextLength := resolveContextLength(loadConfig.ContextLen, metadata)

srv, err := startServer(binary, path, cfg.GPULayers, ctxLen, cfg.ParallelSlots)
modelServer, err := startServer(serverStartConfig{
BinaryPath: binary,
ModelPath: path,
GPULayerCount: loadConfig.GPULayers,
ContextSize: contextLength,
ParallelSlotCount: loadConfig.ParallelSlots,
})
if err != nil {
return nil, err
}

// Map quantisation file type to bit width.
quantBits := 0
quantGroup := 0
ftName := gguf.FileTypeName(meta.FileType)
switch {
case strings.HasPrefix(ftName, "Q4_"):
quantBits = 4
quantGroup = 32
case strings.HasPrefix(ftName, "Q5_"):
quantBits = 5
quantGroup = 32
case strings.HasPrefix(ftName, "Q8_"):
quantBits = 8
quantGroup = 32
case strings.HasPrefix(ftName, "Q2_"):
quantBits = 2
quantGroup = 16
case strings.HasPrefix(ftName, "Q3_"):
quantBits = 3
quantGroup = 32
case strings.HasPrefix(ftName, "Q6_"):
quantBits = 6
quantGroup = 64
case ftName == "F16":
quantBits = 16
case ftName == "F32":
quantBits = 32
}

return &rocmModel{
srv: srv,
modelType: meta.Architecture,
modelInfo: inference.ModelInfo{
Architecture: meta.Architecture,
NumLayers: int(meta.BlockCount),
QuantBits: quantBits,
QuantGroup: quantGroup,
},
server: modelServer,
modelType: metadata.Architecture,
modelInfo: modelInfoFromMetadata(metadata),
}, nil
}

func resolveContextLength(requestedContextLength int, metadata gguf.Metadata) int {
if requestedContextLength > 0 {
return requestedContextLength
}
if metadata.ContextLength == 0 {
return defaultContextLengthCap
}
return min(int(metadata.ContextLength), defaultContextLengthCap)
}

func modelInfoFromMetadata(metadata gguf.Metadata) inference.ModelInfo {
quantBits, quantGroup := quantisationFromFileType(metadata.FileType)
return inference.ModelInfo{
Architecture: metadata.Architecture,
NumLayers: int(metadata.BlockCount),
QuantBits: quantBits,
QuantGroup: quantGroup,
}
}

func quantisationFromFileType(fileType uint32) (bits, groupSize int) {
fileTypeName := gguf.FileTypeName(fileType)

switch {
case strings.HasPrefix(fileTypeName, "Q4_"):
return 4, 32
case strings.HasPrefix(fileTypeName, "Q5_"):
return 5, 32
case strings.HasPrefix(fileTypeName, "Q8_"):
return 8, 32
case strings.HasPrefix(fileTypeName, "Q2_"):
return 2, 16
case strings.HasPrefix(fileTypeName, "Q3_"):
return 3, 32
case strings.HasPrefix(fileTypeName, "Q6_"):
return 6, 64
case fileTypeName == "F16":
return 16, 0
case fileTypeName == "F32":
return 32, 0
default:
return 0, 0
}
}
Loading