A comprehensive reference guide for running LLM models in Docker containers using AMD ROCm GPU acceleration.
-
ROCM_DOCKER_GUIDE.md - Complete guide covering:
- ROCm Docker architecture and requirements
- Official AMD images and documentation links
- GPU device mapping methods (CDI, runtime, manual)
- Framework implementations (llama.cpp, vLLM, Ollama)
- Example Dockerfiles and Docker Compose configurations
- Best practices for production deployments
- Troubleshooting common issues
-
Docker Compose Examples - Ready-to-use configurations for different use cases
Host system requires only:
- AMD GPU kernel driver (
amdgpu-dkms) - Docker 25.0+
- User in
videoandrendergroups
No ROCm installation needed on host - all ROCm libraries run inside containers.
All Docker Compose files support environment variable configuration. See .env.example for all available options.
Quick Start (using defaults):
Ollama (Easiest for development):
docker-compose -f docker-compose-ollama.yml up -d
docker exec -it ollama-dev ollama pull llama3.2
docker exec -it ollama-dev ollama run llama3.2llama.cpp (General-purpose):
mkdir -p models # Place your .gguf models here
docker-compose -f docker-compose-llamacpp.yml up -dvLLM (Production/high performance):
mkdir -p models hf-cache # Place your HuggingFace models here
docker-compose -f docker-compose-vllm.yml up -dCustom Configuration:
Method 1: Using a .env file (Recommended)
-
Create your
.envfile from the example:cp .env.example .env
-
Edit
.envwith your settings:# Example for llama.cpp with local models MODELS_DIR=./models MODEL_PATH=/data/granite-4.0-h-tiny-Q4_K_M.gguf LLAMA_PORT=8000 GPU_ID=0 -
Start the service (automatically reads
.env):docker-compose -f docker-compose-llamacpp.yml up -d
Method 2: Using environment variables
export MODELS_DIR="$HOME/.lmstudio/models"
export MODEL_PATH="/data/model-dir/model.gguf"
export GPU_ID=0
docker-compose -f docker-compose-llamacpp.yml up -dMethod 3: Inline variables
MODELS_DIR=./models MODEL_PATH=/data/my-model.gguf docker-compose -f docker-compose-llamacpp.yml up -dKey environment variables:
MODELS_DIR- Host directory containing models (default:./models)MODEL_PATH- Path to model file inside container (default:/data/model.gguf)GPU_ID- GPU device ID:0,1, or"0,1"for multi-GPU (default:0)ROCM_ARCH- GPU architecture, e.g.,gfx1100for RX 7900 XTX (default:gfx1100)HF_TOKEN- HuggingFace token for gated models (vLLM only)
See .env.example for all available configuration options and examples.
llama.cpp:
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video \
-v ./models:/data \
-p 8000:8000 \
rocm/llama.cpp:llama.cpp-b5997_rocm6.4.0_ubuntu24.04_server \
-m /data/model.gguf --host 0.0.0.0Ollama:
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:rocmvLLM:
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --security-opt seccomp=unconfined \
--ipc=host --shm-size 16G \
-v ./models:/models \
-p 8000:8000 \
rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909 \
--model /models/llama-2-7b-hf --host 0.0.0.0Self-Contained Design:
- Host: Only kernel driver (
amdgpu-dkms) - Container: All ROCm userspace libraries, runtimes, and frameworks
Recommended Versions (2025):
- ROCm 6.4.3: Most stable for production
- ROCm 7.0.1: Latest stable release
- Ubuntu 24.04 base images recommended
| Framework | Best For | Model Format | Ease of Use | Performance |
|---|---|---|---|---|
| llama.cpp | General use | GGUF | Medium | Good |
| Ollama | Development | GGUF (auto-convert) | Easy | Good |
| vLLM | Production API | HuggingFace | Medium | Excellent |
All backends provide OpenAI-compatible APIs, allowing you to use the official OpenAI SDK or any compatible client library with locally hosted models.
| Feature | llama.cpp | Ollama | vLLM | Notes |
|---|---|---|---|---|
/v1/chat/completions |
✅ | ✅ | ✅ | Full support with streaming |
/v1/completions |
✅ | ✅ | ✅ | Legacy text completion |
/v1/embeddings |
✅ | ✅ | ✅ | Text embeddings |
| Tool/Function Calling | ✅ * | ✅ | ✅ | *Requires --jinja flag |
| Streaming | ✅ | ✅ | ✅ | Server-sent events |
| JSON Mode | ✅ | ✅ | ✅ | Structured output |
* llama.cpp requires --jinja flag - already configured in docker-compose-llamacpp.yml
Python Example:
from openai import OpenAI
# llama.cpp
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Ollama
# client = OpenAI(
# base_url="http://localhost:11434/v1",
# api_key="not-needed"
# )
# Chat
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
# Tool calling (requires compatible model like Llama 3.1+)
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools
)llama.cpp:
- ✅
--jinjaflag enabled by default in our docker-compose - ✅ Use compatible model (Llama 3.1+, Mistral Nemo, Qwen 2.5, Hermes)
- ✅ Q4_K_M or higher quantization recommended
Ollama:
- ✅ No setup required - works automatically
- ✅ Pull a compatible model:
docker exec -it ollama-dev ollama pull llama3.1
Recommended Models for Tool Calling:
- Llama 3.1 / 3.3 (8B, 70B) - Best overall choice
- Mistral Nemo (12B) - Excellent performance
- Qwen 2.5 (7B, 14B, 32B) - Great multilingual support
- Hermes 2/3 - Function calling focused
Not Compatible:
- ❌ Granite models (no tool support)
- ❌ Llama 3.2 1B/3B (no tool support)
📖 For complete documentation, examples, and troubleshooting, see OPENAI_API_GUIDE.md
docker run --device amd.com/gpu=all <image>Requires: AMD Container Toolkit
docker run --runtime=amd -e AMD_VISIBLE_DEVICES=0 <image>docker run --device=/dev/kfd --device=/dev/dri --group-add video <image>See llama-tests/ directory for test scripts:
- test-llama.sh - Basic functionality and API validation
- load-test-llama.sh - Load testing and performance benchmarks
- concurrency-test-llama.sh - Parallel vs sequential execution analysis
Quick test:
cd llama-tests
./test-llama.shGPU not detected: Verify kernel driver loaded (lsmod | grep amdgpu) and device nodes exist (ls -la /dev/kfd /dev/dri/)
Permission denied: Add user to groups: sudo usermod -a -G video,render $USER (logout/login required)
Out of memory: Increase shared memory: --shm-size 16G
Slow performance: Verify GPU usage with rocm-smi and ensure GPU layers enabled (-ngl 99)
IBM Granite 4.0 H Tiny + vLLM on ROCm:
IBM Granite 4.0 H Tiny (and potentially other Granite models) are incompatible with vLLM on ROCm due to flash attention kernel failures. This is caused by:
- Non-power-of-2 attention block sizes: Granite uses 400-token blocks, incompatible with Triton kernels which require power-of-2 sizes
- Hybrid MoE architecture incompatibilities: The model's hybrid Mixture-of-Experts design triggers kernel compilation errors
- ROCm flash attention limitations: Both Triton and ROCm Flash Attention backends fail with
HIP Function Failed - invalid device function
Attempted workarounds (all failed):
--enforce-eagerflagVLLM_USE_TRITON_FLASH_ATTN=0VLLM_ATTENTION_BACKEND=TORCH_SDPA(not supported on AMD)VLLM_ATTENTION_BACKEND=ROCM_FLASHVLLM_USE_V1=0(V0 engine)
Solution: Use llama.cpp for Granite models (works perfectly with GGUF format), and use vLLM with tested-compatible models like TinyLlama, Llama-2, Llama-3, Mistral, or Qwen.
See ROCM_DOCKER_GUIDE.md for detailed troubleshooting.
See LICENSE file.
Last Updated: October 2025