Tutorial: Run Claude Code with Local Models Using llama.cpp

Save 100% on AI costs and run completely offline

This step-by-step tutorial shows you how to set up llama.cpp with DevAIFlow to run Claude Code using free local models on your machine. Perfect for:

💰 Developers wanting zero-cost AI assistance
🔒 Teams requiring complete data privacy
✈️ Working offline without internet access
🧪 Experimenting with different coding models

What You'll Get

By the end of this tutorial, you'll have:

✅ llama.cpp server running locally
✅ Qwen3-Coder 25B model (or your choice)
✅ DevAIFlow configured to use local models
✅ Full Claude Code IDE integration working offline

Time Required: 15-20 minutes Cost: FREE (forever) Difficulty: Intermediate

Prerequisites

Hardware Requirements

Minimum:

16GB RAM (for 14B parameter models)
50GB free disk space
macOS with Apple Silicon OR Linux with NVIDIA GPU

Recommended:

32GB+ RAM (for 25B+ parameter models)
100GB free disk space for multiple models
Fast SSD for better performance

Software Requirements

Git - For cloning llama.cpp repository
CMake - For building llama.cpp
DevAIFlow - Already installed (pip install devaiflow)
Claude Code CLI - Version 2.1.3 or higher

Install dependencies:

# macOS
brew install cmake git

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install build-essential cmake git -y

# Fedora/RHEL
sudo dnf install gcc gcc-c++ cmake git -y

Step 1: Build llama.cpp (One-Time Setup)

For macOS (Apple Silicon)

# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with Metal support (GPU acceleration)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

# Verify build succeeded
ls build/bin/llama-server  # Should exist

Compilation time: 3-5 minutes on modern Macs

For Linux (NVIDIA GPU)

# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Verify build succeeded
ls build/bin/llama-server  # Should exist

Compilation time: 5-10 minutes depending on hardware

For Linux (CPU Only)

# Clone and build without GPU acceleration
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build
cmake --build build --config Release -j

ls build/bin/llama-server  # Verify

Note: CPU-only mode is much slower but works without a GPU.

Step 2: Choose and Test a Model

Recommended Models for Coding

Choose based on your available RAM:

For 32GB+ RAM (Best Quality):

# Qwen3-Coder 25B (Excellent for coding, our recommendation)
MODEL="bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M"
ALIAS="Qwen3-Coder"

For 16-32GB RAM (Good Balance):

# DeepSeek-Coder V2 16B (Fast and capable)
MODEL="bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q5_K_M"
ALIAS="DeepSeek-Coder"

# OR Qwen2.5-Coder 14B
MODEL="bartowski/Qwen2.5-Coder-14B-Instruct-GGUF:Q4_K_M"
ALIAS="Qwen2.5-Coder"

For 16GB RAM (Minimum):

# Qwen2.5-Coder 7B (Still quite capable)
MODEL="bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M"
ALIAS="Qwen2.5-7B"

Understanding Model Naming

Example: bartowski/Qwen2.5-Coder-14B-Instruct-GGUF:Q4_K_M

bartowski - HuggingFace user who quantized the model
Qwen2.5-Coder-14B-Instruct - Base model name (14 billion parameters)
GGUF - File format compatible with llama.cpp
Q4_K_M - Quantization level (Q4 = 4-bit, medium quality)

Quantization levels explained:

Q4_K_M - Best balance (recommended for most users)
Q5_K_M - Higher quality, more RAM required
Q6_K - Near-original quality, 50% more RAM
Q8_0 - Highest quality, double RAM

Step 3: Start the llama.cpp Server

Basic Server Command

cd llama.cpp

# Start server (replace MODEL and ALIAS with your choice from Step 2)
./build/bin/llama-server \
    -hf $MODEL \
    --alias "$ALIAS" \
    --port 8000 \
    --jinja \
    --ctx-size 64000

Full Server Command (Optimized)

For better performance with Claude Code:

./build/bin/llama-server \
    -hf $MODEL \
    --alias "$ALIAS" \
    --port 8000 \
    --jinja \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on \
    --batch-size 4096 --ubatch-size 1024 \
    --ctx-size 64000 \
    --n-gpu-layers 99  # macOS/NVIDIA only, remove for CPU

Important flags explained:

Flag	Purpose	Required?
`--jinja`	Enables tool calling support	YES - CRITICAL
`-hf`	Download model from HuggingFace	Recommended
`--alias`	Model name for Claude Code	YES
`--ctx-size 64000`	Large context for Claude's tool definitions	YES
`--port 8000`	Port to listen on	Customizable
`--batch-size 4096`	Processing batch size	Performance tuning
`--flash-attn on`	Flash attention (faster)	macOS only

First Run: Model Download

First time running with -hf flag:

Loading model from HuggingFace...
Downloading bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M
Progress: [=====>    ] 45%

Download time: 5-15 minutes depending on model size and internet speed

7B model: ~4GB download
14B model: ~8GB download
25B model: ~15GB download

Models are cached in ~/.cache/huggingface/hub/ - subsequent starts are instant.

Verify Server is Running

Once you see:

llama server listening at http://127.0.0.1:8000

Test it:

# In a new terminal
curl http://localhost:8000/v1/models

Expected output:

{
  "data": [
    {
      "id": "Qwen3-Coder",
      "object": "model",
      ...
    }
  ]
}

✅ Success! Your llama.cpp server is running.

Keep this terminal window open - the server must stay running for Claude Code to use it.

Step 4: Configure DevAIFlow

You have three options:

Option A: Interactive TUI (Recommended for Beginners)

# Open configuration interface
daf config edit

# Navigate using Tab key:
# 1. Go to "Model Providers" tab
# 2. Click "Add Profile" button
# 3. Select "Custom Provider"
# 4. Fill in the form:
#    - Name: llama-cpp
#    - Base URL: http://localhost:8000
#    - Auth Token: llama-cpp
#    - API Key: (leave empty or enter empty string)
#    - Model Name: Qwen3-Coder  (match your --alias from Step 3)
# 5. Click "Set as Default" (optional)
# 6. Press Ctrl+S to save

Option B: Manual JSON Editing

Edit ~/.daf-sessions/config.json:

{
  "model_provider": {
    "default_profile": "llama-cpp",
    "profiles": {
      "llama-cpp": {
        "name": "llama-cpp",
        "base_url": "http://localhost:8000",
        "auth_token": "llama-cpp",
        "api_key": "",
        "model_name": "Qwen3-Coder"
      }
    }
  }
}

Important:

base_url must match your server port
model_name must match your --alias flag exactly (case-sensitive)
api_key can be empty string or omitted

Option C: Environment Variables (Temporary Testing)

# Set for current terminal session only
export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_AUTH_TOKEN="llama-cpp"
export ANTHROPIC_API_KEY=""

# Test it
daf open PROJ-123

This method doesn't persist across terminal sessions.

Step 5: Test Your Setup

Create a Test Session

# Create a test session
daf new --name llama-test --goal "Test local llama.cpp setup"

# Claude Code should launch automatically
# If it doesn't, the session was created - you can open it manually:
# daf open llama-test

First Prompt Test

In Claude Code, type: hi

Expected behavior:

First time: 30-60 second wait (normal!)
- Claude Code sends ~35,000 tokens of tool definitions
- llama.cpp processes them in batches
- Progress shows in terminal: prompt eval time = 45678.89 ms
Response appears: You should get a greeting message
Subsequent prompts: Much faster (context already loaded)

If you see a response, congratulations! 🎉 It's working!

Test File Operations

Try asking Claude Code to create a file:

Create a file called test.txt with the word "Hello from llama.cpp!" in it

Expected:

Claude Code uses the Edit or Write tool
File appears in your directory
You see confirmation message

If file operations work, you have full IDE integration! ✅

Step 6: Use with Real Projects

Now use it for actual work:

# Create session for real ticket
daf new PROJ-12345 --goal "Implement user authentication"

# Or sync existing tickets
daf sync

# Open any session - uses llama.cpp by default
daf open PROJ-12345

# Work normally in Claude Code
# All file operations, multi-file changes, tool calling work!

# Complete when done
daf complete PROJ-12345

Advanced: Multiple Models

Want different models for different tasks?

Set Up Multiple Servers

# Terminal 1: Large model for complex tasks (port 8000)
./build/bin/llama-server \
    -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
    --alias "Qwen3-Coder" --port 8000 --jinja --ctx-size 64000

# Terminal 2: Small model for quick tasks (port 8001)
./build/bin/llama-server \
    -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M \
    --alias "Qwen2.5-7B" --port 8001 --jinja --ctx-size 64000

Configure Multiple Profiles

Edit ~/.daf-sessions/config.json:

{
  "model_provider": {
    "default_profile": "llama-large",
    "profiles": {
      "llama-large": {
        "name": "llama-large",
        "base_url": "http://localhost:8000",
        "auth_token": "llama-cpp",
        "api_key": "",
        "model_name": "Qwen3-Coder"
      },
      "llama-fast": {
        "name": "llama-fast",
        "base_url": "http://localhost:8001",
        "auth_token": "llama-cpp",
        "api_key": "",
        "model_name": "Qwen2.5-7B"
      }
    }
  }
}

Switch Between Models

# Use large model for complex refactoring
daf open PROJ-123 --model-profile llama-large

# Use fast model for simple bug fix
daf open PROJ-456 --model-profile llama-fast

# Sessions remember their profile choice
daf open PROJ-123  # Uses llama-large automatically

Common Issues and Solutions

Issue: "Deliberating..." Hangs Forever

Solution 1: Wait longer

First prompt takes 30-60 seconds (normal!)
Check terminal - are you seeing progress?
prompt eval time = 45678.89 ms means it's working

Solution 2: Reduce resource usage

# Use smaller model
MODEL="bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M"

# OR reduce context size
--ctx-size 32000  # instead of 64000

# OR reduce batch size
--batch-size 2048 --ubatch-size 512

Issue: Tool Calling Doesn't Work

Symptoms:

Claude responds but doesn't create/edit files
Gets stuck when trying to use tools

Solution: Add --jinja flag

# WRONG
./build/bin/llama-server -hf model --port 8000

# CORRECT
./build/bin/llama-server -hf model --port 8000 --jinja

The --jinja flag is CRITICAL for Claude Code compatibility!

Issue: Out of Memory

Symptoms:

Server crashes during model load
System becomes unresponsive

Solutions:

Use smaller quantization:
- Q4_K_M instead of Q5_K_M
- Q4_K_M instead of Q6_K
Use smaller model:
- 7B instead of 14B
- 14B instead of 25B
Close other applications to free RAM

Issue: Server Won't Start

Check:

# Port already in use?
lsof -i :8000
# If yes, either kill that process or use different port

# Model file corrupt?
rm -rf ~/.cache/huggingface/hub/models*Qwen*
# Then restart server to re-download

Performance Tips

Optimize for Your Hardware

For Apple Silicon Macs:

--flash-attn on \         # Use Flash Attention
--n-gpu-layers 99 \       # Offload everything to GPU
--kv-unified              # Unified KV cache

For NVIDIA GPUs:

--n-gpu-layers 99 \       # Offload all layers to GPU
--batch-size 4096 \       # Larger batches if you have VRAM
--ubatch-size 1024

For CPU-Only:

--threads 8 \             # Match your CPU core count
--batch-size 512 \        # Smaller batches
--ubatch-size 128

Expected Performance

MacBook Pro M1 Max (32GB):

Model: Qwen3-Coder 25B Q4_K_M
First prompt: ~45 seconds
Subsequent prompts: 5-10 seconds
Quality: Excellent

Desktop with RTX 4090 (24GB VRAM):

Model: DeepSeek-Coder V2 16B Q5_K_M
First prompt: ~20 seconds
Subsequent prompts: 2-5 seconds
Quality: Excellent

Cost Comparison

Claude Opus 4.6 (Cloud)

Cost: $15 per million tokens
For 10M tokens/month: $150/month
For 50M tokens/month: $750/month

llama.cpp (Local)

Cost: $0 (FREE)
Initial hardware: You already have it
Electricity: ~$0.50-2/month
Savings: 100%

Next Steps

Now that you have local models working:

Experiment with models: Try different coding models to find your favorite
Create profiles: Set up profiles for different use cases
Share with team: Document your setup for teammates
Contribute back: Report which models work best at https://github.com/itdove/devaiflow/issues

Troubleshooting Resources

DevAIFlow Docs: Alternative Model Providers
llama.cpp GitHub: https://github.com/ggml-org/llama.cpp
HuggingFace Models: https://huggingface.co/models?other=gguf
Community Help: https://github.com/itdove/devaiflow/issues

Conclusion

You now have a completely free, offline AI coding assistant running on your local machine! This setup provides:

✅ Zero ongoing costs
✅ Complete data privacy
✅ Works offline
✅ Full Claude Code IDE integration
✅ Flexibility to try different models

Happy coding! 🎉

Found this tutorial helpful? Please star the DevAIFlow repository and share with others!

Questions or issues? Open an issue on GitHub: https://github.com/itdove/devaiflow/issues

FilesExpand file tree

local-llama-cpp-setup.md

Latest commit

History