Cheesebrain is a high-performance C/C++ runtime for Large Language Model (LLM) inference. It is designed to be small, fast, and self-contained so you can run modern GGUF models on laptops, workstations, and servers with minimal setup.
Cheesebrain is its own project and is not described as “built from cheese.cpp”. It focuses on:
- Portability – a single codebase that runs well on Linux, macOS, and Windows.
- Performance – tight low-level code with aggressive quantization support and hardware-aware kernels.
- Practical tooling – CLI, HTTP server, Web UI, quantization tools, and model conversion utilities.
- C / C++ implementation for easy integration into existing systems.
- Hardware-optimized backends:
- Apple Silicon (NEON, Accelerate, Metal)
- x86 (SSE/AVX/AVX2/AVX-512/AMX where available)
- Optional GPU backends (CUDA / Metal / others, depending on build flags)
- Quantization-aware: supports multiple GGUF quantization schemes to reduce memory and improve throughput.
- Rich tooling:
cheese-clifor interactive and scripted use.cheese-serverfor an OpenAI-compatible HTTP API (with optional Web UI).- Quantization, benchmarking, and conversion helpers under
tools/.
From the repository root:
cmake -B build
cmake --build build --config ReleaseThis produces binaries under build/bin/.
Assuming you have a GGUF model at ./models/model.gguf:
# Chat from the terminal
./build/bin/cheese-cli -m ./models/model.gguf -cnv
# Start an OpenAI-compatible HTTP server (add --webui for the Web UI)
./build/bin/cheese-server -m ./models/model.gguf --port 8080The server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, and related endpoints.
See the in-repo docs for details:
To get the best performance on your machine:
- Build with an appropriate backend (BLAS for CPU, CUDA/Metal/SYCL where applicable).
- Tune runtime flags:
- Threads:
-t N - GPU layers:
-ngl N - Batch size / ubatch:
-ub N
- Threads:
- See:
Cheesebrain aims to be a pragmatic, low-friction way to run GGUF models locally while remaining small and hackable.
