A high-performance, portable inference engine for diffusion language models (LLaDA, Dream).
Diffusion LLMs generate all tokens in parallel via iterative unmasking; this shifts the bottleneck from memory bandwidth to compute. This makes CPU inference viable for sequence generation: diffuse-rs unmasks 30+ tokens per step vs. autoregressive's sequential decoding.
All benchmarks on the same hardware: Xeon Gold 6252, 12 threads, Q4_K_M quantization. Build with RUSTFLAGS="-C target-cpu=native" for AVX2/FMA.
| # | Prompt | diffuse-rs | diffuse-cpp (C++) | Speedup |
|---|---|---|---|---|
| 1 | Capital of France? | 1.5 tok/s | 0.8 tok/s | 1.9x |
| 2 | Translate to French | 2.5 tok/s | 0.7 tok/s | 3.6x |
| 3 | 15 times 23? | 1.3 tok/s | 0.8 tok/s | 1.6x |
| 4 | Translate to Spanish | 1.8 tok/s | 0.7 tok/s | 2.6x |
| 5 | Python is_prime() | 0.6 tok/s | 0.6 tok/s | 1.0x |
| 6 | Why is the sky blue? | 0.6 tok/s | 0.7 tok/s | 0.9x |
| 7 | List the planets | 0.8 tok/s | 0.7 tok/s | 1.1x |
| 8 | Poem about the ocean | 0.6 tok/s | 0.7 tok/s | 0.9x |
llama.cpp (Llama-3-8B, autoregressive): 9.35 tok/s on same hardware. Diffusion models trade per-token speed for parallel generation: each step unmasks multiple tokens simultaneously.
Optimization in progress: eliminating per-matmul Tensor overhead to match candle's internal AVX2 throughput. Target: 2-3x improvement on multi-step prompts.
cargo build --releaseDownload a model:
pip install huggingface-hub
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf --local-dir models/Generate:
./target/release/diffuse-rs generate \
--model models/llada-8b-q4km.gguf \
--prompt-ids "2372,341,268,7706,300,11406,30" \
-n 128 --n-steps 16 --threads 12
# → "The capital of France is Paris."cargo build --release --features server
# Download tokenizer for text prompts
huggingface-cli download GSAI-ML/LLaDA-8B-Instruct tokenizer.json --local-dir models/
./target/release/diffuse-rs serve \
--model models/llada-8b-q4km.gguf \
--tokenizer models/tokenizer.json --threads 12
# Text prompt
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"What is the capital of France?","max_tokens":128}'
# Token IDs (no tokenizer needed)
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"token_ids":[2372,341,268,7706,300,11406,30],"max_tokens":128}'src/
├── model.rs # Model loading, forward pass, KV cache, sampler
├── kernels.rs # Q4_K/Q6_K block types, Q8_K quantization, scalar dot products
├── server.rs # HTTP API server (axum, optional)
└── main.rs # CLI: bench, generate, serve, profile
Quantized matmuls via candle's QMatMul (Q4_K, Q6_K) with AVX2/FMA enabled via target-cpu=native. Native SIMD for attention, RoPE, softmax, SiLU. Rayon fork-join for parallel dispatch. F16 embeddings dequantized per-row at runtime (~72MB vs ~2GB).
- Initialize output positions as
[MASK]tokens - Full forward pass through bidirectional transformer (no causal mask)
- For each masked position, compute entropy of logit distribution
- Unmask the lowest-entropy positions (most confident predictions)
- Subsequent steps reuse cached K/V for unchanged positions
- Repeat until all positions unmasked or step budget exhausted
- diffuse-cpp: reference C++ implementation
- LLaDA: masked diffusion language model
- candle (MIT/Apache-2.0): GGUF parsing and quantized matmul
AGPL-3.0. See LICENSE.
For commercial licensing, contact aygul.galimova@duke.edu.