diffuse-rs

A high-performance, portable inference engine for diffusion language models (LLaDA, Dream).

Diffusion LLMs generate all tokens in parallel via iterative unmasking; this shifts the bottleneck from memory bandwidth to compute. This makes CPU inference viable for sequence generation: diffuse-rs unmasks 30+ tokens per step vs. autoregressive's sequential decoding.

Performance

All benchmarks on the same hardware: Xeon Gold 6252, 12 threads, Q4_K_M quantization. Build with RUSTFLAGS="-C target-cpu=native" for AVX2/FMA.

LLaDA-8B (B=128, entropy_exit, steps=16)

#	Prompt	diffuse-rs	diffuse-cpp (C++)	Speedup
1	Capital of France?	1.5 tok/s	0.8 tok/s	1.9x
2	Translate to French	2.5 tok/s	0.7 tok/s	3.6x
3	15 times 23?	1.3 tok/s	0.8 tok/s	1.6x
4	Translate to Spanish	1.8 tok/s	0.7 tok/s	2.6x
5	Python is_prime()	0.6 tok/s	0.6 tok/s	1.0x
6	Why is the sky blue?	0.6 tok/s	0.7 tok/s	0.9x
7	List the planets	0.8 tok/s	0.7 tok/s	1.1x
8	Poem about the ocean	0.6 tok/s	0.7 tok/s	0.9x

llama.cpp (Llama-3-8B, autoregressive): 9.35 tok/s on same hardware. Diffusion models trade per-token speed for parallel generation: each step unmasks multiple tokens simultaneously.

Optimization in progress: eliminating per-matmul Tensor overhead to match candle's internal AVX2 throughput. Target: 2-3x improvement on multi-step prompts.

Quick Start

cargo build --release

Download a model:

pip install huggingface-hub
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf --local-dir models/

Generate:

./target/release/diffuse-rs generate \
  --model models/llada-8b-q4km.gguf \
  --prompt-ids "2372,341,268,7706,300,11406,30" \
  -n 128 --n-steps 16 --threads 12
# → "The capital of France is Paris."

HTTP Server

cargo build --release --features server

# Download tokenizer for text prompts
huggingface-cli download GSAI-ML/LLaDA-8B-Instruct tokenizer.json --local-dir models/

./target/release/diffuse-rs serve \
  --model models/llada-8b-q4km.gguf \
  --tokenizer models/tokenizer.json --threads 12

# Text prompt
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"What is the capital of France?","max_tokens":128}'

# Token IDs (no tokenizer needed)
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"token_ids":[2372,341,268,7706,300,11406,30],"max_tokens":128}'

Architecture

src/
├── model.rs     # Model loading, forward pass, KV cache, sampler
├── kernels.rs   # Q4_K/Q6_K block types, Q8_K quantization, scalar dot products
├── server.rs    # HTTP API server (axum, optional)
└── main.rs      # CLI: bench, generate, serve, profile

Quantized matmuls via candle's QMatMul (Q4_K, Q6_K) with AVX2/FMA enabled via target-cpu=native. Native SIMD for attention, RoPE, softmax, SiLU. Rayon fork-join for parallel dispatch. F16 embeddings dequantized per-row at runtime (~72MB vs ~2GB).

How It Works

Initialize output positions as [MASK] tokens
Full forward pass through bidirectional transformer (no causal mask)
For each masked position, compute entropy of logit distribution
Unmask the lowest-entropy positions (most confident predictions)
Subsequent steps reuse cached K/V for unchanged positions
Repeat until all positions unmasked or step budget exhausted

Acknowledgements

diffuse-cpp: reference C++ implementation
LLaDA: masked diffusion language model
candle (MIT/Apache-2.0): GGUF parsing and quantized matmul

License

AGPL-3.0. See LICENSE.

For commercial licensing, contact aygul.galimova@duke.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cargo		.cargo
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffuse-rs

Performance

LLaDA-8B (B=128, entropy_exit, steps=16)

Quick Start

HTTP Server

Architecture

How It Works

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

diffuse-rs

Performance

LLaDA-8B (B=128, entropy_exit, steps=16)

Quick Start

HTTP Server

Architecture

How It Works

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages