Skip to content

agalimova/diffuse-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diffuse-rs

A high-performance, portable inference engine for diffusion language models (LLaDA, Dream).

Diffusion LLMs generate all tokens in parallel via iterative unmasking; this shifts the bottleneck from memory bandwidth to compute. This makes CPU inference viable for sequence generation: diffuse-rs unmasks 30+ tokens per step vs. autoregressive's sequential decoding.

Performance

All benchmarks on the same hardware: Xeon Gold 6252, 12 threads, Q4_K_M quantization. Build with RUSTFLAGS="-C target-cpu=native" for AVX2/FMA.

LLaDA-8B (B=128, entropy_exit, steps=16)

# Prompt diffuse-rs diffuse-cpp (C++) Speedup
1 Capital of France? 1.5 tok/s 0.8 tok/s 1.9x
2 Translate to French 2.5 tok/s 0.7 tok/s 3.6x
3 15 times 23? 1.3 tok/s 0.8 tok/s 1.6x
4 Translate to Spanish 1.8 tok/s 0.7 tok/s 2.6x
5 Python is_prime() 0.6 tok/s 0.6 tok/s 1.0x
6 Why is the sky blue? 0.6 tok/s 0.7 tok/s 0.9x
7 List the planets 0.8 tok/s 0.7 tok/s 1.1x
8 Poem about the ocean 0.6 tok/s 0.7 tok/s 0.9x

llama.cpp (Llama-3-8B, autoregressive): 9.35 tok/s on same hardware. Diffusion models trade per-token speed for parallel generation: each step unmasks multiple tokens simultaneously.

Optimization in progress: eliminating per-matmul Tensor overhead to match candle's internal AVX2 throughput. Target: 2-3x improvement on multi-step prompts.

Quick Start

cargo build --release

Download a model:

pip install huggingface-hub
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf --local-dir models/

Generate:

./target/release/diffuse-rs generate \
  --model models/llada-8b-q4km.gguf \
  --prompt-ids "2372,341,268,7706,300,11406,30" \
  -n 128 --n-steps 16 --threads 12
# → "The capital of France is Paris."

HTTP Server

cargo build --release --features server

# Download tokenizer for text prompts
huggingface-cli download GSAI-ML/LLaDA-8B-Instruct tokenizer.json --local-dir models/

./target/release/diffuse-rs serve \
  --model models/llada-8b-q4km.gguf \
  --tokenizer models/tokenizer.json --threads 12

# Text prompt
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"What is the capital of France?","max_tokens":128}'

# Token IDs (no tokenizer needed)
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"token_ids":[2372,341,268,7706,300,11406,30],"max_tokens":128}'

Architecture

src/
├── model.rs     # Model loading, forward pass, KV cache, sampler
├── kernels.rs   # Q4_K/Q6_K block types, Q8_K quantization, scalar dot products
├── server.rs    # HTTP API server (axum, optional)
└── main.rs      # CLI: bench, generate, serve, profile

Quantized matmuls via candle's QMatMul (Q4_K, Q6_K) with AVX2/FMA enabled via target-cpu=native. Native SIMD for attention, RoPE, softmax, SiLU. Rayon fork-join for parallel dispatch. F16 embeddings dequantized per-row at runtime (~72MB vs ~2GB).

How It Works

  1. Initialize output positions as [MASK] tokens
  2. Full forward pass through bidirectional transformer (no causal mask)
  3. For each masked position, compute entropy of logit distribution
  4. Unmask the lowest-entropy positions (most confident predictions)
  5. Subsequent steps reuse cached K/V for unchanged positions
  6. Repeat until all positions unmasked or step budget exhausted

Acknowledgements

  • diffuse-cpp: reference C++ implementation
  • LLaDA: masked diffusion language model
  • candle (MIT/Apache-2.0): GGUF parsing and quantized matmul

License

AGPL-3.0. See LICENSE.

For commercial licensing, contact aygul.galimova@duke.edu.

About

Pure Rust CPU inference engine for diffusion language models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages