A production-style Large Language Model (LLM) inference backend built from scratch using llama.cpp, FastAPI, and pure systems thinking.
This project demonstrates how real-world inference infrastructure works — including KV cache behavior, thread tuning, quantization benchmarking, and live observability.
Build a CPU-based inference server that includes:
- Model loading (GGUF via llama.cpp)
- Chat completion API
- KV cache experimentation
- Thread performance tuning
- Quantization benchmarking
- Metrics endpoint
- Experimental cache complexity analysis
This is not an API wrapper. This is an infrastructure-level implementation.
User
↓
FastAPI (/generate)
↓
llama.cpp (GGUF model)
↓
KV Cache (attention reuse)
↓
Metrics Collector
↓
/metrics endpoint
Core components:
model.py→ Model loadermain.py→ FastAPI servermetrics.py→ Observability layerkv_cache.py→ Cache experiment
- Python 3.12
- llama-cpp-python
- FastAPI
- Uvicorn
- psutil
- GGUF Quantized Models
Measure performance difference between:
- KV cache enabled (default)
- KV cache destroyed (forced full recomputation)
- 50 tokens generated
- Total latency ≈ 2.5 seconds
- ≈ 0.05 sec per token
- ≈ 26–28 tokens/sec
- 50 tokens generated
- Total latency ≈ 44.6 seconds
- ≈ 0.89 sec per token (avg)
- ~18x slower overall
Without KV cache:
Each token recomputes attention over all previous tokens.
Total attention work:
1 + 2 + 3 + ... + N ≈ O(N²)
With KV cache:
Keys and Values are stored. Each new token computes attention only once.
Total work:
N × constant ≈ O(N)
KV cache converts quadratic growth into linear incremental cost.
This is the reason modern chat systems are usable.
Machine: 4 logical cores (nproc = 4)
| Threads | Tokens/sec |
|---|---|
| 2 | 28.6 |
| 3 | 27.0 |
| 4 | 25.8 |
Optimal configuration: n_threads = 2
Observation:
Increasing threads did not improve performance. Inference was memory-bandwidth bound, not compute-bound.
Expose live server statistics:
GET /metrics
Returns:
{
"total_requests": 2,
"avg_latency_sec": 6.86,
"tokens_per_sec_avg": 26.6,
"total_tokens_generated": 365,
"memory_mb": 1213.11,
"cpu_percent": 21.1
}Provides:
- Request count
- Average latency
- Aggregate throughput
- RAM usage
- CPU usage
Observability is built-in.
Model tested: TinyLlama 1.1B Q4_K_M
Memory footprint ≈ 1.2GB (including buffers and cache)
Quantization reduces:
- RAM usage
- Memory bandwidth
- Latency per token
Tradeoff: slight reduction in output quality.
pip install llama-cpp-python fastapi uvicorn psutil
python -m uvicorn server.main:app --reload
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain gravity simply"}'
curl http://localhost:8000/metrics
- KV cache is mandatory for usable LLM inference
- CPU inference is often memory-bandwidth bound
- More threads ≠ better performance
- Observability is essential before optimization
- Quantization dramatically affects throughput
This project shows understanding of:
- Transformer attention mechanics
- KV caching behavior
- Systems performance profiling
- Async API serving
- CPU-level optimization
- Infrastructure thinking
- Add batching scheduler
- Implement streaming tokens (SSE)
- Add quantization comparison benchmarks
- Dockerize deployment
- Publish performance graphs
This Mini vLLM implementation proves that efficient inference is not about model size — it is about systems design.
KV cache turns quadratic complexity into linear growth. Thread tuning reveals memory bottlenecks. Metrics make performance measurable.
Inference is infrastructure. And infrastructure is engineering.