Local LLM Benchmarking Suite

A modular toolkit for benchmarking local Large Language Models (LLMs) using Ollama. Evaluate performance metrics like throughput, latency, and resource usage across models, configurations, and hardware setups. Ideal for developers, researchers, and AI enthusiasts optimizing local AI deployments.

🚀 Why Use This Toolkit?

Running LLMs locally offers privacy, cost savings, and low-latency inference—but performance depends heavily on your hardware, model choice, and settings. This toolkit helps you:

Measure real-world viability: Test if a model meets your speed and resource requirements.
Compare configurations: Evaluate trade-offs in context size, temperature, GPU layers, and more.
Monitor in real-time: Visualize generation progress with live metrics and resource graphs.
Batch benchmark: Run automated suites across multiple models and params, exporting results to CSV.

Built with Ollama for simplicity—no complex setups required. Focus on insights, not infrastructure.

✨ Key Features

Real-Time Mode: Interactive dashboard for monitoring generation, with tokens/sec, TPM (tokens per minute), CPU/GPU/RAM usage, and live text output.
Batch Mode: Automated benchmarking across model grids, averaging multiple trials for reliable results.
Detailed Metrics: Time-to-first-token (TTFT), tokens per second (TPS), peak resource usage, and more.
Config-Driven: JSON-based configs for easy customization of models, prompts, sampling params, and hardware settings.
Smart Model Management: Automatic model pulling, fuzzy matching, and suggestions if a model isn't found locally.
Logging & Export: Detailed logs and CSV outputs for analysis and sharing.
Hardware Snapshot: Captures your system's CPU, RAM, GPU, and Ollama version for reproducible benchmarks.

📦 Installation

Install Ollama: Download and set up Ollama from ollama.com. Start the server with ollama serve.

Clone the Repository:

git clone https://github.com/kaushall13/local-llm-benchmark
cd local-llm-benchmark

Create a Python Virtual Environment: It is recommended to use a virtual environment to isolate dependencies.
```
python -m venv .venv
source .venv/bin/activate   # On Linux/Mac
.venv\Scripts\activate      # On Windows
```
Install Python Dependencies: Ensure Python 3.8+ is installed, then:
```
pip install -r requirements.txt
```
(Requirements include ollama, psutil, pynvml, rich, pandas, tqdm, and more for monitoring and analysis.)

🔧 Usage

The toolkit uses a unified entry point (main.py) to run either real-time or batch modes.

General Command

python main.py [realtime|batch] [options]

Real-Time Mode

Monitor a single generation in real-time with a rich terminal UI.

Basic Example:

python main.py realtime --model-name qwen2.5:1.5b --prompt "Explain quantum computing in simple terms"

With Custom Config: First, create a default config:

python main.py realtime --mode create-config

Edit config.json, then run:

python main.py realtime --config config.json

Options:
- --model-name STR: Model (e.g., qwen2.5:1.5b). Fuzzy matching supported.
- --prompt STR: Input prompt for generation.
- --host STR: Ollama server URL (default: http://localhost:11434).
- --max-tokens INT: Max tokens to generate (default: 2000).
- --temperature FLOAT: Sampling temperature (default: 0.7).
- --n-gpu-layers INT: GPU layers (-1 for all, default: -1).
- --n-ctx INT: Context length (default: 2048).
- --top-k INT: Top-k sampling (default: 40).
- --top-p FLOAT: Top-p sampling (default: 0.9).
- --repeat-penalty FLOAT: Repeat penalty (default: 1.1).
- --num-thread INT: CPU threads (default: 0, auto-detect).
- --config STR: Path to JSON config file.
- --force-size: Skip terminal size check (use cautiously).
- --min-width INT / --min-height INT: Override min terminal dimensions.
The UI requires a terminal of at least 80x30 (adjustable). Press Ctrl+C to stop.

Batch Mode

Run automated benchmarks across multiple configs, averaging trials.

Basic Example (creates sample config if missing):

python main.py batch --config ollama_benchmark_config.json

Options:
- --config STR: Path to config file (default: ollama_benchmark_config.json).

If the config doesn't exist, a sample is created with models like gemma3:270m and qwen2.5:1.5b. Edit it to specify:

{
  "models": [
    {"name": "qwen2.5:1.5b"},
    {"name": "llama3.1:8b"}
  ],
  "parameters": {
    "num_ctx": [2048, 4096],
    "num_predict": [128, 512],
    "num_gpu": [0, -1],
    "temperature": [0.7],
    "num_thread": [4],
    "top_k": [40],
    "top_p": [0.9],
    "repeat_penalty": [1.1]
  },
  "benchmark_settings": {
    "trials": 3,
    "output_prefix": "ollama_benchmark_results",
    "ollama_host": "http://localhost:11434"
  }
}

Results are saved as CSV (e.g., ollama_benchmark_results_YYYYMMDD_HHMMSS.csv) with a console summary.

📊 Example Outputs

Real-Time Mode

Live dashboard showing model info, resource usage, TPS/TPM, and generated text.

Batch Mode Summary

--- Ollama Benchmark Summary ---

Model: qwen2.5:1.5b
  Test Scenario (context: 2048, temp: 0.7, predict: 128):
    - CPU: TTFT: 0.45s | TPS: 45.20 | RAM Peak: 2100 MB
    - GPU: TTFT: 0.12s | TPS: 120.50 | VRAM Peak: 1500 MB
      Performance Gain: GPU is 2.67x faster

Full CSV includes TTFT, TPS, TPM, tokens, resource peaks, and errors.

📌 Use Cases

Model Selection: Compare Llama, Qwen, Gemma, etc., for your hardware.
Optimization: Tune GPU layers, context size, and sampling for best performance.
Hardware Testing: Profile CPU vs. GPU inference on different setups.
Deployment Prep: Ensure configs meet latency/throughput SLAs for apps.

🤝 Contributing

Contributions welcome! Help add features like multi-GPU support or visualizations.

Fork and create a feature branch.
Submit a PR with clear descriptions and tests.
Share your benchmark results across hardware.

📄 License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
llama-cpp-python		llama-cpp-python
prompts		prompts
sample_dashboard_images		sample_dashboard_images
README.md		README.md
approach.md		approach.md
batch_benchmark.py		batch_benchmark.py
config.json		config.json
interface.py		interface.py
main.py		main.py
my_benchmark.json		my_benchmark.json
realtime_benchmark.py		realtime_benchmark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Benchmarking Suite

🚀 Why Use This Toolkit?

✨ Key Features

📦 Installation

🔧 Usage

General Command

Real-Time Mode

Batch Mode

📊 Example Outputs

Real-Time Mode

Batch Mode Summary

📌 Use Cases

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local LLM Benchmarking Suite

🚀 Why Use This Toolkit?

✨ Key Features

📦 Installation

🔧 Usage

General Command

Real-Time Mode

Batch Mode

📊 Example Outputs

Real-Time Mode

Batch Mode Summary

📌 Use Cases

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages