Benchmaker

A modular, visual LLM benchmarking desktop application for scientifically comparing multiple Large Language Models under identical conditions.

Overview

Benchmaker enables AI researchers, prompt engineers, and technical teams to:

Define standardized test suites with system prompts and test cases
Select and compare multiple LLMs via the OpenRouter API
Execute benchmarks in parallel across all selected models
Score results using objective rules and/or LLM judge models
Persist and compare results for regression testing and historical analysis
Analyze performance trends with comprehensive analytics dashboard

Features

Core Benchmarking

Multi-model parallel execution - Run the same prompts against multiple LLMs simultaneously
Real-time response streaming - Watch responses as they're generated
Configurable inference parameters - Temperature, top_p, max_tokens per run

Scoring System

Exact match - Precise string comparison
Regex match - Pattern-based validation
Numeric tolerance - Numerical comparison with configurable tolerance
Boolean match - Contains-based validation
LLM-as-Judge - AI-powered evaluation with customizable judge prompts

AI-Assisted Tools

Benchmark Generator - AI-powered generation of complete test suites from descriptions
Test Case Generator - Automatically generate test cases based on system prompts
Prompt Enhancer - AI-assisted improvement of system and judge prompts

Analytics Dashboard

Model Leaderboard - Ranked performance comparison across all runs
Performance Trends - Track model improvements over time
Interesting Facts - AI-generated insights about benchmark results
Per-suite filtering - Analyze results by specific test suite

Data Management

SQLite persistence - Local database for test suites and run history
Data Vault - Inspect and patch live JSON store directly
Import/Export - Reproducible benchmark configurations

User Experience

Monaco editor integration - Rich code editor for prompt authoring
Custom Tauri window - Native desktop experience with custom title bar
Auto-updates - Checks GitHub releases on startup and installs new versions
Dark/Light mode - Theme support for comfortable usage

Tech Stack

Frontend:

React 19 + TypeScript
Vite 7 build system
Tailwind CSS 4
Zustand state management
Radix UI primitives (custom-styled)
Monaco Editor
Lucide React icons

Backend:

Tauri (Rust)
SQLite (rusqlite)
Serde serialization

External APIs:

OpenRouter (LLM access)

Prerequisites

Node.js (v18 or higher)
Rust (latest stable)
pnpm or npm
An OpenRouter API key

Installation

Clone the repository:

git clone https://github.com/oshtz/Benchmaker.git
cd benchmaker

Install dependencies:
```
npm install
```
Run in development mode:
```
npm run tauri dev
```
Build for production:
```
npm run tauri build
```

Project Structure

benchmaker/
├── src/                          # React frontend
│   ├── components/
│   │   ├── arena/                # Test execution controls
│   │   ├── analytics/            # Analytics dashboard
│   │   ├── prompt-manager/       # Test suite creation & AI tools
│   │   ├── results/              # Results and reporting
│   │   ├── data/                 # Data management (Data Vault)
│   │   ├── settings/             # Settings UI
│   │   ├── layout/               # App layout components
│   │   └── ui/                   # Reusable UI primitives
│   ├── services/                 # Business logic
│   │   ├── analytics.ts          # Analytics computation
│   │   ├── benchmarkGenerator.ts # AI benchmark generation
│   │   ├── execution.ts          # Benchmark execution
│   │   ├── localDb.ts            # SQLite operations
│   │   ├── openrouter.ts         # OpenRouter API client
│   │   ├── promptEnhancer.ts     # AI prompt enhancement
│   │   ├── testCaseGenerator.ts  # AI test case generation
│   │   └── updater.ts            # Auto-update service
│   ├── stores/                   # Zustand state stores
│   │   ├── modelStore.ts         # Model selection state
│   │   ├── runStore.ts           # Benchmark run state
│   │   ├── settingsStore.ts      # App settings
│   │   ├── testSuiteStore.ts     # Test suite state
│   │   └── updateStore.ts        # Update status state
│   ├── scoring/                  # Scoring implementations
│   │   ├── exact-match.ts
│   │   ├── regex-match.ts
│   │   ├── numeric-tolerance.ts
│   │   └── llm-judge.ts
│   ├── types/                    # TypeScript definitions
│   └── lib/                      # Utilities
├── src-tauri/                    # Rust backend
│   ├── src/main.rs               # Tauri app + SQLite
│   └── tauri.conf.json           # Tauri configuration
├── package.json
├── tsconfig.json
└── vite.config.ts

Usage

1. Configure API Key

Enter your OpenRouter API key in the Settings (gear icon in header).

2. Create a Test Suite

Navigate to the Prompts tab
Define a system prompt that applies to all test cases
Add individual test cases with prompts and expected outputs
Configure scoring methods per test case (exact-match, regex, numeric, boolean, or llm-judge)
Optionally use AI tools to generate test cases or enhance prompts

3. Select Models

Go to the Arena tab
Select one or more LLMs from the model list
Configure inference parameters (temperature, top_p, max_tokens)
Optionally select a judge model for LLM-based scoring

4. Run Benchmark

Click Run to execute all test cases against selected models
Watch real-time streaming responses
View progress and status per model

5. Analyze Results

Switch to the Results tab for detailed response comparison
Use the Analytics tab for:
- Model leaderboard rankings
- Performance trends over time
- AI-generated insights and interesting facts
Results are automatically saved for future reference

6. Manage Data

Use the Data tab to inspect the live JSON store
Patch data directly for reproducible runs
Export/import benchmark configurations

Updates

The app checks for updates on startup
Click the version button in the header (e.g. v0.1.4) to view update status, release notes, or manually re-check
Updates are pulled from GitHub Releases and expect a Benchmaker-Portable.exe asset on the latest tag

Development

Scripts

Command	Description
`npm run dev`	Start Vite dev server only
`npm run build`	Build frontend for production
`npm run tauri dev`	Run full Tauri development build
`npm run tauri build`	Build production desktop app

Release Notes

Version is sourced from package.json and src-tauri/tauri.conf.json
GitHub releases should be tagged as vX.Y.Z and include Benchmaker-Portable.exe

Architecture Notes

State Management: All application state is managed through Zustand stores in src/stores/
Services: API calls and business logic are encapsulated in src/services/
Scoring: Pluggable scoring system in src/scoring/ - add new scoring methods here
Types: Centralized TypeScript types in src/types/index.ts
Database: SQLite schema and migrations are in src-tauri/src/main.rs

Contributing

Contributions are welcome! Here's how to get started:

Fork the repository

Create a feature branch:

git checkout -b feature/your-feature-name

Make your changes following the existing code style
Test your changes thoroughly

Commit with clear messages:

git commit -m "Add: description of your feature"

Push and create a Pull Request

Contribution Guidelines

Follow existing TypeScript/React patterns in the codebase
Keep components small and focused
Add types for all new code
Test changes with multiple models before submitting
Update documentation for new features

Areas Open for Contribution

Additional scoring plugins
Export functionality (JSON, CSV, PDF reports)
UI/UX improvements
Performance optimizations
Documentation and examples
Test coverage
Accessibility improvements

Roadmap

Code execution sandbox scoring
Cost-aware benchmarking (track API costs)
Prompt diff/comparison tools
Export to JSON/CSV/PDF
Public shareable benchmark URLs
CI-style automated regression runs
Team workspaces and collaboration
Model cost tracking and optimization suggestions

License

MIT License

Acknowledgments

OpenRouter for unified LLM API access
Tauri for the desktop framework
Radix UI for accessible component primitives
Monaco Editor for the code editing experience
Lucide for beautiful icons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmaker

Overview

Features

Core Benchmarking

Scoring System

AI-Assisted Tools

Analytics Dashboard

Data Management

User Experience

Tech Stack

Prerequisites

Installation

Project Structure

Usage

1. Configure API Key

2. Create a Test Suite

3. Select Models

4. Run Benchmark

5. Analyze Results

6. Manage Data

Updates

Development

Scripts

Release Notes

Architecture Notes

Contributing

Contribution Guidelines

Areas Open for Contribution

Roadmap

License

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmaker

Overview

Features

Core Benchmarking

Scoring System

AI-Assisted Tools

Analytics Dashboard

Data Management

User Experience

Tech Stack

Prerequisites

Installation

Project Structure

Usage

1. Configure API Key

2. Create a Test Suite

3. Select Models

4. Run Benchmark

5. Analyze Results

6. Manage Data

Updates

Development

Scripts

Release Notes

Architecture Notes

Contributing

Contribution Guidelines

Areas Open for Contribution

Roadmap

License

Acknowledgments