![]() |
A modular, visual LLM benchmarking desktop application for scientifically comparing multiple Large Language Models under identical conditions. |
Benchmaker enables AI researchers, prompt engineers, and technical teams to:
- Define standardized test suites with system prompts and test cases
- Select and compare multiple LLMs via the OpenRouter API
- Execute benchmarks in parallel across all selected models
- Score results using objective rules and/or LLM judge models
- Persist and compare results for regression testing and historical analysis
- Analyze performance trends with comprehensive analytics dashboard
- Multi-model parallel execution - Run the same prompts against multiple LLMs simultaneously
- Real-time response streaming - Watch responses as they're generated
- Configurable inference parameters - Temperature, top_p, max_tokens per run
- Exact match - Precise string comparison
- Regex match - Pattern-based validation
- Numeric tolerance - Numerical comparison with configurable tolerance
- Boolean match - Contains-based validation
- LLM-as-Judge - AI-powered evaluation with customizable judge prompts
- Benchmark Generator - AI-powered generation of complete test suites from descriptions
- Test Case Generator - Automatically generate test cases based on system prompts
- Prompt Enhancer - AI-assisted improvement of system and judge prompts
- Model Leaderboard - Ranked performance comparison across all runs
- Performance Trends - Track model improvements over time
- Interesting Facts - AI-generated insights about benchmark results
- Per-suite filtering - Analyze results by specific test suite
- SQLite persistence - Local database for test suites and run history
- Data Vault - Inspect and patch live JSON store directly
- Import/Export - Reproducible benchmark configurations
- Monaco editor integration - Rich code editor for prompt authoring
- Custom Tauri window - Native desktop experience with custom title bar
- Auto-updates - Checks GitHub releases on startup and installs new versions
- Dark/Light mode - Theme support for comfortable usage
Frontend:
- React 19 + TypeScript
- Vite 7 build system
- Tailwind CSS 4
- Zustand state management
- Radix UI primitives (custom-styled)
- Monaco Editor
- Lucide React icons
Backend:
- Tauri (Rust)
- SQLite (rusqlite)
- Serde serialization
External APIs:
- OpenRouter (LLM access)
- Node.js (v18 or higher)
- Rust (latest stable)
- pnpm or npm
- An OpenRouter API key
-
Clone the repository:
git clone https://github.com/oshtz/Benchmaker.git cd benchmaker -
Install dependencies:
npm install
-
Run in development mode:
npm run tauri dev
-
Build for production:
npm run tauri build
benchmaker/
├── src/ # React frontend
│ ├── components/
│ │ ├── arena/ # Test execution controls
│ │ ├── analytics/ # Analytics dashboard
│ │ ├── prompt-manager/ # Test suite creation & AI tools
│ │ ├── results/ # Results and reporting
│ │ ├── data/ # Data management (Data Vault)
│ │ ├── settings/ # Settings UI
│ │ ├── layout/ # App layout components
│ │ └── ui/ # Reusable UI primitives
│ ├── services/ # Business logic
│ │ ├── analytics.ts # Analytics computation
│ │ ├── benchmarkGenerator.ts # AI benchmark generation
│ │ ├── execution.ts # Benchmark execution
│ │ ├── localDb.ts # SQLite operations
│ │ ├── openrouter.ts # OpenRouter API client
│ │ ├── promptEnhancer.ts # AI prompt enhancement
│ │ ├── testCaseGenerator.ts # AI test case generation
│ │ └── updater.ts # Auto-update service
│ ├── stores/ # Zustand state stores
│ │ ├── modelStore.ts # Model selection state
│ │ ├── runStore.ts # Benchmark run state
│ │ ├── settingsStore.ts # App settings
│ │ ├── testSuiteStore.ts # Test suite state
│ │ └── updateStore.ts # Update status state
│ ├── scoring/ # Scoring implementations
│ │ ├── exact-match.ts
│ │ ├── regex-match.ts
│ │ ├── numeric-tolerance.ts
│ │ └── llm-judge.ts
│ ├── types/ # TypeScript definitions
│ └── lib/ # Utilities
├── src-tauri/ # Rust backend
│ ├── src/main.rs # Tauri app + SQLite
│ └── tauri.conf.json # Tauri configuration
├── package.json
├── tsconfig.json
└── vite.config.ts
Enter your OpenRouter API key in the Settings (gear icon in header).
- Navigate to the Prompts tab
- Define a system prompt that applies to all test cases
- Add individual test cases with prompts and expected outputs
- Configure scoring methods per test case (exact-match, regex, numeric, boolean, or llm-judge)
- Optionally use AI tools to generate test cases or enhance prompts
- Go to the Arena tab
- Select one or more LLMs from the model list
- Configure inference parameters (temperature, top_p, max_tokens)
- Optionally select a judge model for LLM-based scoring
- Click Run to execute all test cases against selected models
- Watch real-time streaming responses
- View progress and status per model
- Switch to the Results tab for detailed response comparison
- Use the Analytics tab for:
- Model leaderboard rankings
- Performance trends over time
- AI-generated insights and interesting facts
- Results are automatically saved for future reference
- Use the Data tab to inspect the live JSON store
- Patch data directly for reproducible runs
- Export/import benchmark configurations
- The app checks for updates on startup
- Click the version button in the header (e.g.
v0.1.4) to view update status, release notes, or manually re-check - Updates are pulled from GitHub Releases and expect a
Benchmaker-Portable.exeasset on the latest tag
| Command | Description |
|---|---|
npm run dev |
Start Vite dev server only |
npm run build |
Build frontend for production |
npm run tauri dev |
Run full Tauri development build |
npm run tauri build |
Build production desktop app |
- Version is sourced from
package.jsonandsrc-tauri/tauri.conf.json - GitHub releases should be tagged as
vX.Y.Zand includeBenchmaker-Portable.exe
- State Management: All application state is managed through Zustand stores in
src/stores/ - Services: API calls and business logic are encapsulated in
src/services/ - Scoring: Pluggable scoring system in
src/scoring/- add new scoring methods here - Types: Centralized TypeScript types in
src/types/index.ts - Database: SQLite schema and migrations are in
src-tauri/src/main.rs
Contributions are welcome! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name
- Make your changes following the existing code style
- Test your changes thoroughly
- Commit with clear messages:
git commit -m "Add: description of your feature" - Push and create a Pull Request
- Follow existing TypeScript/React patterns in the codebase
- Keep components small and focused
- Add types for all new code
- Test changes with multiple models before submitting
- Update documentation for new features
- Additional scoring plugins
- Export functionality (JSON, CSV, PDF reports)
- UI/UX improvements
- Performance optimizations
- Documentation and examples
- Test coverage
- Accessibility improvements
- Code execution sandbox scoring
- Cost-aware benchmarking (track API costs)
- Prompt diff/comparison tools
- Export to JSON/CSV/PDF
- Public shareable benchmark URLs
- CI-style automated regression runs
- Team workspaces and collaboration
- Model cost tracking and optimization suggestions
- OpenRouter for unified LLM API access
- Tauri for the desktop framework
- Radix UI for accessible component primitives
- Monaco Editor for the code editing experience
- Lucide for beautiful icons
