Skip to content

Latest commit

 

History

History
271 lines (221 loc) · 9.53 KB

File metadata and controls

271 lines (221 loc) · 9.53 KB
Benchmaker

Benchmaker

A modular, visual LLM benchmarking desktop application for scientifically comparing multiple Large Language Models under identical conditions.

Overview

Benchmaker enables AI researchers, prompt engineers, and technical teams to:

  • Define standardized test suites with system prompts and test cases
  • Select and compare multiple LLMs via the OpenRouter API
  • Execute benchmarks in parallel across all selected models
  • Score results using objective rules and/or LLM judge models
  • Persist and compare results for regression testing and historical analysis
  • Analyze performance trends with comprehensive analytics dashboard

Features

Core Benchmarking

  • Multi-model parallel execution - Run the same prompts against multiple LLMs simultaneously
  • Real-time response streaming - Watch responses as they're generated
  • Configurable inference parameters - Temperature, top_p, max_tokens per run

Scoring System

  • Exact match - Precise string comparison
  • Regex match - Pattern-based validation
  • Numeric tolerance - Numerical comparison with configurable tolerance
  • Boolean match - Contains-based validation
  • LLM-as-Judge - AI-powered evaluation with customizable judge prompts

AI-Assisted Tools

  • Benchmark Generator - AI-powered generation of complete test suites from descriptions
  • Test Case Generator - Automatically generate test cases based on system prompts
  • Prompt Enhancer - AI-assisted improvement of system and judge prompts

Analytics Dashboard

  • Model Leaderboard - Ranked performance comparison across all runs
  • Performance Trends - Track model improvements over time
  • Interesting Facts - AI-generated insights about benchmark results
  • Per-suite filtering - Analyze results by specific test suite

Data Management

  • SQLite persistence - Local database for test suites and run history
  • Data Vault - Inspect and patch live JSON store directly
  • Import/Export - Reproducible benchmark configurations

User Experience

  • Monaco editor integration - Rich code editor for prompt authoring
  • Custom Tauri window - Native desktop experience with custom title bar
  • Auto-updates - Checks GitHub releases on startup and installs new versions
  • Dark/Light mode - Theme support for comfortable usage

Tech Stack

Frontend:

  • React 19 + TypeScript
  • Vite 7 build system
  • Tailwind CSS 4
  • Zustand state management
  • Radix UI primitives (custom-styled)
  • Monaco Editor
  • Lucide React icons

Backend:

  • Tauri (Rust)
  • SQLite (rusqlite)
  • Serde serialization

External APIs:

  • OpenRouter (LLM access)

Prerequisites

Installation

  1. Clone the repository:

    git clone https://github.com/oshtz/Benchmaker.git
    cd benchmaker
  2. Install dependencies:

    npm install
  3. Run in development mode:

    npm run tauri dev
  4. Build for production:

    npm run tauri build

Project Structure

benchmaker/
├── src/                          # React frontend
│   ├── components/
│   │   ├── arena/                # Test execution controls
│   │   ├── analytics/            # Analytics dashboard
│   │   ├── prompt-manager/       # Test suite creation & AI tools
│   │   ├── results/              # Results and reporting
│   │   ├── data/                 # Data management (Data Vault)
│   │   ├── settings/             # Settings UI
│   │   ├── layout/               # App layout components
│   │   └── ui/                   # Reusable UI primitives
│   ├── services/                 # Business logic
│   │   ├── analytics.ts          # Analytics computation
│   │   ├── benchmarkGenerator.ts # AI benchmark generation
│   │   ├── execution.ts          # Benchmark execution
│   │   ├── localDb.ts            # SQLite operations
│   │   ├── openrouter.ts         # OpenRouter API client
│   │   ├── promptEnhancer.ts     # AI prompt enhancement
│   │   ├── testCaseGenerator.ts  # AI test case generation
│   │   └── updater.ts            # Auto-update service
│   ├── stores/                   # Zustand state stores
│   │   ├── modelStore.ts         # Model selection state
│   │   ├── runStore.ts           # Benchmark run state
│   │   ├── settingsStore.ts      # App settings
│   │   ├── testSuiteStore.ts     # Test suite state
│   │   └── updateStore.ts        # Update status state
│   ├── scoring/                  # Scoring implementations
│   │   ├── exact-match.ts
│   │   ├── regex-match.ts
│   │   ├── numeric-tolerance.ts
│   │   └── llm-judge.ts
│   ├── types/                    # TypeScript definitions
│   └── lib/                      # Utilities
├── src-tauri/                    # Rust backend
│   ├── src/main.rs               # Tauri app + SQLite
│   └── tauri.conf.json           # Tauri configuration
├── package.json
├── tsconfig.json
└── vite.config.ts

Usage

1. Configure API Key

Enter your OpenRouter API key in the Settings (gear icon in header).

2. Create a Test Suite

  • Navigate to the Prompts tab
  • Define a system prompt that applies to all test cases
  • Add individual test cases with prompts and expected outputs
  • Configure scoring methods per test case (exact-match, regex, numeric, boolean, or llm-judge)
  • Optionally use AI tools to generate test cases or enhance prompts

3. Select Models

  • Go to the Arena tab
  • Select one or more LLMs from the model list
  • Configure inference parameters (temperature, top_p, max_tokens)
  • Optionally select a judge model for LLM-based scoring

4. Run Benchmark

  • Click Run to execute all test cases against selected models
  • Watch real-time streaming responses
  • View progress and status per model

5. Analyze Results

  • Switch to the Results tab for detailed response comparison
  • Use the Analytics tab for:
    • Model leaderboard rankings
    • Performance trends over time
    • AI-generated insights and interesting facts
  • Results are automatically saved for future reference

6. Manage Data

  • Use the Data tab to inspect the live JSON store
  • Patch data directly for reproducible runs
  • Export/import benchmark configurations

Updates

  • The app checks for updates on startup
  • Click the version button in the header (e.g. v0.1.4) to view update status, release notes, or manually re-check
  • Updates are pulled from GitHub Releases and expect a Benchmaker-Portable.exe asset on the latest tag

Development

Scripts

Command Description
npm run dev Start Vite dev server only
npm run build Build frontend for production
npm run tauri dev Run full Tauri development build
npm run tauri build Build production desktop app

Release Notes

  • Version is sourced from package.json and src-tauri/tauri.conf.json
  • GitHub releases should be tagged as vX.Y.Z and include Benchmaker-Portable.exe

Architecture Notes

  • State Management: All application state is managed through Zustand stores in src/stores/
  • Services: API calls and business logic are encapsulated in src/services/
  • Scoring: Pluggable scoring system in src/scoring/ - add new scoring methods here
  • Types: Centralized TypeScript types in src/types/index.ts
  • Database: SQLite schema and migrations are in src-tauri/src/main.rs

Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch:
    git checkout -b feature/your-feature-name
  3. Make your changes following the existing code style
  4. Test your changes thoroughly
  5. Commit with clear messages:
    git commit -m "Add: description of your feature"
  6. Push and create a Pull Request

Contribution Guidelines

  • Follow existing TypeScript/React patterns in the codebase
  • Keep components small and focused
  • Add types for all new code
  • Test changes with multiple models before submitting
  • Update documentation for new features

Areas Open for Contribution

  • Additional scoring plugins
  • Export functionality (JSON, CSV, PDF reports)
  • UI/UX improvements
  • Performance optimizations
  • Documentation and examples
  • Test coverage
  • Accessibility improvements

Roadmap

  • Code execution sandbox scoring
  • Cost-aware benchmarking (track API costs)
  • Prompt diff/comparison tools
  • Export to JSON/CSV/PDF
  • Public shareable benchmark URLs
  • CI-style automated regression runs
  • Team workspaces and collaboration
  • Model cost tracking and optimization suggestions

License

MIT License

Acknowledgments