feat(evals): add AI Eval Playground for comparing models and prompts #145

berkdurmus · 2026-01-25T02:09:21Z

Summary

Add new AI Eval Playground utility for comparing AI model outputs and prompts
Implement BYOK (Bring Your Own Key) support for OpenAI, Anthropic, and Google AI
Build LLM-as-judge scoring system with configurable criteria weights
Create clean, table-based comparison UI following Linear.app design patterns

Features

Comparison Modes

Model vs Model: Compare 2-4 models with the same prompt
Prompt vs Prompt: Compare 2-4 prompt variations with the same model

Supported Providers

Provider	Models
OpenAI	GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic	Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
Google AI	Gemini 2.0 Flash, Gemini 1.5 Pro, Gemini 1.5 Flash

LLM-as-Judge Scoring

5 evaluation criteria: Accuracy, Relevance, Clarity, Completeness, Conciseness
Adjustable weight sliders for custom scoring emphasis
Pairwise comparison with winner detection
Visual score badges and breakdown bars

Security

API keys stored in sessionStorage only (cleared on browser close)
All processing happens client-side
Keys never leave the browser

Files Added

components/
├── ai-eval/
│   ├── ApiKeyDialog.tsx
│   ├── EvalComparisonGrid.tsx
│   ├── EvalConfigPanel.tsx
│   ├── EvalJudgePanel.tsx
│   ├── EvalModelSelector.tsx
│   ├── EvalResultCell.tsx
│   └── EvalScoreDisplay.tsx
├── hooks/
│   └── useApiKeys.ts
└── utils/
    ├── ai-eval-judge.ts
    ├── ai-eval-providers.ts
    ├── ai-eval-schemas.ts
    └── ai-eval-schemas.test.ts

pages/utilities/
└── ai-eval.tsx

feat(evals): add AI Eval Playground for comparing models and prompts

5244c2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add AI Eval Playground for comparing models and prompts #145

feat(evals): add AI Eval Playground for comparing models and prompts #145

Uh oh!

berkdurmus commented Jan 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(evals): add AI Eval Playground for comparing models and prompts #145

Are you sure you want to change the base?

feat(evals): add AI Eval Playground for comparing models and prompts #145

Uh oh!

Conversation

berkdurmus commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Comparison Modes

Supported Providers

LLM-as-Judge Scoring

Security

Files Added

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

berkdurmus commented Jan 25, 2026 •

edited

Loading