A lightweight, flexible evaluation 'eval' framework for testing models with automated judging capabilities, supporting Gemini, Anthropic, OpenAI, and Ollama.
Step 1: Build the Question for the Judge We create a prompt that asks another AI model to evaluate the first model's answer. This prompt contains:
- The rules for what counts as a good answer (you can customize these or use the defaults)
- What the correct answer should be
- What the model actually said
- Simple instructions telling the judge how to evaluate
Step 2: Send to the Right AI Service
- We figure out which AI service to use (like Anthropic, OpenAI, etc.) from the judge model name
- We send the evaluation question to that service (using the same system we used for the first model)
- We track how long it takes and how many tokens it uses
Step 3: Understand the Judge's Answer We read the judge model's response and pull out:
The decision: Did it pass or fail? (We look for words like "PASS", "FAIL", "yes", or "no") The explanation: Why did the judge decide this? If we can't tell what the verdict is, we mark it as "Uncertain"
- SQLite database for saving history
- Specify LLM provider for both model and judge
- Batch evaluations to multiple providers/models
- API endpoints for developers to consume
- Built-in GUI and results dashboard
- Additional evaluation criteria options (exact match, semantic similarity, etc.)
- Python SDK available
- Real-time WebSocket updates
If you want to use 'evaluate' via your own Python scripts or Jupyter Notebooks, you can use the SDK:
https://pypi.org/project/llmeval-sdk/
(Example usage with Python is shown on that PyPi page)
You'll need:
- Docker (recommended) OR Rust/Cargo
- API keys for your LLM provider(s)
If you use Ollama:
ollama pull llama3Create a .env file in your project root (see env.example):
DATABASE_URL=sqlite:./data/evals.db
GEMINI_API_BASE=https://generativelanguage.googleapis.com
GEMINI_API_KEY=AIzaxxxxxxxxxxxxxxxxxxxxxxxxxxc
GEMINI_MODELS=gemini-2.5-pro,gemini-2.5-flash
OLLAMA_API_BASE=http://host.docker.internal:11434
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
OPENAI_MODELS=gpt-4o,gpt-4o-mini,gpt-3.5-turbo
ANTHROPIC_API_KEY=sk-placeholder-ant-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6
ANTHROPIC_MODELS=claude-opus-4,claude-sonnet-4-5,claude-haiku-4
RUST_LOG=infoBuild the image:
docker build -t evaluate:latest .Run on Linux:
docker run --rm -it \
--network host \
--env-file .env \
-v $(pwd)/data:/usr/local/bin/data \
-e OLLAMA_API_BASE=http://localhost:11434 \
evaluate:latestRun on Mac:
docker run --rm -it -p 8080:8080 \
--env-file .env \
-v $(pwd)/data:/usr/local/bin/data \
evaluate:latestRun on Windows (PowerShell):
docker run --rm -it -p 8080:8080 `
--env-file .env `
-v ${PWD}/data:/usr/local/bin/data `
evaluate:latest# 1. Clone the repository
git clone git@github.com:RGGH/evaluate.git
# 2. Navigate into the project directory
cd evaluate
# 3. Run with Cargo (requires Rust/Cargo installed)
cargo runYou should see output similar to:
[INFO] Starting database migration...
[INFO] Starting server at 127.0.0.1:8080
Access the application at http://localhost:8080
Gemini Example:
curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-pro",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "gemini-2.5-pro",
"criteria": "Does the output correctly name the capital city?"
}' | jqResponse:
{
"id": "619cd32a-4376-4969-ac48-0f25b37bc933",
"status": "passed",
"result": {
"model": "gemini-2.5-pro",
"prompt": "What is the capital of France?",
"model_output": "The capital of France is **Paris**.",
"expected": "Paris",
"judge_result": {
"judge_model": "gemini-2.5-pro",
"verdict": "Pass",
"reasoning": "Verdict: PASS\n\nThe actual output correctly names Paris as the capital city...",
"confidence": null
},
"timestamp": "2024-07-29T10:30:00.123456789+00:00"
},
"error": null
}Ollama Example:
curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
"model": "ollama:llama3",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "ollama:llama3",
"criteria": "Does the output correctly name the capital city?"
}' | jqYou can set the provider in the JSON file and use generic syntax for batch evals:
{
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "gemini:gemini-2.5-pro-latest"
}Call the api/v1/evals/batch endpoint:
curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_sample.json' | jqcurl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_f1.json' | jq
Base URL: http://localhost:8080/api/v1
| Method | Endpoint | Description | Response |
|---|---|---|---|
| GET | /health |
Health check endpoint | {"status": "healthy", "service": "eval-api", "version": "..."} |
| GET | /models |
List all available models | {"models": ["gemini:model-name", "ollama:model-name", ...]} |
| Method | Endpoint | Description | Request Body |
|---|---|---|---|
| POST | /evals/run |
Run a single evaluation | RunEvalRequest |
| POST | /evals/batch |
Run multiple evaluations concurrently | Array of EvalConfig |
| GET | /evals/history |
Get all evaluation history | - |
| GET | /evals/{id} |
Get specific evaluation result | - |
| GET | /evals/{id}/status |
Get evaluation status | - |
| Method | Endpoint | Description | Request Body |
|---|---|---|---|
| GET | /judge-prompts |
Get all judge prompt versions | - |
| GET | /judge-prompts/active |
Get the currently active judge prompt | - |
| GET | /judge-prompts/{version} |
Get a specific judge prompt by version | - |
| POST | /judge-prompts |
Create a new judge prompt version | CreateJudgePromptRequest |
| PUT | /judge-prompts/active |
Set a judge prompt version as active | {"version": 2} |
Get all judge prompts:
curl http://localhost:8080/api/v1/judge-promptsCreate a new judge prompt:
curl -X POST http://localhost:8080/api/v1/judge-prompts \
-H "Content-Type: application/json" \
-d '{
"name": "Strict Evaluator",
"template": "Compare:\nExpected: {{expected}}\nActual: {{actual}}\nVerdict: PASS or FAIL",
"description": "Requires exact semantic match",
"set_active": true
}'Set a version as active:
curl -X PUT http://localhost:8080/api/v1/judge-prompts/active \
-H "Content-Type: application/json" \
-d '{"version": 2}'| Method | Endpoint | Description | Request Body |
|---|---|---|---|
| POST | /experiments |
Create a new experiment | CreateExperimentRequest |
| GET | /experiments/{id} |
Get experiment details | - |
| Protocol | Endpoint | Description |
|---|---|---|
| WS | /ws |
Real-time evaluation updates |
Connect to WebSocket:
const ws = new WebSocket('ws://localhost:8080/api/v1/ws');
ws.onmessage = (event) => {
const update = JSON.parse(event.data);
console.log('Eval update:', update);
};{
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is 2+2?",
"expected": "4",
"judge_model": "gemini:gemini-1.5-pro-latest",
"criteria": "The output should be mathematically correct"
}Fields:
model(required): Model identifier in formatprovider:model_nameprompt(required): The prompt to send to the modelexpected(optional): Expected output for comparisonjudge_model(optional): Judge model for LLM-as-a-judge evaluationcriteria(optional): Custom evaluation criteria
{
"model": "openai:gpt-4o",
"prompt": "Explain quantum computing",
"expected": "Quantum computing uses quantum bits...",
"judge_model": "gemini:gemini-2.5-pro-latest",
"criteria": "The explanation should be accurate and accessible",
"tags": ["physics", "computing"],
"metadata": {
"category": "science",
"difficulty": "advanced"
}
}{
"id": "uuid-string",
"status": "passed",
"result": {
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is 2+2?",
"model_output": "2+2 equals 4",
"expected": "4",
"judge_result": {
"judge_model": "gemini:gemini-2.5-pro-latest",
"verdict": "Pass",
"reasoning": "The output correctly identifies that 2+2 equals 4...",
"confidence": null
},
"timestamp": "2025-10-15T12:34:56Z",
"latency_ms": 450,
"judge_latency_ms": 320,
"total_latency_ms": 770
},
"error": null
}Status values: "passed", "failed", "uncertain", "completed", "error"
Verdict values: "Pass", "Fail", "Uncertain"
{
"batch_id": "uuid-string",
"status": "completed",
"total": 10,
"completed": 10,
"passed": 8,
"failed": 2,
"average_model_latency_ms": 425,
"average_judge_latency_ms": 315,
"results": []
}See full documentation for:
HistoryResponseCreateExperimentRequestExperimentResponseCreateJudgePromptRequestJudgePromptEvalUpdate(WebSocket)
Models are specified in the format provider:model_name:
Gemini:
gemini:gemini-2.5-flash-latestgemini:gemini-2.5-pro-latest
Ollama:
ollama:llama3ollama:gemma
OpenAI:
openai:gpt-4oopenai:gpt-4o-miniopenai:gpt-3.5-turbo
Anthropic:
anthropic:claude-opus-4anthropic:claude-sonnet-4anthropic:claude-sonnet-4-5anthropic:claude-haiku-4
If no provider is specified, gemini is used as the default.
This framework now supports versioned and dynamically loaded judge prompts, allowing you to change the LLM evaluation criteria without restarting the server.
Key Features:
- Version Control: Prompts are stored in the database with version numbers.
- API Control: The active prompt can be set via a dedicated API endpoint.
Default Prompt: An initial default judge prompt is inserted automatically by the database migration.
| Method | Endpoint | Description | Body |
|---|---|---|---|
GET |
/api/v1/judge-prompts/active |
Retrieves the currently active prompt template. | N/A |
POST |
/api/v1/judge-prompts |
Creates a new prompt version. | {name: "new prompt", template: "...", set_active: false} |
PUT |
/api/v1/judge-prompts/active |
Sets a specific version as active. | {version: 3} (Requires the version number) |
curl -X POST 'http://127.0.0.1:8080/api/v1/judge-prompts' \
-H 'Content-Type: application/json' \
-d '{
"name": "Relaxed Math Judge",
"template": "You are an expert evaluator comparing two text outputs. When evaluating mathematical or factual answers, prioritize the core numerical or fact value. Ignore auxiliary text, equations, or prefixes (like \"The answer is\") if the core value is correct. EVALUATION CRITERIA: {{criteria}} EXPECTED OUTPUT: {{expected}} ACTUAL OUTPUT: {{actual}} INSTRUCTIONS: 1. Carefully compare both outputs 2. Provide your verdict as the first line: \"Verdict: PASS\" or \"Verdict: FAIL\" 3. Then explain your reasoning in 2-3 sentences. Your evaluation:",
"description": "A prompt designed to be less strict than the default, allowing for correct answers that include extraneous text.",
"set_active": false
}'Key Components Explained:
-X POST: Specifies the HTTP method.
http://127.0.0.1:8080/api/v1/judge-prompts: Your local API endpoint for creating prompts.
-H 'Content-Type: application/json': Tells the server to expect JSON data in the body.
-d '...': The data payload containing the fields required by your create_judge_prompt handler:
name: A human-readable identifier.
template: The full new prompt text, which now includes instructions to be more flexible on matching.
set_active: We set this to false because we usually create the prompt first, then manually activate it (Step 2).
Next Step: Activating the New Prompt
After running the POST command, the API will respond with the newly created JudgePrompt object, which includes its unique version number (e.g., version: 2).
You would then use a PUT request to make that new version the official, active prompt:
curl -X PUT 'http://127.0.0.1:8080/api/v1/judge-prompts/active' \
-H 'Content-Type: application/json' \
-d '{"version": 2}'One major limitation of LLMs is knowledge recency. Since these models are trained on fixed datasets that quickly become outdated, they often struggle with topics that rely on the latest information β such as new laws, policies, or medical guidance. This means their judgements can be based on old or irrelevant data, leading to unreliable results. To keep them up to date, techniques like retrieval-augmented generation (RAG), regular fine-tuning, and continual learning can help ensure LLMs-as-judges have access to the most current knowledge when making decisions.
Another key weakness is hallucination, where LLMs confidently generate information that isn't true. In an evaluation context, this could mean inventing fake references, misinterpreting facts, or fabricating evidence β all of which can undermine trust in their output. Building in robust fact-checking systems that verify claims against reliable sources is essential to reduce the impact of these errors and maintain fairness in judgement.
Lastly, LLMs often face domain-specific knowledge gaps. While they're great generalists, they can lack the deep understanding needed for complex areas like law, finance, or medicine. Integrating domain-specific knowledge graphs or using RAG to pull in expert information can help bridge this gap, allowing them to deliver more accurate and context-aware evaluations.
Thank you for your interest in contributing! π
We welcome contributions of all kinds β bug fixes, improvements, documentation, examples, or new features. π¦ Rust, Python, and front-end JS/TS contributions are all welcome. See current issues for ideas.
- Fork the repository and create a new branch for your changes
- Make your changes with clear, descriptive commit messages
- Open a Pull Request explaining what you've done and why
Please make sure your code follows the existing style and passes any tests. For larger changes, feel free to open an issue first to discuss your approach.
By contributing, you agree that your work will be licensed under this project's license.
Thank you for helping make this project better! π‘
- https://arxiv.org/html/2412.05579v2
- https://github.com/openai/evals
- https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluate-generative-ai-app#query-and-response-metric-requirements
- Image Classifier Evals
