From 5891466a68704691719d9783c862cbb79ec3d12b Mon Sep 17 00:00:00 2001
From: ahmedhammam <ahmed.hammam@aleph-alpha-ip.ai>
Date: Mon, 12 Jan 2026 10:59:17 +0000
Subject: [PATCH 1/6] docs: add LLM as judge guide

---
 docs/llm_as_judge_guide.md | 591 +++++++++++++++++++++++++++++++++++++
 1 file changed, 591 insertions(+)
 create mode 100644 docs/llm_as_judge_guide.md

diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md
new file mode 100644
index 00000000..843910ed
--- /dev/null
+++ b/docs/llm_as_judge_guide.md
@@ -0,0 +1,591 @@
+# LLM as Judge Evaluation Guide
+
+This guide provides comprehensive documentation for using LLM-as-a-judge evaluation in the eval-framework. LLM judges leverage language models to evaluate the quality, correctness, and various other aspects of model outputs.
+
+## Table of Contents
+
+- [Architecture](#architecture)
+- [Available LLM Judge Metrics](#available-llm-judge-metrics)
+- [Configuration](#configuration)
+  - [CLI Configuration](#cli-configuration)
+  - [Python API Configuration](#python-api-configuration)
+- [Adding a New Benchmark with LLM Judges](#adding-a-new-benchmark-with-llm-judges)
+- [Understanding Results](#understanding-results)
+- [Best Practices](#best-practices)
+- [Troubleshooting](#troubleshooting)
+
+
+---
+
+## Architecture
+
+The LLM judge system follows a layered architecture:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    EvalConfig / CLI                         │
+│     (llm_judge_class, judge_model_args, judge_model_name)   │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+┌─────────────────────────▼───────────────────────────────────┐
+│                  BaseLLMJudgeMetric                         │
+│        (Base class for all LLM judge metrics)               │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+┌─────────────────────────▼───────────────────────────────────┐
+│                     Graders Layer                           │
+│  (InstructionGrader, ComparisonGrader, ChatbotStyleGrader)  │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+┌─────────────────────────▼───────────────────────────────────┐
+│                     BaseLLM (Judge Model)                   │
+│       (OpenAIModel, HFLLM, VLLM, or custom model)           │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Core Components
+
+| Component | Description |
+|-----------|-------------|
+| `BaseLLMJudgeMetric` | Abstract base class for all LLM judge metrics |
+| `Graders` | Specialized classes that format prompts and parse judge outputs |
+| `PromptTemplate` | Defines system and user prompts for the judge |
+| `GradingOutput` | Structured output from graders containing judge response |
+| `MetricResult` | Final result including value, judge prompt, and judge response |
+
+---
+
+## Available LLM Judge Metrics
+
+The framework provides a comprehensive set of pre-built LLM judge metrics:
+
+### Quality & Style Metrics
+
+| Metric | Class | Description | Languages |
+|--------|-------|-------------|-----------|
+| **Chatbot Style** | `LLMJudgeChatbotStyle` | Evaluates if responses follow chatbot conventions (friendly intro, verbose language, follow-up questions) | EN, DE |
+| **Conciseness** | `LLMJudgeConciseness` | Assesses if responses are brief and to the point without unnecessary elaboration | EN, DE |
+| **Coherence** | `LLMJudgeCoherence` | Evaluates logical flow and consistency of responses | EN |
+
+### Correctness Metrics
+
+| Metric | Class | Description | Languages |
+|--------|-------|-------------|-----------|
+| **Completion Accuracy** | `LLMJudgeCompletionAccuracy` | Evaluates if the model response matches the expected answer | EN |
+| **Instruction Following** | `LLMJudgeInstruction` | Comprehensive evaluation of instruction adherence with multiple sub-metrics | EN, DE, FI |
+| **Format Correctness** | `LLMJudgeFormatCorrectness` | Validates if output follows specified format requirements | EN |
+
+### Specialized Metrics
+
+| Metric | Class | Description | Languages |
+|--------|-------|-------------|-----------|
+| **SQL Quality** | `LLMJudgeSql` | Evaluates SQL query quality, efficiency, and accuracy (A-F grade) | EN, DE |
+| **World Knowledge** | `LLMJudgeWorldKnowledge` | Detects if summaries contain information beyond the source text | EN, FR, DE |
+| **Avoids Names** | `LLMJudgeAvoidsNames` | Checks if responses avoid using personal names | EN, FR, DE |
+| **Refusal Classification** | `LLMJudgeRefusal` | Detects if the model refused to answer | EN |
+
+### Comparison Metrics (MT-Bench Style)
+
+| Metric | Class | Description | Languages |
+|--------|-------|-------------|-----------|
+| **Pairwise Judgement** | `MTBenchJudgePair` | Compares two responses and selects the better one (A wins, B wins, tie) | EN, DE, FI |
+| **Single Judgement** | `MTBenchJudgeSingle` | Rates a single response on a 1-10 scale | EN, DE, FI |
+
+### Multi-Key Metrics
+
+Some metrics return multiple evaluation keys:
+
+**`LLMJudgeInstruction`** returns:
+- `quality` - Overall quality score (normalized 0-1)
+- `is_following_instruction` - Boolean instruction adherence
+- `has_correct_grammar_and_spelling` - Boolean grammar check
+- `is_context_consistent` - Boolean consistency with context
+- `is_not_repeating` - Boolean repetition check
+- `is_trustworthy` - Boolean truthfulness check
+- `is_safe` - Boolean safety check
+
+**`LLMJudgeCoherence`** returns:
+- `coherence_score` - Overall coherence rating
+- `is_coherent` - Boolean coherence flag
+- `has_repetition` - Boolean repetition detection
+
+---
+
+## Configuration
+
+### CLI Configuration
+
+To use LLM judges via the command line:
+
+```bash
+uv run eval_framework \
+    --models path/to/your/models.py \
+    --llm-name YourModelToEvaluate \
+    --task-name YourTaskName \
+    --judge-models path/to/judge_models.py \
+    --judge-model-name OpenAI_gpt_4o_mini \
+    --judge-model-args api_key="your-api-key" \
+    --output-dir ./eval_results \
+    --num-samples 100
+```
+
+#### Judge-Specific CLI Arguments
+
+| Argument | Description |
+|----------|-------------|
+| `--judge-models` | Path to Python module containing judge model classes |
+| `--judge-model-name` | Name of the judge model class to instantiate |
+| `--judge-model-args` | Key=value pairs for judge model constructor arguments |
+| `--randomize-judge-order` | Enable position randomization for pairwise comparisons |
+
+### Python API Configuration
+
+```python
+from pathlib import Path
+from eval_framework.llm.openai import OpenAI_gpt_4o_mini
+from eval_framework.llm.huggingface import HFLLM
+from eval_framework.main import main
+from eval_framework.tasks.eval_config import EvalConfig
+
+# Define your model to evaluate
+class MyModel(HFLLM):
+    LLM_NAME = "your-model-name"
+
+# Configure evaluation with LLM judge
+config = EvalConfig(
+    task_name="YourTaskName",
+    llm_class=MyModel,
+    llm_judge_class=OpenAI_gpt_4o_mini,  # Judge model class
+    judge_model_args={                    # Judge model arguments
+        "api_key": "your-api-key",
+        "temperature": 0.0,               # Lower temperature for consistent judging
+    },
+    output_dir=Path("./eval_results"),
+    num_samples=100,
+    randomize_judge_order=True,           # Mitigate position bias
+)
+
+# Run evaluation
+llm = MyModel()
+results = main(llm=llm, config=config)
+```
+
+### Using Different Judge Models
+
+#### OpenAI Models
+
+```python
+from eval_framework.llm.openai import OpenAIModel
+
+# Using pre-defined alias
+from eval_framework.llm.openai import OpenAI_gpt_4o_mini
+
+# Or configure directly
+class CustomOpenAIJudge(OpenAIModel):
+    LLM_NAME = "gpt-4-turbo"
+
+config = EvalConfig(
+    llm_judge_class=CustomOpenAIJudge,
+    judge_model_args={
+        "api_key": "your-api-key",
+        "temperature": 0.0,
+    },
+    # ...
+)
+```
+
+#### Deepseek Models
+
+```python
+from eval_framework.llm.openai import Deepseek_chat
+
+config = EvalConfig(
+    llm_judge_class=Deepseek_chat,
+    judge_model_args={
+        # Uses DEEPSEEK_API_KEY env variable by default
+    },
+    # ...
+)
+```
+
+#### Local vLLM Models
+
+```python
+from eval_framework.llm.vllm import VLLM
+
+class LocalJudge(VLLM):
+    LLM_NAME = "meta-llama/Meta-Llama-3-70B-Instruct"
+
+config = EvalConfig(
+    llm_judge_class=LocalJudge,
+    judge_model_args={
+        "tensor_parallel_size": 4,
+        "max_model_len": 8192,
+    },
+    # ...
+)
+```
+
+---
+
+## Adding a New Benchmark with LLM Judges
+
+This section provides a complete guide to creating a new benchmark task that uses LLM judge metrics for evaluation.
+
+### Step 1: Define Your Task Class
+
+Every benchmark inherits from `BaseTask[SubjectType]` and requires these class attributes:
+
+```python
+from typing import Any
+from eval_framework.tasks.base import BaseTask, ResponseType, Sample, Language
+
+class YourBenchmark(BaseTask[str]):
+    # Required attributes
+    NAME: str = "YourBenchmarkName"              # Display name
+    DATASET_PATH: str = "huggingface/dataset"    # HuggingFace dataset path
+    SAMPLE_SPLIT: str = "test"                   # Split for evaluation samples
+    FEWSHOT_SPLIT: str = "train"                 # Split for few-shot examples
+    RESPONSE_TYPE: ResponseType                  # COMPLETION or LOGLIKELIHOODS
+    METRICS: list[type[BaseMetric]]              # Metrics to compute
+    SUBJECTS: list[str]                          # Subjects/categories
+    
+    # Optional attributes
+    LANGUAGE: Language | None = Language.ENG     # Primary language
+    HF_REVISION: str | None = None               # Dataset version pin
+```
+
+### Step 2: Choose Your LLM Judge Metrics
+
+Select from the available LLM judge metrics based on your evaluation requirements:
+
+```python
+# Quality evaluation
+from eval_framework.metrics.llm.llm_judge_instruction import LLMJudgeInstruction
+from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
+from eval_framework.metrics.llm.llm_judge_conciseness import LLMJudgeConciseness
+from eval_framework.metrics.llm.llm_judge_coherence import LLMJudgeCoherence
+
+# Correctness evaluation
+from eval_framework.metrics.llm.llm_judge_completion_accuracy import LLMJudgeCompletionAccuracy
+from eval_framework.metrics.llm.llm_judge_format_correctness import LLMJudgeFormatCorrectness
+
+# Specialized evaluation
+from eval_framework.metrics.llm.llm_judge_sql import LLMJudgeSql
+from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKnowledge
+from eval_framework.metrics.llm.llm_judge_contains_names import LLMJudgeAvoidsNames
+from eval_framework.metrics.llm.llm_judge_refusal import LLMJudgeRefusal
+
+# Comparison evaluation
+from eval_framework.metrics.llm.llm_judge_mtbench_pair import MTBenchJudgePair
+from eval_framework.metrics.llm.llm_judge_mtbench_single import MTBenchJudgeSingle
+```
+
+### Step 3: Implement Required Methods
+
+Every benchmark must implement these core methods:
+
+```python
+def _get_instruction_text(self, item: dict[str, Any]) -> str:
+    """Generate the instruction/question for the model."""
+    pass
+
+def _get_ground_truth(self, item: dict[str, Any]) -> str | list[str] | None:
+    """Extract the expected answer(s) from a dataset item."""
+    pass
+```
+
+### Step 4: Provide Context for Judge Metrics
+
+Many LLM judge metrics require additional context via the `_get_context` method:
+
+```python
+from eval_framework.shared.types import LanguageMetricContext, BaseMetricContext
+
+def _get_context(self, item: dict[str, Any]) -> BaseMetricContext | None:
+    """Provide additional context for metric evaluation."""
+    return LanguageMetricContext(
+        language=item.get("language", "en"),
+    )
+```
+
+### Step 5: Create Custom Metrics (If Needed)
+
+If the pre-built LLM judge metrics don't cover your evaluation requirements, you can create custom metrics.
+
+#### Basic Custom Metric
+
+Create a new LLM judge metric by extending `BaseLLMJudgeMetric`:
+
+```python
+from eval_framework.llm.base import BaseLLM
+from eval_framework.metrics.base import MetricResult
+from eval_framework.metrics.llm.base import BaseLLMJudgeMetric
+from eval_framework.shared.types import Completion
+from template_formatting.formatter import Message, Role
+
+class CustomJudgeMetric(BaseLLMJudgeMetric):
+    NAME = "Custom Judge Metric"
+    
+    def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False):
+        super().__init__(llm_judge, randomize_order)
+    
+    def calculate(self, response: Completion) -> list[MetricResult]:
+        if response.error is not None:
+            return [MetricResult(
+                metric_name=self.NAME,
+                value=None,
+                higher_is_better=True,
+                error=response.error,
+            )]
+        
+        # Create judge prompt
+        judge_prompt = f"""Evaluate the following response for quality.
+
+Question: {response.system_user_instruction}
+
+Response: {response.sanitized_completion}
+
+Rate the quality on a scale of 1-5, where 5 is excellent.
+Respond with ONLY a JSON object: {{"score": <number>, "reasoning": "<explanation>"}}"""
+        
+        # Get judge response
+        messages = [Message(role=Role.USER, content=judge_prompt)]
+        output = self._llm_judge.generate_from_messages([messages])
+        
+        # Parse result (implement your parsing logic)
+        import json
+        try:
+            parsed = json.loads(output[0].completion)
+            score = parsed.get("score", 3) / 5.0  # Normalize to 0-1
+        except:
+            score = None
+        
+        return [MetricResult(
+            metric_name=self.NAME,
+            value=score,
+            higher_is_better=True,
+            llm_judge_prompt=judge_prompt,
+            llm_judge_response=output[0].completion,
+        )]
+```
+
+#### Creating a Custom Grader
+
+For more sophisticated evaluation, create a custom grader:
+
+```python
+from collections.abc import Mapping
+from eval_framework.llm.base import BaseLLM
+from eval_framework.metrics.llm.graders.models import (
+    GradingOutput,
+    PromptTemplate,
+    parse_json_output,
+)
+from eval_framework.metrics.llm.graders.language import Language
+
+class CustomGradingOutput(GradingOutput):
+    """Structured output from the grader."""
+    quality_score: float | None
+    has_errors: bool | None
+    feedback: str | None
+
+class CustomGrader:
+    RESPONSE_KEY = "response"
+    CRITERIA_KEY = "criteria"
+    
+    PROMPT_TEMPLATES = {
+        Language("en"): PromptTemplate(
+            system_prompt="""You are an expert evaluator. Assess the given response 
+based on the specified criteria.
+
+Provide your evaluation as JSON:
+{
+    "quality_score": float (0.0 to 1.0),
+    "has_errors": bool,
+    "feedback": str
+}""",
+            user_prompt=f"""**Response to Evaluate**:
+{{{RESPONSE_KEY}}}
+
+**Evaluation Criteria**:
+{{{CRITERIA_KEY}}}""",
+        ),
+    }
+    
+    def __init__(
+        self,
+        grading_model: BaseLLM,
+        prompt_templates: Mapping[Language, PromptTemplate] = PROMPT_TEMPLATES,
+    ):
+        self._grading_model = grading_model
+        self._prompt_templates = prompt_templates
+    
+    def grade(
+        self,
+        response: str,
+        criteria: str,
+        language: Language,
+    ) -> CustomGradingOutput:
+        try:
+            prompt_template = language.language_config(self._prompt_templates)
+        except:
+            prompt_template = Language("en").language_config(self._prompt_templates)
+        
+        messages = prompt_template.to_messages(
+            [],  # system key-value pairs
+            [    # user key-value pairs
+                (self.RESPONSE_KEY, response),
+                (self.CRITERIA_KEY, criteria),
+            ],
+        )
+        
+        raw_completion = self._grading_model.generate_from_messages([messages])[0]
+        loaded_json = parse_json_output(raw_completion.completion)
+        
+        return CustomGradingOutput(
+            quality_score=loaded_json.get("quality_score"),
+            has_errors=loaded_json.get("has_errors"),
+            feedback=loaded_json.get("feedback"),
+            judge_prompt=raw_completion.prompt,
+            judge_response=raw_completion.completion,
+        )
+```
+
+#### Using the Custom Grader in a Metric
+
+```python
+class CustomGraderMetric(BaseLLMJudgeMetric):
+    NAME = "Custom Grader Metric"
+    
+    def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False):
+        super().__init__(llm_judge, randomize_order)
+        self._grader = CustomGrader(llm_judge)
+    
+    def calculate(self, response: Completion) -> list[MetricResult]:
+        if response.error is not None:
+            return [self._create_metric_result(
+                metric_name=self.NAME,
+                higher_is_better=True,
+                value=None,
+                error=response.error,
+            )]
+        
+        language = Language(response.get_instruction_language())
+        
+        grading = self._grader.grade(
+            response=response.sanitized_completion,
+            criteria="Evaluate for accuracy, clarity, and completeness.",
+            language=language,
+        )
+        
+        return [self._create_metric_result(
+            metric_name=self.NAME,
+            higher_is_better=True,
+            value=grading.quality_score,
+            llm_judge_prompt=grading.judge_prompt,
+            llm_judge_response=grading.judge_response,
+        )]
+```
+
+### Step 6: Register Your Benchmark
+
+Add your benchmark to the task registry in `src/eval_framework/tasks/task_names.py`:
+
+```python
+# In register_all_tasks() function
+register_lazy_task("eval_framework.tasks.benchmarks.your_module.WritingQualityBenchmark")
+register_lazy_task("eval_framework.tasks.benchmarks.your_module.SQLGenerationBenchmark")
+```
+
+### Step 7: Run Your Benchmark
+
+**Via CLI:**
+
+```bash
+uv run eval_framework \
+    --models path/to/models.py \
+    --llm-name YourModel \
+    --task-name WritingQuality \
+    --task-subjects "creative_writing" \
+    --judge-models eval_framework.llm.openai \
+    --judge-model-name OpenAI_gpt_4o_mini \
+    --judge-model-args api_key="$OPENAI_API_KEY" \
+    --output-dir ./eval_results \
+    --num-samples 50
+```
+
+**Via Python:**
+
+```python
+from pathlib import Path
+from eval_framework.llm.openai import OpenAI_gpt_4o_mini
+from eval_framework.main import main
+from eval_framework.tasks.eval_config import EvalConfig
+
+# Your model class
+from your_models import YourModel
+
+config = EvalConfig(
+    task_name="WritingQuality",
+    task_subjects=["creative_writing", "technical_writing"],
+    llm_class=YourModel,
+    llm_judge_class=OpenAI_gpt_4o_mini,
+    judge_model_args={"api_key": "your-api-key", "temperature": 0.0},
+    output_dir=Path("./eval_results"),
+    num_samples=50,
+    num_fewshot=3,
+    randomize_judge_order=True,  # For fair pairwise comparisons
+)
+
+llm = YourModel()
+results = main(llm=llm, config=config)
+```
+
+
+### Notes on LLM Judge Tasks
+
+1. **Judge Configuration Required**: When running a task with LLM judge metrics, you must configure the judge model or you'll get:
+   ```
+   AssertionError: The LLM Judge must be defined for this evaluation task.
+   ```
+
+2. **Context Matching**: Ensure your `_get_context` returns the correct context type for your metrics:
+   - `LanguageMetricContext` - Most metrics
+   - `LLMJudgeSqlMetricContext` - SQL evaluation
+   - `MTBenchJudgePairMetricContext` - Pairwise comparison
+
+3. **Language Support**: Check that your chosen metrics support the languages in your dataset.
+
+---
+
+## Appendix
+
+
+### Import Reference
+
+```python
+# Judge metrics
+from eval_framework.metrics.llm.llm_judge_instruction import LLMJudgeInstruction
+from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
+from eval_framework.metrics.llm.llm_judge_completion_accuracy import LLMJudgeCompletionAccuracy
+from eval_framework.metrics.llm.llm_judge_conciseness import LLMJudgeConciseness
+from eval_framework.metrics.llm.llm_judge_coherence import LLMJudgeCoherence
+from eval_framework.metrics.llm.llm_judge_format_correctness import LLMJudgeFormatCorrectness
+from eval_framework.metrics.llm.llm_judge_sql import LLMJudgeSql
+from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKnowledge
+from eval_framework.metrics.llm.llm_judge_contains_names import LLMJudgeAvoidsNames
+from eval_framework.metrics.llm.llm_judge_refusal import LLMJudgeRefusal
+from eval_framework.metrics.llm.llm_judge_mtbench_pair import MTBenchJudgePair
+from eval_framework.metrics.llm.llm_judge_mtbench_single import MTBenchJudgeSingle
+
+# Base classes for custom metrics
+from eval_framework.metrics.llm.base import BaseLLMJudgeMetric
+from eval_framework.metrics.llm.graders.models import GradingOutput, PromptTemplate
+
+# Judge model classes
+from eval_framework.llm.openai import OpenAIModel, OpenAI_gpt_4o_mini, Deepseek_chat
+from eval_framework.llm.vllm import VLLM
+from eval_framework.llm.huggingface import HFLLM
+```
+

From b0ed1134229502437f813627445ea026410f828a Mon Sep 17 00:00:00 2001
From: ahmedhammam <ahmed.hammam@aleph-alpha-ip.ai>
Date: Mon, 12 Jan 2026 11:04:51 +0000
Subject: [PATCH 2/6] fix lint

---
 docs/llm_as_judge_guide.md | 39 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md
index 843910ed..06c0bbdb 100644
--- a/docs/llm_as_judge_guide.md
+++ b/docs/llm_as_judge_guide.md
@@ -249,7 +249,7 @@ class YourBenchmark(BaseTask[str]):
     RESPONSE_TYPE: ResponseType                  # COMPLETION or LOGLIKELIHOODS
     METRICS: list[type[BaseMetric]]              # Metrics to compute
     SUBJECTS: list[str]                          # Subjects/categories
-    
+
     # Optional attributes
     LANGUAGE: Language | None = Language.ENG     # Primary language
     HF_REVISION: str | None = None               # Dataset version pin
@@ -326,10 +326,10 @@ from template_formatting.formatter import Message, Role
 
 class CustomJudgeMetric(BaseLLMJudgeMetric):
     NAME = "Custom Judge Metric"
-    
+
     def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False):
         super().__init__(llm_judge, randomize_order)
-    
+
     def calculate(self, response: Completion) -> list[MetricResult]:
         if response.error is not None:
             return [MetricResult(
@@ -338,7 +338,7 @@ class CustomJudgeMetric(BaseLLMJudgeMetric):
                 higher_is_better=True,
                 error=response.error,
             )]
-        
+
         # Create judge prompt
         judge_prompt = f"""Evaluate the following response for quality.
 
@@ -348,11 +348,11 @@ Response: {response.sanitized_completion}
 
 Rate the quality on a scale of 1-5, where 5 is excellent.
 Respond with ONLY a JSON object: {{"score": <number>, "reasoning": "<explanation>"}}"""
-        
+
         # Get judge response
         messages = [Message(role=Role.USER, content=judge_prompt)]
         output = self._llm_judge.generate_from_messages([messages])
-        
+
         # Parse result (implement your parsing logic)
         import json
         try:
@@ -360,7 +360,7 @@ Respond with ONLY a JSON object: {{"score": <number>, "reasoning": "<explanation
             score = parsed.get("score", 3) / 5.0  # Normalize to 0-1
         except:
             score = None
-        
+
         return [MetricResult(
             metric_name=self.NAME,
             value=score,
@@ -393,10 +393,10 @@ class CustomGradingOutput(GradingOutput):
 class CustomGrader:
     RESPONSE_KEY = "response"
     CRITERIA_KEY = "criteria"
-    
+
     PROMPT_TEMPLATES = {
         Language("en"): PromptTemplate(
-            system_prompt="""You are an expert evaluator. Assess the given response 
+            system_prompt="""You are an expert evaluator. Assess the given response
 based on the specified criteria.
 
 Provide your evaluation as JSON:
@@ -412,7 +412,7 @@ Provide your evaluation as JSON:
 {{{CRITERIA_KEY}}}""",
         ),
     }
-    
+
     def __init__(
         self,
         grading_model: BaseLLM,
@@ -420,7 +420,7 @@ Provide your evaluation as JSON:
     ):
         self._grading_model = grading_model
         self._prompt_templates = prompt_templates
-    
+
     def grade(
         self,
         response: str,
@@ -431,7 +431,7 @@ Provide your evaluation as JSON:
             prompt_template = language.language_config(self._prompt_templates)
         except:
             prompt_template = Language("en").language_config(self._prompt_templates)
-        
+
         messages = prompt_template.to_messages(
             [],  # system key-value pairs
             [    # user key-value pairs
@@ -439,10 +439,10 @@ Provide your evaluation as JSON:
                 (self.CRITERIA_KEY, criteria),
             ],
         )
-        
+
         raw_completion = self._grading_model.generate_from_messages([messages])[0]
         loaded_json = parse_json_output(raw_completion.completion)
-        
+
         return CustomGradingOutput(
             quality_score=loaded_json.get("quality_score"),
             has_errors=loaded_json.get("has_errors"),
@@ -457,11 +457,11 @@ Provide your evaluation as JSON:
 ```python
 class CustomGraderMetric(BaseLLMJudgeMetric):
     NAME = "Custom Grader Metric"
-    
+
     def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False):
         super().__init__(llm_judge, randomize_order)
         self._grader = CustomGrader(llm_judge)
-    
+
     def calculate(self, response: Completion) -> list[MetricResult]:
         if response.error is not None:
             return [self._create_metric_result(
@@ -470,15 +470,15 @@ class CustomGraderMetric(BaseLLMJudgeMetric):
                 value=None,
                 error=response.error,
             )]
-        
+
         language = Language(response.get_instruction_language())
-        
+
         grading = self._grader.grade(
             response=response.sanitized_completion,
             criteria="Evaluate for accuracy, clarity, and completeness.",
             language=language,
         )
-        
+
         return [self._create_metric_result(
             metric_name=self.NAME,
             higher_is_better=True,
@@ -588,4 +588,3 @@ from eval_framework.llm.openai import OpenAIModel, OpenAI_gpt_4o_mini, Deepseek_
 from eval_framework.llm.vllm import VLLM
 from eval_framework.llm.huggingface import HFLLM
 ```
-

From ade157e2f5510e7ba4f78f893234ffdf75007c84 Mon Sep 17 00:00:00 2001
From: ahmedhammam <ahmed.hammam@aleph-alpha-ip.ai>
Date: Wed, 14 Jan 2026 11:24:19 +0000
Subject: [PATCH 3/6] add guide to sphinx

---
 docs/benchmarks_and_metrics.md | 61 ++++++++++++++++++++++++++--------
 docs/index.rst                 |  1 +
 2 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md
index 200acc1d..92b20a13 100644
--- a/docs/benchmarks_and_metrics.md
+++ b/docs/benchmarks_and_metrics.md
@@ -64,36 +64,69 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 | Metrics Type | Metrics                       |
 |-|-------------------------------|
-| Completion Metrics | Accuracy
+| Completion Metrics | Accuracy Completion
+|| AidanBench                    |
+|| **Text Similarity**
 || Bleu                          |
+|| Linewise BLEU                 |
+|| Response to Original BLEU     |
 || Chrf                          |
+|| Linewise chrF                 |
+|| COMET                         |
 || Ter                           |
+|| Linewise TER                  |
 || F1                            |
 || Rouge 1                       |
 || Rouge 2                       |
 || Rouge-L                       |
+|| Rouge-Geometric-Mean          |
+|| Exponential Similarity        |
+|| **Code Evaluation**
 || Code Assertion                |
+|| Code Execution Pass@1         |
+|| **Language & Format**
 || Language Checker              |
-|| Length Checker                |
-|| Math Reasoning                |
-|| Placeholder Checker           |
-|| Text Counter                  |
+|| German Completion Check       |
+|| Language Consistency          |
+|| Language Consistency Raw      |
 || CSV Format                    |
 || JSON Format                   |
 || Postscript Format             |
+|| **Text Statistics & Control**
+|| Length Checker                |
+|| Word Count                    |
+|| Sentence Count                |
+|| Paragraph Count               |
+|| Response to Original Length Ratio |
+|| Text Counter                  |
+|| Word Repetition               |
+|| Placeholder Checker           |
+|| **Task-Specific**
+|| Math Reasoning                |
+|| Concordance Index             |
 || Custom IFEval Checker         |
 || Custom CWE Checker            |
 || Custom NIAH Checker           |
 || Custom Grid Comparison Checker |
-|| Repetition Checker            |
+|| StructMetric                  |
+|| RenderableStructMetric        |
 | Loglikelihood Metrics | Accuracy Loglikelihood        |
-|| Normalized Accuracy Loglikelihood |
+|| Accuracy Normalized Loglikelihood |
+|| Confidence-weighted Accuracy  |
+|| Distributional Correctness Score |
 || Probability Mass              |
-| LLM Metrics | Chatbot Style Judge           |
-|| Completion Accuracy Judge
-|| Conciseness Judge
-|| Contains Names Judge
-|| Instruction Judge
-|| SQL Format
-|| World Knowledge Judge
+|| Probability Mass Normalized   |
+|| Ternary Score                 |
+| LLM Judge Metrics | Chatbot Style                 |
+|| Coherence                     |
+|| Completion Accuracy           |
+|| Conciseness                   |
+|| Avoids Names                  |
+|| Format Correctness            |
+|| Instruction Following         |
+|| Pairwise Judgement (MTBench)  |
+|| Single Judgement (MTBench)    |
+|| Refusal Classifier            |
+|| SQL Quality                   |
+|| World Knowledge               |
 | Efficiency Metrics | Bytes per Sequence Position   |
diff --git a/docs/index.rst b/docs/index.rst
index 49d7e066..840af808 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -67,6 +67,7 @@ Documentation Overview
    controlling_upload_results.md
    docker_guide.md
    evaluate_huggingface_model.md
+   llm_as_judge_guide.md
    loglikelihood_task_guide.md
    model_arguments.md
    overview_dataloading.md

From 2cda47b2ce89419f5bd5cc341c332fbd94d6ed4d Mon Sep 17 00:00:00 2001
From: ahmedhammam <ahmed.hammam@aleph-alpha-ip.ai>
Date: Wed, 14 Jan 2026 11:32:10 +0000
Subject: [PATCH 4/6] add aidan bench metric

---
 docs/llm_as_judge_guide.md | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md
index 06c0bbdb..7992c219 100644
--- a/docs/llm_as_judge_guide.md
+++ b/docs/llm_as_judge_guide.md
@@ -10,9 +10,8 @@ This guide provides comprehensive documentation for using LLM-as-a-judge evaluat
   - [CLI Configuration](#cli-configuration)
   - [Python API Configuration](#python-api-configuration)
 - [Adding a New Benchmark with LLM Judges](#adding-a-new-benchmark-with-llm-judges)
-- [Understanding Results](#understanding-results)
-- [Best Practices](#best-practices)
-- [Troubleshooting](#troubleshooting)
+- [Advanced: Using LLM Judges for Generation Control](#advanced-using-llm-judges-for-generation-control)
+- [Appendix](#appendix)
 
 
 ---
@@ -559,6 +558,16 @@ results = main(llm=llm, config=config)
 
 ---
 
+## Advanced: Using LLM Judges for Generation Control
+
+Some tasks use LLM judge graders not for evaluation, but to control the generation process itself.
+
+**Example:** [AidanBench](https://openreview.net/pdf?id=fz969ahcvJ) uses `CoherenceGrader` during iterative generation to decide when to stop. The grader checks each new response for coherence, and stops generating when quality drops below a threshold. The final metric simply counts how many coherent responses were generated.
+
+For implementation details, see [`src/eval_framework/tasks/benchmarks/aidanbench.py`](../src/eval_framework/tasks/benchmarks/aidanbench.py).
+
+---
+
 ## Appendix
 
 

From b595dc0b336f183bbe576eb715b1bde0b38e5f49 Mon Sep 17 00:00:00 2001
From: GrS-AA <171042778+GrS-AA@users.noreply.github.com>
Date: Wed, 14 Jan 2026 17:40:51 +0100
Subject: [PATCH 5/6] merge with main branch

---
 .github/workflows/release_please.yml          |  4 ++--
 docs/tasks/AIME2025.md                        | 20 +++++++++++++++++++
 docs/tasks/README.md                          |  3 ++-
 .../tasks/benchmarks/math_reasoning.py        | 19 ++++++++++++++++++
 src/eval_framework/tasks/task_names.py        |  1 +
 .../tasks/task-prompts-hashes.json            |  2 ++
 uv.lock                                       |  6 +++---
 7 files changed, 49 insertions(+), 6 deletions(-)
 create mode 100644 docs/tasks/AIME2025.md

diff --git a/.github/workflows/release_please.yml b/.github/workflows/release_please.yml
index ec85e519..33fed07c 100644
--- a/.github/workflows/release_please.yml
+++ b/.github/workflows/release_please.yml
@@ -83,5 +83,5 @@ jobs:
           file: Dockerfile
           push: true
           tags: |
-            ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:v${{ steps.version.outputs.major_minor }}
-            ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:v${{ steps.version.outputs.version }}
+            ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:${{ steps.version.outputs.major_minor }}
+            ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }}
diff --git a/docs/tasks/AIME2025.md b/docs/tasks/AIME2025.md
new file mode 100644
index 00000000..6cc52489
--- /dev/null
+++ b/docs/tasks/AIME2025.md
@@ -0,0 +1,20 @@
+# AIME2025
+
+````
+NAME = AIME2025
+DATASET_PATH = math-ai/aime25
+SAMPLE_SPLIT = test
+FEWSHOT_SPLIT = test
+RESPONSE_TYPE = COMPLETION
+METRICS = [MathReasoningCompletion, LanguageRawConsistencyChecker]
+SUBJECTS = ['no_subject']
+LANGUAGE = <Language.ENG: 'English'>
+````
+
+- Module: `eval_framework.tasks.benchmarks.math_reasoning`
+
+- File: [src/eval_framework/tasks/benchmarks/math_reasoning.py](../../src/eval_framework/tasks/benchmarks/math_reasoning.py) | [View on GitHub](https://github.com/Aleph-Alpha-Research/eval-framework/blob/main/src/eval_framework/tasks/benchmarks/math_reasoning.py)
+
+- Link to dataset: [https://huggingface.co/datasets/math-ai/aime25](https://huggingface.co/datasets/math-ai/aime25)
+
+More detailed documentation, with prompt examples and ground truth completions, can be generated with `uv run -m eval_framework.utils.generate_task_docs --add-prompt-examples --only-tasks "AIME2025"`.
diff --git a/docs/tasks/README.md b/docs/tasks/README.md
index db5115ff..65a2f5c6 100644
--- a/docs/tasks/README.md
+++ b/docs/tasks/README.md
@@ -2,7 +2,7 @@
 
 This directory contains the generated documentation for all benchmark tasks available in the package.
 
-**Total number of tasks: 109**
+**Total number of tasks: 110**
 
 The documentation can be generated or updated with `uv run -m eval_framework.utils.generate_task_docs`.
 
@@ -11,6 +11,7 @@ NOTE: This is an automatically generated file. Any manual modifications will not
 ## List of tasks
 
 - [AIME2024](AIME2024.md)
+- [AIME2025](AIME2025.md)
 - [ARC](ARC.md)
 - [ARC_DE](ARC_DE.md)
 - [ARC_EU20_DE](ARC_EU20_DE.md)
diff --git a/src/eval_framework/tasks/benchmarks/math_reasoning.py b/src/eval_framework/tasks/benchmarks/math_reasoning.py
index 56df4ca8..db98ddc2 100644
--- a/src/eval_framework/tasks/benchmarks/math_reasoning.py
+++ b/src/eval_framework/tasks/benchmarks/math_reasoning.py
@@ -355,6 +355,25 @@ def _get_ground_truth(self, item: dict[str, Any]) -> str | None | list[str]:
         return item["answer"].lstrip("0")  # valid answers in this dataset range from 0-999 and have leading zeros
 
 
+class AIME2025(AIME2024):
+    """AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25
+
+    This dataset contains a single test split of 30 questions.
+    Data contains
+    problem | answer | id
+
+    pass@1 evaluation
+    """
+
+    NAME = "AIME2025"
+    DATASET_PATH = "math-ai/aime25"
+    SAMPLE_SPLIT = "test"
+    FEWSHOT_SPLIT = "test"
+
+    def _get_ground_truth(self, item: dict[str, Any]) -> str | None | list[str]:
+        return item["answer"]
+
+
 class MATH500(MATHReasoning):
     """MATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500
 
diff --git a/src/eval_framework/tasks/task_names.py b/src/eval_framework/tasks/task_names.py
index 749fdb86..db084d5d 100644
--- a/src/eval_framework/tasks/task_names.py
+++ b/src/eval_framework/tasks/task_names.py
@@ -18,6 +18,7 @@ def value(self) -> type[BaseTask]:
 def register_all_tasks() -> None:
     """Register all the benchmark tasks with the eval framework."""
     register_lazy_task("eval_framework.tasks.benchmarks.math_reasoning.AIME2024")
+    register_lazy_task("eval_framework.tasks.benchmarks.math_reasoning.AIME2025")
     register_lazy_task("eval_framework.tasks.benchmarks.arc.ARC")
     register_lazy_task("eval_framework.tasks.benchmarks.arc.ARC_IDK")
     register_lazy_task("eval_framework.tasks.benchmarks.arc_de.ARC_DE")
diff --git a/tests/tests_eval_framework/tasks/task-prompts-hashes.json b/tests/tests_eval_framework/tasks/task-prompts-hashes.json
index 7941e65a..71412b5b 100644
--- a/tests/tests_eval_framework/tasks/task-prompts-hashes.json
+++ b/tests/tests_eval_framework/tasks/task-prompts-hashes.json
@@ -1,6 +1,8 @@
 {
     "AIME2024.ConcatFormatter": "fad4d191e9cfc6962c823354e42af067",
     "AIME2024.Llama3Formatter": "dc85310e539af50c6195be6c8031eb28",
+    "AIME2025.ConcatFormatter": "d170270a2228726db6ccd1ce759ff505",
+    "AIME2025.Llama3Formatter": "97584b8496d5215245b5a9ac850adfdf",
     "ARC.ConcatFormatter": "bd30651bf7141f65afcfa89cb449fe80",
     "ARC.Llama3Formatter": "0027b1a525b202c4bd3809d6be54f8fa",
     "ARC_DE.ConcatFormatter": "bb58420112447c9e7f6b3b92cd3adc9f",
diff --git a/uv.lock b/uv.lock
index e5cfd2b0..819478ba 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1088,11 +1088,11 @@ wheels = [
 
 [[package]]
 name = "filelock"
-version = "3.20.0"
+version = "3.20.3"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/58/46/0028a82567109b5ef6e4d2a1f04a583fb513e6cf9527fcdd09afd817deeb/filelock-3.20.0.tar.gz", hash = "sha256:711e943b4ec6be42e1d4e6690b48dc175c822967466bb31c0c293f34334c13f4", size = 18922, upload-time = "2025-10-08T18:03:50.056Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/1d/65/ce7f1b70157833bf3cb851b556a37d4547ceafc158aa9b34b36782f23696/filelock-3.20.3.tar.gz", hash = "sha256:18c57ee915c7ec61cff0ecf7f0f869936c7c30191bb0cf406f1341778d0834e1", size = 19485, upload-time = "2026-01-09T17:55:05.421Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/76/91/7216b27286936c16f5b4d0c530087e4a54eead683e6b0b73dd0c64844af6/filelock-3.20.0-py3-none-any.whl", hash = "sha256:339b4732ffda5cd79b13f4e2711a31b0365ce445d95d243bb996273d072546a2", size = 16054, upload-time = "2025-10-08T18:03:48.35Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/36/7fb70f04bf00bc646cd5bb45aa9eddb15e19437a28b8fb2b4a5249fac770/filelock-3.20.3-py3-none-any.whl", hash = "sha256:4b0dda527ee31078689fc205ec4f1c1bf7d56cf88b6dc9426c4f230e46c2dce1", size = 16701, upload-time = "2026-01-09T17:55:04.334Z" },
 ]
 
 [[package]]

From 608929ec5c58776f4ef9645a43ba8a4863be22de Mon Sep 17 00:00:00 2001
From: GrS-AA <171042778+GrS-AA@users.noreply.github.com>
Date: Wed, 14 Jan 2026 17:57:15 +0100
Subject: [PATCH 6/6] add one explaination detail and tiny fixes on the text

---
 docs/llm_as_judge_guide.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md
index 7992c219..8806f648 100644
--- a/docs/llm_as_judge_guide.md
+++ b/docs/llm_as_judge_guide.md
@@ -56,7 +56,7 @@ The LLM judge system follows a layered architecture:
 
 ## Available LLM Judge Metrics
 
-The framework provides a comprehensive set of pre-built LLM judge metrics:
+The framework provides a comprehensive set of pre-built LLM judge metrics. The metric which is used in a given benchmark task is specified in the task implementation class with `METRICS = [...]`.
 
 ### Quality & Style Metrics
 
@@ -155,13 +155,13 @@ config = EvalConfig(
     task_name="YourTaskName",
     llm_class=MyModel,
     llm_judge_class=OpenAI_gpt_4o_mini,  # Judge model class
-    judge_model_args={                    # Judge model arguments
+    judge_model_args={                   # Judge model arguments
         "api_key": "your-api-key",
-        "temperature": 0.0,               # Lower temperature for consistent judging
+        "temperature": 0.0,              # Lower temperature for consistent judging
     },
     output_dir=Path("./eval_results"),
     num_samples=100,
-    randomize_judge_order=True,           # Mitigate position bias
+    randomize_judge_order=True,          # Mitigate position bias
 )
 
 # Run evaluation
@@ -246,7 +246,7 @@ class YourBenchmark(BaseTask[str]):
     SAMPLE_SPLIT: str = "test"                   # Split for evaluation samples
     FEWSHOT_SPLIT: str = "train"                 # Split for few-shot examples
     RESPONSE_TYPE: ResponseType                  # COMPLETION or LOGLIKELIHOODS
-    METRICS: list[type[BaseMetric]]              # Metrics to compute
+    METRICS: list[type[BaseMetric]]              # List of metrics to compute
     SUBJECTS: list[str]                          # Subjects/categories
 
     # Optional attributes
@@ -544,7 +544,7 @@ results = main(llm=llm, config=config)
 
 ### Notes on LLM Judge Tasks
 
-1. **Judge Configuration Required**: When running a task with LLM judge metrics, you must configure the judge model or you'll get:
+1. **Judge Configuration**: When running a task with LLM judge metrics, you must configure the judge model or you'll get:
    ```
    AssertionError: The LLM Judge must be defined for this evaluation task.
    ```
@@ -554,7 +554,7 @@ results = main(llm=llm, config=config)
    - `LLMJudgeSqlMetricContext` - SQL evaluation
    - `MTBenchJudgePairMetricContext` - Pairwise comparison
 
-3. **Language Support**: Check that your chosen metrics support the languages in your dataset.
+3. **Language Support**: Check that your chosen metrics is compatible with the languages in your dataset.
 
 ---