From 5891466a68704691719d9783c862cbb79ec3d12b Mon Sep 17 00:00:00 2001 From: ahmedhammam Date: Mon, 12 Jan 2026 10:59:17 +0000 Subject: [PATCH 1/6] docs: add LLM as judge guide --- docs/llm_as_judge_guide.md | 591 +++++++++++++++++++++++++++++++++++++ 1 file changed, 591 insertions(+) create mode 100644 docs/llm_as_judge_guide.md diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md new file mode 100644 index 00000000..843910ed --- /dev/null +++ b/docs/llm_as_judge_guide.md @@ -0,0 +1,591 @@ +# LLM as Judge Evaluation Guide + +This guide provides comprehensive documentation for using LLM-as-a-judge evaluation in the eval-framework. LLM judges leverage language models to evaluate the quality, correctness, and various other aspects of model outputs. + +## Table of Contents + +- [Architecture](#architecture) +- [Available LLM Judge Metrics](#available-llm-judge-metrics) +- [Configuration](#configuration) + - [CLI Configuration](#cli-configuration) + - [Python API Configuration](#python-api-configuration) +- [Adding a New Benchmark with LLM Judges](#adding-a-new-benchmark-with-llm-judges) +- [Understanding Results](#understanding-results) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) + + +--- + +## Architecture + +The LLM judge system follows a layered architecture: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ EvalConfig / CLI │ +│ (llm_judge_class, judge_model_args, judge_model_name) │ +└─────────────────────────┬───────────────────────────────────┘ + │ +┌─────────────────────────▼───────────────────────────────────┐ +│ BaseLLMJudgeMetric │ +│ (Base class for all LLM judge metrics) │ +└─────────────────────────┬───────────────────────────────────┘ + │ +┌─────────────────────────▼───────────────────────────────────┐ +│ Graders Layer │ +│ (InstructionGrader, ComparisonGrader, ChatbotStyleGrader) │ +└─────────────────────────┬───────────────────────────────────┘ + │ +┌─────────────────────────▼───────────────────────────────────┐ +│ BaseLLM (Judge Model) │ +│ (OpenAIModel, HFLLM, VLLM, or custom model) │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Core Components + +| Component | Description | +|-----------|-------------| +| `BaseLLMJudgeMetric` | Abstract base class for all LLM judge metrics | +| `Graders` | Specialized classes that format prompts and parse judge outputs | +| `PromptTemplate` | Defines system and user prompts for the judge | +| `GradingOutput` | Structured output from graders containing judge response | +| `MetricResult` | Final result including value, judge prompt, and judge response | + +--- + +## Available LLM Judge Metrics + +The framework provides a comprehensive set of pre-built LLM judge metrics: + +### Quality & Style Metrics + +| Metric | Class | Description | Languages | +|--------|-------|-------------|-----------| +| **Chatbot Style** | `LLMJudgeChatbotStyle` | Evaluates if responses follow chatbot conventions (friendly intro, verbose language, follow-up questions) | EN, DE | +| **Conciseness** | `LLMJudgeConciseness` | Assesses if responses are brief and to the point without unnecessary elaboration | EN, DE | +| **Coherence** | `LLMJudgeCoherence` | Evaluates logical flow and consistency of responses | EN | + +### Correctness Metrics + +| Metric | Class | Description | Languages | +|--------|-------|-------------|-----------| +| **Completion Accuracy** | `LLMJudgeCompletionAccuracy` | Evaluates if the model response matches the expected answer | EN | +| **Instruction Following** | `LLMJudgeInstruction` | Comprehensive evaluation of instruction adherence with multiple sub-metrics | EN, DE, FI | +| **Format Correctness** | `LLMJudgeFormatCorrectness` | Validates if output follows specified format requirements | EN | + +### Specialized Metrics + +| Metric | Class | Description | Languages | +|--------|-------|-------------|-----------| +| **SQL Quality** | `LLMJudgeSql` | Evaluates SQL query quality, efficiency, and accuracy (A-F grade) | EN, DE | +| **World Knowledge** | `LLMJudgeWorldKnowledge` | Detects if summaries contain information beyond the source text | EN, FR, DE | +| **Avoids Names** | `LLMJudgeAvoidsNames` | Checks if responses avoid using personal names | EN, FR, DE | +| **Refusal Classification** | `LLMJudgeRefusal` | Detects if the model refused to answer | EN | + +### Comparison Metrics (MT-Bench Style) + +| Metric | Class | Description | Languages | +|--------|-------|-------------|-----------| +| **Pairwise Judgement** | `MTBenchJudgePair` | Compares two responses and selects the better one (A wins, B wins, tie) | EN, DE, FI | +| **Single Judgement** | `MTBenchJudgeSingle` | Rates a single response on a 1-10 scale | EN, DE, FI | + +### Multi-Key Metrics + +Some metrics return multiple evaluation keys: + +**`LLMJudgeInstruction`** returns: +- `quality` - Overall quality score (normalized 0-1) +- `is_following_instruction` - Boolean instruction adherence +- `has_correct_grammar_and_spelling` - Boolean grammar check +- `is_context_consistent` - Boolean consistency with context +- `is_not_repeating` - Boolean repetition check +- `is_trustworthy` - Boolean truthfulness check +- `is_safe` - Boolean safety check + +**`LLMJudgeCoherence`** returns: +- `coherence_score` - Overall coherence rating +- `is_coherent` - Boolean coherence flag +- `has_repetition` - Boolean repetition detection + +--- + +## Configuration + +### CLI Configuration + +To use LLM judges via the command line: + +```bash +uv run eval_framework \ + --models path/to/your/models.py \ + --llm-name YourModelToEvaluate \ + --task-name YourTaskName \ + --judge-models path/to/judge_models.py \ + --judge-model-name OpenAI_gpt_4o_mini \ + --judge-model-args api_key="your-api-key" \ + --output-dir ./eval_results \ + --num-samples 100 +``` + +#### Judge-Specific CLI Arguments + +| Argument | Description | +|----------|-------------| +| `--judge-models` | Path to Python module containing judge model classes | +| `--judge-model-name` | Name of the judge model class to instantiate | +| `--judge-model-args` | Key=value pairs for judge model constructor arguments | +| `--randomize-judge-order` | Enable position randomization for pairwise comparisons | + +### Python API Configuration + +```python +from pathlib import Path +from eval_framework.llm.openai import OpenAI_gpt_4o_mini +from eval_framework.llm.huggingface import HFLLM +from eval_framework.main import main +from eval_framework.tasks.eval_config import EvalConfig + +# Define your model to evaluate +class MyModel(HFLLM): + LLM_NAME = "your-model-name" + +# Configure evaluation with LLM judge +config = EvalConfig( + task_name="YourTaskName", + llm_class=MyModel, + llm_judge_class=OpenAI_gpt_4o_mini, # Judge model class + judge_model_args={ # Judge model arguments + "api_key": "your-api-key", + "temperature": 0.0, # Lower temperature for consistent judging + }, + output_dir=Path("./eval_results"), + num_samples=100, + randomize_judge_order=True, # Mitigate position bias +) + +# Run evaluation +llm = MyModel() +results = main(llm=llm, config=config) +``` + +### Using Different Judge Models + +#### OpenAI Models + +```python +from eval_framework.llm.openai import OpenAIModel + +# Using pre-defined alias +from eval_framework.llm.openai import OpenAI_gpt_4o_mini + +# Or configure directly +class CustomOpenAIJudge(OpenAIModel): + LLM_NAME = "gpt-4-turbo" + +config = EvalConfig( + llm_judge_class=CustomOpenAIJudge, + judge_model_args={ + "api_key": "your-api-key", + "temperature": 0.0, + }, + # ... +) +``` + +#### Deepseek Models + +```python +from eval_framework.llm.openai import Deepseek_chat + +config = EvalConfig( + llm_judge_class=Deepseek_chat, + judge_model_args={ + # Uses DEEPSEEK_API_KEY env variable by default + }, + # ... +) +``` + +#### Local vLLM Models + +```python +from eval_framework.llm.vllm import VLLM + +class LocalJudge(VLLM): + LLM_NAME = "meta-llama/Meta-Llama-3-70B-Instruct" + +config = EvalConfig( + llm_judge_class=LocalJudge, + judge_model_args={ + "tensor_parallel_size": 4, + "max_model_len": 8192, + }, + # ... +) +``` + +--- + +## Adding a New Benchmark with LLM Judges + +This section provides a complete guide to creating a new benchmark task that uses LLM judge metrics for evaluation. + +### Step 1: Define Your Task Class + +Every benchmark inherits from `BaseTask[SubjectType]` and requires these class attributes: + +```python +from typing import Any +from eval_framework.tasks.base import BaseTask, ResponseType, Sample, Language + +class YourBenchmark(BaseTask[str]): + # Required attributes + NAME: str = "YourBenchmarkName" # Display name + DATASET_PATH: str = "huggingface/dataset" # HuggingFace dataset path + SAMPLE_SPLIT: str = "test" # Split for evaluation samples + FEWSHOT_SPLIT: str = "train" # Split for few-shot examples + RESPONSE_TYPE: ResponseType # COMPLETION or LOGLIKELIHOODS + METRICS: list[type[BaseMetric]] # Metrics to compute + SUBJECTS: list[str] # Subjects/categories + + # Optional attributes + LANGUAGE: Language | None = Language.ENG # Primary language + HF_REVISION: str | None = None # Dataset version pin +``` + +### Step 2: Choose Your LLM Judge Metrics + +Select from the available LLM judge metrics based on your evaluation requirements: + +```python +# Quality evaluation +from eval_framework.metrics.llm.llm_judge_instruction import LLMJudgeInstruction +from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle +from eval_framework.metrics.llm.llm_judge_conciseness import LLMJudgeConciseness +from eval_framework.metrics.llm.llm_judge_coherence import LLMJudgeCoherence + +# Correctness evaluation +from eval_framework.metrics.llm.llm_judge_completion_accuracy import LLMJudgeCompletionAccuracy +from eval_framework.metrics.llm.llm_judge_format_correctness import LLMJudgeFormatCorrectness + +# Specialized evaluation +from eval_framework.metrics.llm.llm_judge_sql import LLMJudgeSql +from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKnowledge +from eval_framework.metrics.llm.llm_judge_contains_names import LLMJudgeAvoidsNames +from eval_framework.metrics.llm.llm_judge_refusal import LLMJudgeRefusal + +# Comparison evaluation +from eval_framework.metrics.llm.llm_judge_mtbench_pair import MTBenchJudgePair +from eval_framework.metrics.llm.llm_judge_mtbench_single import MTBenchJudgeSingle +``` + +### Step 3: Implement Required Methods + +Every benchmark must implement these core methods: + +```python +def _get_instruction_text(self, item: dict[str, Any]) -> str: + """Generate the instruction/question for the model.""" + pass + +def _get_ground_truth(self, item: dict[str, Any]) -> str | list[str] | None: + """Extract the expected answer(s) from a dataset item.""" + pass +``` + +### Step 4: Provide Context for Judge Metrics + +Many LLM judge metrics require additional context via the `_get_context` method: + +```python +from eval_framework.shared.types import LanguageMetricContext, BaseMetricContext + +def _get_context(self, item: dict[str, Any]) -> BaseMetricContext | None: + """Provide additional context for metric evaluation.""" + return LanguageMetricContext( + language=item.get("language", "en"), + ) +``` + +### Step 5: Create Custom Metrics (If Needed) + +If the pre-built LLM judge metrics don't cover your evaluation requirements, you can create custom metrics. + +#### Basic Custom Metric + +Create a new LLM judge metric by extending `BaseLLMJudgeMetric`: + +```python +from eval_framework.llm.base import BaseLLM +from eval_framework.metrics.base import MetricResult +from eval_framework.metrics.llm.base import BaseLLMJudgeMetric +from eval_framework.shared.types import Completion +from template_formatting.formatter import Message, Role + +class CustomJudgeMetric(BaseLLMJudgeMetric): + NAME = "Custom Judge Metric" + + def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False): + super().__init__(llm_judge, randomize_order) + + def calculate(self, response: Completion) -> list[MetricResult]: + if response.error is not None: + return [MetricResult( + metric_name=self.NAME, + value=None, + higher_is_better=True, + error=response.error, + )] + + # Create judge prompt + judge_prompt = f"""Evaluate the following response for quality. + +Question: {response.system_user_instruction} + +Response: {response.sanitized_completion} + +Rate the quality on a scale of 1-5, where 5 is excellent. +Respond with ONLY a JSON object: {{"score": , "reasoning": ""}}""" + + # Get judge response + messages = [Message(role=Role.USER, content=judge_prompt)] + output = self._llm_judge.generate_from_messages([messages]) + + # Parse result (implement your parsing logic) + import json + try: + parsed = json.loads(output[0].completion) + score = parsed.get("score", 3) / 5.0 # Normalize to 0-1 + except: + score = None + + return [MetricResult( + metric_name=self.NAME, + value=score, + higher_is_better=True, + llm_judge_prompt=judge_prompt, + llm_judge_response=output[0].completion, + )] +``` + +#### Creating a Custom Grader + +For more sophisticated evaluation, create a custom grader: + +```python +from collections.abc import Mapping +from eval_framework.llm.base import BaseLLM +from eval_framework.metrics.llm.graders.models import ( + GradingOutput, + PromptTemplate, + parse_json_output, +) +from eval_framework.metrics.llm.graders.language import Language + +class CustomGradingOutput(GradingOutput): + """Structured output from the grader.""" + quality_score: float | None + has_errors: bool | None + feedback: str | None + +class CustomGrader: + RESPONSE_KEY = "response" + CRITERIA_KEY = "criteria" + + PROMPT_TEMPLATES = { + Language("en"): PromptTemplate( + system_prompt="""You are an expert evaluator. Assess the given response +based on the specified criteria. + +Provide your evaluation as JSON: +{ + "quality_score": float (0.0 to 1.0), + "has_errors": bool, + "feedback": str +}""", + user_prompt=f"""**Response to Evaluate**: +{{{RESPONSE_KEY}}} + +**Evaluation Criteria**: +{{{CRITERIA_KEY}}}""", + ), + } + + def __init__( + self, + grading_model: BaseLLM, + prompt_templates: Mapping[Language, PromptTemplate] = PROMPT_TEMPLATES, + ): + self._grading_model = grading_model + self._prompt_templates = prompt_templates + + def grade( + self, + response: str, + criteria: str, + language: Language, + ) -> CustomGradingOutput: + try: + prompt_template = language.language_config(self._prompt_templates) + except: + prompt_template = Language("en").language_config(self._prompt_templates) + + messages = prompt_template.to_messages( + [], # system key-value pairs + [ # user key-value pairs + (self.RESPONSE_KEY, response), + (self.CRITERIA_KEY, criteria), + ], + ) + + raw_completion = self._grading_model.generate_from_messages([messages])[0] + loaded_json = parse_json_output(raw_completion.completion) + + return CustomGradingOutput( + quality_score=loaded_json.get("quality_score"), + has_errors=loaded_json.get("has_errors"), + feedback=loaded_json.get("feedback"), + judge_prompt=raw_completion.prompt, + judge_response=raw_completion.completion, + ) +``` + +#### Using the Custom Grader in a Metric + +```python +class CustomGraderMetric(BaseLLMJudgeMetric): + NAME = "Custom Grader Metric" + + def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False): + super().__init__(llm_judge, randomize_order) + self._grader = CustomGrader(llm_judge) + + def calculate(self, response: Completion) -> list[MetricResult]: + if response.error is not None: + return [self._create_metric_result( + metric_name=self.NAME, + higher_is_better=True, + value=None, + error=response.error, + )] + + language = Language(response.get_instruction_language()) + + grading = self._grader.grade( + response=response.sanitized_completion, + criteria="Evaluate for accuracy, clarity, and completeness.", + language=language, + ) + + return [self._create_metric_result( + metric_name=self.NAME, + higher_is_better=True, + value=grading.quality_score, + llm_judge_prompt=grading.judge_prompt, + llm_judge_response=grading.judge_response, + )] +``` + +### Step 6: Register Your Benchmark + +Add your benchmark to the task registry in `src/eval_framework/tasks/task_names.py`: + +```python +# In register_all_tasks() function +register_lazy_task("eval_framework.tasks.benchmarks.your_module.WritingQualityBenchmark") +register_lazy_task("eval_framework.tasks.benchmarks.your_module.SQLGenerationBenchmark") +``` + +### Step 7: Run Your Benchmark + +**Via CLI:** + +```bash +uv run eval_framework \ + --models path/to/models.py \ + --llm-name YourModel \ + --task-name WritingQuality \ + --task-subjects "creative_writing" \ + --judge-models eval_framework.llm.openai \ + --judge-model-name OpenAI_gpt_4o_mini \ + --judge-model-args api_key="$OPENAI_API_KEY" \ + --output-dir ./eval_results \ + --num-samples 50 +``` + +**Via Python:** + +```python +from pathlib import Path +from eval_framework.llm.openai import OpenAI_gpt_4o_mini +from eval_framework.main import main +from eval_framework.tasks.eval_config import EvalConfig + +# Your model class +from your_models import YourModel + +config = EvalConfig( + task_name="WritingQuality", + task_subjects=["creative_writing", "technical_writing"], + llm_class=YourModel, + llm_judge_class=OpenAI_gpt_4o_mini, + judge_model_args={"api_key": "your-api-key", "temperature": 0.0}, + output_dir=Path("./eval_results"), + num_samples=50, + num_fewshot=3, + randomize_judge_order=True, # For fair pairwise comparisons +) + +llm = YourModel() +results = main(llm=llm, config=config) +``` + + +### Notes on LLM Judge Tasks + +1. **Judge Configuration Required**: When running a task with LLM judge metrics, you must configure the judge model or you'll get: + ``` + AssertionError: The LLM Judge must be defined for this evaluation task. + ``` + +2. **Context Matching**: Ensure your `_get_context` returns the correct context type for your metrics: + - `LanguageMetricContext` - Most metrics + - `LLMJudgeSqlMetricContext` - SQL evaluation + - `MTBenchJudgePairMetricContext` - Pairwise comparison + +3. **Language Support**: Check that your chosen metrics support the languages in your dataset. + +--- + +## Appendix + + +### Import Reference + +```python +# Judge metrics +from eval_framework.metrics.llm.llm_judge_instruction import LLMJudgeInstruction +from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle +from eval_framework.metrics.llm.llm_judge_completion_accuracy import LLMJudgeCompletionAccuracy +from eval_framework.metrics.llm.llm_judge_conciseness import LLMJudgeConciseness +from eval_framework.metrics.llm.llm_judge_coherence import LLMJudgeCoherence +from eval_framework.metrics.llm.llm_judge_format_correctness import LLMJudgeFormatCorrectness +from eval_framework.metrics.llm.llm_judge_sql import LLMJudgeSql +from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKnowledge +from eval_framework.metrics.llm.llm_judge_contains_names import LLMJudgeAvoidsNames +from eval_framework.metrics.llm.llm_judge_refusal import LLMJudgeRefusal +from eval_framework.metrics.llm.llm_judge_mtbench_pair import MTBenchJudgePair +from eval_framework.metrics.llm.llm_judge_mtbench_single import MTBenchJudgeSingle + +# Base classes for custom metrics +from eval_framework.metrics.llm.base import BaseLLMJudgeMetric +from eval_framework.metrics.llm.graders.models import GradingOutput, PromptTemplate + +# Judge model classes +from eval_framework.llm.openai import OpenAIModel, OpenAI_gpt_4o_mini, Deepseek_chat +from eval_framework.llm.vllm import VLLM +from eval_framework.llm.huggingface import HFLLM +``` + From b0ed1134229502437f813627445ea026410f828a Mon Sep 17 00:00:00 2001 From: ahmedhammam Date: Mon, 12 Jan 2026 11:04:51 +0000 Subject: [PATCH 2/6] fix lint --- docs/llm_as_judge_guide.md | 39 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md index 843910ed..06c0bbdb 100644 --- a/docs/llm_as_judge_guide.md +++ b/docs/llm_as_judge_guide.md @@ -249,7 +249,7 @@ class YourBenchmark(BaseTask[str]): RESPONSE_TYPE: ResponseType # COMPLETION or LOGLIKELIHOODS METRICS: list[type[BaseMetric]] # Metrics to compute SUBJECTS: list[str] # Subjects/categories - + # Optional attributes LANGUAGE: Language | None = Language.ENG # Primary language HF_REVISION: str | None = None # Dataset version pin @@ -326,10 +326,10 @@ from template_formatting.formatter import Message, Role class CustomJudgeMetric(BaseLLMJudgeMetric): NAME = "Custom Judge Metric" - + def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False): super().__init__(llm_judge, randomize_order) - + def calculate(self, response: Completion) -> list[MetricResult]: if response.error is not None: return [MetricResult( @@ -338,7 +338,7 @@ class CustomJudgeMetric(BaseLLMJudgeMetric): higher_is_better=True, error=response.error, )] - + # Create judge prompt judge_prompt = f"""Evaluate the following response for quality. @@ -348,11 +348,11 @@ Response: {response.sanitized_completion} Rate the quality on a scale of 1-5, where 5 is excellent. Respond with ONLY a JSON object: {{"score": , "reasoning": ""}}""" - + # Get judge response messages = [Message(role=Role.USER, content=judge_prompt)] output = self._llm_judge.generate_from_messages([messages]) - + # Parse result (implement your parsing logic) import json try: @@ -360,7 +360,7 @@ Respond with ONLY a JSON object: {{"score": , "reasoning": " list[MetricResult]: if response.error is not None: return [self._create_metric_result( @@ -470,15 +470,15 @@ class CustomGraderMetric(BaseLLMJudgeMetric): value=None, error=response.error, )] - + language = Language(response.get_instruction_language()) - + grading = self._grader.grade( response=response.sanitized_completion, criteria="Evaluate for accuracy, clarity, and completeness.", language=language, ) - + return [self._create_metric_result( metric_name=self.NAME, higher_is_better=True, @@ -588,4 +588,3 @@ from eval_framework.llm.openai import OpenAIModel, OpenAI_gpt_4o_mini, Deepseek_ from eval_framework.llm.vllm import VLLM from eval_framework.llm.huggingface import HFLLM ``` - From ade157e2f5510e7ba4f78f893234ffdf75007c84 Mon Sep 17 00:00:00 2001 From: ahmedhammam Date: Wed, 14 Jan 2026 11:24:19 +0000 Subject: [PATCH 3/6] add guide to sphinx --- docs/benchmarks_and_metrics.md | 61 ++++++++++++++++++++++++++-------- docs/index.rst | 1 + 2 files changed, 48 insertions(+), 14 deletions(-) diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md index 200acc1d..92b20a13 100644 --- a/docs/benchmarks_and_metrics.md +++ b/docs/benchmarks_and_metrics.md @@ -64,36 +64,69 @@ Additional task documentation can be generated with the script `utils/generate-t | Metrics Type | Metrics | |-|-------------------------------| -| Completion Metrics | Accuracy +| Completion Metrics | Accuracy Completion +|| AidanBench | +|| **Text Similarity** || Bleu | +|| Linewise BLEU | +|| Response to Original BLEU | || Chrf | +|| Linewise chrF | +|| COMET | || Ter | +|| Linewise TER | || F1 | || Rouge 1 | || Rouge 2 | || Rouge-L | +|| Rouge-Geometric-Mean | +|| Exponential Similarity | +|| **Code Evaluation** || Code Assertion | +|| Code Execution Pass@1 | +|| **Language & Format** || Language Checker | -|| Length Checker | -|| Math Reasoning | -|| Placeholder Checker | -|| Text Counter | +|| German Completion Check | +|| Language Consistency | +|| Language Consistency Raw | || CSV Format | || JSON Format | || Postscript Format | +|| **Text Statistics & Control** +|| Length Checker | +|| Word Count | +|| Sentence Count | +|| Paragraph Count | +|| Response to Original Length Ratio | +|| Text Counter | +|| Word Repetition | +|| Placeholder Checker | +|| **Task-Specific** +|| Math Reasoning | +|| Concordance Index | || Custom IFEval Checker | || Custom CWE Checker | || Custom NIAH Checker | || Custom Grid Comparison Checker | -|| Repetition Checker | +|| StructMetric | +|| RenderableStructMetric | | Loglikelihood Metrics | Accuracy Loglikelihood | -|| Normalized Accuracy Loglikelihood | +|| Accuracy Normalized Loglikelihood | +|| Confidence-weighted Accuracy | +|| Distributional Correctness Score | || Probability Mass | -| LLM Metrics | Chatbot Style Judge | -|| Completion Accuracy Judge -|| Conciseness Judge -|| Contains Names Judge -|| Instruction Judge -|| SQL Format -|| World Knowledge Judge +|| Probability Mass Normalized | +|| Ternary Score | +| LLM Judge Metrics | Chatbot Style | +|| Coherence | +|| Completion Accuracy | +|| Conciseness | +|| Avoids Names | +|| Format Correctness | +|| Instruction Following | +|| Pairwise Judgement (MTBench) | +|| Single Judgement (MTBench) | +|| Refusal Classifier | +|| SQL Quality | +|| World Knowledge | | Efficiency Metrics | Bytes per Sequence Position | diff --git a/docs/index.rst b/docs/index.rst index 49d7e066..840af808 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -67,6 +67,7 @@ Documentation Overview controlling_upload_results.md docker_guide.md evaluate_huggingface_model.md + llm_as_judge_guide.md loglikelihood_task_guide.md model_arguments.md overview_dataloading.md From 2cda47b2ce89419f5bd5cc341c332fbd94d6ed4d Mon Sep 17 00:00:00 2001 From: ahmedhammam Date: Wed, 14 Jan 2026 11:32:10 +0000 Subject: [PATCH 4/6] add aidan bench metric --- docs/llm_as_judge_guide.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md index 06c0bbdb..7992c219 100644 --- a/docs/llm_as_judge_guide.md +++ b/docs/llm_as_judge_guide.md @@ -10,9 +10,8 @@ This guide provides comprehensive documentation for using LLM-as-a-judge evaluat - [CLI Configuration](#cli-configuration) - [Python API Configuration](#python-api-configuration) - [Adding a New Benchmark with LLM Judges](#adding-a-new-benchmark-with-llm-judges) -- [Understanding Results](#understanding-results) -- [Best Practices](#best-practices) -- [Troubleshooting](#troubleshooting) +- [Advanced: Using LLM Judges for Generation Control](#advanced-using-llm-judges-for-generation-control) +- [Appendix](#appendix) --- @@ -559,6 +558,16 @@ results = main(llm=llm, config=config) --- +## Advanced: Using LLM Judges for Generation Control + +Some tasks use LLM judge graders not for evaluation, but to control the generation process itself. + +**Example:** [AidanBench](https://openreview.net/pdf?id=fz969ahcvJ) uses `CoherenceGrader` during iterative generation to decide when to stop. The grader checks each new response for coherence, and stops generating when quality drops below a threshold. The final metric simply counts how many coherent responses were generated. + +For implementation details, see [`src/eval_framework/tasks/benchmarks/aidanbench.py`](../src/eval_framework/tasks/benchmarks/aidanbench.py). + +--- + ## Appendix From b595dc0b336f183bbe576eb715b1bde0b38e5f49 Mon Sep 17 00:00:00 2001 From: GrS-AA <171042778+GrS-AA@users.noreply.github.com> Date: Wed, 14 Jan 2026 17:40:51 +0100 Subject: [PATCH 5/6] merge with main branch --- .github/workflows/release_please.yml | 4 ++-- docs/tasks/AIME2025.md | 20 +++++++++++++++++++ docs/tasks/README.md | 3 ++- .../tasks/benchmarks/math_reasoning.py | 19 ++++++++++++++++++ src/eval_framework/tasks/task_names.py | 1 + .../tasks/task-prompts-hashes.json | 2 ++ uv.lock | 6 +++--- 7 files changed, 49 insertions(+), 6 deletions(-) create mode 100644 docs/tasks/AIME2025.md diff --git a/.github/workflows/release_please.yml b/.github/workflows/release_please.yml index ec85e519..33fed07c 100644 --- a/.github/workflows/release_please.yml +++ b/.github/workflows/release_please.yml @@ -83,5 +83,5 @@ jobs: file: Dockerfile push: true tags: | - ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:v${{ steps.version.outputs.major_minor }} - ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:v${{ steps.version.outputs.version }} + ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:${{ steps.version.outputs.major_minor }} + ${{ env.REGISTRY }}/${{ env.REPO_OWNER }}/${{ env.IMAGE_NAME }}:${{ steps.version.outputs.version }} diff --git a/docs/tasks/AIME2025.md b/docs/tasks/AIME2025.md new file mode 100644 index 00000000..6cc52489 --- /dev/null +++ b/docs/tasks/AIME2025.md @@ -0,0 +1,20 @@ +# AIME2025 + +```` +NAME = AIME2025 +DATASET_PATH = math-ai/aime25 +SAMPLE_SPLIT = test +FEWSHOT_SPLIT = test +RESPONSE_TYPE = COMPLETION +METRICS = [MathReasoningCompletion, LanguageRawConsistencyChecker] +SUBJECTS = ['no_subject'] +LANGUAGE = +```` + +- Module: `eval_framework.tasks.benchmarks.math_reasoning` + +- File: [src/eval_framework/tasks/benchmarks/math_reasoning.py](../../src/eval_framework/tasks/benchmarks/math_reasoning.py) | [View on GitHub](https://github.com/Aleph-Alpha-Research/eval-framework/blob/main/src/eval_framework/tasks/benchmarks/math_reasoning.py) + +- Link to dataset: [https://huggingface.co/datasets/math-ai/aime25](https://huggingface.co/datasets/math-ai/aime25) + +More detailed documentation, with prompt examples and ground truth completions, can be generated with `uv run -m eval_framework.utils.generate_task_docs --add-prompt-examples --only-tasks "AIME2025"`. diff --git a/docs/tasks/README.md b/docs/tasks/README.md index db5115ff..65a2f5c6 100644 --- a/docs/tasks/README.md +++ b/docs/tasks/README.md @@ -2,7 +2,7 @@ This directory contains the generated documentation for all benchmark tasks available in the package. -**Total number of tasks: 109** +**Total number of tasks: 110** The documentation can be generated or updated with `uv run -m eval_framework.utils.generate_task_docs`. @@ -11,6 +11,7 @@ NOTE: This is an automatically generated file. Any manual modifications will not ## List of tasks - [AIME2024](AIME2024.md) +- [AIME2025](AIME2025.md) - [ARC](ARC.md) - [ARC_DE](ARC_DE.md) - [ARC_EU20_DE](ARC_EU20_DE.md) diff --git a/src/eval_framework/tasks/benchmarks/math_reasoning.py b/src/eval_framework/tasks/benchmarks/math_reasoning.py index 56df4ca8..db98ddc2 100644 --- a/src/eval_framework/tasks/benchmarks/math_reasoning.py +++ b/src/eval_framework/tasks/benchmarks/math_reasoning.py @@ -355,6 +355,25 @@ def _get_ground_truth(self, item: dict[str, Any]) -> str | None | list[str]: return item["answer"].lstrip("0") # valid answers in this dataset range from 0-999 and have leading zeros +class AIME2025(AIME2024): + """AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25 + + This dataset contains a single test split of 30 questions. + Data contains + problem | answer | id + + pass@1 evaluation + """ + + NAME = "AIME2025" + DATASET_PATH = "math-ai/aime25" + SAMPLE_SPLIT = "test" + FEWSHOT_SPLIT = "test" + + def _get_ground_truth(self, item: dict[str, Any]) -> str | None | list[str]: + return item["answer"] + + class MATH500(MATHReasoning): """MATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500 diff --git a/src/eval_framework/tasks/task_names.py b/src/eval_framework/tasks/task_names.py index 749fdb86..db084d5d 100644 --- a/src/eval_framework/tasks/task_names.py +++ b/src/eval_framework/tasks/task_names.py @@ -18,6 +18,7 @@ def value(self) -> type[BaseTask]: def register_all_tasks() -> None: """Register all the benchmark tasks with the eval framework.""" register_lazy_task("eval_framework.tasks.benchmarks.math_reasoning.AIME2024") + register_lazy_task("eval_framework.tasks.benchmarks.math_reasoning.AIME2025") register_lazy_task("eval_framework.tasks.benchmarks.arc.ARC") register_lazy_task("eval_framework.tasks.benchmarks.arc.ARC_IDK") register_lazy_task("eval_framework.tasks.benchmarks.arc_de.ARC_DE") diff --git a/tests/tests_eval_framework/tasks/task-prompts-hashes.json b/tests/tests_eval_framework/tasks/task-prompts-hashes.json index 7941e65a..71412b5b 100644 --- a/tests/tests_eval_framework/tasks/task-prompts-hashes.json +++ b/tests/tests_eval_framework/tasks/task-prompts-hashes.json @@ -1,6 +1,8 @@ { "AIME2024.ConcatFormatter": "fad4d191e9cfc6962c823354e42af067", "AIME2024.Llama3Formatter": "dc85310e539af50c6195be6c8031eb28", + "AIME2025.ConcatFormatter": "d170270a2228726db6ccd1ce759ff505", + "AIME2025.Llama3Formatter": "97584b8496d5215245b5a9ac850adfdf", "ARC.ConcatFormatter": "bd30651bf7141f65afcfa89cb449fe80", "ARC.Llama3Formatter": "0027b1a525b202c4bd3809d6be54f8fa", "ARC_DE.ConcatFormatter": "bb58420112447c9e7f6b3b92cd3adc9f", diff --git a/uv.lock b/uv.lock index e5cfd2b0..819478ba 100644 --- a/uv.lock +++ b/uv.lock @@ -1088,11 +1088,11 @@ wheels = [ [[package]] name = "filelock" -version = "3.20.0" +version = "3.20.3" source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/58/46/0028a82567109b5ef6e4d2a1f04a583fb513e6cf9527fcdd09afd817deeb/filelock-3.20.0.tar.gz", hash = "sha256:711e943b4ec6be42e1d4e6690b48dc175c822967466bb31c0c293f34334c13f4", size = 18922, upload-time = "2025-10-08T18:03:50.056Z" } +sdist = { url = "https://files.pythonhosted.org/packages/1d/65/ce7f1b70157833bf3cb851b556a37d4547ceafc158aa9b34b36782f23696/filelock-3.20.3.tar.gz", hash = "sha256:18c57ee915c7ec61cff0ecf7f0f869936c7c30191bb0cf406f1341778d0834e1", size = 19485, upload-time = "2026-01-09T17:55:05.421Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/76/91/7216b27286936c16f5b4d0c530087e4a54eead683e6b0b73dd0c64844af6/filelock-3.20.0-py3-none-any.whl", hash = "sha256:339b4732ffda5cd79b13f4e2711a31b0365ce445d95d243bb996273d072546a2", size = 16054, upload-time = "2025-10-08T18:03:48.35Z" }, + { url = "https://files.pythonhosted.org/packages/b5/36/7fb70f04bf00bc646cd5bb45aa9eddb15e19437a28b8fb2b4a5249fac770/filelock-3.20.3-py3-none-any.whl", hash = "sha256:4b0dda527ee31078689fc205ec4f1c1bf7d56cf88b6dc9426c4f230e46c2dce1", size = 16701, upload-time = "2026-01-09T17:55:04.334Z" }, ] [[package]] From 608929ec5c58776f4ef9645a43ba8a4863be22de Mon Sep 17 00:00:00 2001 From: GrS-AA <171042778+GrS-AA@users.noreply.github.com> Date: Wed, 14 Jan 2026 17:57:15 +0100 Subject: [PATCH 6/6] add one explaination detail and tiny fixes on the text --- docs/llm_as_judge_guide.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/llm_as_judge_guide.md b/docs/llm_as_judge_guide.md index 7992c219..8806f648 100644 --- a/docs/llm_as_judge_guide.md +++ b/docs/llm_as_judge_guide.md @@ -56,7 +56,7 @@ The LLM judge system follows a layered architecture: ## Available LLM Judge Metrics -The framework provides a comprehensive set of pre-built LLM judge metrics: +The framework provides a comprehensive set of pre-built LLM judge metrics. The metric which is used in a given benchmark task is specified in the task implementation class with `METRICS = [...]`. ### Quality & Style Metrics @@ -155,13 +155,13 @@ config = EvalConfig( task_name="YourTaskName", llm_class=MyModel, llm_judge_class=OpenAI_gpt_4o_mini, # Judge model class - judge_model_args={ # Judge model arguments + judge_model_args={ # Judge model arguments "api_key": "your-api-key", - "temperature": 0.0, # Lower temperature for consistent judging + "temperature": 0.0, # Lower temperature for consistent judging }, output_dir=Path("./eval_results"), num_samples=100, - randomize_judge_order=True, # Mitigate position bias + randomize_judge_order=True, # Mitigate position bias ) # Run evaluation @@ -246,7 +246,7 @@ class YourBenchmark(BaseTask[str]): SAMPLE_SPLIT: str = "test" # Split for evaluation samples FEWSHOT_SPLIT: str = "train" # Split for few-shot examples RESPONSE_TYPE: ResponseType # COMPLETION or LOGLIKELIHOODS - METRICS: list[type[BaseMetric]] # Metrics to compute + METRICS: list[type[BaseMetric]] # List of metrics to compute SUBJECTS: list[str] # Subjects/categories # Optional attributes @@ -544,7 +544,7 @@ results = main(llm=llm, config=config) ### Notes on LLM Judge Tasks -1. **Judge Configuration Required**: When running a task with LLM judge metrics, you must configure the judge model or you'll get: +1. **Judge Configuration**: When running a task with LLM judge metrics, you must configure the judge model or you'll get: ``` AssertionError: The LLM Judge must be defined for this evaluation task. ``` @@ -554,7 +554,7 @@ results = main(llm=llm, config=config) - `LLMJudgeSqlMetricContext` - SQL evaluation - `MTBenchJudgePairMetricContext` - Pairwise comparison -3. **Language Support**: Check that your chosen metrics support the languages in your dataset. +3. **Language Support**: Check that your chosen metrics is compatible with the languages in your dataset. ---