From 4b49b7bdc69943a6e3f88ea3ea196f7e4cf2e20f Mon Sep 17 00:00:00 2001 From: Frank Schneider Date: Wed, 11 Feb 2026 09:06:09 +0000 Subject: [PATCH 1/5] docs: Remove changelog request from PR template, will be done automatically. --- .github/pull_request_template.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 073873e8..3edad436 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -10,17 +10,14 @@ - ✅ Provide tests for your changes. - 📝 Use descriptive commit messages. - 📗 Update any related documentation and include any relevant screenshots. - - 📗 Reflect the changes you made in the changelog. - --> - ## PR Checklist + - [ ] Use descriptive commit messages. - [ ] Provide tests for your changes. - [ ] Update any related documentation and include any relevant screenshots. - [ ] Check if changes need to be made to docs (README or any guides in `/docs/`). -- [ ] Reflect the changes you made in the changelog. ## What type of PR is this? (check all applicable) @@ -52,8 +49,8 @@ _Please replace this line with instructions on how to test your changes, a note on the hardware and config this has been tested on, as well as any relevant additional information._ - ## Added/updated tests? + - [ ] Yes - [ ] No, and this is why: _please replace this line with details on why tests have not been included_ From f7d7b9862e7d5eabd14387bfddd5dedccf8f2355 Mon Sep 17 00:00:00 2001 From: Frank Schneider Date: Wed, 11 Feb 2026 09:16:02 +0000 Subject: [PATCH 2/5] docs: Add all _IDK variants to the table. --- docs/benchmarks_and_metrics.md | 152 ++++++++++++++++----------------- 1 file changed, 76 insertions(+), 76 deletions(-) diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md index 200acc1d..20fe1b81 100644 --- a/docs/benchmarks_and_metrics.md +++ b/docs/benchmarks_and_metrics.md @@ -6,53 +6,53 @@ Additional task documentation can be generated with the script `utils/generate-t ## Completion -| **Task** | **Capability** | **Benchmarks** | **Long Context** | -|-|-|-|-| -| Logical Reasoning | Math | `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench` | `InfiniteBench_MathFind` | -| Logical Reasoning | Programming | `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED` | `InfiniteBench_CodeRun` | -| Logical Reasoning | Puzzle | `SPHYR` | | -| Output Control | Structure | `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval` | | -| Text Distillation | Aggregation | | `ZERO_SCROLLS_SPACE_DIGEST` | -| Text Distillation | Classification | `GPQA_COT`, `MMLU`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA` | | -| TextDistillation | Closed QA | `SQUAD`, `SQUAD2` | `InfiniteBench_EnDia` `InfiniteBench_EnQA` | -| Text Distilation | Extraction | `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE` | `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1` | -| Text Distillation | QA | | `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY` | -| Text Transformation | Translation | `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT` | | +| **Task** | **Capability** | **Benchmarks** | **Long Context** | +| ------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| Logical Reasoning | Math | `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench` | `InfiniteBench_MathFind` | +| Logical Reasoning | Programming | `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED` | `InfiniteBench_CodeRun` | +| Logical Reasoning | Puzzle | `SPHYR` | | +| Output Control | Structure | `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval` | | +| Text Distillation | Aggregation | | `ZERO_SCROLLS_SPACE_DIGEST` | +| Text Distillation | Classification | `GPQA_COT`, `MMLU`, `MMLU_IDK`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA` | | +| TextDistillation | Closed QA | `SQUAD`, `SQUAD2` | `InfiniteBench_EnDia` `InfiniteBench_EnQA` | +| Text Distilation | Extraction | `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE` | `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1` | +| Text Distillation | QA | | `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY` | +| Text Transformation | Translation | `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT` | | ## Loglikelihoods -| **Task** | **Capability** | **Benchmarks** | **Long Context** | -|-|-|-|-| -| Output Control | Bias | `WINOGENDER` | | -| Text Distillation | Classification | `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `OPENBOOKQA`, `PIQA`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR` | | -| Text Distillation | QA | | `QUALITY`, `ZERO_SCROLLS_QUALITY` | -| Text Generation | Open QA | `CASEHOLD` | | -| Logical Reasoning | Closed QA | | `InfiniteBench_EnMC` | -| Logical Reasoning | Programming | | `InfiniteBench_CodeDebug` | -| Logical Reasoning | Reasoning | `COPA`, `HELLASWAG, HELLASWAG_EU20_DE, HELLASWAG_EU20_FR`, `WINOGRANDE`, `WINOX_DE`, `WINOX_FR` | | +| **Task** | **Capability** | **Benchmarks** | **Long Context** | +| ----------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- | +| Output Control | Bias | `WINOGENDER`, `WINOGENDER_IDK` | | +| Text Distillation | Classification | `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `ARC_IDK`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `GPQA_IDK`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `MMLU_PRO_IDK`, `OPENBOOKQA`, `OPENBOOKQA_IDK`, `PIQA`, `PIQA_IDK`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR`, `TRUTHFULQA_IDK` | | +| Text Distillation | QA | | `QUALITY`, `ZERO_SCROLLS_QUALITY` | +| Text Generation | Open QA | `CASEHOLD` | | +| Logical Reasoning | Closed QA | | `InfiniteBench_EnMC` | +| Logical Reasoning | Programming | | `InfiniteBench_CodeDebug` | +| Logical Reasoning | Reasoning | `COPA`, `COPA_IDK`, `HELLASWAG`, `HELLASWAG_IDK`, `HELLASWAG_EU20_DE`, `HELLASWAG_EU20_FR`, `WINOGRANDE`, `WINOGRANDE_IDK`, `WINOX_DE`, `WINOX_FR` | | ## Long-Context -| Task Name | Tag | Task | Capability | Domain | Common Few-Shot Counts | Avg #Words | Language | -|--------------------------------|----------------------------------|-|--------------------------------------------------|-|-|----------------------------------------------|--------------| -| Babilong | `Eval Suite Long Context` | Text Generation, Long Context | Completion, Long Context | ? | not supported | 22003 | en | -| InfiniteBench_CodeDebug | `InfiniteBench_CodeDebug` | LogicalReasoning | Programming | ? | not supported | 127761 | en | -| InfiniteBench_CodeRun | `InfiniteBench_CodeRun` | LogicalReasoning | Programming | ? | not supported | 34851 | en | -| InfiniteBench_EnDia | `InfiniteBench_EnDia` | TextDistillation | Closed QA | ? | not supported | 73240 | en | -| InfiniteBench_EnMC | `InfiniteBench_EnMC` | TextDistillation | Closed QA | ? | not supported | 139966 | en | -| InfiniteBench_EnQA | `InfiniteBench_EnQA` | TextDistillation | Closed QA | ? | not supported | 149442 | en | -| InfiniteBench_MathFind | `InfiniteBench_MathFind` | LogicalReasoning | Math | ? | not supported | 30017 | en | -| InfiniteBench_RetrieveKV2 | `InfiniteBench_RetrieveKV2` | TextDistillation | Extraction | ? | not supported | 5010 | en | -| InfiniteBench_RetrieveNumber | `InfiniteBench_RetrieveNumber` | TextDistillation | Extraction | ? | not supported | 99199 | en | -| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation| Extraction | ? | not supported | 99196 | en | -| QuALITY | `QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | -| ZeroSCROLLS GovReport | `ZeroSCROLLS GovReport` | Text Distillation | QA | Government | not supported | 7273 | en | -| ZeroSCROLLS SQuALITY | `ZeroSCROLLS SQuALITY` | Text Distillation | QB-Summ? | Literature | not supported | 4971 | en | -| ZeroSCROLLS Qasper | `ZeroSCROLLS Qasper` | Text Distillation | QA | Science | not supported | 3531 | en | -| ZeroSCROLLS NarrativeQA | `ZeroSCROLLS NarrativeQA` | Text Distillation | QA | Literature, Film | not supported | 49384 | en | -| ZeroSCROLLS QuALITY | `ZeroSCROLLS QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | -| ZeroSCROLLS MuSiQue | `ZeroSCROLLS MuSiQue` | Text Distillation | QA | Wikipedia | not supported | 1749 | en | -| ZeroSCROLLS SpaceDigest | `ZeroSCROLLS SpaceDigest` | Text Distillation | Aggregation | Reviews | not supported | 5481 | en | +| Task Name | Tag | Task | Capability | Domain | Common Few-Shot Counts | Avg #Words | Language | +| ------------------------------ | -------------------------------- | ----------------------------- | ------------------------ | ---------------- | ---------------------- | ---------- | -------- | +| Babilong | `Eval Suite Long Context` | Text Generation, Long Context | Completion, Long Context | ? | not supported | 22003 | en | +| InfiniteBench_CodeDebug | `InfiniteBench_CodeDebug` | LogicalReasoning | Programming | ? | not supported | 127761 | en | +| InfiniteBench_CodeRun | `InfiniteBench_CodeRun` | LogicalReasoning | Programming | ? | not supported | 34851 | en | +| InfiniteBench_EnDia | `InfiniteBench_EnDia` | TextDistillation | Closed QA | ? | not supported | 73240 | en | +| InfiniteBench_EnMC | `InfiniteBench_EnMC` | TextDistillation | Closed QA | ? | not supported | 139966 | en | +| InfiniteBench_EnQA | `InfiniteBench_EnQA` | TextDistillation | Closed QA | ? | not supported | 149442 | en | +| InfiniteBench_MathFind | `InfiniteBench_MathFind` | LogicalReasoning | Math | ? | not supported | 30017 | en | +| InfiniteBench_RetrieveKV2 | `InfiniteBench_RetrieveKV2` | TextDistillation | Extraction | ? | not supported | 5010 | en | +| InfiniteBench_RetrieveNumber | `InfiniteBench_RetrieveNumber` | TextDistillation | Extraction | ? | not supported | 99199 | en | +| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation | Extraction | ? | not supported | 99196 | en | +| QuALITY | `QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | +| ZeroSCROLLS GovReport | `ZeroSCROLLS GovReport` | Text Distillation | QA | Government | not supported | 7273 | en | +| ZeroSCROLLS SQuALITY | `ZeroSCROLLS SQuALITY` | Text Distillation | QB-Summ? | Literature | not supported | 4971 | en | +| ZeroSCROLLS Qasper | `ZeroSCROLLS Qasper` | Text Distillation | QA | Science | not supported | 3531 | en | +| ZeroSCROLLS NarrativeQA | `ZeroSCROLLS NarrativeQA` | Text Distillation | QA | Literature, Film | not supported | 49384 | en | +| ZeroSCROLLS QuALITY | `ZeroSCROLLS QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | +| ZeroSCROLLS MuSiQue | `ZeroSCROLLS MuSiQue` | Text Distillation | QA | Wikipedia | not supported | 1749 | en | +| ZeroSCROLLS SpaceDigest | `ZeroSCROLLS SpaceDigest` | Text Distillation | Aggregation | Reviews | not supported | 5481 | en | ## Languages @@ -62,38 +62,38 @@ Additional task documentation can be generated with the script `utils/generate-t ## Metrics -| Metrics Type | Metrics | -|-|-------------------------------| -| Completion Metrics | Accuracy -|| Bleu | -|| Chrf | -|| Ter | -|| F1 | -|| Rouge 1 | -|| Rouge 2 | -|| Rouge-L | -|| Code Assertion | -|| Language Checker | -|| Length Checker | -|| Math Reasoning | -|| Placeholder Checker | -|| Text Counter | -|| CSV Format | -|| JSON Format | -|| Postscript Format | -|| Custom IFEval Checker | -|| Custom CWE Checker | -|| Custom NIAH Checker | -|| Custom Grid Comparison Checker | -|| Repetition Checker | -| Loglikelihood Metrics | Accuracy Loglikelihood | -|| Normalized Accuracy Loglikelihood | -|| Probability Mass | -| LLM Metrics | Chatbot Style Judge | -|| Completion Accuracy Judge -|| Conciseness Judge -|| Contains Names Judge -|| Instruction Judge -|| SQL Format -|| World Knowledge Judge -| Efficiency Metrics | Bytes per Sequence Position | +| Metrics Type | Metrics | +| --------------------- | --------------------------------- | +| Completion Metrics | Accuracy | +| | Bleu | +| | Chrf | +| | Ter | +| | F1 | +| | Rouge 1 | +| | Rouge 2 | +| | Rouge-L | +| | Code Assertion | +| | Language Checker | +| | Length Checker | +| | Math Reasoning | +| | Placeholder Checker | +| | Text Counter | +| | CSV Format | +| | JSON Format | +| | Postscript Format | +| | Custom IFEval Checker | +| | Custom CWE Checker | +| | Custom NIAH Checker | +| | Custom Grid Comparison Checker | +| | Repetition Checker | +| Loglikelihood Metrics | Accuracy Loglikelihood | +| | Normalized Accuracy Loglikelihood | +| | Probability Mass | +| LLM Metrics | Chatbot Style Judge | +| | Completion Accuracy Judge | +| | Conciseness Judge | +| | Contains Names Judge | +| | Instruction Judge | +| | SQL Format | +| | World Knowledge Judge | +| Efficiency Metrics | Bytes per Sequence Position | From 3d6a02d7206398164f44a9c4e5586a67f4218db5 Mon Sep 17 00:00:00 2001 From: Frank Schneider Date: Wed, 11 Feb 2026 09:30:49 +0000 Subject: [PATCH 3/5] docs: Expand steps on adding new benchmarks (docs and tests). --- docs/add_new_benchmark_guide.md | 45 ++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md index 53875e98..f9241077 100644 --- a/docs/add_new_benchmark_guide.md +++ b/docs/add_new_benchmark_guide.md @@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the ## Overview The eval-framework supports two response types: + 1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation) 2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options For detailed information about implementing each task type, please refer to: + - [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks - [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks @@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample This section provides a complete reference for all configurations available when creating benchmarks. - ### Response Types The response type determines how your model interacts with the task and what type of output is expected. @@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals. - #### Completion Metrics These metrics work with generated text outputs from COMPLETION tasks: @@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks: - ```python from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle # Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German) @@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn ``` - ## Implementation Examples and Patterns ### Practical Example: GeographyQATask @@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]): return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot) ``` - ### Add to Task Registry Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`: @@ -280,32 +277,66 @@ The task will now be available through `get_task("GeographyQA")`. ### Testing your benchmark -All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file. +All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically. #### Automatic Formatting Tests -All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required. + +All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`. + +The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`. + +When you add a new task: + +1. Run the formatter hash test once for your task to generate/check hashes. +2. If your task hash is new, it will be added to `task-prompts-hashes.json`. +3. Commit the updated JSON file together with your task changes. + +Run the formatter hash test only for your newly created task (replace `YourTaskName`): + +```bash +uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName" +``` #### Custom Task Tests (Optional) + If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it. +### Update benchmark documentation + +After adding a benchmark, you also need to update task documentation: + +1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them). +2. Regenerate the task docs: + +```bash +uv run -m eval_framework.utils.generate_task_docs +``` + +This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`. + ## Benchmark Examples by Task Type Study these existing benchmarks in the codebase for more complex patterns: #### Simple Classification Tasks + - **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods - **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects #### Reasoning Tasks + - **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns #### Code Generation + - **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation - **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation #### Long Context Tasks + - **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks #### Custom Format Tasks + - **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation - **JSON/CSV Tasks:** Custom format validation examples From f0a11840bd94b957e223166cfbef6a3e76a575b5 Mon Sep 17 00:00:00 2001 From: Frank Schneider Date: Thu, 12 Feb 2026 12:36:43 +0000 Subject: [PATCH 4/5] docs: Add IDK-specific metrics --- docs/benchmarks_and_metrics.md | 74 ++++++++++++++++++---------------- 1 file changed, 39 insertions(+), 35 deletions(-) diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md index 20fe1b81..77fa7085 100644 --- a/docs/benchmarks_and_metrics.md +++ b/docs/benchmarks_and_metrics.md @@ -62,38 +62,42 @@ Additional task documentation can be generated with the script `utils/generate-t ## Metrics -| Metrics Type | Metrics | -| --------------------- | --------------------------------- | -| Completion Metrics | Accuracy | -| | Bleu | -| | Chrf | -| | Ter | -| | F1 | -| | Rouge 1 | -| | Rouge 2 | -| | Rouge-L | -| | Code Assertion | -| | Language Checker | -| | Length Checker | -| | Math Reasoning | -| | Placeholder Checker | -| | Text Counter | -| | CSV Format | -| | JSON Format | -| | Postscript Format | -| | Custom IFEval Checker | -| | Custom CWE Checker | -| | Custom NIAH Checker | -| | Custom Grid Comparison Checker | -| | Repetition Checker | -| Loglikelihood Metrics | Accuracy Loglikelihood | -| | Normalized Accuracy Loglikelihood | -| | Probability Mass | -| LLM Metrics | Chatbot Style Judge | -| | Completion Accuracy Judge | -| | Conciseness Judge | -| | Contains Names Judge | -| | Instruction Judge | -| | SQL Format | -| | World Knowledge Judge | -| Efficiency Metrics | Bytes per Sequence Position | +| Metrics Type | Metrics | +| ---------------------- | --------------------------------- | +| Completion Metrics | Accuracy | +| | Bleu | +| | Chrf | +| | Ter | +| | F1 | +| | Rouge 1 | +| | Rouge 2 | +| | Rouge-L | +| | Code Assertion | +| | Language Checker | +| | Length Checker | +| | Math Reasoning | +| | Placeholder Checker | +| | Text Counter | +| | CSV Format | +| | JSON Format | +| | Postscript Format | +| | Custom IFEval Checker | +| | Custom CWE Checker | +| | Custom NIAH Checker | +| | Custom Grid Comparison Checker | +| | Repetition Checker | +| Loglikelihood Metrics | Accuracy Loglikelihood | +| | Accuracy Normalized Loglikelihood | +| | Confidence-weighted Accuracy | +| | Probability Mass | +| | Probability Mass Normalized | +| (IDK-Specific Metrics) | Distributional Correctness Score | +| | Ternary Score | +| LLM Metrics | Chatbot Style Judge | +| | Completion Accuracy Judge | +| | Conciseness Judge | +| | Contains Names Judge | +| | Instruction Judge | +| | SQL Format | +| | World Knowledge Judge | +| Efficiency Metrics | Bytes per Sequence Position | From 5df2fd51ed200d434d6d30e22650d3b5c48858d5 Mon Sep 17 00:00:00 2001 From: Frank Schneider Date: Thu, 12 Feb 2026 12:37:34 +0000 Subject: [PATCH 5/5] docs: Highlight the option to run individual tasks. --- docs/add_new_benchmark_guide.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md index f9241077..11ed4079 100644 --- a/docs/add_new_benchmark_guide.md +++ b/docs/add_new_benchmark_guide.md @@ -279,6 +279,10 @@ The task will now be available through `get_task("GeographyQA")`. All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically. +> [!TIP] +> CI runs this test across all registered tasks, but during development you can target only your task, for example: +> `uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -k "YourTaskName"` + #### Automatic Formatting Tests All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.