From 4b49b7bdc69943a6e3f88ea3ea196f7e4cf2e20f Mon Sep 17 00:00:00 2001
From: Frank Schneider <frank.schneider@aleph-alpha-research.com>
Date: Wed, 11 Feb 2026 09:06:09 +0000
Subject: [PATCH 1/5] docs: Remove changelog request from PR template, will be
 done automatically.

---
 .github/pull_request_template.md | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
index 073873e8..3edad436 100644
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -10,17 +10,14 @@
      - ✅ Provide tests for your changes.
      - 📝 Use descriptive commit messages.
      - 📗 Update any related documentation and include any relevant screenshots.
-     - 📗 Reflect the changes you made in the changelog.
-
 -->
 
-
 ## PR Checklist
+
 - [ ] Use descriptive commit messages.
 - [ ] Provide tests for your changes.
 - [ ] Update any related documentation and include any relevant screenshots.
 - [ ] Check if changes need to be made to docs (README or any guides in `/docs/`).
-- [ ] Reflect the changes you made in the changelog.
 
 ## What type of PR is this? (check all applicable)
 
@@ -52,8 +49,8 @@ _Please replace this line with instructions on how to test your changes, a note
 on the hardware and config this has been tested on, as well as any relevant
 additional information._
 
-
 ## Added/updated tests?
+
 - [ ] Yes
 - [ ] No, and this is why: _please replace this line with details on why tests
       have not been included_

From f7d7b9862e7d5eabd14387bfddd5dedccf8f2355 Mon Sep 17 00:00:00 2001
From: Frank Schneider <frank.schneider@aleph-alpha-research.com>
Date: Wed, 11 Feb 2026 09:16:02 +0000
Subject: [PATCH 2/5] docs: Add all _IDK variants to the table.

---
 docs/benchmarks_and_metrics.md | 152 ++++++++++++++++-----------------
 1 file changed, 76 insertions(+), 76 deletions(-)

diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md
index 200acc1d..20fe1b81 100644
--- a/docs/benchmarks_and_metrics.md
+++ b/docs/benchmarks_and_metrics.md
@@ -6,53 +6,53 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 ## Completion
 
-| **Task** | **Capability** | **Benchmarks** | **Long Context** |
-|-|-|-|-|
-|   Logical Reasoning   |   Math   |   `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench`   |   `InfiniteBench_MathFind`   |
-|   Logical Reasoning   |   Programming   |   `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED`   |   `InfiniteBench_CodeRun`   |
-|   Logical Reasoning   |   Puzzle   |   `SPHYR`   |    |
-|   Output Control   |   Structure   |   `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval`   |    |
-|   Text Distillation   |   Aggregation   |    |   `ZERO_SCROLLS_SPACE_DIGEST`   |
-|   Text Distillation   |   Classification   |   `GPQA_COT`, `MMLU`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA`   |    |
-|   TextDistillation   |   Closed QA   | `SQUAD`, `SQUAD2` |   `InfiniteBench_EnDia`   `InfiniteBench_EnQA`   |
-|   Text Distilation   |   Extraction   |   `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE`   |   `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1`   |
-|   Text Distillation   |   QA   |    |   `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY`   |
-|   Text Transformation   |   Translation   |   `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT`   |    |
+| **Task**            | **Capability** | **Benchmarks**                                                                                                                                                                                                         | **Long Context**                                                                                                                                    |
+| ------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Logical Reasoning   | Math           | `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench`                                                                                | `InfiniteBench_MathFind`                                                                                                                            |
+| Logical Reasoning   | Programming    | `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED` | `InfiniteBench_CodeRun`                                                                                                                             |
+| Logical Reasoning   | Puzzle         | `SPHYR`                                                                                                                                                                                                                |                                                                                                                                                     |
+| Output Control      | Structure      | `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval`                                                                                                                                               |                                                                                                                                                     |
+| Text Distillation   | Aggregation    |                                                                                                                                                                                                                        | `ZERO_SCROLLS_SPACE_DIGEST`                                                                                                                         |
+| Text Distillation   | Classification | `GPQA_COT`, `MMLU`, `MMLU_IDK`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA`                                                                                                                                |                                                                                                                                                     |
+| TextDistillation    | Closed QA      | `SQUAD`, `SQUAD2`                                                                                                                                                                                                      | `InfiniteBench_EnDia` `InfiniteBench_EnQA`                                                                                                          |
+| Text Distilation    | Extraction     | `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE`                                                                                                                                                                                    | `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1`                                                       |
+| Text Distillation   | QA             |                                                                                                                                                                                                                        | `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY` |
+| Text Transformation | Translation    | `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT`                                                                                                                   |                                                                                                                                                     |
 
 ## Loglikelihoods
 
-|   **Task**    | **Capability** | **Benchmarks** | **Long Context** |
-|-|-|-|-|
-|   Output Control   |   Bias   |   `WINOGENDER`   |    |
-|   Text Distillation   |   Classification   |   `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `OPENBOOKQA`, `PIQA`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR`   |    |
-|   Text Distillation   |   QA   |    |   `QUALITY`, `ZERO_SCROLLS_QUALITY`   |
-|   Text Generation   |   Open QA   |   `CASEHOLD`   |    |
-|   Logical Reasoning   |   Closed QA   |    |   `InfiniteBench_EnMC`   |
-|   Logical Reasoning   |   Programming   |    |   `InfiniteBench_CodeDebug`   |
-|   Logical Reasoning   |   Reasoning   |   `COPA`, `HELLASWAG, HELLASWAG_EU20_DE, HELLASWAG_EU20_FR`, `WINOGRANDE`, `WINOX_DE`, `WINOX_FR`   |    |
+| **Task**          | **Capability** | **Benchmarks**                                                                                                                                                                                                                                                                                                                                                                            | **Long Context**                  |
+| ----------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
+| Output Control    | Bias           | `WINOGENDER`, `WINOGENDER_IDK`                                                                                                                                                                                                                                                                                                                                                            |                                   |
+| Text Distillation | Classification | `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `ARC_IDK`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `GPQA_IDK`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `MMLU_PRO_IDK`, `OPENBOOKQA`, `OPENBOOKQA_IDK`, `PIQA`, `PIQA_IDK`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR`, `TRUTHFULQA_IDK` |                                   |
+| Text Distillation | QA             |                                                                                                                                                                                                                                                                                                                                                                                           | `QUALITY`, `ZERO_SCROLLS_QUALITY` |
+| Text Generation   | Open QA        | `CASEHOLD`                                                                                                                                                                                                                                                                                                                                                                                |                                   |
+| Logical Reasoning | Closed QA      |                                                                                                                                                                                                                                                                                                                                                                                           | `InfiniteBench_EnMC`              |
+| Logical Reasoning | Programming    |                                                                                                                                                                                                                                                                                                                                                                                           | `InfiniteBench_CodeDebug`         |
+| Logical Reasoning | Reasoning      | `COPA`, `COPA_IDK`, `HELLASWAG`, `HELLASWAG_IDK`, `HELLASWAG_EU20_DE`, `HELLASWAG_EU20_FR`, `WINOGRANDE`, `WINOGRANDE_IDK`, `WINOX_DE`, `WINOX_FR`                                                                                                                                                                                                                                        |                                   |
 
 ## Long-Context
 
-| Task Name                      | Tag                              | Task | Capability                                       | Domain | Common Few-Shot Counts | Avg #Words                                   | Language     |
-|--------------------------------|----------------------------------|-|--------------------------------------------------|-|-|----------------------------------------------|--------------|
-| Babilong                       | `Eval Suite Long Context`        | Text Generation, Long Context | Completion, Long Context                         | ? | not supported | 22003                                        | en           |
-| InfiniteBench_CodeDebug        | `InfiniteBench_CodeDebug`        | LogicalReasoning | Programming                                      | ? | not supported | 127761                                       | en           |
-| InfiniteBench_CodeRun          | `InfiniteBench_CodeRun`          | LogicalReasoning | Programming                                      | ? | not supported | 34851                                        | en           |
-| InfiniteBench_EnDia            | `InfiniteBench_EnDia`            | TextDistillation | Closed QA                                        | ? | not supported | 73240                                        | en           |
-| InfiniteBench_EnMC             | `InfiniteBench_EnMC`             | TextDistillation | Closed QA                                        | ? | not supported | 139966                                       | en           |
-| InfiniteBench_EnQA             | `InfiniteBench_EnQA`             | TextDistillation | Closed QA                                        | ? | not supported | 149442                                       | en           |
-| InfiniteBench_MathFind         | `InfiniteBench_MathFind`         | LogicalReasoning | Math                                             | ? | not supported | 30017                                        | en           |
-| InfiniteBench_RetrieveKV2      | `InfiniteBench_RetrieveKV2`      | TextDistillation | Extraction                                       | ? | not supported | 5010                                         | en           |
-| InfiniteBench_RetrieveNumber   | `InfiniteBench_RetrieveNumber`   | TextDistillation | Extraction                                       | ? | not supported | 99199                                        | en           |
-| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation| Extraction                                       | ? | not supported | 99196                                        | en           |
-| QuALITY                        | `QuALITY`                        | Text Distillation | QA                                               | Literature, Misc | not supported | 4248                                         | en           |
-| ZeroSCROLLS GovReport          | `ZeroSCROLLS GovReport`          | Text Distillation | QA                                               | Government | not supported | 7273                                         | en           |
-| ZeroSCROLLS SQuALITY           | `ZeroSCROLLS SQuALITY`           | Text Distillation | QB-Summ?                                         | Literature | not supported | 4971                                         | en           |
-| ZeroSCROLLS Qasper             | `ZeroSCROLLS Qasper`             | Text Distillation | QA                                               | Science | not supported | 3531                                         | en           |
-| ZeroSCROLLS NarrativeQA        | `ZeroSCROLLS NarrativeQA`        | Text Distillation | QA                                               | Literature, Film | not supported | 49384                                        | en           |
-| ZeroSCROLLS QuALITY            | `ZeroSCROLLS QuALITY`            | Text Distillation | QA                                               | Literature, Misc | not supported | 4248                                         | en           |
-| ZeroSCROLLS MuSiQue            | `ZeroSCROLLS MuSiQue`            | Text Distillation | QA                                               | Wikipedia | not supported | 1749                                         | en           |
-| ZeroSCROLLS SpaceDigest        | `ZeroSCROLLS SpaceDigest`        | Text Distillation | Aggregation                                      | Reviews | not supported | 5481                                         | en           |
+| Task Name                      | Tag                              | Task                          | Capability               | Domain           | Common Few-Shot Counts | Avg #Words | Language |
+| ------------------------------ | -------------------------------- | ----------------------------- | ------------------------ | ---------------- | ---------------------- | ---------- | -------- |
+| Babilong                       | `Eval Suite Long Context`        | Text Generation, Long Context | Completion, Long Context | ?                | not supported          | 22003      | en       |
+| InfiniteBench_CodeDebug        | `InfiniteBench_CodeDebug`        | LogicalReasoning              | Programming              | ?                | not supported          | 127761     | en       |
+| InfiniteBench_CodeRun          | `InfiniteBench_CodeRun`          | LogicalReasoning              | Programming              | ?                | not supported          | 34851      | en       |
+| InfiniteBench_EnDia            | `InfiniteBench_EnDia`            | TextDistillation              | Closed QA                | ?                | not supported          | 73240      | en       |
+| InfiniteBench_EnMC             | `InfiniteBench_EnMC`             | TextDistillation              | Closed QA                | ?                | not supported          | 139966     | en       |
+| InfiniteBench_EnQA             | `InfiniteBench_EnQA`             | TextDistillation              | Closed QA                | ?                | not supported          | 149442     | en       |
+| InfiniteBench_MathFind         | `InfiniteBench_MathFind`         | LogicalReasoning              | Math                     | ?                | not supported          | 30017      | en       |
+| InfiniteBench_RetrieveKV2      | `InfiniteBench_RetrieveKV2`      | TextDistillation              | Extraction               | ?                | not supported          | 5010       | en       |
+| InfiniteBench_RetrieveNumber   | `InfiniteBench_RetrieveNumber`   | TextDistillation              | Extraction               | ?                | not supported          | 99199      | en       |
+| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation              | Extraction               | ?                | not supported          | 99196      | en       |
+| QuALITY                        | `QuALITY`                        | Text Distillation             | QA                       | Literature, Misc | not supported          | 4248       | en       |
+| ZeroSCROLLS GovReport          | `ZeroSCROLLS GovReport`          | Text Distillation             | QA                       | Government       | not supported          | 7273       | en       |
+| ZeroSCROLLS SQuALITY           | `ZeroSCROLLS SQuALITY`           | Text Distillation             | QB-Summ?                 | Literature       | not supported          | 4971       | en       |
+| ZeroSCROLLS Qasper             | `ZeroSCROLLS Qasper`             | Text Distillation             | QA                       | Science          | not supported          | 3531       | en       |
+| ZeroSCROLLS NarrativeQA        | `ZeroSCROLLS NarrativeQA`        | Text Distillation             | QA                       | Literature, Film | not supported          | 49384      | en       |
+| ZeroSCROLLS QuALITY            | `ZeroSCROLLS QuALITY`            | Text Distillation             | QA                       | Literature, Misc | not supported          | 4248       | en       |
+| ZeroSCROLLS MuSiQue            | `ZeroSCROLLS MuSiQue`            | Text Distillation             | QA                       | Wikipedia        | not supported          | 1749       | en       |
+| ZeroSCROLLS SpaceDigest        | `ZeroSCROLLS SpaceDigest`        | Text Distillation             | Aggregation              | Reviews          | not supported          | 5481       | en       |
 
 ## Languages
 
@@ -62,38 +62,38 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 ## Metrics
 
-| Metrics Type | Metrics                       |
-|-|-------------------------------|
-| Completion Metrics | Accuracy
-|| Bleu                          |
-|| Chrf                          |
-|| Ter                           |
-|| F1                            |
-|| Rouge 1                       |
-|| Rouge 2                       |
-|| Rouge-L                       |
-|| Code Assertion                |
-|| Language Checker              |
-|| Length Checker                |
-|| Math Reasoning                |
-|| Placeholder Checker           |
-|| Text Counter                  |
-|| CSV Format                    |
-|| JSON Format                   |
-|| Postscript Format             |
-|| Custom IFEval Checker         |
-|| Custom CWE Checker            |
-|| Custom NIAH Checker           |
-|| Custom Grid Comparison Checker |
-|| Repetition Checker            |
-| Loglikelihood Metrics | Accuracy Loglikelihood        |
-|| Normalized Accuracy Loglikelihood |
-|| Probability Mass              |
-| LLM Metrics | Chatbot Style Judge           |
-|| Completion Accuracy Judge
-|| Conciseness Judge
-|| Contains Names Judge
-|| Instruction Judge
-|| SQL Format
-|| World Knowledge Judge
-| Efficiency Metrics | Bytes per Sequence Position   |
+| Metrics Type          | Metrics                           |
+| --------------------- | --------------------------------- |
+| Completion Metrics    | Accuracy                          |
+|                       | Bleu                              |
+|                       | Chrf                              |
+|                       | Ter                               |
+|                       | F1                                |
+|                       | Rouge 1                           |
+|                       | Rouge 2                           |
+|                       | Rouge-L                           |
+|                       | Code Assertion                    |
+|                       | Language Checker                  |
+|                       | Length Checker                    |
+|                       | Math Reasoning                    |
+|                       | Placeholder Checker               |
+|                       | Text Counter                      |
+|                       | CSV Format                        |
+|                       | JSON Format                       |
+|                       | Postscript Format                 |
+|                       | Custom IFEval Checker             |
+|                       | Custom CWE Checker                |
+|                       | Custom NIAH Checker               |
+|                       | Custom Grid Comparison Checker    |
+|                       | Repetition Checker                |
+| Loglikelihood Metrics | Accuracy Loglikelihood            |
+|                       | Normalized Accuracy Loglikelihood |
+|                       | Probability Mass                  |
+| LLM Metrics           | Chatbot Style Judge               |
+|                       | Completion Accuracy Judge         |
+|                       | Conciseness Judge                 |
+|                       | Contains Names Judge              |
+|                       | Instruction Judge                 |
+|                       | SQL Format                        |
+|                       | World Knowledge Judge             |
+| Efficiency Metrics    | Bytes per Sequence Position       |

From 3d6a02d7206398164f44a9c4e5586a67f4218db5 Mon Sep 17 00:00:00 2001
From: Frank Schneider <frank.schneider@aleph-alpha-research.com>
Date: Wed, 11 Feb 2026 09:30:49 +0000
Subject: [PATCH 3/5] docs: Expand steps on adding new benchmarks (docs and
 tests).

---
 docs/add_new_benchmark_guide.md | 45 ++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md
index 53875e98..f9241077 100644
--- a/docs/add_new_benchmark_guide.md
+++ b/docs/add_new_benchmark_guide.md
@@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the
 ## Overview
 
 The eval-framework supports two response types:
+
 1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation)
 2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options
 
 For detailed information about implementing each task type, please refer to:
+
 - [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks
 - [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks
 
@@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample
 
 This section provides a complete reference for all configurations available when creating benchmarks.
 
-
 ### Response Types
 
 The response type determines how your model interacts with the task and what type of output is expected.
@@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS
 
 Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.
 
-
 #### Completion Metrics
 
 These metrics work with generated text outputs from COMPLETION tasks:
@@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas
 
 These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:
 
-
 ```python
 from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
 # Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
@@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn
 
 ```
 
-
 ## Implementation Examples and Patterns
 
 ### Practical Example: GeographyQATask
@@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]):
         return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot)
 ```
 
-
 ### Add to Task Registry
 
 Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`:
@@ -280,32 +277,66 @@ The task will now be available through `get_task("GeographyQA")`.
 
 ### Testing your benchmark
 
-All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
+All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.
 
 #### Automatic Formatting Tests
-All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.
+
+All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.
+
+The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`.
+
+When you add a new task:
+
+1. Run the formatter hash test once for your task to generate/check hashes.
+2. If your task hash is new, it will be added to `task-prompts-hashes.json`.
+3. Commit the updated JSON file together with your task changes.
+
+Run the formatter hash test only for your newly created task (replace `YourTaskName`):
+
+```bash
+uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName"
+```
 
 #### Custom Task Tests (Optional)
+
 If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it.
 
+### Update benchmark documentation
+
+After adding a benchmark, you also need to update task documentation:
+
+1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them).
+2. Regenerate the task docs:
+
+```bash
+uv run -m eval_framework.utils.generate_task_docs
+```
+
+This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`.
+
 ## Benchmark Examples by Task Type
 
 Study these existing benchmarks in the codebase for more complex patterns:
 
 #### Simple Classification Tasks
+
 - **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods
 - **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects
 
 #### Reasoning Tasks
+
 - **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns
 
 #### Code Generation
+
 - **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation
 - **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation
 
 #### Long Context Tasks
+
 - **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks
 
 #### Custom Format Tasks
+
 - **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation
 - **JSON/CSV Tasks:** Custom format validation examples

From f0a11840bd94b957e223166cfbef6a3e76a575b5 Mon Sep 17 00:00:00 2001
From: Frank Schneider <frank.schneider@aleph-alpha-research.com>
Date: Thu, 12 Feb 2026 12:36:43 +0000
Subject: [PATCH 4/5] docs: Add IDK-specific metrics

---
 docs/benchmarks_and_metrics.md | 74 ++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 35 deletions(-)

diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md
index 20fe1b81..77fa7085 100644
--- a/docs/benchmarks_and_metrics.md
+++ b/docs/benchmarks_and_metrics.md
@@ -62,38 +62,42 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 ## Metrics
 
-| Metrics Type          | Metrics                           |
-| --------------------- | --------------------------------- |
-| Completion Metrics    | Accuracy                          |
-|                       | Bleu                              |
-|                       | Chrf                              |
-|                       | Ter                               |
-|                       | F1                                |
-|                       | Rouge 1                           |
-|                       | Rouge 2                           |
-|                       | Rouge-L                           |
-|                       | Code Assertion                    |
-|                       | Language Checker                  |
-|                       | Length Checker                    |
-|                       | Math Reasoning                    |
-|                       | Placeholder Checker               |
-|                       | Text Counter                      |
-|                       | CSV Format                        |
-|                       | JSON Format                       |
-|                       | Postscript Format                 |
-|                       | Custom IFEval Checker             |
-|                       | Custom CWE Checker                |
-|                       | Custom NIAH Checker               |
-|                       | Custom Grid Comparison Checker    |
-|                       | Repetition Checker                |
-| Loglikelihood Metrics | Accuracy Loglikelihood            |
-|                       | Normalized Accuracy Loglikelihood |
-|                       | Probability Mass                  |
-| LLM Metrics           | Chatbot Style Judge               |
-|                       | Completion Accuracy Judge         |
-|                       | Conciseness Judge                 |
-|                       | Contains Names Judge              |
-|                       | Instruction Judge                 |
-|                       | SQL Format                        |
-|                       | World Knowledge Judge             |
-| Efficiency Metrics    | Bytes per Sequence Position       |
+| Metrics Type           | Metrics                           |
+| ---------------------- | --------------------------------- |
+| Completion Metrics     | Accuracy                          |
+|                        | Bleu                              |
+|                        | Chrf                              |
+|                        | Ter                               |
+|                        | F1                                |
+|                        | Rouge 1                           |
+|                        | Rouge 2                           |
+|                        | Rouge-L                           |
+|                        | Code Assertion                    |
+|                        | Language Checker                  |
+|                        | Length Checker                    |
+|                        | Math Reasoning                    |
+|                        | Placeholder Checker               |
+|                        | Text Counter                      |
+|                        | CSV Format                        |
+|                        | JSON Format                       |
+|                        | Postscript Format                 |
+|                        | Custom IFEval Checker             |
+|                        | Custom CWE Checker                |
+|                        | Custom NIAH Checker               |
+|                        | Custom Grid Comparison Checker    |
+|                        | Repetition Checker                |
+| Loglikelihood Metrics  | Accuracy Loglikelihood            |
+|                        | Accuracy Normalized Loglikelihood |
+|                        | Confidence-weighted Accuracy      |
+|                        | Probability Mass                  |
+|                        | Probability Mass Normalized       |
+| (IDK-Specific Metrics) | Distributional Correctness Score  |
+|                        | Ternary Score                     |
+| LLM Metrics            | Chatbot Style Judge               |
+|                        | Completion Accuracy Judge         |
+|                        | Conciseness Judge                 |
+|                        | Contains Names Judge              |
+|                        | Instruction Judge                 |
+|                        | SQL Format                        |
+|                        | World Knowledge Judge             |
+| Efficiency Metrics     | Bytes per Sequence Position       |

From 5df2fd51ed200d434d6d30e22650d3b5c48858d5 Mon Sep 17 00:00:00 2001
From: Frank Schneider <frank.schneider@aleph-alpha-research.com>
Date: Thu, 12 Feb 2026 12:37:34 +0000
Subject: [PATCH 5/5] docs: Highlight the option to run individual tasks.

---
 docs/add_new_benchmark_guide.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md
index f9241077..11ed4079 100644
--- a/docs/add_new_benchmark_guide.md
+++ b/docs/add_new_benchmark_guide.md
@@ -279,6 +279,10 @@ The task will now be available through `get_task("GeographyQA")`.
 
 All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.
 
+> [!TIP]
+> CI runs this test across all registered tasks, but during development you can target only your task, for example:
+> `uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -k "YourTaskName"`
+
 #### Automatic Formatting Tests
 
 All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.